Gradio

📐 The Open Persian ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub.
We report the Average WER (⬇️ lower the better) and Average CER (⬇️ lower the better). Check the 📈 Metrics tab to understand how the models are evaluated.
If you want results for a model that is not listed here, you can submit a request for it to be included ✉️✨.
The leaderboard includes Persian/Farsi ASR evaluation benchmarks.
We created our own high quality evaluation dataset, the Persian ASR Benchmark, which is used to evaluate the models listed here.

Select Datasets to Display

Persian ASR Benchmark

FLEURS

Common Voice

MANA


10	jonatasgrosman/wav2vec2-large-xlsr-53-persian (315 M)	0.025	0.114	0.0258	0.1623	0.0215	0.0724	0.0106	0.0424	0.0402	0.0933


1	c1tech small (244 M)	0.025	0.07	0.0258	0.071	0.0215	0.0724	0.0106	0.0424	0.0402	0.0933
2	c1tech base (74 M)	0.041	0.114	0.0597	0.1623	0.035	0.1034	0.0229	0.0818	0.0445	0.1082
3	vhdm/whisper-large-fa-v1 (809 M)	0.061	0.171	0.1078	0.3167	0.0349	0.1194	0.0245	0.091	0.0752	0.158
4	Neurai/NeuraSpeech_WhisperBase (74 M)	0.062	0.175	0.1016	0.251	0.0431	0.1383	0.0381	0.1481	0.065	0.1626
5	nezamisafa/whisper-persian-v4 (1550 M)	0.069	0.173	0.1095	0.326	0.0309	0.1061	0.0178	0.0647	0.1195	0.1969
6	nvidia/stt_fa_fastconformer_hybrid_large (115 M)	0.071	0.234	0.1369	0.3659	0.0742	0.246	0.0354	0.1613	0.0357	0.1618
7	jonatasgrosman/wav2vec2-large-xlsr-53-persian (315 M)	0.075	0.276	0.1076	0.367	0.064	0.2525	0.0497	0.2043	0.078	0.2799
8	m3hrdadfi/wav2vec2-large-xlsr-persian-v3 (315 M)	0.076	0.259	0.1115	0.3529	0.0616	0.2339	0.057	0.2215	0.0724	0.229
9	Vosk-0.5	0.082	0.209	0.1532	0.3573	0.0581	0.158	0.0606	0.1685	0.0567	0.1524
10	farbodbij/whisper-small-Persian (244 M)	0.154	0.244	0.0788	0.2136	0.0311	0.1073	0.038	0.1346	0.4677	0.5194
11	aictsharif/whisper-small-fa (244 M)	0.175	0.321	0.1149	0.338	0.0653	0.2197	0.0264	0.1405	0.4947	0.5861
12	openai/whisper-small (244 M)	0.285	0.662	0.2711	0.6464	0.1671	0.5147	0.2645	0.6984	0.4362	0.7899
13	openai/whisper-large-v3 (1550 M)	0.439	0.573	0.0725	0.228	0.0485	0.1783	0.0712	0.2384	1.5636	1.649
14	openai/whisper-base (74 M)	0.565	1.078	0.7858	1.3106	0.3312	0.8135	0.5566	1.0797	0.5872	1.1065

Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.

Metrics

Models are evaluated using Character Error Rate (CER) and Word Error Rate (WER). The CER metric measures the accuracy of a system at the character level, capturing detailed errors such as misspellings, missing letters, or small deviations that WER might miss. A lower CER indicates better accuracy in reproducing the reference transcript character by character.

WER is also reported to provide a word-level perspective, but models are primarily ranked based on their CER, emphasizing fine-grained transcription quality.

For details on reproducing the benchmark numbers, refer to the Persian-ASR-Leaderboard GitHub repository.

Character Error Rate (CER)

Character Error Rate is used to measure the accuracy of automatic speech recognition systems at the character level. It calculates the percentage of characters in the system's output that differ from the reference (correct) transcript. A lower CER value indicates higher accuracy. Take the following example: Reference: علی کتاب خواند
Prediction: علی کتاه خاند

Reference:	د	ن	ا	و	خ		ب	ا	ت	ک		ی	ل	ع
Prediction:	د	ن	ا	-	خ		ه	ا	ت	ک		ی	ل	ع
Label:	✅	✅	✅	D	✅	✅	S	✅	✅	✅	✅	✅	✅	✅

Explanation of labels:

S (Deletion): ب Subtituted
D (Deletion): و Deleted

Total reference characters (**N**) = 14
Errors = 1 substitution (ب→ه) + 1 deletion (و) = **2 errors**

CER = (S + I + D) / N = 2/14

Final CER = 0.14285 (≈ 14.3%)

Word Error Rate (WER)

Word Error Rate is used to measure the accuracy of automatic speech recognition systems. It calculates the percentage of words in the system's output that differ from the reference (correct) transcript. A lower WER value indicates higher accuracy.

Take the following example:

Reference:	رفتند	مدرسه	به	پارسا	و	آرش
Prediction:	رفتن	مدرسه		بارسا	و	آرش
Label:	S	✅	D	S	✅	✅

Here, we have:

2 substitutions ("پارسا" → "بارسا" and "رفتند" → "رفتن")
0 insertions
1 deletion ("به" is missing)

This gives 3 errors in total. To get our word error rate (WER), we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our reference (N), which for this example is 6:

WER = (S + I + D) / N = (2 + 0 + 1) / 6 = 0.5

Giving a WER of 0.5, or 50%.

For a fair comparison, we calculate normalized CER and WER for all model checkpoints, meaning punctuation and casing are removed from the references and predictions. You can find the evaluation code on our GitHub repository

Limitations of WER for Persian

Persian has complex linguistic features that make Word Error Rate (WER) less reliable as a metric.

1. Formal vs. Informal Variations

Persian often has multiple valid forms for the same sentence depending on formality:

Formal:
کتابم را از علی گرفتم
Informal:
کتابم رو از علی گرفتم

Both sentences are correct, but WER would count the difference between را and رو as a full word error, penalizing the model unfairly.

2. Morphological Complexity

Persian words often include clitics or attached pronouns (e.g., کتابم, رفتم), which can be split differently depending on tokenization. WER can exaggerate errors in these cases.

3. Word Segmentation Ambiguity

Persian does not always use spaces consistently, especially with prepositions, conjunctions, and enclitics. WER is sensitive to such inconsistencies, which can inflate error rates.

Word Error Rate (WER) Calculation

Substitution: را → رو counts as 1 word error
Total words in reference: 5
WER = 1 / 5 ≈ 0.2

Character Error Rate (CER) Calculation

Character-level difference: ا → و (1 character error)
Total characters in reference: 21
CER = 1 / 21 ≈ 0.0476

How to reproduce our results

The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible. Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations. For more details head over to our repo at: Persian-ASR-Leaderboard GitHub repository

P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! ♥️

Benchmark datasets

Dataset	Total Duration (h)	License
FLEURS	-	CC-BY-4.0
Persian-ASR-Benchmark	-	CC-BY-4.0
common_voice	-	CC0
ManaTTS(only parts 70-77)	-	CC0

Dataset and Normalization

During preprocessing, we noticed that some Persian words contained Arabic forms (e.g., دایرة المعارف), which added unnecessary complexity and confused the model. We normalized such words to standard Persian forms to improve consistency and model understanding.

For more information about our normalization methods, please refer to our GitHub page where we describe our preprocessing pipeline in detail.

Since many models do not release their training data, we created an evaluation dataset using audio recorded after the public release dates (2 November 2025) of those models. This ensures fairness and prevents data leakage, as none of these samples were used during training.

Last updated on Oct 14th 2025

For further information, keep in touch:
info@c1tech.group