Multilingual (English, Chinese, Japanese, Korean) diagnostic benchmark evaluating ASR robustness across three out-of-distribution dimensions: environmental degradation (reverberation, noise, clipping), demographic shift (accents, children, older speakers), and linguistic diversity (code-switching, short utterances, incomplete speech). Uses WER for English and CER for CJK languages.
Cer is the reported evaluation metric for WildASR. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Lower is better
| Rank | Model | Trust | Score | Year | Links | Edit |
|---|---|---|---|---|---|---|
| 01 | Gemini 3 Pro | verified | 6.10 | 2025 | Source ↗ | Edit result |
| 02 | GPT-4o Transcribe | verified | 6.40 | 2025 | Source ↗ | Edit result |
| 03 | Gemini 2.5 Pro | verified | 6.70 | 2025 | Source ↗ | Edit result |
| 04 | Whisper Large V3 | verified | 7.50 | 2025 | Source ↗ | Edit result |
| 05 | Scribe V1 | verified | 8.70 | 2025 | Source ↗ | Edit result |
| 06 | Qwen2-Audio | verified | 9.10 | 2025 | Source ↗ | Edit result |
| 07 | Nova 2 | verified | 10.1 | 2025 | Source ↗ | Edit result |
Wer is the reported evaluation metric for WildASR. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Lower is better
| Rank | Model | Trust | Score | Year | Links | Edit |
|---|---|---|---|---|---|---|
| 01 | Gemini 3 Pro | verified | 2.80 | 2025 | Source ↗ | Edit result |
| 02 | GPT-4o Transcribe | verified | 2.80 | 2025 | Source ↗ | Edit result |
| 03 | Gemini 2.5 Pro | verified | 3.60 | 2025 | Source ↗ | Edit result |
| 04 | Scribe V1 | verified | 3.60 | 2025 | Source ↗ | Edit result |
| 05 | Whisper Large V3 | verified | 4.20 | 2025 | Source ↗ | Edit result |
| 06 | Qwen2-Audio | verified | 5.80 | 2025 | Source ↗ | Edit result |
| 07 | Nova 2 | verified | 6.00 | 2025 | Source ↗ | Edit result |