DeepL or Google Cloud Translation
Low ops, mature APIs, glossary support, and predictable latency.
Choose a translation model by language pair, domain, document context, deployment boundary, and evaluation metric. WMT is the public benchmark signal; production quality still needs COMET, terminology checks, and human edit rate.
BLEU is still reported for continuity, but modern translation decisions should look at COMET, human preference, terminology errors, and edit distance on your own corpus.
| Benchmark | Role | Metric | Best use |
|---|---|---|---|
| WMT | Primary competition | COMET / human eval / BLEU | High-resource language pairs and annual system comparisons. |
| FLORES-200 | Coverage test | spBLEU / chrF++ | Broad multilingual coverage, especially underrepresented languages. |
| COMET | Quality estimator | reference-based / QE | Ranking fluent translations when BLEU misses meaning and adequacy. |
| Human review | Production gate | edits / accept rate | Legal, medical, marketing, and other high-liability translation. |
Dedicated MT, open multilingual models, and LLM translators solve different versions of the task. The right model depends more on pair and corpus than on one global rank.
| Model / system | Lane | Good fit | Watch out |
|---|---|---|---|
| HY-MT1.5 | WMT frontier | Competition-grade text translation; use as current WMT reference signal. | Availability and deployment path depend on the publisher. |
| DeepL | Production API | European languages, business copy, and editorial workflows. | Narrower language coverage than massively multilingual systems. |
| Google Cloud Translation | Production API | Broad language support, low operational burden, and AutoML-style customization. | Quality varies by pair and domain; test domain terminology. |
| NLLB-200 | Open model | Self-hosted translation across 200 languages and low-resource coverage. | License and quality constraints; needs evaluation per pair. |
| MADLAD-400 | Open model | Very broad language coverage when you need one multilingual backbone. | Larger deployment footprint and uneven pair quality. |
| Claude / GPT | LLM translation | Document-level context, terminology instructions, style transfer, and format repair. | Higher cost and more latency; require review for factual fidelity. |
Low ops, mature APIs, glossary support, and predictable latency.
Open multilingual models cover many pairs that commercial APIs handle weakly.
Long context preserves names, tone, and paragraph-level coherence.
Keeps data inside your boundary and lets terminology be audited.
The building-block page explains encoder-decoder translation, LLM translation, beam search, and implementation options.
Open translation explainer ->