Language Modeling
Language modeling — predicting the next token — is the pretraining objective that accidentally became the foundation of modern AI. From GPT-2's "too dangerous to release" moment in 2019 to GPT-4, Claude, Llama 3, and Gemini, scaling language models has produced emergent capabilities no one predicted from loss curves alone. Perplexity on benchmarks like WikiText-103 and Penn Treebank is essentially a historical artifact now; the field evaluates via downstream tasks (MMLU, HumanEval, MATH) because raw perplexity stopped correlating with usefulness years ago. The frontier has moved to mixture-of-experts architectures (Mixtral, DeepSeek-V3), longer context windows (1M+ tokens), and efficient inference — the model is no longer the bottleneck, serving it is.
WikiText Perplexity
Language modeling quality measured by perplexity on Wikipedia text
Top 10
Leading models on WikiText Perplexity.
No results yet. Be the first to contribute.
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Natural Language Processing.
Looking to run a model? HuggingFace hosts inference for this task type.