Natural Language Processingtext-generation

Language Modeling

Language modeling — predicting the next token — is the pretraining objective that accidentally became the foundation of modern AI. From GPT-2's "too dangerous to release" moment in 2019 to GPT-4, Claude, Llama 3, and Gemini, scaling language models has produced emergent capabilities no one predicted from loss curves alone. Perplexity on benchmarks like WikiText-103 and Penn Treebank is essentially a historical artifact now; the field evaluates via downstream tasks (MMLU, HumanEval, MATH) because raw perplexity stopped correlating with usefulness years ago. The frontier has moved to mixture-of-experts architectures (Mixtral, DeepSeek-V3), longer context windows (1M+ tokens), and efficient inference — the model is no longer the bottleneck, serving it is.

1 datasets0 resultsView full task mapping →

Language modeling — predicting the next token given preceding context — is the foundational task that powers all modern NLP. GPT-4, Claude, Llama, and Gemini are all language models at their core. Perplexity on held-out text remains the key intrinsic metric, but downstream task performance has become the real measure of progress.

History

2003

Bengio et al. introduce neural language models with feedforward networks, replacing n-gram models

2013

Word2Vec shows that language model byproducts (embeddings) transfer to downstream NLP tasks

2017

Transformer architecture (Vaswani et al.) enables massively parallel training, replacing recurrent models

2018

GPT (Radford et al.) demonstrates that autoregressive pretraining on 40GB of text produces useful representations

2019

GPT-2 (1.5B params) shows emergent generation quality; OpenAI delays release over misuse concerns

2020

GPT-3 (175B params) demonstrates in-context learning — the model performs tasks from examples in the prompt

2023

GPT-4 and Claude 2 reach broadly expert-level performance across NLP, coding, and reasoning

2023

Llama 2 (Meta) opens the floodgates for open-weight LLMs; Mistral-7B matches Llama 2 13B

2024

Llama 3.1 405B, DeepSeek-V3, and Qwen2.5-72B close the gap with proprietary frontier models

2025

Claude 3.5, GPT-4o, Gemini 2.0 compete on reasoning, coding, and agentic capabilities; Llama 4 and DeepSeek-R1 push open-source further

How Language Modeling Works

Tokenization

Text is encoded into subword tokens using BPE (GPT), SentencePiece (Llama), or custom tokenizers; vocabulary sizes range from 32K to 256K

Embedding

Each token is mapped to a dense vector; positional information is added via learned or rotary (RoPE) position embeddings

Transformer layers

Tokens pass through N layers of multi-head self-attention and feed-forward networks; modern models use 32-128 layers

Next-token prediction

A linear head projects the final hidden state to vocabulary logits; softmax gives probability distribution over next token

Training

Cross-entropy loss on next-token prediction over trillions of tokens from web text, code, and curated data

Current Landscape

Language modeling in 2025 is the foundation of the entire AI industry. The scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) continue to hold: more compute and data produce better models. But the frontier has shifted from pure scale to efficiency (MoE architectures, DeepSeek), reasoning (o1-style inference-time compute), and post-training (RLHF, DPO, Constitutional AI). Open-source models lag frontier by 6-12 months but are increasingly competitive. The Chinchilla-optimal training paradigm has given way to over-training smaller models for cheaper inference.

Key Challenges

Scaling cost: training a frontier model costs $50-500M+ in compute; only a handful of organizations can afford it

Data quality and curation are arguably more important than model size — garbage in, garbage out at scale

Evaluation: perplexity doesn't capture reasoning ability; benchmarks saturate quickly; human evaluation is expensive

Alignment: making models helpful, harmless, and honest through RLHF/RLAIF adds complexity and potential capability loss

Inference cost: serving large models requires expensive GPU clusters; efficiency techniques (quantization, speculative decoding) are critical

Quick Recommendations

Best frontier model

Claude 3.5 Sonnet, GPT-4o, or Gemini 2.0 Pro

Top performance on reasoning, coding, and instruction following; competitive pricing

Open-source (large)

Llama 3.1 405B or DeepSeek-V3-671B (MoE)

Approaching frontier model quality; self-hostable for full data control

Open-source (efficient)

Qwen2.5-72B or Llama 3.1 70B

Best quality at the 70B scale; fits on 2x A100 with quantization

Small / edge

Llama 3.2 3B or Phi-3.5 Mini (3.8B)

Runs on mobile and laptop hardware; surprisingly capable for their size

Research / perplexity benchmark

GPT-4 or Gemini 1.5 Pro

Lowest published perplexity on standard LM benchmarks

What's Next

The next phase is test-time compute scaling (thinking longer to solve harder problems), multi-modal native models (text + image + audio + video in one architecture), and agentic models that can use tools, write code, and take actions. Expect the open-source gap to continue closing, with 70B-class models matching today's frontier within a year. Architecture innovations (state-space models, hybrid attention-SSM) may complement or partially replace pure transformers.

Benchmarks & SOTA

WikiText Perplexity

20160 results

Language modeling quality measured by perplexity on Wikipedia text

No results tracked yet

Related Tasks

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Polish Text Understanding

Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Language Modeling benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing