Home/Building Blocks/Language Model
TextText

Language Model

Transform, generate, or reason about text. The core building block for chatbots, summarization, translation, and more.

How Large Language Models Work

A technical deep-dive into transformer-based language models. How text becomes tokens, tokens become vectors, and vectors predict the next word.

1

Tokenization: Text to Numbers

LLMs don't see text as characters. They see tokens - subword units learned from training data.

TRY THESE:
TOKENS:
TheID: 464
▁quickID: 2159
▁brownID: 7586
▁foxID: 21831
4 tokens|19 characters|4.8 chars/token

How BPE (Byte Pair Encoding) Works

1. Start with characters
Split text into individual bytes/characters
["h","e","l","l","o"]
2. Find common pairs
Count byte pairs, merge the most frequent
"ll" appears often - merge
3. Repeat 50K times
Build vocabulary of ~50K-100K subwords
["hel","lo"] or ["hello"]
2

Embeddings: Tokens to Vectors

Each token ID is mapped to a learned vector (typically 4096-12288 dimensions). These vectors capture semantic meaning - similar words have similar vectors.

Token Embedding Lookup

"cat"
Token ID: 9246
->
0.23
-0.45
0.12
0.89
-0.34
0.67
...
4096-dimensional vector

Semantic Similarity

"king" - "man" + "woman" ="queen"
"Paris" - "France" + "Japan" ="Tokyo"
cos("cat", "dog") =0.82
cos("cat", "democracy") =0.12
3

Attention: How Tokens Relate

The key innovation of transformers. Each token can "attend" to every other token, learning which words are relevant for understanding each position.

Click a row to see attention weights:
The
cat
sat
on
the
mat
The
0.60
0.15
0.08
0.07
0.05
0.05
cat
0.25
0.35
0.15
0.10
0.08
0.07
sat
0.05
0.45
0.15
0.10
0.05
0.20
on
0.08
0.12
0.35
0.25
0.10
0.10
the
0.30
0.10
0.08
0.12
0.30
0.10
mat
0.05
0.20
0.15
0.15
0.10
0.35
"sat" attends to:
"cat"
0.45
"mat"
0.20
"sat"
0.15
"on"
0.10

The Q-K-V Mechanism

Q (Query)
"What am I looking for?"
Each token asks a question
K (Key)
"What do I contain?"
Each token advertises itself
V (Value)
"What do I offer?"
The actual content to retrieve
Attention(Q, K, V) = softmax(QKT / sqrt(d_k)) V
4

Next Token Prediction

The core task of language models. Given context, predict the probability distribution over all possible next tokens.

Select a prompt:
CONTEXT:
The capital of France is|
TOP PREDICTIONS:
Paris
92.0%
92.0%
the
3.0%
a
2.0%

Sampling Strategies

Greedy
Always pick top token
Deterministic but repetitive
Temperature
Flatten/sharpen distribution
T=0 greedy, T=1 diverse
Top-K
Sample from top K tokens
K=40 is common default
Top-P (Nucleus)
Sample until P% probability mass
P=0.9 adapts to distribution
5

Model Size Comparison

ModelParametersLayersContext
GPT-21.5B481K
GPT-3175B964K
GPT-4~1.8T*~120*128K
Llama 3.1 70B70B80128K
Claude 3.5~?~?200K

* Estimated values for proprietary models

The Complete Pipeline

Text
->
Tokens
->
Embeddings
->
N x (Attention + FFN)
->
Next Token

The model generates text by repeatedly predicting the next token, appending it to the context, and predicting again. This autoregressive process continues until a stop condition is met.

Use Cases

  • Chatbots and assistants
  • Text summarization
  • Translation
  • Content generation
  • Question answering

Architectural Patterns

Direct LLM Generation

Pass input to an LLM with appropriate prompting.

Pros:
  • +Simple
  • +Flexible
  • +Handles many tasks
Cons:
  • -May hallucinate
  • -Limited to training knowledge

RAG (Retrieval-Augmented Generation)

Retrieve relevant context, then generate with LLM.

Pros:
  • +Grounded in data
  • +Up-to-date
  • +Citable
Cons:
  • -Retrieval quality matters
  • -More complex pipeline

Agent with Tools

LLM that can call external tools, APIs, and functions.

Pros:
  • +Can take actions
  • +Access real-time data
Cons:
  • -Complex error handling
  • -Security considerations

Implementations

API Services

GPT-4o

OpenAI
API

Top-tier reasoning. Good balance of speed and quality.

Claude 3.5 Sonnet

Anthropic
API

Excellent for long context and code. Strong reasoning.

Gemini 1.5 Pro

Google
API

1M token context. Good for very long documents.

Mistral Large

Mistral
API

Strong European option. Good for function calling.

Open Source

Llama 3.1 405B

Llama 3.1 Community
Open Source

Best open-source. Requires significant compute.

Qwen2.5-72B

Apache 2.0
Open Source

Strong multilingual. Good for self-hosting.

Benchmarks

Quick Facts

Input
Text
Output
Text
Implementations
2 open source, 4 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for language model.

Submit Results