Text→Text

Language Model

Transform, generate, or reason about text. The core building block for chatbots, summarization, translation, and more.

How Large Language Models Work

A technical deep-dive into transformer-based language models. How text becomes tokens, tokens become vectors, and vectors predict the next word.

1. Tokenization 2. Embeddings 3. Attention 4. Next Token 5. Architecture

Tokenization: Text to Numbers

LLMs don't see text as characters. They see tokens - subword units learned from training data.

TRY THESE:

TOKENS:

TheID: 464

▁quickID: 2159

▁brownID: 7586

▁foxID: 21831

4 tokens|19 characters|4.8 chars/token

How BPE (Byte Pair Encoding) Works

1. Start with characters

Split text into individual bytes/characters

["h","e","l","l","o"]

2. Find common pairs

Count byte pairs, merge the most frequent

"ll" appears often - merge

3. Repeat 50K times

Build vocabulary of ~50K-100K subwords

["hel","lo"] or ["hello"]

Embeddings: Tokens to Vectors

Each token ID is mapped to a learned vector (typically 4096-12288 dimensions). These vectors capture semantic meaning - similar words have similar vectors.

Token Embedding Lookup

"cat"

Token ID: 9246

0.23

-0.45

0.12

0.89

-0.34

0.67

...

4096-dimensional vector

Semantic Similarity

"king" - "man" + "woman" ="queen"

"Paris" - "France" + "Japan" ="Tokyo"

cos("cat", "dog") =0.82

cos("cat", "democracy") =0.12

Attention: How Tokens Relate

The key innovation of transformers. Each token can "attend" to every other token, learning which words are relevant for understanding each position.

Click a row to see attention weights:

The

cat

sat

the

mat

The

0.60

0.15

0.08

0.07

0.05

cat

0.25

0.35

0.15

0.10

0.08

0.07

sat

0.05

0.45

0.15

0.10

0.05

0.20

0.08

0.12

0.35

0.25

0.10

the

0.30

0.10

0.08

0.12

0.30

0.10

mat

0.05

0.20

0.15

0.10

0.35

"sat" attends to:

"cat"

0.45

"mat"

0.20

"sat"

0.15

"on"

0.10

The Q-K-V Mechanism

Q (Query)

"What am I looking for?"

Each token asks a question

K (Key)

"What do I contain?"

Each token advertises itself

V (Value)

"What do I offer?"

The actual content to retrieve

Attention(Q, K, V) = softmax(QKT / sqrt(d_k)) V

Next Token Prediction

The core task of language models. Given context, predict the probability distribution over all possible next tokens.

Select a prompt:

CONTEXT:

The capital of France is|

TOP PREDICTIONS:

Paris

92.0%

the

3.0%

2.0%

Sampling Strategies

Greedy

Always pick top token

Deterministic but repetitive

Temperature

Flatten/sharpen distribution

T=0 greedy, T=1 diverse

Top-K

Sample from top K tokens

K=40 is common default

Top-P (Nucleus)

Sample until P% probability mass

P=0.9 adapts to distribution

Model Size Comparison

Model	Parameters	Layers	Context
GPT-2	1.5B	48	1K
GPT-3	175B	96	4K
GPT-4	~1.8T*	~120*	128K
Llama 3.1 70B	70B	80	128K
Claude 3.5	~?	~?	200K

* Estimated values for proprietary models

The Complete Pipeline

Text

Tokens

Embeddings

N x (Attention + FFN)

Next Token

The model generates text by repeatedly predicting the next token, appending it to the context, and predicting again. This autoregressive process continues until a stop condition is met.

Use Cases

✓Chatbots and assistants
✓Text summarization
✓Translation
✓Content generation
✓Question answering

Architectural Patterns

Direct LLM Generation

Pass input to an LLM with appropriate prompting.

Pros:

+Simple
+Flexible
+Handles many tasks

Cons:

-May hallucinate
-Limited to training knowledge

RAG (Retrieval-Augmented Generation)

Retrieve relevant context, then generate with LLM.

Pros:

+Grounded in data
+Up-to-date
+Citable

Cons:

-Retrieval quality matters
-More complex pipeline

Agent with Tools

LLM that can call external tools, APIs, and functions.

Pros:

+Can take actions
+Access real-time data

Cons:

-Complex error handling
-Security considerations

Implementations

API Services

GPT-4o

OpenAI

API

Top-tier reasoning. Good balance of speed and quality.

Claude 3.5 Sonnet

Anthropic

API

Excellent for long context and code. Strong reasoning.

Gemini 1.5 Pro

Google

API

1M token context. Good for very long documents.

Mistral Large

Mistral

API

Strong European option. Good for function calling.

Open Source

Llama 3.1 405B

Llama 3.1 Community

Open Source

Best open-source. Requires significant compute.

HuggingFace

Qwen2.5-72B

Apache 2.0

Open Source

Strong multilingual. Good for self-hosting.

HuggingFace

Benchmarks

MMLU →HumanEval →

Quick Facts

Input: Text
Output: Text
Implementations: 2 open source, 4 API
Patterns: 3 approaches

Related Blocks

Text Embedding

Text → Vector

Document Extraction

Document → Structured Data

Have benchmark data?

Help us track the state of the art for language model.

Submit Results