Level 0: Foundations~12 min

Tokens & Context Windows

How LLMs read text. Not characters, not words - tokens. Understanding this changes how you prompt.

What is a Token?

When you send text to an LLM, it does not see characters or words. It sees tokens - subword units that the model learned during training.

Tokens are chunks of text that typically represent common sequences of characters. A token might be:

-A whole common word: "the"
-Part of a word: "un" + "believ" + "able"
-A single character: "X" (rare or unusual characters)
-A number or punctuation: "123" or "!"

Rule of Thumb

In English, 1 token is approximately 4 characters or 0.75 words. But this varies significantly by language and content type.

English text:~4 chars/token

Code:~2-3 chars/token

Non-Latin scripts:~1-2 chars/token

Tokenization in Action

Different tokenizers split text differently. Here is how GPT-4 tokenizes common sentences:

Input:

"Hello, world!"

Tokens (4 total):

Hello, world!

Input:

"Tokenization is fascinating!"

Tokens (5 total):

Tokenization is fascinating!

Input (Python code):

def hello(): return "world"

Tokens (9 total):

def hello(): return "world"

Counting Tokens in Python

Use tiktoken to count tokens for OpenAI models. Each model family uses a specific encoding.

token_counting.py

import tiktoken

# Get the encoding for GPT-4
enc = tiktoken.encoding_for_model("gpt-4")

# Encode text to tokens
text = "Hello, world!"
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Token count: {len(tokens)}")  # Output: 4
print(f"Token IDs: {tokens}")         # Output: [9906, 11, 1917, 0]

# Decode tokens back to text
decoded = enc.decode(tokens)
print(f"Decoded: {decoded}")          # Output: Hello, world!

# Different model encodings
for model in ["gpt-4", "gpt-3.5-turbo", "text-embedding-ada-002"]:
    enc = tiktoken.encoding_for_model(model)
    count = len(enc.encode("Machine learning is transforming industries."))
    print(f"{model}: {count} tokens")

Common Encodings

Encoding	Models	Vocab Size
cl100k_base	GPT-4, GPT-3.5-turbo, text-embedding-3-*	100,256
o200k_base	GPT-4o, o1	200,000
p50k_base	Codex models, text-davinci-003	50,281

Understanding Context Windows

The context window is the maximum number of tokens a model can process in a single request. This includes both your input (prompt) AND the model's output (completion).

If your prompt is 3,000 tokens and the model has a 4K context window, you only have ~1,000 tokens left for the response.

Context Window = Input + Output

Input Tokens

Your prompt + system message + context

Output Tokens

The model's response

Context Window

Maximum allowed

Model	Context Window	Approx. Pages	Use Case
GPT-3.5-turbo	16K tokens	~24 pages	Short documents, chat
GPT-4	8K / 128K tokens	~12 / ~192 pages	Varies by endpoint
GPT-4o	128K tokens	~192 pages	Long documents, code
Claude 3 Opus/Sonnet	200K tokens	~300 pages	Books, legal docs
Gemini 1.5 Pro	1M / 2M tokens	~1500 / ~3000 pages	Massive context tasks
Llama 3.1 (8B/70B)	128K tokens	~192 pages	Self-hosted, private

Why Context Size Matters

More Context = More Information

With 128K tokens, you can include entire codebases, long documents, or extensive conversation history. The model sees everything at once.

RAG vs Long Context

Long context can replace RAG for some use cases. Instead of retrieving chunks, just put the whole document in the prompt.

Attention Degradation

Models may lose focus in very long contexts. The "lost in the middle" problem: information at the start and end is recalled better than the middle.

Cost Scales Linearly

More tokens = higher cost. A 100K token request costs 10x more than a 10K token request. Optimize your context for efficiency.

Token Pricing

API providers charge per token, typically quoted per 1 million tokens. Input and output tokens often have different prices.

Model	Input ($/1M)	Output ($/1M)	~1000 queries*
GPT-4o	$2.50	$10.00	$2.50
GPT-4o-mini	$0.15	$0.60	$0.15
Claude 3.5 Sonnet	$3.00	$15.00	$3.00
Claude 3 Haiku	$0.25	$1.25	$0.25
Gemini 1.5 Flash	$0.075	$0.30	$0.08

*Estimated cost for 1000 queries at ~1000 input tokens each (prices as of early 2025, subject to change)

Cost Calculation Example

# Example: GPT-4o query with 2000 input + 500 output tokens

input_cost = 2000 / 1_000_000 * 2.50 # $0.005

output_cost = 500 / 1_000_000 * 10.00 # $0.005

total_cost = $0.01 per query

monthly_cost = $0.01 * 10000 queries = $100/month

Practical Tips

1. Count Before You Send

Always count tokens before making API calls, especially with user-provided content.

def check_token_limit(text: str, max_tokens: int = 8000) -> bool:
    enc = tiktoken.encoding_for_model("gpt-4")
    token_count = len(enc.encode(text))
    return token_count <= max_tokens

2. Set max_tokens for Outputs

Always set max_tokens in your API calls to prevent runaway costs and ensure you stay within context limits.

3. Truncate Strategically

When content exceeds limits, truncate intelligently - keep the most relevant parts. For conversations, keep the system message + recent messages.

4. Use Smaller Models When Possible

GPT-4o-mini and Claude Haiku are often sufficient for simpler tasks. Save the expensive models for complex reasoning.

Key Takeaways

1
Tokens are subwords, not characters or words - LLMs see text as token sequences, typically ~4 characters per token in English.
2
Context window = input + output - Budget your tokens. A 128K context does not mean 128K input if you need a long response.
3
Use tiktoken to count tokens - Always validate token counts before API calls to avoid truncation or errors.
4
Optimize for cost and quality - More tokens = higher cost. Use smaller models for simpler tasks.

Next: Environment Setup Back to Roadmap