Tokens & Context Windows
How LLMs read text. Not characters, not words - tokens. Understanding this changes how you prompt.
What is a Token?
When you send text to an LLM, it does not see characters or words. It sees tokens - subword units that the model learned during training.
Tokens are chunks of text that typically represent common sequences of characters. A token might be:
- -A whole common word:
"the" - -Part of a word:
"un"+"believ"+"able" - -A single character:
"X"(rare or unusual characters) - -A number or punctuation:
"123"or"!"
Rule of Thumb
In English, 1 token is approximately 4 characters or 0.75 words. But this varies significantly by language and content type.
Tokenization in Action
Different tokenizers split text differently. Here is how GPT-4 tokenizes common sentences:
Counting Tokens in Python
Use tiktoken to count tokens for OpenAI models. Each model family uses a specific encoding.
import tiktoken
# Get the encoding for GPT-4
enc = tiktoken.encoding_for_model("gpt-4")
# Encode text to tokens
text = "Hello, world!"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}") # Output: 4
print(f"Token IDs: {tokens}") # Output: [9906, 11, 1917, 0]
# Decode tokens back to text
decoded = enc.decode(tokens)
print(f"Decoded: {decoded}") # Output: Hello, world!
# Different model encodings
for model in ["gpt-4", "gpt-3.5-turbo", "text-embedding-ada-002"]:
enc = tiktoken.encoding_for_model(model)
count = len(enc.encode("Machine learning is transforming industries."))
print(f"{model}: {count} tokens")Common Encodings
| Encoding | Models | Vocab Size |
|---|---|---|
| cl100k_base | GPT-4, GPT-3.5-turbo, text-embedding-3-* | 100,256 |
| o200k_base | GPT-4o, o1 | 200,000 |
| p50k_base | Codex models, text-davinci-003 | 50,281 |
Understanding Context Windows
The context window is the maximum number of tokens a model can process in a single request. This includes both your input (prompt) AND the model's output (completion).
If your prompt is 3,000 tokens and the model has a 4K context window, you only have ~1,000 tokens left for the response.
Context Window = Input + Output
| Model | Context Window | Approx. Pages | Use Case |
|---|---|---|---|
| GPT-3.5-turbo | 16K tokens | ~24 pages | Short documents, chat |
| GPT-4 | 8K / 128K tokens | ~12 / ~192 pages | Varies by endpoint |
| GPT-4o | 128K tokens | ~192 pages | Long documents, code |
| Claude 3 Opus/Sonnet | 200K tokens | ~300 pages | Books, legal docs |
| Gemini 1.5 Pro | 1M / 2M tokens | ~1500 / ~3000 pages | Massive context tasks |
| Llama 3.1 (8B/70B) | 128K tokens | ~192 pages | Self-hosted, private |
Why Context Size Matters
More Context = More Information
With 128K tokens, you can include entire codebases, long documents, or extensive conversation history. The model sees everything at once.
RAG vs Long Context
Long context can replace RAG for some use cases. Instead of retrieving chunks, just put the whole document in the prompt.
Attention Degradation
Models may lose focus in very long contexts. The "lost in the middle" problem: information at the start and end is recalled better than the middle.
Cost Scales Linearly
More tokens = higher cost. A 100K token request costs 10x more than a 10K token request. Optimize your context for efficiency.
Token Pricing
API providers charge per token, typically quoted per 1 million tokens. Input and output tokens often have different prices.
| Model | Input ($/1M) | Output ($/1M) | ~1000 queries* |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | $2.50 |
| GPT-4o-mini | $0.15 | $0.60 | $0.15 |
| Claude 3.5 Sonnet | $3.00 | $15.00 | $3.00 |
| Claude 3 Haiku | $0.25 | $1.25 | $0.25 |
| Gemini 1.5 Flash | $0.075 | $0.30 | $0.08 |
*Estimated cost for 1000 queries at ~1000 input tokens each (prices as of early 2025, subject to change)
Cost Calculation Example
Practical Tips
1. Count Before You Send
Always count tokens before making API calls, especially with user-provided content.
def check_token_limit(text: str, max_tokens: int = 8000) -> bool:
enc = tiktoken.encoding_for_model("gpt-4")
token_count = len(enc.encode(text))
return token_count <= max_tokens2. Set max_tokens for Outputs
Always set max_tokens in your API calls to prevent runaway costs and ensure you stay within context limits.
3. Truncate Strategically
When content exceeds limits, truncate intelligently - keep the most relevant parts. For conversations, keep the system message + recent messages.
4. Use Smaller Models When Possible
GPT-4o-mini and Claude Haiku are often sufficient for simpler tasks. Save the expensive models for complex reasoning.
Tools and Resources
OpenAI Tokenizer
Official web tool to visualize how text gets tokenized. Great for debugging.
platform.openai.com/tokenizertiktoken Library
Fast Python library for tokenizing text for OpenAI models. Essential for production.
pip install tiktokenClaude Tokenization
Anthropic's documentation on Claude tokenization. Different tokenizer than OpenAI.
docs.anthropic.comHuggingFace Tokenizers
Deep dive into tokenizer concepts: BPE, WordPiece, SentencePiece, and more.
huggingface.co/docsKey Takeaways
- 1
Tokens are subwords, not characters or words - LLMs see text as token sequences, typically ~4 characters per token in English.
- 2
Context window = input + output - Budget your tokens. A 128K context does not mean 128K input if you need a long response.
- 3
Use tiktoken to count tokens - Always validate token counts before API calls to avoid truncation or errors.
- 4
Optimize for cost and quality - More tokens = higher cost. Use smaller models for simpler tasks.