Level 0: Foundations~12 min

Tokens & Context Windows

How LLMs read text. Not characters, not words - tokens. Understanding this changes how you prompt.

What is a Token?

When you send text to an LLM, it does not see characters or words. It sees tokens - subword units that the model learned during training.

Tokens are chunks of text that typically represent common sequences of characters. A token might be:

  • -A whole common word: "the"
  • -Part of a word: "un" + "believ" + "able"
  • -A single character: "X" (rare or unusual characters)
  • -A number or punctuation: "123" or "!"

Rule of Thumb

In English, 1 token is approximately 4 characters or 0.75 words. But this varies significantly by language and content type.

English text:~4 chars/token
Code:~2-3 chars/token
Non-Latin scripts:~1-2 chars/token

Tokenization in Action

Different tokenizers split text differently. Here is how GPT-4 tokenizes common sentences:

Input:
"Hello, world!"
Tokens (4 total):
Hello, world!
Input:
"Tokenization is fascinating!"
Tokens (5 total):
Tokenization is fascinating!
Input (Python code):
def hello(): return "world"
Tokens (9 total):
def hello(): return "world"

Counting Tokens in Python

Use tiktoken to count tokens for OpenAI models. Each model family uses a specific encoding.

token_counting.py
import tiktoken

# Get the encoding for GPT-4
enc = tiktoken.encoding_for_model("gpt-4")

# Encode text to tokens
text = "Hello, world!"
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Token count: {len(tokens)}")  # Output: 4
print(f"Token IDs: {tokens}")         # Output: [9906, 11, 1917, 0]

# Decode tokens back to text
decoded = enc.decode(tokens)
print(f"Decoded: {decoded}")          # Output: Hello, world!

# Different model encodings
for model in ["gpt-4", "gpt-3.5-turbo", "text-embedding-ada-002"]:
    enc = tiktoken.encoding_for_model(model)
    count = len(enc.encode("Machine learning is transforming industries."))
    print(f"{model}: {count} tokens")

Common Encodings

EncodingModelsVocab Size
cl100k_baseGPT-4, GPT-3.5-turbo, text-embedding-3-*100,256
o200k_baseGPT-4o, o1200,000
p50k_baseCodex models, text-davinci-00350,281

Understanding Context Windows

The context window is the maximum number of tokens a model can process in a single request. This includes both your input (prompt) AND the model's output (completion).

If your prompt is 3,000 tokens and the model has a 4K context window, you only have ~1,000 tokens left for the response.

Context Window = Input + Output

Input Tokens
Your prompt + system message + context
+
Output Tokens
The model's response
=
Context Window
Maximum allowed
ModelContext WindowApprox. PagesUse Case
GPT-3.5-turbo16K tokens~24 pagesShort documents, chat
GPT-48K / 128K tokens~12 / ~192 pagesVaries by endpoint
GPT-4o128K tokens~192 pagesLong documents, code
Claude 3 Opus/Sonnet200K tokens~300 pagesBooks, legal docs
Gemini 1.5 Pro1M / 2M tokens~1500 / ~3000 pagesMassive context tasks
Llama 3.1 (8B/70B)128K tokens~192 pagesSelf-hosted, private

Why Context Size Matters

More Context = More Information

With 128K tokens, you can include entire codebases, long documents, or extensive conversation history. The model sees everything at once.

RAG vs Long Context

Long context can replace RAG for some use cases. Instead of retrieving chunks, just put the whole document in the prompt.

Attention Degradation

Models may lose focus in very long contexts. The "lost in the middle" problem: information at the start and end is recalled better than the middle.

Cost Scales Linearly

More tokens = higher cost. A 100K token request costs 10x more than a 10K token request. Optimize your context for efficiency.

Token Pricing

API providers charge per token, typically quoted per 1 million tokens. Input and output tokens often have different prices.

ModelInput ($/1M)Output ($/1M)~1000 queries*
GPT-4o$2.50$10.00$2.50
GPT-4o-mini$0.15$0.60$0.15
Claude 3.5 Sonnet$3.00$15.00$3.00
Claude 3 Haiku$0.25$1.25$0.25
Gemini 1.5 Flash$0.075$0.30$0.08

*Estimated cost for 1000 queries at ~1000 input tokens each (prices as of early 2025, subject to change)

Cost Calculation Example

# Example: GPT-4o query with 2000 input + 500 output tokens
input_cost = 2000 / 1_000_000 * 2.50 # $0.005
output_cost = 500 / 1_000_000 * 10.00 # $0.005
total_cost = $0.01 per query
monthly_cost = $0.01 * 10000 queries = $100/month

Practical Tips

1. Count Before You Send

Always count tokens before making API calls, especially with user-provided content.

def check_token_limit(text: str, max_tokens: int = 8000) -> bool: enc = tiktoken.encoding_for_model("gpt-4") token_count = len(enc.encode(text)) return token_count <= max_tokens

2. Set max_tokens for Outputs

Always set max_tokens in your API calls to prevent runaway costs and ensure you stay within context limits.

3. Truncate Strategically

When content exceeds limits, truncate intelligently - keep the most relevant parts. For conversations, keep the system message + recent messages.

4. Use Smaller Models When Possible

GPT-4o-mini and Claude Haiku are often sufficient for simpler tasks. Save the expensive models for complex reasoning.

Tools and Resources

Key Takeaways

  • 1

    Tokens are subwords, not characters or words - LLMs see text as token sequences, typically ~4 characters per token in English.

  • 2

    Context window = input + output - Budget your tokens. A 128K context does not mean 128K input if you need a long response.

  • 3

    Use tiktoken to count tokens - Always validate token counts before API calls to avoid truncation or errors.

  • 4

    Optimize for cost and quality - More tokens = higher cost. Use smaller models for simpler tasks.