Level 4: Advanced~35 min

Real-time Systems

Build production AI systems that respond in milliseconds. Streaming, caching, GPU serving, and cost optimization.

The Latency Challenge

Users expect fast responses. A 200ms delay feels instant. A 1-second delay is noticeable. A 3-second delay loses users.

LLMs are inherently slow - generating text token by token. Real-time systems need multiple strategies to hide and reduce this latency.

Where Latency Comes From

Network round-trip (API)50-200ms

Model loading (cold start)2-30s

Prompt processing (prefill)100-500ms

Token generation (decode)20-100ms/token

Embedding lookup1-10ms

Streaming Responses

Streaming sends tokens as they're generated instead of waiting for completion. The user sees the response building in real-time, which feels much faster.

Without Streaming

User waits 3 seconds seeing nothing, then entire response appears.

[-------- 3000ms --------] Response

With Streaming

First token appears in 100ms, response builds progressively.

[100ms] The [+] answer [+] is...

Streaming with OpenAI

# Streaming with OpenAI
from openai import OpenAI

client = OpenAI()

def stream_response(prompt: str):
    stream = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

# Usage - print tokens as they arrive
for token in stream_response("Explain quantum computing"):
    print(token, end="", flush=True)

Server-Sent Events (SSE) in FastAPI

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI()

@app.get("/stream")
async def stream_endpoint(prompt: str):
    async def generate():
        stream = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield f"data: {chunk.choices[0].delta.content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

Latency Optimization Techniques

Semantic Caching

Cache responses by semantic similarity, not exact match. Similar queries return cached results.

"What's the capital of France?" ~ "France's capital city?" -> cached

Speculative Decoding

Use a small fast model to draft tokens, large model to verify. 2-3x speedup with same quality.

Draft (7B) -> Verify (70B) -> Accept/Reject

Model Distillation

Train smaller models on larger model outputs. Production uses the fast distilled model.

GPT-4 (teacher) -> Fine-tuned 7B (student)

Prefetching

Predict likely queries and pre-compute responses. Works for predictable user flows.

User on page A -> Pre-generate content for pages B, C

GPU Serving at Scale

For self-hosted models, you need efficient inference servers. vLLM and TGI are the industry standards.

vLLM

PagedAttention for memory efficiency. High throughput batching. OpenAI-compatible API.

- 24x throughput vs HuggingFace
- Continuous batching
- Tensor parallelism

Text Generation Inference (TGI)

HuggingFace's production server. Flash Attention, quantization, watermarking.

- Easy HuggingFace integration
- Built-in metrics
- Speculative decoding

vLLM Batch Inference

# vLLM serving for high throughput
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
params = SamplingParams(temperature=0.7, max_tokens=256)

# Batch multiple requests
prompts = ["Question 1", "Question 2", "Question 3"]
outputs = llm.generate(prompts, params)  # Batched inference

for output in outputs:
    print(output.outputs[0].text)

vLLM as OpenAI-compatible Server

# Start vLLM server
# python -m vllm.entrypoints.openai.api_server \
#     --model meta-llama/Llama-2-7b-chat-hf \
#     --port 8000

# Use with OpenAI client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "Hello!"}]
)

Cost Optimization

Real-time doesn't have to mean expensive. These techniques reduce costs while maintaining speed.

Request Batching

Collect multiple requests and process together. GPU utilization goes from 10% to 80%. Trade small latency increase for major throughput gains.

Quantization

Reduce model precision from FP16 to INT8 or INT4. 2-4x memory reduction, 1.5-2x speed increase, minimal quality loss.

Model Routing

Use cheap models for simple queries, expensive models for complex ones. A classifier routes requests to appropriate tier.

Prompt Compression

Reduce input token count. Remove redundant instructions, compress context, use efficient prompting. Fewer tokens = lower cost + faster response.

Cost Comparison (per 1M tokens)

GPT-4 (API)$30-60

GPT-3.5 Turbo (API)$0.50-1.50

Llama 70B (self-hosted)$0.20-0.80

Llama 7B quantized (self-hosted)$0.02-0.05

Production Architecture Patterns

Putting it all together: a production real-time AI system.

High-Throughput Architecture

User Request
     |
     v
[Load Balancer]
     |
     v
[Semantic Cache] --hit--> Return cached response
     |
     miss
     v
[Query Classifier] --simple--> Small Model (7B)
     |                              |
     complex                        |
     v                              v
[Large Model (70B)]           [Response]
     |                              |
     v                              v
[Response]                    [Cache Update]
     |
     v
[Cache Update]

Key insight: Real-time doesn't mean no latency. It means perceived low latency. Streaming, caching, and smart routing make the system feel instant even when models take seconds to generate.

Key Takeaways

1
Streaming hides latency - Users see tokens appearing, making responses feel instant even when generation takes seconds.
2
Semantic caching multiplies throughput - Similar queries return cached results. Huge win for common questions.
3
vLLM/TGI for self-hosted inference - 24x throughput improvement over naive HuggingFace inference.
4
Route by complexity - Simple queries to small models, complex to large. Best cost/quality tradeoff.

Level 4 Complete

You've completed the Advanced level. You now understand multi-modal RAG, agent architectures, video understanding, and production real-time systems.

Back to Roadmap Explore Benchmarks

Previous: Video Understanding Complete Course