Real-time Systems
Build production AI systems that respond in milliseconds. Streaming, caching, GPU serving, and cost optimization.
The Latency Challenge
Users expect fast responses. A 200ms delay feels instant. A 1-second delay is noticeable. A 3-second delay loses users.
LLMs are inherently slow - generating text token by token. Real-time systems need multiple strategies to hide and reduce this latency.
Where Latency Comes From
Streaming Responses
Streaming sends tokens as they're generated instead of waiting for completion. The user sees the response building in real-time, which feels much faster.
Without Streaming
User waits 3 seconds seeing nothing, then entire response appears.
With Streaming
First token appears in 100ms, response builds progressively.
Streaming with OpenAI
# Streaming with OpenAI
from openai import OpenAI
client = OpenAI()
def stream_response(prompt: str):
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
# Usage - print tokens as they arrive
for token in stream_response("Explain quantum computing"):
print(token, end="", flush=True)Server-Sent Events (SSE) in FastAPI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
app = FastAPI()
client = OpenAI()
@app.get("/stream")
async def stream_endpoint(prompt: str):
async def generate():
stream = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield f"data: {chunk.choices[0].delta.content}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream"
)Latency Optimization Techniques
Semantic Caching
Cache responses by semantic similarity, not exact match. Similar queries return cached results.
Speculative Decoding
Use a small fast model to draft tokens, large model to verify. 2-3x speedup with same quality.
Model Distillation
Train smaller models on larger model outputs. Production uses the fast distilled model.
Prefetching
Predict likely queries and pre-compute responses. Works for predictable user flows.
GPU Serving at Scale
For self-hosted models, you need efficient inference servers. vLLM and TGI are the industry standards.
vLLM
PagedAttention for memory efficiency. High throughput batching. OpenAI-compatible API.
- - 24x throughput vs HuggingFace
- - Continuous batching
- - Tensor parallelism
Text Generation Inference (TGI)
HuggingFace's production server. Flash Attention, quantization, watermarking.
- - Easy HuggingFace integration
- - Built-in metrics
- - Speculative decoding
vLLM Batch Inference
# vLLM serving for high throughput
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
params = SamplingParams(temperature=0.7, max_tokens=256)
# Batch multiple requests
prompts = ["Question 1", "Question 2", "Question 3"]
outputs = llm.generate(prompts, params) # Batched inference
for output in outputs:
print(output.outputs[0].text)vLLM as OpenAI-compatible Server
# Start vLLM server
# python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-2-7b-chat-hf \
# --port 8000
# Use with OpenAI client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "Hello!"}]
)Cost Optimization
Real-time doesn't have to mean expensive. These techniques reduce costs while maintaining speed.
Request Batching
Collect multiple requests and process together. GPU utilization goes from 10% to 80%. Trade small latency increase for major throughput gains.
Quantization
Reduce model precision from FP16 to INT8 or INT4. 2-4x memory reduction, 1.5-2x speed increase, minimal quality loss.
Model Routing
Use cheap models for simple queries, expensive models for complex ones. A classifier routes requests to appropriate tier.
Prompt Compression
Reduce input token count. Remove redundant instructions, compress context, use efficient prompting. Fewer tokens = lower cost + faster response.
Cost Comparison (per 1M tokens)
Production Architecture Patterns
Putting it all together: a production real-time AI system.
High-Throughput Architecture
User Request
|
v
[Load Balancer]
|
v
[Semantic Cache] --hit--> Return cached response
|
miss
v
[Query Classifier] --simple--> Small Model (7B)
| |
complex |
v v
[Large Model (70B)] [Response]
| |
v v
[Response] [Cache Update]
|
v
[Cache Update]Key insight: Real-time doesn't mean no latency. It means perceived low latency. Streaming, caching, and smart routing make the system feel instant even when models take seconds to generate.
Key Takeaways
- 1
Streaming hides latency - Users see tokens appearing, making responses feel instant even when generation takes seconds.
- 2
Semantic caching multiplies throughput - Similar queries return cached results. Huge win for common questions.
- 3
vLLM/TGI for self-hosted inference - 24x throughput improvement over naive HuggingFace inference.
- 4
Route by complexity - Simple queries to small models, complex to large. Best cost/quality tradeoff.
Level 4 Complete
You've completed the Advanced level. You now understand multi-modal RAG, agent architectures, video understanding, and production real-time systems.