Every LLM that runs on
Hailo-10H
Decode throughput, time-to-first-token, memory, and power for every LLM officially or community-supported on the Hailo-10H edge accelerator — side-by-side with Raspberry Pi 5 CPU and Jetson Orin Nano.
At a glance
LLM catalog
Decode tokens/sec, batch 1, on Hailo-10H. Status marks whether Hailo officially supports the model (Official), it’s reproducible through community tools like hailo-ollama (Community), or Hailo has shown internal numbers without shipping the HEF yet (Demonstrated).
| Model | Vendor | Params | Quant | Decode | TTFT | KV ctx | Memory | Power | Status | Src |
|---|---|---|---|---|---|---|---|---|---|---|
| Phi-2 | Microsoft | 2.7B | INT4 | 19 tok/s | — | 2048 | 2.8 GB | — | Official | [F] |
| Llama 3 8B | Meta | 8.0B | INT4 | 11 tok/s | — | 8192 | 5.2 GB | 4.5 W | Official | [F] |
| Llama 2 7B | Meta | 7.0B | INT4 | 10 tok/s | — | 2048 | 4.8 GB | 5 W | Official | [P] |
| Qwen2-1.5B-Instruct | Alibaba | 1.5B | W4A8 group-wise | 9.45 tok/s | 289 ms | 2048 | 1.2 GB | 2.1 W | Official | [H] |
| Qwen2-1.5B (ollama Q4_0) | Alibaba | 1.5B | Q4_0 (GGUF) | 8.03 tok/s | — | 2048 | 1.3 GB | — | Community | [S] |
| Qwen2.5-Coder-1.5B | Alibaba | 1.5B | Q4_0 | 7.94 tok/s | — | 2048 | 1.3 GB | — | Community | [S] |
| DeepSeek-R1-Distill-Qwen-1.5B | DeepSeek | 1.5B | Q4_0 | 6.83 tok/s | — | 2048 | 1.3 GB | — | Community | [S] |
| Qwen2.5-1.5B-Instruct | Alibaba | 1.5B | Q4_0 | 6.76 tok/s | — | 2048 | 1.3 GB | — | Community | [S] |
| Qwen3-1.7B-Instruct | Alibaba | 1.7B | A8W4 group-wise | 4.78 tok/s | 620 ms | 2048 | 1.79 GB | — | Official | [HX] |
| Llama 3.2 3B | Meta | 3.0B | Q4_0 | 2.65 tok/s | — | 4096 | 2.4 GB | — | Community | [S] |
Dashes mean the source didn’t publish that metric. Memory numbers include weights + KV cache at the listed context length.
Quality benchmarks
Every model’s published FP16 score on standard academic benchmarks. INT4 quantization on Hailo-10H typically costs 1–2 points vs these baselines — good enough that the ranking rarely changes.
| Model | MMLU | GSM8K | MATH | HumanEval | MBPP | IFEval | ARC-C | GPQA | Src |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-0.6B (no HEF) | 52.8 | 59.6 | 32.4 | — | 36.6 | — | — | 26.8 | [Q3] |
| Qwen3-1.7B-Instruct | 62.6 | 75.4 | 43.5 | — | 55.4 | — | — | 28.3 | [Q3] |
| Qwen3-4B (no HEF yet) | 73.0 | 87.8 | 54.1 | — | 67.0 | — | — | 36.9 | [Q3] |
| Qwen2-1.5B-Instruct | 41.2 | 61.6 | 25.3 | 42.1 | 44.2 | 29.0 | — | 21.2 | [Q2] |
| Qwen2-1.5B (ollama Q4_0) | 41.2 | 61.6 | 25.3 | 42.1 | 44.2 | 29.0 | — | 21.2 | [Q2] |
| Qwen2.5-Coder-1.5B | — | — | — | 70.7 | 69.2 | — | — | — | [QC] |
| Qwen2.5-1.5B-Instruct | 50.7 | 73.2 | 55.2 | 61.6 | 63.2 | 42.5 | — | 29.8 | [Q25] |
| DeepSeek-R1-Distill-Qwen-1.5B | — | — | 83.9 | — | — | — | — | 33.8 | [DR] |
| Phi-2 | 56.7 | 61.1 | — | — | 59.1 | — | — | — | [P2] |
| Llama 3.2 3B | 63.4 | — | — | — | — | 77.4 | 78.6 | — | [L32] |
| Llama 2 7B | 45.3 | 14.6 | 2.5 | 12.8 | 20.8 | — | 45.9 | — | [L2] |
| Llama 3 8B | 66.6 | 79.6 | 30.0 | 61.6 | — | 69.2 | — | 34.2 | [L3] |
Benchmark score sources
- [Q2] Qwen2 paper / Qwen2.5 blog comparison table
- [Q25] Qwen2.5 LLM blog — Qwen2.5-1.5B-Instruct table
- [QC] Qwen2.5-Coder Technical Report (Table 16)
- [DR] DeepSeek-R1 paper Table 5 — distill models
- [P2] Microsoft Research — Phi-2 blog
- [L32] Meta — Llama 3.2 release
- [L3] Meta — Llama 3 8B Instruct card
- [L2] Llama 2 paper (Touvron et al.)
- [Q3] Qwen3 Technical Report (May 2025, base models)
- [HX] Hailo Model Explorer — Qwen3-1.7B-Instruct page
Quality × throughput cheat-sheet
The honest pareto frontier. Higher is always better on both axes.
| Model | Decode tok/s | MMLU | HumanEval | GSM8K | When to pick |
|---|---|---|---|---|---|
| Phi-2 | 19 | 56.7 | — | 61.1 | Max tok/s with decent quality. Best default. |
| Llama 3 8B | 11 | 66.6 | 61.6 | 79.6 | Highest quality that fits. Use if you can spare 5 GB. |
| Qwen2.5-1.5B-Instruct | 6.8 | 50.7 | 61.6 | 73.2 | Small model with strong GSM8K. Tool-calling. |
| Qwen2.5-Coder-1.5B | 7.9 | — | 70.7 | — | On-device coding assistant. Small footprint. |
| DeepSeek-R1-Distill-Qwen-1.5B | 6.8 | — | — | — | 83.9 MATH-500 in 1.3 GB — shockingly good at math. |
| Llama 3.2 3B | 2.65 | 63.4 | — | — | Better knowledge than 1.5B models, but 3× slower. |
Quality scores are FP16 baselines from each model’s technical report. Throughput is INT4 decode on Hailo-10H. Choose by the column that matches your application.
Pick a model by use case
What each supported model is actually good for on Hailo-10H.
Qwen3-1.7B-Instruct
OfficialNewest Qwen family; strongest small-model reasoning
Qwen2-1.5B-Instruct
OfficialGeneral chat, Hailo reference model
Qwen2-1.5B (ollama Q4_0)
Communityhailo-ollama pipeline
Qwen2.5-Coder-1.5B
CommunityInline code completion, small refactors
Qwen2.5-1.5B-Instruct
CommunityFunction-calling, tool use
DeepSeek-R1-Distill-Qwen-1.5B
CommunityReasoning traces, math
Phi-2
OfficialFast general chat, strongest tok/s on this chip
Llama 3.2 3B
CommunityChat with larger knowledge base
Llama 2 7B
OfficialHailo launch reference, 10 tok/s under 5W
Llama 3 8B
OfficialBest-quality model that still fits
vs other edge platforms
Qwen2-1.5B INT4/Q4_0 on the same prompt length. Hailo-10H’s edge isn’t raw tokens/sec — it’s the power envelope. Jetson is faster but draws 5× the power.
| Platform | Model | tok/s | Power | tok/s/W | Cost | Notes |
|---|---|---|---|---|---|---|
| Hailo-10H | Qwen2-1.5B INT4 | 9.45 | 2.1 W (NPU) | 4.50 | ~$170 (M.2 module) | Hailo reference number |
| Raspberry Pi 5 (CPU only) | Qwen2-1.5B Q4_0 | 5.5 | ~8 W (SoC) | — | $80 (Pi 5 8GB) | llama.cpp on 4× Cortex-A76 |
| Jetson Orin Nano 8GB | Qwen2-1.5B INT4 | 25 | ~10 W | — | $250 | GPU decode, higher throughput, 5× the power |
| Apple M2 Pro (MLX) | Qwen2-1.5B 4-bit | 95 | ~15 W | — | N/A (laptop) | Not an edge device, shown for scale |
The honest picture
Independent testing (CNX Software, schwab.sh) has found that on some prompts, Raspberry Pi 5’s own CPU is within striking distance of Hailo-10H for pure decode throughput. Hailo-10H wins on sustained performance,TTFT, and power — not on peak single-prompt speed. If you care about battery life or multi-stream always-on LLM inference, it’s the right pick. If you just want the fastest tok/s off a Pi, try the CPU first.
Quantization stack
Every LLM on Hailo-10H goes through the same quantization pipeline. Understanding this helps explain the accuracy/speed trade-offs vs a vanilla FP16 or Q4_K_M build.
| Tensor | Method | Why |
|---|---|---|
| Weights | Static, 4-bit symmetric, group-wise (GPTQ / QuaROT) | Biggest memory win — 4× smaller than FP16. Group-wise keeps accuracy within 1-2% of FP16 on MMLU. |
| Activations | Static, 8-bit asymmetric, per-tensor | Matches Hailo’s INT8 NN core natively — no runtime scaling overhead. |
| KV cache | Static, 8-bit asymmetric, per-tensor | KV cache dominates memory at long context. 8-bit halves it vs FP16 with negligible quality loss. |
Hailo’s compiler builds on GPTQ and QuaROT to produce the INT4/INT8 mix above. Accuracy delta vs FP16 is typically 1-2% on standard benchmarks (MMLU, ARC, HellaSwag) for models in the 1.5B-8B range.
Deployment paths
Three ways to actually run one of these models on Hailo-10H today.
1. Hailo HEF + HailoRT
Official path. Download the pre-compiled HEF from Hailo’s Model Zoo, load it with HailoRT, run inference in Python or C++. Lowest-level, fastest.
Best for production embedded devices.
2. hailo-ollama
Community Ollama fork that routes GGUF Q4_0 models through Hailo instead of CPU. Every Ollama model becomes a Hailo target. Slight performance tax vs native HEF.
Best for prototyping on Raspberry Pi 5.
3. Compile your own
Start from a PyTorch or ONNX checkpoint, run through the Hailo Dataflow Compiler with a calibration set, get a custom HEF. Longest path but unlocks any model that fits the 8 GB budget.
Best for custom fine-tunes or distilled variants.
Sources
Missing a model you care about?
This page tracks every LLM with a published Hailo-10H benchmark. If you’ve run a model that isn’t listed — or have more recent numbers — tell us and we’ll add it.
Submit a benchmarkLast updated April 2026. Benchmarks collected from cited sources — not independently re-run by CodeSOTA. Numbers change as Hailo ships new SDK releases.