Embedded AI/Hailo-10H LLMs
Hailo-10H · 40 TOPS INT4 · ~2.5 W typical

Every LLM that runs on
Hailo-10H

Decode throughput, time-to-first-token, memory, and power for every LLM officially or community-supported on the Hailo-10H edge accelerator — side-by-side with Raspberry Pi 5 CPU and Jetson Orin Nano.

At a glance

10
LLMs benchmarked
Phi-2
Fastest: 19 tok/s
Llama 3 8B
Largest that fits: 8.0B params
~4.5 tok/s/W
Power efficiency (Qwen2-1.5B)

LLM catalog

Decode tokens/sec, batch 1, on Hailo-10H. Status marks whether Hailo officially supports the model (Official), it’s reproducible through community tools like hailo-ollama (Community), or Hailo has shown internal numbers without shipping the HEF yet (Demonstrated).

ModelVendorParamsQuantDecodeTTFTKV ctxMemoryPowerStatusSrc
Phi-2Microsoft2.7BINT419 tok/s20482.8 GBOfficial[F]
Llama 3 8BMeta8.0BINT411 tok/s81925.2 GB4.5 WOfficial[F]
Llama 2 7BMeta7.0BINT410 tok/s20484.8 GB5 WOfficial[P]
Qwen2-1.5B-InstructAlibaba1.5BW4A8 group-wise9.45 tok/s289 ms20481.2 GB2.1 WOfficial[H]
Qwen2-1.5B (ollama Q4_0)Alibaba1.5BQ4_0 (GGUF)8.03 tok/s20481.3 GBCommunity[S]
Qwen2.5-Coder-1.5BAlibaba1.5BQ4_07.94 tok/s20481.3 GBCommunity[S]
DeepSeek-R1-Distill-Qwen-1.5BDeepSeek1.5BQ4_06.83 tok/s20481.3 GBCommunity[S]
Qwen2.5-1.5B-InstructAlibaba1.5BQ4_06.76 tok/s20481.3 GBCommunity[S]
Qwen3-1.7B-InstructAlibaba1.7BA8W4 group-wise4.78 tok/s620 ms20481.79 GBOfficial[HX]
Llama 3.2 3BMeta3.0BQ4_02.65 tok/s40962.4 GBCommunity[S]

Dashes mean the source didn’t publish that metric. Memory numbers include weights + KV cache at the listed context length.

Quality benchmarks

Every model’s published FP16 score on standard academic benchmarks. INT4 quantization on Hailo-10H typically costs 1–2 points vs these baselines — good enough that the ranking rarely changes.

ModelMMLUGSM8KMATHHumanEvalMBPPIFEvalARC-CGPQASrc
Qwen3-0.6B (no HEF)52.859.632.436.626.8[Q3]
Qwen3-1.7B-Instruct62.675.443.555.428.3[Q3]
Qwen3-4B (no HEF yet)73.087.854.167.036.9[Q3]
Qwen2-1.5B-Instruct41.261.625.342.144.229.021.2[Q2]
Qwen2-1.5B (ollama Q4_0)41.261.625.342.144.229.021.2[Q2]
Qwen2.5-Coder-1.5B70.769.2[QC]
Qwen2.5-1.5B-Instruct50.773.255.261.663.242.529.8[Q25]
DeepSeek-R1-Distill-Qwen-1.5B83.933.8[DR]
Phi-256.761.159.1[P2]
Llama 3.2 3B63.477.478.6[L32]
Llama 2 7B45.314.62.512.820.845.9[L2]
Llama 3 8B66.679.630.061.669.234.2[L3]
= best in column— = not reported
Best general knowledge
Llama 3 8B
66.6 MMLU · 7.1 GB memory
Best reasoning / math
DeepSeek-R1-Distill-Qwen-1.5B
83.9 MATH-500 · 1.3 GB memory
Best code
Qwen2.5-Coder-1.5B
70.7 HumanEval · 1.3 GB memory
Best instruction following
Llama 3.2 3B
77.4 IFEval · 2.4 GB memory
Benchmark score sources
  1. [Q2] Qwen2 paper / Qwen2.5 blog comparison table
  2. [Q25] Qwen2.5 LLM blog — Qwen2.5-1.5B-Instruct table
  3. [QC] Qwen2.5-Coder Technical Report (Table 16)
  4. [DR] DeepSeek-R1 paper Table 5 — distill models
  5. [P2] Microsoft Research — Phi-2 blog
  6. [L32] Meta — Llama 3.2 release
  7. [L3] Meta — Llama 3 8B Instruct card
  8. [L2] Llama 2 paper (Touvron et al.)
  9. [Q3] Qwen3 Technical Report (May 2025, base models)
  10. [HX] Hailo Model Explorer — Qwen3-1.7B-Instruct page

Quality × throughput cheat-sheet

The honest pareto frontier. Higher is always better on both axes.

ModelDecode tok/sMMLUHumanEvalGSM8KWhen to pick
Phi-21956.761.1Max tok/s with decent quality. Best default.
Llama 3 8B1166.661.679.6Highest quality that fits. Use if you can spare 5 GB.
Qwen2.5-1.5B-Instruct6.850.761.673.2Small model with strong GSM8K. Tool-calling.
Qwen2.5-Coder-1.5B7.970.7On-device coding assistant. Small footprint.
DeepSeek-R1-Distill-Qwen-1.5B6.883.9 MATH-500 in 1.3 GB — shockingly good at math.
Llama 3.2 3B2.6563.4Better knowledge than 1.5B models, but 3× slower.

Quality scores are FP16 baselines from each model’s technical report. Throughput is INT4 decode on Hailo-10H. Choose by the column that matches your application.

Pick a model by use case

What each supported model is actually good for on Hailo-10H.

Qwen3-1.7B-Instruct

Official

Newest Qwen family; strongest small-model reasoning

Decode
4.78 t/s
Memory
1.79 GB
Params
1.7B

Qwen2-1.5B-Instruct

Official

General chat, Hailo reference model

Decode
9.45 t/s
Memory
1.2 GB
Params
1.5B

Qwen2-1.5B (ollama Q4_0)

Community

hailo-ollama pipeline

Decode
8.03 t/s
Memory
1.3 GB
Params
1.5B

Qwen2.5-Coder-1.5B

Community

Inline code completion, small refactors

Decode
7.94 t/s
Memory
1.3 GB
Params
1.5B

Qwen2.5-1.5B-Instruct

Community

Function-calling, tool use

Decode
6.76 t/s
Memory
1.3 GB
Params
1.5B

DeepSeek-R1-Distill-Qwen-1.5B

Community

Reasoning traces, math

Decode
6.83 t/s
Memory
1.3 GB
Params
1.5B

Phi-2

Official

Fast general chat, strongest tok/s on this chip

Decode
19 t/s
Memory
2.8 GB
Params
2.7B

Llama 3.2 3B

Community

Chat with larger knowledge base

Decode
2.65 t/s
Memory
2.4 GB
Params
3.0B

Llama 2 7B

Official

Hailo launch reference, 10 tok/s under 5W

Decode
10 t/s
Memory
4.8 GB
Params
7.0B

Llama 3 8B

Official

Best-quality model that still fits

Decode
11 t/s
Memory
5.2 GB
Params
8.0B

vs other edge platforms

Qwen2-1.5B INT4/Q4_0 on the same prompt length. Hailo-10H’s edge isn’t raw tokens/sec — it’s the power envelope. Jetson is faster but draws 5× the power.

PlatformModeltok/sPowertok/s/WCostNotes
Hailo-10HQwen2-1.5B INT49.452.1 W (NPU)4.50~$170 (M.2 module)Hailo reference number
Raspberry Pi 5 (CPU only)Qwen2-1.5B Q4_05.5~8 W (SoC)$80 (Pi 5 8GB)llama.cpp on 4× Cortex-A76
Jetson Orin Nano 8GBQwen2-1.5B INT425~10 W$250GPU decode, higher throughput, 5× the power
Apple M2 Pro (MLX)Qwen2-1.5B 4-bit95~15 WN/A (laptop)Not an edge device, shown for scale

The honest picture

Independent testing (CNX Software, schwab.sh) has found that on some prompts, Raspberry Pi 5’s own CPU is within striking distance of Hailo-10H for pure decode throughput. Hailo-10H wins on sustained performance,TTFT, and power — not on peak single-prompt speed. If you care about battery life or multi-stream always-on LLM inference, it’s the right pick. If you just want the fastest tok/s off a Pi, try the CPU first.

Quantization stack

Every LLM on Hailo-10H goes through the same quantization pipeline. Understanding this helps explain the accuracy/speed trade-offs vs a vanilla FP16 or Q4_K_M build.

TensorMethodWhy
WeightsStatic, 4-bit symmetric, group-wise (GPTQ / QuaROT)Biggest memory win — 4× smaller than FP16. Group-wise keeps accuracy within 1-2% of FP16 on MMLU.
ActivationsStatic, 8-bit asymmetric, per-tensorMatches Hailo’s INT8 NN core natively — no runtime scaling overhead.
KV cacheStatic, 8-bit asymmetric, per-tensorKV cache dominates memory at long context. 8-bit halves it vs FP16 with negligible quality loss.

Hailo’s compiler builds on GPTQ and QuaROT to produce the INT4/INT8 mix above. Accuracy delta vs FP16 is typically 1-2% on standard benchmarks (MMLU, ARC, HellaSwag) for models in the 1.5B-8B range.

Deployment paths

Three ways to actually run one of these models on Hailo-10H today.

1. Hailo HEF + HailoRT

Official path. Download the pre-compiled HEF from Hailo’s Model Zoo, load it with HailoRT, run inference in Python or C++. Lowest-level, fastest.

Best for production embedded devices.

2. hailo-ollama

Community Ollama fork that routes GGUF Q4_0 models through Hailo instead of CPU. Every Ollama model becomes a Hailo target. Slight performance tax vs native HEF.

Best for prototyping on Raspberry Pi 5.

3. Compile your own

Start from a PyTorch or ONNX checkpoint, run through the Hailo Dataflow Compiler with a calibration set, get a custom HEF. Longest path but unlocks any model that fits the 8 GB budget.

Best for custom fine-tunes or distilled variants.

Sources

  1. [H] Hailo blog — Qwen2-1.5B reference deployment
  2. [S] schwab.sh — hailo-ollama benchmarks on Pi 5 + AI HAT+ 2, Jan 2026
  3. [F] faceofit.com — Pi AI HAT+ 2 compatibility + Llama 3 / Phi-2 numbers
  4. [P] Hailo press — Llama2-7B at up to 10 tok/s under 5 W

Missing a model you care about?

This page tracks every LLM with a published Hailo-10H benchmark. If you’ve run a model that isn’t listed — or have more recent numbers — tell us and we’ll add it.

Submit a benchmark

Last updated April 2026. Benchmarks collected from cited sources — not independently re-run by CodeSOTA. Numbers change as Hailo ships new SDK releases.