Hailo-10H · 40 TOPS INT4 · ~2.5 W typical

Every LLM that runs on
Hailo-10H

Decode throughput, time-to-first-token, memory, and power for every LLM officially or community-supported on the Hailo-10H edge accelerator — side-by-side with Raspberry Pi 5 CPU and Jetson Orin Nano.

LLM catalog Quality benchmarks vs other edge platforms Quantization stack

At a glance

LLMs benchmarked

Phi-2

Fastest: 19 tok/s

Llama 3 8B

Largest that fits: 8.0B params

~4.5 tok/s/W

Power efficiency (Qwen2-1.5B)

LLM catalog

Decode tokens/sec, batch 1, on Hailo-10H. Status marks whether Hailo officially supports the model (Official), it’s reproducible through community tools like hailo-ollama (Community), or Hailo has shown internal numbers without shipping the HEF yet (Demonstrated).

Model	Vendor	Params	Quant	Decode	TTFT	KV ctx	Memory	Power	Status	Src
Phi-2	Microsoft	2.7B	INT4	19 tok/s	—	2048	2.8 GB	—	Official	[F]
Llama 3 8B	Meta	8.0B	INT4	11 tok/s	—	8192	5.2 GB	4.5 W	Official	[F]
Llama 2 7B	Meta	7.0B	INT4	10 tok/s	—	2048	4.8 GB	5 W	Official	[P]
Qwen2-1.5B-Instruct	Alibaba	1.5B	W4A8 group-wise	9.45 tok/s	289 ms	2048	1.2 GB	2.1 W	Official	[H]
Qwen2-1.5B (ollama Q4_0)	Alibaba	1.5B	Q4_0 (GGUF)	8.03 tok/s	—	2048	1.3 GB	—	Community	[S]
Qwen2.5-Coder-1.5B	Alibaba	1.5B	Q4_0	7.94 tok/s	—	2048	1.3 GB	—	Community	[S]
DeepSeek-R1-Distill-Qwen-1.5B	DeepSeek	1.5B	Q4_0	6.83 tok/s	—	2048	1.3 GB	—	Community	[S]
Qwen2.5-1.5B-Instruct	Alibaba	1.5B	Q4_0	6.76 tok/s	—	2048	1.3 GB	—	Community	[S]
Qwen3-1.7B-Instruct	Alibaba	1.7B	A8W4 group-wise	4.78 tok/s	620 ms	2048	1.79 GB	—	Official	[HX]
Llama 3.2 3B	Meta	3.0B	Q4_0	2.65 tok/s	—	4096	2.4 GB	—	Community	[S]

Dashes mean the source didn’t publish that metric. Memory numbers include weights + KV cache at the listed context length.

Quality benchmarks

Every model’s published FP16 score on standard academic benchmarks. INT4 quantization on Hailo-10H typically costs 1–2 points vs these baselines — good enough that the ranking rarely changes.

Model	MMLU	GSM8K	MATH	HumanEval	MBPP	IFEval	ARC-C	GPQA	Src
Qwen3-0.6B (no HEF)	52.8	59.6	32.4	—	36.6	—	—	26.8	[Q3]
Qwen3-1.7B-Instruct	62.6	75.4	43.5	—	55.4	—	—	28.3	[Q3]
Qwen3-4B (no HEF yet)	73.0	87.8	54.1	—	67.0	—	—	36.9	[Q3]
Qwen2-1.5B-Instruct	41.2	61.6	25.3	42.1	44.2	29.0	—	21.2	[Q2]
Qwen2-1.5B (ollama Q4_0)	41.2	61.6	25.3	42.1	44.2	29.0	—	21.2	[Q2]
Qwen2.5-Coder-1.5B	—	—	—	70.7	69.2	—	—	—	[QC]
Qwen2.5-1.5B-Instruct	50.7	73.2	55.2	61.6	63.2	42.5	—	29.8	[Q25]
DeepSeek-R1-Distill-Qwen-1.5B	—	—	83.9	—	—	—	—	33.8	[DR]
Phi-2	56.7	61.1	—	—	59.1	—	—	—	[P2]
Llama 3.2 3B	63.4	—	—	—	—	77.4	78.6	—	[L32]
Llama 2 7B	45.3	14.6	2.5	12.8	20.8	—	45.9	—	[L2]
Llama 3 8B	66.6	79.6	30.0	61.6	—	69.2	—	34.2	[L3]

= best in column— = not reported

Best general knowledge

Llama 3 8B

66.6 MMLU · 7.1 GB memory

Best reasoning / math

DeepSeek-R1-Distill-Qwen-1.5B

83.9 MATH-500 · 1.3 GB memory

Best code

Qwen2.5-Coder-1.5B

70.7 HumanEval · 1.3 GB memory

Best instruction following

Llama 3.2 3B

77.4 IFEval · 2.4 GB memory

Benchmark score sources

Quality × throughput cheat-sheet

The honest pareto frontier. Higher is always better on both axes.

Model	Decode tok/s	MMLU	HumanEval	GSM8K	When to pick
Phi-2	19	56.7	—	61.1	Max tok/s with decent quality. Best default.
Llama 3 8B	11	66.6	61.6	79.6	Highest quality that fits. Use if you can spare 5 GB.
Qwen2.5-1.5B-Instruct	6.8	50.7	61.6	73.2	Small model with strong GSM8K. Tool-calling.
Qwen2.5-Coder-1.5B	7.9	—	70.7	—	On-device coding assistant. Small footprint.
DeepSeek-R1-Distill-Qwen-1.5B	6.8	—	—	—	83.9 MATH-500 in 1.3 GB — shockingly good at math.
Llama 3.2 3B	2.65	63.4	—	—	Better knowledge than 1.5B models, but 3× slower.

5.2 GB

Params

8.0B

vs other edge platforms

Qwen2-1.5B INT4/Q4_0 on the same prompt length. Hailo-10H’s edge isn’t raw tokens/sec — it’s the power envelope. Jetson is faster but draws 5× the power.

Platform	Model	tok/s	Power	tok/s/W	Cost	Notes
Hailo-10H	Qwen2-1.5B INT4	9.45	2.1 W (NPU)	4.50	~$170 (M.2 module)	Hailo reference number
Raspberry Pi 5 (CPU only)	Qwen2-1.5B Q4_0	5.5	~8 W (SoC)	—	$80 (Pi 5 8GB)	llama.cpp on 4× Cortex-A76
Jetson Orin Nano 8GB	Qwen2-1.5B INT4	25	~10 W	—	$250	GPU decode, higher throughput, 5× the power
Apple M2 Pro (MLX)	Qwen2-1.5B 4-bit	95	~15 W	—	N/A (laptop)	Not an edge device, shown for scale

The honest picture

Independent testing (CNX Software, schwab.sh) has found that on some prompts, Raspberry Pi 5’s own CPU is within striking distance of Hailo-10H for pure decode throughput. Hailo-10H wins on sustained performance,TTFT, and power — not on peak single-prompt speed. If you care about battery life or multi-stream always-on LLM inference, it’s the right pick. If you just want the fastest tok/s off a Pi, try the CPU first.

Quantization stack

Every LLM on Hailo-10H goes through the same quantization pipeline. Understanding this helps explain the accuracy/speed trade-offs vs a vanilla FP16 or Q4_K_M build.

Tensor	Method	Why
Weights	Static, 4-bit symmetric, group-wise (GPTQ / QuaROT)	Biggest memory win — 4× smaller than FP16. Group-wise keeps accuracy within 1-2% of FP16 on MMLU.
Activations	Static, 8-bit asymmetric, per-tensor	Matches Hailo’s INT8 NN core natively — no runtime scaling overhead.
KV cache	Static, 8-bit asymmetric, per-tensor	KV cache dominates memory at long context. 8-bit halves it vs FP16 with negligible quality loss.

Hailo’s compiler builds on GPTQ and QuaROT to produce the INT4/INT8 mix above. Accuracy delta vs FP16 is typically 1-2% on standard benchmarks (MMLU, ARC, HellaSwag) for models in the 1.5B-8B range.

Deployment paths

Three ways to actually run one of these models on Hailo-10H today.

1. Hailo HEF + HailoRT

Official path. Download the pre-compiled HEF from Hailo’s Model Zoo, load it with HailoRT, run inference in Python or C++. Lowest-level, fastest.

Best for production embedded devices.

2. hailo-ollama

Community Ollama fork that routes GGUF Q4_0 models through Hailo instead of CPU. Every Ollama model becomes a Hailo target. Slight performance tax vs native HEF.

Best for prototyping on Raspberry Pi 5.

3. Compile your own

Start from a PyTorch or ONNX checkpoint, run through the Hailo Dataflow Compiler with a calibration set, get a custom HEF. Longest path but unlocks any model that fits the 8 GB budget.

Best for custom fine-tunes or distilled variants.

Sources

Missing a model you care about?

This page tracks every LLM with a published Hailo-10H benchmark. If you’ve run a model that isn’t listed — or have more recent numbers — tell us and we’ll add it.

Submit a benchmark

Last updated April 2026. Benchmarks collected from cited sources — not independently re-run by CodeSOTA. Numbers change as Hailo ships new SDK releases.

Every LLM that runs onHailo-10H

At a glance

LLM catalog

Quality benchmarks

Quality × throughput cheat-sheet

Pick a model by use case

Qwen3-1.7B-Instruct

Qwen2-1.5B-Instruct

Qwen2-1.5B (ollama Q4_0)

Qwen2.5-Coder-1.5B

Qwen2.5-1.5B-Instruct

DeepSeek-R1-Distill-Qwen-1.5B

Phi-2

Llama 3.2 3B

Llama 2 7B

Llama 3 8B

vs other edge platforms

The honest picture

Quantization stack

Deployment paths

1. Hailo HEF + HailoRT

2. hailo-ollama

3. Compile your own

Sources

Missing a model you care about?

Every LLM that runs on
Hailo-10H