Hardware  Apple M3 Pro · 18 GB Unified Memory Models 8 tested Updated April 2026
Version Comparison
Average across all models · higher tok/s and lower latency is better
Models
Per-model summary across all tested Ollama versions
Thinking models: qwen3:4b and qwen3.5:4b use chain-of-thought reasoning before emitting visible output. Their high TTFT is intentional, not a performance defect.
phi4-mini-reasoning context dependency: This model requires a sufficiently large context window to function correctly. At Ollama's default num_ctx (~4096 on <24 GB systems), the reasoning chain overflows the context, producing garbled output and 0 % task pass rates. All 0.18.3 and 0.20.0 phi4 results were collected at the default context and are unreliable for quality benchmarks. The 0.20.2 entry uses num_ctx=16384 and reflects actual model capability.
Throughput & Latency (Warm)
10 warm runs after one warmup · short prompt · bars grouped by model, colored by version
Tokens per Second
Mean decode throughput across warm runs. Higher is better.
Time to First Token (TTFT)
Mean TTFT across warm runs. Lower is better. Thinking models (†) show elevated values due to internal reasoning chains.
Cold Start
Model load latency measured from the Ollama API load_duration field
Load Duration — First Cold-Start Run
Time Ollama spent loading the model into memory before serving the first token.
llama3.2:3b anomaly: First cold-start load was 10.13 s — well above subsequent runs (~0.62 s). This reflects the model being fully evicted from the OS page cache. Later cold-start runs are consistent.
0.20.2 finding (qwen3:4b): Cold load dropped from 1.67 s → 1.12 s (−33 %) versus 0.18.3, suggesting faster model initialisation in the newer runtime. Decode throughput was unchanged (~45.7 tok/s).
0.20.2 finding (phi4-mini-reasoning, ctx=16384): With the default context window (~4096), phi4-mini-reasoning failed completely on cold-start in all prior runs. At num_ctx=16384 the model loads correctly — first cold load is 2.45 s, consistent with other 4 B-class models.
Prompt Length Scaling
Mean TTFT across 3 reps per size · output fixed at 64 tokens · Y-axis log scale
TTFT vs Input Context Size
Solid lines = 0.18.3. Each additional version adds dashed lines. Use the filter to reduce clutter. 0.20.2 only has qwen3:4b data so far.
Output Length Scaling
Mean tok/s across 3 reps per output length · fixed short input prompt
Throughput vs Output Length
Stable lines indicate consistent decode speed. Drops may indicate memory pressure at longer sequences.
Variance & Memory
Run-to-Run TTFT Std Dev
Standard deviation of TTFT across 10 identical runs. Lower = more predictable. phi4-mini-reasoning 0.18.3/0.20.0 omitted (all runs failed at default ctx). At ctx=16384 on 0.20.2, phi4 recovers to σ = 21 ms. 0.20.2 qwen3:4b σ = 155 ms vs 4 ms on 0.18.3 — notable variance regression.
System Memory Pressure
Mean system memory % during warm runs. On unified memory Macs, sustained high usage increases swap risk.
Long-Context Retrieval
~645-token context · fact planted at beginning or end · mean TTFT across 3 runs
TTFT — Fact Position in Context
Compares TTFT when the target fact is at the start vs end of the context. A gap may indicate position-sensitive attention cost.
phi4-mini-reasoning total response latency (ctx=16384): The chart shows TTFT only (~0.52 s, unchanged). However, total response latency improved substantially with the larger context window — fact-at-beginning: 23.3 s → 14.0 s (−40 %), fact-at-end: 35.1 s → 17.0 s (−52 %). The model generates a shorter, more focused reasoning chain when it is not context-constrained.