Ollama Benchmark · Apple M3 Pro 18 GB

Version Comparison

Average across all models · higher tok/s and lower latency is better

Models

Per-model summary across all tested Ollama versions

Thinking models: qwen3:4b and qwen3.5:4b use chain-of-thought reasoning before emitting visible output. Their high TTFT is intentional, not a performance defect.

phi4-mini-reasoning context dependency: This model requires a sufficiently large context window to function correctly. At Ollama's default num_ctx (~4096 on <24 GB systems), the reasoning chain overflows the context, producing garbled output and 0 % task pass rates. All 0.18.3 and 0.20.0 phi4 results were collected at the default context and are unreliable for quality benchmarks. The 0.20.2 entry uses num_ctx=16384 and reflects actual model capability.

Throughput & Latency (Warm)

10 warm runs after one warmup · short prompt · bars grouped by model, colored by version

Tokens per Second

Mean decode throughput across warm runs. Higher is better.

Time to First Token (TTFT)

Mean TTFT across warm runs. Lower is better. Thinking models (†) show elevated values due to internal reasoning chains.

Cold Start

Model load latency measured from the Ollama API load_duration field

Load Duration — First Cold-Start Run

Time Ollama spent loading the model into memory before serving the first token.

llama3.2:3b anomaly: First cold-start load was 10.13 s — well above subsequent runs (~0.62 s). This reflects the model being fully evicted from the OS page cache. Later cold-start runs are consistent.

0.20.2 finding (qwen3:4b): Cold load dropped from 1.67 s → 1.12 s (−33 %) versus 0.18.3, suggesting faster model initialisation in the newer runtime. Decode throughput was unchanged (~45.7 tok/s).

0.20.2 finding (phi4-mini-reasoning, ctx=16384): With the default context window (~4096), phi4-mini-reasoning failed completely on cold-start in all prior runs. At num_ctx=16384 the model loads correctly — first cold load is 2.45 s, consistent with other 4 B-class models.

Prompt Length Scaling

Mean TTFT across 3 reps per size · output fixed at 64 tokens · Y-axis log scale

TTFT vs Input Context Size

Solid lines = 0.18.3. Each additional version adds dashed lines. Use the filter to reduce clutter. 0.20.2 only has qwen3:4b data so far.

Output Length Scaling

Mean tok/s across 3 reps per output length · fixed short input prompt

Throughput vs Output Length

Stable lines indicate consistent decode speed. Drops may indicate memory pressure at longer sequences.

Variance & Memory

Run-to-Run TTFT Std Dev

Standard deviation of TTFT across 10 identical runs. Lower = more predictable. phi4-mini-reasoning 0.18.3/0.20.0 omitted (all runs failed at default ctx). At ctx=16384 on 0.20.2, phi4 recovers to σ = 21 ms. 0.20.2 qwen3:4b σ = 155 ms vs 4 ms on 0.18.3 — notable variance regression.

System Memory Pressure

Mean system memory % during warm runs. On unified memory Macs, sustained high usage increases swap risk.

Long-Context Retrieval

~645-token context · fact planted at beginning or end · mean TTFT across 3 runs

TTFT — Fact Position in Context

Compares TTFT when the target fact is at the start vs end of the context. A gap may indicate position-sensitive attention cost.

phi4-mini-reasoning total response latency (ctx=16384): The chart shows TTFT only (~0.52 s, unchanged). However, total response latency improved substantially with the larger context window — fact-at-beginning: 23.3 s → 14.0 s (−40 %), fact-at-end: 35.1 s → 17.0 s (−52 %). The model generates a shorter, more focused reasoning chain when it is not context-constrained.