fmbench · Apple Foundation Models

model system · decoding greedy · 52 cases · 2026-06-10 08:02:32 · github.com/dwstevens/fmbench ↗

50/52

96% structurally valid

JSON validity

100%

Routing accuracy

Tool Routing

97%

Field accuracy

Nested Extraction

100%

Constraint compliance

Enum / Const Enforcement

54 tok/s

on the ANE

Throughput

92%

ANE duty cycle under load

Neural Engine

Suite scores

Suite	Cases	Valid JSON	Refused	Avg score
Tool Routing	14	100%	0	100%
Nested Extraction	8	100%	0	97%
Enum / Const Enforcement	14	100%	0	100%
Failure Modes (characterization)	10	100%	1	78%
Big-Args Scaling	6	83%	0	83%

Scores exclude guardrail refusals, which are tracked separately as a reliability signal.

Confirmed limits

Case	Kind	Observation	Result
fail-math-mugs	guardrail	request blocked by safety guardrail	refused
fail-math-pens	arithmetic	got 11.5, correct 11.5	correct
fail-math-cables	arithmetic	got 39.96, correct 39.96	correct
fail-math-mixed	arithmetic	got 47.25, correct 50.7	wrong
fail-math-decimals	arithmetic	got 66.67, correct 53.31	wrong
fail-notool-haiku	no-tool-fits	chose respond_directly	used escape
fail-notool-meaning	no-tool-fits	chose respond_directly	used escape
fail-notool-translate	no-tool-fits	chose respond_directly	used escape
fail-notool-joke	no-tool-fits	chose respond_directly	used escape
fail-notool-recommend	no-tool-fits	chose respond_directly	used escape

Big-args scaling (one tool call, growing arg object)

Fields	Valid JSON	Field accuracy	Latency
5	valid	5/5	0.9s
10	valid	10/10	1.6s
25	valid	25/25	3.4s
50	valid	49/50	6.3s
75	valid	75/75	13.4s
100	invalid	None/None	11.2s

A single tool call whose argument object scales to ~100 fields across mixed nesting — shows where validity or field accuracy slips as the structure grows.

Where the compute landed

Process	Role	Avg CPU%	Peak CPU%
aned	Apple Neural Engine daemon (ANE dispatch)	4.0	71.7
modelcatalogd	model weight management	12.8	41.9
TGOnDeviceInferenceProviderService	on-device inference worker	20.3	38.7
fm	CLI client (should be ~idle)	2.8	5.0
GenerativeExperiencesSafetyInferenceProvider	safety/guardrail inference	1.2	3.5
IntelligencePlatformComputeService	compute coordination	0.1	3.5

GPU never appears as a consumer — inference runs on the Apple Neural Engine, so it runs free of GPU contention.

ANE hardware activity (Instruments Core ML trace)

Metric	Value
Neural Engine ops	1623
ANE active time	15.5 s
Duty cycle (active / window)	92.0%
Median / max op	1063 µs / 104.0 ms

Direct proof inference runs on the Apple Neural Engine — hundreds of Neural Engine Prediction hardware intervals. Each op is a sub-30ms burst, which is exactly why power/energy sampling reads ~0.

generated by fmbench · greedy decoding · deterministic, code-graded · view source ↗