fmbench · Apple Foundation Models

model system · decoding greedy · 52 cases · 2026-06-10 08:02:32 · github.com/dwstevens/fmbench ↗
50/52
96% structurally valid
JSON validity
100%
Routing accuracy
Tool Routing
97%
Field accuracy
Nested Extraction
100%
Constraint compliance
Enum / Const Enforcement
54 tok/s
on the ANE
Throughput
92%
ANE duty cycle under load
Neural Engine

Suite scores

SuiteCasesValid JSONRefusedAvg score
Tool Routing14100%0100%
Nested Extraction8100%097%
Enum / Const Enforcement14100%0100%
Failure Modes (characterization)10100%178%
Big-Args Scaling683%083%
Scores exclude guardrail refusals, which are tracked separately as a reliability signal.

Confirmed limits

CaseKindObservationResult
fail-math-mugsguardrailrequest blocked by safety guardrailrefused
fail-math-pensarithmeticgot 11.5, correct 11.5correct
fail-math-cablesarithmeticgot 39.96, correct 39.96correct
fail-math-mixedarithmeticgot 47.25, correct 50.7wrong
fail-math-decimalsarithmeticgot 66.67, correct 53.31wrong
fail-notool-haikuno-tool-fitschose respond_directlyused escape
fail-notool-meaningno-tool-fitschose respond_directlyused escape
fail-notool-translateno-tool-fitschose respond_directlyused escape
fail-notool-jokeno-tool-fitschose respond_directlyused escape
fail-notool-recommendno-tool-fitschose respond_directlyused escape

Big-args scaling (one tool call, growing arg object)

FieldsValid JSONField accuracyLatency
5valid5/5
0.9s
10valid10/10
1.6s
25valid25/25
3.4s
50valid49/50
6.3s
75valid75/75
13.4s
100invalidNone/None
11.2s
A single tool call whose argument object scales to ~100 fields across mixed nesting — shows where validity or field accuracy slips as the structure grows.

Where the compute landed

ProcessRoleAvg CPU%Peak CPU%
anedApple Neural Engine daemon (ANE dispatch)4.071.7
modelcatalogdmodel weight management12.841.9
TGOnDeviceInferenceProviderServiceon-device inference worker20.338.7
fmCLI client (should be ~idle)2.85.0
GenerativeExperiencesSafetyInferenceProvidersafety/guardrail inference1.23.5
IntelligencePlatformComputeServicecompute coordination0.13.5
GPU never appears as a consumer — inference runs on the Apple Neural Engine, so it runs free of GPU contention.

ANE hardware activity (Instruments Core ML trace)

MetricValue
Neural Engine ops1623
ANE active time15.5 s
Duty cycle (active / window)92.0%
Median / max op1063 µs / 104.0 ms
Direct proof inference runs on the Apple Neural Engine — hundreds of Neural Engine Prediction hardware intervals. Each op is a sub-30ms burst, which is exactly why power/energy sampling reads ~0.
generated by fmbench · greedy decoding · deterministic, code-graded · view source ↗