fmbench · Apple Foundation Models
50/52
96% structurally valid
JSON validity
100%
Routing accuracy
Tool Routing
97%
Field accuracy
Nested Extraction
100%
Constraint compliance
Enum / Const Enforcement
54 tok/s
on the ANE
Throughput
92%
ANE duty cycle under load
Neural Engine
Suite scores
| Suite | Cases | Valid JSON | Refused | Avg score |
|---|
| Tool Routing | 14 | 100% | 0 | 100%
|
| Nested Extraction | 8 | 100% | 0 | 97%
|
| Enum / Const Enforcement | 14 | 100% | 0 | 100%
|
| Failure Modes (characterization) | 10 | 100% | 1 | 78%
|
| Big-Args Scaling | 6 | 83% | 0 | 83%
|
Scores exclude guardrail refusals, which are tracked separately as a reliability signal.
Confirmed limits
| Case | Kind | Observation | Result |
|---|
| fail-math-mugs | guardrail | request blocked by safety guardrail | refused |
| fail-math-pens | arithmetic | got 11.5, correct 11.5 | correct |
| fail-math-cables | arithmetic | got 39.96, correct 39.96 | correct |
| fail-math-mixed | arithmetic | got 47.25, correct 50.7 | wrong |
| fail-math-decimals | arithmetic | got 66.67, correct 53.31 | wrong |
| fail-notool-haiku | no-tool-fits | chose respond_directly | used escape |
| fail-notool-meaning | no-tool-fits | chose respond_directly | used escape |
| fail-notool-translate | no-tool-fits | chose respond_directly | used escape |
| fail-notool-joke | no-tool-fits | chose respond_directly | used escape |
| fail-notool-recommend | no-tool-fits | chose respond_directly | used escape |
Big-args scaling (one tool call, growing arg object)
| Fields | Valid JSON | Field accuracy | Latency |
|---|
| 5 | valid | 5/5
| 0.9s |
| 10 | valid | 10/10
| 1.6s |
| 25 | valid | 25/25
| 3.4s |
| 50 | valid | 49/50
| 6.3s |
| 75 | valid | 75/75
| 13.4s |
| 100 | invalid | None/None
| 11.2s |
A single tool call whose argument object scales to ~100 fields across mixed nesting — shows where validity or field accuracy slips as the structure grows.
Where the compute landed
| Process | Role | Avg CPU% | Peak CPU% |
|---|
| aned | Apple Neural Engine daemon (ANE dispatch) | 4.0 | 71.7 |
| modelcatalogd | model weight management | 12.8 | 41.9 |
| TGOnDeviceInferenceProviderService | on-device inference worker | 20.3 | 38.7 |
| fm | CLI client (should be ~idle) | 2.8 | 5.0 |
| GenerativeExperiencesSafetyInferenceProvider | safety/guardrail inference | 1.2 | 3.5 |
| IntelligencePlatformComputeService | compute coordination | 0.1 | 3.5 |
GPU never appears as a consumer — inference runs on the Apple Neural Engine, so it runs free of GPU contention.
ANE hardware activity (Instruments Core ML trace)
| Metric | Value |
|---|
| Neural Engine ops | 1623 |
| ANE active time | 15.5 s |
| Duty cycle (active / window) | 92.0% |
| Median / max op | 1063 µs / 104.0 ms |
Direct proof inference runs on the Apple Neural Engine — hundreds of Neural Engine Prediction hardware intervals. Each op is a sub-30ms burst, which is exactly why power/energy sampling reads ~0.