My benchmark graded '7! = 5040' as wrong — and three other ways it lied to me
My benchmark marked the answer “7! = 5040” as wrong.
The model was right. 5040 is correct. But my scorer pulled the first number out of the response — the 7 — compared it to the expected 5040, and failed the task. I only noticed because I went to re-run an old benchmark and couldn’t reproduce my own published numbers.
That one bug had quietly inflated the quality scores in three posts I’d already shipped. Fixing it surfaced a second bug. Fixing that surfaced a third. Here’s the whole chain, because the lesson underneath it is the useful part: in a benchmark that executes and checks, a wrong number looks exactly like a right one — and the failure modes are model-specific, so they hide until you read the actual outputs.
Bug 1: the scorer read the first number, not the answer
My math checker extracted a number from each response and compared it to the known answer. The extraction grabbed the first number it found via regex.
That works for a model that answers “5040.” It breaks for a model that shows its work:
- “7! = 5040” → scored on the 7
- “17.5% of 240 = 0.175 × 240 = 42” → scored on the 17.5
- “1024 is 2^10, so log₂(1024) = 10” → scored on the 1024
Every show-your-work answer was being marked wrong. The fix is one line of intent — read the last number, not the first — but the consequence was large: it had pushed several models’ “real” scores down, and inflated others to a too-clean “21/21 (100%)” that I’d put in a headline.
I fixed it to take the last number in the response, unit-tested it against the exact answers that had been failing, and re-ran every model. That produced honest numbers — and immediately exposed the next problem.
Bug 2: a model that codes correctly, scored 1/8
With the math checker fixed, one result still looked absurd: Gemma 4 12B scored 1/8 on coding while acing math (8/8) and factual recall (5/5). A model that can do all the math but almost no code? That’s not a capability profile — that’s a harness smell.
I read the actual responses. The code was correct. Gemma 4’s default output is a long exploratory reasoning trace: it writes the function inside a ```python fence, but describes the call — print(fizzbuzz(15)) — out in the prose, not in the code block. My harness executed the fenced block, got a function definition with nothing calling it, captured no output, and scored it failed. Eight times.
This one I did not “fix” with a cleverer extractor. I tried — and a heuristic that scraped print() calls out of the prose recovered one task and introduced syntax errors in others. The honest conclusion: Gemma 4’s thinking-style output isn’t cleanly scorable by an execute-the-block harness, and any number I produced would be a function of how hard my extractor tried, not of the model. So I report its coding as N/A, with the explanation, rather than a fake number. Reporting “1/8” would have been as wrong as the old “100%,” just in the other direction.
Bug 3: every model reported the same RAM
The third one I almost shipped. My batch runner measured peak memory per model with mx.get_peak_memory(). Three different-sized models came back with identical peak RAM: 11.37 GB, 11.37 GB, 11.37 GB.
get_peak_memory() returns the peak for the whole process, and the batch ran every model in one process without resetting. So once the largest model set the high-water mark, every model after it inherited that number. A 3.4 GB model was being reported as 11.4 GB. The fix is one call — mx.reset_peak_memory() between models — but I’d have published a table where the memory column was silently meaningless for most rows.
The pattern: failures look like successes
Three bugs, one shape. A benchmark that executes code and checks answers feels objective — there’s no rubric, no judgment, the task either passes or fails. That’s exactly what makes its bugs dangerous: a false fail and a real fail produce the same cell in the table. “1/8” looks like a verdict on the model. “11.37 GB” looks like a measurement. Neither announces that it’s wrong.
And the bugs were model-specific. The first-number bug only bit verbose models. The code-extraction bug only bit thinking-style output. The RAM bug only bit the non-largest models in a batch. None of them showed up on the model I built the harness against. They waited for a different model to walk into the trap.
The practical defense is unglamorous: spot-check the failures, not the passes. A passing task is rarely lying to you. A failing task might be the model, or it might be your harness — and the only way to tell is to read the actual output, not the score. I now treat any surprising failure (a capable model bombing one category) as a harness bug until I’ve read the raw response and proven otherwise.
What I did about the posts
These numbers were already public. Three posts cited the inflated table. So I re-ran every model on one uniform methodology with all three fixes, corrected the live posts with dated notes explaining what changed, and pushed the fixes to the benchmark itself (ondevice-bench is open source, including the reset_peak_memory call and the documented extractor limitation). A benchmark that checks other people’s work should survive a check of its own.
This is the part of on-device work I keep coming back to: the bugs that never show up in a clean demo or a cloud eval, only when a specific model meets a specific harness on a specific machine. The score looked fine. The model looked bad. The truth was one regex deep — and it was mine.