Even a Screw Works as a Nail If You Hit It with a Big Enough Hammer
LLM Models, Parameters, Quantizations, Prompts, and Tool Usage
A two-hour Friday-night experiment with 35+ models and the infamous 3-digit Mastermind puzzle (secret code 042)
The Clues
- 682 → one correct and well-placed
- 614 → one correct but wrong-placed
- 206 → two correct but wrong-placed
- 738 → nothing correct
- 780 → one correct but wrong-placed
I forced thirty-five open-source models to solve it the hard way: pure step-by-step deduction, no code allowed.
Then I opened Claude Code and told Opus 4.5: “Solve this by writing and running a short Python script.”
Two hours later the verdict was merciless.
The Hammer-vs-Screwdriver Leaderboard
| Approach | Accuracy | Time to Answer | Verdict |
|---|---|---|---|
| Qwen3 family (1.7B – 30B, Q4) | 13/15 correct | 17–155 s | Best hammer available |
| GPT-OSS 20B & 120B (Q4) | 100% | ~15–24 s | Bigger hammer, identical result |
| Gemma3 27B (Q4, Q8, FP16) | 0/8 correct | 0.2–49 s | Precise hammer that still misses the nail |
| Devstral / Mistral 7B / GLM-Air | Chaos → timeout | Sometimes infinite | Fancy hammer, wrong head |
| Claude Code + Opus 4.5 + 15 lines of Python | 100% | 1.3 milliseconds | Didn’t swing a hammer. Just used a screwdriver |
The Run That Made Every Other Run Look Silly
Claude Code took ten seconds to think, then executed a 15-line brute-forcer that checked all 1,000 possibilities and printed 042 before most 30B models had finished their first paragraph.
Same underlying intelligence, different tool.
Result: six orders of magnitude faster, zero hallucinations, perfect reliability.
What Two Hours of Swinging Hammers Proved
- Parameters: massive gains 1B → ~20B, then almost zero extra value.
- Quantization: Q4 vs FP16 changes almost nothing on final correctness.
- Prompts: a “perfect” 300-word LogicMaster prompt rescued tiny Qwen models but actively crippled several others. Still roulette.
- Architecture & training data: the single biggest predictor of success.
- Tool usage: the ultimate cheat code. One model + one tiny script beats every pure-reasoning attempt on every metric.
The Only Rule You Need in Late 2025
When the search space is under 10,000 candidates, drop the parameter hammer.
Just ask one of the vibe coding tools like Claude Code or Google Antigravity to use a tool to solve it.
They’ll be done before the 120B model has warmed up.
Two hours, thirty-five bruised hammers, one inescapable truth:
Even a screw works as a nail if you hit it with a big enough model…
…but maybe just use a screwdriver.
The code was 042 all along.
Some models hammered until they got lucky.
Claude Code reached for the screwdriver and was done before my coffee cooled.