Even a Screw Works as a Nail If You Hit It with a Big Enough Hammer

LLM Models, Parameters, Quantizations, Prompts, and Tool Usage

A two-hour Friday-night experiment with 35+ models and the infamous 3-digit Mastermind puzzle (secret code 042)

The Clues

682 → one correct and well-placed
614 → one correct but wrong-placed
206 → two correct but wrong-placed
738 → nothing correct
780 → one correct but wrong-placed

I forced thirty-five open-source models to solve it the hard way: pure step-by-step deduction, no code allowed.

Then I opened Claude Code and told Opus 4.5: “Solve this by writing and running a short Python script.”

Two hours later the verdict was merciless.

The Hammer-vs-Screwdriver Leaderboard

Approach	Accuracy	Time to Answer	Verdict
Qwen3 family (1.7B – 30B, Q4)	13/15 correct	17–155 s	Best hammer available
GPT-OSS 20B & 120B (Q4)	100%	~15–24 s	Bigger hammer, identical result
Gemma3 27B (Q4, Q8, FP16)	0/8 correct	0.2–49 s	Precise hammer that still misses the nail
Devstral / Mistral 7B / GLM-Air	Chaos → timeout	Sometimes infinite	Fancy hammer, wrong head
Claude Code + Opus 4.5 + 15 lines of Python	100%	1.3 milliseconds	Didn’t swing a hammer. Just used a screwdriver

The Run That Made Every Other Run Look Silly

Claude Code took ten seconds to think, then executed a 15-line brute-forcer that checked all 1,000 possibilities and printed 042 before most 30B models had finished their first paragraph.

Same underlying intelligence, different tool.

Result: six orders of magnitude faster, zero hallucinations, perfect reliability.

What Two Hours of Swinging Hammers Proved

Parameters: massive gains 1B → ~20B, then almost zero extra value.
Quantization: Q4 vs FP16 changes almost nothing on final correctness.
Prompts: a “perfect” 300-word LogicMaster prompt rescued tiny Qwen models but actively crippled several others. Still roulette.
Architecture & training data: the single biggest predictor of success.
Tool usage: the ultimate cheat code. One model + one tiny script beats every pure-reasoning attempt on every metric.

The Only Rule You Need in Late 2025

When the search space is under 10,000 candidates, drop the parameter hammer.

Just ask one of the vibe coding tools like Claude Code or Google Antigravity to use a tool to solve it.

They’ll be done before the 120B model has warmed up.

Two hours, thirty-five bruised hammers, one inescapable truth:

Even a screw works as a nail if you hit it with a big enough model…

…but maybe just use a screwdriver.

The code was 042 all along.

Some models hammered until they got lucky.

Claude Code reached for the screwdriver and was done before my coffee cooled.