A More Detailed Look at 'Even a Screw Works as a Nail If You Hit It with a Big Enough Hammer'

A More Detailed Look at ‘Even a Screw Works as a Nail If You Hit It with a Big Enough Hammer’

A Large Independent Mastermind-Style PIN Puzzle Benchmark December 2025 – 620+ runs, 73 local models, 10 frontier APIs

TL;DR – The Ultimate Irony in One Paragraph

On 1 December 2025 the best AI frontier models collectively spent $2.40, 42 minutes of inference time, and ~187,000 reasoning tokens trying to solve a single 4-digit PIN puzzle… and every single one of them failed.

Meanwhile, a 35-line Python script generated in ten seconds by any decent coding assistant solves the identical puzzle in 31 milliseconds, uses less than a joule of energy, and is correct with mathematical certainty.

The same models that can perfectly write the solver cannot be trusted to reason through the problem themselves. That is the state of “pure reasoning” in late 2025.

Recommendation: Never waste tokens on pure chain-of-thought for any constraint-satisfaction problem with an enumerable search space (Mastermind, Sudoku, small scheduling, verification tasks, etc.). Give your agent tool use — specifically code execution — on day one. It instantly turns unreliable probabilistic reasoning into deterministic perfection and shrinks cost and latency by four to six orders of magnitude. In 2025 and beyond, the single highest-ROI feature you can add to any LLM system is the ability to say “stop thinking, start computing.”

The Puzzles

3-digit puzzle – answer 042

682: One digit correct and well placed
614: One digit correct but wrongly placed
206: Two digits correct but wrongly placed
738: Nothing correct
780: One digit correct but wrongly placed

4-digit puzzle – answer 5930

3593: Three digits correct but wrongly placed
2266: Nothing correct
8348: One digit correct but wrongly placed
8085: Two digits correct but wrongly placed
1489: One digit correct but wrongly placed

These are classic Bulls-and-Cows (Mastermind) constraint-satisfaction problems – exactly the kind of systematic deduction that LLMs are supposed to excel at in 2025.

Scale of the Benchmark

  • 73 local models (4B → 120B parameters) via Ollama on an RTX 6000 Blackwell → 584 individual runs
  • 10 frontier models × 2 generations via OpenRouter → ~60 additional runs
  • Total: more than 620 attempts at pure logical reasoning

Every run used identical prompts and automated answer extraction.

Local Model Results (default temperature)

Condition Accuracy
3-digit baseline 31.5%
3-digit + LogicMaster 39.7%
4-digit baseline 23.3%
4-digit + LogicMaster 32.9%
Overall 31.8%

The LogicMaster system prompt (explicit rules + “act as a flawless deduction engine”) helped, but only by ~8–10 percentage points – and actually made several models worse.

The Eight Local Champions (perfect 4/4)

Model Size Notes
phi4-reasoning:14b 14B Microsoft reasoning fine-tune
Phi-4-reasoning-plus-GGUF 14B Unsloth community fine-tune
qwen3:30b-a3b-instruct-2507 (fp16 & q4) 30B Alibaba
qwen3:30b-a3b-thinking-2507-q4_K_M 30B “thinking” variant
qwen3:4b-thinking-2507-fp16 4B A 4-billion-parameter model beat 70B+ giants
AM-Thinking-v1 ~32B Community reasoning fine-tune
gpt-oss:20b 20B Open-source GPT-style

A 4B model achieving perfection is one of the clearest demonstrations yet that targeted reasoning training matters more than raw parameter count.

Frontier Model Results – 3-digit puzzle (easier)

Model Score Cost (4 runs) Notes
Gemini 3 Pro 4/4 $0.03 Cheapest frontier model, fastest, perfect even at temp=0
Grok-4 4/4 $0.04 Perfect, very slow (~2 min/response)
Claude Opus 4.5 3/4 $0.31 Failed one temp=0 run
Claude Sonnet 4.5 3/4 $0.05  
Claude Haiku 4.5 1/4 $0.08 Extremely verbose, worst performer

Frontier Model Results – 4-digit puzzle (the real test)

Model Correct Typical wrong answer Tokens used Cost
Claude Opus 4.5 0/2 0962 / 6942 2–24k ~$0.19
Claude Sonnet 4.5 0/2 0912 24k $0.36
Gemini 3 Pro 0/2 6942 21k $0.21
Grok-4 0/2 6942 54k $0.82

Total across 9 frontier models: 187,624 tokens, $2.40, 42+ minutes → 0% accuracy

The Temperature=0 Disaster

Deterministic mode (temp=0) is widely recommended for reasoning.

Reality:

  • Local accuracy collapsed from 31.8% → 15.8%
  • Every single Claude 4.5 model failed at least one temp=0 condition
  • Only Gemini 3 Pro and Grok-4 stayed perfect

Many leading models need randomness as a crutch to escape reasoning dead-ends.

The Computational Solution (the one that actually works)

for pin in range(10000):
    if satisfies_all_clues(pin):
        print(f"{pin:04d}")   # → 5930
  • Runtime on one core (Ryzen 9 9950X): 31 ms
  • Energy consumption: 0.59 joules (≈ $0.000026 of electricity)
  • Accuracy: 100% (mathematical proof)
  • Cost after code generation: ~3¢ (tokens to write the script) + 0.59 J to execute

Even a 4B local model with a code interpreter beats every pure-reasoning frontier giant.

Final 2025 Leaderboard – Real-World Performance

Rank Solution Correct Monetary Cost per Solve Energy per Solve Notes
1 Any LLM + code execution 100% ~3¢ (code gen) + $0.000026 0.59 joules Mathematically guaranteed
2 Gemini 3 Pro (API, no tools) 100%* $0.0075 ~15–25 kJ (cloud) *only on 3-digit; fails 4-digit
3 Grok-4 (API, no tools) 100%* $0.010 ~40–60 kJ (cloud) *only on 3-digit
4 Local Phi-4-reasoning-14B or Qwen3-4B-thinking 100% $0 (your GPU) <3 J with code exec Offline + deterministic
Claude Opus 4.5 (pure reasoning) 0–75% $0.10–$0.31 Hundreds of kJ Expensive and unreliable on hard puzzles

Key Takeaways – December 2025

  1. Pure chain-of-thought reasoning remains brittle and unreliable on hard constraint problems.
  2. Tool use (code execution) is the single largest immediate performance multiplier in applied AI.
  3. Reasoning-specialized small models running locally + a code interpreter routinely crush $100/M-token cloud giants.
  4. Temperature=0 hurts more than it helps on multi-step deduction; keep 0.2–0.5 or use voting.
  5. Cost per correct answer varies by >10,000× depending on whether you let the model “think” or let it compute.

The smartest capability in late 2025 is not a bigger context window or a better reasoning chain.

It is the meta-capability to recognize when reasoning is the wrong tool — and to delegate to deterministic computation instead.

That 0.59 joules on your laptop isn’t just cheaper and faster than $2.40 of cloud inference.

It is the difference between probabilistic hallucination and provable correctness.

And that gap will only widen in 2026.


This post expands on the findings from our earlier experiment, Even a Screw Works as a Nail If You Hit It with a Big Enough Hammer, scaling from 35 runs to over 620.