TL;DR

We built a web-based Minesweeper game where AI models compete head-to-head on identical boards. What started as a fun demonstration became an uncomfortable lesson in LLM limitations. Despite testing 30B+ parameter models, the AI couldn’t reliably pick coordinates from an explicit list without extensive guardrails. It couldn’t resist inventing coordinates, selecting already-revealed cells, or pattern-matching on the wrong part of the prompt.

After five major iterations of prompt engineering, we achieved working AI players—but examining what we built revealed an uncomfortable truth: we’d reduced the LLM’s role to trivial instruction-following. The actual Minesweeper logic—constraint propagation, deducing safe cells from number relationships—happens entirely in deterministic JavaScript. The AI just picks from pre-labeled lists we generate.

The business lesson: LLMs excel at flexible language interpretation but struggle with spatial reasoning, multi-step logic, and precise constraint following. Effective AI applications aren’t about making LLMs smarter—they’re about designing systems where code handles structured reasoning while LLMs handle what they’re actually good at. Know the difference, and you’ll build systems that work.

The Experiment

If you’ve been following my Joshua8.AI experiments, you know I’ve been stress-testing AI models with thinking games. This time: Minesweeper. We built a web-based version with AI capabilities, letting local LLM servers play the game. The crown jewel was “Race Mode”—two AI models competing head-to-head on identical boards.

The initial approach seemed reasonable. Convert the board to a 2D array, send it to the LLM with instructions to analyze and respond with a move in JSON format. The AI should read the grid, understand that a “1” means one adjacent mine, and deduce safe cells.

Here’s where it got interesting.

The Cascade of Failures

Failure 1: Spatial Blindness. The AI immediately and repeatedly selected already-revealed cells. Despite clearly marking hidden cells versus revealed numbers, the models couldn’t parse 2D spatial relationships. They’d analyze a cell showing “1” and try to reveal that same cell rather than its hidden neighbors.

Failure 2: Coordinate Hallucination. We pre-computed valid moves and presented them explicitly: “VALID MOVES: (0,4), (0,5), (1,4).” Given this list, the AI would output (3,0)—apparently interpolating that if rows 1, 2, and 4 existed, row 3 must be valid too. It invented coordinates that didn’t exist.

Failure 3: Pattern Matching Gone Wrong. We added constraint hints like “Cell (2,3)=1: 1 mine among [(1,2), (1,3)].” The AI would output (2,3)—the revealed cell providing the constraint, not a valid hidden cell. It pattern-matched on coordinate formats rather than understanding semantic meaning.

Failure 4: Priority Blindness. When the AI finally picked valid cells, it made poor strategic choices. Given guaranteed-safe cells and 50/50 guesses, it would flag the uncertain cell instead of revealing the safe one. It understood Minesweeper rules in principle but couldn’t execute logical priority.

Each failure required another iteration. More explicit instructions. More guardrails. More preprocessing in our code.

The Uncomfortable Realization

After five major iterations, we achieved working AI players. But examining what we built revealed an uncomfortable truth.

We had reduced the LLM’s role to trivial instruction-following.

The actual Minesweeper logic—constraint propagation, deducing safe cells and mines from number relationships—happens entirely in our JavaScript. Our boardToAIFormat() function does the real work: identifying frontier cells, analyzing constraints, finding guaranteed-safe cells, labeling them clearly as “SAFE” or “MINES.”

The AI’s job became:

See “SAFE” label → pick one of those cells
See “MINES” label → flag one of those cells
Otherwise → pick randomly from valid moves

For comparison, our JavaScript Perfect Solver implements actual constraint propagation and achieves 100% win rates on solvable boards. It does real reasoning. The LLM mostly follows explicit hints and guesses randomly when no deterministic solution exists.

The prompt engineering effort wasn’t teaching the AI to play Minesweeper better. It was compensating for fundamental limitations by moving all reasoning into deterministic code.

Why This Matters for Your Business

This pattern repeats across AI implementations. Language models excel at natural language understanding, pattern matching, flexible output formatting, and following well-structured instructions. They struggle with spatial reasoning, multi-step logical deduction, maintaining precise constraints, and structured problem-solving.

The lesson isn’t that AI is useless. It’s that AI is differently capable.

Hybrid systems where code handles structured logic while LLMs handle flexible interpretation can be powerful. But you must recognize what the LLM actually contributes versus what you might wishfully attribute to “AI reasoning.”

For Small to Medium Sized Businesses (SMBs) building AI applications, this means:

Don’t expect LLMs to reason about structured data. Pre-process it into explicit options.
Validate outputs aggressively. Even simple list-picking requires guardrails.
Put logic in code, flexibility in LLMs. Let each component do what it’s good at.
Test with adversarial cases. If the AI can misinterpret something, it will.

The gap between “understands the rules when asked” and “can reliably execute the logic” proved enormous. Recognizing that gap is the difference between AI projects that work and the 95% that fail.

The Bottom Line

Building this Minesweeper AI was a masterclass in LLM limitations. We started expecting the AI to reason about a game board and ended up spoon-feeding it pre-computed answers with explicit labels. The final system works, but the intelligence lives almost entirely in traditional code.

That’s not a failure—it’s a design pattern. The AI serves as a flexible but unreliable interface layer that required extensive engineering to constrain into correct behavior. Knowing that upfront would have saved us five iterations.

Know what AI actually does. Build systems accordingly. That’s how you conquer AI instead of being conquered by it.

Appendix: About That “We”

A note on pronouns: the “we” throughout this post refers to myself and Claude Code working together. I haven’t settled on the right way to describe human-AI collaboration yet, so “we” it is until someone convinces me otherwise.

Here’s the meta-lesson this project revealed about AI coding tools themselves. Given a complete GitHub specification—game rules, UI requirements, API structure—Claude Code one-shotted the entire Minesweeper application. Working game, race mode, local LLM integration. Done.

The prompts were a different story. Those required human intervention and a few hours of experimentation. No amount of specification could shortcut the iterative discovery of how LLMs actually behave when faced with spatial reasoning tasks.

AI coding tools like Claude Code (and there are other excellent ones) excel at translating well-defined specifications into working code. They struggle with the ambiguous, experimental work of figuring out what the specification should be in the first place.

The human in the loop isn’t going anywhere. AI just saves time on the implementation.