Part 5 of the 600,000 Questions benchmark series

TL;DR

After 200+ GPU-hours and 300,000+ inference calls, here’s what I’d actually deploy: Phi4 14B for math-heavy workloads on consumer hardware, Qwen3-Next 80B Q4_K_M for general-purpose assistants on prosumer rigs, and Qwen3-VL AWQ on vLLM for high-throughput batch processing. Skip the 7B models for anything requiring reasoning. Avoid quantizations below Q4 unless you’re desperate for VRAM. And if you’re processing batches, parallelize—you’re leaving 5-9x performance on the table otherwise.

After four parts of detailed analysis, let’s cut to the chase. You have a specific use case, specific hardware, and specific requirements. Here’s how to navigate the decision.

The Decision Flowchart

flowchart TD A[What's your primary task?] --> B[Math/Code/Reasoning] A --> C[General Knowledge/Chat] A --> D[Batch Processing] B --> E{VRAM?} E --> |< 16GB| F[Phi4 14B
95.1% GSM8K] E --> |16-32GB| G[Qwen3-Coder 30B
95.1% GSM8K, fastest] E --> |48GB+| H[Qwen3-Next 80B Q4_K_M
96.3% GSM8K] C --> I{VRAM?} I --> |< 16GB| J[Phi4 14B
77.9% MMLU] I --> |16-32GB| K[Qwen3-VL AWQ on vLLM
78.9% MMLU, fast] I --> |48GB+| L[Qwen3-Next 80B Q4_K_M
83.8% MMLU] I --> |96GB+| M[GPT-OSS 120B Q4
87.9% MMLU] D --> N[vLLM + AWQ + parallel=16
9x throughput]

Quick Reference Table

Use Case	Model	Quant	VRAM	MMLU	GSM8K	Speed
Budget math	Phi4	default	10GB	77.6%	95.1%	2.0s/q
Budget general	Phi4	default	10GB	77.6%	95.1%	0.45s/q
Mid-tier math	Qwen3-Coder	default	20GB	75.3%	95.1%	1.3s/q
Mid-tier general	Qwen3-VL AWQ	AWQ-4bit	20GB	78.9%	95.2%	0.03s/q
High-end	Qwen3-Next 80B	Q4_K_M	50GB	83.8%	96.3%	0.19s/q
Maximum accuracy	GPT-OSS 120B	Q4	80GB	87.9%	—	1.3s/q
Predictable latency	Gemma3-27B-IT	QAT	18GB	74.5%	94.6%	0.24s/q

The Detailed Breakdown

If you need math/code and have limited VRAM:

→ Phi4 14B is the clear winner. At 95.1% GSM8K accuracy, it matches models 4-6x its size. It fits comfortably in 10GB VRAM and runs at 150+ tok/s. There’s no reason to use anything else in the sub-16GB tier for reasoning tasks.

Do NOT use: Mistral-7B (53.1% GSM8K—catastrophic failure on math)

If you need general knowledge and have limited VRAM:

→ Phi4 14B again. Its 77.9% MMLU is competitive with 30B models. The only reason to look elsewhere is if you need the absolute highest accuracy and have the VRAM to spare.

If you have 16-32GB VRAM:

→ For math: Qwen3-Coder 30B delivers 95.1% GSM8K in just 1.34s per question—the fastest math performance in the benchmark. It’s optimized for code and reasoning tasks.

→ For general: Qwen3-VL AWQ on vLLM hits 78.9% MMLU at 0.030s per question. That’s 15x faster than Phi4 for similar accuracy. The catch: you need vLLM, not Ollama.

→ For predictable latency: Gemma3-27B-IT-QAT has the lowest variance of any model tested. Its QAT training produces rock-solid response times without the “thinking spirals” that plague post-training quantized models.

If you have 48GB+ VRAM:

→ Qwen3-Next 80B Q4_K_M is the sweet spot. At 83.8% MMLU and 96.3% GSM8K, it’s within striking distance of the 120B model while requiring half the memory. The Q4_K_M quantization has minimal accuracy impact.

If you have 96GB+ VRAM and need maximum accuracy:

→ GPT-OSS 120B Q4 leads at 87.9% MMLU. But honestly, the 4 percentage point gain over the 80B model rarely justifies the doubled resource requirements. Only go here if accuracy is truly critical.

What NOT to Do

Don’t use Q2_K or Q3_K_M quantizations. The accuracy loss (17-21% below FP16) is severe, and counterintuitively, they’re often slower than higher quantizations due to extended reasoning patterns.

Don’t use 7B models for reasoning. Mistral-7B’s 53.1% on GSM8K isn’t a typo. Small models collapse on multi-step reasoning regardless of how fast they generate tokens.

Don’t trust tok/s as a speed metric. A model generating 300 tok/s but needing 100 tokens to answer is slower than one generating 100 tok/s but needing 10 tokens. Measure wall-clock time.

Don’t process batches sequentially. If you’re running vLLM for batch inference, parallel=16 gives you 5-9x throughput with zero accuracy loss. Sequential processing is leaving money on the table.

Don’t assume FP16 is slow. For the Qwen3-30B family, FP16 was actually faster than quantized versions because it generates more concise responses. Test your specific model.

The Surprising Recommendations

A few results that contradicted my expectations:

Q6_K beats Q8_0 on accuracy (71.8% vs 71.1% for Qwen3-30B). K-quant’s smart precision allocation matters more than uniform bit depth.
Phi4 14B matches 80B models on math. Architecture and training data trump parameter count for reasoning tasks.
vLLM at 66.9 tok/s beats Ollama at 310.6 tok/s on wall-clock time. Token generation speed is meaningless if you’re generating unnecessary tokens.
QAT quantization has lower variance than FP16. Gemma3’s quantization-aware training produced more stable outputs than full-precision models with post-hoc quantization.

My Personal Setup

For what it’s worth, here’s what I actually run:

Daily coding assistant: Qwen3-Coder 30B on Ollama. Fast math, good code completion, fits in my GPU.
Research/complex questions: Qwen3-Next 80B Q4_K_M when I need deeper reasoning.
Batch processing: Qwen3-VL AWQ on vLLM with parallel=16. Throughput is king for bulk work.

The 80B model stays loaded most of the time. The 30B model is for quick iterations where latency matters more than depth.

Final Thoughts

The local LLM landscape rewards specificity. There’s no single “best model”—there’s the best model for your VRAM budget, your task type, your latency requirements, and your accuracy threshold.

The data in this series should help you navigate those trade-offs. But ultimately, the right answer is to benchmark on your workload. The 14,042 MMLU questions and 1,319 GSM8K problems I used are proxies for general capability. Your actual use case may have different characteristics.

The space heater is finally cooling down. The benchmarks are complete. And somewhere in these numbers is the answer to “which model should I use?”—you just have to find the row that matches your constraints.

Happy inferencing.

This concludes the 5-part series on local LLM benchmarking. Start from Part 1 if you want the full journey.

Benchmarks conducted January 17-20, 2026 on an Intel Ultra 7 system with NVIDIA RTX 6000 Blackwell Pro QMax GPU. Full MMLU (14,042 questions) and GSM8K (1,319 questions) datasets. Ollama 0.14.1 (and .3) and vLLM 0.13.0 for inference.