Which Qwen Should You Use? A Practical Benchmark for SMBs

TL;DR: When small businesses wonder which AI model to choose, you can read the details in the rest of this post—or just contact Joshua8.AI and let us answer it for you.


It’s Sunday night and there’s a dusting of snow outside—late December. My wife is watching yet another Hallmark Christmas movie, which means it’s the perfect time to log into the basement servers and put them to work. I recently repositioned the rack directly beneath the family room, so that NAS array and those three CPUs and five GPUs crunching AI inference now double as a space heater. The warmth rising through the floorboards is all the motivation I need to answer the question I’ve been asked three times this month:

“There are like ten different Qwens now. Which one should I actually use?”

Fair question. Alibaba has been shipping Qwen variants like they’re getting paid per model release. Dense models, MoE models, vision models, thinking models, different quantizations—the menu has gotten unwieldy. For a small business trying to deploy a local LLM, the choice paralysis is real.

So I did what any reasonable person would do on a Sunday night: I built a benchmark.

Key Findings

I tested 9 models across 5 practical small and medium business (SMB) tasks: CSI MasterFormat classification, email drafting, document summarization, code generation, and math. 300 test cases in the main benchmark, plus 300 more testing quantization methods and model variants. Here’s what I found:

GGUF quantization beats AWQ by 13%. On domain-specific tasks, Ollama’s Q4_K_M format consistently outperformed vLLM’s AWQ—and it’s not the sampling parameters. I tested both with identical settings.

“Thinking” vs “Instruct” is a wash. On 50-case CSI classification, qwen3-next:80b Thinking scored 54% vs Instruct at 58%. The small-sample 10-case results were misleading—chain-of-thought reasoning doesn’t help domain classification.

Tokens per second isn’t everything. qwen3-next:80b-a3b (Instruct) only generates 45 tok/s, but it finished the benchmark in 0.6 minutes because it gives concise, direct answers instead of verbose explanations.

The biggest Qwen wasn’t the best. qwen3-next:80b took 27 minutes to finish what gpt-oss:120b did in 4 minutes—and scored lower on math.

Fine-tuning approach mattered for math. qwen3-next:80b-a3b Instruct hit 100% while qwen3-next:80b Thinking hit 50%—same model, same 3B active parameters, different fine-tuning. Dense models and STEM-focused training also helped.

For practical SMB work, gpt-oss:120b is the all-rounder. Fast (175 tok/s), accurate (78% on domain-specific classification), and reliable across all categories.


The Setup

All models ran locally on a single RTX 6000 Blackwell via Ollama (and one via vLLM for comparison). No cloud APIs. No token costs. Just raw local inference.

Models tested:

Model Type Speed VRAM Notes
qwen3-30b-a3b MoE (3B active) 203 tok/s ~19 GB Fastest via Ollama
qwen3-30b-vl Vision-Language 199 tok/s ~19 GB Text-only in this test
qwen3-30b-vl-awq AWQ via vLLM ~19 GB Accuracy comparison only
qwen3:32b Dense 55 tok/s ~20 GB Verbose responses
qwen3-next:80b (Thinking) MoE (3B active) 90 tok/s ~49 GB Chain-of-thought reasoning
qwen3-next:80b-a3b (Instruct) MoE (3B active) 45 tok/s ~49 GB Direct instruction following
gpt-oss:120b MoE (128 experts, 4 active) 175 tok/s ~73 GB OpenAI open-weight, Apache 2.0
gemma3:27b Dense 67 tok/s ~17 GB Google’s contender
GLM-4.5-Air MoE (12B active) 111 tok/s ~65 GB Zhipu AI, 106B total params

Benchmarks:

  1. CSI MasterFormat Classification: Given construction spec text, classify into the correct Division. Tests domain knowledge in commercial construction. (This benchmark was chosen to support development efforts at TeraContext.AI, one of Joshua8.AI’s incubator companies building AI tools for the commercial construction industry.)
  2. Email Drafting: Write professional emails from scenarios. Tests tone and completeness.
  3. Document Summarization: Compress business documents accurately. Tests comprehension.
  4. Code Generation: Write working Python/Bash scripts. Tests technical capability.
  5. Math: Solve geometry, integration, and differentiation problems. Tests reasoning.

10 test cases per category, 50 cases per model. The initial benchmark covered 6 models (300 cases), with 3 more models added in extended testing.


The Quantization Deep Dive: GGUF vs AWQ

Here’s where it gets interesting. The initial 10-case benchmark showed a small gap between Ollama (GGUF Q4_K_M) and vLLM (AWQ 4-bit). Was that noise, or something real?

I ran 200 more tests to find out: 50 CSI classification cases × 2 backends × 2 parameter configurations. Both models are 4-bit quantized, so this isolates the quantization method rather than precision.

The results were consistent:

Configuration Ollama (GGUF) vLLM (AWQ)
Qwen recommended (temp=0.7, top_p=0.8) 50% 34%
Original params (temp=1.0, top_p=0.95) 44% 34%
Average 47% 34%

That’s a 13 percentage point gap—and it held regardless of sampling parameters. vLLM scored exactly 34% both times.

What this means:

The difference isn’t about temperature or top_p settings. It’s something fundamental about how GGUF Q4_K_M and AWQ 4-bit preserve (or lose) model knowledge during quantization—or possibly differences in post-quantization tuning. I couldn’t find details on whether either model had additional training after quantization.

For general tasks like email and summarization, both performed identically (100%). But for domain-specific classification requiring niche knowledge, GGUF retained more capability in this particular instance.

One caveat for vLLM users: make sure you’re using the right API endpoint. Chat/instruct models need /v1/chat/completions, not /v1/completions. I burned an hour debugging empty responses before catching that. Ollama handles that automatically.


Domain Knowledge: Size Matters (Sort Of)

CSI MasterFormat is the construction industry’s standard for organizing specifications. Divisions like 03-Concrete, 07-Thermal and Moisture Protection, 22-Plumbing. A useful LLM should be able to classify spec text into the right division.

I ran an extended 50-case CSI benchmark on additional models to get more reliable numbers:

Model CSI Accuracy Test Size Notes
gpt-oss:120b 78% 50 cases Best overall
gemma3-27b 68% 50 cases  
qwen3-next:80b-a3b (Instruct) 58% 50 cases MoE, 3B active
qwen3-next:80b (Thinking) 54% 50 cases Chain-of-thought
qwen3:32b 48% 50 cases Dense
qwen3-vl:30b (GGUF) 47% 50 cases  
qwen3-30b-a3b 38% 50 cases MoE, 3B active
qwen3-30b-vl-awq (vLLM) 34% 50 cases AWQ penalty
GLM-4.5-Air 26% 50 cases Worst on domain knowledge

The qwen3-next:80b naming is confusing. There are two variants on Ollama:

  • qwen3-next:80b — “Thinking” fine-tune (chain-of-thought reasoning)
  • qwen3-next:80b-a3b-instruct — “Instruct” fine-tune (direct answers)

Both are MoE with 80B total params but only 3B active per token. On a fair 50-case comparison, they scored nearly identically (54% vs 58%). The initial 10-case sample suggested Thinking was better, but that was noise—sample size matters.

Takeaway: If your use case requires domain-specific knowledge (construction, legal, medical), you need the larger models like gpt-oss:120b. The 80B Qwen variants (both Thinking and Instruct) hover around 55%, which isn’t much better than the smaller 30B models.


Math: The Surprise Winner

This one surprised me. I expected the bigger models to dominate math like they did CSI classification. Instead:

Model Math Accuracy
qwen3:32b 100%
qwen3-next:80b-a3b (Instruct) 100%
gemma3-27b 90%
GLM-4.5-Air 90%
gpt-oss:120b 60%
qwen3-30b-a3b 60%
qwen3-30b-vl 60%
qwen3-next:80b (Thinking) 50%

The pattern here is nuanced. qwen3:32b (dense) and qwen3-next:80b-a3b Instruct (MoE, 3B active) both hit 100%, while gemma3-27b (dense) and GLM-4.5-Air (MoE, 12B active) tied at 90%. But qwen3-next:80b Thinking—same architecture as the Instruct variant, same 3B active parameters—scored only 50%.

Why? Fine-tuning matters more than architecture for math. The “Thinking” chain-of-thought approach seems to introduce errors, while direct Instruct responses stay accurate. The dense models (qwen3:32b, gemma3-27b) performed consistently well. Google’s Gemma and Zhipu’s GLM were also explicitly trained with STEM focus. The takeaway: for math, fine-tuning approach and training focus matter more than parameter count or architecture.

Here’s gemma3 solving a calculus problem:

Step 1: We need to find the derivative of f(x) = ln(sin(x² + 1))

Step 2: Using the chain rule: d/dx[ln(u)] = (1/u) · du/dx

Step 3: Let u = sin(x² + 1), then du/dx = cos(x² + 1) · 2x

Final answer: f’(x) = 2x · cot(x² + 1)

Clean, correct, well-structured. The “Thinking” variants and smaller MoE Qwen models often got lost in the chain rule or made arithmetic errors—though both the dense qwen3:32b and the Instruct-tuned qwen3-next:80b-a3b nailed it.

Takeaway: For math-heavy applications (tutoring, technical analysis, scientific computing), use dense models: qwen3:32b or gemma3-27b. If you need both math and fast total response time, qwen3-next:80b-a3b (Instruct) scored 100% and finished the benchmark in just 0.6 minutes—despite only 45 tok/s, it gives concise direct answers instead of verbose explanations.


The All-Rounder: gpt-oss:120b

If I had to pick one model for general SMB tasks, it’s gpt-oss:120b—OpenAI’s open-weight model released under Apache 2.0.

Category gpt-oss:120b Best Alternative
CSI MasterFormat 78% gemma3 (68%)
Email Drafting 100% All tied
Summarization 100% All tied
Code Generation 90% Most at 90%
Math 60% gemma3 (90%)
Speed 175 tok/s qwen3-30b-a3b (203)
Total Time 4.4 min

It’s fast (175 tok/s), reliable across all categories, and finished the entire benchmark in under 5 minutes. The only weakness is math, where gemma3 dominates.

For context: qwen3-next:80b scored higher on CSI classification but took 27 minutes to complete the same benchmark. That’s not a tradeoff most businesses would accept.


The Speed Champions

For Ollama users: qwen3-30b-a3b at 203 tok/s. Best balance of speed and accuracy.

It achieves:

  • 100% on email and summarization
  • 90% on code generation
  • Total time: 2-3 minutes

The catch: it scored only 38% on domain-specific CSI classification and 60% on math. For general-purpose tasks like drafting emails, summarizing meeting notes, or generating boilerplate code, it’s excellent. For anything requiring specialized knowledge or complex reasoning, you’ll want something bigger.


What Actually Matters

Here’s the decision framework I’d use:

For domain-specific work (construction, legal, medical):

  • Use gpt-oss:120b (78% accuracy, 175 tok/s)
  • The Qwen 80B variants only hit 54-58%—not worth the slower speed
  • The smaller MoE models don’t have the knowledge depth either

For math and STEM reasoning:

  • Use qwen3:32b or gemma3:27b (both dense models)
  • qwen3:32b hit 100%, gemma3 hit 90%
  • For speed + math accuracy, qwen3-next:80b-a3b (Instruct) scored 100% at 45 tok/s

For general business tasks (email, summaries, basic code):

  • Use qwen3-30b-a3b for speed
  • Use gpt-oss:120b for quality
  • Both are reliable for non-specialized work

The Full Results

For the benchmark completists:

Model CSI Email Summary Code Math Speed Total Time
qwen3-30b-a3b 38% 100% 100% 90% 60% 203 tok/s 3.0 min
qwen3-30b-vl 50% 100% 100% 90% 60% 199 tok/s 15.3 min
qwen3-30b-vl-awq 34% 100% 100% 90% 60%
qwen3:32b 48% 100% 100% 90% 100% 55 tok/s 8.1 min
qwen3-next:80b (Thinking) 54% 100% 100% 90% 50% 90 tok/s 27.4 min
qwen3-next:80b-a3b (Instruct) 58% 100% 100% 100% 100% 45 tok/s 0.6 min
gpt-oss:120b 78% 100% 100% 90% 60% 175 tok/s 4.4 min
gemma3:27b 68% 100% 100% 90% 90% 67 tok/s 6.9 min
GLM-4.5-Air 26% 100% 100% 100% 90% 111 tok/s 42.2 min

Note the qwen3-30b-vl anomaly: same tok/s as the a3b variant but took 5x longer. That’s because the VL model generates much longer responses—more tokens at the same speed means more time.

The qwen3-next:80b-a3b (Instruct) finished in 0.6 minutes because it returns concise ~20-token responses. The qwen3:32b took 8 minutes despite being smaller because it generates verbose 500-1000 token responses for the same task.


The Meta-Lesson

Every time I run one of these benchmarks, I learn the same thing: the hype metrics don’t matter as much as you’d think.

Parameter count? The 80B model was slower and less accurate than the 27B on math.

Speed? The fastest models (MoE variants) struggled with domain-specific classification.

Quantization method? GGUF Q4_K_M beat AWQ 4-bit by 13% on domain tasks—same bit depth, very different results.

Model origin? Zhipu’s GLM-4.5-Air scored 26% on CSI—dead last—despite excellent performance on general tasks. Domain knowledge varies wildly by training data, not architecture.

Sampling parameters? Barely mattered. I tested identical prompts with temp=0.7 and temp=1.0, top_p=0.8 and top_p=0.95. The accuracy gap stayed the same.

What matters is testing your actual use case with your actual data on your actual hardware. The 4 hours I spent running 600+ benchmark cases will save months of frustration from deploying the wrong model.

So to answer the original question: which Qwen model should I use? For general use, qwen3-30b-a3b-instruct is fast and fits in modest amounts of VRAM. qwen3-next improves on that but requires more than 2x the VRAM. If you need vision capabilities, go with the qwen3-vl version. But there’s domain expertise not well captured in the Qwen3 series that gemma and gpt-oss do capture—so evaluate those for your specific application. Or consider additional model training to add the expertise you need. (A likely subject for a future blog post.)


Benchmark run on December 28-29, 2025. Hardware: RTX 6000 Blackwell for Ollama models. vLLM model (qwen3-30b-vl-awq) tested on different hardware—speed not directly comparable; included for quantization accuracy comparison only. Initial benchmark: 300 cases across 6 models. Extended tests: 200 CSI cases for GGUF vs AWQ comparison, 150 CSI cases for qwen3:32b, qwen3-next:80b (Thinking), and qwen3-next:80b-a3b (Instruct), plus 100 cases for GLM-4.5-Air. All Ollama models use Q4_K_M quantization; vLLM model uses AWQ 4-bit.