TL;DR: When small businesses wonder which AI model to choose, you can read the details in the rest of this post—or just contact Joshua8.AI and let us answer it for you.
It’s Sunday night and there’s a dusting of snow outside—late December. My wife is watching yet another Hallmark Christmas movie, which means it’s the perfect time to log into the basement servers and put them to work. I recently repositioned the rack directly beneath the family room, so that NAS array and those three CPUs and five GPUs crunching AI inference now double as a space heater. The warmth rising through the floorboards is all the motivation I need to answer the question I’ve been asked three times this month:
“There are like ten different Qwens now. Which one should I actually use?”
Fair question. Alibaba has been shipping Qwen variants like they’re getting paid per model release. Dense models, MoE models, vision models, thinking models, different quantizations—the menu has gotten unwieldy. For a small business trying to deploy a local LLM, the choice paralysis is real.
So I did what any reasonable person would do on a Sunday night: I built a benchmark.
Key Findings
I tested 9 models across 5 practical small and medium business (SMB) tasks: CSI MasterFormat classification, email drafting, document summarization, code generation, and math. 300 test cases in the main benchmark, plus 300 more testing quantization methods and model variants. Here’s what I found:
GGUF quantization beats AWQ by 13%. On domain-specific tasks, Ollama’s Q4_K_M format consistently outperformed vLLM’s AWQ—and it’s not the sampling parameters. I tested both with identical settings.
“Thinking” vs “Instruct” is a wash. On 50-case CSI classification, qwen3-next:80b Thinking scored 54% vs Instruct at 58%. The small-sample 10-case results were misleading—chain-of-thought reasoning doesn’t help domain classification.
Tokens per second isn’t everything. qwen3-next:80b-a3b (Instruct) only generates 45 tok/s, but it finished the benchmark in 0.6 minutes because it gives concise, direct answers instead of verbose explanations.
The biggest Qwen wasn’t the best. qwen3-next:80b took 27 minutes to finish what gpt-oss:120b did in 4 minutes—and scored lower on math.
Fine-tuning approach mattered for math. qwen3-next:80b-a3b Instruct hit 100% while qwen3-next:80b Thinking hit 50%—same model, same 3B active parameters, different fine-tuning. Dense models and STEM-focused training also helped.
For practical SMB work, gpt-oss:120b is the all-rounder. Fast (175 tok/s), accurate (78% on domain-specific classification), and reliable across all categories.
The Setup
All models ran locally on a single RTX 6000 Blackwell via Ollama (and one via vLLM for comparison). No cloud APIs. No token costs. Just raw local inference.
Models tested:
| Model | Type | Speed | VRAM | Notes |
|---|---|---|---|---|
| qwen3-30b-a3b | MoE (3B active) | 203 tok/s | ~19 GB | Fastest via Ollama |
| qwen3-30b-vl | Vision-Language | 199 tok/s | ~19 GB | Text-only in this test |
| qwen3-30b-vl-awq | AWQ via vLLM | — | ~19 GB | Accuracy comparison only |
| qwen3:32b | Dense | 55 tok/s | ~20 GB | Verbose responses |
| qwen3-next:80b (Thinking) | MoE (3B active) | 90 tok/s | ~49 GB | Chain-of-thought reasoning |
| qwen3-next:80b-a3b (Instruct) | MoE (3B active) | 45 tok/s | ~49 GB | Direct instruction following |
| gpt-oss:120b | MoE (128 experts, 4 active) | 175 tok/s | ~73 GB | OpenAI open-weight, Apache 2.0 |
| gemma3:27b | Dense | 67 tok/s | ~17 GB | Google’s contender |
| GLM-4.5-Air | MoE (12B active) | 111 tok/s | ~65 GB | Zhipu AI, 106B total params |
Benchmarks:
- CSI MasterFormat Classification: Given construction spec text, classify into the correct Division. Tests domain knowledge in commercial construction. (This benchmark was chosen to support development efforts at TeraContext.AI, one of Joshua8.AI’s incubator companies building AI tools for the commercial construction industry.)
- Email Drafting: Write professional emails from scenarios. Tests tone and completeness.
- Document Summarization: Compress business documents accurately. Tests comprehension.
- Code Generation: Write working Python/Bash scripts. Tests technical capability.
- Math: Solve geometry, integration, and differentiation problems. Tests reasoning.
10 test cases per category, 50 cases per model. The initial benchmark covered 6 models (300 cases), with 3 more models added in extended testing.
The Quantization Deep Dive: GGUF vs AWQ
Here’s where it gets interesting. The initial 10-case benchmark showed a small gap between Ollama (GGUF Q4_K_M) and vLLM (AWQ 4-bit). Was that noise, or something real?
I ran 200 more tests to find out: 50 CSI classification cases × 2 backends × 2 parameter configurations. Both models are 4-bit quantized, so this isolates the quantization method rather than precision.
The results were consistent:
| Configuration | Ollama (GGUF) | vLLM (AWQ) |
|---|---|---|
| Qwen recommended (temp=0.7, top_p=0.8) | 50% | 34% |
| Original params (temp=1.0, top_p=0.95) | 44% | 34% |
| Average | 47% | 34% |
That’s a 13 percentage point gap—and it held regardless of sampling parameters. vLLM scored exactly 34% both times.
What this means:
The difference isn’t about temperature or top_p settings. It’s something fundamental about how GGUF Q4_K_M and AWQ 4-bit preserve (or lose) model knowledge during quantization—or possibly differences in post-quantization tuning. I couldn’t find details on whether either model had additional training after quantization.
For general tasks like email and summarization, both performed identically (100%). But for domain-specific classification requiring niche knowledge, GGUF retained more capability in this particular instance.
One caveat for vLLM users: make sure you’re using the right API endpoint. Chat/instruct models need /v1/chat/completions, not /v1/completions. I burned an hour debugging empty responses before catching that. Ollama handles that automatically.
Domain Knowledge: Size Matters (Sort Of)
CSI MasterFormat is the construction industry’s standard for organizing specifications. Divisions like 03-Concrete, 07-Thermal and Moisture Protection, 22-Plumbing. A useful LLM should be able to classify spec text into the right division.
I ran an extended 50-case CSI benchmark on additional models to get more reliable numbers:
| Model | CSI Accuracy | Test Size | Notes |
|---|---|---|---|
| gpt-oss:120b | 78% | 50 cases | Best overall |
| gemma3-27b | 68% | 50 cases | |
| qwen3-next:80b-a3b (Instruct) | 58% | 50 cases | MoE, 3B active |
| qwen3-next:80b (Thinking) | 54% | 50 cases | Chain-of-thought |
| qwen3:32b | 48% | 50 cases | Dense |
| qwen3-vl:30b (GGUF) | 47% | 50 cases | |
| qwen3-30b-a3b | 38% | 50 cases | MoE, 3B active |
| qwen3-30b-vl-awq (vLLM) | 34% | 50 cases | AWQ penalty |
| GLM-4.5-Air | 26% | 50 cases | Worst on domain knowledge |
The qwen3-next:80b naming is confusing. There are two variants on Ollama:
- qwen3-next:80b — “Thinking” fine-tune (chain-of-thought reasoning)
- qwen3-next:80b-a3b-instruct — “Instruct” fine-tune (direct answers)
Both are MoE with 80B total params but only 3B active per token. On a fair 50-case comparison, they scored nearly identically (54% vs 58%). The initial 10-case sample suggested Thinking was better, but that was noise—sample size matters.
Takeaway: If your use case requires domain-specific knowledge (construction, legal, medical), you need the larger models like gpt-oss:120b. The 80B Qwen variants (both Thinking and Instruct) hover around 55%, which isn’t much better than the smaller 30B models.
Math: The Surprise Winner
This one surprised me. I expected the bigger models to dominate math like they did CSI classification. Instead:
| Model | Math Accuracy |
|---|---|
| qwen3:32b | 100% |
| qwen3-next:80b-a3b (Instruct) | 100% |
| gemma3-27b | 90% |
| GLM-4.5-Air | 90% |
| gpt-oss:120b | 60% |
| qwen3-30b-a3b | 60% |
| qwen3-30b-vl | 60% |
| qwen3-next:80b (Thinking) | 50% |
The pattern here is nuanced. qwen3:32b (dense) and qwen3-next:80b-a3b Instruct (MoE, 3B active) both hit 100%, while gemma3-27b (dense) and GLM-4.5-Air (MoE, 12B active) tied at 90%. But qwen3-next:80b Thinking—same architecture as the Instruct variant, same 3B active parameters—scored only 50%.
Why? Fine-tuning matters more than architecture for math. The “Thinking” chain-of-thought approach seems to introduce errors, while direct Instruct responses stay accurate. The dense models (qwen3:32b, gemma3-27b) performed consistently well. Google’s Gemma and Zhipu’s GLM were also explicitly trained with STEM focus. The takeaway: for math, fine-tuning approach and training focus matter more than parameter count or architecture.
Here’s gemma3 solving a calculus problem:
Step 1: We need to find the derivative of f(x) = ln(sin(x² + 1))
Step 2: Using the chain rule: d/dx[ln(u)] = (1/u) · du/dx
Step 3: Let u = sin(x² + 1), then du/dx = cos(x² + 1) · 2x
Final answer: f’(x) = 2x · cot(x² + 1)
Clean, correct, well-structured. The “Thinking” variants and smaller MoE Qwen models often got lost in the chain rule or made arithmetic errors—though both the dense qwen3:32b and the Instruct-tuned qwen3-next:80b-a3b nailed it.
Takeaway: For math-heavy applications (tutoring, technical analysis, scientific computing), use dense models: qwen3:32b or gemma3-27b. If you need both math and fast total response time, qwen3-next:80b-a3b (Instruct) scored 100% and finished the benchmark in just 0.6 minutes—despite only 45 tok/s, it gives concise direct answers instead of verbose explanations.
The All-Rounder: gpt-oss:120b
If I had to pick one model for general SMB tasks, it’s gpt-oss:120b—OpenAI’s open-weight model released under Apache 2.0.
| Category | gpt-oss:120b | Best Alternative |
|---|---|---|
| CSI MasterFormat | 78% | gemma3 (68%) |
| Email Drafting | 100% | All tied |
| Summarization | 100% | All tied |
| Code Generation | 90% | Most at 90% |
| Math | 60% | gemma3 (90%) |
| Speed | 175 tok/s | qwen3-30b-a3b (203) |
| Total Time | 4.4 min | — |
It’s fast (175 tok/s), reliable across all categories, and finished the entire benchmark in under 5 minutes. The only weakness is math, where gemma3 dominates.
For context: qwen3-next:80b scored higher on CSI classification but took 27 minutes to complete the same benchmark. That’s not a tradeoff most businesses would accept.
The Speed Champions
For Ollama users: qwen3-30b-a3b at 203 tok/s. Best balance of speed and accuracy.
It achieves:
- 100% on email and summarization
- 90% on code generation
- Total time: 2-3 minutes
The catch: it scored only 38% on domain-specific CSI classification and 60% on math. For general-purpose tasks like drafting emails, summarizing meeting notes, or generating boilerplate code, it’s excellent. For anything requiring specialized knowledge or complex reasoning, you’ll want something bigger.
What Actually Matters
Here’s the decision framework I’d use:
For domain-specific work (construction, legal, medical):
- Use gpt-oss:120b (78% accuracy, 175 tok/s)
- The Qwen 80B variants only hit 54-58%—not worth the slower speed
- The smaller MoE models don’t have the knowledge depth either
For math and STEM reasoning:
- Use qwen3:32b or gemma3:27b (both dense models)
- qwen3:32b hit 100%, gemma3 hit 90%
- For speed + math accuracy, qwen3-next:80b-a3b (Instruct) scored 100% at 45 tok/s
For general business tasks (email, summaries, basic code):
- Use qwen3-30b-a3b for speed
- Use gpt-oss:120b for quality
- Both are reliable for non-specialized work
The Full Results
For the benchmark completists:
| Model | CSI | Summary | Code | Math | Speed | Total Time | |
|---|---|---|---|---|---|---|---|
| qwen3-30b-a3b | 38% | 100% | 100% | 90% | 60% | 203 tok/s | 3.0 min |
| qwen3-30b-vl | 50% | 100% | 100% | 90% | 60% | 199 tok/s | 15.3 min |
| qwen3-30b-vl-awq | 34% | 100% | 100% | 90% | 60% | — | — |
| qwen3:32b | 48% | 100% | 100% | 90% | 100% | 55 tok/s | 8.1 min |
| qwen3-next:80b (Thinking) | 54% | 100% | 100% | 90% | 50% | 90 tok/s | 27.4 min |
| qwen3-next:80b-a3b (Instruct) | 58% | 100% | 100% | 100% | 100% | 45 tok/s | 0.6 min |
| gpt-oss:120b | 78% | 100% | 100% | 90% | 60% | 175 tok/s | 4.4 min |
| gemma3:27b | 68% | 100% | 100% | 90% | 90% | 67 tok/s | 6.9 min |
| GLM-4.5-Air | 26% | 100% | 100% | 100% | 90% | 111 tok/s | 42.2 min |
Note the qwen3-30b-vl anomaly: same tok/s as the a3b variant but took 5x longer. That’s because the VL model generates much longer responses—more tokens at the same speed means more time.
The qwen3-next:80b-a3b (Instruct) finished in 0.6 minutes because it returns concise ~20-token responses. The qwen3:32b took 8 minutes despite being smaller because it generates verbose 500-1000 token responses for the same task.
The Meta-Lesson
Every time I run one of these benchmarks, I learn the same thing: the hype metrics don’t matter as much as you’d think.
Parameter count? The 80B model was slower and less accurate than the 27B on math.
Speed? The fastest models (MoE variants) struggled with domain-specific classification.
Quantization method? GGUF Q4_K_M beat AWQ 4-bit by 13% on domain tasks—same bit depth, very different results.
Model origin? Zhipu’s GLM-4.5-Air scored 26% on CSI—dead last—despite excellent performance on general tasks. Domain knowledge varies wildly by training data, not architecture.
Sampling parameters? Barely mattered. I tested identical prompts with temp=0.7 and temp=1.0, top_p=0.8 and top_p=0.95. The accuracy gap stayed the same.
What matters is testing your actual use case with your actual data on your actual hardware. The 4 hours I spent running 600+ benchmark cases will save months of frustration from deploying the wrong model.
So to answer the original question: which Qwen model should I use? For general use, qwen3-30b-a3b-instruct is fast and fits in modest amounts of VRAM. qwen3-next improves on that but requires more than 2x the VRAM. If you need vision capabilities, go with the qwen3-vl version. But there’s domain expertise not well captured in the Qwen3 series that gemma and gpt-oss do capture—so evaluate those for your specific application. Or consider additional model training to add the expertise you need. (A likely subject for a future blog post.)
Benchmark run on December 28-29, 2025. Hardware: RTX 6000 Blackwell for Ollama models. vLLM model (qwen3-30b-vl-awq) tested on different hardware—speed not directly comparable; included for quantization accuracy comparison only. Initial benchmark: 300 cases across 6 models. Extended tests: 200 CSI cases for GGUF vs AWQ comparison, 150 CSI cases for qwen3:32b, qwen3-next:80b (Thinking), and qwen3-next:80b-a3b (Instruct), plus 100 cases for GLM-4.5-Air. All Ollama models use Q4_K_M quantization; vLLM model uses AWQ 4-bit.