This Week’s Friday Night Experiment: Why Speculative Decoding Didn’t Speed Up My 120B Model (And Why That’s Actually Fine)
December 12, 2025
Last week, my wife soundly defeated approximately 70 AI models in a sudoku showdown. It wasn’t even close. This week, I had plans for a rematch—NYT Wordle, humans versus machines, the whole production. But she was elbow-deep in wrapping paper and ribbon, buried under a mountain of holiday presents that apparently couldn’t wait. “You and your GPU will have to entertain yourselves tonight,” she said, not looking up from a particularly stubborn bow.
So the GB202 and I did what we do best: dig into something technical that’s been nagging at me. This week: speculative decoding. The promise is tantalizing—use a small “draft” model to predict what a large model will say, verify those predictions in parallel, and get 2-3x faster inference. Papers show impressive speedups. NVIDIA’s pushing it. The vLLM team just shipped Eagle3 support. So naturally, I had to try it on my single-GPU GPT-OSS-120B setup.
Spoiler: baseline won. But the journey taught me more than the destination.
The Setup
I’m running OpenAI’s GPT-OSS-120B—a 117 billion parameter Mixture-of-Experts model with only 5.1B parameters active per token—on a single RTX PRO 6000 Blackwell QMax GPU with 96GB of DDR7. The model uses MXFP4 quantization, cramming ~66GB of weights into VRAM with room to spare for KV cache. vLLM v0.12.0 handles inference, pulling about 180 tokens/second on autoregressive decoding.
That’s already pretty good for a 120B model on consumer-adjacent hardware. But could speculative decoding make it better?
The Hypothesis
Speculative decoding works by having a smaller “draft” model generate several candidate tokens, then having the larger “target” model verify them in a single forward pass. If the draft model guesses correctly, you get multiple tokens for the cost of one target model inference. The catch: if the draft model guesses wrong, you’ve wasted compute.
NVIDIA released an Eagle3 draft model specifically for GPT-OSS-120B, so I had a matched pair ready to test. The research literature suggested I might see 1.5-2.5x speedups depending on the task.
The First Disappointment
My initial benchmark used creative writing prompts: “Write a 1000 word story about apples/bananas/dogs/etc.” The results were grim:
| Configuration | Speed (tok/s) |
|---|---|
| Baseline (no speculation) | ~180 |
| Eagle3 (1 token) | ~37 |
| Eagle3 (2 tokens) | ~146 |
| Eagle3 (4 tokens) | ~118 |
Wait—1 speculative token gave me 37 tokens/second? That’s an 80% slowdown! Turns out I’d only run a single test, capturing the warmup/compilation penalty. After fixing my benchmark methodology (always run multiple iterations, discard the first), the 1-token config actually hit ~158 tok/s. Still slower than baseline, but not catastrophically so.
But the pattern was clear: more speculative tokens meant more overhead, not more speedup. Why?
The Task Matters More Than You Think
Here’s where my Friday night took an interesting turn. I started researching what benchmarks the academic papers actually use. Turns out, creative story generation is one of the worst cases for speculative decoding.
Speculative decoding shines when the draft model can accurately predict what the target model will generate. This happens with:
- Summarization: Output tokens often appear verbatim in the input
- Code completion: Syntax is predictable
- Translation: Structured, pattern-following
- RAG/retrieval QA: Answers drawn from provided context
Creative writing? The whole point is to be unpredictable. A draft model trying to guess the next word in an imaginative story is basically rolling dice.
I pivoted to a summarization task—a ~500 word article about AI history, asking for a 3-4 paragraph summary. The results improved noticeably:
| Configuration | Story Gen (tok/s) | Summarization (tok/s) |
|---|---|---|
| Baseline | ~180 | ~180 |
| Eagle3 (2 tokens) | ~146 | ~177 |
| Eagle3 (4 tokens) | ~118 | ~154 |
Summarization with 2 speculative tokens hit 98% of baseline speed. That’s within margin of error. But it still wasn’t faster.
The Real Bottleneck: Memory Bandwidth
After more experimentation—trying n-gram matching, different token counts, disabling CUDA graphs (which made everything 2x slower, don’t do that)—I realized the fundamental issue.
Speculative decoding helps when you’re compute-bound. It lets you trade extra compute for fewer memory accesses by batching multiple token verifications into one forward pass.
But on my Blackwell QMax GPU with fast DDR7 memory, I’m not compute-bound. I’m already saturating memory bandwidth at 180 tok/s with autoregressive decoding. There’s no latency to hide. The draft model adds overhead (loading its weights, running inference, verification logic) without providing benefit because the target model’s memory accesses were never the bottleneck.
This is actually documented in vLLM’s own blog: “in high-QPS environments, speculative decoding may introduce performance trade-offs. The extra compute required to propose and verify tokens can sometimes slow down the system when it is already compute-bound.”
My single-request, memory-saturated setup is the opposite of where speculation helps.
When Would Speculation Actually Help?
Based on this experiment, speculative decoding is most valuable when:
- You’re serving many concurrent requests and GPU compute is the bottleneck
- Your task has high input-output overlap (summarization, RAG, code)
- Your target model is very slow due to size or hardware limitations
- Draft acceptance rate exceeds ~60% (mine was ~45% on creative tasks)
It’s less useful when:
- Single-request latency is already good (memory-bandwidth limited)
- Tasks are creative/unpredictable (low acceptance rate)
- You’re on newer hardware with fast memory subsystems
The Takeaway
Sometimes the best optimization is no optimization. My 180 tok/s baseline is excellent for a 120B model on a single GPU. The Eagle3 draft model works correctly—it just can’t improve on something that’s already efficient.
This is a pattern I see repeatedly in AI infrastructure: techniques that provide massive gains in one context provide nothing (or negative value) in another. Speculative decoding isn’t snake oil. It’s a tool with specific use cases. The research showing 2-3x speedups used batch sizes of 24-56 and tasks with high token predictability. My single-request creative workload is the opposite scenario.
Next Friday? Maybe I’ll test what happens under concurrent load. That’s where speculation should shine. But for tonight, I’m satisfied knowing exactly why my experiment “failed”—and that failure taught me more than success would have.
Tested on: RTX PRO 6000 Blackwell QMax (96GB), vLLM v0.12.0, GPT-OSS-120B (MXFP4), Eagle3 draft model