TL;DR: I benchmarked Llama 3.1 8B inference across four NVIDIA Blackwell GPUs using vLLM. The entry-level RTX 5060 Ti ($450) hit 1,259 tokens/second—that’s 17x faster than CPU inference and enough for most single-user applications. Scaling up: the RTX 5070 Ti ($1,000) reaches 2,689 TPS, the RTX 5090 ($3,200) peaks at 8,691 TPS with NVFP4, and the $8,500 RTX PRO 6000 trails at 7,648 TPS. Yes, the $3,200 card beat the $8,500 card by 14%. Why? Memory bandwidth. The 5090 and PRO 6000 share identical 1,792 GB/s bandwidth, but the 5090’s 575W TDP delivers more compute than the PRO 6000’s efficiency-focused 300W. The 16GB cards (5070 Ti, 5060 Ti) hit OOM at 64 parallel queries—fine for development, limiting for production. CPU inference? Only AMD Ryzen works. Intel Core Ultra and older Xeons lack the AVX-512 instructions vLLM requires. At 74 TPS peak, CPU is 17x slower than even the budget 5060 Ti. Bottom line: start with the 5060 Ti for development, buy the 5090 for production. The PRO 6000’s extra VRAM only matters for 70B+ models.
You don’t need an $8,500 GPU to run local LLMs. You might not even need a $3,200 one.
I benchmarked four Blackwell GPUs on Llama 3.1 8B to answer the real question: how cheap can you go before performance becomes painful?
The answer surprised me. The $450 RTX 5060 Ti generates 1,259 tokens per second. That’s 17x faster than CPU inference and plenty fast for development, testing, or single-user chat applications.
The Contenders
I tested every Blackwell GPU I could get my hands on:
| GPU | VRAM | Memory Bandwidth | TDP | Price |
|---|---|---|---|---|
| RTX 5090 | 32GB GDDR7 | 1,792 GB/s | 575W | $3,200 |
| RTX PRO 6000 Max-Q | 96GB GDDR7 | 1,792 GB/s | 300W | $8,500 |
| RTX 5070 Ti | 16GB GDDR7 | 896 GB/s | 300W | $1,000 |
| RTX 5060 Ti | 16GB GDDR7 | 448 GB/s | 180W | $450 |
Notice anything? The 5090 and PRO 6000 share identical memory bandwidth—1,792 GB/s. The PRO 6000 costs 2.7x more but doesn’t move data any faster.
Test setup:
- Model: Llama 3.1 8B Instruct (GPTQ-INT4 and NVFP4 quantization)
- Framework: vLLM 0.13.0 with FP8 KV cache
- Parallelism: 1, 2, 4, 8, 16, 32, 64, 128 concurrent requests
- Prompt: 500-token essay generation task
The Results: Peak Throughput
Here’s what happened when I pushed each GPU to its limit:
| GPU | GPTQ-INT4 Peak | NVFP4 Peak | Max Parallel |
|---|---|---|---|
| RTX 5090 | 6,510 TPS | 8,691 TPS | 128 |
| RTX PRO 6000 | 6,176 TPS | 7,648 TPS | 128 |
| RTX 5070 Ti | 2,689 TPS | — | 32 (OOM) |
| RTX 5060 Ti | 1,259 TPS | — | 32 (OOM) |
The $3,200 card beat the $8,500 card. By 14%.
How? The RTX 5090’s 575W power budget lets it clock higher and sustain more compute than the PRO 6000’s efficiency-optimized 300W design. Same memory bandwidth, more muscle to use it.
The NVFP4 vs GPTQ-INT4 Tradeoff
Blackwell supports native FP4 compute, so I tested both quantization formats. The results reveal an interesting tradeoff:
| Parallelism | GPTQ-INT4 (5090) | NVFP4 (5090) | Winner |
|---|---|---|---|
| 1 query | 205 TPS | 124 TPS | GPTQ-INT4 (+65%) |
| 8 queries | 1,473 TPS | 951 TPS | GPTQ-INT4 (+55%) |
| 32 queries | 4,440 TPS | 3,296 TPS | GPTQ-INT4 (+35%) |
| 128 queries | 6,510 TPS | 8,691 TPS | NVFP4 (+33%) |
For interactive use—chatbots, coding assistants, single-user applications—GPTQ-INT4 delivers 65% faster response times. You get 205 tokens/second versus 124.
But for batch processing at scale, NVFP4 pulls ahead. At 128 concurrent requests, it’s generating 33% more tokens per second.
My recommendation: Use GPTQ-INT4 for latency-sensitive applications. Use NVFP4 when you’re saturating the GPU with batch jobs.
The 16GB Problem
The RTX 5070 Ti and 5060 Ti both have 16GB VRAM. That’s enough to load Llama 3.1 8B (~4GB in INT4), but not enough for large KV caches at high concurrency.
Both cards hit out-of-memory errors at 64 parallel queries. Their maximum sustainable throughput:
| GPU | Peak TPS | Parallel Limit | TPS per Dollar |
|---|---|---|---|
| RTX 5090 | 8,691 | 128 | 2.72 |
| RTX PRO 6000 | 7,648 | 128 | 0.90 |
| RTX 5070 Ti | 2,689 | 32 | 2.69 |
| RTX 5060 Ti | 1,259 | 32 | 2.80 |
The 5060 Ti delivers the best tokens-per-dollar. The 5070 Ti and 5090 are nearly tied at ~2.7 TPS/$, while the PRO 6000 lags at 0.90.
The PRO 6000? It’s not competing in the same category. That 96GB VRAM lets you run GPT-OSS 120B with a full 128k context window—something none of the other cards can touch. For an 8B model benchmark, the VRAM is wasted. For production workloads with frontier-class open models, it’s the only option.
Memory Bandwidth is Everything
Look at how throughput scales with memory bandwidth:
| GPU | Bandwidth | Peak TPS | TPS per GB/s |
|---|---|---|---|
| RTX 5090 | 1,792 GB/s | 8,691 | 4.85 |
| RTX PRO 6000 | 1,792 GB/s | 7,648 | 4.27 |
| RTX 5070 Ti | 896 GB/s | 2,689 | 3.00 |
| RTX 5060 Ti | 448 GB/s | 1,259 | 2.81 |
There’s an almost linear relationship between bandwidth and throughput. Double the bandwidth, roughly double the tokens per second.
This is why LLM inference is called “memory-bandwidth bound.” The GPU spends most of its time waiting for weights to load from VRAM, not computing. More bandwidth = less waiting = more tokens.
The CPU Surprise
I also tested CPU inference on an AMD Ryzen 9 9950X (16 cores, 32 threads). Peak throughput: 74 tokens/second at 64 parallel queries.
That’s 117x slower than the RTX 5090.
But here’s the real kicker: Core Ultra Intel CPUs don’t work at all. vLLM’s CPU backend requires AVX-512 instructions. Intel dropped AVX-512 support after 11th gen desktop CPUs. And the older Xeon CPUs I had access to for testing never included them. (To be fair, older AMD CPUs also omit those instructions and newer Intel server CPUs do include them.)
If you’re planning CPU inference with vLLM, AMD or Xeon is your only option. The Ryzen 9 9950X works. modern Intel desktop CPUs don’t.
When to Buy What
RTX 5090 ($3,200): Best choice for models up to 30B parameters with modest context windows. Highest throughput for production workloads. Typical form factor is 3 to 4 slots wide and almost 600W so integrating multiple GPUs into a server is difficult.
RTX PRO 6000 ($8,500): The only card that can run GPT-OSS 120B, Llama 3.1 70B, or similar frontier models with full context windows. If you need 96GB VRAM, this is your only single-GPU option—and it’s worth every penny for that use case. At almost the same performance as a 600W RTX 5090 and being only 2 slots wide, this is my top choice for a GPU if you have the budget.
RTX 5070 Ti ($1,000): Solid mid-range option. Handles 8B models well, but you’ll hit VRAM limits before GPU limits at high concurrency.
RTX 5060 Ti ($450): Best value per dollar. Good for development, testing, and single-user applications. The entry point for serious local inference. Great option for local embedding models to avoid network latency.
CPU (AMD Ryzen 9950X): Baseline only. 117x slower than GPU. Only useful when you can’t install a GPU or need the absolute lowest upfront cost.
The Bottom Line
You can run Llama 3.1 8B at 1,259 tokens/second for $450. That’s the headline.
The RTX 5060 Ti won’t win any benchmarks, but it delivers 17x the performance of CPU inference at a price most developers can expense without approval. For prototyping, testing, and single-user applications, it’s more than enough—and it has the best TPS-per-dollar of any card tested.
When you need production throughput, the RTX 5090 is the workhorse—8,691 TPS at $3,200, beating the $8,500 PRO 6000 QMax by 14%. Memory bandwidth determines LLM throughput, and both cards share identical bandwidth. The 5090 just has more power to use it. (The RTX Pro 6000 Blackwell is also available in a non-QMax version with 600W power, it likely outperforms the RTX 5090 but I did not have one available for test.)
The PRO 6000 plays a different game entirely. Its 96GB VRAM runs GPT-OSS 120B with a full 128k context window on a single card—impossible on any consumer GPU. If you’re deploying frontier-class open models, the PRO 6000 isn’t overpriced; it’s the only option. For 8B models, you’re paying for capacity you won’t use.
And if you’re considering CPU inference: make sure it’s AMD. Intel left AVX-512 behind, and vLLM left Intel behind with it.
Benchmarks run January 9, 2026. Hardware: RTX 5090, RTX PRO 6000 Blackwell Max-Q, RTX 5070 Ti, RTX 5060 Ti, AMD Ryzen 9 9950X. Model: Llama 3.1 8B Instruct in GPTQ-INT4 and NVFP4 quantization. Framework: vLLM 0.13.0 with FP8 KV cache. Test prompt: 500-token essay generation. All GPUs tested at parallelism levels 1-128 until OOM.