TL;DR: RAG pipelines flip the local vs cloud economics completely. vLLM embeddings hit 8,091 chunks/sec—385x faster than cloud and 1,000x cheaper ($0.01 vs $10 per million chunks). Even Ollama manages 117 chunks/sec, 6x faster than OpenRouter. Reranking on a $1,000 RTX 5070 Ti scores 329 docs/sec. For enterprise RAG processing 100M chunks/month, the RTX 6000 Blackwell Pro Q-Max pays for itself in 2.5 days. The choice between Ollama and vLLM matters enormously: vLLM’s native batching delivers 70x better throughput than Ollama’s sequential processing on the same hardware.
In Parts 1 and 2, I benchmarked LLM generation—the part of AI that writes responses. The verdict: cloud wins on cost for low-volume, local wins on latency and privacy.
But most production AI systems aren’t just generating text. They’re doing Retrieval-Augmented Generation (RAG)—searching a knowledge base, ranking results, then generating answers grounded in retrieved documents. RAG pipelines have two additional compute-intensive steps: embedding and reranking.
The question: how do those economics change the calculation?
The RAG Pipeline
A typical RAG system:
- Embed the query → Convert user question to a vector
- Vector search → Find similar documents (handled by vector DB)
- Rerank results → Score relevance of top-K candidates
- Generate response → LLM synthesizes answer from context
Steps 1 and 3 hit your inference hardware. Step 2 is your vector database. Step 4 we covered in Parts 1-2. Tonight we’re benchmarking embedding and reranking.
The Setup
Embedding model: Qwen3-Embedding 0.6B (1024-dim vectors, BF16)
- Local: Ollama on RTX 6000 Blackwell Pro Q-Max
- Cloud: OpenRouter → OpenAI text-embedding-3-small (1536-dim)
- Test chunks: ~500 tokens each (typical RAG chunk size)
Reranking model: Qwen3-Reranker 4B
- Local: vLLM on RTX 5070 Ti ($1,000 GPU, ultra9 server)
- Cloud: Qwen3-Reranker wasn’t responding on OpenRouter during testing (Cohere Rerank used for cost comparison)
This brings up another advantage of local: model availability on cloud is outside your control. OpenRouter handles provider outages better than most by routing to alternatives, but it remains an issue—your production RAG pipeline shouldn’t fail because a cloud endpoint went down.
Network: 2.5 Gbps bidirectional WAN, 5ms latency (to Cloudflare), 25 Gbps LAN. This is top-end for most office connections—typical offices with slower links or higher latency would see an even larger advantage for local embeddings and reranking.
Embedding Results
| Backend | Concurrency | Chunks/sec | Latency/chunk |
|---|---|---|---|
| Ollama (local) | 1 | 24 | 42ms |
| Ollama (local) | 4 | 67 | 15ms |
| Ollama (local) | 16 | 114 | 9ms |
| Ollama (local) | 64 | 117 | 9ms |
| vLLM (local) | 4 batches | 8,091 | 0.1ms |
| vLLM (local) | 16 batches | 3,586 | 0.3ms |
| OpenRouter (cloud) | batch | 21 | 48ms |
The vLLM result is stunning: 8,091 chunks/second—that’s 70x faster than Ollama and 385x faster than cloud.
vLLM’s native batching is the key. While Ollama processes embeddings one at a time (even with concurrent requests), vLLM batches them efficiently on the GPU. The 0.6B model in BF16 is small enough that vLLM can process massive batches with near-zero latency per chunk.
Ollama tops out around 117 chunks/sec regardless of concurrency—the bottleneck is its sequential processing, not the GPU. OpenRouter’s cloud batching hits 21 chunks/sec, limited by network round-trips and API overhead.
Reranking Results
| Backend | Concurrency | Docs/sec | Latency/doc |
|---|---|---|---|
| vLLM ultra9 (RTX 5070 Ti) | 1 | 104 | 10ms |
| vLLM ultra9 (RTX 5070 Ti) | 8 | 329 | 3.0ms |
The RTX 5070 Ti—a $1,000 consumer GPU—reranks 329 documents per second with 8 concurrent requests. That’s 3.0ms per document.
For context: a typical RAG query retrieves 20-100 candidate documents from vector search, then reranks them. At 329 docs/sec, reranking 100 documents takes 300ms. That’s imperceptible in a user-facing application.
The Cost Analysis
OpenAI text-embedding-3-small pricing: $0.02 per 1M tokens
A typical RAG chunk is ~500 tokens. At $0.02/M tokens:
- Cost per 1M chunks embedded: $10
- At 21 chunks/sec cloud throughput: 13 hours to embed 1M chunks
Local embedding costs:
RTX 6000 Blackwell Pro Q-Max hourly cost: $0.39 (from Part 1)
| Backend | Chunks/sec | Time for 1M | Cost per 1M |
|---|---|---|---|
| OpenRouter | 21 | 13.2 hours | $10.00 |
| Ollama | 117 | 2.4 hours | $0.94 |
| vLLM | 8,091 | 2 minutes | $0.01 |
With vLLM, embedding 1 million chunks costs about a penny and takes 2 minutes. That’s 1,000x cheaper than cloud.
| Metric | Cloud | vLLM Local | Advantage |
|---|---|---|---|
| Cost per 1M chunks | $10.00 | $0.01 | 1,000x cheaper |
| Time to embed 1M | 13.2 hours | 2 minutes | 385x faster |
For reranking, cloud options like Cohere Rerank cost ~$1 per 1K searches. Local reranking on a $1,000 RTX 5070 Ti is essentially free after hardware costs.
Break-Even for RAG Workloads
The RTX 6000 Blackwell Pro Q-Max at ~$8,500 breaks even on embedding alone at:
$8,500 / ($10 - $0.01 per M chunks) = 850M chunks
With vLLM at 8,091 chunks/sec, that’s just 29 hours of continuous embedding.
| Usage | Chunks/month | Break-even |
|---|---|---|
| Light (100K docs/month) | 100K | 7 years |
| Medium (1M docs/month) | 1M | 8.5 months |
| Heavy (10M docs/month) | 10M | 25 days |
| Enterprise (100M docs/month) | 100M | 2.5 days |
For enterprise RAG deployments processing 100M+ chunks monthly—customer support knowledge bases, legal document search, enterprise wikis—the GPU pays for itself in under a week.
The RTX 5070 Ti Sweet Spot
The $1,000 RTX 5070 Ti running reranking:
- 329 docs/sec reranking throughput (8 concurrent)
- 16GB VRAM fits the 4B reranker model
- Power: 300W GPU + 150W system = 450W total
- Hourly cost: ~$0.11 (electricity + 3yr depreciation)
Break-even vs Cohere Rerank (~$1/1K searches):
| Daily Queries | Cloud Cost/Day | Break-even |
|---|---|---|
| 1,000 | $1 | 2.7 years |
| 10,000 | $10 | 3.3 months |
| 50,000 | $50 | 20 days |
For high-volume RAG (50K+ queries/day), the RTX 5070 Ti pays for itself in under a month.
The Latency Advantage
Beyond cost, local RAG components transform user experience:
| Operation | Cloud | Ollama | vLLM |
|---|---|---|---|
| Embed query | 48ms | 9ms | 0.1ms |
| Rerank 100 docs | ~500ms | N/A | 300ms |
| Total RAG overhead | ~548ms | ~309ms | ~300ms |
With vLLM, embedding latency is effectively zero. Combined with the LLM latency advantages from Part 1 (88ms vs 760ms TTFT for Gemma), local RAG delivers noticeably snappier responses.
When Local RAG Wins
Always wins:
- High-volume embedding (10M+ chunks/month)
- Reranking-heavy workloads (1K+ queries/day)
- Latency-sensitive applications
- Privacy-critical document search
Cloud might win:
- Low-volume, occasional indexing
- Burst capacity for one-time migrations
- When you need OpenAI’s larger embedding models
The Hybrid Architecture
My recommendation for most RAG deployments:
- vLLM for embeddings—the 1,000x cost advantage over cloud is too large to ignore
- Local reranking on even modest hardware—a $1,000 RTX 5070 Ti handles enterprise load
- Local or cloud LLM depending on volume and privacy needs (see Parts 1-2)
- Ollama for convenience, vLLM for performance—same model, 70x throughput difference
The GPU rack in my basement now runs triple duty: LLM inference, embedding, and reranking. The space heater metaphor from Part 1 keeps getting more literal—but the economics finally work out.
Benchmarks run January 5, 2026. Embedding: Qwen3-Embedding-0.6B (BF16) via Ollama and vLLM on RTX 6000 Blackwell Pro Q-Max. Reranking: Qwen3-Reranker-4B via vLLM on RTX 5070 Ti. Cloud: OpenRouter → OpenAI text-embedding-3-small. Network: 2.5 Gbps WAN, 5ms latency, 25 Gbps LAN. Test corpus: 80 chunks (~500 tokens each). Concurrency tested: 1, 4, 16, 64 workers.