TL;DR: RAG pipelines flip the local vs cloud economics completely. vLLM embeddings hit 8,091 chunks/sec—385x faster than cloud and 1,000x cheaper ($0.01 vs $10 per million chunks). Even Ollama manages 117 chunks/sec, 6x faster than OpenRouter. Reranking on a $1,000 RTX 5070 Ti scores 329 docs/sec. For enterprise RAG processing 100M chunks/month, the RTX 6000 Blackwell Pro Q-Max pays for itself in 2.5 days. The choice between Ollama and vLLM matters enormously: vLLM’s native batching delivers 70x better throughput than Ollama’s sequential processing on the same hardware.

In Parts 1 and 2, I benchmarked LLM generation—the part of AI that writes responses. The verdict: cloud wins on cost for low-volume, local wins on latency and privacy.

But most production AI systems aren’t just generating text. They’re doing Retrieval-Augmented Generation (RAG)—searching a knowledge base, ranking results, then generating answers grounded in retrieved documents. RAG pipelines have two additional compute-intensive steps: embedding and reranking.

The question: how do those economics change the calculation?

The RAG Pipeline

A typical RAG system:

Embed the query → Convert user question to a vector
Vector search → Find similar documents (handled by vector DB)
Rerank results → Score relevance of top-K candidates
Generate response → LLM synthesizes answer from context

Steps 1 and 3 hit your inference hardware. Step 2 is your vector database. Step 4 we covered in Parts 1-2. Tonight we’re benchmarking embedding and reranking.

The Setup

Embedding model: Qwen3-Embedding 0.6B (1024-dim vectors, BF16)

Local: Ollama on RTX 6000 Blackwell Pro Q-Max
Cloud: OpenRouter → OpenAI text-embedding-3-small (1536-dim)
Test chunks: ~500 tokens each (typical RAG chunk size)

Reranking model: Qwen3-Reranker 4B

Local: vLLM on RTX 5070 Ti ($1,000 GPU, ultra9 server)
Cloud: Qwen3-Reranker wasn’t responding on OpenRouter during testing (Cohere Rerank used for cost comparison)

This brings up another advantage of local: model availability on cloud is outside your control. OpenRouter handles provider outages better than most by routing to alternatives, but it remains an issue—your production RAG pipeline shouldn’t fail because a cloud endpoint went down.

Network: 2.5 Gbps bidirectional WAN, 5ms latency (to Cloudflare), 25 Gbps LAN. This is top-end for most office connections—typical offices with slower links or higher latency would see an even larger advantage for local embeddings and reranking.

Embedding Results

Backend	Concurrency	Chunks/sec	Latency/chunk
Ollama (local)	1	24	42ms
Ollama (local)	4	67	15ms
Ollama (local)	16	114	9ms
Ollama (local)	64	117	9ms
vLLM (local)	4 batches	8,091	0.1ms
vLLM (local)	16 batches	3,586	0.3ms
OpenRouter (cloud)	batch	21	48ms

The vLLM result is stunning: 8,091 chunks/second—that’s 70x faster than Ollama and 385x faster than cloud.

vLLM’s native batching is the key. While Ollama processes embeddings one at a time (even with concurrent requests), vLLM batches them efficiently on the GPU. The 0.6B model in BF16 is small enough that vLLM can process massive batches with near-zero latency per chunk.

Ollama tops out around 117 chunks/sec regardless of concurrency—the bottleneck is its sequential processing, not the GPU. OpenRouter’s cloud batching hits 21 chunks/sec, limited by network round-trips and API overhead.

Reranking Results

Backend	Concurrency	Docs/sec	Latency/doc
vLLM ultra9 (RTX 5070 Ti)	1	104	10ms
vLLM ultra9 (RTX 5070 Ti)	8	329	3.0ms

The RTX 5070 Ti—a $1,000 consumer GPU—reranks 329 documents per second with 8 concurrent requests. That’s 3.0ms per document.

For context: a typical RAG query retrieves 20-100 candidate documents from vector search, then reranks them. At 329 docs/sec, reranking 100 documents takes 300ms. That’s imperceptible in a user-facing application.

The Cost Analysis

OpenAI text-embedding-3-small pricing: $0.02 per 1M tokens

A typical RAG chunk is ~500 tokens. At $0.02/M tokens:

Cost per 1M chunks embedded: $10
At 21 chunks/sec cloud throughput: 13 hours to embed 1M chunks

Local embedding costs:

RTX 6000 Blackwell Pro Q-Max hourly cost: $0.39 (from Part 1)

Backend	Chunks/sec	Time for 1M	Cost per 1M
OpenRouter	21	13.2 hours	$10.00
Ollama	117	2.4 hours	$0.94
vLLM	8,091	2 minutes	$0.01

With vLLM, embedding 1 million chunks costs about a penny and takes 2 minutes. That’s 1,000x cheaper than cloud.

Metric	Cloud	vLLM Local	Advantage
Cost per 1M chunks	$10.00	$0.01	1,000x cheaper
Time to embed 1M	13.2 hours	2 minutes	385x faster

For reranking, cloud options like Cohere Rerank cost ~$1 per 1K searches. Local reranking on a $1,000 RTX 5070 Ti is essentially free after hardware costs.

Break-Even for RAG Workloads

The RTX 6000 Blackwell Pro Q-Max at ~$8,500 breaks even on embedding alone at:

$8,500 / ($10 - $0.01 per M chunks) = 850M chunks

With vLLM at 8,091 chunks/sec, that’s just 29 hours of continuous embedding.

Usage	Chunks/month	Break-even
Light (100K docs/month)	100K	7 years
Medium (1M docs/month)	1M	8.5 months
Heavy (10M docs/month)	10M	25 days
Enterprise (100M docs/month)	100M	2.5 days

For enterprise RAG deployments processing 100M+ chunks monthly—customer support knowledge bases, legal document search, enterprise wikis—the GPU pays for itself in under a week.

The RTX 5070 Ti Sweet Spot

The $1,000 RTX 5070 Ti running reranking:

329 docs/sec reranking throughput (8 concurrent)
16GB VRAM fits the 4B reranker model
Power: 300W GPU + 150W system = 450W total
Hourly cost: ~$0.11 (electricity + 3yr depreciation)

Break-even vs Cohere Rerank (~$1/1K searches):

Daily Queries	Cloud Cost/Day	Break-even
1,000	$1	2.7 years
10,000	$10	3.3 months
50,000	$50	20 days

For high-volume RAG (50K+ queries/day), the RTX 5070 Ti pays for itself in under a month.

The Latency Advantage

Beyond cost, local RAG components transform user experience:

Operation	Cloud	Ollama	vLLM
Embed query	48ms	9ms	0.1ms
Rerank 100 docs	~500ms	N/A	300ms
Total RAG overhead	~548ms	~309ms	~300ms

With vLLM, embedding latency is effectively zero. Combined with the LLM latency advantages from Part 1 (88ms vs 760ms TTFT for Gemma), local RAG delivers noticeably snappier responses.

When Local RAG Wins

Always wins:

High-volume embedding (10M+ chunks/month)
Reranking-heavy workloads (1K+ queries/day)
Latency-sensitive applications
Privacy-critical document search

Cloud might win:

Low-volume, occasional indexing
Burst capacity for one-time migrations
When you need OpenAI’s larger embedding models

The Hybrid Architecture

My recommendation for most RAG deployments:

vLLM for embeddings—the 1,000x cost advantage over cloud is too large to ignore
Local reranking on even modest hardware—a $1,000 RTX 5070 Ti handles enterprise load
Local or cloud LLM depending on volume and privacy needs (see Parts 1-2)
Ollama for convenience, vLLM for performance—same model, 70x throughput difference

The GPU rack in my basement now runs triple duty: LLM inference, embedding, and reranking. The space heater metaphor from Part 1 keeps getting more literal—but the economics finally work out.

Benchmarks run January 5, 2026. Embedding: Qwen3-Embedding-0.6B (BF16) via Ollama and vLLM on RTX 6000 Blackwell Pro Q-Max. Reranking: Qwen3-Reranker-4B via vLLM on RTX 5070 Ti. Cloud: OpenRouter → OpenAI text-embedding-3-small. Network: 2.5 Gbps WAN, 5ms latency, 25 Gbps LAN. Test corpus: 80 chunks (~500 tokens each). Concurrency tested: 1, 4, 16, 64 workers.