If you’ve tried running Qwen3.5-35B-A3B on an RTX 5090 and hit a wall of cryptic Triton OOM errors, you’re not alone. This model runs beautifully on paper — 35 billion parameters but only 3 billion active per token thanks to its sparse mixture-of-experts architecture — yet getting it to actually serve requests on a 32 GB consumer GPU requires navigating a specific and poorly documented pitfall. Here’s everything I learned getting it working with full multimodal support, 128K context, and fp8 KV cache.

The Setup

The model in question is cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit, a community AWQ 4-bit quantization of Alibaba’s Qwen3.5-35B-A3B. At 4-bit precision, the weights come in at roughly 22 GB — a tight fit on the 5090’s 32 GB, but seemingly feasible with room left over for KV cache.

For the serving engine, I’m using vLLM. The critical detail: as of late February 2026, the official vllm/vllm-openai:latest image (v0.16.0) does not include Qwen3.5 model support. You need a nightly or pre-release build that contains the qwen3_5.py model class. In my case, that was a custom image based on vLLM 0.16.0rc2 with CUDA 13.0 and PyTorch 2.10.

The Problem Nobody Warns You About

The model loads fine. Weights take 22 GB. The server starts, healthcheck passes, everything looks perfect. Then you send your first request and the whole thing crashes:

RuntimeError: Triton Error [CUDA]: out of memory

The stack trace points to solve_tril.py inside vLLM’s Flash Linear Attention ops — specifically, the Triton autotuner. This is the key to understanding the problem.

Qwen3.5 uses a novel hybrid architecture. Three out of every four layers use Gated Delta Networks (GDN), a form of linear attention implemented as custom Triton kernels. These kernels use Triton’s @triton.autotune decorator, which benchmarks multiple kernel configurations on first use by allocating temporary GPU buffers. This benchmarking process needs 4-8 GB of free VRAM.

Here’s the catch-22: vLLM pre-allocates KV cache to fill available GPU memory up to your configured --gpu-memory-utilization limit. With a 22 GB model on a 32 GB GPU, even at 0.83 utilization, vLLM claims roughly 26.5 GB total — leaving only about 5.5 GB free. When the Triton autotuner fires on the first inference request, it tries to allocate its temporary benchmarking buffers and finds there’s not enough contiguous free memory. OOM.

Setting --enforce-eager doesn’t help. That flag disables torch.compile and CUDA graphs, but the Triton autotuner is a completely separate mechanism baked into the kernel definitions themselves. Lowering --gpu-memory-utilization enough to leave headroom for Triton (around 0.72) means there’s no memory left for KV cache, and vLLM refuses to start. You’re stuck.

The Solution: Cross-GPU Triton Cache Warmup

The fix is elegant if you have access to a second GPU with more VRAM — and many workstation and multi-GPU setups do. In my case, I have an RTX PRO 6000 Blackwell alongside the 5090. Both are Blackwell architecture (sm_120), which means their Triton kernel caches are binary-compatible.

The strategy:

Warm up on the big GPU. Run the model on the high-VRAM card with full multimodal enabled. With 98 GB available, there’s no memory pressure at all. The Triton autotuner runs happily, benchmarks all its kernel configurations, and writes the results to a cache directory.
Send diverse warmup requests. This is the step people miss. You need to trigger every kernel code path — not just text, but image processing at multiple resolutions, and multi-image requests. Each unique tensor shape produces a different autotuner cache key.
Persist the cache. Mount ~/.cache/triton as a Docker volume so the autotuning results survive container restarts.
Switch to the 5090. Change CUDA_VISIBLE_DEVICES to point at the smaller GPU. Because the Triton cache already contains tuned kernel configurations, the autotuner skips benchmarking entirely — no temporary buffer allocation, no OOM.

The Docker Compose Configuration

For the warmup phase on the large GPU, I use CUDA_VISIBLE_DEVICES=0 with --gpu-memory-utilization 0.92 and --max-model-len 131072. Send a handful of text prompts and image requests at various sizes (64x64, 256x256, 512x512, plus a multi-image request). Verify the Triton cache is populated — you should see 70-90+ files in ~/.cache/triton.

For production on the 5090, the key settings are:

--gpu-memory-utilization 0.83
--max-model-len 131072          # full 128K context
--max-num-seqs 2
--kv-cache-dtype fp8
--quantization compressed-tensors
--reasoning-parser qwen3
--limit-mm-per-prompt '{"image": 2, "video": 0}'
--enable-prefix-caching
--trust-remote-code

The fp8 KV cache is essential — it halves the per-token cache footprint, which is how 128K context fits at all. The compressed-tensors quantization is auto-detected from the AWQ model but I specify it explicitly for clarity.

One subtle requirement: --max-num-batched-tokens must be at least 2096. Qwen3.5’s Mamba-style cache alignment forces an attention block size of 2096 tokens, and vLLM will refuse to start if the batched token limit is smaller than the block size.

What You Get

The result is a fully functional Qwen3.5-35B-A3B server on a consumer RTX 5090:

128K token context window
Full multimodal support (image understanding)
2 concurrent request slots
Thinking mode with reasoning traces
Prefix caching for repeated prompt prefixes

The MoE architecture is what makes this possible on 32 GB. Despite having 35 billion total parameters, only 3 billion are active per token (8 of 256 experts plus 1 shared expert). This keeps the KV cache small relative to a dense model of similar capability.

Performance

To put some numbers on this, I ran a simple benchmark: generate a 1000-word story for each of six different animals (ape, bobcat, crocodile, dolphin, elephant, falcon) with max_tokens=1024. For context, I’ve included the same benchmark from Qwen3-VL-30B running on the same GPU and vLLM image.

Model	Avg tok/s
Qwen3-VL-30B-A3B (MoE, AWQ 4-bit)	259.3
Qwen3.5-35B-A3B (MoE, AWQ 4-bit)	202.3

Both models are sparse MoE architectures with similar active parameter counts (~3B), but they differ in total parameters and attention mechanism. Qwen3-VL-30B-A3B has 30B total parameters across 128 experts with standard attention, while Qwen3.5-35B-A3B has 35B total across 256 experts with a hybrid GDN+attention design. The throughput gap (202 vs. 259 tok/s) comes down to that hybrid architecture — the Triton-based Gated Delta Network kernels in Qwen3.5 aren’t as optimized as the FlashAttention path that Qwen3-VL uses for all its layers.

Lessons Learned

The biggest takeaway: Triton autotuning is an invisible memory consumer that nobody documents. It’s not in vLLM’s troubleshooting guides, it doesn’t show up in memory profiling, and it only strikes on the first inference after a cold start. If you’re running any model with custom Triton kernels on a memory-constrained GPU, persist your Triton cache and consider warming it up on a larger card.

The second lesson: don’t assume your GPU is too small just because the first attempt fails. The 5090 runs this 35B model comfortably — it just needs a helping hand getting past that initial autotuning hurdle.