Running MiniMax-M2.5 on a Single RTX 6000 Blackwell: 68 Tokens/s with 64K Context

MiniMax-M2.5 is a 139B parameter mixture-of-experts model with only 10B active parameters per token, making it surprisingly efficient for its size. Using the REAP NVFP4 quantization from lukealonso, you can run it on a single NVIDIA RTX PRO 6000 Blackwell GPU with 96 GB of VRAM — and get a very usable 68 tokens per second with a 64K token context window.

Here’s exactly how to do it.

The Stack

  • Model: lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
  • Inference engine: SGLang v0.5.8.post1
  • Docker image: lmsysorg/sglang:v0.5.8.post1-cu130
  • GPU: NVIDIA RTX PRO 6000 Blackwell (96 GB)

Why SGLang and Not vLLM?

I tried vLLM first — versions 0.15.1, 0.16.0, and the cu130 nightly. All three crash with a CUDA illegal memory access in the MoE gate layer during inference. The model loads fine, the server starts, but the first request kills the engine. Both the CUTLASS and Marlin GEMM backends hit the same error. I filed this as a bug (vllm-project/vllm#35566).

SGLang’s FlashInfer-based MoE kernels handle the NVFP4 checkpoint without issues on Blackwell.

The Docker Compose File

services:
  sglang:
    image: lmsysorg/sglang:v0.5.8.post1-cu130
    container_name: sglang-minimax-reap
    runtime: nvidia
    shm_size: "1g"
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - LD_LIBRARY_PATH=/lib/x86_64-linux-gnu
    command:
      - python3
      - -m
      - sglang.launch_server
      - --model
      - lukealonso/MiniMax-M2.5-REAP-139B-A10B-NVFP4
      - --served-model-name
      - minimax-m2.5-reap-nvfp4
      - --reasoning-parser
      - minimax
      - --tool-call-parser
      - minimax-m2
      - --trust-remote-code
      - --tp
      - "1"
      - --mem-fraction-static
      - "0.95"
      - --max-running-requests
      - "32"
      - --context-length
      - "65536"
      - --quantization
      - modelopt_fp4
      - --attention-backend
      - flashinfer
      - --moe-runner-backend
      - flashinfer_cutlass
      - --kv-cache-dtype
      - fp8_e5m2
      - --enable-flashinfer-allreduce-fusion
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

Run docker compose up -d and wait about two minutes for the model to load.

Two Key Settings

fp8 KV cache is essential. The model card recommends bf16 for the KV cache, but that only gives you about 33K tokens of capacity on 96 GB. Switching to --kv-cache-dtype fp8_e5m2 doubles it to 67K tokens, which is enough to actually use a 65K context window. In my testing, output quality was not noticeably affected.

Set memory fraction to 0.95. The default 0.85-0.88 range doesn’t leave enough room for the KV cache after the model’s 81.6 GB of weights are loaded. At 0.95, you get about 9 GB for KV cache, CUDA graphs, and overhead.

Performance

I generated six 1000+ word stories and measured throughput:

Prompt Tokens Time Tokens/s
Elephant story 1,364 20.1s 67.9
Fox story 1,580 23.1s 68.3
Zebra story 1,334 19.3s 69.1
Dolphin story 1,205 17.7s 67.9
Owl story 1,248 18.0s 69.1
Wolf story 1,328 19.1s 69.4

Consistent 68-69 tokens/s on short-context generation. This is well above the 15-30 t/s some early reports suggested for this model on Blackwell. Long-context workloads (above 32K input tokens) will be slower, as expected for single-GPU MoE inference.

The model supports both reasoning (chain-of-thought in reasoning_content) and tool calling out of the box through SGLang’s OpenAI-compatible API at http://localhost:8000/v1/chat/completions.