GLM-4.7-Flash: 128k Context on a Single Consumer GPU

There’s a new contender in the local LLM space: GLM-4.7-Flash from Zhipu AI. This 30B parameter MoE model punches well above its weight class, and with the right configuration, you can run it with full 128k context on a single 32GB consumer GPU.

Model Highlights

GLM-4.7-Flash uses a Mixture of Experts (MoE) architecture with 31 billion total parameters but only 3 billion active during inference. This makes it remarkably efficient while delivering impressive benchmark results.

Benchmark Performance

Benchmark GLM-4.7-Flash Qwen3-30B-A3B
AIME 25 91.6 81.4
GPQA 66.8 65.8
SWE-bench Verified 59.2 38.6
LiveCodeBench 58.7 49.5

Source: THUDM/GLM-4.7-Flash on HuggingFace

The model excels at coding and agentic tasks, with strong tool-use capabilities. It’s released under the MIT license, making it fully open for commercial use.

The Challenge: 128k Context on Consumer Hardware

Running GLM-4.7-Flash with 128k context presents a memory challenge. With standard attention, the KV cache alone would require approximately 29GB for 128k tokens—leaving almost nothing for the model itself on a 32GB GPU.

The solution is Multi-Head Latent Attention (MLA), the same attention mechanism used in DeepSeek models. MLA compresses the KV cache to approximately 8GB for 128k context, making full-length inference feasible on consumer hardware.

The catch? As of the January 21, 2026 nightly build, vLLM doesn’t recognize glm4_moe_lite as an MLA model by default.

The Solution: Patching vLLM for MLA Support

Getting MLA working requires a small patch to vLLM’s model architecture config. You extract the convertor file from the container, add glm4_moe_lite to the MLA whitelist, and mount it back in.

Here’s the key change:

elif self.hf_text_config.model_type in (
    "deepseek_v2",
    "deepseek_v3",
    "deepseek_v32",
    "deepseek_mtp",
    "kimi_k2",
    "kimi_linear",
    "longcat_flash",
    "pangu_ultra_moe",
    "pangu_ultra_moe_mtp",
    "glm4_moe_lite",  # <-- ADD THIS LINE
):
    return self.hf_text_config.kv_lora_rank is not None

With this patch and chunked prefill enabled (by setting max-num-batched-tokens lower than max-model-len), vLLM processes long sequences in manageable chunks while still supporting the full 128k context window.

Performance Results

Performance varies significantly depending on your GPU generation and configuration:

Configuration Throughput
With --enforce-eager (CUDA graphs disabled) ~30 tok/s
Without --enforce-eager on Blackwell 100-136 tok/s

On Blackwell GPUs (RTX 5090), the TritonMLA backend fully supports CUDA graphs. Enabling them provides a 3-4x speedup over eager mode.

GPU-specific guidance:

  • Blackwell+ (RTX 5090): Do NOT use --enforce-eager
  • Pre-Hopper GPUs: Use --enforce-eager (CUDA graphs unsupported with MLA)

Quick Setup

The full setup involves:

  1. Use vLLM nightly - The glm4_moe_lite architecture isn’t in stable releases yet
  2. Install transformers from main - Glm4MoeLiteConfig is too new for stable transformers
  3. Patch the MLA config - Add glm4_moe_lite to the MLA whitelist
  4. Configure chunked prefill - Set max-num-batched-tokens to 8192
  5. Use the NVFP4 quantized model - GadflyII/GLM-4.7-Flash-NVFP4 fits in 32GB

Once running, verify MLA is active:

docker compose logs vllm-glm 2>&1 | grep -E "TRITON_MLA|Available KV cache"

You should see:

Using TRITON_MLA attention backend out of potential backends: ('TRITON_MLA',)
Available KV cache memory: 8.11 GiB

Download the Full Setup Guide

For complete step-by-step instructions including the full docker-compose.yml, troubleshooting tips, and verification commands:

Download: GLM-4.7-Flash MLA Setup Guide (Markdown)

Coming Soon: Benchmarks

I’ll be publishing detailed benchmarks comparing GLM-4.7-Flash against other local models in the same parameter class. Stay tuned for:

  • Coding benchmark comparisons
  • Long-context retrieval tests
  • Real-world agentic task performance
  • Memory and throughput analysis across GPU generations

Conclusion

GLM-4.7-Flash represents a significant step forward for local LLM inference. The combination of MoE efficiency, MLA attention, and strong benchmark performance makes it a compelling choice for anyone running local models.

With the MLA patch and proper configuration, you can run the full 128k context window on a single 32GB GPU at over 100 tokens per second on Blackwell hardware. That’s enterprise-grade capability on consumer hardware.

The model’s MIT license and strong tool-use capabilities also make it an excellent foundation for agentic applications where you need both performance and the freedom to deploy anywhere.