There’s a new contender in the local LLM space: GLM-4.7-Flash from Zhipu AI. This 30B parameter MoE model punches well above its weight class, and with the right configuration, you can run it with full 128k context on a single 32GB consumer GPU.
Model Highlights
GLM-4.7-Flash uses a Mixture of Experts (MoE) architecture with 31 billion total parameters but only 3 billion active during inference. This makes it remarkably efficient while delivering impressive benchmark results.
Benchmark Performance
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B |
|---|---|---|
| AIME 25 | 91.6 | 81.4 |
| GPQA | 66.8 | 65.8 |
| SWE-bench Verified | 59.2 | 38.6 |
| LiveCodeBench | 58.7 | 49.5 |
Source: THUDM/GLM-4.7-Flash on HuggingFace
The model excels at coding and agentic tasks, with strong tool-use capabilities. It’s released under the MIT license, making it fully open for commercial use.
The Challenge: 128k Context on Consumer Hardware
Running GLM-4.7-Flash with 128k context presents a memory challenge. With standard attention, the KV cache alone would require approximately 29GB for 128k tokens—leaving almost nothing for the model itself on a 32GB GPU.
The solution is Multi-Head Latent Attention (MLA), the same attention mechanism used in DeepSeek models. MLA compresses the KV cache to approximately 8GB for 128k context, making full-length inference feasible on consumer hardware.
The catch? As of the January 21, 2026 nightly build, vLLM doesn’t recognize glm4_moe_lite as an MLA model by default.
The Solution: Patching vLLM for MLA Support
Getting MLA working requires a small patch to vLLM’s model architecture config. You extract the convertor file from the container, add glm4_moe_lite to the MLA whitelist, and mount it back in.
Here’s the key change:
elif self.hf_text_config.model_type in (
"deepseek_v2",
"deepseek_v3",
"deepseek_v32",
"deepseek_mtp",
"kimi_k2",
"kimi_linear",
"longcat_flash",
"pangu_ultra_moe",
"pangu_ultra_moe_mtp",
"glm4_moe_lite", # <-- ADD THIS LINE
):
return self.hf_text_config.kv_lora_rank is not None
With this patch and chunked prefill enabled (by setting max-num-batched-tokens lower than max-model-len), vLLM processes long sequences in manageable chunks while still supporting the full 128k context window.
Performance Results
Performance varies significantly depending on your GPU generation and configuration:
| Configuration | Throughput |
|---|---|
With --enforce-eager (CUDA graphs disabled) |
~30 tok/s |
Without --enforce-eager on Blackwell |
100-136 tok/s |
On Blackwell GPUs (RTX 5090), the TritonMLA backend fully supports CUDA graphs. Enabling them provides a 3-4x speedup over eager mode.
GPU-specific guidance:
- Blackwell+ (RTX 5090): Do NOT use
--enforce-eager - Pre-Hopper GPUs: Use
--enforce-eager(CUDA graphs unsupported with MLA)
Quick Setup
The full setup involves:
- Use vLLM nightly - The
glm4_moe_litearchitecture isn’t in stable releases yet - Install transformers from main -
Glm4MoeLiteConfigis too new for stable transformers - Patch the MLA config - Add
glm4_moe_liteto the MLA whitelist - Configure chunked prefill - Set
max-num-batched-tokensto 8192 - Use the NVFP4 quantized model -
GadflyII/GLM-4.7-Flash-NVFP4fits in 32GB
Once running, verify MLA is active:
docker compose logs vllm-glm 2>&1 | grep -E "TRITON_MLA|Available KV cache"
You should see:
Using TRITON_MLA attention backend out of potential backends: ('TRITON_MLA',)
Available KV cache memory: 8.11 GiB
Download the Full Setup Guide
For complete step-by-step instructions including the full docker-compose.yml, troubleshooting tips, and verification commands:
Download: GLM-4.7-Flash MLA Setup Guide (Markdown)
Coming Soon: Benchmarks
I’ll be publishing detailed benchmarks comparing GLM-4.7-Flash against other local models in the same parameter class. Stay tuned for:
- Coding benchmark comparisons
- Long-context retrieval tests
- Real-world agentic task performance
- Memory and throughput analysis across GPU generations
Conclusion
GLM-4.7-Flash represents a significant step forward for local LLM inference. The combination of MoE efficiency, MLA attention, and strong benchmark performance makes it a compelling choice for anyone running local models.
With the MLA patch and proper configuration, you can run the full 128k context window on a single 32GB GPU at over 100 tokens per second on Blackwell hardware. That’s enterprise-grade capability on consumer hardware.
The model’s MIT license and strong tool-use capabilities also make it an excellent foundation for agentic applications where you need both performance and the freedom to deploy anywhere.