TL;DR: Z.ai GLM-5: New SOTA Open-Weights LLM
Major Highlights:
- Bold scale-up with cost-aware attention
- GLM-5 upgrades Zhipu AI’s MoE line from 355B params / 32B active (GLM-4.5) to 744B / 40B active, with pretraining data rising from 23T to 28.5T tokens. It integrates DeepSeek Sparse Attention (DSA) to preserve long-context capability while cutting serving costs—framed by many as further evidence of DeepSeek-style attention winning in open models.
- Open-weight frontier leader on agentic/econ tasks
- Third-party synthesis (Artificial Analysis) places GLM‑5 as the top open-weights model on its Intelligence Index (score 50, up from 42 for GLM‑4.7), with major gains on real-work benchmarks: GDPVal‑AA ELO 1412 (trailing only Opus 4.6 and GPT‑5.2 xhigh in their setup) and big hallucination reductions (AA‑Omniscience −1, lowest among tested). Zhipu also cites strong results on BrowseComp and VendingBench 2.
- Day-0 ecosystem coverage with permissive licensing
- Released under an MIT license, GLM‑5 saw immediate hosting across OpenRouter, Modal (limited-time free endpoint), DeepInfra, and Ollama Cloud, plus day‑0 inference stack support in vLLM (DSA + speculative decoding hooks) and SGLang (cookbooks). Broad distribution landed on Hugging Face and ModelScope.
- Demand crush and phased rollout
- Zhipu flagged tight serving capacity and staged access (e.g., prioritized for “Coding Plan Pro”) amid traffic spikes and pricing changes. Despite this, GLM‑5 quickly reached #1 among open models on Text Arena (~#11 overall at snapshot).
Key Technical Details:
- Architecture: Mixture-of-Experts; 744B total params, 40B active per token (up from 355B/32B).
- Training data: 28.5T tokens (from 23T).
- Context/IO limits: 200K context window; 128K max output cited.
- Attention: DeepSeek Sparse Attention integrated for cheaper long-context serving.
- Modalities and focus: Text-only; emphasis on Office workflows (PDF/Word/Excel) similar to Kimi K2.5.
- Weights/hosting: Released in BF16; ~1.5TB footprint implies non-trivial self-hosting versus models shipped natively in FP8/INT4.
- License: MIT.
- Availability: OpenRouter, Modal, DeepInfra, Ollama Cloud; vLLM and SGLang day‑0 support; HF/ModelScope mirrors.
Community Response/Impact:
- “China open model week” narrative: GLM‑5 is seen as part of a rapid China-led open ecosystem cycle (DeepSeek, MiniMax, Zhipu) that’s compressing iteration times—dubbed a “bloodbath.”
- DeepSeek influence: Many credit DeepSeek’s attention and MoE recipes as shaping current open-frontier designs; GLM‑5’s DSA adoption reinforces that trend.
- Caveats raised: Lack of vision; BF16 vs. natively quantized comparisons may reshuffle perceived rankings; compute scarcity affected user access and pricing.
- Office benchmark bragging rights: On GDPVal‑AA, GLM‑5 ranks above Kimi K2.5, reinforcing its “white collar work” positioning.
First Principles Analysis:
- Why it matters: GLM‑5 advances the open-weight state of the art by pairing large MoE capacity with sparse attention that meaningfully lowers long-context inference costs—a core bottleneck for agentic, multi-document, and tool-heavy workflows.
- How it works: MoE raises representational capacity without linearly scaling active FLOPs, while DSA prunes attention computation/storage over long sequences. Releasing BF16 weights with a permissive license accelerates research and infra optimization—though it shifts the burden of quantization/serving efficiency to the community.
- Strategic takeaway: The open frontier is now competing on attention efficiency and real-work benchmarks, not just raw param counts—pushing vendors to ship better inference kernels, memory layouts, and scheduling alongside the models.