Feb 11 Z.ai GLM-5: New SOTA Open Weights LLM Show details

news.smol.ai•14 days ago•View Original →

TL;DR: Z.ai GLM-5: New SOTA Open-Weights LLM

Major Highlights:

Bold scale-up with cost-aware attention
- GLM-5 upgrades Zhipu AI’s MoE line from 355B params / 32B active (GLM-4.5) to 744B / 40B active, with pretraining data rising from 23T to 28.5T tokens. It integrates DeepSeek Sparse Attention (DSA) to preserve long-context capability while cutting serving costs—framed by many as further evidence of DeepSeek-style attention winning in open models.
Open-weight frontier leader on agentic/econ tasks
- Third-party synthesis (Artificial Analysis) places GLM‑5 as the top open-weights model on its Intelligence Index (score 50, up from 42 for GLM‑4.7), with major gains on real-work benchmarks: GDPVal‑AA ELO 1412 (trailing only Opus 4.6 and GPT‑5.2 xhigh in their setup) and big hallucination reductions (AA‑Omniscience −1, lowest among tested). Zhipu also cites strong results on BrowseComp and VendingBench 2.
Day-0 ecosystem coverage with permissive licensing
- Released under an MIT license, GLM‑5 saw immediate hosting across OpenRouter, Modal (limited-time free endpoint), DeepInfra, and Ollama Cloud, plus day‑0 inference stack support in vLLM (DSA + speculative decoding hooks) and SGLang (cookbooks). Broad distribution landed on Hugging Face and ModelScope.
Demand crush and phased rollout
- Zhipu flagged tight serving capacity and staged access (e.g., prioritized for “Coding Plan Pro”) amid traffic spikes and pricing changes. Despite this, GLM‑5 quickly reached #1 among open models on Text Arena (~#11 overall at snapshot).

Key Technical Details:

Architecture: Mixture-of-Experts; 744B total params, 40B active per token (up from 355B/32B).
Training data: 28.5T tokens (from 23T).
Context/IO limits: 200K context window; 128K max output cited.
Attention: DeepSeek Sparse Attention integrated for cheaper long-context serving.
Modalities and focus: Text-only; emphasis on Office workflows (PDF/Word/Excel) similar to Kimi K2.5.
Weights/hosting: Released in BF16; ~1.5TB footprint implies non-trivial self-hosting versus models shipped natively in FP8/INT4.
License: MIT.
Availability: OpenRouter, Modal, DeepInfra, Ollama Cloud; vLLM and SGLang day‑0 support; HF/ModelScope mirrors.

Community Response/Impact:

“China open model week” narrative: GLM‑5 is seen as part of a rapid China-led open ecosystem cycle (DeepSeek, MiniMax, Zhipu) that’s compressing iteration times—dubbed a “bloodbath.”
DeepSeek influence: Many credit DeepSeek’s attention and MoE recipes as shaping current open-frontier designs; GLM‑5’s DSA adoption reinforces that trend.
Caveats raised: Lack of vision; BF16 vs. natively quantized comparisons may reshuffle perceived rankings; compute scarcity affected user access and pricing.
Office benchmark bragging rights: On GDPVal‑AA, GLM‑5 ranks above Kimi K2.5, reinforcing its “white collar work” positioning.

First Principles Analysis:

Why it matters: GLM‑5 advances the open-weight state of the art by pairing large MoE capacity with sparse attention that meaningfully lowers long-context inference costs—a core bottleneck for agentic, multi-document, and tool-heavy workflows.
How it works: MoE raises representational capacity without linearly scaling active FLOPs, while DSA prunes attention computation/storage over long sequences. Releasing BF16 weights with a permissive license accelerates research and infra optimization—though it shifts the burden of quantization/serving efficiency to the community.
Strategic takeaway: The open frontier is now competing on attention efficiency and real-work benchmarks, not just raw param counts—pushing vendors to ship better inference kernels, memory layouts, and scheduling alongside the models.