TL;DR: MiniMax‑M2.5 goes open source with agent‑native RL; GLM‑5 joins “near‑frontier” open wave; agent tooling and eval debates heat up
Major Highlights:
- MiniMax‑M2.5 open-sourced as an “agent‑native” model: MiniMax released weights and code for M2.5, trained with RL across 200k+ real‑world environments spanning coding, tool use, web search, and office workflows. vLLM and SGLang shipped day‑0 support, framing the model as production‑ready for long‑horizon, always‑on agents.
- Economics and throughput as the headline: MiniMax markets “$1 per hour at 100 TPS” for agent workloads, positioning M2.5 for low‑cost, high‑throughput self‑hosting. Early local reports show ~50 tok/s on MLX and ~40 tok/s on an M3 Ultra (6‑bit) with ~186 GB peak memory—unusually strong on‑device viability for its class.
- Forge RL system surfaces as the bigger story: Community writeups describe MiniMax’s “Forge” as CISPO‑like RL with process and completion‑time rewards, multi‑level prefix caching, and heavy rollout allocation (~60% of compute), yielding millions of trajectories per day. Leadership hints at “10B active” parameterization, argues knowledge capacity (not control) now limits scaling, and teases structural/pretraining changes for M3.
- GLM‑5 positions as “near‑frontier” open: Together’s GLM‑5 claims best‑in‑class open‑source agent performance (77.8% SWE‑Bench Verified; 50.4% HLE w/tools), MoE efficiency via “DeepSeek Sparse Attention,” and a permissive license. Community notes potential dataset fingerprints (“truthy‑dpo”) in outputs.
Key Technical Details:
- Benchmarks (MiniMax‑M2.5): 80.2% SWE‑Bench Verified; 76.3% BrowseComp (vLLM-reported).
- Training scale: 200k+ RL environments; millions of trajectories/day; rollout share ~60% of compute.
- Throughput/cost: “$1/hour at 100 TPS” (MiniMax claim); early local speed ~40–50 tok/s on Apple silicon at 6‑bit with high RAM; self‑hosting helped by low activated‑parameter count.
- Ecosystem: Day‑0 support in vLLM/SGLang; rapid GGUF/quant releases; Intel‑hosted 2‑bit GGUF (MiniMax‑M2) and INT4 for Qwen3‑Coder‑Next.
- GLM‑5: 77.8% SWE‑Bench Verified; 50.4% HLE w/tools; claims 744B params (MoE) and MIT licensing; “new RL framework.”
Community Response/Impact:
- Reviews call M2.5 “viable for multi‑turn work,” close to Claude Sonnet on coding, with better stability vs M2.1—but token‑hungry (≈2× Sonnet in one test), making economics crucial.
- Immediate tooling uptake: package mirrors, quantized artifacts, and agent stacks adopting M2.5 for “always‑on” use.
- Eval skepticism persists: SWE‑Bench “saturation” doubts resurface; SWE‑rebench shows different rankings; concerns that token/latency tradeoffs obscure true capability and that TPS ≠ model size.
- Agent engineering pragmatics: Claude Code “Agent Teams” reportedly coordinate via JSON files on disk (simple, observable, tradeoffs on atomicity/backpressure). Terminal‑first agents surge (Cline CLI 2.0) with parallel agents, headless CI/CD, and broad editor support.
First Principles Analysis:
- “Agent‑native” RL shifts optimization from static next‑token prediction to closed‑loop control under latency and budget constraints. Process and completion‑time rewards align models to finish tasks quickly and reliably, not just answer well.
- Low activated‑parameter designs and MoE+sparse attention architectures target the real bottlenecks for agents: throughput, stability across long horizons, and cost per successful trajectory. As evals lag real‑world agent workloads, operational metrics (TPS/$, failure modes, recovery behavior) may become the decisive differentiators.