Feb 13 MiniMax-M2.5: SOTA coding, search, toolcalls, $1/hour Show details

news.smol.ai•12 days ago•View Original →

TL;DR: MiniMax‑M2.5 goes open source with agent‑native RL; GLM‑5 joins “near‑frontier” open wave; agent tooling and eval debates heat up

Major Highlights:

MiniMax‑M2.5 open-sourced as an “agent‑native” model: MiniMax released weights and code for M2.5, trained with RL across 200k+ real‑world environments spanning coding, tool use, web search, and office workflows. vLLM and SGLang shipped day‑0 support, framing the model as production‑ready for long‑horizon, always‑on agents.
Economics and throughput as the headline: MiniMax markets “$1 per hour at 100 TPS” for agent workloads, positioning M2.5 for low‑cost, high‑throughput self‑hosting. Early local reports show ~50 tok/s on MLX and ~40 tok/s on an M3 Ultra (6‑bit) with ~186 GB peak memory—unusually strong on‑device viability for its class.
Forge RL system surfaces as the bigger story: Community writeups describe MiniMax’s “Forge” as CISPO‑like RL with process and completion‑time rewards, multi‑level prefix caching, and heavy rollout allocation (~60% of compute), yielding millions of trajectories per day. Leadership hints at “10B active” parameterization, argues knowledge capacity (not control) now limits scaling, and teases structural/pretraining changes for M3.
GLM‑5 positions as “near‑frontier” open: Together’s GLM‑5 claims best‑in‑class open‑source agent performance (77.8% SWE‑Bench Verified; 50.4% HLE w/tools), MoE efficiency via “DeepSeek Sparse Attention,” and a permissive license. Community notes potential dataset fingerprints (“truthy‑dpo”) in outputs.

Key Technical Details:

Benchmarks (MiniMax‑M2.5): 80.2% SWE‑Bench Verified; 76.3% BrowseComp (vLLM-reported).
Training scale: 200k+ RL environments; millions of trajectories/day; rollout share ~60% of compute.
Throughput/cost: “$1/hour at 100 TPS” (MiniMax claim); early local speed ~40–50 tok/s on Apple silicon at 6‑bit with high RAM; self‑hosting helped by low activated‑parameter count.
Ecosystem: Day‑0 support in vLLM/SGLang; rapid GGUF/quant releases; Intel‑hosted 2‑bit GGUF (MiniMax‑M2) and INT4 for Qwen3‑Coder‑Next.
GLM‑5: 77.8% SWE‑Bench Verified; 50.4% HLE w/tools; claims 744B params (MoE) and MIT licensing; “new RL framework.”

Community Response/Impact:

Reviews call M2.5 “viable for multi‑turn work,” close to Claude Sonnet on coding, with better stability vs M2.1—but token‑hungry (≈2× Sonnet in one test), making economics crucial.
Immediate tooling uptake: package mirrors, quantized artifacts, and agent stacks adopting M2.5 for “always‑on” use.
Eval skepticism persists: SWE‑Bench “saturation” doubts resurface; SWE‑rebench shows different rankings; concerns that token/latency tradeoffs obscure true capability and that TPS ≠ model size.
Agent engineering pragmatics: Claude Code “Agent Teams” reportedly coordinate via JSON files on disk (simple, observable, tradeoffs on atomicity/backpressure). Terminal‑first agents surge (Cline CLI 2.0) with parallel agents, headless CI/CD, and broad editor support.

First Principles Analysis:

“Agent‑native” RL shifts optimization from static next‑token prediction to closed‑loop control under latency and budget constraints. Process and completion‑time rewards align models to finish tasks quickly and reliably, not just answer well.
Low activated‑parameter designs and MoE+sparse attention architectures target the real bottlenecks for agents: throughput, stability across long horizons, and cost per successful trajectory. As evals lag real‑world agent workloads, operational metrics (TPS/$, failure modes, recovery behavior) may become the decisive differentiators.