Home
Projects
Blog
Contact
Books
AI News
← Back to AI News

Feb 13 MiniMax-M2.5: SOTA coding, search, toolcalls, $1/hour Show details

news.smol.ai•12 days ago•View Original →

TL;DR: MiniMax‑M2.5 goes open source with agent‑native RL; GLM‑5 joins “near‑frontier” open wave; agent tooling and eval debates heat up

Major Highlights:

  • MiniMax‑M2.5 open-sourced as an “agent‑native” model: MiniMax released weights and code for M2.5, trained with RL across 200k+ real‑world environments spanning coding, tool use, web search, and office workflows. vLLM and SGLang shipped day‑0 support, framing the model as production‑ready for long‑horizon, always‑on agents.
  • Economics and throughput as the headline: MiniMax markets “$1 per hour at 100 TPS” for agent workloads, positioning M2.5 for low‑cost, high‑throughput self‑hosting. Early local reports show ~50 tok/s on MLX and ~40 tok/s on an M3 Ultra (6‑bit) with ~186 GB peak memory—unusually strong on‑device viability for its class.
  • Forge RL system surfaces as the bigger story: Community writeups describe MiniMax’s “Forge” as CISPO‑like RL with process and completion‑time rewards, multi‑level prefix caching, and heavy rollout allocation (~60% of compute), yielding millions of trajectories per day. Leadership hints at “10B active” parameterization, argues knowledge capacity (not control) now limits scaling, and teases structural/pretraining changes for M3.
  • GLM‑5 positions as “near‑frontier” open: Together’s GLM‑5 claims best‑in‑class open‑source agent performance (77.8% SWE‑Bench Verified; 50.4% HLE w/tools), MoE efficiency via “DeepSeek Sparse Attention,” and a permissive license. Community notes potential dataset fingerprints (“truthy‑dpo”) in outputs.

Key Technical Details:

  • Benchmarks (MiniMax‑M2.5): 80.2% SWE‑Bench Verified; 76.3% BrowseComp (vLLM-reported).
  • Training scale: 200k+ RL environments; millions of trajectories/day; rollout share ~60% of compute.
  • Throughput/cost: “$1/hour at 100 TPS” (MiniMax claim); early local speed ~40–50 tok/s on Apple silicon at 6‑bit with high RAM; self‑hosting helped by low activated‑parameter count.
  • Ecosystem: Day‑0 support in vLLM/SGLang; rapid GGUF/quant releases; Intel‑hosted 2‑bit GGUF (MiniMax‑M2) and INT4 for Qwen3‑Coder‑Next.
  • GLM‑5: 77.8% SWE‑Bench Verified; 50.4% HLE w/tools; claims 744B params (MoE) and MIT licensing; “new RL framework.”

Community Response/Impact:

  • Reviews call M2.5 “viable for multi‑turn work,” close to Claude Sonnet on coding, with better stability vs M2.1—but token‑hungry (≈2× Sonnet in one test), making economics crucial.
  • Immediate tooling uptake: package mirrors, quantized artifacts, and agent stacks adopting M2.5 for “always‑on” use.
  • Eval skepticism persists: SWE‑Bench “saturation” doubts resurface; SWE‑rebench shows different rankings; concerns that token/latency tradeoffs obscure true capability and that TPS ≠ model size.
  • Agent engineering pragmatics: Claude Code “Agent Teams” reportedly coordinate via JSON files on disk (simple, observable, tradeoffs on atomicity/backpressure). Terminal‑first agents surge (Cline CLI 2.0) with parallel agents, headless CI/CD, and broad editor support.

First Principles Analysis:

  • “Agent‑native” RL shifts optimization from static next‑token prediction to closed‑loop control under latency and budget constraints. Process and completion‑time rewards align models to finish tasks quickly and reliably, not just answer well.
  • Low activated‑parameter designs and MoE+sparse attention architectures target the real bottlenecks for agents: throughput, stability across long horizons, and cost per successful trajectory. As evals lag real‑world agent workloads, operational metrics (TPS/$, failure modes, recovery behavior) may become the decisive differentiators.