TL;DR: Feb 12 — Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5
Major Highlights:
-
Google ships Gemini 3 Deep Think V2 to users with SOTA reasoning
- Productized “Deep Think” reasoning mode now rolling out to Google AI Ultra subscribers in the Gemini app; Vertex AI/Gemini API early access opening for select researchers/enterprises. Framed as test-time-compute-heavy but deployable, not just a lab demo.
- New state-of-the-art on ARC-AGI-2 (84.6%, independently certified), strong “no tools” results on Humanity’s Last Exam (48.4%), and elite Codeforces Elo (3455; ~top-10 globally).
- Google highlights real engineering/science workflows: debugging math proofs, physics system modeling, semiconductor crystal growth optimization, sketch-to-CAD/STL for 3D printing.
-
Anthropic closes massive round; revenue surges
- Closes a reported $30B raise at a $380B valuation. Revenue run-rate jumps >10x to $14B; Claude Code ARR doubles, reaching $2.5B YTD. Positions Anthropic for expanded compute, model training, and enterprise growth.
-
OpenAI debuts GPT-5.3-Codex-Spark for speed
- New “Spark” mode targets Claude’s fast mode with >1000 tokens/s generation (≈10x speedup) versus prior baselines; framed as rapid commercialization following the Cerebras deal. Emphasis on latency-sensitive coding/agent workloads.
-
China’s open(ish) coding wave: MiniMax M2.5 and GLM-5
- MiniMax M2.5 claims 80.2% on SWE-Bench Verified (Opus-level), rapid distribution across OpenRouter, Arena, Cline, Ollama Cloud promos, Eigent, Qoder, Blackbox AI. Community notes strong throughput and price competitiveness.
- GLM-5 circulates with reported 744B total params (~40B active MoE), 28.5T tokens, DeepSeek Sparse Attention, and “Slime” async RL infra; 200K context on YouWare, ~14 tps on OpenRouter, and local mlx-lm runs on M3 Ultra (512GB).
Key Technical Details:
- Gemini 3 Deep Think V2: ARC-AGI-2 84.6% (certified), HLE 48.4% (no tools), Codeforces Elo 3455. Jeff Dean emphasizes efficiency: up to 82% cheaper per task on select evals.
- ARC eval pricing (ARC Prize): ~$13.62/task (ARC-AGI-2), ~$7.17/task (ARC-AGI-1).
- GPT-5.3-Codex-Spark: >1000 tok/s generation; positioned as a 10x speedup vs typical LLM output rates, competing with Claude’s 2.5x fast mode.
- MiniMax M2.5: cited 100 tok/s and ~$0.06/M tokens with caching (per Cline); 80.2% SWE-Bench Verified.
- GLM-5: 744B params (~40B active), 28.5T tokens; 200K context window; MoE-style sparsity; community-reported ~14 tps cloud, ~15 tok/s locally (mlx-lm, M3 Ultra 512GB).
Community Response/Impact:
- ARC creator François Chollet welcomes progress but reiterates ARC targets fluid test-time adaptation, not “AGI proof.” Expects benchmarks to evolve until the human–AI gap closes (speculates ~2030 horizon).
- Debate over “no-tools” evaluation conditions (e.g., Codeforces) and generalization to real tasks.
- Practitioners note M2.5 as one of the first open-ish coding models viable for daily work; growing momentum behind agentic, long-horizon coding stacks.
First Principles Analysis:
- The shift is from mere benchmark wins to deployable test-time reasoning: productized heavy reasoning modes, faster token throughput, and lower per-task costs.
- Efficiency gains (82% cheaper tasks; >1000 tok/s generation) expand the feasible frontier for real engineering/science workflows and multi-agent coding systems.
- Funding scale (Anthropic) plus specialized hardware tie-ups (OpenAI–Cerebras) signal a near-term race to compress “reasoning quality × latency × cost” into production-grade offerings.