Home
Projects
Blog
Contact
Books
AI News
← Back to AI News

Feb 19 Gemini 3.1 Pro: 2x 3.0 on ARC-AGI 2 Show details

news.smol.ai•6 days ago•View Original →

TL;DR: Gemini 3.1 Pro — >2× Gemini 3.0 on ARC-AGI-2

Major Highlights:

  • Strong reasoning jump and broader rollout

    • Google released Gemini 3.1 Pro (developer preview), rolling it into the Gemini app, NotebookLM, Gemini API/AI Studio, and Vertex AI. Positioned as the “same core intelligence” that powers Gemini 3 Deep Think, scaled for products.
    • Headline score: ARC-AGI-2 = 77.1% (>2× Gemini 3.0/3 Pro), corroborated by independent leaderboards (Artificial Analysis, LMSYS Arena).
  • Material gains in coding, tools, and practical outputs

    • Coding/agentic benchmarks improved: SWE-Bench Verified 80.6%; Terminal-Bench 2.0 68.5%; APEX-Agents tool-use 33.5% vs 18.4% on 3 Pro (≈82% relative gain).
    • Community reports highlight better SVG/UI/web outputs and more reliable tool use—practical improvements beyond standard academic evals.
  • Hallucination reduction and cost/intelligence positioning

    • Artificial Analysis (AA) measured a −38 percentage point hallucination rate vs Gemini 3 Pro Preview on AA-Omniscience.
    • AA ranks 3.1 Pro at or near the top of its Intelligence Index (leading 6/10 evals), with favorable cost-to-intelligence tradeoffs; community sentiment: “Google back on the frontier.”

Key Technical Details:

  • Model/access:

    • Available now: Gemini app, NotebookLM; Developer Preview via Gemini API/AI Studio; Enterprise via Vertex AI. Also on OpenRouter; Perplexity upgraded Pro/Max users to 3.1 Pro.
    • Framing: “Same core intelligence” as Deep Think.
  • Specs:

    • Context window: 1M tokens; Max output: 64k
    • Knowledge cutoff: Jan 2025
    • Tooling: tool calling, structured outputs, JSON mode
  • Pricing (as reported by Artificial Analysis/Phil Schmid):

    • $2/$12 per 1M input/output tokens for ≤200k context (AA’s full suite cost: $892; ~57M tokens). Still ~2× cost of open-weights GLM-5 for the same AA run.
  • Benchmarks (selected):

    • ARC-AGI-2: 77.1%
    • SWE-Bench Verified: 80.6%
    • Terminal-Bench 2.0: 68.5% (Hard: 54%)
    • APEX-Agents: 33.5% (vs 18.4% on 3 Pro)
    • MCP Atlas: 69.2%; BrowseComp: 85.9%
    • SciCode: 59%; CritPt (research physics): 18% (best by >5 p.p.)
    • GDPval-AA: ELO 1316 (up >100 pts, still behind leaders)
    • ARC Prize: ARC-AGI-1 98% at ~$0.52/task; ARC-AGI-2 77% (cost not reported)

Community Response/Impact:

  • Enthusiasm: tangible upgrades in coding, agent workflows, SVG/UI design; strong leaderboard showings.
  • Skepticism: benchmark-targeting and “eval tweeting”; GDPval/real-world agentic tasks still not state-of-the-art.
  • Rollout friction: some tools (Gemini CLI/Code Assist/Antigravity) inconsistently available at launch; packaging UX critiques.
  • Market impact: strengthens Google’s cost–intelligence stance vs OpenAI/Anthropic; immediate adoption by aggregators (Perplexity, OpenRouter).

First Principles Analysis:

  • Why it matters: ARC-AGI-2 and agent/tool benchmarks test compositional reasoning and multi-step tool competence—capabilities essential for dependable agents. Gemini 3.1 Pro’s gains, coupled with lower hallucinations and 1M context, shift it from “demo wins” to production viability. However, lag on GDPval signals that robust end-to-end real-world task automation remains unsolved; improvements likely hinge on tighter tool ecosystems, planning reliability, and evals that penalize fragile behaviors, not just average-case accuracy.