Jan 22 not much happened today Show details

news.smol.ai•about 1 month ago•View Original →

TL;DR: Jan 22–23, 2026 AI Roundup — Agent Harnesses Go First‑Class, FrontierMath Jump, Google–Sakana Tie-Up, Gemini Quotas Rise

Major Highlights:

FrontierMath leap signals real capability gains
- Epoch reports GPT-5.2 Pro hits 31% on FrontierMath Tier 4 (prev. best 19%), with practitioners citing concrete usefulness (e.g., catching benchmark flaws). Separate analysis shows strong cross-benchmark correlations (≈0.68 across domains; ≈0.79 within-domain), reinforcing a shared “capability factor” across math/coding/reasoning. The counterpoint: systems remain uneven—“smarter than a PhD in math, dumber than an intern”—highlighting reliability gaps outside verifiable domains.
Agent harnesses become first-class systems
- OpenAI published a technical deep dive on the Codex agent loop (assemble inputs → model inference → tool execution → feedback → stop criteria), elevating the harness/orchestration layer as core. Cursor shipped discoverable Agent Skills for dynamic context and specialization, while Anthropic expanded “Claude in Excel” to Pro with safer cell writes, multi-file drag/drop, and longer sessions via auto-compaction. Community research converges on hybrid architectures: Skills, decision-time guidance, and Recursive Language Models (RLMs) that offload long prompts into code + subcalls.
Google boosts Gemini quotas; strategic tie-up with Sakana AI
- Gemini App quotas for Ultra members increased to 1,500 Thinking + 500 Pro prompts/day. Google announced a partnership and investment in Sakana AI to combine Gemini/Gemma with Sakana’s “AI Scientist”/“ALE-Agent,” targeting high-security deployments in Japan.
Ecosystem, funding, and second-tier multimodal
- Baseten raised $300M at a $5B valuation, positioning around a “many-model future” and high-performance inference. In China, Baidu’s ERNIE 5.0 (≈2T params, ~61K context) is reviewed as improved/stable but still “second tier” vs top multimodals with larger compute budgets.

Key Technical Details:

Benchmarks: GPT-5.2 Pro = 31% on FrontierMath Tier 4 (Epoch); correlation across benchmarks ≈0.68 (cross-domain), ≈0.79 (within-domain).
Product quotas: Gemini Ultra—1,500 Thinking + 500 Pro prompts/day.
Claude in Excel (now in Pro): multi-file drag/drop, safer cell writes, longer sessions via auto-compaction.
Agent frameworks: OpenAI Codex loop/harness published; Cursor Agent Skills launched; RLMs advocated by DSPy community for “arbitrarily long” tasks via code + subcalls.
Claude Code ecosystem: tasks stored at ~/.claude/tasks enabling multi-session/subagent collaboration; tutorials show local, tool-enabled “Claude Code-like” setups with open models.
Corporate moves: Baseten $300M raise at $5B; Google–Sakana partnership for high-security Japanese deployments leveraging Gemini/Gemma.
China: Baidu ERNIE 5.0 ≈2T params, ~61K context; seen as costlier and “firmly second tier.”

Community Response/Impact:

Engagement around Claude-in-Excel vs Microsoft 365 Copilot pace; excitement over practical agent harness patterns.
Debates on MoE provenance: arguments that DeepSeek’s MoE isn’t derived from Mixtral (“neoMoE” vs “oldMoE” framing).
Timelines: Shane Legg assigns 50% by 2028 for “minimal AGI” (with continuous learning/memory/world models); Demis Hassabis says continual learning remains unsolved; exploration of AlphaZero-style methods with foundation models continues.

First Principles Analysis:

The locus of progress is shifting from model scaling alone to the agent harness: state representation, tool sequencing, and skill composition. This reduces prompt-length bottlenecks and improves reliability via structured execution.
Math/coding gains lead because they’re verifiable and richly augmentable with synthetic data; extrapolation to open-ended tasks is limited until continual learning, memory, and robust world modeling improve.