Jan 26 Anthropic launches the MCP Apps open spec, in Claude.ai Show details

news.smol.ai•30 days ago•View Original →

TL;DR: Anthropic launches the MCP Apps open spec in Claude.ai

Major Highlights:

Anthropic standardizes “rich generative UI” with MCP Apps + Claude.ai support
Anthropic absorbed the independent MCP UI project and, collaborating with OpenAI, Block, VS Code, Antigravity, JetBrains, AWS, and others, released the MCP Apps open spec with official support in Claude.ai. This creates a vendor-neutral way for AI apps to return interactive, rich UIs and interoperate across tools and IDEs—addressing fragmentation after slow uptake of OpenAI’s ChatGPT Apps. It positions MCP Apps as the shared substrate for agentic interfaces and could consolidate today’s fragmented “$20/month per app” ecosystem.
Agent orchestration and recursion-first designs become the default pattern
NVIDIA’s ToolOrchestra proposes a small “conductor” (Orchestrator-8B) that alternates reasoning with tool calls and specialist/foundation models, claiming frontier-like outcomes at lower cost via end-to-end RL on synthesized multi-turn tool-use tasks. Parallel momentum around Recursive Language Models (RLMs) advocates pass-by-reference context (shell/grep/AST) instead of stuffing everything into prompt context; Daytona pitches sandboxed sub-agents for “unlimited recursion.” A “Clawdbot” meme signals demand for outcome-first UX with tight context/tool integration—paired with renewed warnings about prompt-injection risks for browser/desktop agents.
Reasoning/model evals intensify; meta-inference trumps sheer scale in places
Alibaba’s Qwen3-Max-Thinking touts strong math/agent scores (HMMT Feb 98.0; HLE 49.8) and adaptive tool-use. Tencent’s HunyuanImage 3.0-Instruct (80B MoE, 13B active) enters image-edit leaderboards (Arena rank #7) with native CoT + MixGRPO. A method combining Recursive Self-Aggregation (RSA) with Gemini 3 Flash hits 59.31% on ARC-AGI-2 at ~1/10 the cost of Gemini Deep Think, underscoring the leverage of meta-inference. Open entrants include Molmo 2 (Apache 2.0) and GLM-4.7-Flash via llama.cpp.
RL everywhere: test-time training, stability knobs, pretraining, and compute savings
A TTT+RL cluster reports: new upper bound for an Erdős overlap problem, ~2× faster A100 kernels than best human baselines, and AtCoder wins over best AI+human attempts. Practical levers emerge (GRPO delta=4.0 stability); NVIDIA’s “Reinforcement as a Pretraining Objective” (RLP) is accepted to ICLR 2026; AI21’s Dynamic Data Snoozing claims up to 3× compute reduction in RLVR by shelving “too-easy” examples.

Key Technical Details:

MCP Apps: open spec for rich UI returned by agents; official Claude.ai support; partners include OpenAI, Block, VS Code, JetBrains, AWS, Antigravity.
Orchestrator-8B: small controller alternates reasoning with tool/expert calls; trained end-to-end via scalable RL on synthetic tool-use environments.
RLMs: pass-by-reference context; Daytona introduces per-(sub)agent sandboxes for deep recursion.
Benchmarks: Qwen3-Max-Thinking (HMMT Feb 98.0; HLE 49.8); HunyuanImage 3.0-Instruct (80B MoE, 13B active, MixGRPO, native CoT, Arena image-edit #7); ARC-AGI-2 59.31% with RSA + Gemini 3 Flash at ~1/10 cost vs Gemini Deep Think; GLM-4.7-Flash via llama.cpp (Q4_K_M, 24k context); Molmo 2 (Apache 2.0).
Monitoring scope: 12 subreddits, 544 Twitters, 24 Discords (206 channels; 14,285 messages); ~1,208 minutes reading time saved.

Community Response/Impact:

Broad endorsement of MCP Apps suggests real standardization momentum absent with ChatGPT Apps alone.
“Outcome-first” assistant UX is ascendant; security concerns (prompt injection) remain the gating factor for powerful browser/desktop agents.
Tool-enabled evals and aggregation strategies reshape leaderboards; open-weight inference continues to commoditize.

First Principles Analysis:

A shared UI/agent schema decouples models from app surfaces, enabling cross-tool interoperability and lowering integration costs—akin to HTTP for agent UIs.
In agents, controller size is less decisive than policy quality, tool routing, and realistic RL rollouts.
Recursion-first designs minimize context bloat and improve precision, while meta-inference (aggregation, pruning, test-time learning) delivers outsized cost-performance gains relative to raw model scale.