Feb 17 Claude Sonnet 4.6: clean upgrade of 4.5, mostly better with some caveats Show details

news.smol.ai•8 days ago•View Original →

TL;DR: Claude Sonnet 4.6 is a cleaner, broader upgrade over 4.5—approaching Opus-class with bigger context, better agents, and some efficiency caveats

Major Highlights:

Bold upgrade with 1M-token context (beta)
- Anthropic released Sonnet 4.6 as the new top “workhorse” model, claiming improvements across coding, computer use, long-context reasoning, agent planning, knowledge work, and design—plus a 1,000,000-token context window in beta.
- Internally and by early testers, it’s framed as nearing Opus-class capability while keeping Sonnet’s lower price tier.
Agentic “Computer Use” is maturing into product
- The once-slow/inaccurate Computer Use (launched Oct 2024) is now productized as Claude Cowork, with anecdotal adoption reportedly ahead of OpenAI’s Operator/agent iterations.
- Signals a shift from demo-level agents to reliable workflow tools in enterprise contexts.
Strong benchmark movement—but higher token burn
- Sonnet 4.6 posts 79.6% on SWE-Bench Verified and 58.3% on ARC-AGI-2.
- On GDPval-AA (agentic knowledge work), it leads with ELO 1633 (adaptive thinking, max effort) but consumes many more tokens: ~280M vs Sonnet 4.5’s ~58M and Opus 4.6’s ~160M—meaning total job cost/latency can be higher despite a lower list price.
Tooling quality-of-life upgrades
- New search/fetch performs executable pre-filtering of results, yielding +13% accuracy on BrowseComp while cutting input tokens by 32%.
- Minor upgrades land across API platform tools and Excel integrations.

Key Technical Details:

Context: 1M tokens (beta).
Benchmarks:
- SWE-Bench Verified: 79.6%.
- ARC-AGI-2: 58.3%.
- Preference: Users preferred Sonnet 4.6 over Opus 4.5 in 59% of comparisons.
- GDPval-AA ELO: 1633; #1, but within the 95% CI of Opus 4.6.
Token usage on GDPval-AA: Sonnet 4.6 ~280M vs Opus 4.6 ~160M vs Sonnet 4.5 ~58M.
Pricing: Claimed unchanged from Sonnet 4.5; typical Sonnet rate cited externally as $3 input / $15 output per 1M tokens.
Availability/integrations: Cursor, Windsurf, Microsoft Foundry (Azure), Perplexity Pro/Max, and Comet browser agent (for Pro).

Community Response/Impact:

Positive: “Approaching Opus-class,” “insane jump over 4.5,” improved “taste”/aesthetics in outputs; noticeable gains on long tasks and agent planning.
Mixed/negative: Early reports of hallucinated function names and broken structured outputs; later anecdotally “seems fixed.” Cursor notes it’s better than 4.5 on longer tasks but still below Opus 4.6 for raw intelligence.
Implications: Enterprises get a cheaper model edging into frontier capability—with the catch that higher token use can erase cost advantages on complex, agentic workloads.

First Principles Analysis:

The big shifts are (1) massive context + (2) smarter retrieval via “compute-before-context.” Executable pre-filtering reduces prompt bloat and boosts signal-to-noise, directly addressing long-context inefficiency.
Sonnet 4.6’s agentic gains suggest the model is spending more tokens to think/plan—great for quality on messy tasks, but it pressures cost/latency. Buyers should optimize for total job cost, not list price.
Productizing Computer Use as Claude Cowork indicates agent reliability crossed a threshold from novelty to utility, potentially redefining how “knowledge work” gets automated at scale.