Home
Projects
Blog
Contact
Books
AI News
← Back to AI News

Feb 17 Claude Sonnet 4.6: clean upgrade of 4.5, mostly better with some caveats Show details

news.smol.ai•8 days ago•View Original →

TL;DR: Claude Sonnet 4.6 is a cleaner, broader upgrade over 4.5—approaching Opus-class with bigger context, better agents, and some efficiency caveats

Major Highlights:

  • Bold upgrade with 1M-token context (beta)

    • Anthropic released Sonnet 4.6 as the new top “workhorse” model, claiming improvements across coding, computer use, long-context reasoning, agent planning, knowledge work, and design—plus a 1,000,000-token context window in beta.
    • Internally and by early testers, it’s framed as nearing Opus-class capability while keeping Sonnet’s lower price tier.
  • Agentic “Computer Use” is maturing into product

    • The once-slow/inaccurate Computer Use (launched Oct 2024) is now productized as Claude Cowork, with anecdotal adoption reportedly ahead of OpenAI’s Operator/agent iterations.
    • Signals a shift from demo-level agents to reliable workflow tools in enterprise contexts.
  • Strong benchmark movement—but higher token burn

    • Sonnet 4.6 posts 79.6% on SWE-Bench Verified and 58.3% on ARC-AGI-2.
    • On GDPval-AA (agentic knowledge work), it leads with ELO 1633 (adaptive thinking, max effort) but consumes many more tokens: ~280M vs Sonnet 4.5’s ~58M and Opus 4.6’s ~160M—meaning total job cost/latency can be higher despite a lower list price.
  • Tooling quality-of-life upgrades

    • New search/fetch performs executable pre-filtering of results, yielding +13% accuracy on BrowseComp while cutting input tokens by 32%.
    • Minor upgrades land across API platform tools and Excel integrations.

Key Technical Details:

  • Context: 1M tokens (beta).
  • Benchmarks:
    • SWE-Bench Verified: 79.6%.
    • ARC-AGI-2: 58.3%.
    • Preference: Users preferred Sonnet 4.6 over Opus 4.5 in 59% of comparisons.
    • GDPval-AA ELO: 1633; #1, but within the 95% CI of Opus 4.6.
  • Token usage on GDPval-AA: Sonnet 4.6 ~280M vs Opus 4.6 ~160M vs Sonnet 4.5 ~58M.
  • Pricing: Claimed unchanged from Sonnet 4.5; typical Sonnet rate cited externally as $3 input / $15 output per 1M tokens.
  • Availability/integrations: Cursor, Windsurf, Microsoft Foundry (Azure), Perplexity Pro/Max, and Comet browser agent (for Pro).

Community Response/Impact:

  • Positive: “Approaching Opus-class,” “insane jump over 4.5,” improved “taste”/aesthetics in outputs; noticeable gains on long tasks and agent planning.
  • Mixed/negative: Early reports of hallucinated function names and broken structured outputs; later anecdotally “seems fixed.” Cursor notes it’s better than 4.5 on longer tasks but still below Opus 4.6 for raw intelligence.
  • Implications: Enterprises get a cheaper model edging into frontier capability—with the catch that higher token use can erase cost advantages on complex, agentic workloads.

First Principles Analysis:

  • The big shifts are (1) massive context + (2) smarter retrieval via “compute-before-context.” Executable pre-filtering reduces prompt bloat and boosts signal-to-noise, directly addressing long-context inefficiency.
  • Sonnet 4.6’s agentic gains suggest the model is spending more tokens to think/plan—great for quality on messy tasks, but it pressures cost/latency. Buyers should optimize for total job cost, not list price.
  • Productizing Computer Use as Claude Cowork indicates agent reliability crossed a threshold from novelty to utility, potentially redefining how “knowledge work” gets automated at scale.