Feb 24 Claude Code Anniversary + Launches from: Qwen 3.5, Cursor Demos, Cognition Devin 2.2, Inception Mercury 2 Show details

news.smol.ai•7 days ago•View Original →

TL;DR: Claude Code Anniversary + Qwen 3.5, Cursor Demos, OpenAI GPT‑5.3 Codex, Inception Mercury 2

Major Highlights:

Qwen 3.5 bets on “more intelligence, less compute”
- Alibaba’s new medium series—Qwen3.5-Flash (hosted, 1M-context default, built-in tools), Qwen3.5-35B-A3B (MoE), Qwen3.5-122B-A10B (MoE), and Qwen3.5-27B (dense)—pushes architecture+data+RL over raw parameter scaling. Early users report 35B-A3B and 122B-A10B feel unusually strong, with “intelligence-per-watt” claims (35B surpassing a 235B predecessor).
- Rapid ecosystem support: GGUF and sizing from Unsloth, SGLang serving, and aggressive INT4 quant releases drive local and low‑bit enthusiasm.
Coding agents become the product surface (OpenAI + Anthropic)
- OpenAI shipped GPT‑5.3‑Codex to all developers via the Responses API, expanded direct file ingestion (docx/pptx/csv/xlsx, etc.), and highlighted web sockets as a throughput win (~30% faster rollouts). Third‑party boards show strong Codex 5.3 placements on TerminalBench/IOI/LiveCodeBench/VibeCodeBench. Pricing cited as $1.75 input / $14 output.
- Anthropic introduced Claude Code “Remote Control” so devs can start a local terminal session and continue from phone; also rolled out high‑engagement enterprise cowork/plugin customization updates.
Cursor pivots to “demos, not diffs”
- New UX has agents use the software they build and return video demos. Builders report practical gains from cloud VMs: async execution, self‑verification loops, and demo artifacts as deliverables—shifting review from code diffs to observed behavior.
Diffusion for language races on speed (Inception Mercury 2)
- Mercury 2, a “reasoning diffusion LLM,” emphasizes output speed (~1,000 tokens/s). While not frontier‑leading on intelligence, it posts solid agentic/coding results (Terminal‑Bench Hard, IFBench) and showcases architecture‑level parallel token refinement as a latency play.

Key Technical Details:

Qwen 3.5: Flash (1M context, tools), 35B‑A3B MoE, 122B‑A10B MoE, 27B dense; INT4 quant variants; SGLang + GGUF/Unsloth support. Qwen3.5‑397B‑A17B trended on HF and scored well in Code Arena’s webdev‑style agent evals.
OpenAI: GPT‑5.3‑Codex in Responses API; expanded file types; web sockets reported ~30% rollout speedup; pricing cited as $1.75 input / $14 output. Strong third‑party benchmark showings.
Inception Labs: Mercury 2 emphasizes speed (~1k tok/s) with competitive agentic/coding evaluations.
SWE‑bench Multilingual: 300 tasks across 9 languages; reported SOTA 72%. Highlights that rankings can invert outside English/Python.

Community Response/Impact:

Practitioners hail Qwen 35B‑A3B as a sweet spot for local and efficient deployment; ultra‑low‑bit quant workflows gain momentum.
Reliability and safety remain bottlenecks: a Princeton study quantifies a capability–reliability gap (12 dimensions, modest reliability gains), OpenClaw shows “routine‑step decomposition” can bypass safeguards, and research finds LLM‑generated AGENTS.md/context files often reduce success while raising costs.
Enterprise traction: Claude cowork/plugin updates and Remote Control resonate with team workflows and mobile development.

First Principles Analysis:

The center of gravity is moving from sheer scale to system engineering: MoE + quantization + serving stacks (web sockets, SGLang) and artifact‑driven UX (videos, terminals, files) compress verification loops and raise trust.
Diffusion‑style parallel refinement attacks the latency wall; 2026 competition likely optimizes latency and throughput as much as benchmark maxima.
Multilingual SWE evaluations expose data coverage gaps; winning agent stacks will hinge on non‑English code/data pipelines and robust, reliability‑first orchestration.