Home
Projects
Blog
Contact
Books
AI News
← Back to AI News

Feb 18 not much happened today Show details

news.smol.ai•7 days ago•View Original →

TL;DR: AI News for 2/17–2/18/2026 (Frontier Model Churn, Agent Harnesses, EVMbench)

Major Highlights:

  • Claude 4.6 jumps in capability—at a steep token cost
    • Artificial Analysis places Sonnet 4.6 at 51 on its Intelligence Index (up from 43 for Sonnet 4.5 reasoning), just behind Opus 4.6 at 53. However, Sonnet 4.6 consumed ~74M output tokens to run the suite vs ~25M for 4.5 and ~58M for Opus 4.6, costing ~$2,088 at “max effort.” Community notes better critique/architecture skills but ongoing reliability/tooling issues around Claude Code. Arena added Opus/Sonnet 4.6 to its search leaderboard. Anthropic also published autonomy telemetry from millions of tool-using interactions.
  • Qwen 3.5: efficiency vs overthinking, and open FP8 weights
    • Qwen3.5 drew attention for “excess thinking” and token bloat; community claims Qwen3.5-Plus trims long-chain verbosity but may regress in non-reasoning tasks. Distribution momentum continued: Qwen3.5-Plus is now on Vercel AI Gateway; Alibaba Cloud launched a fixed-price, high-cap coding plan. Critically, Alibaba released FP8 weights for Qwen3.5‑397B‑A17B with SGLang support live and vLLM support imminent—exemplifying fast OSS ecosystem bring-up.
  • GLM‑5 pushes “agentic engineering” with detailed RL infra
    • The GLM‑5 technical report details asynchronous agent RL (decoupling generation and training) and DSA for lower compute with preserved long-context performance. Practitioners praised unusually actionable details (optimizers, state handling, data curation for terminal envs/slide gen), positioning GLM‑5 as a replicable OSS reference.
  • OpenAI launches EVMbench for smart-contract agent security
    • EVMbench evaluates agents on detecting, exploiting, and patching high-severity EVM vulnerabilities—signaling agentic security as a first-class evaluation domain. Community immediately compared model families on precision/recall and exploit reliability.

Key Technical Details:

  • Claude 4.6 Intelligence Index: Opus 4.6 = 53; Sonnet 4.6 = 51; Sonnet 4.5 (reasoning) = 43.
  • Token usage to run the suite: Sonnet 4.6 ~74M; Sonnet 4.5 ~25M; Opus 4.6 ~58M; Sonnet 4.6 cost ≈ $2,088 at “max effort.”
  • Anthropic autonomy telemetry: ~73% of tool calls human-in-the-loop; 0.8% appear irreversible; ~50% of tool calls are software engineering tasks on their API.
  • Harness delta (same model): LangChain Deep Agents CLI completed task in 9s vs Claude Code in 16s (1.7× faster).
  • SWE-bench: leaderboard now standardizes on mini‑SWE‑agent v2—changing performance baselines; parallel critiques of “SWE‑fficiency” ranking.
  • Qwen3.5‑397B‑A17B FP8: open weights; SGLang supported; vLLM PR “within days.”
  • Gemini 3.1 Pro: early tests suggest longer “thinking” traces vs Gemini 3 Pro; skepticism on benchmark trust and adversarial robustness persists.
  • MiniMax M2.5/M2.5 Lightning: added to community leaderboards via OpenRouter/prompt-vote.
  • Practical safety note: avoid Windows-specific “2>nul” in Git Bash/MSYS2 to prevent undeletable files; prefer Unix redirects or wrap with “cmd /c.”

Community Response/Impact:

  • Strong interest in Claude 4.6’s gains tempered by token economics and complaints about Claude Code stability/docs.
  • Broader recognition that orchestration, memory, and tool policies dominate perceived “agent capability.”
  • Momentum toward open weights plus immediate inference-stack support as table stakes for OSS competitiveness.
  • Heightened scrutiny of evals: harness changes (SWE-bench), leaderboard optics, and adversarial test reliability.

First Principles Analysis:

  • The center of gravity is shifting from “model-only” performance to “agent = model + harness + memory + tools + supervision.” Small orchestration choices can yield 1.7× speedups and different success rates without changing weights.
  • Token efficiency is now a core product constraint. Models that “think longer” must justify cost via measurable gains on complex tasks.
  • Open weights paired with rapid ecosystem enablement (SGLang/vLLM) compress time-to-adoption—crucial for reproducibility and community-driven optimization.
  • Evals are specializing by domain (e.g., EVMbench), reflecting a move toward measuring outcomes in high-stakes, tool-rich environments rather than generic benchmarks alone.