TL;DR: Agentic Engineering — WTF Happened in December 2025?
Major Highlights:
-
Perplexity launches “Computer,” an orchestration-first agent platform
- Computer coordinates parallel, asynchronous sub-agents (research, coding, media) under a central “coordinator” model, shifting from single-loop chats to distributed workflows.
- Emphasis on systems UX: multi-model routing, sandboxed tools/environments, persistent memory, spend caps, and usage-based pricing.
- Rollout: web access for Max first, then Pro/Enterprise; Max includes 10k monthly credits plus a time-limited bonus. Sub-agents can select models per task.
-
Coding agents hit a “phase change” since December
- Andrej Karpathy reports end-to-end autonomy on multi-hour dev tasks (e.g., SSH → vLLM deploy → model bench → API → UI → systemd → report) with minimal intervention—moving from brittle demos to coherent, tenacious execution.
- Devtools echo this shift (Cursor, independent builders), framing December as an inflection for real-world reliability and long-horizon tasking.
-
New coding stack drops: GPT-5.3-Codex, Claude Code integrations, Copilot CLI GA
- OpenAI GPT‑5.3‑Codex: ~25% faster than 5.2 with fewer tokens per task; early claims of strong SWE-Bench Pro and “86% on IBench” circulate, but methodology caveats apply.
- Claude Code: year-one maturity; adds Slack plugin and LangSmith tracing to diagnose routing/nerfing; raises context-length/memory tradeoff concerns.
- GitHub Copilot CLI GA: adds /research for repo-wide inquiries via GitHub code search + MCP-based dynamic fetching; exports to gists; improved terminal UX.
-
Open models surge: Qwen3.5 “Medium” wave + local-agent tipping point
- Day-0 support across vLLM, GGUF, LM Studio, Ollama, Jan underscores rapid deployment pipelines.
- Claims: near-lossless 4-bit weight + KV quant; long context—27B ~800K+, 35B‑A3B >1M on 32GB VRAM, 122B‑A10B 1M+ on 80GB GPUs; FP8 weights with native vLLM/SGLang support.
- Practitioners report Qwen3.5‑35B‑A3B improves local agent loops (tool calling, stability) while activating ~3B params/token—making local viable alongside top hosted agents.
Key Technical Details:
- Perplexity Computer: orchestration-first, coordinator + specialist sub-agents; usage-based with spend caps; Max priority access; 10k monthly credits + time-limited bonus.
- GPT‑5.3‑Codex: ~25% latency improvement vs 5.2; efficiency gains in token usage; strong early SWE-Bench Pro chatter; IBench “86%” claim pending verification.
- Copilot CLI: /research feature using GitHub code search + MCP dynamic fetching; shareable reports via gists.
- Qwen3.5: FP8 weights released; 27B, 35B‑A3B (MoE), 122B‑A10B; >1M-token contexts on consumer/pro GPUs; broad inference stack coverage.
Community Response/Impact:
- Broad sentiment that “software development changed” around Dec 2025 as agents began closing the loop reliably.
- Evaluation caution: warnings about benchmaxxing, MoE vs dense confusion; surprising parity across sizes on some benches; Arena adds Qwen3.5 Medium for head-to-heads.
- Reliability research flags that failures often stem from compounding tool-call errors; capability gains haven’t translated linearly into reliability.
First Principles Analysis:
- The center of gravity is moving from single-model chats to orchestrated, multi-agent systems with resource governance, observability, and specialized tools.
- MoE + FP8/quant advances plus mature local runtimes (vLLM, SGLang, GGUF/Ollama) make “local-first” agents practical for many workflows.
- Real progress hinges on reliability engineering—interface design, tracing, recoverability, and minimal yet meaningful agent benchmarks—more than on raw model scores.