Jan 15 not much happened today. Show details

news.smol.ai•about 1 month ago•View Original →

TL;DR: Jan 15 — Quiet day, but meaningful moves in long‑horizon coding, inference speed, and agent tooling

Major Highlights:

OpenAI ships GPT‑5.2‑Codex for long‑running coding; Cursor + GitHub integrate immediately
- OpenAI added GPT‑5.2‑Codex to the Responses API, positioning it as the strongest coding model for extended tasks (feature work, refactors, bug‑finding) and “most cyber‑capable” for code vulnerability understanding. Cursor framed it as a frontier model for long workflows; GitHub rolled it into @code and tweaked preview/GA labeling to ease enterprise adoption.
- A standout datapoint: a team ran GPT‑5.2 autonomously in Cursor for a week to build a basic browser, generating 3M+ lines of Rust across thousands of files (HTML parsing → CSS layout → painting → custom JS VM). It “kind of works” for simple sites—now a reference for continuous agent runtime. Best practice emerging: build explicit “review” loops into agents to control quality and safety.
Inference “speed is the product”: OpenAI–Cerebras tie‑up and granular provider benchmarks
- Cerebras announced a partnership with OpenAI, reinforcing latency/tokens‑per‑second as visible product differentiators versus rivals.
- Artificial Analysis benchmarked GLM‑4.7 across providers: Cerebras served ~1,445 output tok/s with ~1.6s time‑to‑first‑answer; GPU providers (e.g., Fireworks/Baseten) lagged on throughput/latency but offered larger contexts (Cerebras ~131k vs ~200k elsewhere) and different caching discounts.
- Ops content emphasized cost/throughput: Modal argued self‑hosted inference can match/beat API economics; scaling notes included running healthy fleets (~20k GPUs), with vLLM + FlashInfer, async scheduling, and careful batch sizing to saturate H100s.
Agent engineering gets productized: skills, dynamic tools, and when not to go multi‑agent
- “Skills” emerge as a portability layer (standardized folders; CLI/MCP compatibility) to reduce plugin versioning pain. LangChain’s LangSmith Agent Builder treats “agents as a filesystem” with built‑in memory, ambient triggers, skills/MCP/subagents; CopilotKit bridges to UI apps.
- Guidance trend: default to a single agent; only go multi‑agent for hard constraints (context limits, distributed ownership, necessary decomposition).

Key Technical Details:

GPT‑5.2‑Codex: in OpenAI Responses API; integrated by Cursor and GitHub @code.
Autonomous build case: 1 week continuous run, 3M+ Rust LOC, thousands of files; basic rendering pipeline and custom JS VM.
GLM‑4.7 serving: ~1,445 tok/s (Cerebras), TTFAT ~1.6s; context: Cerebras ~131k vs ~200k on some GPU providers.
Infra ops: vLLM + FlashInfer backend; Modal shares practices for batch inference and large GPU fleets.
Coverage scope: 12 subreddits, 544 Twitter accounts, 24 Discords (204 channels; 5,168 messages). New site with metadata search: https://news.smol.ai/

Community Response/Impact:

Evaluation discourse: “vibes vs metrics” for coding models; long‑horizon evals (e.g., METR‑style) may detect capability jumps earlier than standard benchmarks.
Enterprise angle: clarity on model readiness (preview vs GA) seen as key to adoption.
Agent ops: rising consensus that quality requires first‑class review/verification loops.

First Principles Analysis:

The news crystallizes three fronts: (1) models that sustain reliable, week‑long chains of thought and action; (2) economics shifting to latency/throughput as core UX and margin drivers; and (3) agent ecosystems converging on minimal, composable “skills” plus a single‑agent default. Research threads (DroPE long‑context without positional embeddings; DeepSeek’s Engram memory with hashed O(1) lookups) point to architectures separating compute from memory—critical for scaling long‑horizon reasoning without blowing up cost.