Jan 02 not much happened today Show details

news.smol.ai•about 2 months ago•View Original →

TL;DR: Jan 02 — not much happened today (AI News 1/1/2026–1/2/2026)

Major Highlights:

mHC makes hyper-connections stable and scalable
- DeepSeek’s Manifold‑Constrained Hyper‑Connections (mHC) generalize residuals from one stream to n parallel streams with learned mixing along identity and update paths. The key advancement is constraining the residual mixing matrix to the Birkhoff polytope (doubly stochastic), which preserves identity-like behavior while avoiding explosive/vanishing products over depth. Early training on 3B/9B/27B models shows smoother token-scaling curves, higher stability, and small but consistent quality gains versus baseline residuals and naïve HCs.
Systems + math jointly drive the result
- Beyond theory, DeepSeek reports fused kernels, activation recomputation in backward, mixed precision tuning, and dedicated high-priority streams for pipeline comms—underscoring that architectural gains matter only when backed by kernel/memory/parallelism engineering.
Long-horizon agents shift from “bigger context” to “context management”
- Prime Intellect’s Recursive Language Models (RLMs) propose that agents should learn to manage their own context—offloading work to tools/sub-LLMs and keeping the main context small. Parallel conversations argue production value moves from datasets to durable “context graphs” (decision/action traces) and from prompt tweaks to full stack optimization (RAG, tools, memory, orchestration).
Coding agents: harness design may be the differentiator
- Practitioners argue current agent harnesses underutilize frontier models; better orchestration and eval design (e.g., SWE-Bench claims scrutiny, less LLM-judge bias) could unlock step-changes in code automation, not just raw model upgrades.

Key Technical Details:

mHC mechanism: n-stream residual path with learned mixing; residual matrix constrained to the Birkhoff polytope; efficient Sinkhorn-like row/column normalization projection.
Stability/perf: example backward gain capped ~1.6 vs ~3000 for naïve HC; improved stability across depth and modest loss/benchmark improvements.
Overhead: ~6.7% training overhead reported for n=4.
Scale: experiments cited at 3B/9B/27B parameters; better token-scaling curves vs baseline.
Systems: fused kernels, mixed precision details, activation recomputation in backward, pipeline comm scheduling on high-priority streams.
Meta: smol.ai launched a new site with full metadata search across past issues (news.smol.ai).

Community Response/Impact:

Researchers see residual-path design becoming a first-class scaling lever (callouts by @teortaxesTex, @rasbt, @norxornor, @AskPerplexity).
@iamgrigorev links mHC to broader architectural trends (residual variants, GRAPE-like positional work, Muon optimizers), speculating MLP expansion factors may be partly offset by more expressive residual streams.
Agent discourse centers on enterprise moats via traceable “context graphs” (@ashugarg), and 2026 themes of enterprise agents + scientific acceleration (@gdb), plus “verification over belief” and “tool users → system owners” (@TheTuringPost).
Memory practices debated: pragmatic “MEMORIES.md” per project (@giffmana) vs risks of overlearning and preference for explicit, inspectable memory tools (@swyx).

First Principles Analysis:

Residuals work by preserving identity and enabling deep signal flow; HCs expand capacity but destabilize via matrix products. Constraining to doubly stochastic matrices sustains a bounded, identity-centered operator over depth (closure under multiplication), combining expressivity with stable gradients.
For agents, the bottleneck is not tokens but control: selecting, summarizing, verifying, and reusing context. Durable value accrues in structured traces and verifiability, pushing the moat from raw data toward system design, instrumentation, and governance.