Jan 19 not much happened today Show details

news.smol.ai•about 1 month ago•View Original →

TL;DR: Jan 19 — not much happened today

Major Highlights:

Bold architectures for scaling memory and context
- STEM (Scaling Transformers with Embedding Modules) from CMU + Meta swaps ~1/3 of the FFN up-projection for a static, token-indexed embedding lookup while keeping the gate + down-projection dense. The static lookup avoids MoE-style runtime routing and cross-device comms, enabling CPU offload and async prefetch. Bottom line: decouple model capacity from per-token FLOPs and comms with systems-friendly static sparsity—an alternative to expert-parallel MoE.
- RePo (Context Re-Positioning) by Sakana AI lets LMs adaptively reorder positional structure so relevant distant tokens are “pulled closer” in attention while noise is pushed away. It targets robustness on noisy contexts, structured inputs, and long-range dependencies—complementing retrieval/packing rather than replacing it.
GLM-4.7-Flash: a 30B-class, MLA + small MoE model aimed at coding/agents
- Zhipu AI released GLM-4.7-Flash, later clarified as a 30B-A3B MoE with “MLA” design choices (unconventional head dims and higher head counts after down-projection). Community summaries suggest ~3B active parameters per token and strong coding/agent benchmarks (SWE-bench Verified, τ²-Bench, HLE, BrowseComp), with Qwen leading on LCB. Treat these as second-hand until verified via the model card.
Inference and deployment: day-0 ecosystem support, speed-first workflows
- Immediate integrations: mlx-lm (0.30.3) with reported 4-bit performance on an M5 32GB laptop (~43 tok/s generation, ~800 tok/s prefill), LM Studio (Apple Silicon via MLX), Ollama v0.14.3+ (pre-release), vLLM day-0 PR, and Hugging Face Inference Providers (including local via Ollama + Harbor).
- Engineers increasingly favor small, fast models for synchronous coding—“good enough” capability with low latency over maximal scale for most interactive work.

Key Technical Details:

STEM: static embedding lookup replaces part of FFN up-projection; predictable memory access patterns; mitigates MoE routing overhead/instability; better CPU offload and async prefetch; reduces cross-device communication.
RePo: module reshapes attention geometry by re-indexing based on content relevance; framed via Cognitive Load Theory; aims at long-range dependency handling and noise robustness.
GLM-4.7-Flash: 30B-A3B MoE, MLA-style architecture; claimed strengths on coding and agentic tasks; availability via MLX/LM Studio/Ollama/vLLM/HF providers; local 4-bit metrics reported by mlx-lm.
Systems trends (China recap): KV-cache offload for “cold” tokens to DRAM; decode-to-prefill GPU flow; dual-hash routing (“power of two choices”); latency slack exploitation; agent memory as reusable KV blocks to preserve prefix continuity.

Community Response/Impact:

MoE skepticism fuels interest in static-sparse, systems-friendly designs (STEM).
“Compression” narrative around GLM (e.g., GLM-4.5 110B → GLM-4.7 ~31B) reflects a push to retain capability at smaller scales; details remain interpretive.
Agents/memory debates: filesystem-as-memory (simplicity, tool familiarity) vs database-first (indexing, concurrency, permissions) split persists.
Cerebras vs GPUs reframed: Cerebras excels at ultra-low-latency small-model inference; GPUs remain better for FLOPs/memory efficiency in typical workloads.

First Principles Analysis:

The frontier is shifting from isolated model kernels to end-to-end SLO-goodput systems. Static sparsity (STEM) and attention geometry control (RePo) reflect a broader move to architect for predictable memory access, reduced comms, and robustness to messy inputs. Combined with rapidly maturing local runtimes and KV-centric systems designs, “fast-enough, small-enough, robust-enough” stacks look set to dominate most interactive and agentic workloads—while giant dense models recede to offline or specialized roles.

Note: The newsletter scanned 12 subreddits, 544 Twitter accounts, and 24 Discords (205 channels; 13,654 messages), estimates 1,062 minutes saved at 200 wpm, and launched a searchable archive at news.smol.ai. Consider reviewing the ARC AGI 2025 Report for broader context.