Jan 20 not much happened today Show details

news.smol.ai•about 1 month ago•View Original →

TL;DR: Jan 20 — “a quiet day” AI news roundup

Major Highlights:

X open-sources “For You” recommender (Grok-style transformer)
- X Engineering published its ranking/recommendation stack, claiming it uses the same transformer architecture as xAI’s Grok. Early community reads highlight familiar components (candidate generation, isolation, strong out-of-network discovery, minimal “content features”) and skepticism that “it uses a transformer” implies Grok “reads every post.” Reactions split between optimism about transparency and immediate adversarial tinkering; creators simultaneously reported sudden reach drops, underscoring transparency ≠ perceived fairness.
GLM-4.7-Flash emerges as a strong local model, with KV-cache caveats
- GLM-4.7-Flash (30B A3B) drew momentum as a “local workhorse,” with claims of best-in-class 30B results on SWE-Bench and GPQA, 200K context, GGUF packaging, and “runs locally with 24GB RAM.” Threads emphasize that KV cache dominates memory earlier than many expect and that MLA is not free if run as naïve MHA—diagnosing vLLM’s ~1MB/token vs a first-principles ~54KB/token for GLM-Flash. Quantization issues (looping) and mitigation tips circulated.
Reasoning research cluster: internal debate, multiplex tokens, distillation, and synthetic data strategy
- “Societies of Thought” frames gains in o-series/R1/QwQ as emerging internal debate patterns (questioning, disagreement, convergence), attributing a sizable chunk (20%+) of accuracy gains to these behaviors. “Multiplex Thinking” proposes branch-and-merge tokens (K candidates per step) to encode uncertainty efficiently. Distillation via token-ranking logistic loss (vs KL/SFT) gained attention. DeepMind-summarized results suggest under a fixed inference budget, more samples from smaller models improve synthetic data quality (coverage +11%, diversity +86%), enabling training gains up to 31.6%.
Agents in production: RLMs and long-horizon compute management
- Recursive Language Models (RLMs) were pitched as a pragmatic backbone for long-running systems: symbolic recursion enables many sub-reads/edits without tokenizing every intermediate, mitigating context-window blowups that plague tool-chains and sub-agents.

Key Technical Details:

X algorithm: candidate generation isolation; emphasis on out-of-network discovery; minimal content features; transformer-based ranking. Code published by X Engineering on GitHub.
GLM-4.7-Flash: 30B A3B; 200K context; “local on 24GB RAM”; GGUF available. Quant tips: increase --dry-multiplier to 1.1; prefer higher-quality quants (e.g., UD-Q4_K_XL+); add tool-calling data in calibration.
KV cache: vLLM observed ~1MB/token for GLM-Flash under naïve MHA vs an expected ~54KB/token; MLA–MHA mismatch can explode memory.
Throughput: ~100 tok/s with tensor-parallel GLM-Flash across 4× M4 Pro Mac Minis using RDMA over Thunderbolt + MLX; target ~200 tok/s.
GLM ecosystem: GLM-Image reached #8 on an open image leaderboard in a snapshot; devs “one-shotting” small projects locally.
NanoGPT “speedrun”: ~99.3s with bigram hash embedding added to residual stream; deviates from Chinchilla token/parameter ratio.

Community Response/Impact:

Transparency vs trust gap: X’s code drop met with enthusiasm and immediate attempts to “fix”/game it; creators report suppressed reach.
Local-first momentum: Interest coalescing around GLM-4.7-Flash for practical, high-throughput local inference; active debugging/quant threads indicate rapid, community-driven hardening.
Research energy: Strong appetite for methods that compress reasoning (multiplex tokens, ranking-based distillation) and stretch budgets (synthetic data from smaller models).

First Principles Analysis:

Platform transparency raises two tensions: exposing incentives (driving content drift) and enabling gaming. The market will test whether openness improves user trust or accelerates adversarial tuning.
Local inference economics hinge on KV-cache math and interconnects: MLA vs MHA memory behavior and cheap RDMA-style clustering on commodity Apple silicon suggest a viable “edge cluster” path.
Reasoning gains increasingly reflect structured exploration (branching, internal debate) rather than brute longer-chain tokens; methods that encode multiple hypotheses per step may deliver better accuracy-per-token and cheaper training via smarter distillation and sampling.