Jan 08 not much happened today Show details

news.smol.ai•about 2 months ago•View Original →

TL;DR: Jan 08 — Quiet day on the surface, meaningful infra moves underneath

Major Highlights:

GLM-4.7 leads open-weight benchmarks; Z.ai goes public
- Artificial Analysis Intelligence Index v4.0 ranks GLM-4.7 as the strongest open-weights model evaluated, with notable gains in coding, agents, and scientific reasoning. Zhipu AI’s parent Z.ai listed on HKEX, signaling capital-market confidence in the open-weight MoE track.
Qwen ships multimodal retrieval stack (text+image+video)
- Alibaba released Qwen3-VL-Embedding and Qwen3-VL-Reranker, a two-stage retrieval/RAG system spanning 30+ languages and mixed media. Early benchmarks suggest SOTA across MMEB-V2 and MMTEB; vLLM added nightly support, pointing to fast ecosystem uptake.
Inference and kernels: vLLM throughput, KV offloading, AI-written ops
- vLLM reported 16k tokens/sec on NVIDIA B200 and integrated IBM Research’s KV Offloading Connector, materially improving throughput and TTFT under preemption. An “AI-generated” fused RMSNorm kernel (“Oink”) shows promising micro-kernel gains, while Keras Pallas pushes Python-authored custom ops.
Product and policy watch: Gmail’s “Gemini era”; LLM memorization claims
- Google announced Gemini-powered Gmail features (AI Overviews, AI Inbox, natural-language search, writing help) with user-controlled toggles. Separately, a Stanford preprint summary alleges copyrighted text extraction from frontier models; notably, Claude 3.7 Sonnet reportedly reproduced 95.8% of Harry Potter 1 in their setup (vs. much lower for GPT-4.1), reigniting copyright and safety debates.

Key Technical Details:

GLM-4.7: 355B MoE (32B active), 200K context, text-only I/O, MIT license; Reasoning score 42 (vs. 32 for GLM-4.6); GDPval-AA ELO 1193; ~710GB BF16 weights (won’t fit on a single 8×H100 ~640GB node).
Qwen3-VL-Embedding/Reranker: multimodal (text/images/screenshots/video), 30+ languages, configurable embedding dims, instruction tuning, quantization; MMEB-V2 77.9%, MMTEB 67.88%; shipped via HF/GitHub/ModelScope; Alibaba Cloud API “coming soon.”
vLLM + IBM KV Offloading: up to 9× throughput improvement on H100; 2×–22× TTFT reductions for cache hits; async DMA with contiguous physical blocks; CLI flags include --kv_offloading_backend native.
Kernel advances: AI-generated fused RMSNorm kernel ~40% kernel speedup, ~1.6% end-to-end; Keras Pallas lowers Python kernels to Mosaic (TPUs) / Triton (GPUs).
Additional models: Falcon-H1R-7B (hybrid Transformer–Mamba) scores 16 on AA v4.0 among <12B params; AI21’s Jamba2 (hybrid SSM-Transformer, KV-cache efficiency) under Apache 2.0 via AI21 SaaS and Hugging Face.

Community Response/Impact:

Rapid integration: vLLM nightly support for Qwen’s multimodal embeddings suggests growing demand for multimodal RAG “by default.”
Ecosystem churn meme: Ongoing fragmentation in kernel DSLs/backends (Triton, Mojo, TileIR, Pallas, etc.) reflects real migration costs.
OSS funding optics: Google AI Studio’s TailwindCSS sponsorship lands amid broader open-source sustainability debates.
Legal/safety discourse: The Stanford memorization claims revitalize copyright, safety, and evaluation rigor discussions.

First Principles Analysis:

Open-weight MoE models like GLM-4.7 show that selective activation (32B active) can deliver high reasoning performance without serving all parameters—if the software stack (KV management, scheduling) keeps up.
Multimodal retrieval is becoming foundational: most enterprise knowledge lives in PDFs, screenshots, and videos; embedding+ranks across modalities is essential for accurate RAG.
Inference wins are compounding: KV offloading, kernel fusion, and Python-first kernel authoring shrink latency/cost and broaden hardware targets, enabling larger contexts and more concurrent agents without linear cost blowups.