Feb 16 Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model Show details

news.smol.ai•9 days ago•View Original →

TL;DR: Qwen3.5-397B-A17B — open-weight frontier MoE with tractable long context

Major Highlights:

Alibaba ships Qwen3.5-397B-A17B (open-weight, Apache-2.0): First open model in the Qwen3.5 line, natively multimodal with “thinking/non-thinking” modes, trained across 201 languages. It combines hybrid linear attention with sparse MoE and reports gains over Qwen3-Max and Qwen3-VL, especially in vision and spatial reasoning. It’s a heavyweight refresh in China’s open ecosystem, landing in the ~400B “Kimi-class” with a different sparsity choice (~4.3% vs Kimi’s ~3.25%).
Hosted twin with longer context: Qwen3.5-Plus is the API-hosted version built on the same base, extending context from the model-native 256K to 1M tokens and bundling search and code-interpreter integrations.
Day-0 infra support, unexpectedly runnable: vLLM added immediate support and showcased throughput/latency advantages; Ollama made it available on their cloud. Despite ~800GB BF16 weights, users reported local runs via MLX and 4-bit quantization on high-RAM Apple Silicon (≈225GB+), with Unsloth providing 4-bit guidance for 256GB setups.

Key Technical Details:

Architecture: 397B total parameters with 17B “active” via sparse MoE; Gated Delta Networks (“GatedDeltaNet”) plus hybrid linear attention to curb memory/computation at long context; native multimodality and spatial intelligence features.
Context/KV efficiency: Community estimates put KV cache at ~31KB/token; ~8.05GB at ~262K tokens in BF16 (≈4GB in FP8), enabled by few KV heads and many gated-delta layers—key to viable long-context inference.
Context and licensing: 256K model-native context; 1M via hosted Qwen3.5-Plus; Apache-2.0 license.
Benchmarks: Not claiming across-the-board SOTA (notably in coding), but shows solid deltas vs Qwen3-Max and vision-led gains; some agentic RL improvements suspected; mixed results on task-specific SVG/Vending-Bench-style tests.
Pricing: API pricing drew criticism as “high/weird” relative to the efficiency narrative and competitors (Kimi/GLM), raising questions about true serve-costs.

Community Response/Impact:

Adoption momentum: Open weights plus vLLM/Ollama/MLX support lowered friction; quantization pathways broaden local experimentation. Positive reception on efficiency and long-context practicality.
Skepticism and eval debates: Calls for clearer “reasoning efficiency” evidence beyond headline scores; critiques of black-box evals and cherry-picked demos.
Agent ecosystem shifts: The OpenClaw saga (solo dev leverage; Peter Steinberger joining OpenAI) reignited debates on OSS friendliness (notably Anthropic) and emphasized the “harness” (tooling, context mgmt, skills, eval/observability) as the true agent moat; lightweight alternatives (PicoClaw, nanobot) gained attention; agent observability (LangSmith/LangChain) is trending toward table stakes. Anthropic announced a Bengaluru expansion, noting India as Claude.ai’s #2 market.

First Principles Analysis:

Why this matters: Qwen3.5 pairs sparse MoE (activate a small expert subset) with linear attention and GatedDeltaNet to compress KV footprint and stabilize long-context compute. That makes 256K+ contexts practical at frontier scale, shifting the constraint from algorithmic complexity to bandwidth and memory locality. Open weights under Apache-2.0 catalyze rapid infra integration and community evaluation, while “agentic RL” hints at performance gains derived from richer, harness-shaped interaction data. The strategic question is no longer just “who has the best model,” but “who controls the harness and serve-cost economics.”