Feb 16 Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model Show details

news.smol.ai•9 days ago•View Original →

TL;DR: Qwen3.5-397B-A17B — Alibaba’s open-weight frontier MoE with efficient long‑context multimodality

Major Highlights:

Open-weight frontier model with native multimodality: Alibaba released Qwen3.5-397B-A17B, the first open-weight model in the Qwen3.5 line. It is natively multimodal (vision + text) with “spatial intelligence,” supports 201 languages, and ships under Apache-2.0. It includes “thinking” and “non-thinking” modes and is positioned as a generational step-up over Qwen3-Max/VL (but not claiming universal SOTA, especially on coding).
Efficient long-context via hybrid attention + sparse MoE: The architecture combines hybrid linear attention, sparse Mixture-of-Experts, and Gated Delta Networks (“GatedDeltaNet”). Community analyses highlight unusually low KV-cache costs for the context length, making long-context inference more tractable than typical LLMs of this size.
Hosted twin with 1M context and tools: Qwen3.5-Plus (API) is the served version of the same base model, offering 1M-token context (vs 256K native) and integrations like search and a code interpreter.
Fast ecosystem uptake despite huge weights: vLLM shipped day-0 support; local quantized runs appeared quickly on Apple Silicon via MLX/Q4; Unsloth published 4-bit guidance; Ollama onboarded it to their cloud. Despite ~800GB BF16 weights, reports show practical routes to experimentation via quantization and cloud hosting.

Key Technical Details:

Parameters: 397B total, 17B active per token (A17B sparse MoE).
Context: 256K native; API variant offers 1M context with tools.
License: Apache-2.0; Languages: 201.
KV/cache efficiency (community estimates): ~31KB/token; ~8.05GB KV at 262K context in BF16; ~4GB in FP8—attributed to few KV heads + many gated-delta layers.
Inference footprint: ~800GB BF16 weights; local demos reported on Apple Silicon with ~225GB RAM via Q4 quantization; Unsloth guidance targets 4-bit on 256GB machines.
Infra support: Day-0 vLLM recipe; rapid availability on Ollama Cloud; MLX demonstrations.

Community Response/Impact:

Broad praise for practical efficiency and vision quality; seen as a solid refresh from China’s most prolific open-weight lab and likely the last major open release before DeepSeek v4.
Benchmarks show improvements over Qwen3-Max/VL; some harnesses even surpass “Qwen3-Max-thinking.” Critics call for clearer “reasoning efficiency” evidence and note failures on certain SVG/“Vending-Bench” tasks; concerns about black-box evals persist.
Pricing sparked debate: users question API pricing relative to the model’s claimed efficiency and peers like Kimi/GLM, creating uncertainty about real serve-cost advantages.
Agent stack discourse intensified: OpenClaw’s trajectory (and creator joining OpenAI) reignited debates on open-source posture (notably around Anthropic) and reinforced a thesis that the “harness” (tools, context control, lifecycle, eval/observability) is a growing moat for agent systems.

First Principles Analysis:

Sparse MoE (397B total, 17B active) concentrates capacity when needed while keeping per-token compute manageable. Pairing this with linear attention and GatedDeltaNet reduces KV-head count and KV growth, shrinking memory per token and enabling tractable 256K–1M contexts.
Open weights under Apache-2.0 catalyze infra support (vLLM/MLX/Ollama) and community tuning, accelerating downstream adoption. If “agentic RL” improvements generalize, the model could narrow the gap with top closed models in agent tasks—provided pricing and eval transparency align with the efficiency story.