Dec 16 OpenAI GPT Image-1.5 claims to beat Nano Banana Pro, #1 across all Arenas, but completely fails Vibe Checks Show details

news.smol.ai•2 months ago•View Original →

TL;DR: OpenAI GPT Image‑1.5 tops leaderboards but flunks community “vibe checks”; Xiaomi’s MiMo‑V2‑Flash, NVIDIA’s Nemotron‑Cascade, FLUX.2 Max, and new factuality/science evals

Major Highlights:

OpenAI GPT Image‑1.5 claims #1 across Arenas, but users prefer Nano Banana Pro
- OpenAI’s new image model improves instruction following, precise edits, text/markdown rendering, faces/logos, and fixes GPT‑Image‑1 bugs. It posts #1 scores on major image leaderboards: LM Arena 1277, Design Arena 1344, AA Arena 1272.
- Despite benchmark wins, “vibe checks” across Twitter/Reddit/Discord skew negative versus Gemini’s Nano Banana Pro—especially on subjective quality and “visual IQ” tasks (math/maze reasoning). Progress vs GPT‑Image‑1 is real, but confidence in Arena benchmarks reflecting user preferences is questioned.
Xiaomi MiMo‑V2‑Flash (309B MoE; 15B active) pushes fast, long‑context, agentic LLMs
- Specs: 150 tokens/s, 256K context, top open‑source on SWE‑Bench (Verified 73.4%, Multilingual 71.7%). Uses Hybrid Sliding Window Attention (sparse local windows + small set of global layers) and 3‑layer MTP for speculative decode.
- Engineering: SWA‑128 > 512 post‑training; attention sinks matter; MTP yields >3 accept length and ~2.5× coding speedup; multi‑teacher on‑policy distillation (MOPD) matches teacher with <1/50th SFT+RL compute. Claims parity with DeepSeek‑V3.2 at lower latency. Day‑0 on LMSYS/SGLang; free limited access via OpenRouter.
FLUX.2 Max raises open image quality
- Adds web grounding and up to 10 reference images for consistent editing. Ranks #2–3 on T2I and editing. Pricing: ~$70/1k T2I images, ~$140/1k edits. Hosted by fal/arena.
NVIDIA’s Nemotron‑Cascade and broader Nemotron 3 availability
- “Cascade RL” (8B/14B) uses domain‑wise sequential RL; 14B beats DeepSeek‑R1‑0528 (671B) on LiveCodeBench v5/v6/Pro and hits 43.1% pass@1 on SWE‑Bench Verified (53.8% with test‑time scaling). Emphasizes RLHF pre‑step for reasoning.
- Nemotron 3 Nano ships to Ollama and MLX/LM Studio, aligning open models with NVIDIA‑optimized training/inference stacks.

Key Technical Details:

GPT Image‑1.5: New “Images” surface in ChatGPT; up to 4× faster generation; some regressions openly documented. Pricing estimates: ~ $133/1k 1MP images (high quality), ~ $9/1k (low quality).
MiMo‑V2‑Flash: 309B MoE, 15B active; Hybrid SWA, MTP; MOPD post‑training; 256K context, 150 tok/s; top SWE‑Bench scores; free trial via OpenRouter.
FLUX.2 Max: #2–3 on leaderboards; $70/$140 per 1k (T2I/edits); up to 10 refs.
Nemotron‑Cascade 14B: LiveCodeBench SOTA vs much larger R1; SWE‑Bench Verified 43.1% (53.8% with TTS).

Community Response/Impact:

GPT Image‑1.5: Broadly negative sentiment vs Nano Banana Pro despite #1 Arena scores; prompts skepticism that Arenas map to real user preferences. Timing and Gemini “Code Red” narrative dampen reception; likely still sees usage via ChatGPT integration.
Open‑source momentum: MiMo‑V2‑Flash energizes agentic/fast‑serve workflows. NVIDIA’s “hardware‑defined AI” strategy deepens, pushing open models optimized for its stack.

First Principles Analysis:

Benchmark vs vibe gap: Arena leaderboards measure pairwise preferences under curated prompts; users weigh instruction adherence, text fidelity, edge‑case robustness, and “visual IQ”—areas where Nano Banana Pro may still lead. Cost/latency also shape perceived value.
Efficient scaling trends: MiMo shows MoE + Hybrid SWA + MTP can deliver long context and high throughput without teacher‑level compute. Nemotron‑Cascade suggests sequencing RL by domain and doing RLHF early can lift reasoning at smaller scales.
Factuality/science evals (noted elsewhere in the issue): FACTS shows Gemini 3 Pro leading overall (68.8%), with Claude safer but conservative and GPT models broader but riskier; OpenAI’s FrontierScience ties benchmark gains to real wet‑lab outcomes (e.g., 79× cloning efficiency), nudging evals toward real‑world utility.