Home
Projects
Blog
Contact
Books
AI News
← Back to AI News

Jan 29 xAI Grok Imagine API - the #1 Video Model, Best Pricing and Latency - and merging with SpaceX Show details

news.smol.ai•27 days ago•View Original →

TL;DR: xAI’s Grok Imagine API debuts as a top video model; aggressive pricing, strong latency claims, and reported tie-up with SpaceX

Major Highlights:

  • xAI Grok Imagine launches SOTA video+audio generation in API
    • Multiple leaderboards place Grok Imagine at or near #1 for video and image-to-video; Artificial Analysis ranks it top and Arena lists it among best-in-class. The model natively generates synchronized audio and supports 15s clips.
    • Pricing undercuts rivals: $4.20/min including audio, positioning xAI against Google Veo/Sora-class offerings on both cost and quality. Early testers cite strong controllability and editing coverage (text-to-video, image-to-video, video editing).
  • Strategic positioning: xAI + SpaceX alignment and IPO race
    • The report frames a SpaceX–xAI consolidation/merger narrative, speculating a combined valuation around $1.1T following xAI’s recent $20B Series E, with OpenAI ($800B), Anthropic (~$350B), and SpaceX+xAI in a “race to IPO” by year-end. Details remain unconfirmed publicly; treat as market chatter/positioning rather than finalized corporate action.
  • Google fires back with Genie 3 “world model” prototype
    • Genie 3 rolls out to Google AI Ultra subscribers (U.S., 18+), offering text/image-to-interactive worlds, remixing, and a gallery. Constraints: ~60s generation windows, control latency, and imperfect physics. Signals a near-term push to interactive, controllable simulation rather than pure video “dreaming.”

Key Technical Details:

  • Grok Imagine
    • Modalities: Text-to-video, image-to-video, video editing; native audio synthesis
    • Clip length: ~15 seconds; priced at $4.20/min including audio
    • Distribution: API available now; fal is a day-0 platform partner exposing endpoints for text-to-image, image editing, text-to-video, image-to-video, and video editing
  • Runway Gen-4.5
    • New controls: Motion Sketch (annotate camera/motion on first frame), Character Swap; pitched increasingly as an “animation engine” with packaged controllability primitives
  • World models
    • Google Genie 3: Prototype; limited availability; interactive worlds with remixing; noted latency/control limits
    • LingBot-World (open-source): Claims <1s latency at 16 FPS, minute-level coherence, better object permanence/landmark persistence; built on Wan 2.2
  • Open stack momentum
    • Kimi K2.5: Promoted as #1 open model on multiple evals (VoxelBench, Vision Arena); Kimi Code switches to token-based billing with temporary 3× quota
    • Alibaba Qwen3-ASR: Apache-2.0 ASR + forced aligner, 52 languages/dialects, up to 20-minute chunks, timestamps, native streaming; vLLM day-0 support and large-throughput claims
    • Arcee Trinity Large: 400B MoE with ~13B active parameters; router/load-balancing and stability tricks for high throughput

Community Response/Impact:

  • Video model shakeout: Commentators suggest smaller video labs just faced a “bitter lesson” as xAI’s pricing/quality compress margins and raise the bar on integrated audio and editing.
  • Control > base quality trend: Runway’s features and Genie’s interactivity reinforce that controllability, latency, and workflow packaging are now primary differentiators.
  • Open-source parity push: LingBot-World and Qwen3-ASR highlight rapid open advancements on latency, coherence, and full-stack deployability, pressuring proprietary roadmaps.

First Principles Analysis:

  • Why this matters: Grok Imagine’s combination of cost, latency, and audio-native video shifts the competitive frontier from “can you generate great clips?” to “can you deliver controllable, end-to-end creative tooling at platform scale?” If the SpaceX+xAI narrative holds, access to bandwidth, deployment venues, and capital could give xAI a distribution and infrastructure edge. Meanwhile, Google’s Genie signals the next battleground—interactive, persistent, controllable worlds—where open projects are racing to close gaps on coherence and real-time performance.