Home
Projects
Blog
Contact
Books
AI News
← Back to AI News

Jan 21 OpenEvidence, the ‘ChatGPT for doctors,’ raises $250m at $12B valuation, 12x from $1b last Feb Show details

news.smol.ai•about 1 month ago•View Original →

TL;DR: OpenEvidence raises $250M at $12B; Anthropic open-sources Claude “constitution”; agents hit production while benchmarks expose brittleness

Major Highlights:

  • OpenEvidence becomes a decacorn with $12B valuation
    • The “ChatGPT for doctors” raised $250M at a $12B valuation—12x from ~$1B last February. CEO Daniel Nadler told CNBC the product is used by 40% of U.S. physicians and exceeded $100M in annual revenue in 2025. That implies a steep ~120x revenue multiple, signaling investor conviction in clinical AI’s defensibility and expansion potential (evidence retrieval, billing/EHR workflows, and payer/provider integrations).
  • Anthropic publishes Claude “constitution” under CC0
    • Anthropic released the full, living “constitution” that guides Claude’s behavior, under CC0 1.0 for unrestricted reuse. The document is reportedly used directly during training, aiming to make alignment goals transparent and portable. The move invites scrutiny, replication, and potential standardization of safety/value frameworks across models.
  • Agents move from demos to dollars—Podium’s “Jerry” crosses $100M+ ARR
    • Podium claims over $100M AI agent ARR with 10k+ agents deployed, reframing AI as an “operator” that runs workflows end-to-end (after-hours leads, missed calls). Board-level metrics cite burn reduced from $95M to $0 and AI ARR from $0 to $100M in ~21 months. Meanwhile, engineering threads converge on memory/reliability as the real bottleneck for long-horizon agents.
  • Agent benchmarks show autonomy is still brittle
    • New evaluations (APEX-Agents, legal-research “prinzbench”) report modest pass rates and highlight search/tool-use as the main failure modes—underscoring the need for better context layers, governance, and tool design.

Key Technical Details:

  • APEX-Agents (Google Workspace professional tasks, Pass@1): Gemini 3 Flash High 24.0%, GPT-5.2 High 23.0%, Claude Opus 4.5 High 18.4%.
  • Legal “prinzbench”: GPT-5.2 Thinking barely >50%; Gemini close; Sonnet/Opus 4.5 scored 0/24 on Search tasks (search orchestration is the choke point).
  • Memory/reliability: Agent Cognitive Compressor (ACC) proposes a bounded “Compressed Cognitive State” vs naive transcript replay to reduce drift/hallucination. MCP-SIM multi-agent loop reportedly solves 12/12 scientific tasks vs one-shot GPT at 6/12.
  • Platform/UX: Prefect Horizon frames a managed “context layer” for MCP servers (registry, RBAC, audit logs). Phil Schmid recommends outcome-centric tools with typed, constrained args; positions Skills and MCP as complementary. LangChain ships Agent Builder GA, “agents as folders,” templates (Tavily, PagerDuty, Box), and subagent/skills-on-demand patterns.
  • Inference: AirLLM uses sequential layer loading (load → compute → free) and compression to serve big models with tiny VRAM; supports CPU/GPU, Linux/macOS.

Community Response/Impact:

  • Anthropic’s constitution sparked debate: openness and clarity vs alignment signaling or “persona baking”; some note circularity in training models on normative documents about their own behavior.
  • Practitioners emphasize production scaffolding (auth, observability, guardrails) and better tool schemas over larger context windows.
  • Hiring/evals shift: Anthropic notes Opus 4.5 solved an internal performance-engineering take-home, prompting redesign of evaluation methods.

First Principles Analysis:

  • OpenEvidence’s multiple suggests investors are pricing in regulated-market moats (clinical accuracy, liability handling, HIPAA/EHR integrations) and distribution (claimed 40% physician use). The path to durable revenue likely runs through billing, order entry, and payer/provider workflows where switching costs and compliance barriers are highest. The risk: verifying real clinical adoption and ensuring outcomes/medico-legal safety at scale.
  • Agent progress hinges less on model IQ and more on disciplined tool interfaces, memory compression, and governance. Benchmarks show autonomy is fragile; the winning stack will pair strong models with robust context layers, typed tool contracts, and observability—turning “agent as demo” into “agent as dependable operator.”