Published: June 30, 2025
255
1.2k
8.5k

Microsoft claims their new AI framework diagnoses 4x better than doctors. I'm a medical doctor and I actually read the paper. Here's my perspective on why this is both impressive AND misleading ... ๐Ÿงต

Image in tweet by Dr. Dominic Ng

What did they create? Two key innovations: 1. SDBench: A testing environment using 304 real medical mysteries from NEJM where AI starts with just "29yo woman with sore throat" and must decide what to ask/test next 2. MAI-DxO: An AI system that simulates 5 doctors working together as a team

Image in tweet by Dr. Dominic Ng

How did they test the AI/Doctors? They took 304 real cases from NEJM and turned them into an interactive game. The setup: Step 1: You (human doctor or AI) get a tiny intro like: "52-year-old man with fever and breathing problems." That's it. No test results, no detailed history - just like a patient walking into the ER. Step 2: There's a "Gatekeeper" (another AI) that has the full case file but won't tell you anything unless you specifically ask. Step 3: You can do three things: 1. Ask questions ("Any recent travel?" "Is there chest pain?") 2. Order tests ("CBC" "Chest X-ray" "CT scan") 3. Make your final diagnosis ("This is pneumonia") Step 4: The Gatekeeper then answers the question. BUT it only reveals what you ask for. If you don't think to ask about travel history, you won't find out the patient just returned from a cave expedition (real case - histoplasmosis). Step 5: Every test costs money (real US hospital prices). Every round of questions = $300 office visit.

MAI-DxO isn't a new model but instead a framework built on top of existing LLM's (ChatGPT, Claude, Gemini). How does this framework work? It asks the LLM to simulate a virtual panel of 5 specialised AI doctors: Dr. Hypothesis (tracks diagnoses) Dr. Test-Chooser (selects optimal tests) Dr. Challenger (plays devil's advocate) Dr. Stewardship (manages costs) Dr. Checklist (quality control) Then argue it out between themselves as to the best path forward.

Image in tweet by Dr. Dominic Ng

The results? ๐Ÿ“Š Accuracy: Doctors: 20% (ouch) Standard AI: 30-79% MAI-DxO: 80-85.5% ๐Ÿ’ฐ Cost per case: Doctors: $2,963 Standard AI (o3): $7,850 MAI-DxO: $2,397 On paper the AI was 4x more accurate AND cheaper.....

But there's five issues I see: 1. They used ZERO healthy patients 95% of sore throats are viral and this AI was only tested on incredibly rare diagnostic cases. We don't know if it will order biopsies on every patient with a sore throat "just to rule out rhabdomyosarcoma."

2. "Cost-effective" ignores the human toll Their costs only count lab fees, not: - 2 weeks of anxiety waiting for biopsy results - Radiation from "precautionary" CT scans (cancer risk!) - Complications from unnecessary procedures - Time off work - Psychological trauma of false cancer scares

3. The physician comparison was rigged Docs were banned from: โŒ Googling symptoms โŒ Consulting colleagues โŒ Using UpToDate/medical databases โŒ Calling specialists That's not how we practice!! It's like testing a chef who can't use recipes or taste their food.

4. The "Retrospective Oracle" Problem These cases were already SOLVED and published. Real medicine involves genuine uncertainty - sometimes the diagnosis is never found. Does the AI know when to stop investigating?

5. No "When to Stop" Testing Great doctors know when NOT to test. This AI was never evaluated on: "This headache is just stress" "Let's wait and see" "More tests will cause more harm than good" The benchmark rewards finding zebras, not recognising horses.

Don't get me wrong - this tech is amazing and I have no doubt I might be getting replaced in the not so near future. But we need: โœ“ Testing on actual patient populations (mostly healthy!) โœ“ Measuring overdiagnosis harm โœ“ Real-world physician comparisons

Final thought: We don't need AI that can diagnose every rare disease. We need AI that knows when to diagnose and when to reassure. That's the real art of medicine. But what do you think?

Share this thread

Read on Twitter

View original thread

Navigate thread

1/12