Rohan Paul's Twitter Thread

🧬 Bad news for medical LLMs. This paper finds that top medical AI models often match patterns instead of truly reasoning. Small wording tweaks cut accuracy by up to 38% on validated questions. The team took 100 MedQA questions, replaced the correct choice with None of the

https://jamanetwork.com/journa...

https://x.com/rohanpaul_ai/sta...

@rohanpaul_ai Irrelevant & obsolete study. Just look at the models they used for this. DeepSeek-R1 (model 1), o3-mini (reasoning models) (model 2), Claude-3.5 Sonnet (model 3), Gemini-2.0-Flash (model 4), GPT-4o (model 5), and Llama-3.3-70B (model 6).

@rohanpaul_ai Of course. All models aren't really "reasoning" at this stage.

@drgurner Yes, that too.

@rohanpaul_ai It’s almost as if they are next-token predictors

@rohanpaul_ai What could solve that problem? Fine tuning? A vast RAG system or a knowledge graph to help?

@w1kke well, this is one negative study, and then there are hundreds of studies proving LLM's capability in medical space.

@rohanpaul_ai Not really bad news… early study that will only get better. Basically internet dial up now until we get broadband WiFi

@Trace_Cohen Oh yes, absolutely.

@rohanpaul_ai you can hear the collective sigh of doctors globally, relieved to be still keeping their jobs

@0xPrismatic Yeah 😀😀

@rohanpaul_ai deepseek R1 did great tho

@nisten what a great model they produced.

@rohanpaul_ai Benchmark scores ≠ real reasoning. This study nails that gap.

@rohanpaul_ai Benchmarks can flatter, but real robustness shows when the format shifts and the answer stays the same.

@rohanpaul_ai This just seems like fidelity is working out what reasoning models do in general, no?

@rohanpaul_ai gpt5 or opus 4.1 should be included for this to be remotely interesting

@rohanpaul_ai BS… this is a lie for medical professionals to stay relevant. •

@rohanpaul_ai idk, if you do this with a set of human doctors they might decrease their performance as well. And only a 8% decrease on the best model, which isn’t the SOTA

@rohanpaul_ai Researchers test medical AI by removing all safety tools, databases, and verification systems, then shocked when performance drops. This is like testing pilots by removing radar and GPS, then concluding planes aren't safe.

@rohanpaul_ai On the other hand I’ve had a GP Google my symptoms right in front of me, so 🤷‍♂️

@rohanpaul_ai Why are we doing these studies with non SOTA models? Literally invalidate this study, lol

@rohanpaul_ai what bunch of bs

@rohanpaul_ai Interesting. Pattern matching mimics reasoning, creating an illusion of understanding. This reveals the limits of current AI in complex fields.

@rohanpaul_ai Lol wait til you hear from doctors. We don’t reason either. It’s all protocol / evidence based practice. LLMs probably do the same.

@stephen_winters 😀😀 Ha ha

@rohanpaul_ai Interesting findings. Probably this evaluation should be done again with Gemini 2.5 Pro , GPT5 Pro, Claude 4.0 . The models they have used seem a little old now: "We evaluated 6 models spanning different architectures and capabilities: DeepSeek-R1 (model 1), o3-mini (reasoning

@AIWithRithesh yes, many studies uses old model, guess to reduce their eval cost

@rohanpaul_ai This is a common pattern in the world of AI. People mistake clever pattern matching for intelligence or even superintelligence and trust the benchmarks too much. The reasoning capabilities, especially in spatial and temporal (time), are badly lacking for now, but how soon?

@rohanpaul_ai >models often match patterns instead of truly reasoning. That's what LLM models do. Did someone think something else?

@rohanpaul_ai No way, I'm shocked I tell you. Who'd have thought the pattern matching algorithm matched patterns instead of reasoning, despite zero evidence of reasoning capacity.

@rohanpaul_ai How is this a surprise to anyone who knows how LLMs work?

@rohanpaul_ai Gary Marcus is in the room... @GaryMarcus

@fernandoquadra @GaryMarcus 😀

@rohanpaul_ai Too bad they couldn't test o3/2.5pro or gpt-5

@krasmanalderey yes, they were proly reducing their eval cost

Share this thread

Read on Twitter

Navigate thread