Anh Totti Nguyen's Twitter Thread

#GPT5 is STILL having a severe confirmation bias like prev SOTA models! 😜 Try yourselves (images, prompts avail in 1 click): https://vlmsarebiased.github.i... It's fast to test for such biases in images. Similar biases should still exist in non-image domains as well...

@anh_ng8 @taesiri @an_vo12

@GDocal @taesiri @an_vo12 Did you have Settings -> Personalization -> Memory turned ON? Memory OFF: it's consistently been "two". Even after horizontal flipping.

@anh_ng8 @taesiri @an_vo12 been exploring vIms lately; the issue with vIms is that they are quite heavily language biased, barely use the image tokens, and place attention on the wrong parts of the image

@ForBo7_ @taesiri @an_vo12 I agree with all 3 things, which we are also seeing and trying to understand :)

@anh_ng8 @taesiri @an_vo12 It seems to think it is an easy question at first and barely spends any time actually looking at the image until it gets it wrong, and then it actually carefully analyses the image

@FeltSteam @taesiri @an_vo12 yea, that's the behavior @knnguyen2511 previously found in o3, o4-mini in the Chat interface as well. The problem is GPT does not know when to turn on the "careful analysis" mode. :)

@anh_ng8 @taesiri @an_vo12 This too https://x.com/pretendsmarts/st...

@wendyweeww @taesiri @an_vo12 Yep! We also tested this in our benchmarks. Over 6 illusions, SOTA VLMs correctly identify their names and expected answers, and are almost always biased! (accuracy ~random chance). #GPT5 is not different.

@anh_ng8 @taesiri @an_vo12 all adversarial examples where its often relying on the high res detail like audi/zebra/duck. Interesting though! Kind of generalization

@anh_ng8 @taesiri @an_vo12 Yep, just some of the current limitations of AI.

@anh_ng8 @taesiri @an_vo12 someone shuld make a benchmark for vlms with lots of tests like this

@anh_ng8 @taesiri @an_vo12 GPT-5 thinking mode can tell it the zebra has 5 legs according to the Chain of Thought, but explicitly decides to ignore it because *Zebras* have 4 legs. But tell it "in this image" (and make sure to use thinking mode) and it can count them just fine: https://x.com/happysmash27/sta...

@anh_ng8 @taesiri @an_vo12 GPT5 Pro, the most advanced version, reasoned for 1 minute 19 seconds and came up with the wrong answer. Same with GPT5 Thinking. Amazing level of incompetence from the smartest model yet.

@anh_ng8 @taesiri @an_vo12 sounds like classic bias issues are still lingering, which is no surprise really. testing those image biases is smart though. cool to see more folks diving into this stuff. it's a complex area for sure

@anh_ng8 @taesiri @an_vo12 FYI: In research mode

Share this thread

Read on Twitter

Navigate thread