Claude 3 gets ~60% accuracy on GPQA. It's hard for me to understate how hard these questions are—literal PhDs (in different domains from the questions) with access to the internet get 34%. PhDs *in the same domain* (also with internet access!) get 65% - 75% accuracy.
@idavidrein No need to overhype! if you look into the actual test you'll see that it is still just memorization based stuff. LLMs fail simple reasoning tasks and there is no reliable way to fix that. So.. :)
@yar_vol I did look into the actual test, as I’m the first author on the paper introducing the dataset :) https://arxiv.org/abs/2311.120...
@idavidrein @yar_vol @davidrein -- curious if you agree with the "it is still just memorization based stuff" and if so, what a GPQA++ that captures more abstract scientific reasoning would look like. I love this eval so much.
@logangraham @yar_vol @DavidRein thank you!! re: memorization—it's hard for me to imagine how answering these (novel, not curated) questions could be done with simple/rote memorization. I would actually be more excited though about *less* abstract, more "practical" evals, where you have people use models IRL
@idavidrein Love this app! :)
@idavidrein @yar_vol Contrived testbed. As Princeton Review will attest, multiple choice questions highly prone to elimination of some/all wrong answers by the ignorant. Open ended Qs (as in real world) force LLMs to conceptually understand so afford them space to reveal their erroneous “thinking.”

