Arvind Narayanan's Twitter Thread

OpenAI may have tested GPT-4 on the training data: we found slam-dunk evidence that it memorizes coding problems that it's seen. Besides, exams don't tell us about real-world utility: It’s not like a lawyer’s job is to answer bar exam questions all day. https://aisnakeoil.substack.co...

After seeing the quoted thread, @sayashk dug deeper and found that there’s a sharp drop in performance based on the exact date of the problem: before Sep 5 vs after Sep 12, 2021. Even more blatantly, we can just ask it for memorized details of problems! https://x.com/cHHillee/status/...

The paper’s Codeforces results aren’t affected by this, as OpenAI used recent problems (and GPT-4 performed very poorly). For the non-coding benchmarks, there isn’t a clear way to separate problems by date, so we think it is unlikely that OpenAI was able to avoid contamination.

There’s a bigger issue: The manner in which language models solve problems is different from how people do it, so these results us very little about how a bot will do when confronted with the real-life problems that professionals face. https://aisnakeoil.substack.co...

Instead of standalone benchmarks, we should study how well language models can do any of the real-world tasks that professionals must do. But it's not human vs bot. Study professionals doing their jobs with the help of AI tools — ideally qualitatively and not just quantitatively.

This is the latest in the AI Snake Oil book blog by @sayashk and me. There's more in the post that I didn't summarize. Read it on Substack and maybe subscribe! https://aisnakeoil.substack.co... https://x.com/random_walker/st...

A recurring q: how to decontaminate a training set that includes the whole web?! Exactly! We think accuracy benchmarking isn't too meaningful for LLMs. There are ideas like holistic evaluation https://arxiv.org/abs/2211.091... But also look beyond metrics and study use-cases in the field.

@random_walker My colleagues ran the bar exam piece of it and confirmed that the questions were not in the training set. You’re right that lawyers don’t take the bar all day. The real proof will be if useful products can be built with this tech. We believe so based on what we’ve seen so far!

@Jacob_Heller Yes, we acknowledge Casetext's work in the post! But the bar exam was the only benchmark where this sort of effort was made. I don't doubt that useful products can be built! Indeed, we advocate evaluating real-world usefulness instead of obsessing over benchmarks.

@random_walker lmao. "it's not useful, it's not useful!! i scream as GPT obsoletes my job in front of me"

@random_walker Re "Melanie Mitchell gives an example of an MBA test question where changing some details in a way that wouldn’t fool a person is enough to fool ChatGPT (running GPT-3.5)." — Couldn't reproduce with GPT-4. Using the same prompt @MelMitchell1 used, GPT-4 got it right every time.

@random_walker Thanks so much for this! I started doing similar but got pulled into other things. Appreciate the work from you and @sayashk !

@random_walker Contamination is all you need. NLP is fairly behind vs CV on this.

@random_walker have u ever checked out http://betterwithout.ai ?

@random_walker @YiTayML ~seen a c# kernel based pressure solver becaue it was trained before. used .net 7, fx no fluid experts use c# or compute , mabye convNasa Fortran to a new language i can acutally build..coding sucks. i ask it simulate exotic 🛸 bf ww3 , bias,war or☄️. gpt4 isnt tuned yet tho.

@random_walker This says more about the exams used to evaluate human performance rather than it does about GPT-4’s performance.

@random_walker > we found slam-dunk evidence that it memorizes coding problems that it's seen. < so what and who cares? all anyone, other than researchers, care about is that it provides good, valuable content

@random_walker I am confident that OpenAI will rebrand soon. The name is a little on the nose and points to a problem that will be come glaringly obvious soon. Unless there is a way to validate claims openly, brand loyalty / tribalism will inevitably be the dominant driver of opinion

@random_walker Good piece! Our education tests humans based on our training data. Memorization tests provide permission to practice many professions. So, maybe we shouldn’t test AI the same way as humans. Have you thought about alternatives better suited to testing practice AI uses?

@random_walker @mmitchell_ai Really great points made here. Excellent article.

@random_walker When i 1st read the gpt4“paper”, i was impressed but then 2nd thought that a minority maybe “indirectly” in the train set. Just never thought it may be a bit more pervasive. Probably shouldn’t have cite its math score to excuse it from (in case) multiplying long nums.

@random_walker Looking forward to diving deeper into this critical work. Thanks

@random_walker “Benchmarks are already wildly overused in AI for comparing different models. They have been heavily criticized for collapsing a multidimensional evaluation into a single number. When used as a way to compare humans & bots, what results is misinformation.” 🙏🏽

@random_walker @fchollet The reason I know chatGPT won’t replace the coding part of my job is that I’ve made a serious effort to get it to do that part of my job and it’s failed. It’s still been useful, as a way to look things up with focused tailored answers, as a smart template to kickstart, etc.

@random_walker I am frustrated at NLP community lack of interest in eval based on real world usage by professionals. We had a paper at NAACL 22 that looked at this in medical context. Paper won award, but otherwise has been ignored. https://aclanthology.org/2022....

@random_walker This is pretty consistent in my use as a researcher. If I ask some questions that somewhat deviate from text-book examples, even GPT4 starts spitting some long-winded answer that looks reasonable but has serious errors upon closer examination.

@random_walker And the people who can't afford any kind of lawyer at all, but have free or cheap access to an AI version? Poor people didn't choose to lock away or paywall so much information and knowledge, did they? It's long overdue for any tech to bust that disgusting 'Normal.'

@random_walker I think we need to take seriously the possibility that this is deliberate deception on the part of OpenAI, and look for evidence of other deception. Were they p-hacking as well? Almost certainly. Note the HUGE spike in interest leading up to the "surprise" GPT-4 release. 1/

@random_walker I don't understand the point. Anyone paying attention doesn't really expect creativity from these models. They are recursive comparison systems. A lot of work people do, especially in code and law, neatly falls into this category.

@random_walker There is always a way to trip up an ai I wonder if it could code something that isn't blatantly on stack overflow like emulating an obscure cpu or even emulating a cpu in general

@random_walker I have used Copilot and ChatGTP both are great too, but they are not replacing any job anytime soon.

@random_walker This is bad @sama @gdb @ilyasut @OpenAI

@random_walker So what? Ml scientist are chasing sota non stop. This tool is just wonderful, it is useful. Who cares about the accuracy on task x? We can judge ourselves unlike so many “open sources” publications

@random_walker Seems like a lot of tier 2 computer scientists scramble to piss on the AI progress that has been made.

@random_walker Finally. Thanks.

@random_walker @Carnage4Life

@random_walker @fchollet Whoa 🤯

Share this thread

Read on Twitter

Navigate thread