Published: November 17, 2023
23
285
1.2k

New paper from my group: "Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks". 🧵 (1/9) https://arxiv.org/abs/2311.092...

We compare performance of people, GPT-4 (text-only), & GPT-4 Vision on tasks in the ConceptARC benchmark. (2/9)

ConceptARC is a benchmark in the ARC domain that systematically tests understanding of core-knowledge concepts (e.g., "object", "top/bottom", "same/different", etc.) (3/9)

Image in tweet by Melanie Mitchell

In a previous paper in TMLR (https://openreview.net/pdf?id=... we tested GPT-4 using a very simple zero-shot prompt that encoded visual grids as arrays of numbers. Here we extend this work in two ways. 1. We test text-only GPT-4 using a more informative, 1-shot prompt. (4/9)

2. We test GPT-4 Vision on the simplest (minimal) tasks with the same prompt, but using visual grids instead of text to represent the task grids. This is one of the first tests (AFAIK) of GPT- Vision on ARC-like tasks. (5/9)

Results of the paper: Performance of GPT-4 (text-only) is improved with better prompt (33% correct overall), but still far below that of humans (91% correct overall). (6/9)

GPT-4 with Vision on the very simplest "minimal" tasks is substantially worse than that of GPT-4 text-only, which is in turn worse than humans: Minimal tasks: GPT-4 Vision: 25% correct GPT-4 Text Only: 65% correct Humans: 95% correct (7/9)

Conclusion: "Our results support the hypothesis that GPT-4, perhaps the most capable “general” LLM currenly available, is still not able to robustly form abstractions and reason about basic core-concepts in contexts not previously seen in its training data." (8/9)

Comments welcome! (9/9 -- End)

@MelMitchell1 The reason might be because there are not much abstraction training data

Share this thread

Read on Twitter

View original thread

Navigate thread

1/11