Article reprint source: AI Trends
Original source: Quantum Bit
Image source: Generated by Unbounded AI
GPT-4 solved the famous Internet meme "Chihuahua or blueberry muffin", which amazed countless people.
However, now it has been pointed out as "cheating"!
All the pictures in the original question are used, but the order and arrangement are shuffled.
As a result, the latest version of the all-in-one GPT-4 not only counted the number of pictures incorrectly, but also made mistakes in identifying the Chihuahua that it could originally identify correctly.
So why does GPT-4 perform so well on the original image?
Xin Eric Wang, an assistant professor at UCSC who conducted the test, speculated that the original image was so popular on the Internet that GPT-4 had seen the original answer many times during training and had memorized it.
LeCun, one of the three Turing Award winners, also paid attention to this matter and said:
Be wary of testing on the training set.
Teddy bear and fried chicken are indistinguishable
How popular is the original picture? It is not only a famous Internet meme, but also a classic problem in the field of computer vision, and has appeared many times in related research papers.
So, leaving aside the influence of the original image, where exactly is GPT-4's ability limited? Many netizens have given their own testing plans.
In order to rule out the possibility that the arrangement was too complex and had an impact, some people modified it to a simple 3x3 arrangement, which also resulted in many errors.
Someone separated some of the images and sent them to GPT-4 separately, and got a 5/5 accuracy rate.
But Xin Eric Wang believes that putting these confusing images together is precisely the point of this challenge.
Finally, someone used the two spells of "taking a deep breath" and "thinking step by step" at the same time and got the correct result.
However, the words used by GPT-4 in its answer, "This is an example of a visual pun or famous meme," also revealed that the original image may indeed exist in the training data.
Finally, someone tested the "Teddy or fried chicken" test, which often appears together, and found that GPT-4 could not distinguish them well.
But this "blueberry or chocolate beans" is a bit too much...
Visual hallucinations become a hot topic
The "nonsense" of large models is called the hallucination problem in academia. The problem of visual hallucinations in large multimodal models has become a hot research topic recently.
In a study at EMNLP 2023, a GVIL dataset was constructed, containing 1,600 data points, to systematically evaluate visual hallucination problems.
The study found that larger models were more susceptible to illusions and closer to human perception.
Another study just released focused on evaluating two types of hallucinations: deviations and interferences.
Bias refers to the tendency of a model to produce certain types of responses, perhaps due to an imbalance in the training data.
Distractions can cause distractions due to the way the text prompt is worded or the way the input image is presented.
The study noted that GPT-4V often got confused when interpreting multiple images together and performed better when sent the images individually, which is consistent with observations in the "Chihuahua or muffin" test.
Popular mitigation measures such as self-correction and thought chaining prompts are not effective in addressing these issues, and multimodal models such as LLaVA and Bard have similar problems.
The study also found that GPT-4V is better at interpreting images with Western cultural backgrounds or images with English text.
For example, GPT-4V can correctly count the seven dwarfs + Snow White, but counts the seven calabash brothers as 10.
Reference links: [1] https://twitter.com/xwang_lk/status/1723389615254774122 [2] https://arxiv.org/abs/2311.00047 [3] https://arxiv.org/abs/2311.03287