Article reprinted from: Machine Heart
Image source: Generated by Unbounded AI
The wind of landing multimodal large models finally blew.
A dozen days ago, OpenAI added image recognition to ChatGPT, allowing users to upload one or more images to engage in conversation. From OpenAI's own public document, we learned that behind ChatGPT's image recognition function is a new large model called GPT-4V.
In fact, this capability already existed when GPT-4 was released half a year ago, but it has never been made public to ordinary users. In the field of AI, multimodal large models have long been a recognized trend and are also considered to be the key modules of general AI assistants.
Given OpenAI's insistence on "closed source", many researchers have also taken the lead in launching their own multimodal large model research results. For example, the two representative works "LLaVA" and "MiniGPT-4" have demonstrated impressive results in natural instruction tracking and visual reasoning capabilities.
In April this year, researchers from the University of Wisconsin-Madison, Microsoft Research, and Columbia University jointly released LLaVA (Large Language and Vision Assistant). Although LLaVA was trained with a small multimodal instruction dataset, it showed very similar reasoning results to GPT-4 on some samples.
Today, this achievement has received a major upgrade: LLaVA-1.5 has been officially released. By simply modifying the original LLaVA, it has refreshed the SOTA on 11 benchmarks.
Paper address: https://browse.arxiv.org/pdf/2310.03744.pdf
Demo address: https://llava.hliu.cc/
Using only 1.2 million publicly available data, LLaVA-1.5 completed training in less than 1 day on a single 8-A100 node.
In the paper, the researchers introduce two simple improvements: an MLP cross-modal connector and incorporating relevant data from academic tasks such as VQA. When used with LLaVA, these two improvements lead to better multimodal understanding capabilities.
Compared to InstructBLIP or Qwen-VL, which train specially designed visual resamplers on hundreds of millions or even billions of image-text pairings, LLaVA uses the simplest architecture design and only needs to train a simple fully connected projection layer on 600K image-text pairs.
Can it beat GPT-4V?
Before reading the paper, let’s take a look at the recognition capabilities of LLaVA-1.5 and whether it can compete with GPT-4V.
Proposition 1: Convert groceries to JSON
Instructions: You need to identify all fruits (and only fruits), and then for each fruit create an object with a name attribute and nutritional attributes including estimated calories, carbohydrates, fat, and protein attributes.
LLaVA-1.5's answer results:
GPT-4V's answer results:
Proposition 2: Identify movie titles from simplified sketches
Instructions: What movie is this picture about? Note: I changed the names of the characters to make identification more difficult.
LLaVA-1.5's answer results:
GPT-4V's answer results:
Paper details
LLaVA demonstrates commendable capabilities in visual reasoning, outperforming multiple state-of-the-art models on a variety of benchmarks for real-life visual instruction tasks, while falling short only on academic benchmarks that typically require short answers. The research team attributes the latter to the fact that LLaVA was not pre-trained on large-scale data like other methods.
Specifically, the study first analyzes the impact of scaling data, models, and input image resolution on three datasets selected in Table 1 below; and then conducts comparative experiments on 12 different benchmarks in Table 2. Experimental results show that the LLaVA architecture is powerful and data-efficient for visual instruction adaptation, and achieves the best performance using much less computation and training data than all other methods.
Response format prompt
The study found that there are two main reasons why methods such as InstructBLIP cannot strike a balance between short-form and long-form VQA:
First, the prompts given to LLM are not explicit in the response format. For example, a prompt like “Q: {question} A: {answer}” does not clearly articulate the desired output format. Even for natural visual conversations, this may make LLM overfit to give short answers.
Second, no fine-tuning of LLM is done. For example, InstructBLIP requires the visual output token of Qformer to control the output length (long form/short form) of LLM, but Qformer may lack the ability to do this correctly due to its limited capacity.
To solve this problem, the study proposed using a "response format prompt" that clearly specifies the output format. For example, when the model is required to give a short answer, add a sentence at the end of the VQA question: "Answer the question using a single word or phrase."
This study experimentally shows that when LLM is fine-tuned using such prompts, LLaVA is able to appropriately adapt the output format based on the user's instructions and does not require additional processing of the VQA data using ChatGPT.
In addition, the study also found that improving the representation of the visual-language connector through a two-layer MLP can improve the multimodal capabilities of LLaVA compared to the original model. In addition, the study also expanded the data for academic tasks, including additional academic task-oriented VQA datasets for VQA, OCR, and region-level perception to enhance the multimodal capabilities of the model.
Interested readers can read the original paper to learn more about the research content.
Reference Links:
https://twitter.com/rowancheung/status/1710736745904721955
https://twitter.com/imhaotian/status/1710192818159763842
