In this paper, we pioneer evaluating recent MLLMs (LLaVA-1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses, such as extreme bias, spurious correlations and lack of fine-grained analysis in earlier datasets like VQAv2 and RefCOCO.
We assess four VQA datasets:
Our experiments reveal the weaknesses of today's MLLMs that have not previously been reported. Our code is integrated into a fork of the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs.
Description 🗂️: VQDv1 challenges models to generate multiple bounding boxes 📦, not just one! Unlike typical datasets, queries in VQDv1 can result in 0 to N bounding boxes, testing general detection skills 🔍. This adds an extra layer of difficulty.
Results 📊: All models show decreased Recall as the number of bounding boxes 📦 increases. This indicates that models struggle to identify multiple objects matching the query. 🤖🔍
Description 🔢: TallyQA tests models' counting skills 🔢. It includes simple questions on object detection and complex ones needing advanced reasoning, like pose estimation ("How many dogs are sitting?") and positional reasoning ("How many dogs are in front of the white building?"). 🏢🔍
Results 📊: TallyQA includes simple and complex counting questions 🔢. Some models, like BLIP2, excel at simple questions but struggle with complex ones. For example, BLIP2 matches GPT-4 on simple tasks but drops significantly on complex questions requiring advanced reasoning, like pose estimation 🕺 and positional reasoning 📍.
Description 🎯: TDIUC tests models' versatility across 12 tasks, including object 🏷️, attribute 🔖, and activity recognition 🏃, as well as overall scene understanding 🌆. The meaningful categories of question types allow fine-grain analysis of models' abilities, highlighting specific strengths and weaknesses.
Results 📊: TDIUC results provide detailed performance metrics for each model on different question types. Models show varying strengths and weaknesses across these types. Key takeaways: all models struggle with positional reasoning 📍. LLaVA-NeXT outperforms in almost all question types, even surpassing the previous state-of-the-art VQA algorithm, MuRel.
Description 📊: DVQA tests models' ability to interpret and analyze visual data in chart form 📊. It requires OCR skills 🔡 and handling unusual words found in charts. The synthetically generated images 🖥️ pose unique challenges compared to natural images 🌳.
Results 📊: DVQA results show that while some models perform well on natural images 🌳, they struggle with synthetic images 🖥️. The DVQA dataset is entirely synthetic, with bars in random colors and styles. Most models especially struggle with reasoning and data retrieval questions in DVQA. 🧠🔍