Revisiting Multi-Modal LLM Evaluation

In this paper, we pioneer evaluating recent MLLMs (LLaVA-1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses, such as extreme bias, spurious correlations and lack of fine-grained analysis in earlier datasets like VQAv2 and RefCOCO.
We assess four VQA datasets:

VQDv1, which requires identifying all image regions that satisfy a given query.
TallyQA, which has simple and complex counting questions, and
TDIUC, which permits fine-grained analysis on 12 question types,
DVQA, which requires optical character recognition for chart understanding.

Our experiments reveal the weaknesses of today's MLLMs that have not previously been reported. Our code is integrated into a fork of the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs.

VQDv1

Description 🗂️: VQDv1 challenges models to generate multiple bounding boxes 📦, not just one! Unlike typical datasets, queries in VQDv1 can result in 0 to N bounding boxes, testing general detection skills 🔍. This adds an extra layer of difficulty.

Example Image — Example images of VQDv1 with multiple bounding boxes.

Results 📊: All models show decreased Recall as the number of bounding boxes 📦 increases. This indicates that models struggle to identify multiple objects matching the query. 🤖🔍

VQDv1 Result Image — Comparison of Multi-Modal LLMs on VQDv1. MGM refers to MiniGemini 7B, LLaVA-OV is LLaVA OneVision, L is LLaVA 1.5 and L-NeXT is LLaVA-NeXT.

VQDv1 Performance Graph — Comparison of Multi-Modal LLMs on VQDv1. MGM refers to MiniGemini 7B, LLaVA-OV is LLaVA OneVision, L is LLaVA 1.5 and L-NeXT is LLaVA-NeXT.

TallyQA

Description 🔢: TallyQA tests models' counting skills 🔢. It includes simple questions on object detection and complex ones needing advanced reasoning, like pose estimation ("How many dogs are sitting?") and positional reasoning ("How many dogs are in front of the white building?"). 🏢🔍

Results 📊: TallyQA includes simple and complex counting questions 🔢. Some models, like BLIP2, excel at simple questions but struggle with complex ones. For example, BLIP2 matches GPT-4 on simple tasks but drops significantly on complex questions requiring advanced reasoning, like pose estimation 🕺 and positional reasoning 📍.

TallyQA Result Image — Results on TallyQA. For Acc., best performers based on paired asymptotic McNemar tests (α = 0.05) are in bold. For RMSE, the highest value is bolded. For comparison, the result from SMoLA is the current best on TallyQA.

TallyQA Simple Counting Performance Graph — Results on TallyQA. For Acc., best performers based on paired asymptotic McNemar tests (α = 0.05) are in bold. For RMSE, the highest value is bolded. For comparison, the result from SMoLA is the current best on TallyQA.

TDIUC

Description 🎯: TDIUC tests models' versatility across 12 tasks, including object 🏷️, attribute 🔖, and activity recognition 🏃, as well as overall scene understanding 🌆. The meaningful categories of question types allow fine-grain analysis of models' abilities, highlighting specific strengths and weaknesses.

Results 📊: TDIUC results provide detailed performance metrics for each model on different question types. Models show varying strengths and weaknesses across these types. Key takeaways: all models struggle with positional reasoning 📍. LLaVA-NeXT outperforms in almost all question types, even surpassing the previous state-of-the-art VQA algorithm, MuRel.

TDIUC Result Image — Accuracy on TDIUC for each question type. Best performers based on paired asymptotic McNemar tests (α = 0.05) are in bold. For comparison, MuRel is the previous best result from training on TDIUC.

DVQA

Description 📊: DVQA tests models' ability to interpret and analyze visual data in chart form 📊. It requires OCR skills 🔡 and handling unusual words found in charts. The synthetically generated images 🖥️ pose unique challenges compared to natural images 🌳.

Results 📊: DVQA results show that while some models perform well on natural images 🌳, they struggle with synthetic images 🖥️. The DVQA dataset is entirely synthetic, with bars in random colors and styles. Most models especially struggle with reasoning and data retrieval questions in DVQA. 🧠🔍

DVQA Result Image — Percentage (%) accuracy results on DVQA. Best performers based on paired asymptotic McNemar tests (α = 0.05) are in bold. For comparison, PReFIL and Human results correspond to performance on Test-Novel, where PReFIL uses Improved OCR. PReFIL is a DVQA system trained on DVQA’s training set.