Revisiting Multi-Modal LLM Evaluation

1University of Rochester 2SRI International 3Adobe
*jlu59@u.rochester.edu

In this paper, we pioneer evaluating recent MLLMs (LLaVA-1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses, such as extreme bias, spurious correlations and lack of fine-grained analysis in earlier datasets like VQAv2 and RefCOCO.
We assess four VQA datasets:

  1. VQDv1, which requires identifying all image regions that satisfy a given query.
  2. TallyQA, which has simple and complex counting questions, and
  3. TDIUC, which permits fine-grained analysis on 12 question types,
  4. DVQA, which requires optical character recognition for chart understanding.

Our experiments reveal the weaknesses of today's MLLMs that have not previously been reported. Our code is integrated into a fork of the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs.

VQDv1

Description 🗂️: VQDv1 challenges models to generate multiple bounding boxes 📦, not just one! Unlike typical datasets, queries in VQDv1 can result in 0 to N bounding boxes, testing general detection skills 🔍. This adds an extra layer of difficulty.

Example Image
Example images of VQDv1 with multiple bounding boxes.

Results 📊: All models show decreased Recall as the number of bounding boxes 📦 increases. This indicates that models struggle to identify multiple objects matching the query. 🤖🔍

VQDv1 Result Image
              Comparison of Multi-Modal LLMs on VQDv1. MGM refers to MiniGemini 7B, LLaVA-OV is LLaVA OneVision, L is LLaVA 1.5 and L-NeXT is LLaVA-NeXT.
The large gap from optimal performance on VQDv1, suggests a major limitation of today's LLMs.
VQDv1 Performance GraphVQDv1 Performance Graph
Precision and recall graphs for VQDv1.
This trend suggests that increasing the number of grounding objects challenges the models' detection capabilities, potentially due to limitations in how they interpret contextual information within visual scenes.

TallyQA

Description 🔢: TallyQA tests models' counting skills 🔢. It includes simple questions on object detection and complex ones needing advanced reasoning, like pose estimation ("How many dogs are sitting?") and positional reasoning ("How many dogs are in front of the white building?"). 🏢🔍

Example Image
Example images from TallyQA showing various counting challenges.

Results 📊: TallyQA includes simple and complex counting questions 🔢. Some models, like BLIP2, excel at simple questions but struggle with complex ones. For example, BLIP2 matches GPT-4 on simple tasks but drops significantly on complex questions requiring advanced reasoning, like pose estimation 🕺 and positional reasoning 📍.

TallyQA Result Image
Results on TallyQA. For Acc., best performers based on paired asymptotic McNemar tests (α = 0.05) are in bold. For RMSE, the highest value is bolded. For comparison, the result from SMoLA is the current best on TallyQA.
TallyQA Simple Counting Performance Graph
TallyQA Simple
TallyQA Complex Counting Performance Graph
TallyQA Complex
Accuracy distribution by correct answer, for simple (left Fig.) and complex (right Fig.) counting questions in TallyQA. This suggests the necessity of incorporating more complex counting questions in the evaluation of visual reasoning models.

TDIUC

Description 🎯: TDIUC tests models' versatility across 12 tasks, including object 🏷️, attribute 🔖, and activity recognition 🏃, as well as overall scene understanding 🌆. The meaningful categories of question types allow fine-grain analysis of models' abilities, highlighting specific strengths and weaknesses.

Example Image
Example images from TDIUC showing various question types.

Results 📊: TDIUC results provide detailed performance metrics for each model on different question types. Models show varying strengths and weaknesses across these types. Key takeaways: all models struggle with positional reasoning 📍. LLaVA-NeXT outperforms in almost all question types, even surpassing the previous state-of-the-art VQA algorithm, MuRel.

TDIUC Result Image
Accuracy on TDIUC for each question type. Best performers based on paired asymptotic McNemar tests (α = 0.05) are in bold. For comparison, MuRel is the previous best result from training on TDIUC.

DVQA

Description 📊: DVQA tests models' ability to interpret and analyze visual data in chart form 📊. It requires OCR skills 🔡 and handling unusual words found in charts. The synthetically generated images 🖥️ pose unique challenges compared to natural images 🌳.

Example Image
Example images from DVQA showing various chart types.

Results 📊: DVQA results show that while some models perform well on natural images 🌳, they struggle with synthetic images 🖥️. The DVQA dataset is entirely synthetic, with bars in random colors and styles. Most models especially struggle with reasoning and data retrieval questions in DVQA. 🧠🔍

DVQA Result Image
Percentage (%) accuracy results on DVQA. Best performers based on paired asymptotic McNemar tests (α = 0.05) are in bold. For comparison, PReFIL and Human results correspond to performance on Test-Novel, where PReFIL uses Improved OCR. PReFIL is a DVQA system trained on DVQA’s training set.