MVTamperBench: Adversarial Benchmark for Evaluating Video-Language Models

( Work in Progress / Benchmark is being expanded )

1Liverpool John Moores University  2Birla Institute of Technology  3Christ University  4Columbia University  5New York University  6University of Washington  7Hanyang University
Equal Contribution *Leadership 

What is MVTamperBench?

MVTamperBench is a benchmark containing 18495 videos across 5 different video tampering scenarios, providing robust evaluation on 16 video understanding tasks.

Each example comes from one task in MVTamperBench, presenting diverse tampering across temporal relations in videos.

MVTamperBench -- Novel Features


  • MVTamperBench provides a systematic evaluation on models robustness against video tamperingeffects by introducing diverse manipulations, enabling a deeper understanding of model strengths and vulnerabilities under adversarial scenarios.
  • MVTamperBench evaluates video tampering on a comprehensive range of 20 temporal understanding abilities, e.g. scene transition, action localization, character order, ..., etc.
  • MVTamperBench contains 9 diverse spatial setup and relations to evaluate tampering, e.g. counting, pose, cognition, action, scene, objects,...., etc. etc.



Abstract

Recent advancements in multimodal research reveal an increasing trend of leveraging Vision-Language Models (VLMs) for tackling complex video understanding tasks. However, there is still a lack of research on how stable these models are in real-world situations and how susceptible they are to disruptions from realistic perturbations. So we present MVTamperBench, a thorough benchmark created to assess VLM resilience against manipulations including rotation, dropping, masking, substitution, and repetition in order to address this issue. MVTamperBench systematically evaluates state-of-the-art models and reveals significant variability in robustness. Where models like InternVL2-4B and MOLMO variants achieve near-perfect accuracy across all manipulations, while models like Llama-VILA1.5-8b and llavaonevision variants show severe vulnerabilities with accuracy near zero. To facilitate adoption, we have included the benchmark into the VLMEvalKit, enabling seamless evaluation and promoting enhancements in model robustness. Our research is an important step in creating dependable VLMs that can function well in the face of real-world disturbances.

Quantitative Results

Results of different models on MVTamperBench. The first row shows different tampering types and the overall performance. We see that models performance varies drastically across model families with InternVL2-8B being more robust to tampering detection compared to VILA models

Overall performance: the average accuracies of the most advanced multimodal LLMs on MVTamperBench are no better than 84%, which are still far from enabling satisfactory utility. The mean accuracies of open-source multimodal LLMs that have considered video hover between 27.54% and 84.80%. Notably, there is no obvious correlation between model sizes and performances, indicating the importance of training data and training processes in developing multimodal LLMs with better temporal and spatial understanding in videos/scences and multi-image scenarios. For certain models and tasks, some results are only on par or even below random guessing.

Experiment Analysis


1. In which tampering type do multimodal LLMs show relative strengths and weaknesses?

As in the figure, we observe that multimodal LLMs perform relatively better on image-text matching, visual retrieval, and diagram understanding. In contrast, multi-image ordering and visual grounding appear to be more challenging for these models, because these tasks require understanding the whole multi-image context and conducting more complicated reasoning processes across images and modalities afterwards.


2. How do multimodal LLMs perform on average?

We compare performances overall performance for all the studied models. InternVL2-8B consistently outperformed other models, achieving the highest accuracy across all tampering effects. Conversely, VILA1.5-13b and VILA1.5-3b displayed significant vulnerabilities, particularly under Dropping and Substitution manipulations.


BibTeX


        @article{mvtamperbench2024,
        title={MVTamperBench: A Benchmark for Robustness Against Video Tampering Effects},
        author={Amit Agarwal, Srikant Panda, Angeline, Bhargava, Hitesh, Priyan, Taki, Tejaswini},
        journal={Journal of Video Understanding},
        year={2024}
        }