MVTamperBench

Abstract

Multimodal Large Language Models (MLLMs), also known as Large Multi-modal Models (LMMs), are recent advancement of Vision-Language Models (VLMs), that have driven major advances in video understanding, yet their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce \textbf{MVTamperBench}, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping. Built from 3.4K original videos—expanded to over 17K tampered clips spanning 19 video tasks.

MVTamperBench challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families, revealing substantial variability in resilience across tampering types and showing that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code and data to foster open research in trustworthy video understanding.

Quantitative Results

Results of different models on MVTamperBench. The first row shows different tampering types and the overall performance across tamper-types. We see that models performance varies drastically across model families and sizes. We further categorize our the models based on Perfomance & Model Size.

Overall performance: the average accuracies of the most advanced multimodal LLMs on MVTamperBench are no better than 87%, which are still far from enabling satisfactory utility. The mean accuracies of open-source multimodal LLMs that have considered video hover between 27.54% and 87.90%. Notably, there is no obvious correlation between model sizes and performances, indicating the importance of training data and training processes in developing multimodal LLMs with better temporal and spatial understanding in videos/scences and multi-image scenarios. For certain models and tasks, some results are only on par or even below random guessing.

Experiment Analysis

1. Does the MLLM performance depend on it's size ?

As in the figure, we observe that MLLMs performance are spread across the spectrum is not co-releated to the model-size (Pearson Correlation = 0.05). We also observe the few MLLMs have better performance for smaller models compared to their larger variant (Molmo-1B vs Molmo-72B), while some share similar performance between smaller and large variants (like LLama-3.2-Vision 11B & 90B).

2. In which Video Understanding tasks do MLLMs show relative strengths and weaknesses?

While certain tasks exhibit high detection F1 score, others remain significantly more challenging.

Easier Tasks such as Episodic Reasoning, Scene Transition, Ego-centric Navigation, and State Change consistently achieve higher F1 scores, as shown in figure below. These tasks often involve shorter temporal dependencies and simpler spatial reasoning, allowing models to rely on pre-trained vision-language features rather than advanced temporal reasoning.

Challenging Tasks such as Action Prediction, Counterfactual Inference, and Fine-Grained Action pose significant challenges for tampering detection. These tasks inherently require complex temporal reasoning and context preservation, making them highly sensitive to disruptions introduced by tampering techniques.

3. How do different tampering types impact performance across the video taks and MLLMs?

We compare performances overall performance for all the studied models across the proposed tamper-types. Masking is relatively easier across as it creates information loss which the MLLMs are able to manage better than other spatial (Rotation) and temporal (Repetition, Substitution, Dropping) tampering.

We observe a similar trend across the various video tasks, where masking is either easier in few video tasks or creates similar challenges like other tamper types. Video Tasks like Action Predition, Action Count, Scene Transition overall struggle the most due to tampering.

BibTeX


        @article{agarwal2024mvtamperbench,
        title={MVTamperBench: Evaluating Robustness of Vision-Language Models},
        author={Agarwal, Amit and Panda, Srikant and Charles, Angeline and Kumar, Bhargava and Patel, Hitesh and Pattnayak,
        Priyanranjan and Rafi, Taki Hasan and Kumar, Tejaswini and Chae, Dong-Kyu},
        journal={arXiv preprint arXiv:2412.19794},
        year={2024}
        }

MVTamperBench: Evaluating Robustness of Vision-Language Models

What is MVTamperBench?

Each example comes from one task in MVTamperBench, presenting diverse tampering across temporal relations in videos.