( Work in Progress / Benchmark is being expanded )
Multimodal Large Language Models (MLLMs), also known as Large Multi-modal Models (LMMs), are recent advancement of
Vision-Language Models (VLMs), that have driven major advances in video understanding, yet their vulnerability to
adversarial tampering and manipulations remains underexplored. To address this gap, we introduce \textbf{MVTamperBench},
a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation,
masking, substitution, repetition, and dropping. Built from 3.4K original videos—expanded to over 17K tampered clips
spanning 19 video tasks.
MVTamperBench challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs
from 15+ model families, revealing substantial variability in resilience across tampering types and showing that larger
parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing
tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content
distribution, and enforcing policies on media platforms. We release all
code and data to foster open research in trustworthy video understanding.
@article{agarwal2024mvtamperbench,
title={MVTamperBench: Evaluating Robustness of Vision-Language Models},
author={Agarwal, Amit and Panda, Srikant and Charles, Angeline and Kumar, Bhargava and Patel, Hitesh and Pattnayak,
Priyanranjan and Rafi, Taki Hasan and Kumar, Tejaswini and Chae, Dong-Kyu},
journal={arXiv preprint arXiv:2412.19794},
year={2024}
}