( Work in Progress / Benchmark is being expanded )
Recent advancements in multimodal research reveal an increasing trend of leveraging Vision-Language Models (VLMs) for tackling complex video understanding tasks. However, there is still a lack of research on how stable these models are in real-world situations and how susceptible they are to disruptions from realistic perturbations. So we present MVTamperBench, a thorough benchmark created to assess VLM resilience against manipulations including rotation, dropping, masking, substitution, and repetition in order to address this issue. MVTamperBench systematically evaluates state-of-the-art models and reveals significant variability in robustness. Where models like InternVL2-4B and MOLMO variants achieve near-perfect accuracy across all manipulations, while models like Llama-VILA1.5-8b and llavaonevision variants show severe vulnerabilities with accuracy near zero. To facilitate adoption, we have included the benchmark into the VLMEvalKit, enabling seamless evaluation and promoting enhancements in model robustness. Our research is an important step in creating dependable VLMs that can function well in the face of real-world disturbances.
@article{mvtamperbench2024,
title={MVTamperBench: A Benchmark for Robustness Against Video Tampering Effects},
author={Amit Agarwal, Srikant Panda, Angeline, Bhargava, Hitesh, Priyan, Taki, Tejaswini},
journal={Journal of Video Understanding},
year={2024}
}