CAVE

Detecting and Explaining Commonsense Anomalies
in Visual Environments

EMNLP 2025 Main

Rishika Bhagwatkar^1,2,*, Syrielle Montariol^1,*, Angelika Romanou¹, Beatriz Borges¹, Irina Rish², Antoine Bosselut¹

¹ École Polytechnique Fédérale de Lausanne (EPFL), ² Quebec Artificial Intelligence Institute (Mila)

^* Equal Contribution

arXiv Code

Dataset

Summary

Humans can naturally identify, reason about, and explain anomalies in their environment.

We introduce CAVE, the first benchmark of real-world commonsense anomalies. CAVE contains images captured in real-world scenarios and supports three open-ended tasks: anomaly description, explanation, and justification. It also includes numerical features representing how humans perceive these anomalies: complexity, severity, and commonness.

These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies.

Figure: Main overview of CAVE benchmark.

CAVE Benchmark Overview

CAVE was created to evaluate how Vision-Language Models (VLMs) handle commonsense anomalies in real-world images.
(1) Image Collection: Images were sourced from the top 1,000 posts across various subreddits and filtered to ensure high-quality and harmless data.
(2) Human Annotation: Initial annotations were performed by Mechanical Turk workers, focusing on anomaly identification and classification.
(3) Expert Verification & Annotation: A subsequent round of expert-driven annotation and verification ensured high-quality, consistent annotations across three open-ended tasks: description, explanation, and justification, as well as along numerical axes of severity, surprisal, and complexity.

Figure 1: Overview of the CAVE data collection and annotation pipeline.

Dataset Statistics

The dataset includes 361 images and 334 anomalies across categories such as entity presence/absence, attribute errors, spatial relation mismatches, uniformity breaches, and textual anomalies. Each image contains 0 to 3 anomalies. Each anomaly is paired with human-written descriptions, explanations, and plausible justifications, along with fine-grained scores: Severity (impact or risk), Surprisal (unexpectedness), Complexity (ease of detection).

Figure 2: Statistics of anomaly categories and numerical attributes.

Experimental Results

We evaluated 8 state-of-the-art VLMs on CAVE. Even the strongest model only achieved around 57% F1 in anomaly description. Models perform best on severe and surprising anomalies, but struggle with complex perception, especially spatial reasoning and pattern detection.

Figure 3: Performance of leading VLMs for anomaly detection (AD) and anomaly explanation (AE), evaluated using GPT4o-as-a-judge. We test various prompting strategies: Chain-of-Thought (CoT), Set-of-Marks (SoM), multi-steps reasoning (MS CoT), and self-consistency (CoT + consist.)

Example Anomalies

CAVE captures a wide variety of real-world commonsense anomalies, ranging from misplaced objects to textual inconsistencies. This diversity challenges models to detect anomalies beyond simple object recognition and requires reasoning about context and expectations.

Figure 4: Examples from CAVE showing diverse anomaly types.

Key Findings

State-of-the-art VLMs achieve only ~57% F1 on anomaly detection.
Models perform better on severe/surprising anomalies, but struggle with complex reasoning tasks.
Visual grounding and structured prompting improve recall modestly, but issues remain.
Cultural and commonsense gaps highlight the need for more inclusive evaluation.

BibTeX


@inproceedings{bhagwatkar2025cave,
  title={CAVE: Commonsense Anomalies in Visual Environments},
  author={Bhagwatkar, Rishika and Montariol, Syrielle and Romanou, Angelika and Borges, Beatriz and Rish, Irina and Bosselut, Antoine},
  booktitle={EMNLP 2025},
  year={2025}
}

Contact

Rishika Bhagwatkar: rishika.bhagwatkar@mila.quebec

Syrielle Montariol: syrielle.montariol@epfl.ch

Acknowledgement

This website is adapted from LLaVA-VL, Nerfies, and VL-RewardBench, licensed under CC BY-SA 4.0.