CAVE

Detecting and Explaining Commonsense Anomalies
in Visual Environments

EMNLP 2025 Main

Rishika Bhagwatkar1,2,*, Syrielle Montariol1,*, Angelika Romanou1, Beatriz Borges1, Irina Rish2, Antoine Bosselut1
1 École Polytechnique Fédérale de Lausanne (EPFL), 2 Quebec Artificial Intelligence Institute (Mila)
* Equal Contribution

Summary

Humans can naturally identify, reason about, and explain anomalies in their environment.

We introduce CAVE, the first benchmark of real-world commonsense anomalies. CAVE contains images captured in real-world scenarios and supports three open-ended tasks: anomaly description, explanation, and justification. It also includes numerical features representing how humans perceive these anomalies: complexity, severity, and commonness.

These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies.

CAVE Main Teaser

Figure: Main overview of CAVE benchmark.

CAVE Benchmark Overview

CAVE was created to evaluate how Vision-Language Models (VLMs) handle commonsense anomalies in real-world images.
(1) Image Collection: Images were sourced from the top 1,000 posts across various subreddits and filtered to ensure high-quality and harmless data.
(2) Human Annotation: Initial annotations were performed by Mechanical Turk workers, focusing on anomaly identification and classification.
(3) Expert Verification & Annotation: A subsequent round of expert-driven annotation and verification ensured high-quality, consistent annotations across three open-ended tasks: description, explanation, and justification, as well as along numerical axes of severity, surprisal, and complexity.

CAVE Benchmark Overview

Figure 1: Overview of the CAVE data collection and annotation pipeline.



Dataset Statistics

The dataset includes 361 images and 334 anomalies across categories such as entity presence/absence, attribute errors, spatial relation mismatches, uniformity breaches, and textual anomalies. Each image contains 0 to 3 anomalies. Each anomaly is paired with human-written descriptions, explanations, and plausible justifications, along with fine-grained scores: Severity (impact or risk), Surprisal (unexpectedness), Complexity (ease of detection).

CAVE Dataset Statistics

Figure 2: Statistics of anomaly categories and numerical attributes.



Experimental Results

We evaluated 8 state-of-the-art VLMs on CAVE. Even the strongest model only achieved around 57% F1 in anomaly description. Models perform best on severe and surprising anomalies, but struggle with complex perception, especially spatial reasoning and pattern detection.

CAVE Results

Figure 3: Performance of leading VLMs for anomaly detection (AD) and anomaly explanation (AE), evaluated using GPT4o-as-a-judge. We test various prompting strategies: Chain-of-Thought (CoT), Set-of-Marks (SoM), multi-steps reasoning (MS CoT), and self-consistency (CoT + consist.)



Example Anomalies

CAVE captures a wide variety of real-world commonsense anomalies, ranging from misplaced objects to textual inconsistencies. This diversity challenges models to detect anomalies beyond simple object recognition and requires reasoning about context and expectations.

CAVE Example Images

Figure 4: Examples from CAVE showing diverse anomaly types.

Key Findings

BibTeX


@inproceedings{bhagwatkar2025cave,
  title={CAVE: Commonsense Anomalies in Visual Environments},
  author={Bhagwatkar, Rishika and Montariol, Syrielle and Romanou, Angelika and Borges, Beatriz and Rish, Irina and Bosselut, Antoine},
  booktitle={EMNLP 2025},
  year={2025}
}
      

Contact

Rishika Bhagwatkar: rishika.bhagwatkar@mila.quebec

Syrielle Montariol: syrielle.montariol@epfl.ch

Acknowledgement

This website is adapted from LLaVA-VL, Nerfies, and VL-RewardBench, licensed under CC BY-SA 4.0.