Humans can naturally identify, reason about, and explain anomalies in their environment.
We introduce CAVE, the first benchmark of real-world commonsense anomalies.
CAVE contains images captured in real-world scenarios and supports three open-ended tasks:
anomaly description, explanation, and justification.
It also includes numerical features representing how humans perceive these anomalies:
complexity, severity, and commonness.
These annotations draw inspiration from cognitive science research on how humans
identify and resolve anomalies, providing a comprehensive framework for evaluating
Vision-Language Models (VLMs) in detecting and understanding anomalies.
We show that state-of-the-art VLMs struggle with visual anomaly perception
and commonsense reasoning, even with advanced prompting strategies.
Figure: Main overview of CAVE benchmark.
CAVE was created to evaluate how Vision-Language Models (VLMs) handle
commonsense anomalies in real-world images.
(1) Image Collection: Images were sourced from the top 1,000 posts across various subreddits and filtered to ensure high-quality and harmless data.
(2) Human Annotation: Initial annotations were performed by Mechanical Turk workers, focusing on anomaly identification and classification.
(3) Expert Verification & Annotation: A subsequent round of expert-driven annotation and verification ensured high-quality, consistent annotations across three open-ended tasks: description, explanation, and justification, as well as along numerical axes of severity, surprisal, and complexity.
Figure 1: Overview of the CAVE data collection and annotation pipeline.
The dataset includes 361 images and 334 anomalies across categories such as entity presence/absence, attribute errors, spatial relation mismatches, uniformity breaches, and textual anomalies. Each image contains 0 to 3 anomalies. Each anomaly is paired with human-written descriptions, explanations, and plausible justifications, along with fine-grained scores: Severity (impact or risk), Surprisal (unexpectedness), Complexity (ease of detection).
Figure 2: Statistics of anomaly categories and numerical attributes.
We evaluated 8 state-of-the-art VLMs on CAVE. Even the strongest model only achieved around 57% F1 in anomaly description. Models perform best on severe and surprising anomalies, but struggle with complex perception, especially spatial reasoning and pattern detection.
Figure 3: Performance of leading VLMs for anomaly detection (AD) and anomaly explanation (AE), evaluated using GPT4o-as-a-judge. We test various prompting strategies: Chain-of-Thought (CoT), Set-of-Marks (SoM), multi-steps reasoning (MS CoT), and self-consistency (CoT + consist.)
CAVE captures a wide variety of real-world commonsense anomalies, ranging from misplaced objects to textual inconsistencies. This diversity challenges models to detect anomalies beyond simple object recognition and requires reasoning about context and expectations.
Figure 4: Examples from CAVE showing diverse anomaly types.
@inproceedings{bhagwatkar2025cave,
title={CAVE: Commonsense Anomalies in Visual Environments},
author={Bhagwatkar, Rishika and Montariol, Syrielle and Romanou, Angelika and Borges, Beatriz and Rish, Irina and Bosselut, Antoine},
booktitle={EMNLP 2025},
year={2025}
}
Rishika Bhagwatkar: rishika.bhagwatkar@mila.quebec
Syrielle Montariol: syrielle.montariol@epfl.ch
This website is adapted from LLaVA-VL, Nerfies, and VL-RewardBench, licensed under CC BY-SA 4.0.