Abstract

Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner.

Highlights

We present a unified reasoning framework that connects temporal detection, spatial localization, and textual explanation for holistic video anomaly analysis in a fully zero-shot manner. Our key contributions include:

Overview of the unified reasoning framework
  • Introduces a unified, training-free pipeline that chains temporal detection, spatial localization, and textual explanation for holistic video anomaly analysis.
  • Proposes Intra-Task Reasoning (IntraTR) to refine temporal anomaly scores using contextual video priors, enhancing detection accuracy.
  • Develops Inter-Task Chaining (InterTC) to link temporal detection with spatial and semantic tasks, guiding frozen detectors and narrators using temporal cues.
  • Establishes new state-of-the-art zero-shot baselines across multiple video anomaly detection, localization, and explanation benchmarks without any additional training or data.

Methodology

The proposed framework consists of two key components: Intra-Task Reasoning (IntraTR) and Inter-Task Chaining (InterTC). IntraTR refines temporal anomaly scores by leveraging priors from the most suspicious video segment, while InterTC connects temporal detection with spatial localization and textual explanation, guiding frozen detectors and narrators using temporal cues. This chained reasoning process enables comprehensive zero-shot video anomaly analysis without any additional training or data.

Results at a Glance

The unified reasoning framework establishes new zero-shot baselines across four video anomaly benchmarks while remaining completely training-free. Below we summarize the key quantitative improvements.

Qualitative Results

Below we present qualitative results demonstrating our method's ability to accurately detect, localize, and explain anomalies in a fully zero-shot manner across diverse scenarios. In all cases, our method effectively identifies anomalous events, precisely localizes them in space and time, and generates fine-grained anomaly tags and coherent textual explanations, showcasing the power of chained reasoning over frozen vision-language models.

Qualitative temporal detection and textual explanation results. Our method accurately identifies anomalous segments and generates detailed explanations by leveraging chained reasoning across tasks.

Qualitative Localisation results on UCF-Crime

Qualitative spatial localization results on UCF-Crime. Our method accurately localizes anomalies by leveraging extracted priors from temporal task to guide frozen VLM detectors.

BibTeX

@inproceedings{
lin2025AUR,
title={A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis},
author={Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, Yunchao Wei},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=Qla5PqFL0s}
}