Abstract
Highlights
We present a unified reasoning framework that connects temporal detection, spatial localization, and textual explanation for holistic video anomaly analysis in a fully zero-shot manner. Our key contributions include:
- Introduces a unified, training-free pipeline that chains temporal detection, spatial localization, and textual explanation for holistic video anomaly analysis.
- Proposes Intra-Task Reasoning (IntraTR) to refine temporal anomaly scores using contextual video priors, enhancing detection accuracy.
- Develops Inter-Task Chaining (InterTC) to link temporal detection with spatial and semantic tasks, guiding frozen detectors and narrators using temporal cues.
- Establishes new state-of-the-art zero-shot baselines across multiple video anomaly detection, localization, and explanation benchmarks without any additional training or data.
Methodology
The proposed framework consists of two key components: Intra-Task Reasoning (IntraTR) and Inter-Task Chaining (InterTC). IntraTR refines temporal anomaly scores by leveraging priors from the most suspicious video segment, while InterTC connects temporal detection with spatial localization and textual explanation, guiding frozen detectors and narrators using temporal cues. This chained reasoning process enables comprehensive zero-shot video anomaly analysis without any additional training or data.
Intra-Task Reasoning (IntraTR) enhances temporal anomaly detection by refining scores based on contextual video segments.

Inter-Task Chaining (InterTC) links temporal detection with spatial localization and textual explanation, enabling holistic anomaly analysis.
Results at a Glance
The unified reasoning framework establishes new zero-shot baselines across four video anomaly benchmarks while remaining completely training-free. Below we summarize the key quantitative improvements.
Temporal Video Anomaly Detection (VAD)

Intra-Task Reasoning consistently boosts AUC across backbones and datasets, with adaptive margins providing further gains. Our method outperforms prior zero-shot approaches by a significant margin, closing the gap to supervised methods without any training.
Spatial Video Anomaly Localization (VAL)

Inter-Task Chaining leverages temporal priors to guide frozen detectors, improving TIoU by +1.1 over direct VLM localization.
Textual Video Anomaly Understanding (VAU)

Inter-task chaining consistently improves both traditional and LLM-based metrics over frozen narrators, closing the gap to supervised instruction-tuned systems.
Qualitative Results
Below we present qualitative results demonstrating our method's ability to accurately detect, localize, and explain anomalies in a fully zero-shot manner across diverse scenarios. In all cases, our method effectively identifies anomalous events, precisely localizes them in space and time, and generates fine-grained anomaly tags and coherent textual explanations, showcasing the power of chained reasoning over frozen vision-language models.
Qualitative temporal detection and textual explanation results. Our method accurately identifies anomalous segments and generates detailed explanations by leveraging chained reasoning across tasks.

Qualitative spatial localization results on UCF-Crime. Our method accurately localizes anomalies by leveraging extracted priors from temporal task to guide frozen VLM detectors.
BibTeX
@inproceedings{
lin2025AUR,
title={A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis},
author={Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, Yunchao Wei},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=Qla5PqFL0s}
}