A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

Lin, Dongheng; Qu, Mengxue; Han, Kunyang; Jiao, Jianbo; Jin, Xiaojie; Wei, Yunchao

A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

Dongheng Lin^1,2 Mengxue Qu¹ Kunyang Han¹ Jianbo Jiao² Xiaojie Jin¹ Yunchao Wei¹

¹Institute of Information Science, BJTU ²The MIx Group, University of Birmingham
NeurIPS 2025

Abstract

Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner.

Highlights

We present a unified reasoning framework that connects temporal detection, spatial localization, and textual explanation for holistic video anomaly analysis in a fully zero-shot manner. Our key contributions include:

Overview of the unified reasoning framework

Introduces a unified, training-free pipeline that chains temporal detection, spatial localization, and textual explanation for holistic video anomaly analysis.
Proposes Intra-Task Reasoning (IntraTR) to refine temporal anomaly scores using contextual video priors, enhancing detection accuracy.
Develops Inter-Task Chaining (InterTC) to link temporal detection with spatial and semantic tasks, guiding frozen detectors and narrators using temporal cues.
Establishes new state-of-the-art zero-shot baselines across multiple video anomaly detection, localization, and explanation benchmarks without any additional training or data.

Methodology

The proposed framework consists of two key components: Intra-Task Reasoning (IntraTR) and Inter-Task Chaining (InterTC). IntraTR refines temporal anomaly scores by leveraging priors from the most suspicious video segment, while InterTC connects temporal detection with spatial localization and textual explanation, guiding frozen detectors and narrators using temporal cues. This chained reasoning process enables comprehensive zero-shot video anomaly analysis without any additional training or data.

Intra-Task Reasoning (IntraTR) enhances temporal anomaly detection by refining scores based on contextual video segments.

Inter-Task Chaining (InterTC) links temporal detection with spatial localization and textual explanation, enabling holistic anomaly analysis.

Results at a Glance

The unified reasoning framework establishes new zero-shot baselines across four video anomaly benchmarks while remaining completely training-free. Below we summarize the key quantitative improvements.

Temporal Video Anomaly Detection (VAD)

Temporal VAD results on UCF-Crime and XD-Violence

Intra-Task Reasoning consistently boosts AUC across backbones and datasets, with adaptive margins providing further gains. Our method outperforms prior zero-shot approaches by a significant margin, closing the gap to supervised methods without any training.

Spatial Video Anomaly Localization (VAL)

Inter-Task Chaining leverages temporal priors to guide frozen detectors, improving TIoU by +1.1 over direct VLM localization.

Textual Video Anomaly Understanding (VAU)

Video anomaly understanding results on UCF-Crime and XD-Violence

Inter-task chaining consistently improves both traditional and LLM-based metrics over frozen narrators, closing the gap to supervised instruction-tuned systems.

Qualitative Results

Below we present qualitative results demonstrating our method's ability to accurately detect, localize, and explain anomalies in a fully zero-shot manner across diverse scenarios. In all cases, our method effectively identifies anomalous events, precisely localizes them in space and time, and generates fine-grained anomaly tags and coherent textual explanations, showcasing the power of chained reasoning over frozen vision-language models.

Qualitative temporal detection and textual explanation results. Our method accurately identifies anomalous segments and generates detailed explanations by leveraging chained reasoning across tasks.

Qualitative Localisation results on UCF-Crime

Qualitative spatial localization results on UCF-Crime. Our method accurately localizes anomalies by leveraging extracted priors from temporal task to guide frozen VLM detectors.

BibTeX

@inproceedings{
lin2025AUR,
title={A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis},
author={Dongheng Lin, Mengxue Qu, Kunyang Han, Jianbo Jiao, Xiaojie Jin, Yunchao Wei},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=Qla5PqFL0s}
}