What Time Tells Us? An Explorative Study of Time-Awareness Learned from Static Images

Key Takeaways

Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time-of-day awareness from static images, trying to answer: what time tells us?

📕Dataset

Dataset Overview

We introduce the Time-Oriented Collection (TOC) dataset, which consists of 130,906 images with reliable timestamps verified manually. This dataset enables us to analyze how time-related visual cues can be extracted from static images.

🔍What time tells us in Time-based Image Retrieval?

Time-based Image Retrieval
As an extension to the pretext task, the results show that TICL retrieves a higher percentage of images with smaller time errors compared to other features under zero-shot nearest-neighbour retrieval.

🏞️What time tells us in Video Scene Classification?

Video Scene Classification
Supported by an intuition about how time correlates with scene contexts (see t-SNE and text probabilities in the figure for evidence.), our model understanding time-related visual contexts can improve video scene classification task as we proved in various datasets.

🌅What time tells us in Time-aware Image Editing?

Time-aware Image Editing
Our model also creates an visually related representation for different clock times throughout the day, therefore we can use the representation to guide the image editing task to create more images with more reasonable illuminations. The experiment results in the figure spans baseline editing methods on GAN and Diffusion models.

✍Methodology on Pretext Estimation Task

Methodology Overview

Our proposed method, Time-Image Contrastive Learning (TICL), employs a cross-modal contrastive learning framework. Intuitively, time correlates to many of the metaphysical concepts that can be described in natural languages, this have motivated us to align CLIP image embeddings with our clock timestamp representations, allowing our model to learn time-related patterns from rich visual semantical features. The indirect correlations inherited from CLIP have help our method to outperform previous methods taking raw geolocation/date metadata (directly time-related!) as additional inputs.

Time Estimation

🙏 Acknowledgements

This project is partially supported by the Royal Society grants (SIF\R1\231009, IES\R3\223050) and an Amazon Research Award. The computations in this research were performed using the Baskerville Tier 2 HPC service. Baskerville was funded by the EPSRC and UKRI through the World Class Labs scheme (EP\T022221\1) and the Digital Research Infrastructure programme (EP\W032244\1) and is operated by Advanced Research Computing at the University of Birmingham.

BibTeX

@misc{lin2025timetellsusexplorative,
        title={What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images}, 
        author={Dongheng Lin and Han Hu and Jianbo Jiao},
        year={2025},
        eprint={2503.17899},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.17899}, 
  }