We introduce the Time-Oriented Collection (TOC) dataset, which consists of 130,906 images with reliable timestamps verified manually. This dataset enables us to analyze how time-related visual cues can be extracted from static images.
Our proposed method, Time-Image Contrastive Learning (TICL), employs a cross-modal contrastive learning framework. Intuitively, time correlates to many of the metaphysical concepts that can be described in natural languages, this have motivated us to align CLIP image embeddings with our clock timestamp representations, allowing our model to learn time-related patterns from rich visual semantical features. The indirect correlations inherited from CLIP have help our method to outperform previous methods taking raw geolocation/date metadata (directly time-related!) as additional inputs.
@misc{lin2025timetellsusexplorative,
title={What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images},
author={Dongheng Lin and Han Hu and Jianbo Jiao},
year={2025},
eprint={2503.17899},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.17899},
}