Fifth International Workshop on Large Scale Holistic Video Understanding

In Conjunction with CVPR 2024,

Holistic Video Understanding is a joint project of the KU Leuven, University of Bonn, KIT, ETH, and the HVU team.


The capabilities of computer systems to classify video from the Internet or analyze human actions in videos have improved tremendously in recent years. Lots of work has been done in the video recognition field on specific video understanding tasks, such as action recognition and scene recognition. Despite substantial achievements in these tasks, the holistic video understanding task has not received enough attention. Currently, video understanding systems specialize in specific fields. For real-world applications, such as analyzing multiple concepts in a video for video search engines and media monitoring systems, or defining a humanoid robot's surrounding environment, a combination of current state-of-the-art methods is necessary. Toward a holistic video understanding (HVU), we present this workshop. Recognizing scenes, objects, actions, attributes, and events in real-world videos is the focus of this challenge. To address such tasks, we introduce our HVU dataset, which is organized hierarchically according to a semantic taxonomy of holistic video understanding. Many real-world conditioned video datasets target human action or sport recognition. Our newly created dataset can help the vision community and attract more attention to the possibility of developing more interesting holistic video understanding solutions. Our workshop will bring together ideas related to multi-label and multi-task recognition in real-world videos. Our dataset will be used to test research efforts.


The main objective of the workshop is to establish a video benchmark integrating joint recognition of all the semantic concepts, as a single class label per task is often not sufficient to describe the holistic content of a video. The planned panel discussion with world’s leading experts on this problem will be a fruitful input and source of ideas for all participants. Further, we invite the community to help to extend the HVU dataset that will spur research in video understanding as a comprehensive, multi-faceted problem. We as organizers expect to receive valuable feedback from users and from the community on how to improve the benchmark.


  • Large scale video understanding
  • Multi-Modal learning from videos
  • Multi concept recognition from videos
  • Multi task deep neural networks for videos
  • Learning holistic representation from videos
  • Weakly supervised learning from web videos
  • Object, scene and event recognition from videos
  • Unsupervised video visual representation learning
  • Unsupervised and self-­supervised learning with videos