Second Tutorial on Large Scale Holistic Video Understanding

In Conjunction with ICCV 2021,

Holistic Video Understanding is a joint project of the KU Leuven, University of Bonn, KIT, ETH, and the HVU team.


In the last years, we have seen a tremendous progress in the capabilities of computer systems to classify video clips taken from the Internet or to analyze human actions in videos. There are lots of works in video recognition field focusing on specific video understanding tasks, such as, action recognition, scene understanding, etc. There have been great achievements in such tasks, however, there has not been enough attention toward the holistic video understanding task as a problem to be tackled. Current systems are expert in some specific fields of the general video understanding problem. However, for real world applications, such as, analyzing multiple concepts of a video for video search engines and media monitoring systems or providing an appropriate definition of the surrounding environment of an humanoid robot, a combination of current state-of-the-art methods should be used. Therefore in this tutorial, we intend to put effort into introducing the holistic video understanding as a new challenge in the computer vision field. This challenge focuses on the recognition of scenes, objects, actions, attributes, and events in the real world and user-generated videos. We also aim to cover the most important aspects of video recognition and understanding in the tutorial course work.


  • Holistic video recognition
  • Future prediction in videos
  • Large scale video understanding
  • Multiple category recognition in videos
  • Object and activity detection in videos
  • Weakly supervised learning from web videos
  • Learning visual representation from videos
  • Unsupervised and self-supervised learning with videos
  • 3D/2D Deep Neural Networks for action and activity recognition



October 16, 2021 - Full Day

Tutorial Video (YouTube)

Time: EDT Speaker Description
08:50 Opening Remark
09:00 Hilde Kuehne Understanding videos without labels
10:00 Efstratios Gavves The Machine Learning of Time and Dynamics ... with an emphasis on Sciences
11:00 Cees G. M. Snoek Return of the Tubelets
12:00 Yale Song Towards Self-Supervised Holistic Video Representations
13:00 Break
13:30 Raquel Urtasun An AI-First Approach to Self Driving
14:30 Ehsan Adeli Forecasting Human Dynamics in Multiple Levels of Abstraction
15:30 Xiaolong Wang Learning to Perceive Videos for Embodiment
16:30 Georgia Gkioxari Can Video Rescue Object Understanding?
17:30 Conclusion