Fourth International Workshop on Large Scale Holistic Video Understanding

In Conjunction with CVPR 2023,

Holistic Video Understanding is a joint project of the KU Leuven, University of Bonn, KIT, ETH, and the HVU team.


The capabilities of computer systems to classify video from the Internet or analyze human actions in videos have improved tremendously in recent years. Lots of work has been done in the video recognition field on specific video understanding tasks, such as action recognition and scene recognition. Despite substantial achievements in these tasks, the holistic video understanding task has not received enough attention. Currently, video understanding systems specialize in specific fields. For real-world applications, such as analyzing multiple concepts in a video for video search engines and media monitoring systems, or defining a humanoid robot's surrounding environment, a combination of current state-of-the-art methods is necessary. Toward a holistic video understanding (HVU), we present this workshop. Recognizing scenes, objects, actions, attributes, and events in real-world videos is the focus of this challenge. To address such tasks, we introduce our HVU dataset, which is organized hierarchically according to a semantic taxonomy of holistic video understanding. Many real-world conditioned video datasets target human action or sport recognition. Our newly created dataset can help the vision community and attract more attention to the possibility of developing more interesting holistic video understanding solutions. Our workshop will bring together ideas related to multi-label and multi-task recognition in real-world videos. Our dataset will be used to test research efforts.


The main objective of the workshop is to establish a video benchmark integrating joint recognition of all the semantic concepts, as a single class label per task is often not sufficient to describe the holistic content of a video. The planned panel discussion with world’s leading experts on this problem will be a fruitful input and source of ideas for all participants. Further, we invite the community to help to extend the HVU dataset that will spur research in video understanding as a comprehensive, multi-faceted problem. We as organizers expect to receive valuable feedback from users and from the community on how to improve the benchmark.


  • Large scale video understanding
  • Multi-Modal learning from videos
  • Multi concept recognition from videos
  • Multi task deep neural networks for videos
  • Learning holistic representation from videos
  • Weakly supervised learning from web videos
  • Object, scene and event recognition from videos
  • Unsupervised video visual representation learning
  • Unsupervised and self-­supervised learning with videos



June 18th - AM, Room: East 8, Poster: West Exhibit Hall, #100-114

   PT    Description    Speaker
08:30    Opening Remarks
08:45    Orals    1 - 3
09:15    Invited Speaker 1:    Rohit Girdhar
10:00    Poster (West Exhibit Hall, #100-114) & Break    All Papers
10:30    Invited Speaker 2:    Cordelia Schmid
11:15    Invited Speaker 3:    Carl Vondrick
12:00    Closing Remarks


1. Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos; Rohit Gupta, Anirban Roy, Sujeong Kim, Claire Christensen, Sarah Gerard, Madeline Cincebeaux, Todd Grindal, Ajay Divakaran, Mubarak Shah
2. AutoAD: Movie Description in Context; Tengda Han, Max Bain, Arsha Nagrani, Gul Varol, Weidi Xie, Andrew Zisserman
3. HierVL: Learning Hierarchical Video-Language Embeddings; Ashutosh Kumar, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman
4. DOAD: Decoupled One Stage Action Detection Network; Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou
5. PIVOT: Prompting for Video Continual Learning; Andres Villa, Juan Leon Alcazar, Motasem Alfarra, Kumail Alhamoud, Julio Hurtado, Fabian Caba Heilbron, Alvaro Soto, Bernard Ghanem
6. Token Turing Machines; Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab
7. A New Dataset and Approach for Timestamp Supervised Action Segmentation Using Human Object Interaction; Saif Sayed, Reza Ghoddoosian, Bhaskar Chandra Trivedi, Vassilis Athitsos
8. Meta-Personalizing Vision-Language Models to Find Named Instances in Video; Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, Simon Jenni
9. Test of Time: Instilling Video-Language Models with a Sense of Time; Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek
10. AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation; Giacomo Zara, Subhankar Roy, Paolo Rota, Elisa Ricci
11. MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge; Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof
12. Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning; AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
13. Multi-Annotation Attention Model for Video Summarization; Hacene Terbouche, Maryan Morel, Mariano Rodriguez, Alice Ahlem Othmani
14. Global Motion Understanding in Large-Scale Video Object Segmentation; Volodymyr Fedynyak, Yaroslav Romanus, Igor Babin, Roman Riazantsev, Oles Dobosevych
15. Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization; Chen Zhao, Shuming Liu, Karttikeya Mangalam, Bernard Ghanem
16. TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition; Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Chen Chen, Mubarak Shah
17. Connecting Vision and Language with Video Localized Narratives; Paul Voigtlaender, Soravit Changpinyo, Jordi Pont-Tuset, Radu Soricut, Vittorio Ferrari


Prospective authors will be invited to submit a regular paper of previously unpublished work (CVPR paper format) or an extended abstract of a published work. The review process will be double blind. All the submissions will be peer-reviewed by the international program committee. Accepted papers will be presented as posters or contributed talks and will be considered non-archival and published via the Open Access versions, provided by the Computer Vision Foundation. Accepted extended abstracts will be presented at the poster session.


June 18th


*You can submit papers in two different formats:

  1. We will accept papers that have not been published elsewhere or have been recently published elsewhere including CVPR 2023. Accepted papers will appear in CVPR proceedings. For submissions of papers, we will follow the Double Blind review process, in that authors do not know the names of the reviewers of their papers, and reviewers do not know the names of the authors. The authors must follow the CVPR 2023 submission policy. Papers are limited to eight pages, including figures and tables, in the CVPR style. Additional pages containing only cited references are allowed. Please refer to the CVPR 2023 website for more information. Papers that are not properly anonymized, or do not use the template, or have more than eight pages (excluding references) will be rejected without review. The deadline for paper submission is March 30. Notification to the authors by April 10th. The accepted papers must follow the CVPR 2023 camera-ready format as per the instructions are given here but limit your paper to 4-8 pages excluding references.
  2. For submissions of papers that have been published or accepted for publication in a recent venue, we will follow the Single Blind review process, in that authors do not know the names of the reviewers of their papers, but reviewers do know the names of the authors. Authors MUST indicate, in the footnote section on the first page of their submission, which venue their papers have been published or will be published. For example, if the paper will appear at CVPR 2023, the submission should include a footnote on the first page showing "To appear at 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition".
All papers must be formatted using the CVPR template style, which can be obtained at CVPR style.

Submit Your Work