Second International Workshop on Large Scale Holistic Video Understanding

In Conjunction with CVPR 2021,

Holistic Video Understanding is a joint project of the KU Leuven, University of Bonn, KIT, ETH, and the HVU team.


In the last years, we have seen tremendous progress in the capabilities of computer systems to classify video clips taken from the Internet or to analyze human actions in videos. There are lots of works in video recognition field focusing on specific video understanding tasks, such as action recognition, scene understanding, etc. There have been great achievements in such tasks, however, there has not been enough attention toward the holistic video understanding task as a problem to be tackled. Current systems are expert in some specific fields of the general video understanding problem. However, for real-world applications, such as, analyzing multiple concepts of a video for video search engines and media monitoring systems or providing an appropriate definition of the surrounding environment of a humanoid robot, a combination of current state-of-the-art methods should be used. Therefore, in this workshop, we intend to introduce the holistic video understanding as a new challenge for the video understanding efforts. This challenge focuses on the recognition of scenes, objects, actions, attributes, and events in the real world user-generated videos. To be able to address such tasks, we also introduce our new dataset named Holistic Video Understanding~(HVU dataset) that is organized hierarchically in a semantic taxonomy of holistic video understanding. Almost all of the real-world conditioned video datasets are targeting human action or sport recognition. So our new dataset can help the vision community and bring more attention to bring more interesting solutions for holistic video understanding. The workshop is tailored to bringing together ideas around multi-label and multi-task recognition of different semantic concepts in the real world videos. And the research efforts can be tried on our new dataset.


The main objective of the workshop is to establish a video benchmark integrating joint recognition of all the semantic concepts, as a single class label per task is often not sufficient to describe the holistic content of a video. The planned panel discussion with world’s leading experts on this problem will be a fruitful input and source of ideas for all participants. Further, we invite the community to help to extend the HVU dataset that will spur research in video understanding as a comprehensive, multi-faceted problem. We as organizers expect to receive valuable feedback from users and from the community on how to improve the benchmark.


  • Large scale video understanding
  • Multi-Modal learning from videos
  • Multi concept recognition from videos
  • Multi task deep neural networks for videos
  • Learning holistic representation from videos
  • Weakly supervised learning from web videos
  • Object, scene and event recognition from videos
  • Unsupervised video visual representation learning
  • Unsupervised and self-­supervised learning with videos


ET | CET    Description    Speaker/Paper ID
11:00 | 17:00    Opening remark
11:10 | 17:10    Invited Speaker 1:    Joao Carreira
12:00 | 18:00    Invited Speaker 2:    Cordelia Schmid
12:50 | 18:50    Invited Speaker 3:    Dima Damen
13:40 | 19:40    Oral Session 1    1, 2, 3, 4, 5, 6
14:40 | 20:40    Break
15:10 | 21:10    Oral Session 2    7, 8, 9, 10, 11, 12
16:10 | 22:10    Invited Speaker 4:    Carl Vondrick
17:00 | 23:00    Invited Speaker 5:    Sanja Fidler
17:50 | 23:50    Invited Speaker 6:    Kristen Grauman
18:40 | 00:40    Conclusion

Oral Papers

  1. IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos; Gyeongsik Moon, Heeseung Kwon, Kyoung Mu Lee, Minsu Cho
  2. Rethinking Training Data for Mitigating Representation Biases in Action Recognition; Kensho Hara, Yuchi Ishikawa, Hirokatsu Kataoka
  3. MDMMT: Multidomain Multimodal Transformer for Video Retrieval; Maksim Dzabraev, Maksim Kalashnikov, Stepan A Komkov, Aleksandr Petiushko
  4. SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction from Video Data; Yuan-Ting Hu, Jiahong Wang, Raymond A Yeh, Alexander Schwing
  5. ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video; Nikolaos Gkalelis, Andreas Goulas, Damianos Galanopoulos, Vasileios Mezaris
  6. CoCon: Cooperative-Contrastive Learning; Nishant Rai, Ehsan Adeli, Kuan-Hui Lee, Adrien Gaidon, Juan Carlos Niebles
  7. Learning to Segment Actions from Visual and Language Instructions via Differentiable Weak Sequence Alignment; Yuhan Shen, Lu Wang, Ehsan Elhamifar
  8. Training vision transformers for image retrieval; Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Hervé Jégou
  9. Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation; Weiyao Wang, Matt Feiszli, Heng Wang, Du Tran
  10. Learning Video Representations from Textual Web Supervision; Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, David A. Ross
  11. Parameter Efficient Multimodal Transformers for Video Representation Learning; Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song
  12. Just Ask: Learning to Answer Questions from Millions of Narrated Videos; Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid


Prospective authors will be invited to submit a regular paper of previously unpublished work (CVPR paper format) or an extended abstract of a published work. The review process will be double blind. All the submissions will be peer-reviewed by the international program committee. Accepted papers will be presented as posters or contributed talks and will be considered non-archival and published via the Open Access versions, provided by the Computer Vision Foundation. Accepted extended abstracts will be presented at the poster session.


June 25th


*You can submit papers in two different formats:

  1. We will accept papers that have not been published elsewhere or have been recently published elsewhere including CVPR 2021. Accepted papers will appear in CVPR proceedings. For submissions of papers, we will follow the Double Blind review process, in that authors do not know the names of the reviewers of their papers, and reviewers do not know the names of the authors. The authors must follow the CVPR 2021 submission policy. Papers are limited to eight pages, including figures and tables, in the CVPR style. Additional pages containing only cited references are allowed. Please refer to the CVPR 2021 website for more information. Papers that are not properly anonymized, or do not use the template, or have more than eight pages (excluding references) will be rejected without review. The deadline for paper submission is March 31st. Notification to the authors by April 14th. The accepted papers must follow the CVPR 2021 camera-ready format as per the instructions are given here but limit your paper to 4-8 pages excluding references.
  2. For submissions of papers that have been published or accepted for publication in a recent venue, we will follow the Single Blind review process, in that authors do not know the names of the reviewers of their papers, but reviewers do know the names of the authors. Authors MUST indicate, in the footnote section on the first page of their submission, which venue their papers have been published or will be published. For example, if the paper will appear at CVPR 2021, the submission should include a footnote on the first page showing "To appear at 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition".
All papers must be formatted using the CVPR template style, which can be obtained at CVPR style.

Submit Your Work

Program Commitee

  • Abhishek Aich (University of California, Riverside)
  • AJ Piergiovanni (Google / Indiana University)
  • Akash Gupta (University of California, Riverside)
  • Alejandro Pardo (KAUST)
  • Amogh Gupta (Columbia University)
  • Ananda Chowdhury (Jadavpur University)
  • Anna Kukleva (MPII)
  • Bing Li (KAUST)
  • Boxiao Pan (Stanford University)
  • Boyuan Chen (Columbia University)
  • Bruno Korbar (Facebook)
  • Chengzhi Mao (Columbia University)
  • Evangelos Kazakos (University of Bristol)
  • Graham Taylor (University of Guelph)
  • Hazel Doughty (University of Amsterdam)
  • Jinwoo Choi (Kyung Hee University)
  • Karttikeya Mangalam (UC Berkeley)
  • Khoi-Nguyen Mac (UIUC)
  • Hilde Kuehne (IBM)
  • Jinwoo Choi (Kyung Hee University)
  • Karttikeya Mangalam (UC Berkeley)
  • Khoi-Nguyen Mac (UIUC)
  • Michael Wray (University of Bristol)
  • Mingmin Zhao (MIT)
  • Mohammad Sabokrou (Institute for Research in Fundamental Sciences (IPM))
  • Mohammadreza Zolfaghari (University of Freiburg)
  • Nadine Behrmann (Bosch Center for Artificial Intelligence)
  • Nikita Araslanov (TU Darmstadt)
  • Pascal Mettes (University of Amsterdam)
  • Rameswar Panda (MIT-IBM Watson AI Lab, IBM Research)
  • Ruohan Gao (Stanford University)
  • Sachit Menon (Duke University)
  • Shyamal Buch (Stanford University)
  • Silvio Giancola (KAUST)
  • Tao Hu (University of Amsterdam)
  • Tengda Han (University of Oxford)
  • Weidi Xie (University of Oxford)
  • Yang Liu (University of Oxford)
  • Yingcheng Liu (MIT)
  • Yunhua Zhang (University of Amsterdam)
  • Zhicheng Yan (Facebook AI)