Holistic Video Understanding is a joint project of the KU Leuven, University of Bonn, KIT, ETH, and the HVU team.
In recent years, the ability of computer systems to classify and analyze online videos has greatly improved. Significant advancements have been made in specific video recognition tasks, such as action and scene recognition. However, the comprehensive understanding of videos, known as holistic video understanding (HVU), has not received the attention it deserves. Current video understanding systems are specialized, focusing on narrow tasks. For real-world applications like video search engines, media monitoring systems, and defining a humanoid robot's environment, integrating state-of-the-art methods is essential. To address this need, we are hosting a workshop focused on HVU. This workshop will cover recognizing scenes, objects, actions, attributes, and events in real-world videos. We are introducing our HVU dataset, organized hierarchically with a semantic taxonomy for holistic video understanding. While many existing datasets focus on human action or sport recognition, our new dataset aims to broaden the scope and draw attention to the potential for more comprehensive video understanding solutions. Our workshop will gather ideas related to multi-label and multi-task recognition in real-world videos, using our dataset to test and showcase research efforts.
The primary goal of this workshop is to create a comprehensive video benchmark that integrates the recognition of all semantic concepts. Single class labels per task often fall short in capturing the full content of a video. Engaging with the world’s leading experts on this issue will provide invaluable insights and ideas for all participants. We also invite the community to contribute to the expansion of the HVU dataset, which will drive research in video understanding as a multifaceted problem. As organizers, we look forward to receiving constructive feedback from users and the community on how to enhance the benchmark.
PDT | Description | Speaker | Title | |
08:35 | Opening Remarks | |||
08:40 | Invited Speaker 1: | Angela Yao | Part I VideoQA in the Era of LLMs | |
09:15 | Invited Speaker 2: | Lu Yuan | Revolutionizing Computer Vision: The Power of Small Vision Language Foundation Models | |
09:50 | Invited Speaker 3: | Cees Snoek | Learning to Generalize in Video Space and Time | |
10:25 | Invited Speaker 4: | Yale Song | Procedural Activity Understanding | |
11:00 | Oral Presentations | Details Below | ||
12:00 | Closing Remarks |
1. From Video Generation to Embodied AI; Ruoshi Liu, Carl Vondrick
2. MoReVQA: Exploring Modular Reasoning Models for Video Question Answering; Juhong Min · Shyamal Buch · Arsha Nagrani · Minsu Cho · Cordelia Schmid
3. Learning from One Continuous Video Stream; Joao Carreira · Michael King · Viorica Patraucean · Dilara Gokay · Catalin Ionescu · Yi Yang · Daniel Zoran · Joseph Heyward · Carl Doersch · Yusuf Aytar · Dima Damen · Andrew Zisserman
4. PEEKABOO: Interactive Video Generation via Masked-Diffusion; Yash Jain · Anshul Nasery · Vibhav Vineet · Harkirat Behl
5. Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language; Mark Hamilton · Andrew Zisserman · John Hershey · William Freeman
6. MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding; Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim