Holistic Video Understanding is a joint project of the KU Leuven, University of Bonn, KIT, ETH, and the HVU team.
In recent years, the ability of computer systems to classify and analyze online videos has greatly improved. Significant advancements have been made in specific video recognition tasks, such as action and scene recognition. However, the comprehensive understanding of videos, known as holistic video understanding (HVU), has not received the attention it deserves. Current video understanding systems are specialized, focusing on narrow tasks. For real-world applications like video search engines, media monitoring systems, and defining a humanoid robot's environment, integrating state-of-the-art methods is essential. To address this need, we are hosting a workshop focused on HVU. This workshop will cover recognizing scenes, objects, actions, attributes, and events in real-world videos. We are introducing our HVU dataset, organized hierarchically with a semantic taxonomy for holistic video understanding. While many existing datasets focus on human action or sport recognition, our new dataset aims to broaden the scope and draw attention to the potential for more comprehensive video understanding solutions. Our workshop will gather ideas related to multi-label and multi-task recognition in real-world videos, using our dataset to test and showcase research efforts.
The primary goal of this workshop is to create a comprehensive video benchmark that integrates the recognition of all semantic concepts. Single class labels per task often fall short in capturing the full content of a video. Engaging with the world’s leading experts on this issue will provide invaluable insights and ideas for all participants. We also invite the community to contribute to the expansion of the HVU dataset, which will drive research in video understanding as a multifaceted problem. As organizers, we look forward to receiving constructive feedback from users and the community on how to enhance the benchmark.