Holistic Video Understanding is a joint project of the KU Leuven, University of Bonn, KIT, ETH, and the HVU team.
In the last years, we have seen tremendous progress in the capabilities of computer systems to classify video clips taken from the Internet or to analyze human actions in videos. There are lots of works in video recognition field focusing on specific video understanding tasks, such as action recognition, scene understanding, etc. There have been great achievements in such tasks, however, there has not been enough attention toward the holistic video understanding task as a problem to be tackled. Current systems are expert in some specific fields of the general video understanding problem. However, for real-world applications, such as, analyzing multiple concepts of a video for video search engines and media monitoring systems or providing an appropriate definition of the surrounding environment of a humanoid robot, a combination of current state-of-the-art methods should be used. Therefore, in this workshop, we intend to introduce the holistic video understanding as a new challenge for the video understanding efforts. This challenge focuses on the recognition of scenes, objects, actions, attributes, and events in the real world user-generated videos. To be able to address such tasks, we also introduce our new dataset named Holistic Video Understanding~(HVU dataset) that is organized hierarchically in a semantic taxonomy of holistic video understanding. Almost all of the real-world conditioned video datasets are targeting human action or sport recognition. So our new dataset can help the vision community and bring more attention to bring more interesting solutions for holistic video understanding. The workshop is tailored to bringing together ideas around multi-label and multi-task recognition of different semantic concepts in the real world videos. And the research efforts can be tried on our new dataset.
The main objective of the workshop is to establish a video benchmark integrating joint recognition of all the semantic concepts, as a single class label per task is often not sufficient to describe the holistic content of a video. The planned panel discussion with world’s leading experts on this problem will be a fruitful input and source of ideas for all participants. Further, we invite the community to help to extend the HVU dataset that will spur research in video understanding as a comprehensive, multi-faceted problem. We as organizers expect to receive valuable feedback from users and from the community on how to improve the benchmark.
ET | CET | Description | Speaker/Paper ID | |
11:00 | 17:00 | Opening remark | ||
11:10 | 17:10 | Invited Speaker 1: | Joao Carreira | |
12:00 | 18:00 | Invited Speaker 2: | Cordelia Schmid | |
12:50 | 18:50 | Invited Speaker 3: | Dima Damen | |
13:40 | 19:40 | Oral Session 1 | 1, 2, 3, 4, 5, 6 | |
14:40 | 20:40 | Break | ||
15:10 | 21:10 | Oral Session 2 | 7, 8, 9, 10, 11, 12 | |
16:10 | 22:10 | Invited Speaker 4: | Carl Vondrick | |
17:00 | 23:00 | Invited Speaker 5: | Sanja Fidler | |
17:50 | 23:50 | Invited Speaker 6: | Kristen Grauman | |
18:40 | 00:40 | Conclusion |
Prospective authors will be invited to submit a regular paper of previously unpublished work (CVPR paper format) or an extended abstract of a published work. The review process will be double blind. All the submissions will be peer-reviewed by the international program committee. Accepted papers will be presented as posters or contributed talks and will be considered non-archival and published via the Open Access versions, provided by the Computer Vision Foundation. Accepted extended abstracts will be presented at the poster session.
*You can submit papers in two different formats: