Holistic Video Understanding is a joint project of the KU Leuven, University of Bonn, KIT, ETH, and the HVU team.
This workshop aims to advance the field of video understanding by fostering discussions around holistic and generalist video foundation models. Building upon the Holistic Video Understanding (HVU) initiative and dataset introduced in 2019, we have successfully organized eight HVU workshops and tutorials at top-tier venues such as CVPR and ICCV, uniting researchers, practitioners, and students from around the world. These efforts have played a central role in moving the community beyond narrow action recognition tasks toward multi-faceted, semantic, and generalist video understanding.With the emergence of large-scale foundation models and video large language models (Video-LLMs), the landscape of video understanding is rapidly evolving. These models enable unified reasoning across spatial, temporal, and multimodal dimensions, yet introduce new challenges in scalability, efficiency, interpretability, and responsible deployment.The HVU Workshop 2025 will provide a platform to explore these frontiers, discussing topics such as multimodal representation learning, long-context reasoning, evaluation of general-purpose video systems, efficient adaptation and scaling laws, and the ethical and societal implications of video AI. Our goal is to bring together a diverse and inclusive community to define the next chapter of holistic, generalist, and responsible video understanding.
In recent years, the rise of large-scale foundation models and video large language models (Video-LLMs) has transformed how we approach video understanding. Instead of task-specific networks, we now see generalist models capable of reasoning over long temporal horizons, aligning multiple modalities (vision, language, and audio), and adapting to a wide range of downstream tasks. These developments open new opportunities but also introduce challenges in scalability, efficiency, evaluation, and responsible deployment. In this edition of the HVU Workshop, we aim to bring together researchers and practitioners to discuss the advances and open questions in this new era of Video Foundation Models. Building on the legacy of holistic video understanding, this workshop focuses on unified architectures, training methodologies, and evaluation paradigms for generalist video models.
| local Time | Description | Speaker | |
| 08:50 | Opening Remarks | ||
| 09:00 | Invited Speaker 1: | TBA | |
| 09:30 | Invited Speaker 2: | TBA | |
| 10:00 | Poster Session and Coffee Break | ||
| 10:30 | Invited Speaker 3: | TBA | |
| 11:00 | Invited Speaker 4: | TBA | |
| 11:30 | Contributed Oral Presentations | ||
| 12:00 | Panel Discussion | ||
| 12:45 | Closing Remarks | ||
Prospective authors will be invited to submit a regular paper of previously unpublished work (NeuRIPS 2025 paper format) or an extended abstract of a published work. The review process will be double blind. All the submissions will be peer-reviewed by the international program committee. Accepted papers will be presented as posters or contributed talks. Accepted extended abstracts will also be presented at the poster session or presentations.
*You can submit papers in two different formats: