Nicu Sebe from the University of Trento presented recent work on video generation, focusing on animating objects in a source image using external information like labels, driving videos, or text. He introduced a Learnable Game Engine (LGE) trained from monocular annotated videos, which maintains states of scenes, objects, and agents to render controllable viewpoints. Why it matters: This talk highlights advancements in cross-modal AI, potentially enabling new applications in gaming, simulation, and content creation within the region.
The paper introduces VENOM, a text-driven framework for generating high-quality unrestricted adversarial examples using diffusion models. VENOM unifies image content generation and adversarial synthesis into a single reverse diffusion process, enhancing both attack success rate and image quality. The framework incorporates an adaptive adversarial guidance strategy with momentum to ensure the generated adversarial examples align with the distribution of natural images.
A researcher at the University of Oxford presented new findings on 3D neural reconstruction. The talk introduced a dataset comprising real-world video captures with perfect 3D models. A novel joint optimization method refines camera poses during the reconstruction process. Why it matters: High-quality 3D reconstruction has broad applicability to robotics and computer vision applications in the region.
Egor Zakharov from ETH Zurich AIT lab will present research on creating controllable and detailed 3D head avatars using data from consumer-grade devices. The presentation will cover high-fidelity image-based facial reconstruction/animation and video-based reconstruction of detailed structures like hairstyles. He will showcase integrating human-centric assets into virtual environments for real-time telepresence and entertainment. Why it matters: This research contributes to advancements in digital human modeling and telepresence, with applications in communication and gaming within the region.
FancyVideo, a new video generator, introduces a Cross-frame Textual Guidance Module (CTGM) to enhance text-to-video models. CTGM uses a Temporal Information Injector and Temporal Affinity Refiner to achieve frame-specific textual guidance, improving comprehension of temporal logic. Experiments on the EvalCrafter benchmark demonstrate FancyVideo's state-of-the-art performance in generating dynamic and consistent videos, also supporting image-to-video tasks.
Gregory Chirikjian presented an overview talk on applying probability, harmonic analysis, and geometry to robotics, emphasizing the need for robots to function beyond traditional industrial programming. He discussed a new approach where robots define affordances of objects, using simulation to 'imagine' object use and enabling reasoning about novel objects. Probabilistic methods on Lie-groups, initially developed for mobile robot state estimation, are now adapted for one-shot learning of affordances, with plans to integrate large language models. Why it matters: This research direction aims to enhance robot intelligence and adaptability, crucial for service robots in dynamic environments and aligning with broader goals of advanced AI integration in robotics.