This proposal requests 20,000 GPU hours per month on NVIDIA A100 GPUs and 100 TB of Mimer storage to advance generative foundation models for 4D human animation and video generation at KTH’s Embodied Social Agents Lab (ESAL). Following recent changes in intellectual property regulations for industrial collaborations, our work is now conducted entirely at KTH. This enables open publication, code and model release, and direct contribution to the academic community, but it also requires access to dedicated large-scale computational infrastructure rather than relying on industrial partners.
Our research plan builds on recently completed projects that demonstrate a strong scientific track record and a need for substantial GPU resources. AMUSE (CVPR 2024) introduced disentangled latent diffusion for emotional, speech-driven 3D body animation, providing independent control over linguistic content, emotion, and personal style through a three-way disentanglement strategy. SPECTRUM developed 3D texture-aware representations for parsing human clothing and body parts by repurposing image-to-texture diffusion models fine-tuned on 3D human texture maps, achieving robust alignment across diverse and previously unseen garments. Emotional 3D Humans (SIGGRAPH I3D 2025, Frontiers in Computer Science) used VR-based user studies with 48 participants to evaluate state-of-the-art generative models for emotional animation, showing that methods explicitly modeling emotions achieve significantly higher recognition accuracy than approaches focused primarily on speech synchrony. Synthetically Expressive (ACM IVA 2025, Best Paper Award) examined how synthesized speech and gestures are perceived in VR versus 2D displays, finding that VR enhances the perception of natural gesture–voice pairings but does not improve synthetic combinations. Audiopedia (ICASSP 2025, Oral) introduced knowledge-augmented audio language models for reasoning-heavy audio question answering, addressing the limitations of current large audio language models on knowledge-intensive tasks.
The requested compute will support training and fine-tuning large-scale diffusion models with 1 to 10 billion parameters for 4D human motion synthesis across multiple emotional and stylistic dimensions; post- training of foundation models via supervised fine-tuning, LoRA adaptation, direct preference optimization, and reinforcement learning from human feedback for controllable generation; video generation model training on large-scale motion capture datasets and multi-view video sequences; multimodal controllable generation with vision-language and large language models; and extensive ablation studies and hyperparameter optimization across multiple architectures. These experiments require long training runs, large batch sizes, and repeated evaluations on high-resolution 4D and video data, making A100-class hardware essential for practical iteration cycles and for maintaining competitiveness with international research groups.
Our team has extensive experience with A100 and H100 GPU clusters for large-scale training, fine-tuning, and post-training of vision-language and language models, including efficient mixed-precision training, distributed data and model parallelism, and scalable experiment management. This expertise ensures that requested resources will be used efficiently and translated into high-impact publications, open-source tools, and publicly released models that benefit both KTH and the wider research community, while providing a robust computational foundation for future collaborations and externally funded projects.