Video Action Recognition with Foundation Models and Transformers

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-746

Type:

NAISS Small

Principal Investigator:

Zijian Cai

Affiliation:

Örebro universitet

Start Date:

2026-04-20

End Date:

2026-11-01

Primary Classification:

10201: Computer Sciences

Webpage:

Allocation

Mimer at C3SE: 1500 GiB
Alvis at C3SE: 1000 GPU-h/month
Arrhenius Disk at NAISS: 750 GiB
Arrhenius Flash at NAISS: 400 GiB

Abstract

Understanding human actions from videos is a fundamental problem in computer vision with applications in robotics, human-computer interaction, and assistive systems. In particular, egocentric video understanding presents unique challenges due to viewpoint changes, occlusions, and complex object interactions. The EPIC-Kitchens dataset provides a large-scale benchmark for fine-grained action recognition in real-world kitchen environments, making it well-suited for studying these challenges. In this project, we aim to develop a video action recognition framework based on foundation models and transformer architectures. We will leverage visual encoders (e.g., VideoMAE, TimeSformer, or CLIP-based models) and train them on the EPIC-Kitchens dataset for action classification and temporal understanding. The project will explore different strategies including feature extraction vs full fine-tuning, temporal modeling using transformers, and efficient adaptation methods. Due to the large size of the dataset and the high computational cost of training video transformers, significant GPU resources are required for both training and experimentation.