Understanding human actions from videos is a fundamental problem in computer vision with applications in robotics, human-computer interaction, and assistive systems. In particular, egocentric video understanding presents unique challenges due to viewpoint changes, occlusions, and complex object interactions. The EPIC-Kitchens dataset provides a large-scale benchmark for fine-grained action recognition in real-world kitchen environments, making it well-suited for studying these challenges. In this project, we aim to develop a video action recognition framework based on foundation models and transformer architectures. We will leverage visual encoders (e.g., VideoMAE, TimeSformer, or CLIP-based models) and train them on the EPIC-Kitchens dataset for action classification and temporal understanding. The project will explore different strategies including feature extraction vs full fine-tuning, temporal modeling using transformers, and efficient adaptation methods. Due to the large size of the dataset and the high computational cost of training video transformers, significant GPU resources are required for both training and experimentation.