In this project, we will on egocentric video action recognition. Give an egocentric video, the task is to predict the action label of the video. We will develop a causality-based deep learning method to tackle this task. Our method consists of vision-language models (VLMs), video transformers and causal variational autoencoders (VAEs). We use VLMs to extract language descriptions of the videos and train transformers to obtain features representing verb, noun and action. The causal VAEs are used to learn causality between language, verb, noun and action.