SUPR
Deep learning based proteomics data harmonzation for neurodegenerative dieases
Dnr:

NAISS 2024/22-457

Type:

NAISS Small Compute

Principal Investigator:

Lijun An

Affiliation:

Lunds universitet

Start Date:

2024-04-08

End Date:

2025-05-01

Primary Classification:

30208: Radiology, Nuclear Medicine and Medical Imaging

Allocation

Abstract

The absence of effective treatment and diagnostic options for neurodegenerative disorders affects over 57 million individuals globally. Understanding the underlying mechanisms specific to each disorder and shared among them is crucial for improving diagnosis and treatment. The Global Neuroproteomics Consortium (GNPC) has collected over 40,000 patient samples from 20 international research groups, resulting in nearly 300 million unique protein measurements. Despite harmonization efforts in data collection and preprocessing, the presence of site effects within multi-site GNPC data cannot be overlooked, potentially impacting analyses and leading to false positive discoveries. Therefore, we are aiming to develop state-of-the-art deep learning algorithms on the multi-site proteomics dataset to help researchers’ work. We have developed series of deep learning based harmonization models for MRI data in our previous work (An et al., 2022; An et al., 2024). During this proposal, we will develop a new variational auto-encoder (VAE) based model to harmonize proteomics data. We will build a conditional VAE model to allow researchers to harmonize proteomics based on their own research goals. For example, researchers could flexibly preserve the effect of specific variables in harmonized data. The learned latent representation by VAE of proteomics data will be informative to allow researchers to investigate this precious proteomics (300 million dimensions) in a much lower dimensionality way (1000 dimensions). Overall, we believe our harmonization model will remove unwanted site variabilities of proteomics data and preserve biological information as much as possible. Our harmonized data will be available to all global research teams.