Develop mathematical methods to create synthetic, multi-modal, high-dimensional cohorts for studies of humans with neurodegenerative diseases. The synthetic, or “twin”, cohorts will represent the original data across a range of modalities, including clinical and cognitive testing, high-dimensional neuroimaging, fluid biomarker proteomics and genomics. The aim of this project is twofold: first, to accelerate neurodegenerative disease research by generating synthetic cohorts from existing observational data, enabling the creation of large, shareable datasets that can be freely accessed across institutions without the constraints of patient recruitment or legal/ethical barriers. Second, to leverage these synthetic cohort models for uncovering new associations and insights across various aspects of neurodegenerative diseases, providing deeper understanding through detailed analysis of the generated data.
The overall aim of this project is to generate synthetic data that accurately replicates the properties of original observational data, enabling it to serve as a reliable substitute for real clinical datasets. In addition to producing data that is correct enough to mirror the characteristics of the original data, the project seeks to ensure that this synthetic data can be effectively used for scientific discoveries, particularly through the application of machine learning models and other mathematical frameworks. This will allow for new insights into neurodegenerative diseases while overcoming the legal, ethical, and logistical barriers of working with real patient data.
The project will progress in a logical sequence:
1. We will first generate structural MRI (T1).
2. Next, we will sequentially incorporate additional meta-data, including demographics, key biomarkers, cognitive assessments, and amyloid and tau PET summary measures.
3. We will then add multiple MRI modalities, such as FLAIR.
4. Finally, we will integrate longitudinal data, which will consist of both longitudinal images and longitudinal meta-data.
We will evaluate our synthetic cohort(s) in several ways. We will use the synthetic data to test different classification and regression problems (e.g., ability to identify patients versus controls with MRI, or ability to predict continuous cognition with continuous MRI features). We will compare the performance for these tasks in the synthetic data with the performance in original data. We will also evaluate the “privacy” of the synthetic data, to guard against generating data that is too similar to actual original data, using different metrics and methods.