SUPR
Gradient learning of heterogeneity in cryo-EM
Dnr:

NAISS 2024/22-387

Type:

NAISS Small Compute

Principal Investigator:

Björn Forsberg

Affiliation:

Karolinska Institutet

Start Date:

2024-03-13

End Date:

2024-08-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Webpage:

Allocation

Abstract

Processing of structural biology data from cryo-electron microcopy (cryo-EM) experiments is a computationally demanding imaging-processing task. In the past GPUs were thus naturally employed to accelerate this processing, which has empowered cryo-EM to become the de-facto method for structural characterization of e.g. membrane proteins and viruses, but also biomolecules traditionally amenable to high-resolution x-ray crystallography. Further developments have since reduced the necessary computation, however due to lack of robustness and explainability of the processing output, large-scale resources and computations are still required to assure fidelity of the output. Unfortunately, automated methods for such analyses have not been established, leading structural data to be used as unquantifiable support for biological hypotheses following ad-hoc and subjective user interpretation of the processing output. Rather than working to rectify this deficiency, more recent method developments instead aim to extend the analysis further by appending agnostic analysis of the processing output, with the goal of discovering statistically significant modes of co-variation in the data. This can clarify data heterogeneity and discover meaningful sub-populations within the data, but is still subject to stochastic variation, user interpretation, and unquantifiable error. Crucially it is also contingent on successful prior processing, which can be compromised by data variations. Our research aims to re-develop the capabilities of existing software in an extensible ML-framework which incorporates the co-variance analysis into the processing workflow, but also permits us to employ more informative priors based on both first principles and learned features. This aims to render the co-variance analysis useful during data processing, and heterogeneity quantified according to explicit causal models. This ML-framework will thus be employed to improve both fidelity and explainability of structural biology data, and permit explicit hypothesis testing against cryo-EM data. Rather than rectifying this deficiency, more recent method developments instead aim to extend the analysis further by appending agnostic analysis of the processing output, with the goal of discovering statistically significant modes of co-variation in the data. This can clarify data heterogeneity and discover meaningful sub-populations within the data, but is still subject to stochastic variation, user interpretation, and unquantifiable error. The proposed research aims to re-develop the capabilities of existing software in an extensible ML-framework, with the ultimate goal of incorporating the co-variance analysis into the processing framework, as well as employ more informative priors which renders the co-variance analysis less agnostic, and thus improve both fidelity and explainability.