Introduction:
High resolution mass spectrometry (HRMS) is widely employed for molecular identification and structural characterization, yet only 1–20% of spectra acquired from human or environmental samples match existing HRMS/MS libraries. Reducing human exposure to hazardous substances requires more effort to identify potentially toxic, persistent, bioaccumulative, or environmentally mobile substances in complex samples. A prior proof-of-concept work (MS2Tox) showed machine learning can predict acute toxicity of unidentified substances from molecular fingerprints derived from in-silico software workflows. As HRMS MS2 fragmentation patterns (e.g., mass-to-charge (m/z) and intensity) encode structural information in a machine-readable format, our approach exploits data-rich spectral information directly, rather than predicting intermediate structures, reducing error propagation and enabling property prediction for unknown substances. Beyond toxicity, early identification of persistent (resistant to degradation) and mobile (potential for transport via porewater flow) substances is essential for proactive risk management.
Aim
Our study aims to develop deep learning models that predict environmental persistence and mobility directly from MS2 data acquired in human or environmental samples, enabling rapid prioritization or flagging of hazardous chemicals in untargeted MS datasets. The specific objectives are: (1) Model development: Train and optimize models using curated HRMS MS2 data with known persistence and mobility. (2) Benchmarking: Compare model performance with traditional quantitative structure-property relationship (QSPR) approaches using predicted molecular fingerprints. (3) Model Validation: Apply trained models to unannotated MS2 spectra and assess robustness through subsequent chemical annotation. (4) Implementation: Develop a user-friendly graphical interface to facilitate model application and generalization.
Methods:
HRMS MS2 data were sourced from MassBank (release 2025.10), while persistence and mobility endpoints are available from the framework of Arp and Hale, based on degradation half-life and organic carbon–water partitioning. Only compounds with both MS/MS data and annotated endpoints were included. Data were stratified by ionization mode and partitioned into training, validation, and testing sets.
Spectra were encoded as ordered m/z–intensity pairs with precursor m/z and processed using a transformer architecture inspired by MS2Prop. Models optimized mean squared error for continuous endpoints and cross-entropy for categorical endpoints. Performance was evaluated using mean absolute error and balanced error rate. A benchmark model using molecular fingerprints derived from MS2 data was implemented for comparison. Finally, the trained model was applied to thousands of molecular features in untargeted HRMS datasets (strategic water monitoring and experimental trials), and predictions were cross-validated against empirical measurements and PubChem.
Expected Results:
We expect MS2-based models outperform fingerprint-based approaches by reducing structural annotation errors and that compounds predicted as persistent or mobile will correspond to known substances documented in literature or databases.
Discussion:
This work introduces a novel strategy for predicting persistence and mobility directly from MS2 spectra, advancing the applicability of quantitative spectra–activity relationships (QSpAR) in nontarget HRMS analysis. By eliminating structural inference, the approach shows bioinformatic benefits by enabling untargeted screening of unknown compounds and promises extension to other physicochemical, environmental and biologically-relevant endpoints.