Machine learning models for predicting permeability of macrocycles

NAISS 2023/22-253


NAISS Small Compute

Principal Investigator:

Saw Simeon


Uppsala universitet

Start Date:


End Date:


Primary Classification:

30103: Medicinal Chemistry




Macrocycles are advantageous ligands for difficult-to-drug targets like protein-protein interactions (PPIs). To make them accessible to the intracellular environment and to enable oral intake, they must be cell permeable. However, it is a challenge to estimate cell permeability, particularly for large macrocycles outside of the "Rule of 5" (bRo5) range. Recently, machine learning was demonstrated to effectively classify the cell permeability of macrocycles with restricted structural variability. Machine learning algorithms can be used to help design these macrocycles with desired properties, such as decreased toxicity and increased uptake efficiency. Recently, Rzepiela et al. investigates the effects of macrocyclic conformation on the passive membrane permeability of synthetic macrocycles. Results showed that the macrocycles with a higher degree of planarity and more rigid conformations had a decreased membrane permeability. Williams-Noonan et al. provides numerous design guidelines from machine learning to help researchers create effective and safe membrane permeating macrocycles for drug delivery. In this context, the development of in silico models is of great importance, as these can drastically reduce the number of required in vitro tests. Furthermore, because computational predictions are fast, the scales are strongly reduced too. This project will expand on these results and utilize a larger, more reliable, and diversified dataset to quickly and accurately predict macrocycle cell permeability. To this end, we will use > 3000 macrocycles from the dataset and employ various machine learning algorithms—including decision trees, support vector machines, gaussian processes, deep neural networks, partial least squares and random forests—to develop our predictive models. The data will be randomly split into training and test sets, and the training process will be repeated 100 times to avoid random seed bias. To assess the model's performance, we will use correlation coefficient, root mean squared error, mean absolute error and Y-scrambling. We will also investigate feature importance through mechanistic interpretation of the best model. In addition, we will evaluate the domain of applicability of the model. This will involve assessing the model's performance in different regions of the chemical space, and seeing if it is able to reliably predict permeability in those regions. This will allow us to determine the boundaries of the model's applicability, and decide whether the model could be reliably implemented in future experiments.