Conquering Chemical Space

(1)

Conquering Chemical Space

Optimization of Docking Libraries through Interconnected Molecular Features

Leonard Sparring

Degree project in bioinformatics, 2020

Examensarbete i bioinformatik 45 hp till masterexamen, 2020

Biology Education Centre and Department of Cell and Molecular Biology, Uppsala University

Supervisors: Jens Carlsson and Andreas Luttens

(2)

Abstract

The development of new pharmaceuticals is a long and ardous process that typically requires more than 10 years from target identification to approved drug. This process often relies on high throughput screening of molecular libraries.

However, this is a costly and time-intensive approach and the selection of molecules to screen is not obvious, especially in relation to the size of chemical space, which has been estimated to consist of 10⁶⁰compounds. To accelerate this exploration, molecules can be obtained from virtual chemical libraries and tested in-silico using molecular docking.

Still, such methods are incapable of handling the increasingly colossal virtual libraries, currently reaching into the billions. As the libraries continue to expand, a pre-selection of compounds will be necessitated to allow accurate docking-predictions.

This project aims to investigate whether the search for ligands in vast molecular libraries can be made more efficient with the aid of classifiers extended with the conformal prediction framework. This is also explored in conjunction with a fragment based approach, where information from smaller molecules are used to predict larger, lead-like molecules.

The methods in this project are retrospectively tested with two clinically relevant G protein-coupled receptor targets, A_2Aand D₂. Both of these targets are involved in devastating disease, including Parkinson’s disease and cancer.

The framework developed in this project has the capacity to reduce a chemical library of > 170 million tenfold, while retaining the 80 % of molecules scoring among the top 1 % of the entire library. Furthermore, it is also capable of finding known ligands. This will allow for reduction of ultra-large chemical libraries to manageable sizes, and will allow increased sampling of selected molecules. Moreover, the framework can be used as a modular extension on top of almost any classifier. The fragment-based approaches that were tested in this project performed unreliably and will be explored further.

(3)

(4)

Conquering Chemical Space

Popular Science Summary

Leonard Sparring

Läkemedelsutveckling är en lång och dyr process. Det typiska första steget är identifikationen av en receptor som är involverad i ett visst sjukdomsförlopp. Därefter krävs det vanligtvis mer än 10 år innan ett läkemedel mot receptorn kan utvecklas och godkännas. Den tidiga delen i detta förlopp sker ofta med hjälp av automatiserad screening av molekylbibliotek. Här försöker man finna en molekyl som binder till receptorn – en ligand – som sen kan optimeras till en mer läkemedelslik förening. Detta är emellertid ett kostsamt och tidskrävande tillvägagångssätt och valet av molekyler som ska undersökas är inte uppenbart, särskilt i förhållande till storleken på den kemiska rymden, som har uppskattats bestå av 10⁶⁰föreningar. För att påskynda denna undersökning kan molekyler testas med datorsimuleringar med såkallad molekylär dockning. Denna metod utspelar sig genom att molekyler passas mot 3D-modeller av en receptor.

Om molekylen och receptorn passar väl ihop, så får molekylen en bra poäng. Molekylerna till dessa simuleringar erhålls från virtuella kemiska bibliotek som innehåller kommersiellt tillgängliga och farmakologiskt relevanta molekyler.

Dessa molekyler har inte nödvändigtvis syntetiserats förut, utan de genereras genom kombinationen av molekylära byggklossar som byggs ihop med ett antal tillåtna kemiska reaktioner.

Dessa bibliotek växer och har nu nått flera miljarder molekyler, storlekar som till och med molekylär dockning är oförmögen att hantera. Nu, och särskilt när biblioteken fortsätter att expandera kommer ett förval av föreningar att behöva göras för att reducera tiden som används till att undersöka irrelevanta molekyler och för att möjliggöra mer exakt poänggivning vid molekylär dockning.

Detta arbete syftar till att undersöka om sökandet efter ligander i stora molekylbibliotek kan effektiviseras genom klassificerare som utvidgas med ramverket conformal prediction. Detta ramverk tillåter en kontroll av hur många väl-dockande molekyler som behålls i det reducerade biblioteket. I projektet undersöks också ett fragmentbaserat tillvägagångssätt, där information från mindre molekyler används för att förutsäga större, mer läkemedelslika molekyler.

Metoderna i detta projekt testas retrospektivt med två kliniskt relevanta receptorer, adenosinreceptorn A_2Aoch dopam- inreceptorn D₂. Båda dessa receptorer är intressanta mål för förödande sjukdomar, till exempel Parkinsons sjukdom och cancer.

Metoden som utvecklats i detta projekt har kapacitet att minska ett kemiskt bibliotek på > 170 miljoner molekyler tiofaldigt, samtidigt som 80 % av molekylerna som räknas bland den bästa percentilen av molekylerna – med avseende på dockningspoäng – av hela biblioteket behålls. Dessutom kan metoden korrekt förutsäga kända ligander till båda de testade receptorerna. Detta möjliggör en minskning av ultra-stora kemiska bibliotek till hanterbara storlekar och kommer att tillåta ökad sampling av de utvalda molekylerna. Därtill kan metoden användas som ett modulärt tillägg ovanpå närtill vilken klassificerare som helst. Fragmentbaserade tillvägagångssätten som testades i detta projekt var instabila för de två testade receptorerna, och andra metoder bör undersökas ytterligare.

Degree project in bioinformatics, 2020

Examensarbete i bioinformatik 45 hp till masterexamen, 2020 Deparment of Cell and Molecular Biology, ICM

Supervisors: Jens Carlsson & Andreas Luttens

(5)

(6)

Abbreviations

A2AR . . . Adenosine A_2AReceptor

AMCP . . . Aggregated Mondrian Conformal Prediction AUC . . . Area Under the Curve

CDDD . . . Continuous and Data-Driven Descriptors cLogP . . . Calculated Logarithm of Partition coefficient D2R . . . Dopamine D2Receptor

DUD-E . . . Directory of Useful Decoys, Enhanced

ECFP4 . . . Extended connectivity Fingerprint, up to four bonds FWOK . . . Fragment Wait-OK

GPCR . . . G Protein-Coupled Receptor HTS . . . High-Throughput Screening

IID . . . Independently Identically Distributed InChi . . . International Chemical Identifier LSD . . . Large Scale Docking screen LLWOK . . . Lead-Like Wait-OK

NMR . . . Nuclear Magnetic Resonance ROC . . . Receiver Operating Characteristic

SMILES . . . Simplified Molecular Input Line Entry Specification

(8)

(9)

1 Introduction

1.1 Background

1.1.1 High Throughput Screening & Commercial Chemical Libraries

Modern drug-discovery typically starts with the identification and validation of a biological target that is involved in a particular disease. This initiates the search for active molecules that are capable of binding to the target to modulate its activity. When information about the target is sparse, high throughput screening (HTS) is a common approach in this endeavour. This is to screen a large molecular library against a given target with a certain assay. The goal is to efficiently find compounds that elicit a response in the assay, so-called hits. These are subsequently refined into analogous lead compounds and if showing promise in other assays, optimized to more selective drug-like molecules for pre-clinical and clinical testing. This optimization often results in an increase in lipophilicity and molecular weight of the hit. Typically, libraries of 100,000 to 1,000,000 compounds are screened in HTS campaigns (Brown & Boström 2018) with hit-rates ranging from 0.01 % to 0.14 % (Zhu et al. 2013). A major challenge in HTS is the design of the molecular library.

Especially so when considering the size of the chemical space, which has been estimated to consist of 10⁶⁰molecules (Kirkpatrick & Ellis 2004) (see figure 1). The capacity of HTS also appears miniscule in relation to the number of commercially available compounds. These include molecules that have not necessarily been synthesized before but are synthetically tractable. They are catalogued in virtual chemical libraries that attempt to capture all compounds that are available on demand. These libraries are generally generated with a set of building blocks, chemical reactions and limitations on the end-products (Walters 2019). The generated molecules are often confined to Lipinski’s rule of five (Lipinski et al. 2001). That is, their molecular weight is restricted to < 500 Da, their calculated octanol-water coefficient (cLogP) to < 5 and their number of hydrogen-bond acceptors and donors to < 5. The libraries have grown dramatically over recent years and are not static in size. One example is the ZINC (Sterling & Irwin 2015) database, which aims to catalogue all biologically relevant molecules for sale in 2D and 3D representations. This has grown from

∼120 million compounds in 2015 to more than 900 million compounds in 2020. Likewise, the pharmaceutical vendor Enamine, provides four billion molecules in their searchable REAL space (Hoffmann & Gastreich 2019). These virtual chemical libraries and spaces represent a huge opportunity and challenge in the field of drug-discovery.

Screened Space Commercial

Space Chemical

Space

Figure 1: Chemical space. Chemical space is enormous and the amount of experimentally screened compounds represent only a fraction of commercially available compounds.

One approach of adapting to these sizes is fragment-based drug-discovery (see figure 2). Initially, a small library with compounds of^∼150 Da are screened. The number of non-hydrogen atoms, also referred to as heavy atom count (HAC), of these molecules is typically ≤ 20. Thus, combinatorial expansion of chemical libraries will not affect these molecules to the same extent as it has and will continue to do for the larger lead-like and drug-like compounds. Furthermore, after a fragment’s binding mode has been determined with X-ray crystallography or nuclear magnetic resonance (NMR), it constitutes a good starting point for chemical modification. They can be used as as scaffolds to rationally combine with other chemical groups to promote interaction with other parts of the binding pocket. (Murray & Rees 2009)

(10)

O ^O

O

A B

Figure 2: Fragment-based drug discovery. HTS (A) typically aims to find large ligands that interact with most parts of the binding pocket by testing millions of compounds. In fragment-based drug discovery (B), few fragments are screened to find low affinity ligands that allow more thorough optimization.

Hann et al. (2001) illuminated the strength of fragment-based drug-discovery with a simple model of molecular recognition. They showed that as the complexity of molecules increases, the probability of the molecules making unique complementary interactions with a binding pocked decreases. The inverse response to complexity was shown for measuring that interaction in assays. As the complexity of molecules increases, they are allowed to make more interactions with the binding pocket, resulting in increased affinity. Thus, the probability of measuring binding is higher for more complex molecules that do have complementary interactions with the binding pocket. These two probabilities are independent and the authors multiplied them to define the probability of a useful event (see equation 1).

Maximization of this probability produces an optimal complexity for molecules to be screened (see figure 3).

p(useful event) = p(measuring binding of ligand) · p(molecule matches pocket in a unique way) Equation 1: Probability of a useful event.

It should be noted that complexity is an abstract concept that encompasses many molecular features that are arguably dependent on more factors than size, but HAC and molecular weight can serve as proxies. The authors’ model was related to experimental methods where compounds of 100-250 Da were tested, suggesting this range for experimental fragment-based screenings. This study illustrates that a higher fraction of fragments will be ligands to a receptor, but that they will be less potent than ligands from lead-like space and thus less likely to elicit a response in a given assay.

(11)

Ligand complexity

P robability

Probability of measuring binding

Probability of binding in a unique way Probability of useful event

Figure 3: The successs landscape. Schematic adaptation molecular complexity (Hann et al. 2001).

1.1.2 Virtual Screening

In the age of high performance computing clusters, virtual screening has become an important toolbox for accessing large virtual libraries. This group of techniques utilize information about already established ligands (ligand-based) or the target’s structure (structure-based) and guides compound selection for experimental testing accordingly. Ligand- based methods are useful when there already are many compounds known to interact with a given target, but there is limited information about the target itself. The premise is that molecules that are similar to known binders are likely to be binders as well. However, ligand-based methods are generally unable to detect compounds that would interact with the target in a previously unknown manner (Geppert et al. 2010). In contrast, structure-based methods are not as dependent on previously known ligands. These methods are used when the structure of the target has been experimentally solved, either through X-ray crystallography, NMR or cryogenic electron microscopy. A protein target’s structure can also be accurately modeled if there is already a solved homologous protein with high amino acid sequence similarity. With a 3D model of the target, its interaction with unseen molecules can be simulated by fitting the molecules’

3D representations to the binding pocket and by estimating binding affinities with scoring functions. Physics-based scoring functions attempt to approximate affinities, where a more negative term indicates higher affinity (see equation 2). In this way, structure-based methods are capable of finding novel scaffolds. Due to the nearly exponential increase in experimentally solved protein structures occurring in the 1990’s to 2000’s (Berman 2012), these methods have gained importance in recent years.

E_docking= Eelectrostatics+ Evan derWaals+ Edesolvation

Equation 2: Scoring function. Physics based scoring function for the molecular docking software DOCK3.7.

Molecular docking is a structure-based technique that can emulate HTS in-silico. The aim is to filter through a large set of compounds in the computer and to rank them on their docking scores. Only a few compounds among the top-ranking molecules are then selected for experimental testing. Using molecular docking, researchers are still able to access a large proportion of virtual chemical libraries. The success of docking relies on the accuracy of the structural model, the accuracy of the 3D structures of the molecules to be docked and the scoring function. Their concordance to experimental results is generally estimated by how well they can replicate experimentally solved poses and by how well their scoring function enriches active molecules over highly similar inactive molecules, so-called decoys (Mysinger et al. 2012).

Such decoys are often obtained from the directory of useful decoys (DUD-E) database. The molecular docking software DOCK3.7 (Coleman et al. 2013) is especially geared towards screening large chemical libraries. Implemented in Fortran in the 1970s, DOCK3.7 is one of the fastest docking softwares (Fan et al. 2019). It achieves this by assuming a rigid protein structure, bounded sampling coordinates and by performing many calculations prior to the docking screen.

(12)

For instance, the 3D coordinates of a molecule’s structural conformations are generated beforehand. They can also be obtained from the ZINC database. The protein environment is also represented as a grid with precalculated potentials.

Lyu et al. (2019) investigated the ZINC library for lead-like (molecular weight between 250-350 Da) molecules available on-demand by two large scale docking screens (LSD). Docking was performed with 99 million molecules against the G protein-coupled receptor (GPCR) AmpC β -lactamase. After filtering for compounds that were similar to known inhibitors and clustering for similarity, 51 diverse molecules ranking in the top^∼1 % on docking scores were selected for experimental testing. Out of the 44 successfully synthesized compounds, 5 showed to inhibit β -lactamase, a hit-rate of 11 %. Additionally, after probing the analogues, the most potent non-covalent inhibitor to β -lactamase was identified. For dopamine receptor D₄, 138 million molecules were docked. After similar filtering for established ligands and selection of top ranking diverse compounds, 589 molecules were selected for experimental testing. 81 new chemotypes were discovered with 31 of them showing sub-micromolar activity.

In another LSD, Stein et al. (2020) docked more than 150 million molecules against the sleep-wake cycle regulating GPCR MT₁. The orthosteric binding site is highly similar to MT₂and few selective ligands are known. The top scoring 300,000 molecules were clustered for similarity and compounds with similar chemotypes to known ligands were removed. After manual selection from the top 10,000 clusters, 38 molecules were experimentally tested. Of these, 15 had activity in MT₁or MT₂, an hit rate of 39 %. Two selective MT₁inverse agonists were also found. In this study, it was also shown that docking cluster representatives rather than the complete library produced worse docking scores.

In the largest molecular docking screen to date, Gorgulla et al. (2020) docked 330 million molecules from ZINC and more than one billion compounds from Enamine REAL space against KEAP1’s NRF2 interaction interface. Three million high ranking molecules were chosen from a first-stage rigid-target docking screen and rescored in a second-stage side-chain-flexible docking screen. 492 compounds from the top 0.03 % of the second stage docking and 98 from the top 0.0001 % of the first stage docking were chosen for experimental validation. 69 of the selected compounds were shown to bind KEAP1 with a fluorescence based method, hit rate of 12 %. In this study, the authors also showed that, keeping the number of selected compounds constant, the true positive rate (TPR) scales with the size of the docked library. This is not obvious, as larger libraries generally entails that actives are outnumbered with decoys. This further illustrates the power of docking.

However, for computational feasibility of LSDs, several sacrifices have to be made. They are computationally demanding, requiring approximately 40,000 CPU-hours to dock 300 million molecules of the current ZINC lead-like library, according to in-house experience. Additionally, the molecules are sampled to a limited extent. This will be an even more accentuated problem as libraries continue to increase in size. To overcome this challenge, fragment-based approaches have also been tried in-silico. For example, fragments from ZINC can be docked and substructure searches can be performed in the lead-like domain.

1.1.3 Machine Learning in Cheminformatics

Apart from virtual screening methods, other computational approaches have also been employed in the search for novel receptor ligands and in cheminformatics in general. Machine learning has successfully been used for predictions of drug metabolism, drug-drug interactions and carcinogenesis. The idea of supervised machine learning is that the relationship between a set of simple molecular descriptors and a numerical value – for regression – or label – for classification – can be learned by a model. The model’s parameters can then be transferred to molecules with other value-combinations of the descriptors, after which their state can be predicted. The recent advent of deep-learning has also made an impact in cheminformatics. Broadly, these models generate hidden descriptive features about an object that will maximize the model’s capacity to make accurate predictions about future objects. This is done by leveraging a voting system by

’neurons’ in layers. (Lo et al. 2018)

Class-imbalance is an important consideration in classification applied to cheminformatics as the goal is often to predict minority-classes from large libraries. For example, finding a selective inhibitor is much like finding a needle in a haystack. This is highly pertinent to virtual screening, as the growth of molecular libraries will bring more active molecules, but will also entail an exacerbation of the imbalance between active and non-active compounds.

A common evaluation criteria for classifiers is their accuracy, their fraction of correct predictions. While this may be informative for balanced datasets, it is not useful when the data is highly imbalanced. To illustrate this, imagine a classifier that is presented with the task of discerning positive samples from negative samples. A poor classifier could

(13)

merely predict all samples as negative, regardless of their properties. Such a classifier actually has an accuracy of 99 % if the number of negative samples outnumber positive samples 99 to 1, despite it being wrong in 100 % of its predictions on positive samples. For these scenarios, precision and recall are valuable class-wise metrics obtained from the confusion matrix (see figure 4). In this example, precision captures the proportion of positives in the set of samples that are predicted as positive. Recall, also known as the true positive rate (TPR), captures the proportion of all positive samples that are predicted as positive. The maximization of both of these metrics is often desired but they are generally inversely correlated. When the threshold for classifying a sample as positive is strict, a classifier’s precision will be favoured, while when the threshold is more lenient, recall is favoured. Which one of the two is more important depends on question asked. Both of them are captured in the F1-score which is their harmonic mean. This mean’s strength over its common arithmetic counterpart is that it is more sensitive to extremes. If either the recall or the precision is low, the F1-score will be low as well.

Lastly, the false positive rate (FPR) captures how many of the negative samples are predicted as positives. The TPR and FPR are plotted against each other in the receiver operating characteristic (ROC) space to evaluate a classifier’s performance. This plot attempts to capture how well a classifier enriches positives over negatives as the threshold for classifying a sample as positive is made increasingly lenient. The area under this curve (AUC) is often used as a single measure of model performance, but it should be interpreted cautiously when there is high class imbalance.

Recall = TPR = TP TP + FN

Precision = TP TP + FP

F1-score = 2 · Precision · Recall Precision + Recall

FPR = FP

FP + TN

Equation 3: Common machine learning metrics based on the confusion matrix.

Predicted

Actual

PositiveNegative

Positive Negative

TP

TN FP FN

TPR

FPR

No skill Good performance

performanceBad

Figure 4: Confusion matrix and ROC plot. The confusion matrix captures the amount of true negatives, true positives, false negatives and false positives after classification. It is the basis for most metrics of supervised classification. The ROC plot is used to assess enrichment of true positives over false positives.

(14)

1.1.4 Conformal Prediction

Conformal prediction (Vovk et al. 2005) is a framework that can be applied to almost any classifier or regressor. It is designed to output predictions with guaranteed error rates on a per sample basis. That is, if the accepted error rate is ε , the conformal predictor should give at most ε errors. It achieves this by predicting sets of labels. For the case of binary classification, where samples are either in labeled as ‘1’ or ‘0’, the set-predictions are {1}, {0}, {1,0} and { }. If the sample’s actual label is included in the set, the prediction is considered valid. The error rate is guaranteed in the way that if the actual sample’s label is ‘1’, the predicted set will contain ‘1’ with probability 1 − ε. A useful metric of conformal predictions is the efficiency. This is the number of single label predictions. There is an inherent reciprocal relationship between the expected error rate ε and the efficiency. At low levels of ε, the framework will assign both labels to the majority of the samples, while at levels of high ε, it will assign none of the labels to the majority of the samples (see figure 5).

Classiﬁer + Conformal predictor

ε

P roportions

{0} & {1} sets

{0,1} set { } set

p('1') = 0.52 > ε

p('0') = 0.23 < ε

}

{0,1}, valid p('1') = 0.05 < ε

p('0') = 0.13 < ε

}

^{{ }, error}

p('1') = 0.09 < ε

p('0') = 0.79 < ε

}

^{{0}, error}

p('1') = 0.65 > ε

p('0') = 0.13 < ε

}

^{{1}, valid}

ε=0.2 1

1 0 0

error rat e

Figure 5: Conformal prediction. For binary classification, if samples are assigned their actual label, or more labels, the prediction is valid. Samples that are not assigned any labels (predicted to the { } set) are always erroneous. As the value of ε increases from 0 to 1, the proportion of the {0,1} set decreases in favor for single label predictions (higher efficiency). After a certain point, no labels will be assigned to the samples, instead predicting all of them in the { } set.

An important aspect of conformal prediction is the distribution of training sets and the set that is predicted. If both sets are obtained through the same generative process, the sets are said to be independently identically distributed (IID).

Conformal prediction operates under the exchangeability assumption. This is a slightly weaker assumption than IID, that does not require independence. The typical difference is seen when drawing random samples. If this is performed with replacement, so that the already drawn samples can be drawn again, this will generate IID datasets. Conversely, sampling without replacement will generate exchangeable sequences. Furthermore, similar distributions for the training and test sets is also important for the underlying classifier’s predictive performance.

The first described variation of the framework is the on-line, transductive, approach (Shafer & Vovk 2008). This relies on updating the model after each sample has been predicted, thus requiring retraining of the classifier for every new prediction. To achieve computational efficiency, the inductive, ‘batch setting’ approach was proposed. This method will be outlined for the case of binary classification below.

Firstly, the training set is split into a proper training set and a calibration set. The proper training set is used to fit an underlying classifier. The calibration set is used to generate a list of nonconformity scores. To do this, conformal prediction relies on the ability of the classifier to provide the value α, a measure of nonconformity, that aims to capture how dissimilar a given sample is to those that have already been seen. The typical function to generate α is 1 subtracted with the classifiers probability for the given label (see equation 4). However, the conformal predictor’s guarantees of the error rate are independent of this function and although the predictions would be uninformative, even a random number generator can be used. The nonconformity function is applied to each sample of the calibration set and sorted in a list. At the stage of prediction, each label y_nis examined for each sample. The sample and label combination’s nonconformity

(15)

score is calculated and the rank of this nonconformity score among the calibration set sample’s nonconformity scores is reported as the conformal prediction p-value. If the p-value is more than ε, the sample conforms to the label and the label is added to the sample’s prediction set. If the p-value is less than or equal ε, the label is rejected from the prediction set.

When observing predictions on a sample-per-sample basis, they are predicted to the class that has the highest p-value. In accordance with conformal terminology, this p-value is the prediction’s credibility. The conformal prediction confidence is 1 subtracted by the second to highest p-value. Another useful metric is the quality of information, defined by Shen Shyang Ho and Harry Wechsler (2008) as the largest minus the second largest p-values.

A caveat of the framework is that the conformal prediction error rate is not guaranteed for each class individually, which especially for imbalanced datasets can cause unreliable predictions. The mondrian conformal predictor provides a simple solution to ensure class-wise validity. The nonconformity function is unmodified but the calibration set’s nonconformity scores are separated by class and the p-value for each label is deduced from the nonconformity score’s rank in each list separately (see equation 5).

αi, j:= f (z_i) = 1 − Ph(y_j| x_i)

Equation 4: Typical nonconformity function for the calibration set and for samples to be predicted. For each sample z_i with the label y_jand features x_i, the nonconformity score α is 1 subtracted by the p-value predicted by model h. All label’s nonconformity scores are calculated for the samples to be predicted. For samples of the calibration set, only the nonconformity score corresponding to the true label is calculated.

pi,y_j=|{αi, j: αi, j≥ αj,n_j+1}|

n_j+ 1

Equation 5: The mondrian conformal prediction p-value. For each sample, the p-value of each label y_jis calculated by dividing the total number of nonconformity scores in the current label’s calibration set n_jthat are less than or equal to the current sample’s and label’s nonconformity score, by the number of samples in the label’s calibration set plus one.

Furthermore, a problem with the inductive approach is that the calibration set is not utilized for training of the classifier.

Depending on the size of the training set, this may reduce predictive performance of the classifier. To counter this issue, several conformal predictors can be aggregated (Carlsson et al. 2014). Also known as cross-conformal predictors, aggregated conformal predictors can be implemented in different ways. Commonly, the training set is split into a proper training set and a calibration set for each classifier to be trained. Between the training of each classifier, the training set is randomly shuffled. At the subsequent stage of prediction. Each sample is predicted by each classifier and the median of the models’ conformal prediction p-values is selected. This increases the likelihood that each sample of the training set will influence the predictions. Aggregated conformal predictors achieve higher efficiency but the error rate is not automatically guaranteed. The error rates are instead conditional on the stability of the underlying classifier. If each aggregated model produce similar p-values, the error rate approximates ε. Otherwise, error rates are generally less than ε for ε ≤ 0.4 (Linusson et al. 2017).

Conformal prediction has previously been used in cheminformatics for toxicological modeling and molecular docking.

Svensson et al. (2017) used iterative docking and conformal prediction to improve HTS efficiency. Firstly, docking was performed with the DUD-E datasets against their respective targets. The top 1 % scoring compounds were chosen to be screened experimentally. The confirmed actives and non-actives were labeled and used for training of a conformal predictor after which the remaining samples in the dataset were predicted. Conformal prediction achieved an average hit rate of 7.6 %, compared to a hit rate of 5.2 % achieved through docking alone.

Ahmed et al. (2018) leveraged the conformal prediction framework in an iterative manner to predict molecular docking scores in order to limit the total number of molecules to dock. Firstly, molecular signatures – features that capture molecules’ atomistic environments – were generated for a molecular library of^∼2 million compounds. An initial training set was randomly extracted and docked against four receptors. High scoring compounds were assigned the label ’1’ and low scoring compounds the label ’0’, using a docking score histogram approach. Compounds between the

’high scoring’ and the ’low scoring’ bins were not labeled and hence not used for training. The labeled molecules were

(16)

subsequently used to train a conformal predictor with an underlying support vector machine for classification. The model was used to predict the rest of the library after which several iterative prediction-docking steps were performed.

Firstly, if compounds were predicted to the {0} set, they were disregarded for subsequent docking steps. A smaller set of compounds were then randomly selected from the rest of the library to expand the training set. After this, the model was re-trained and the rest of the library predicted again. This was repeated until predictions reached a tolerable level of efficiency. With this method, the authors managed to reduce the number of docked molecules by 62.61 %. This was performed while maintaining an accuracy of the best scoring 30 binders of 94 % on average.

1.1.5 Adenosine A2AReceptor & Dopamine D2Receptor

G protein-coupled receptors (GPCRs) are a diverse group of membrane-receptors that respond to a vast variety of stimuli, including ions, odors, hormones and even light. Their activation triggers intracellular signaling cascades that mediate physiological responses throughout the entire mammalian body. GPCRs also represent the most pharmaceutically involved group of receptors, with > 30 % of all small-molecule drugs having a GPCR as their primary target. Due to their association with the plasma membrane, early structure determination of GPCRs proved difficult. Nevertheless, breakthroughs in structural biology has produced a plethora of high resolution structures that are ideal for structure-based drug discovery. (Hauser et al. 2017)

In this project, molecular docking scores to two G protein-coupled receptor (GPCR) targets are investigated, adenosine A_2Areceptor (A2AR) and dopamine D₂receptor (D2R) (see binding modes in figure 6).

A2AR has physiological roles in the regulation of myocardial oxygen demand, immune cell suppression and control of glutamatergic and dopaminergic release in the brain. Non-selective A2AR antagonists, most notably aminophylline, were developed in the 1990s and are used for the treatment of bradyasystolic arrest (Burton et al. 1997). Recent research focuses on discovery of selective antagonists for the treatment of neurodegenerative disease (Franco & Navarro 2018) and cancer immunotherapy (Leone et al. 2015). A2AR is one of the best structurally characterized protein targets.

Ligand binding is generally coordinated with hydrophobic interactions and selectivity is mainly mediated by polar interaction with asparagine 253 (pdb accession 4EIY). The distinguishing features for agonists are hydrogen bonds made deep in the binding pocket, in adenosine’s case, through its ribose moiety. (Carpenter & Lebon 2017)

D2R have predominant roles in neurophysiology and neurobiological disorders. It is highly expressed in the striatum where it regulates locomotion (Mishra et al. 2018). It is also involved in cognitive flexibility, learning and memory (Cameron et al. 2018). Many first-generation antipsychotic drugs, such as haloperidol and chlorpromazine, target D2R as non-selective antagonists while selective agonists, such as bromocriptine, are used for the treatment of Parkinson’s disease. Additionally, A2AR and D2R have been shown to interact in-vivo by dimerization, causing a reduction in D2R activity (Kamiya et al. 2003). The key interaction for ligand binding to D2R is the formation of a salt-bridge between aspartic acid 114 (pdb accession 6CM4) and a protonatable amine of the ligands. Similarly to A2AR, several hydrophobic interactions contribute to binding for D2R ligands as well, but the pocket is more constrained (Wang et al.

2018).

(17)

Asn253

Asp114

A B

RBP

Figure 6: Binding modes for the antagonist ZMA (pdb accession: 4EIY) to A2AR (A) and the antagonist risperidone (pdb accession: 6CM4) to D2R (B). The ribose binding pocket (RBP) for A2AR and both targets’ key interacting residues are marked.

1.2 Purpose

Finding novel selective ligands to A2AR and D2R can help alleviate suffering for patients of tragic disease, such as neurodegenerative disorders and cancer. The expansion of molecular libraries provide an enormous resource in this endeavour. However, the current methods of exploring such libraries requires spending massive amounts of computational resources. This necessitates a method that is less restricted by size expansion.

To solve this problem, this project aims to investigate the fusion of molecular docking and machine learning models within the conformal prediction framework. The goal is to design a combined machine learning-docking method that has the capacity to find ligands in chemical datasets of billions. To accomplish this, the final framework must reduce the time spent docking irrelevant compounds while retaining the strictly-docking approach’s sensitivity in finding ligands. Furthermore, this project investigates a fragment-based approach, something that has not yet been tried within the conformal prediction framework. If successful, the framework designed in this project may expedite the early computational drug-discovery process.

Aim 1: Reduce large virtual libraries to manageable sizes for molecular docking, with a controlled loss of found ligands.

Aim 2: Exploit knowledge acquired from fragments to find ligands among lead-like molecules.

2 Methods

2.1 AMCP Architecture & Classification

To allow for parallel building and prediction, to control memory usage, and for multi-class predictions, a new installment of an aggregated mondrian conformal prediction (AMCP) framework was implemented in python as a collaborative

(18)

effort. The implementation is briefly described here.

The main AMCP script has two modes, build and predict (see figure 7). Both modes expect an input file with white space-delimited values where each line represents a sample to be used for building or prediction. The first field is the ID of the sample, the second is the label of the sample and subsequent fields are descriptive features. For the build mode, the entire input file is read into memory. For each model built, the samples are shuffled and split into a proper training set and a calibration set. A classifier is fit to the proper training set and the calibration set is used to generate conformity scores, rather than nonconformity scores. That is, P_his used directly, instead of 1 − P_hin equation 4. Models are built sequentially and saved in a compressed format together with the calibration set’s conformity scores in a model directory.

Prediction is performed by loading all models and calibration set’s conformity scores into memory and instantiating an array of fixed size. Samples are loaded into the array and prediction is performed in batches. All models are used to obtain a conformity score for each sample. The conformity score is compared to each model’s respective calibration set’s conformity score. The conformal prediction p-values are then obtained by equation 5 after which their median is written to the output file.

Parallel building and prediction is performed with job arrays. Models can be built separately with the same input file and afterwards aggregated to the same model directory. Parallel prediction can be performed by splitting the files, predicting and concatenating. For all tests in this project, the random forest classifier implemented in python scikit-learn is used. It allows further parallelization for each job across all designated cores.

Predict with each model and get conformity scores Sample Class Feature1 Feature2 ···

Z03145 1 0.8944 0.3261 ···

Z03341 0 0.5378 0.3754 ···

Z00941 0 0.7894 0.1610 ···

··· ··· ··· ··· ···

Random split

Proper training set

Calibration set

Fit classiﬁer Predict calibration set and store conformity scores

[0.03, 0.11, 0.23, 0.31, 0.49, 0.63, 0.75, 0.81]

[0.17, 0.32, 0.44, 0.53, 0.65, 0.74, 0.82, 0.91]

α class 0

α class 1

20X

[0.03, 0.11, 0.23, 0.31, 0.49, 0.63, 0.75, 0.81]

[0.17, 0.32, 0.44, 0.53, 0.65, 0.74, 0.82, 0.91]

α class 0

α class 1

Sample Feature1 Feature2 ···

Z02954 0.9158 0.1236 ···

Z07714 0.2961 0.1154 ···

Z10491 0.4731 0.1106 ···

··· ··· ··· ···

Building Predicting

Z02954: {α0 = 0.02, α1 = 0.61}

Z07714: {α0 = 0.76, α1 = 0.15}

Z10491: {α0 = 0.65, α1 = 0.83}

Z02954: {p0=0.11, p1=0.44}: {1}

Z07714: {p0=0.88, p1=0.11}: {0}

Z10491: {p0=0.66, p1=0.88}: {0,1}

ε=0.2 P-values from conformity scores' positions

in calibration set lists

Figure 7: Building and predicting with the AMCP framework.

The random forest classifier Ho (1995, Breiman 2001) was chosen due to its capacity for handling imblanaced data-sets.

To understand this classifier, its underlying units, decision trees, must first be explained. A decision tree is trained by constructing a binary tree graph, where each internal node represents a true-or-false question about the training samples’

properties. For each node, the feature and value that best splits the classes are used. Nodes are added recursively top-down until only one class is represented in the node. Thus, every leaf-node will associate to one of the classes. When

(19)

predicting new samples, they are guided through the tree by their features. The leaf that it is predicted in determines its predicted class (see figure 8).

MW > 423 Da?

cLogP < 4.8?

Hydrogen bond donors > 2?Hydrogen bond

donors > 2?

# rings > 2? cLogP > 3.9? Active Non-active

Active Non-active Active Non-active yes

yes yes

yes

yes no

no

no no

Figure 8: A simple decision tree. This is the minimal unit of a random forest classifier.

While decision trees are highly interpretable, they are generally inaccurate for predictions of unseen samples. To counter this weakness, random forests are constructed with stochastic components. These classifiers rely on building many decision trees from the original training set by sampling with replacement, so called bootstrapping. The classifier also selects a random subset of features for building the trees. All of the trees are subsequently used to vote on classes, where the majority-vote decides the final classification. For each sample, class-wise probabilities are also assigned. These are calculated as the division of votes for each class by the number of decision trees used. These are the probabilities that are used as conformity scores for the conformal predictor.

2.2 Descriptive Features

Molecules can be represented virtually in several ways. The simplified molecular input line entry specification (SMILES) format represents molecules as interpretable strings. Similarly, the international chemical identifier (InChi) represent molecules in string format, but in a unique manner. These are the predominant 2D representations found in virtual libraries. From these, more descriptive features can be generated.

In this work, three sets of descriptive features are used. Firstly, a set of 97 physico-chemical properties. These include, for example molecular weight, the number of atoms and the cLogP of the molecule. Secondly, extended connectivity fingerprints with diameter 4 (ECFP4s) are used (see figure 9). These represent the presence of particular substructures and are algorithmically generated through the morgan algorithm (Morgan 1965). Firstly, each non-hydrogen atom is labeled with an integer that represents its element, number of heavy neighbours, number of bonded hydrogens, charge, isotope and whether it is within a ring. These values are hashed into a single 32-bit integer. Next, the atomic identifiers are iteratively updated by collecting all of the neighbouring atoms’ bond orders and identifiers into a single array. The bond orders are 1 for single, 2 for double, 3 for triple and 4 for aromatic bonds. The identifiers are the neighbouring atoms’ previously assigned integer values. Before the next iteration, this array is again hashed to a 32-bit integer, which is used to update the identifier of the atom. After a fixed amount of iterations, any duplicate identifiers are removed and the integer represented as a bit-vector. In this work, 1024 bits were used as binary features. The diameter of the fingerprint is twice the amount of iterations that have been performed (see figure 9).

(20)

NH

{...010001000100010...}

Figure 9: ECFP4s. Each bit represents a substructure of the molecule up to the diameter 4.

Thirdly, continuous and data-driven molecular descriptors (CDDD) (Winter et al. 2019) are used. These features are generated with a pre-trained autoencoder. The autoencoder leverages deep learning to maximize its performance in translating SMILES to InChis (see figure 10). The idea is that this requires the model to learn the properties of the molecule that best describes it. This translation process uses an encoder, forcing the SMILES string to be represented as a 512 length vector of floating point numbers, and a decoder, translating the vector to InChis. Here, the encoder of the authors’ pre-trained model is used to generate descriptors (see figure 10).

SMILES InChi

CDDDs

Figure 10: Autoencoder. Latent features generated from an autoencoder.

2.3 Retrospective Prediction of Known Ligands & Docked Compounds

Prior to this work, DOCK3.7 was used to perform LSDs against A2AR and D2R. The docking grids were optimized and validated with satisfactory enrichment of actives over decoys. 3D models of molecules with wait-ok status – that is, commercially available – were obtained from ZINC. Molecules with molecular weights < 250 Da were considered fragments (FWOK) and those between 250 Da and 350 Da were considered lead-like (LLWOK). Both sets were docked with DOCK3.7, using the same settings, and all docking scores were saved. To validate the predictive power of the AMCP framework, the top scoring compounds to A2AR and D2R were predicted retrospectively.

The training set was chosen to a random subset of the combined ZINC FWOK and LLWOK sets, at 18 to 26 heavy atoms. Compounds with docking scores at the top 1 % of the combined FWOK-LLWOK sets were labeled with ’1’ and the rest were labeled with ’0’. The docking score cutoff was -33.45 for A2AR and -51.53 for D2R. The underlying classifier was random forest with 100 trees. 20 aggregated models were used with a training size of 3,000,000 molecules.

Validation was also performed by prediction of known ligands to the two targets. Ligand SMILES were obtained from ChEMBL (ID: CHEMBL251 for A2AR and CHEMBL217 for D2R) with the requirement that the Ki values were

(21)

below 10 µM. To ensure that none of the predicted ligands were in the FWOK and LLWOK sets, the compounds were compared after neutralization, removal of salts and stereoisomers based on their SMILES. The overlap of smiles to the training set was removed from the ligand set.

For evaluation of retrospective predictions, The proportion of ’actives’ predicted in the {1} set was determined at ε = 0.2. The ROC AUC was calculated both for the known binders (considering all compounds of the FWOK-LLWOK library as false positives) and the top 1 % scoring compounds (considering all compounds not in the top 1 % as false positives). This was done using three schemes based on ranking on the conformal prediction confidence, 1 − p(0);

credibility, p(1) and the quality of information, p(1) − p(0), respectively.

2.4 Lead-Likes Predicted from Fragments

The aim of this method was to assess whether models trained on fragments could accurately predict lead-like compounds.

Approximately 20 % of the ZINC FWOK (MW less than 250 Da) library was chosen as the training set for building AMCP models, training size of 2,929,118. A^∼10 % subset of the LLWOK set was used as test set, 30 million samples.

As controls, AMCP models were also built with a subset of the LLWOK compounds of the same and used for prediction of the same LLWOK test set. For the same reason, the rest of the FWOK set was also predicted with the FWOK-built AMCP models. For each AMCP model built, molecules with docking scores at the top^∼1 % (a docking score cutoff of -31) of fragments were labeled ’1’ and the rest of the compounds were labeled ’0’.

The same labeling strategy was applied to the test sets. Molecules with docking scores below -31 were labeled ’1’ and the rest of the molecules were labeled ’0’. Due to the large file-size of the LLWOK library, the CDDD training and test sets were selected from after quasi-shuffling. That is, lines, representing samples, from the file were extracted in blocks after which the blocks were shuffled individually and then concatenated back together. This does not ensure exchangeability between training and test sets. For ECFP4 and physico-chemical features, the same training set and test set were used. These were obtained through random sampling from the entire LLWOK library. For each model built, 20 aggregated models were used, each with 100 trees for the random forest classifier.

2.5 Adaptive Learning & Prediction

Based on the previous method’s results (see 3.2), the conformal prediction error rate was considered controlled when predicting compounds approximately three to four heavy atoms larger than the training set when ECFP4s were used as features. Hence, several different approaches of adaptive learning and prediction were attempted with the same general idea. This was to use a large proportion of the fragments as a training set to predict slightly larger compounds. These predictions would guide which of these larger compounds to ’dock’, which would then constitute the training set for predictions of molecules with an even larger size (see figure 11).

(22)

NH N H

Build Predict

Build

Predict Build

Predict HN

HAC

Figure 11: Adaptive learning. Guiding idea for adaptive learning and prediction.

The docked and scored ZINC FWOK and LLWOK compounds were neutralized, stripped of salts and subsequently distributed in different bins of 14-17, 18-21, 22-24 and 25-26 heavy atoms. A random subset was extracted from each bin to be used for using a non-adaptive controls scheme (see description below). The remaining compounds were used as test sets. If a compound of the test set had a docking score among the top 1 % of the entire combined FWOK and LLWOK libraries, the compound was considered ’active’. Otherwise, it was considered ’non-active’. For A2AR, this docking score cutoff was -33.45 and for D2R, it was -51.53. For all methods other than the multi-class methods – subsequent description explicitly – training set-labeling was performed by labeling ’actives’ as ’1’, and ’non-actives’ as

’0’. Note that the top 1 % scoring compounds are not uniformly distributed across the HAC bins. This means that that when ’actives’ were labeled as ’1’, each step of prediction did have a slightly different ratio of ’actives’ to ’non-actives’

for the training sets. The rationale for using this strategy, as opposed to choosing the top 1 % scoring compounds for each HAC bin individually, was that it would allow more accurate comparisons between the different approaches. In prospective use cases, labeling would rather be based on the initial training set. Each time that a prediction set was extracted, it was done so with a significance of ε = 0.2.

The first approach was the random-subset method, which was used as a non-adaptive negative control for the subsequent methods. Training was performed with the aforementioned extracted sets. For each HAC bin, 2,000,000 samples were used for training. The models were used to predict the test sets with the same HAC bin as was used for training. This ensured exchangeable distributions. Ten aggregated models were trained for each AMCP model.

Secondly, the {1} set based method represented the most naïve conformal prediction method. The 14-17 AMCP models were used to predict the 18-21 HAC bin and the compounds predicted in the {1} set were extracted. For prospective use, these compounds are the ones that would be indicated for docking, but in this retrospective scenario, they were already scored. A 2 million random subset of these ’docked’ molecules were then used for training new AMCP models.

These models were used for prediction of the 22-24 HAC bin. As previously, compounds predicted in the ’{1} set were chosen to be ’docked’ and a 2,000,000 random subset of these were used to build new AMCP models for prediction of the 25-26 HAC bin. Lastly, the 25-26 HAC compounds predicted in the {1} set were selected to be ’docked’. For each training step, ten aggregated models were used.

The third approach was the confidence-sort method, which was designed to ignore indications of the framework for how many molecules to dock, and instead select a fixed number of compounds that the model is most confident in. The 14-17 HAC bin models were used to predict the 18-21 HAC bin. After this, the predicted samples were sorted according to their confidence (1 − p(0)) and the top 2,000,000 compounds were selected to be ’docked’. The ’docked’ compounds were subsequently used for training new AMCP models, for prediction of the 22-24 HAC bin. As in the previous step, the 2,000,000 compounds that the were were most confidently predicted as ’actives’ were ’docked’, forming the training

(23)

set for new AMCP models for the last prediction step. Lastly, the 25-26 HAC bin was predicted and the 2,000,000 compounds that the framework was most confident in were ’docked’. For each training step, twenty aggregated models were used.

Two multi-class approaches with differing labeling strategies were also attempted. In both cases, compounds with docking scores among the bottom 10 % of the combined FWOK and LLWOK sets were labeled with ’0’. For A2AR, this docking score cutoff was -4.82 and for D2R, it was -17.90. For the first multi-class approach, compounds with docking scores among the top 1 % were labeled as ’2’ (docking score cutoff at -33.45 for A2AR, and -51.53 for D2R) and the middle scoring compounds were labeled as ’1’. For the second multi-class approach, compounds scoring among in the top 10 % of the entire FWOK and LLWOK libraries were labeled as ’2’ (docking score cutoff at -27.35 for A2AR and -39.75 for D2R) and the middle 80 % scoring compounds were labeled as ’1’. The same labeling strategy with the same docking score cutoffs was performed for each HAC bin step.

The practical method was the same for the two multi-class approaches. Firstly, AMCP models were built on 2,000,000 molecules from the 14-17 HAC bin and used to predict the 18-21 HAC bin. 2,000,000 compounds from the 18-21 HAC bin were then randomly selected from samples in all prediction sets with the exception of the {0} set. That is, the selected compounds included the samples that were predicted in the {1} set and the {2} set, but also in the { } set, the {1,2} set, the {0,1} set, the {0,2} set and the {0,1,2} set. The rational of choosing these compounds was that they would not be as biased to ’actives’ and therefor present a more accurate representation of the ’active’ to ’non-active’ ratio for training of the next model. By not training on compounds predicted in the {0} set, the sampling of non-interesting compounds was still reduced to a limited extent.

After the AMCP models had been trained, they were used in two ways. Firstly, the same HAC bin was predicted again.

The {2} set was then extracted for ’docking’. The compounds predicted in the {2} set are the compounds that the model is most confident in that they will score well. Secondly, these models were used to predict the 22-24 HAC bin.

As before, new AMCP models were then built on compounds not predicted in the {0} set. These models were used to predict the 22-24 HAC bin again, and the compounds predicted in the {2} set were extracted for ’docking’. The same idea was then repeated in the last step for the 25-26 HAC bin. While these multi-class approaches may achieve more exchangeable distributions between training and test sets, they do require sampling many compounds that are not expected to score well.

For comparison of these methods, recall and precision of top 1 % scoring compounds was assessed in the extracted {1}

sets for the first three methods: the random-subset method, the {1} set-based method and the confidence-sort method.

For the multi-class approaches, recall and precision of top 1 % scoring compounds were assessed in the extracted {2}

sets.

2.6 Heavy Atom Count Dependence of Required Training Set

The purpose of these tests were to assess if the required training size for predictive performance increases with the HAC of compounds. Similarly to section 2.5 the compounds of the FWOK and LLWOK libraries were binned into bins of HAC. AMCP models were trained at 12 different size levels for each HAC bin. These were 1,000, 2,000, 4,000, 8,000, 1,6000, 32,000, 64,000, 128,000, 160,000, 320,000, 1,000,000 and 2,000,000 samples. Both the training data and separate test sets were predicted with these models. Three replicates were used for each training size. After prediction, recall was assessed at ε = 0.2.

(24)

3 Results

3.1 Retrospective Prediction of Known Ligands & Docked Compounds

The application of the AMCP method for retrospective prediction of A2AR LSD scores allowed for finding 86.8 % of top 1 % scoring compounds while only evaluating 9.62 % of the entire 18-26 HAC test-set. Of the compounds predicted in the {1} set, 9.58 % were part of the top 1 % scoring compounds for A2AR. Similar predictive power was observed for D2R. Retrospective prediction of the D2R test set allowed for finding 84.4 % of top 1 % scoring compounds while only evaluating 9.7 % of the entire 18-26 HAC test-set. Of the compounds predicted in the {1} set, 8.45 % were top 1

% scoring compounds for D2R. The error-rates for both the active and non-active class were lower than the chosen significance of ε = 0.2 for both A2AR and D2R (see table 1).

Table 1: Retrospective prediction of top 1 % scoring compounds in FWOK-LLWOK LSDs with ε = 0.2.

A2AR D2R

Test set size 179,805,087 219,039,320

Number of top 1 % scoring compounds 1,908,170 2,132,578

{1} set size 17,293,923 21,296,055

{1} set proportion 9.62 % 9.72 %

Top 1 % predicted in {1} set 1,657,007 1,799,949

Recall in {1} 86.8 % 84.4 %

Efficiency 98.0 % 99.4 %

Precision in {1} 9.58 % 8.45 %

Error-rate class ’1’ 13.1 % 14.6 %

Error-rate class ’0’ 10.8% 9.07 %

ROC AUC, p(1) 95.7 % 94.7 %

ROC AUC, 1 − p(0) 95.7 % 94.7 %

ROC AUC, p(1) − p(0) 95.7 % 94.7 %

Calibration plots and predictive efficiencies are seen for A2AR in figure 12. This shows that the error rates were ≤ ε as the ε increased linearly from 0 to 1 in 50 steps. The behaviour of the error rate differed for the classes. Predictions of the majority class (non-actives) were consistently over-conservative, and especially so at higher values of ε. The minority class (actives) followed the expected error rate more closely but had over conservative error rates for lower values of ε. The proportion of single label predictions were maximized at ε = 0.18 for 99.97 %. At ε = 0.2, where the {1} set was extracted, it was at 97.99 %.

(25)

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.0 0.2 0.4 0.6 0.8 1.0

Errorrate

A

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.0 0.2 0.4 0.6 0.8 1.0

Efficiency

B

Error rate ’0’ Error rate ’1’

Figure 12: Class wise error rates (A) and efficiency (B) for retrospective top 1 % predictions of A2AR. For (A), the dotted line indicates the expected error rates. For (B), the dotted line shows the ε that was used for extracting the {1}

set.

Calibration plots for retrospective predictions of D2R showed a highly similar pattern as A2AR (see figure 12). Error rates were ≤ ε as the ε increased linearly from 0 to 1 in 50 steps. Predictions of the majority class (non-actives) were overconservative and the minority class (actives) were mostly only overconservative at low values of ε.

The scores of the non-active compounds predicted in the {1} set were also investigated. It was seen that these compounds, although not reaching the docking score cutoff-value, were generally among the better scoring (more negative) compounds of the test set. This was true for both A2AR and D2R (see figure 13).

−40 −20 0 20

Docking score

0.00 0.02 0.04 0.06 0.08

Density

A

−60 −40 −20

Docking score

0.00 0.02 0.04 0.06

Density

B

Test set {1} set

Figure 13: Docking score distribution for the {1} set compared to the test set for A2AR (A) and D2R (B) at ε = 0.2.

The vertical lines indicate the thresholds for the top 1 % scoring compounds. For A2AR, the right side of the plot is cut, due to a long tail.

Retrospective prediction of known ligands to A2AR showed 100 % efficiency. Each of the ligands were given a single label and none of them were predicted as both ({0,1}) or null ({ }). The recall in the {1} set also reflected the expected error rate of 80 % at ε = 0.2. Retrospective prediction of known ligands to D2R showed an efficiency of 99.0 % and an error rate that was marginally higher than the expected, at 23.3 % (see table 2).

(26)

Table 2: Retrospective prediction of known ligands for A2AR and D2R with ε = 0.2.

A2AR D2R

Number of ligands 8,683 9,477

Ligands predicted in {1} set 7,314 7,167 Recall in {1} set 84.2 % 75.6 %

Error rate 15.8 % 23.3 %

Efficiency 100 % 99.0 %

ROC AUC, p(1) 93.4 % 90.2 %

ROC AUC, 1 − p(0) 92.8 % 91.1%

ROC AUC, p(1) − p(0) 92.8 % 90.2 %

Lastly, the AMCP models’ abilities of enriching ’actives’ in the predicted sets were also assessed. This was done after ranking the samples on the p-values. The AMCP model enriched top 1 % scoring compounds over non-top 1 % scoring compounds in the 18-26 HAC test set when ranking predictions on credibility (p(1)), confidence (1 − p(0) and quality of information (p(1) − p(0)). For these predictions, there was no discernible difference in performance for the ranking schemes in regards to the AUC metric.

For enrichment of known ligands, detection of any compound of the combined FWOK-LLWOK set was considered a false positive, including their top 1 % scoring compounds. The ranking schemes did perform in a different manner for the two targets. For A2AR, credibility showed best enrichment. For D2R, confidence showed best enrichment. In figure 15, ROC plots of known binders and top 1 % scoring compounds can be seen.

0.0 0.2 0.4 0.6 0.8 1.0

FPR

0.00 0.25 0.50 0.75 1.00

TPR

A

0.0 0.2 0.4 0.6 0.8 1.0

FPR

0.00 0.25 0.50 0.75 1.00

TPR

B

Top 1 % compounds Known ligands

Figure 14: ROC curves for A2AR (A) and D2R (B) based on credibility (p(1)).

3.2 Lead-Likes Predicted from Fragments

After the AMCP framework had been validated for retrospective predictions, it was tested whether models trained in the fragment domain could be used to predict molecules of the lead-like domain. Considering the quasi-shuffle approach that was used to extract CDDD-training and test sets from the LLWOK library, precision and recall values for CDDD-trained models should be compared with the other feature’s models cautiously. With this in mind, it is apparent that models built on each set of features perform well when predicting in the same domain as they were trained in. For building and prediction in fragment space, the error rates are consistently < ε. The models are also efficient, with each model labeling more than 85 % of the samples with a single label. For models built with CDDDs, the error rate exceeds ε , likely a result of the aforementioned quasi-shuffling.

When predicting lead-like molecules with models built on fragments, it is evident that the physico-chemical and CDDD models successfully label almost all ’actives’ correctly, with a recall of 99.9 % and 98.8 % respectively. However, they also predict a large proportion of ’non-actives’ in the {1} set, as shown by their low precision 3.27 % and 4.98 %. In

Conquering Chemical Space

Conquering Chemical Space

Optimization of Docking Libraries through Interconnected Molecular Features

Leonard Sparring

Degree project in bioinformatics, 2020

Examensarbete i bioinformatik 45 hp till masterexamen, 2020

Biology Education Centre and Department of Cell and Molecular Biology, Uppsala University

Supervisors: Jens Carlsson and Andreas Luttens

Abstract

Conquering Chemical Space

Popular Science Summary

Leonard Sparring

Contents

Abbreviations

1 Introduction

O

A B

Ligand complexity

P robability

TP

TN FP FN

TPR

FPR

ε

P roportions

}

}

}

}

ε=0.2 1

1 0 0

A B

2 Methods

{...010001000100010...}

HAC

3 Results