• No results found

On the relationship between molecular high-throughput screening hit rates and molecular descriptors

N/A
N/A
Protected

Academic year: 2022

Share "On the relationship between molecular high-throughput screening hit rates and molecular descriptors"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

On the relationship between molecular high-throughput

screening hit rates and molecular descriptors

Mari Hansson

Degree Project in Engineering Chemistry, 30 hp

Report passed: April 2013 Supervisors:

Hongming Chen and Ola Engkvist, AstraZeneca Anna Linusson, Umeå University

(2)
(3)

Abstract

High-Throughput Screening (HTS) is widely used in the pharmaceutical industry to identify novel chemical starting points for drug discovery projects. The current study focuses on the relationship between molecular HTS hit rate and four common molecular descriptors: molecular lipophilicity (as described by ClogP), size (heavy atom count, HEV), degree of aromaticity content (Fsp3) and fraction of molecular framework (fMF). Molecular HTS hit rate is defined as the ratio between the number of times a compound was labeled as active and the number of times the compound was tested in a diverse set of HTS assays. Our results show that all four descriptors are correlated with molecular hit rate. Beta-binomial statistical models were built to model hit rates as a function of these descriptors and to quantify the influence of the descriptors on the molecular HTS hit rate. The study shows that among the four descriptors, ClogP has the largest influence on molecular hit rate. Fsp3 and HEV are also independently influencing the molecular hit rate. fMF has only minor influence beyond that it is correlated to the other descriptors. Higher degree polynomial terms of these descriptors were added into the beta-binomial statistic model to improve the model quality. The results show that models with high degree polynomial terms fit better to the experimental data than the non-polynomial models. In conclusion, for the first time the relative importance of the different molecular descriptors on the molecular HTS hit rate has been estimated. It was found that ClogP has the largest influence followed by Fsp3 and HEV.

(4)

II

Contents

Introduction ... 1

Methods... 4

Datasets ... 4

Calculation of molecular HTS hit rate and molecular descriptors ... 5

Fitting molecular HTS hit rate with beta-binomial distribution ... 6

Results and Discussion ... 10

Relationship between molecular HTS hit rate and molecular descriptors ... 10

Inter-correlation analysis ... 12

Fitting molecular HTS hit rate with beta-binomial statistical models ... 13

Conclusion ... 20

References ... 21

Appendix ... 24

(5)

III

(6)

1

Introduction

The drug discovery process is well recognized as a lengthy and complex journey. It can be divided into four different steps: target identification, lead identification (LI), lead optimization (LO) and clinical development.1) Although no accurate number exists, the cost of developing a new drug from scratch until regulatory approval has been estimated to 0.8~1 billion dollar on average, with a development time of up to 20 years.2)

The target identification and validation steps are the starting points of this journey and the main goal here is to identify and validate the target of a specific disease, either in vitro (in enzyme or cellular model) or in vivo (animal model). The definition of a target is wide and can lie in between an isolated active protein to a complex cellular system. Since the beginning of the genomics era in the 1990s, the number of targets identified has increased significantly.3)

Once a target has been selected the project enters the lead identification phase. The main task here is to understand how the selected target acts to form and sustain the disease and accordingly to find a so called lead compound that can interact with the target in order to slow down or even neutralize the disease process.4)-5) There are many ways to identify lead chemical series like fragment based screening, the fast follower strategy, structure based de novo design etc.6) However the main strategy for lead identification in pharmaceutical companies is nowadays to run a High-Throughput Screen (HTS) against a large compound collection.7) The following lead optimization (LO) step aims at further optimization of the lead compound against a multitude of in vitro and in vivo parameters (aiming at low dosage and more desirable Absorption, Distribution, Metabolism, Excretion and Toxicity (ADMET) related properties).8) Within the lead optimization stage drug developers put a lot of effort into investigating the structure-activity relationship (SAR) around the lead series to identify the best compound in terms of potency and ADMET profile.9) The final compound coming out of the LO phase is nominated as a drug candidate for clinical trials, which constitute the last drug development steps where efficacy and safety in humans will be investigated.

For lead identification, the screening of chemical compounds to elucidate pharmacological activity is the key step and has been ongoing in various forms for at least 40 years. The screening methodologies have improved with time, both in terms of throughput and the amount of information that can be derived from the screen. HTS technology emerged in the 1990s and it has become a key technology in the industrialized drug discovery paradigm. With the latest HTS technology, it is possible to generate more than 100,000 data points per day. As described above, the main use of HTS technologies have for many years been for lead identification but thanks to improvement of these technologies, companies are now expanding its use backward and forward in the drug discovery process. Backward toward target validation and forward to converting assay hits (active compounds) to qualified leads via information generated either within screens or through downstream, high throughput ADMET testing.10) The HTS technology involves robotics and different technological devices such as liquid control and sensitive detectors. Before running a screen, a suitable biological assay for the intended target has to be developed. The developed assay is then transferred to the HTS facility and a validation of the robustness of the assay on the HTS equipment is often done to ensure reliable HTS results.

One thing all the HTS’s have in common is the so called microplates. A microplate is a plastic plate, containing a number of wells, typically 384 or 1536, where the compounds are diluted in a solvent using robotics. The screening procedure is divided into several steps in order to reduce the number of false positive results to a minimum. These steps are automation, liquid handling, plate storage and incubation, and detection instrumentation.

Automation can be done by two different approaches, either by unit automation, which is designed to perform a specific task, or fully automated systems, which are designed to perform a number of steps in the screening procedure. The liquid handling systems are responsible for adding the liquids into the wells. For systems responsible for the storage and incubation of the plates, different conditions such as storage space, temperature and humidity have to be considered. The last system is the detection instrumentation, referring to

(7)

2 the read-out of the biological response that has taken place in the wells. Typically, all wells are read simultaneously, using fluorescence or luminescence detection with direct imaging or a fiber optic feed network.11)

It has been argued that HTS is not a reliable method for these activity investigations due to noise and unreliability’s in the results. Other concerns about HTS have also raised the question about undesirable compounds for the specific target.7) But these counter arguments have not been widely accepted in pharmaceutical industry and HTS has remained the most popular lead generation method. By implementing rigorous quality assurance methods, such as Z′ trend monitoring,12) plate pattern recognition algorithms,13) regular usage of pharma- cological standards,14) and liquid handler and reader performance monitoring,15) the quality of HTS data can be very high and is often also confirmed by low throughput biological assays which generate concentration response results. Here, a selection of compounds is tested at different concentrations, generating concentration-response curves from which biochemical parameters such as IC50 values can be estimated. Undesirable compounds can be removed by follow-up counter screening and selectivity screens. In a recent analysis of 58 drugs approved between 1991 and 2008, for which the lead compounds were known, 14 originated from HTS and most of the other compounds had been previously reported in the literature.16)

Despite all of these technology advancements in drug discovery, the pharmaceutical industry has for many years faced difficulties with high attrition rates in the drug development phase. The main reason for failure has historically been due to poor pharmacokinetics and bioavailability associated with clinical candidates. Consideration of ADMET properties is thus introduced already in the lead generation step, in an attempt to decrease the attrition rate. Establishing the relationship between physiochemical properties and a compound’s potential to become a drug is a problem of paramount importance in drug discovery. The first attempt to apply the ‘drug-likeness’ strategy based on physicochemical properties came from Lipinski et al.17) 1997 where the “rule-of-five” criteria for oral drug bioavailability were proposed. Since then, many other research groups have done similar research. Oprea et al.18) investigated the differences between lead compounds and drugs.

Wenlock et al.19) tried to identify physiochemical property differences between compounds in different clinical stages and marketed drugs in order to find what properties were favored for passing through the clinical phases and later reached the market. Vieth and Sutherland20) used different protein target families of current market drugs and compared their molecular weight (MW) and LogP. These studies has shown that LogP and MW are important parameters to predict compound quality and drugs have significantly lower LogP and MW than normal bioactive compounds in general.

Previously a lot of efforts have also been taken to investigate the relationship between compound quality and molecular topology related descriptors. Ritchie et al.21) investigated the impact of aromatic ring count in different development stages. Lovering et al.22) introduced the fraction of sp3-hybridized carbons (Fsp3) topology descriptor in 2009, showing the trends of Fsp3 in different clinical stages. Leeson et al.23) investigated trends for a number of molecular descriptors together with the topology descriptor aromatic atom count – sp3 atom count of different oral drugs. It was found that these topology descriptors are correlated with clinical success through the influence on ADMET, in addition to the physiochemical properties. Another topology descriptor, fraction of the molecular framework (ƒMF), was introduced by Yang et al.24) in 2010 where the relationship between topology and selectivity for druglike molecules was investigated.

Inspired by Lipinski’s pioneering work, some researchers have also looked into factors associated with specific biological activities rather than drug-likeness. Gillet et al.25) adopted a different strategy in an effort to delineate the factors associated with biological activity. In this work the WDI26) database was used as representative of biologically active molecules and the SPRESI27) database was used as representative of biologically inactive compounds. The distributions of various molecular properties were calculated for the two databases, and the resulting distributions were compared. Diller et al.28) published a statistical analysis of a company’s historical high-throughput screening data examining the relationship between physicochemical properties and substructures on the likelihood of compounds displaying biological activity, irrespective of target class (structural properties versus molecular HTS hit

(8)

3 rate). Data mining revealed that rotatable bond count had the most significant impact on hit rate, with more rigid structures preferred. Compounds having higher molecular weight tend to have higher hit rate. When logP is less than 5, compounds having higher logP value also tend to have higher hit rate. But when logP values exceed 5, the trend is opposite. Pope29) made a similar investigation based on GSK’s screening data and examined the relationship between molecular HTS hit rate and different molecular descriptors, indicating that molecular hit rate positively correlates with ClogP and MW, respectively, and has a negative correlation with Fsp3.

The aim of this project was to collect in-house historical screening data and investigate the relationship between molecular HTS hit rate and some basic molecular physiochemical properties like lipophilicity (ClogP), heavy atom number (HEV) as well as some topological descriptors like Fsp3 and fMF. The initial step was to make this investigation on primary screen data and then validate the results using concentration-response screened data. The pair wise correlations between ClogP, HEV, Fsp3 and fMF were also studied. The aim is to find trends that can guide in identifying good starting points for the lead generation efforts. Earlier studies have been showing the influence on the molecular HTS hit rates by several descriptors individually. However, there have been no earlier investigations of how the descriptors are inter-correlated in an molecular HTS hit rate analysis and how the inter- correlation influences the relative importance of the different descriptors with respect to the molecular HTS hit rate.

(9)

4

Methods

Datasets

The dataset used in this study was extracted from the in-house screening database, HBase. It includes the screening results for 1 769 089 compounds on a diverse set of 260 primary screening assays, which were carried out at AstraZeneca’s HTS department starting form 2008 and onwards. The initial dataset is single point (SP, screen at single concentration) data which is heterogeneous and noisy. For example the compound screening concentrations were not consistent in some of the assays, the activity of the compounds was labeled in various ways, some compounds had been tested multiple times etc. Due to these inconsistencies, following modifications (Figure 1) of the data were made resulting in a “cleaned” dataset for each assay.

In the first step of the cleaning procedure, assays that did not contain any active compounds were excluded. For the compounds tested in the same assay, only compounds tested in the most occurring screening concentration were kept. Then, compounds labeled as either “A” or

“N” were saved and compounds with any other labels were excluded. In cases where

compounds had been tested multiple times, the activity flag was determined by the majority label, e.g. if a compound were tested three times in one assay and two times were defined as

“A” and one time was defined as “N”, the final activity flag will be “A”, i.e. the majority label wins. If there were equal numbers of “A” and “N”, the compound was labeled as “N”. The biological effect values were also kept for calculating molecular hit rates (explained later) and median effect values were calculated for compounds which had been tested multiple times in the same assay. After the unique compound records were compiled for each assay, only assays with more than 100 000 unique compounds were utilized for data analysis and Figure 1: Flow chart over the data cleaning process

No No No No

Yes Yes

Yes Yes Yes

No 260 Assays

Are there any active compounds

in the assay results? Remove assay

Compound tested in most

occurring concentration? Remove compound

Compound labeled as A or N? Remove compound

Compound tested multiple times?

Change activity flag label

Keep compound

Calculate median effect

At least 100 000 compounds included in the assay?

102 assays

Remove assay

(10)

5 altogether 102 HTS assays remained after the cleaning procedure. Only compounds with a heavy atom count between 6 and 45, a ClogP range between -2.0 and 6.0 and compounds labeled as a neutral, acidic or basic were kept for further study, removing cationic and zwitterionic compounds. Altogether, 1 307 989 compounds tested in more than 15 assays (Dataset 1) were included for the hit rate analysis.

To remove the false positive actives in the cleaned SP dataset, the concentration response (CR) results in the screening cascade were examined for each SP test assay, using in-house work annotating the assays according to the Bio-Assay Ontology (BAO).30) An active from SP screening was labeled as “A” (active) only if it was also active in the following CR test, otherwise it was regarded as “N” (inactive) or simply removed if not included in the CR test. Since the CR test results have much higher accuracy, this step improves overall data quality. In the end, 85 of the 102 SP assays had CR confirmation results and were thus included in the CR corrected compound set which comprises 1 283 283 compounds (Dataset 2). The cleaning and correction procedure was made in python.Error! Reference source not found.

Calculation of molecular HTS hit rate and molecular descriptors

Two different approaches were used for calculating the molecular HTS hit rate. One method uses the activity flag annotated by the staff that carried out the screen. This is straightforward since the activity flag field is stored in the database. Since the activity and the biological effect of a compound are related, an alternative method is to re-label the activity flag based on the compounds biological effect value.32) This method was implemented by defining the top 1%

compounds as actives and the rest as inactive (Dataset 3). This is a more objective way and all human bias in the process is eliminated. Once the compound activity flag is determined, the molecular hit rate for each individual compound is calculated according to Eq. 1, which is the ratio of the number of times a compound is active and the number of times it had been tested across the included HTS assays. The hit rate of a molecule is the probability to be active in an HTS.

(1)

Several commonly used molecular physicochemical descriptors were used in current study.

This includes ClogP33) as a measure of lipophilicity, the number of non-hydrogen atoms in the compound, denoted as heavy atom count (HEV), as a measure of molecular size, and two topological descriptors: fraction of sp3 hybridized carbon atoms (Fsp3) and fraction of the molecular framework (ƒMF). ClogP is a well known descriptor and defined as the logarithm of the calculated partition coefficient between 1-octanol and water. Fsp3 is the ratio between the number of sp3-hybridized carbons and the total number of carbon atoms included in the compound (as shown in Eq. 2). It describes the fraction of the relative aliphatic carbon atom content in relation to the total number of carbon atoms. A higher value of Fsp3 indicates that the shape of the compound tends to be less flat.

(2)

fMF is another topological descriptor which is related with the concept of molecular framework (MF). Murcko et al.34) fragmented a molecular structure into four different subunits: ring systems, side chains, molecular framework, and linkers. Ring systems are defined as cycles or cycles sharing an edge within the graph representation of a molecule.

Linkers are atoms connecting two ring systems. Ring systems and relevant linkers in a compound constitute the molecular framework. Yang et al.24) divided the molecule into MF and side chain part as defined by Murcko (Figure 2).

(11)

6 Figure 2: Division of a compound into a Molecular framework and side chains.24)

The descriptor fMF is defined as the number of heavy atoms in the MF (HEVMF) divided by the total number of heavy atoms (HEVtotal) in the molecule as shown in Eq. 3 and fMF [0, 1].

The side chains refer to all the heavy atoms that are not included in the MF.

ƒ

(3)

The value of fMF reveals the compound composition in terms of MF and side chains. A high fMF indicates that the compound contains a large MF and a small side chain and vice versa.

Molecular HTS hit rate was calculated with python,Error! Reference source not found.

ClogP was calculated with a commercial program,35) while HEV, Fsp3 and fMF were calculated with an in-house C++ program.36) Since fMF is based on HEV, it should be independent of the program used for calculation. Fsp3, on the other hand, is based on the aromaticity of the compound which might be defined differently in different programs. However, for drug-like molecules as is used in this study, it should be seldom that different programs assign different aromaticity to a ring system.

Wilcoxon rank-sum tests were made in R37) to determine if the difference in average molecular hit rate were statistical significant between different descriptor values. The Wilcoxon test is a non-parametric statistical hypothesis test that compares the difference between the population mean of two related samples.38)

Fitting molecular HTS hit rate with beta-binomial distribution

Model with non-polynomial parameters. To quantitatively measure the influence of different descriptors on the molecular hit rate, empirical hit rate probability models were built to fit with experimental HTS hit rate data. In the probability theory of statistics, the binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent binary yes/no experiments, each of which yields success with probability p. In this study, the probability of a compound being active in a given assay corresponds to its molecular hit rate, h, and if the assumptions that for any compound, there is a constant probability that the compound would be registered as a “hit” (active) in a randomly chosen assay and the results of different assay are independent, are valid, then the probability for a compound being y times active in a fixed number of assays, N, is given by the binomial distribution with a probability mass function according to Eq. 4.

( )

( ) ( ) (4)

However, the molecular hit rate for compounds with the same descriptor value varies in a random manner, so the hit rate is actually unknown. A flexible way to model this is to assume that for a randomly chosen compound, the hit rate will vary accordingly to a beta distribution with a probability density function depending on descriptor values according to Eq. 5.

(12)

7 ( )

( ( ) ( ))( ) ( ) ( ) ( ) ( ) ( )

(5)

The variable x corresponds to the molecular descriptor value, B is the beta function and a and b are parameters dependent on x. a and b have to be positive and they are presumed to be exponential functions of x (Eq. 6). For example, with two descriptors denoted by x1 and x2.

( ) ( ) (6)

As described above, if the hit rate is constant then the probability of having certain number of hits in a fixed number of assays would follow a binomial distribution. On the other hand, the molecular hit rate at each descriptor value follows a beta distribution with the probability density function in Eq. 5. Then the probability of a randomly selected compound being y times active in N assays is given by the probability mass function according to Eq. 7.

( ) ( )

( ( ) ( ) ) ( ( ) ( ))

(7)

This is obtained by combing Eq. 4 and 5 and integrating out the hit rate (applying the usual rules of probability) and used to build the hit rate models.

For assumption in Eq. 6, which is used in the beta distribution (Eq. 5), estimations of the coefficients a0, a1, b0, b1 etc. (corresponding to how many descriptors that were included in the model) had to be done. These estimations were achieved by using the standard statistical method of maximum likelihood.39) The likelihood function simply is the product of the probability mass function (Eq. 7) evaluated at all of the data point and then it is maximized by using a nonlinear numerical optimization procedure with respect to the coefficients.

To find out which descriptor that is most important to describe molecular hit rate, a small subset of the data was created, containing 100 000 randomly chosen compounds that had been tested in more than 42 assays and had the same descriptor values as at least 50 other compounds (Dataset 4). The initial coefficient values (a0, a1, b0, b1, etc.) for the optimization process were all set to zero and linear terms of the descriptors were used to build different models with forward selection. Forward selection is an approach that starts without any independent variables in the model and then adds each variable, one at the time, to find the one that fits the model best. This procedure was repeated until no improvement of adding a new variable could be found. To decide which model fitted best at each cycle, the model selection criterion BIC (Bayesian Information Criterion, Eq. 8) was introduced.40)

( ) ( ) (8) where k is the number of estimated parameters and nobs the number of data points. A smaller BIC value represents a better fitted model.

This BIC value is a summarized value of two parts, the first part decreases as the

“goodness of fit” of a model improves (which happens every time another explanatory variable is included, in addition to those already present) and the second part is a penalty term, which increases with the number of independent variables added to the model. This is to guard against over fitting the model by continually adding terms to the model until there are as many parameters as data (where fit would be perfect, but with zero reduction in complexity). The negative log likelihood is usually termed as “lack of fit”, since it decreases as fit improves.

Model with polynomial parameters. So far the beta-binomial model described above use ClogP, HEV, Fsp3 and fMF descriptors to fit with experimental hit rate data and the exponential term in parameter a, b of the probability mass function (Eq. 7) is a form of linear

(13)

8 combination of those descriptors, i.e. a non-polynomial function. It is of great interest to explore these models by adding polynomial terms of descriptors in the function (Eq. 6), e.g.

adding ClogP2, ClogP3 as separate descriptors. This will change Eq. 6 into Eq. 9.

( ) ( ) (9)

Here the a, b are exemplified as having square terms (including cross terms) and in current study higher order polynomial terms were also tried out to examine the model performance.

For models using polynomial parameters, a more complex protocol as shown in Figure 3 was used to estimate the parameters.

First the range of the descriptors was divided into small intervals with equal length, using Dataset 1. Since the change in x over each interval is small, the quantities in the exponential function of Eq. 9 could be replaced by constant values and this makes the parameters a and b in the standard beta-binomial model (Eq. 8) become constant in each interval. These two parameters can be first initialized using the method of moments.39) Method of moments is a statistical approach, using the population hit rate mean and variance to get the initial a and b parameters (as shown in Eq. 10-13) for each small bin.

∫ ( )

(10)

∫ ( )

( ) ( )

(11)

( )

(12) ( ) ( )

(13)

Figure 3: The protocol for estimating polynomial parameters in the model Divide each descriptor range into bins

Method of moments for each bin

Likelihood function for each bin

Maximize the likelihood function for local optimal a,b in each bin

Least squares to solve a(x), b(x)

Likelihood function for the whole range

Maximize the likelihood function for global optimal a,b

Hit rate mean and variance for each bin

Product of the probability mass function Using the optimization

function in R

(14)

9 Here f(h;a,b) is the probability density function (Eq. 5), µ is the hit rate mean and σ2 is the hit rate variance. Substituting the sample mean and variance provides the method of moments estimates for a and b. These estimates were then used as initial guesses and an optimization procedure was run in each bin to obtain local optimal a, b values in each bin, which maximize the likelihood function. This created a sequence of local optimal a and b values across the whole descriptor range and an ordinary least squares fitting was done to estimate the global coefficients (a0, a1, b0, b1, etc.) of polynomial terms for the functions a(x) and b(x) in Eq. 9.

These least squares estimates can be used to initialize the maximum likelihood estimation of the whole data set.

An R37) script was written to do the parameter estimation for beta-binomial distribution models, forward descriptor selection and BIC score calculation. The numerical optimization was done by using BFGS algorithmError! Reference source not found.

implemented in “Optim” module of R.

(15)

10

Results and Discussion

Relationship between molecular HTS hit rate and molecular descriptors Investigations of the relationship between molecular HTS hit rate and lipophilicity (ClogP), size (HEV), Fsp3 and fMF were done based on the SP data (Dataset 1) and results are shown in Figure 4. For each property plot, compounds were binned according to their property value and plotted against average hit rate of each bin (red curve) and the number of compounds representing each bin (blue bars). The bins containing less than 1% of the whole compound set were removed. For results that are extensively discussed in the text, we have included statistical tests (Wilcoxon rank-sum) in the relevant figures. Differences between other bins might also be statistically significant. However, since the differences are not explicitly discussed in the text, the corresponding statistical tests are not included in the figures.

It can be seen from Figure 4 that the ClogP, HEV and fMF descriptors are all positively correlated with molecular hit rate. That means compounds which have larger ClogP, HEV and fMF values in the compound collection tend to have higher molecular hit rate in HTS screens. It has been shown by various studies24),42) that ClogP, HEV and fMF are positively correlated with compound promiscuity based on the Bioprint database,42) so our results based on HTS data are consistent with previous findings. It was previously found that compounds with an fMF value >0.65 are significantly more promiscuous than compounds with lower fMF values. This is however not in line with the finding in our study, where a more linear relationship between molecular hit rate and fMF is demonstrated. This difference between the studies might be due to the differences in the datasets used in the two studies. On the other hand, Fsp3has an inverse correlation with hit rate (shown in Figure 4c), which means that a larger Fsp3 value corresponds to a lower hit rate, in line with a similar previously published investigation.43) The average hit rate according to ionization states is displayed in Figure 4e indicating that acidic compounds have slightly higher hit rates than basic and neutral compounds.

The same plots were generated based on Dataset 3, where the compounds activity flag were labeled due to biological effect (explained in Methods) to remove possible human errors. The relationships between molecular hit rate and ClogP, HEV, fMF and Fsp3, respectively, are included in Appendix Figure A1. In these plots, similar correlation trends were observed.

a) b)

c) d)

0 50000 100000 150000 200000 250000 300000 350000

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

-2-1 1-2 2-3 3-4 4-5 5-6 Number of compounds

Average Hit rate

ClogP x

x

0 50000 100000 150000 200000 250000 300000 350000

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

9-12 12-15 15-18 18-21 21-24 24-27 27-30 30-33 33-36 36-39 Number of compounds

Average Hit rate

HEV x

x

0 50000 100000 150000 200000 250000 300000

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

Number of compounds

Average Hit rate

Fsp3 x

x

0 50000 100000 150000 200000 250000 300000 350000 400000 450000

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

Number of compounds

Average Hit rate

fFM x

x

(16)

11

e)

Figure 4: Relationship between molecular hit rate and a) ClogP, b) HEV, c) Fsp3, d) fMF, and e) ion class based on the SP data (Dataset 1) [(x) p < 0.0001]

We have also traced down SP actives from each primary assay in their following screening cascades to check if they are true actives or not. In this case, SP actives will be labeled as true actives only if they were labeled as active compound in the following concentration response (CR) test. By doing in this way, it was our hope that false positive data points could be filtered out. After cleaning the data with this CR test correction (Dataset 2), the molecular HTS hit rates were then calculated and plotted against the four molecular properties (shown in Figure 5). Comparing these results with those obtained in Figure 4, it can be seen that the average hit rate value decreases after the CR correction. However, the correlation trends are still the same for all the properties, indicating that our conclusions regarding molecular HTS hit rate and molecular property relationships were valid.

It was also of interest to investigate these trends by dividing the compounds into different ion classes. Similar plots to Figure 4 were generated for the different ion classes and are displayed in Appendix, Figure A2-A4. Basically the same trends can be seen between molecular hit rate and the four descriptors for all three ion classes, indicating that the trends in Figure 4 are true, regardless of ion class.

a) b)

c) d)

0 200000 400000 600000 800000 1000000

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

Acid Base Neutral Number of compounds

Average Hit rate

Ion class

0 50000 100000 150000 200000 250000 300000 350000

0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009

-2-1 1-2 2-3 3-4 4-5 5-6 Number of compounds

Average Hit rate

ClogP x

x

0 50000 100000 150000 200000 250000 300000

0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.0080.009

9-12 12-15 15-18 18-21 21-24 24-27 27-30 30-33 33-36 36-39 Number of compounds

Average Hit rate

HEV x

x

0 50000 100000 150000 200000 250000 300000

0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 Number of compounds

Average Hit rate

Fsp3 x

x

0 50000 100000 150000 200000 250000 300000 350000 400000 450000

0 0.001 0.002 0.003 0.0040.005 0.0060.007 0.008 0.009

Number of compounds

Average Hit rate

fFM

x

x

(17)

12

e)

Figure 5: Relationship between molecular hit rate and a) ClogP, b) HEV, c) Fsp3, d) fMF, and e) ion class based on the CR corrected data (Dataset 2)[(x) p < 2.2e-16]

Inter-correlation analysis

The correlations between the descriptors were investigated by plotting the average value of the first descriptor against binned values of the second descriptor, based on Dataset 1.

Lipophilicity and size have a strong positive correlation (Figure 6a). ClogP and Fsp3 are negatively correlated (Figure 6b) and Figure 6f shows that Fsp3 and fMF are negatively correlated. On the other hand, for ClogP vs. fMF (Figure 6c), HEV vs. Fsp3 (Figure 6d) and HEV vs. fMF (Figure 6e), no monotonous relationship can be found. Thus certain interdependence between the used descriptors were observed, this triggered a deeper analysis of how the descriptor inter-correlation would influence the correlation between the descriptors and the molecular HTS hit rate as described in the next section.

a) b)

c) d)

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009

Acid Base Neutral Number of compounds

Average Hit rate

Ion class

0 50000 100000 150000 200000 250000 300000 350000 400000 450000

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

8 13 18 23 28 33 38 43 Number of compounds

Average ClogP

HEV

0 50000 100000 150000 200000 250000 300000

0 0.5 1 1.5 2 2.5 3 3.5

Number of compounds

Average ClogP

Fsp3

0 50000 100000 150000 200000 250000 300000 350000 400000 450000

0 0.5 1 1.5 2 2.5 3 3.5

Number of compounds

Average ClogP

fMF

0 50000 100000 150000 200000 250000 300000

0 5 10 15 20 25 30

Number of compounds

Average HEV

Fsp3

(18)

13

e) f)

Figure 6: Relationship between a) ClogP and HEV, b) ClogP and Fsp3, c) ClogP and fMF, d) HEV and Fsp3, e) HEV and fMF and f) Fsp3 and fMF based on SP data (Dataset 1).

Fitting molecular HTS hit rate with beta-binomial statistical models To quantitatively compare the influence of different molecular descriptors on the molecular HTS hit rate, we build statistical models which take the various descriptors as independent variables and hit rate as the response variable. The influence of the descriptors on the molecular hit rate could then be quantitatively measured by examining the deviance between the observed and modeled hit rate distributions.

As already explained, we have used the systematic procedure of utilising the BIC score value (Eq. 8) to select the most appropriate model for the hit rate distribution.

Models with non-polynomial parameters (Dataset 4). ClogP, HEV, Fsp3 and fMF were used as independent variables for describing hit rate and a forward selection process was carried out to build the hit rate beta-binomial models. The BIC score value (Eq. 8) was used to measure the model quality and Table 1 demonstrates how the BIC score varies during the forward selection process. When single descriptor models were built (the first round of model building), the ClogP model had the lowest BIC value and became the best model. This means that ClogP is the most influential descriptor among all four descriptors. In the second round of model building, ClogP was kept in the model and one at the time another descriptor was selected from the three remaining descriptors to be added into the model. Comparing these results with the single descriptor model, it seems like the model quality was further improved by adding one additional descriptor and the model combing Fsp3 and ClogP turned out to be the best two-descriptor-model. In the third round, adding HEV to the model provided the largest decrease of the BIC value and in the end, when fMF was added, the four- descriptor-model got only a small decrease of the BIC value.

0 50000 100000 150000 200000 250000 300000 350000 400000 450000

0 5 10 15 20 25 30 35

Number of compounds

Average HEV

fMF

0 50000 100000 150000 200000 250000 300000 350000 400000 450000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Number of compounds

Average Fsp3

fMF

(19)

14 Table 1: BIC values in the four rounds of the forward selection procedure where the descriptors were added one at the time to fit the hit rate data.

Forward Selection BIC Score

Descriptors Round 1 Round 2 Round 3 Round 4 ClogP 992337.1

Fsp3 994222.5 988786.0

HEV 997614.3 991946.3 987743.2

fMF 998365.2 991637.4 988608.3 987696.5

Figure 7 demonstrates the decrease of the BIC values for the best model in each round of the forward selection procedure. It can be seen that adding more descriptors always improves the model quality and especially when adding Fsp3, a large improvement of the model was obtained. When HEV is added, the improvement margin is less than that of Fsp3 and fMF

gives the smallest improvement for the model. This suggests the impact of the four descriptors on molecular hit rate, from largest to least, is in following order ClogP > Fsp3 >

HEV > fMF. Especially Fsp3 provides a significant impact on hit rate on top of ClogP. Since the improvement of the model was very small when fMF was added it could be established that it did not have a significant impact on hit rate on top of the other descriptors and the descriptor was therefore excluded from further investigations.

Figure 7: Decrease of the BIC value when additional descriptors are added. The first model contains ClogP only. The second, third and fourth model are ClogP+Fsp3, ClogP+Fsp3+HEV and ClogP+Fsp3+HEV+fMF respectively

The cumulative distribution functions (cdf) for different models are shown in Figure 8. The cdf is obtained by integrating the probability density function in the beta distribution (Eq. 5) over a range of hit rate and indicates the cumulative fraction of compounds whose hit rate are lower than a specific hit rate for a given descriptor value. The comparison of cdf curves for the ClogP and ClogP+Fsp3 models is shown in the Figure 8. Each plot in Figure 8 corresponds to compounds with a given ClogP value and four representative ClogP values (ranging from high to low ClogP values) are displayed in Figure 8. The black curves in these plots always refer to the cdf’s for the ClogP model. The color curves in each plot refer to the cdf’s of the ClogP+Fsp3 model and the color scheme corresponds to different Fsp3 values in the model.

It can be seen from these cdf curves that for compounds with fixed ClogP value increasing Fsp3 value will decrease the probability of having a high molecular hit rate and this is true at all four ClogP values. This means that increasing Fsp3 will decrease the hit rate and the effect is independent of the influence of ClogP. Figure 4c and 5c have already shown that Fsp3 is negatively correlated with hit rate and the effect is reflected in the cdf of the ClogP+Fsp3 model as well. This results also demonstrate that the derived hit rate statistical model correctly capture the relationship between hit rate and Fsp3.

987000 988000 989000 990000 991000 992000 993000

1 2 3 4

BIC

Models

(20)

15

a) b)

c) d)

Figure 8: Hit rate cdf plots for the ClogP model and the ClogP+Fsp3 model. Four plots correspond to compounds with four representative ClogP value: a) ClogP= 4.8, b) ClogP=3.8, c) ClogP = 2.3 and d) ClogP = 0.9. X-axis refers to hit rate and y-axis refers to the cumulative fraction of compounds whose hit rate is lower than a given value on the x-axis. In each plot, the black curve refers to the ClogP model and other curves are colored according to different Fsp3 values.

It is also of our interest to examine the additional influence of size on top of ClogP and Fsp3.

As shown in Figure 9, the cdf’s for the ClogP+Fsp3 and the ClogP+Fsp3+HEV model are compared to investigate the influence of size on the molecular HTS hit rate. Similar to Figure 8, the black cdf curves refer to the cumulative distribution of compounds in the ClogP+Fsp3 model at fixed ClogP and Fsp3 values for given hit rate, while the colored cdf curves correspond to the cumulative distribution in the ClogP+Fsp3+HEV model for compounds which have fixed ClogP and Fsp3 value (same as in the ClogP+Fsp3 model) but with different HEV values. In Figure 9, three ClogP levels were selected to represent high, medium and low ClogP levels. For each ClogP level, compounds are splited into two groups with respect to high and low Fsp3 levels and compounds within each Fsp3 subgroup are further subdivided into five levels based on their size. The cumulative distribution for compounds in each subgroup are generated and displayed in each plot of Figure 9. It seems that for compounds with fixed ClogP and Fsp3 values, smaller molecular size lowers the odds of making a high hit rate. This is consistent to what we have seen in Figure 4b and 5b where larger compounds

(21)

16 tend to have higher hit rate. On the other hand, it also means that the effect of size on molecular hit rate is independent from ClogP and Fsp3 levels. Comparing with Figure 8, it can be seen that the effect margin is lower than that of Fsp3. Another interesting observation is that at the highest ClogP level (Figure 9a,b), the influence of size on hit rate is larger than that in other two ClogP levels (Figure 9c-f) reflecting that the influence of size on molecular hit rate is not constant, but depends on the specific ClogP and Fsp3 values.

a) b)

c) d)

(22)

17

e) f)

Figure 9: Hit rate cdf plots for the ClogP+Fsp3 and ClogP+Fsp3+HEV model at various ClogP and Fsp3 values: a) ClogP=4.8, Fsp3=0.5; b) ClogP=4.8, Fsp3=0.1; c) ClogP=2.9, Fsp3=0.5; d) ClogP=2.9, Fsp3=0.1; e) ClogP=0.9, Fsp3=0.6; f) ClogP=0.9, Fsp3=0.2. The X- axis refers to the hit rate and the y-axis refers to the cumulative fraction of compounds whose hit rate is lower than a given value on x axis. In each plot, the black curve refers to cdf for the ClogP+Fsp3 model and other colored curves are the cdfs for the ClogP+Fsp3+HEV model.

The colors correspond to different HEV values.

Models with polynomial parameters (Dataset 1). To further decrease the deviation of the statistical model from the experimental hit rate measurements, higher degree polynomial terms were added into the a(x) and b(x) parameters (Eq. 9) of the pdf for the beta distribution (Eq. 5). Actually, models with non-polynomial parameters described previously can also be regarded as polynomial models which only have one degree polynomial terms.

Since our study results in previous section has shown that ClogP is the most important factor for explaining molecular hit rate, we first built models with polynomial terms using only ClogP. The polynomial degree for ClogP models started from one and gradually increased to a degree of five. The relationship between each models BIC value and polynomial degree is shown in Figure 10. It seems that when quadratic and cubic terms are added into the model, the BIC value has a large decrease. But when higher polynomial terms are further added, the decrease of the BIC value is much smaller. This is probably because that gaining of model deviance from experimental data is largely offset by the penalty on introducing too many descriptors in the model. The ClogP polynomial model at a degree of four was chosen as the best ClogP model for further studies.

Figure 10: Decrease of the BIC value as the degree of polynomial increased for the ClogP model.

11020500 11021000 11021500 11022000 11022500 11023000

1 2 3 4 5

BIC

Degree of polynomial

(23)

18 The impact of Fsp3and size was also investigated by building ClogP+Fsp3 and ClogP+HEV models respectively, with different degree polynomial terms. Figure 11 demonstrates the improvement of fit of the ClogP model when Fsp3 and HEV are added to the model, respectively, as well as when polynomial terms are gradually added to the three different models. It can be seen that adding polynomial terms into these models will improve the model fit to the experimental hit rate data and Fsp3 have larger impact than size in terms of decreasing BIC score. It can also be seen that gradually adding an additional polynomial term to the different models the decrease of BIC score is lowest for the ClogP model and biggest for the ClogP+Fsp3 model. These results clearly show that Fsp3has a more significant impact on hit rate than size.

Figure 11: Comparison of the ClogP, ClogP+Fsp3 and ClogP+HEV polynomial models.

To visually display the deviance between the predicted and experimental hit rate, the surfaces for predicted mean molecular hit rate are generated and compared with experimental mean hit rate in Figure 12 and 13. The surface in each figure represents the predicted mean hit rate from the model. The colored dot represents the mean experimental hit rate for a given descriptor bin and the color of the dots corresponds to the compound density in each bin, where the green dots represents lowest density and the purple dots represents the highest.

Figure 12a shows the comparison between the four degree ClogP+Fsp3 polynomial model fitted mean hit rate and mean experimental hit rate data. It can be seen in the figure that the model (blue surface) in general align well with the dots, which indicates that the model has a good fit with the experimental mean hit rate. In Figure 12b, another surface (red) is added in for comparison, which is generated from a four degree ClogP polynomial model. We can see that the red surface has a larger deviation to those green dots which have high Fsp3 in general, while the surface from the ClogP+Fsp3model can match with them nicely. Similar surface plots are made for the ClogP+HEV model (Figure 13a, b). The ClogP+HEV model surface matches the dots closely and also gives a better prediction comparing with the ClogP model. For the ClogP model, the dots corresponding to high HEV values have large deviation to the model surface and they are well fitted in the ClogP+HEV model (Figure 13b).

Comparing Figure 12 with Figure 13 it seems that the ClogP+Fsp3 model surface fits better to experimental data than the ClogP+HEV model and this is also consistent with the BIC values for these two models.

10970000 10980000 10990000 11000000 11010000 11020000 11030000

1 2 3 4 5

BIC

Degrees of polynomial

ClogP ClogP+HEV ClogP+Fsp3

(24)

19

a) b)

Figure 12: Surface plots comparing a) the ClogP+Fsp3 polynomial model (four degree polynomial terms) predicted mean hit rate with mean experimental hit rate; b) the predicted mean hit rate from the ClogP and the ClogP+Fsp3 polynomial model (both have four degree polynomial terms) with mean experimental hit rate. The dots correspond to the experiment data and the color represents compound density in the bin, where the green dots represent the bins containing less than 500 compounds, blue dots/bins containing compounds within the [500, 1000] range and purple dots/bins containing more than 1 000 compounds). The X- Y plane refers to Fsp3 and ClogP descriptor space and the Z axis refers to the mean hit rate.

The blue surface corresponds to the ClogP+Fsp3 model and the red surface corresponds to the ClogP model.

a) b)

Figure 13: Surface plots comparing a) the ClogP+HEV polynomial model (four degree polynomial terms) predicted mean hit rate with mean experimental hit rate; b) the predicted mean hit rate from the ClogP and the ClogP+Fsp3 polynomial model (both have four degree polynomial terms) with mean experimental hit rate. The dots correspond to experiment data and the color represents compound density in the bin, where the green dots represent the bins containing less than 500 compounds, blue dots/bins containing compounds within the [500, 1000] range and purple dots/bins containing more than 1000 compounds). The X-Y plane refers to HEV and ClogP descriptor space and the Z axis refers to the mean hit rate. The blue surface corresponds to the ClogP+HEV model and the red surface corresponds to the ClogP model.

References

Related documents

This leads to significant strength enhancement in these films by improving adhesion between the fibres, as indicated by the better strength properties of the sheets obtained

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

• Utbildningsnivåerna i Sveriges FA-regioner varierar kraftigt. I Stockholm har 46 procent av de sysselsatta eftergymnasial utbildning, medan samma andel i Dorotea endast

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating

The EU exports of waste abroad have negative environmental and public health consequences in the countries of destination, while resources for the circular economy.. domestically