• No results found

Estimating environmentally important properties of chemicals from the chemical structure

N/A
N/A
Protected

Academic year: 2022

Share "Estimating environmentally important properties of chemicals from the chemical structure"

Copied!
62
0
0

Loading.... (view fulltext now)

Full text

(1)

properties of chemicals from the chemical structure

Erik Furusjö, Magnus Andersson, Magnus Rahmberg, Anders Svenson

?

(2)

Verktyg för prognosticering av kemikaliers bioackumulerbarhet, toxicitet samt persistens

Adress/address Box 21060

100 31 Stockholm

Anslagsgivare för projektet/

Project sponsor Telefonnr/Telephone

08-598 563 00 CFs Miljöfond, SIVL

Rapportförfattare/author

Erik Furusjö, Magnus Andersson, Magnus Rahmberg, Anders Svenson

Rapportens titel och undertitel/Title and subtitle of the report

Estimating environmentally important properties of chemicals from the chemical structure

Sammanfattning/Summary

The development of models to predict important environmental properties is easily recognised in the light of the great number of existing chemicals that still need to be characterised. To meet the needs for testing new chemicals such models may also be useful. Here new quantitative structure-activity relationship (QSAR) models are presented to predict acute and subacute aquatic toxicity to a green alga (Pseudokirschneriella subcapitata), a crustacea (Daphnia magna), two fish species (Lepomis macrochirus, Leuciscus idus) and a bacterial bioluminescence inhibition test (Microtox3). The toxicity is predicted from more than 1400 molecular descriptors using the multivariate statistical method partial least squares (PLS) regression. The models are based on descriptors calculated from the chemical structure only and can be applied to substances that have not yet been isolated or synthesised.

QSAR models were obtained for which the standard prediction errors in logarithmic units correspond to the following concentration factors: Microtox 15 min bioluminescence inhibition EC50 – a factor 3.4; green alga 96 h growth rate inhibition EC50 – a factor 2.8; Daphnia magna 48 h immobilisation EC50 – a factor 2.3; Lepomis macrochirus 96 h toxicity LC50 – a factor 2.4; Leuciscus idus 96 h toxicity LC50 – a factor 3.5

In addition to development of prognosis models, the aim of this project was to develop methodology to obtain more reliable QSAR model predictions of toxicity. Two methodologies that are very important in this respect are systematic selection of the training set by statistical molecular design (SMD) and outlier detection. Partial least squares (PLS) modelling provides unique diagnostic tools when the model is used to predict the toxicity of new substances. Using these, it can be detected if the model does not cover the substance that the model is applied to, i.e. if the substance is a model outlier and the prediction is likely to be inaccurate. It is shown that reliable automatic outlier detection with a high efficiency can be obtained. This is a huge advantage for routine use of QSAR models and a leap forward towards reliable QSAR estimates of substance properties without requiring expert knowledge by the user.

Nyckelord samt ev. anknytning till geografiskt område eller näringsgren /Keywords

Quantitative structure activity relationships, QSAR, SAR, chemical structure, aquatic toxicity, screening new chemicals, multivariate data analysis, MVA, partial least squares, PLS, statistical molecular design, SMD, outlier detection, robust models

Kvantitativa struktur-aktivitetssamband, QSAR, SAR, kemisk struktur, akvatisk toxicitet, kemikalier, multivariat dataanalys, MVA, partial least squares, PLS, statistical molecular design, SMD, avvikare, robusta modeller Bibliografiska uppgifter/Bibliographic data

IVL Rapport/report B1517

Beställningsadress för rapporten/Ordering address IVL, Publikationsservice, Box 21060, S-100 31 Stockholm

(3)

Table of Contents

Abstract ... 3

1 Introduction ... 4

2 Toxicity... 5

3 Theory... 5

3.1 Molecular descriptors... 5

3.1.1 Measured descriptors ... 6

3.1.2 Calculated descriptors... 7

3.1.3 Software ... 11

3.2 Modelling methods ... 12

3.2.1 Linear regression... 12

3.2.2 Multivariate projection methods ... 13

3.2.3 Non-linear methods... 15

3.2.4 Common Reactivity Pattern... 15

3.2.5 Model validation and model accuracy measures ... 16

3.2.6 Outliers in QSAR models ... 17

3.2.7 Statistical Molecular Design (SMD)... 18

4 Methods ... 19

4.1 Toxicity data ... 19

4.1.1 Microtox toxicity ... 19

4.1.2 Alga toxicity... 19

4.1.3 Daphnia toxicity ... 20

4.1.4 Fish toxicity ... 20

4.2 Descriptor calculation and QSAR modelling ... 20

5 Results ... 21

5.1 Microtox prognosis model ... 22

5.1.1 Prediction outlier detection... 23

5.1.2 Random training set selection... 27

5.1.3 Systematic training set selection... 30

5.2 Green alga toxicity prognosis model ... 33

5.2.1 Random training set selection... 33

5.2.2 Systematic training set selection... 34

5.3 Daphnia toxicity prognosis model ... 35

5.3.1 Simple variable selection ... 36

5.4 Fish toxicity prognosis model... 38

5.4.1 Lepomis macrochirus toxicity model with random training set selection .. 38

(4)

5.4.2 Lepomis macrochirus toxicity model with systematic training set... 41

5.4.3 Leicuscus idus toxicity models ... 43

6 Discussion... 44

7 Conclusions ... 47

8 Acknowledgements ... 47

9 References ... 47

Appendix A: Descriptors calculated by the Dragon software ... 50

Appendix B: Substances and reference data used... 51

(5)

Abstract

The development of models to predict important environmental properties is easily recognised in the light of the great number of existing chemicals that still need to be characterised. To meet the needs for testing new chemicals such models may also be useful. Here new quantitative structure-activity relationship (QSAR) models are presented to predict acute and subacute aquatic toxicity to a green alga

(Pseudokirschneriella subcapitata), a crustacea (Daphnia magna), two fish species (Lepomis macrochirus, Leuciscus idus) and a bacterial bioluminescence inhibition test (Microtox3). The toxicity is predicted from more than 1400 molecular descriptors using the multivariate statistical method partial least squares (PLS) regression. The models are based on descriptors calculated from the chemical structure only and can be applied to substances that have not yet been isolated or synthesised.

QSAR models were obtained for which the standard prediction errors in logarithmic units correspond to the following concentration factors:

• Microtox 15 min bioluminescence inhibition EC50 – a factor 3.4

• green alga 96 h growth rate inhibition EC50 – a factor 2.8

• Daphnia magna 48 h immobilisation EC50 – a factor 2.3

• Lepomis macrochirus 96 h toxicity LC50 – a factor 2.4

• Leuciscus idus 96 h toxicity LC50 – a factor 3.5

In addition to development of prognosis models, the aim of this project was to develop methodology to obtain more reliable QSAR model predictions of toxicity. Two

methodologies that are very important in this respect are systematic selection of the training set by statistical molecular design (SMD) and outlier detection. Partial least squares (PLS) modelling provides unique diagnostic tools when the model is used to predict the toxicity of new substances. Using these, it can be detected if the model does not cover the substance that the model is applied to, i.e. if the substance is a model outlier and the prediction is likely to be inaccurate. It is shown that reliable automatic outlier detection with a high efficiency can be obtained. This is a huge advantage for routine use of QSAR models and a leap forward towards reliable QSAR estimates of substance properties without requiring expert knowledge by the user.

(6)

1 Introduction

The work described in this report is related to methods for estimating environmentally important properties, such as aquatic toxicity, from the structure of the chemical substance. The use of quantitative structure activity relationships (QSAR) is becoming established and accepted for estimating the ecotoxicity of many chemicals in the absence of results from actual toxicity tests1. However, there are some limitations associated with most QSAR models used today:

• Application relies on the availability of measured physicochemical parameters such as octanol/water partition coefficient, density, refractive index, boiling- and melting point, etc.

• A prediction from the model does not give any diagnostic information on whether or not the model is valid for this compound and, thus, what the quality of the prediction can be expected to be.

• The applicability of the models is in general limited to narrow classes of compounds The aim of the research presented here is to investigate to what degree these limitations can be relaxed and what modelling methods and molecular descriptor are best in this respect.

The first limitation is serious, since it is often of great interest to assess the environmental properties of a substance that has not been isolated in a laboratory.

Further, laboratory tests, even simple tests like solubility and partition coefficients are time consuming and expensive even when the substance has been isolated. Thus, our work has focused completely on models that are based on the structure of the substance alone, i.e. without requiring access to the actual substance and any physicochemical measurements.

The lack of prediction diagnostics that indicates what the quality of the prediction can be expected can easily over-confidence in the value produced by the model. A value without uncertainty measure is often perceived as exact, although the opposite is usually more adequate. In our opinion, a value should be treated with scepticism in the absence of an uncertainty estimate of some kind.

1 See e.g. the homepage of the ECOSAR software under the USEPA New chemicals program, http://www.epa.gov/oppt/newchems/21ecosar.htm. Please note the discussion about these models in

(7)

Multivariate modelling methods based on latent variables, such as principal component analysis (PCA) and partial least squares (PLS) can provide prediction diagnostics with each prediction due to the fact that the covariance structure in the descriptor set is modelled. Although PLS has been used in several QSAR studies [Giraud et al. 2000, Shi et al. 2001, Tong et al. 1998, Eriksson et al. 2000], this important feature of the algorithm is often neglected.

The third limitation listed above, that the models are valid only for a narrow class of compounds, is probably the most difficult to solve. The reason for this is that different groups of substances can act with different mechanisms, which may be difficult to capture in a single model, especially for more specific and complex responses. The research presented here deals primarily with more general responses like aquatic toxicity. In order to reach the goal of general applicability, the work has been focused on models covering a wide range of chemicals.

2 Toxicity

The toxicity is an important property used in risk assessment and classification of substances. Acute toxicity, usually synonymous to lethality, is characterised by short- term exposure in relation to the life cycle of the organism. Long-term exposure, usually to lower doses may cause chronic effects. Effects on reproduction and exposure to more than one life cycle represent such effects.

In this study the bioluminescence inhibition of a marine bacterium kept at non- reproducing conditions (Microtox), the Daphnia magna 48 h immobilisation test and the 96 h fish lethality tests all represent acute toxic effects. The alga growth rate inhibition test, however, could be considered at least as a sub-acute or sub-chronic test although the duration was only 96 h. Several life cycles pass within this time and effects on the reproduction may be tested.

3 Theory

3.1 Molecular descriptors

In the scope of the investigation presented in this report, the purpose of molecular descriptors is to be the basis of models describing some aspect or aspects of the behaviour of chemical substances. Some general requirements that need to be fulfilled in order to make this possible are:

(8)

• The descriptors should contain relevant information for the purpose of the

modelling, i.e. the aspect of the behaviour modelled. This means that the descriptor should allow for, and take into account, flexibility in the chemical structure if this is necessary to capture the behaviour of the substance.

• Most modelling methods require that the size of the descriptor set is independent of the size of the molecule.

Molecular descriptors can be classified by origin into measured and calculated

descriptors. The major difference from an application point-of-view is that the chemical substance in question is required in order to obtain a measured descriptor while the calculated descriptors can be obtained for substances that cannot be isolated or have not yet been synthesised.

Andersson et al. [2000] have compared the information content in measured

physicochemical and some calculated descriptors. Their results show that the descriptor sets contain similar information for the data sets investigated. The aim of the work presented in this report is to forecast environmental properties of large sets of new chemical substances. Measured properties are frequently not available for such sets and the aim is often to prioritise the substances for tests. Thus, the work presented is focused on the use of calculated descriptors and measured descriptors are discussed only very briefly below.

The distinction between measured and calculated descriptors is only one of many distinctions that can be made. Other possible classifications are global and local (depending on if the descriptor describes a property of the whole or a part of the molecule), static and dynamic (depending on whether dynamics of e.g. conformational changes are considered) as well as relative and absolute [Wehrens et al. 1999].

3.1.1 Measured descriptors

Undoubtedly, the single most important descriptor used in QSAR is hydrophobicity, which is usually measured as the logarithm of the octanol/water partition coefficient, log KOW.

Other examples of useful measured descriptors include [Andersson et al. 2000, Livingstone 2000]:

• solubilities in different solvents

• boiling, melting and flash points

• spectroscopic properties such as NMR shifts or IR/Raman stretching frequencies

• molecular volume and density

• specific refraction and molecular refractivity

(9)

It is not difficult to understand why properties like log KOW and solubility are important since they reflect the way the substance is distributed within an organism, which is of course important for its biological activity.

There are numerous methods for estimating log KOW from the chemical structure based on different algorithms. Frequently such estimates are used as a descriptor for further QSAR modelling if experimental log KOW values are not available. In such cases, the descriptor is not measured but often it is still denoted measured since both measured and calculated values of log KOW are used as basis for the same model and no distinction is made between them.

3.1.2 Calculated descriptors

In order to relate chemical structure to biological activity or other molecular properties, it is necessary to describe the chemical structure numerically in some manner. A

calculated molecular descriptor is a number extracted by a well-defined algorithm from a structural representation of the molecule. The descriptors are defined by the algorithm used for the calculation. Often, the chemical/physical interpretation of this number is not straightforward. However, this does not mean that the descriptor does not contain useful information about the properties of the molecule. We quote professor Roberto Todeschini of the Chemometrics and QSAR research group, Dept. of Environmental Sciences, University of Milano-Bicocca, Italy: "There is good reason to believe that often our difficulties in attributing a meaning to this number lie ultimately in the lack of deeper chemical theories and higher level languages and not from esoteric approaches to the descriptor definition." [web site http://www.disat.unimib.it/chm].

Numerous types of descriptors have been developed to numerically describe chemical structures. They can be coarsely classified into the groups 0D, 1D, 2D, 3D and other.

These groups are briefly reviewed below.

3.1.2.1 0D descriptors

0D descriptors are constitutional in character and independent of molecular connectivity and conformations. Typical examples are atom and bond type counts, molecular weight and sum of atomic van der Waals volumes.

This type of descriptors cannot distinguish most molecular isomers and similar molecules, e.g. m-nitrophenol from p-nitrophenol.

3.1.2.2 1D descriptors

Counts of functional groups and atom-centred fragments, i.e. fractions of a molecule involving a few atoms, are often termed 1D molecular descriptors.

(10)

Molecular holograms. Holographic QSAR (HQSAR) is a recently developed technique that uses molecular holograms as descriptors [Burden, Winkler 1999]. We have chosen to classify this type of descriptors as 1D since it is based on structure fragments similar to the other 1D descriptors. The calculation procedure is roughly as follows: the molecule is divided into fragments of a number of atoms. Typically, a range like 3 to 8 atoms per fragment excluding hydrogen is used. Each fragment is mapped to an integer number. The integers are arranged in a number of bins (similar to a histogram) and the descriptors are the number of fragments in each bin. Typically, the number of bins used is in the range 20-400. HQSAR descriptors have not been used to obtain the results in this report.

3.1.2.3 2D descriptors

2D descriptors are dependent on the constitution and connectivity of the molecule but independent of conformation. Thus, 2D descriptors can be calculated from a 2D-

structure representation of the molecule, e.g. a structure formula of an organic molecule.

2D autocorrelations. An autocorrelation function of the form A(d) = sumij(pipj) can be used to encode the topology of a molecular graph. pi and pjrepresent the values of an atomic property at atoms i and j , respectively, and d is the topological distance between the two atoms measured in bonds along the shortest path. The function has the useful property that no matter how large and complex the molecule, it can be encoded in a fixed length vector of small rank. Typically, only path lengths of 2 to 8 are considered.

Atomic properties include e.g. atomic mass, volume, polarisability and electronegativity.

BCUT descriptors are calculated as the eigenvalues of the so-called adjacency matrix with the diagonal elements weighted by atomic masses. The adjacency matrix is a square matrix with each row/column corresponding to one atom. The ij element is 0 if atoms i and j are not connected, 1 if they are connected by a single bond, √2 if they are connected by a double bond etc.

Galvez topological charge indices are similar to the BCUT descriptors but the diagonal elements of the adjacency matrix are weighted by atomic charges instead of atomic weights.

Molecular walk counts. Counts walks and self-returning walks in the molecule of different length.

Various topological descriptors. A diverse set of descriptors, e.g. Wiener type indices and connectivity indices can be calculated from the 2D molecular structure.

(11)

3.1.2.4 3D descriptors

The 0D, 1D and 2D descriptors discussed above are independent of the 3D geometry of the molecule. It is reasonable to believe that the 3D structure of a molecule has a large influence of the biological activity of the molecule. Thus, descriptors that contain information on 3D structure should be valuable for QSAR studies.

3D descriptors are calculated from the 3D structure of the molecule, i.e. they are dependent on the conformation, including bond angles, interatomic distances etc. Since these properties are not available for any given chemical substance, some type of geometry optimisation must be included in the modelling (and prediction) process if a generally applicable method is sought.

A relatively simple and fast method that is applicable for small as well as large molecules is to optimise the geometry of the molecule by molecular mechanics. All structures used for modelling in this work were optimised by this method. Various force fields can be used but the MM+ force field is a versatile force field that suits the aim of general applicability of the results.

The geometry optimisation is performed by a local optimisation algorithm, which means that it may converge to different local minima depending on the initial geometry.

One approach is to perform optimisations starting with a number of different conformations and choosing the 3D structure with the lowest energy. Chemical knowledge can be used to start with reasonable conformations, which makes the probability of reaching the global energy minimum relatively large, but there is no guarantee that the global optimum is reached, especially for very large molecules.

It should be noted that it is by no means certain that it is the conformation lowest in energy that is active in a biological system. This is a drawback of using the geometry dependent 3D descriptors as is done in this work. There are some methods that take the possibility of different active conformations into account, e.g. the CoRePa method discussed below. The conformation problems do not apply to 0D, 1D and 2D descriptors since these are conformation independent.

Other possible geometry optimisation methods include, semi-empirical (e.g. AM1) geometry optimisation and quantum mechanical methods, but these require substantially more computing power and are less suited for large molecules for this reason. In

addition, they require more advanced and expensive software that may not be as widely available.

Randic molecular profiles characterise molecular shape in the form of a shape profile (a series of numbers) [Randic 1995, Randic, Razinger 1995].

(12)

Radial distribution functions (RDF) contains information about the interatomic distances in a molecule, unweighted or weighted by different atomic properties such as atomic mass, electronegativity, van der Waals volume and atomic polarisability

[Hemmer et al. 1999].

3D-MoRSE descriptors reflect the three-dimensional distribution of different properties in the molecule. The transformation is derived from calculations used when determining molecular structure from electron diffraction measurements. The descriptors are

obtained by summing products of atomic properties (mass, electronegativity,

polarisability) weighted by different angular scattering functions and have been shown to preserve information about e.g. branching [Schur et al. 1996].

WHIM (weighted holistic invariant molecular) descriptors are based on principal component analysis of atomic co-ordinates with different weighting schemes.

Weighting by atomic mass, electronegatvity, atomic polarisability, van der Waals volume, electrotoplogical state as well as unweighted analysis gives a total of 99 descriptors. The descriptors are of two types: directional (shape related) and non- directional (size related) [Livingstone 2000].

GETAWAY (Geometry Topology and Atom Weights Assembly) descriptors are calculated from a leverage matrix based on atomic co-ordinates called the molecular influence matrix. Weighted by different atomic properties such as atomic mass, electronegativity, van der Waals volume and atomic polarisability.

Various 3D geometrical descriptors based on molecular geometry, e.g. sums of interatomic geometrical distances.

Quantum mechanical/semi-empirical descriptors. As discussed above, geometry optimisation can be performed by quantum mechanical or semi-empirical methods.

When such methods are applied to a molecule a description of the molecule is obtained that potentially contains large amounts of information about the properties of the molecule. A large number of descriptors can be extracted, e.g. energies of molecular orbitals (HOMO and LUMO), molecular polarisability, charge distribution, heat of formation, ionisation potential etc. We have not used quantum mechanical/semi- empirical descriptors to obtain the results presented below.

EVA (Eigenvalue) descriptors are vectors based on eigenvalues corresponding to a molecule's vibrational modes.

(13)

3.1.2.5 Other calculated descriptors

Estimated physical properties. As noted above, QSAR estimates of physical

properties, most commonly the octanol/water partition coefficient log KOW are used as descriptors for further QSAR modelling.

3.1.2.6 CoMFA and GRID

There exist specific QSAR descriptors that are based on a more physical model or understanding of the molecular interactions behind the biological response measured.

Two methods that are closely related and based on superposition and alignment of molecular structures are Comparative molecular field analysis (CoMFA) and GRID [Livingstone 2000]. Both involve the use of a molecular probe and calculation of the interaction between the probe and the molecule that is being analysed. Interactions are measured at a (usually large) number of points in space defined by a grid placed around the molecular structure. PLS (see below) is usually used as the regression method in CoMFA.

CoMFA and GRID require that molecules be aligned relative to some common

reference, e.g. the centre of mass. Aligning molecules with a similar structure is usually not that difficult, but a more diverse data set poses problems for all methods requiring alignment [Buydens et al. 1999]. CoMFA and GRID descriptors have not been used in the present work.

3.1.3 Software

A survey of available software for calculation of molecular descriptors was performed during the first stages of the project. The survey showed a variety of software packages of which most are strongly focused on drug discovery and drug design. Examples of software packages that compute molecular descriptors are Tsar, Dragon, AMPAC, MolconnZ and MOPAC. The software packages are more or less advanced; some of them only allow descriptor calculation but a few of them are also capable of QSAR modelling. An important aspect in the choice of a tool for calculation of descriptors is the licence fee for the software. Almost all of the software packages are licensed for a substantial fee, which means that they are not generally available to potential users of the QSAR models. This would limit the possible use of the models.

Thus, the criteria for our choice of software are calculation of a wide variety of relevant molecular descriptors at a reasonable price on a computer running Windows. Evaluation of these criteria led to a choice of the Dragon software. Dragon is a free software

package developed by the Chemometrics and QSAR research group at Milan

University, Italy. Dragon can be used to calculate a large number (1481) of molecular

(14)

descriptors from molecular structures saved in several different file formats, e.g. the standard format .mol and HyperChem .hin. The descriptors calculated by Dragon are discussed in Appendix A.

3.2 Modelling methods

This section describes some modelling methods that can be used to relate the chemical structure to environmental properties. The emphasis is on multivariate regression methods based on latent variables, since it is one of these methods, partial least squares (PLS), that has been used to obtain the results presented in this report. Other methods are discussed but less in-depth. Molecular docking algorithms are not considered at all.

Clustering of substances prior to regression modelling is often beneficial as reported by several authors, see e.g. [Suzuki et al. 2001]. Classification of substances prior to modelling has not been performed in this study, since the aim was to obtain models covering a broad range of chemicals in order to facilitate forecasting of environmental properties of large sets of new chemicals. This means that the predictive performance, measured as prediction errors for the environmental properties predicted by the models, is probably larger than what would be the case if clustering was used prior to regression modelling. On the other hand, the models are more generally applicable which is considered to be of greater importance.

3.2.1 Linear regression

The simplest forms of QSAR models are simple univariate linear regression models of the form

0 1 descriptor k k

response= × +

These very simple models are of limited use since such a simple relationship is usually inadequate. An extension of this equation is

=

× +

= p

i

i i descriptor k

k response

1 0

p is usually chosen as p = 2 or p = 3. The extension allows non-linear relations between the response and the single descriptor. However, a single descriptor is usually not sufficient to capture the behaviour of a substance, although successful applications have been reported for narrow groups of substances, usually with log KOW as the descriptor.

(15)

Multiple linear regression (MLR) can be used to model the dependence of several descriptors according to the equation

=

× +

= p

i ki xi

k response

1 0

xi is the ith descriptor. The number of descriptors, p, can vary widely from p = 2 to relatively large numbers. However, if many descriptors are used that contain similar information, i.e. are co-linear, problems with so-called variance inflation occurs, which means that the models become very sensitive to small variations in the descriptors and that their predictive performance becomes poor. To solve this problem, different variable selection algorithms can be used to select a small set of variables with high information content. Another approach is to use multivariate projection methods, described in the next section, that handle, and even utilise, the co-linearity in the descriptor set.

3.2.2 Multivariate projection methods

Typical examples of multivariate projection methods are principal component analysis (PCA) and partial least squares (PLS). Sometimes this type of methods is denoted multivariate data analysis (MVA) methods, which is a rather non-descriptive name but nevertheless adopted here due to convention. More informative names are multivariate projection methods or latent variable methods.

The fundamental MVA method is PCA. Only a very brief description of PCA is given here. More detailed introductory descriptions are references [Wold et al. 1987], [Martens, Naes 1989] and [Esbensen et al. 1996]. PCA decomposes a data matrix X (a table, in the current context the rows correspond to the substances while the columns correspond to descriptors) according to:

E TP X= T +

PCA can be considered a co-ordinate transformation from the original variable space to a model hyper-plane of much lower dimensionality that captures the variance in the data in the most efficient way. The scores, denoted t or T, are the co-ordinates in the new co- ordinate system and thus describe the objects (here: chemical substances). The loadings, denoted p or P, describe the relation between the latent variables (principal

components) that span the model space and original variables.

The matrix E in the equation above contains the residuals, i.e. the part of the data not captured by the model hyper-plane. Substances that do not conform to the "pattern"

found among the other substances will be badly described by the model and thus have

(16)

large residuals. This can be caused by corrupted data or that the substance in question is different from the others, which may indicate that a QSAR model based on the rest of the compounds will not be valid.

The substantial dimensionality reduction achieved by applying PCA to molecular descriptor data sets leads to enhanced interpretation abilities which facilitate

classification and clustering of substances. This is utilised in a methodology known as statistical molecular design (SMD), see the separate discussion below.

PCA is not a regression method and cannot be used for finding quantitative relationships between descriptors and responses. The most common multivariate regression method is PLS.

3.2.2.1 Partial least squares

PLS is a latent variable based regression method described in several references [Martens, Naes 1989, Esbensen et al. 1996, Geladi, Kowalski 1986]. PLS has several benefits compared to ordinary multiple linear regression:

• Co-linearity is handled in a natural way and even utilised to find a robust estimate of the data structure. This means that variable selection methods are of less importance than in MLR.

• The latent variable approach means that outlier diagnostics can be obtained both for training and prediction substances.

The prediction outlier diagnostics obtained has no counterpart in MLR or the non-linear regression methods, such as artificial neural networks (ANN) discussed below, and are the greatest advantage of latent variable regression methods according to us. For a new sample it is possible to calculate a probability that the sample belongs to the sample population the model was estimated from and thus that the model is likely to yield a valid prediction. It should be noted that, as shown below, it is quite possible for a model to yield good predictions although the sample is classified as not belonging to the model. The opposite, that the sample is classified as belonging to the model and poorly predicted is uncommon. This is the behaviour required for risk assessment of

substances, since a false prediction that is not detected may lead to a substance being erroneously classified as likely to be non-toxic and thus that further testing of the substance is given low priority.

3.2.2.2 Hierarchical modelling

During recent years, hierarchical multivariate modelling methods has undergone rapid development and several successful applications within process modelling have been

(17)

published [Westerhuis et al. 1998, Qin et al. 2001, Westerhuis, Coenegracht, 1997]. A hierarchical model structure can be beneficial when several distinct and separate blocks of data are used for modelling. In process modelling this usually corresponds to process data from different process sections (reactors, coolers, distillation columns etc.) that influence product properties in different ways. A separate model (e.g. a PLS model) is built for each block. The scores calculated from each of these block models are then used as an input to a top-level hierarchical model. It is out of the scope of this report to go into any detail regarding the theory for hierarchical multivariate modelling. More details are given in Westerhuis et al. [1998]. The benefit of using a hierarchical model structure is that the complexity of the individual models is decreased. Still, the

interaction between different blocks can be modelled and combination of information between blocks can still be achieved in the top-level model.

In QSAR modelling the different groups of descriptors reflect different aspects of the substance and can be treated as blocks in a hierarchical model structure. Interpretation of the top-level model gives insights into which descriptor groups contain most information about the biological response and how the information is combined.

3.2.3 Non-linear methods

Non-linear methods are not applied in the work presented in this report but several investigations presented in the literature indicate that they give superior performance to linear methods in some cases. The most common group of methods is artificial neural networks (ANN) that exist in a variety of different forms.

It should be noted that ANNs have some drawbacks that sometimes are neglected: the large number of parameters means that a large amount of training data is needed and that validation must be performed rigorously in order to avoid over-fitting that leads to poor model performance. Further, prediction diagnostics are not obtained from ANN models. One needs to ensure in some other way, independently from the ANN model, that the model is valid for the substance in question or, which is common practice, predict and pray.

3.2.4 Common Reactivity Pattern

The modelling methods discussed above are general empirical regression methods that can in principle be applied to any regression problem and that can be used for QSAR modelling when applied to molecular descriptors and molecular properties of

substances.

Another approach is Common Reactivity Pattern (CoRePa) [Mekenyan et al. 1997], which accounts for conformer flexibility in the structures. A brief description of

(18)

CoRePA is as follows. A set of chemicals that are most (or sometimes least) active, i.e.

that exceed (fall short of) a threshold for the biological activity in question, is selected.

Then, a set of parameters that are hypothesised to be potentially important for the biological activity are identified. These are evaluated for a distribution of conformers for each compound to give a distribution of the parameter per substance. All

distributions for a certain parameter are superimposed and common regions are identified. The common regions identified (i.e. for different parameters) constitute the common reactivity pattern.

3.2.5 Model validation and model accuracy measures

It is important to be able to measure model performance for different reasons, including ranking of models and estimating the reliability of predictions, when the model is used on new substances. An accuracy measure is essential in order to be able to trust and use a model prediction.

The data used to estimate the model, the training set, cannot be used to reliably estimate model performance. Two validation methods are commonly used:

• Cross-validation. In cross-validation the model is estimated a number of times. In each round, a part of the training substances are kept out. The toxicities of these substances are then predicted by the model and compared to the known (reference) values. The procedure is repeated until all samples have been kept out exactly once and cross-validation prediction errors have been obtained for all substances.

• Test set validation. Test set validation is used when there are enough data available to exclude some of it, called the test set, from the model estimation and use it solely for validation. The model is estimated from the remaining data, the training set.

Test set validation is the most reliable method to estimate the true model performance, since if the test set is adequately selected, it is exactly equal to future model use;

substances that are completely unknown to the model are predicted. Cross-validation is a reasonable substitute method if the amount of data is limited but the reliability is lower; slightly over-optimistic results are usually obtained.

For multivariate modelling methods and some other modelling methods there is a further complication. Validation is usually used both for model complexity selection (e.g. the number of PLS components in PLS regression) and for estimation of model performance. Since the model complexity selection is usually based on a prediction error criterion this can lead to so-called selection bias, which means that over-optimistic estimates of model performance are obtained. One way to deal with this problem that has been used in this work is to use cross-validation to select model complexity and test

(19)

set validation to estimate model performance. This means that selection bias is avoided and that very reliable estimates of model performance can be obtained.

Model performance can be measured by different metrics:

• R2 (or R2Y) is the part of the variance explained in the training data, i.e. without validation. Thus, it does not give information about model performance for new substances. If R2 is 1 the model explains the data perfectly, if R2 is zero it is as good to guess a random number as to use the model.

• Q2 is the validation counterpart to R2. It measures the part of the variance explained in the validation data. Q2 can be calculated both for cross-validation, in which case it is sometimes denoted Q2CV, and for test set validation.

• RMSEP (root mean square error of prediction) is a measure of the prediction error and has the same unit as the response predicted by the model. It is calculated similarly to a standard deviation and can be used roughly as a standard deviation of predictions. In the formula, y is the reference value and yˆ is the predicted value.

( )

n y RMSEP=

i yi ˆi 2

• RMSECV (root mean square error of cross-validation) the cross-validation version of RMSEP, i.e. corresponding to Q2CV.

• RMSEE (root mean square error of estimation) the non-validated version of RMSEP, i.e. corresponding to R2.

3.2.6 Outliers in QSAR models

An outlier in a QSAR model is a substance that is in some way different from the rest (majority) of the substances used to estimate the QSAR model and for which the model is not valid. The difference can be caused by different features in the chemical structure, which is closely related to the discussion above on classification of substances prior to modelling.

The common explanation of a model outlier is that it is badly predicted (has a large y residual) but this is a somewhat limited definition since a good prediction may be purely due to chance, although the substance class in question is not at all present in the

training data. In multivariate statistics, it is common to define three types of outliers:

(20)

• X/Y outliers are outliers in the normal meaning, i.e. substances for which the relationship between the descriptors (X variables) and the environmental property (Y variable) is not valid, e.g. due to different toxicity mechanisms.

• X outliers. In short, a substance is an X outlier if the molecular descriptors for this substance do not conform to the "pattern" (covariance structure) in the (rest of the) training data. A different pattern in the descriptors indicates that the substance is different from the training data and thus that the prediction is likely to be inaccurate, i.e. a substance that is an X outlier is likely to be an X/Y outlier as well.

• Y outliers are only defined for training or test samples. They are substances for which the reference value of the response is bad for some reason.

It is important to note that outliers can be present both during training (model estimation) and model use (prediction). Naturally, since no Y value is normally available during prediction (this is why the model is used to estimate the property in question), Y cannot be present and X/Y outliers cannot be detected directly.

However, if multivariate prediction methods are used X outliers can be detected during prediction from the X residuals of the projection (also known as: distance to model in X space). This is a significant advantage of multivariate projection methods, like PLS, that facilitates automatic detection of outliers during the use of a QSAR model. This

possibility is a property of the PLS method and not of the descriptors used. Thus, the advantage is present regardless of the molecular descriptors used although the success is of course dependent on the information content in the descriptors.

Lipnick [1991] discussed possible reasons for outliers (X/Y only) in QSAR models and related them to different mechanisms of action.

3.2.7 Statistical Molecular Design (SMD)

Statistical molecular design (SMD) is a method introduced by researchers in Umeå, Sweden [Eriksson, Johansson 1996, Andersson et al. 2000, Eriksson et al. 2000]. The purpose of SMD is to apply experimental design methodology in QSAR modelling. The goal of experimental design is to select a training set for modelling that contains

maximal information given the number of experiments that can be performed. In QSAR, the experiments correspond to substances but their properties (molecular descriptors) cannot be designed since they are impossible to control independently in practically all cases.

SMD uses a large number of candidate structures for which the response (y) variable does not need to be measured or known. Molecular descriptors are calculated or

(21)

principal components that are combinations of the molecular properties are referred to as the principal properties (PP) of the data set, since they are the combinations that explain the variation among the molecules in an optimal way.

The design is then performed with respect to the principal properties by selecting a subset of substances that are most efficient in spanning the substance (or PCA model) space and thus are the best selection of training set for a QSAR model. The selection can be done manually from the score plots if the number of principal components (properties) is three to four or less. An algorithm based on D-optimality is necessary when a higher number of PCs are used. Such an algorithm can be used for low- dimensional models as well, but it is often sufficient to select samples manually.

An excellent illustration of the usefulness of SMD and multivariate techniques for exploration of principal properties can be found in a recent publication [Giraud et al.

2000].

4 Methods

4.1 Toxicity data

4.1.1 Microtox toxicity

The toxicity of various substances to the marine bacterium Vibrio fischerii was taken from literature [Kaiser, Palabrica 1991]. The EC50 (in mmoles/L) of the bacterial luminescence inhibition at 15 min exposure was selected as the toxic endpoint and transformed to the log of the inverse of the millimolar concentration to yield pEC50

values.

The Microtox toxicity of ethylene diamine was also tested experimentally following essentially the procedure of the manufacturer's manual [Azur Environmental, Carlsbad, USA (www.azurenv.com), Svenson 1993]. Before testing, the solution of the toxicant was adjusted to pH 7.3 ± 0.05. The procedure involving a combined duplicate of tests was repeated three times to generate a log-normal average value of the EC50. Ethylene diamine obtained from Merck, freshly distilled prior to use, was a kind gift from Fredrik Rahm at the Department of Organic Chemistry, Royal Institute of Technology in

Stockholm.

4.1.2 Alga toxicity

The unicellular green alga Pseudokirschneriella subcapitata was chosen as the organism for prediction of alga toxicity. The species, also known by its synonyms

(22)

Selenastrum capricornutum and Raphidocelis subcapitata, is the most widely used freshwater organism for test of alga toxicity. The inhibition in growth rate was selected as the toxic endpoint [Nyholm, Källqvist 1989]. The 96 h EC50 values were collected for a set of substances from published sources [Alexander et al. 1988, Blaylock et al.

1985, Calamari et al. 1979, 1980, 1983, Draper, Brewer 1979, Eloranta 1982, Galassi, Vighi 1981, Galassi et al. 1988, IUCLID database, 2000 Kuivasniemi et al. 1985, Macri, Sbardella 1984, Shigeoka et al. 1988]. Before use, the data was transformed as the logarithms of the inverse EC50 in mmoles/L. EC50 values for some substances were calculated by non-linear regression of the logarithmic rate from growth data given in literature [Adams, Dobbs 1984].

4.1.3 Daphnia toxicity

The toxicity of various substances to Daphnia magna exposed for 48 h at specified conditions was used to derive a prognosis model for a crustacean. Toxicity data were collected from a published source [Devillers et al. 1987] and used as logarithms of the inverse in mmoles/L.

4.1.4 Fish toxicity

Data from two fish species were selected to model fish toxicity. The lethal toxicity to Leuciscus idus was taken from Juncke, Lüdemann [1978], i.e. data dertermined in one of the two laboratories reported in the literature source, and Lepomis macrochirus from Buccafusco et al. [1981]. Data represents LC50 at 96 h exposure and the values were transformed as the logarithms of the inversed millimolar concentrations.

4.2 Descriptor calculation and QSAR modelling

The following procedure was used to obtain the results presented in the following section.

For each substance:

• The structure of the substance was obtained from Internet databases, e.g.

ChemIDplus2, or, if not available, drawn manually.

• The structure was imported into a molecular modelling software, HyperChem3, and the minimum energy conformation was determined by molecular mechanics with the MM+ force field. Different initial conformations were used in order to decrease the risk of finding local energy minima. The optimised structure was saved.

2 http://chem.sis.nlm.nih.gov/chemidplus/cmplxqry.html

(23)

For all substances belonging to a data set:

• From the saved structures, 1481 molecular descriptors were calculated for each substance by the Dragon software4 and the results were saved.

• The descriptor file was extended with the biological response variable.

• The data set was imported into the multivariate modelling software SIMCA P-105 for modelling.

Auto-scaling (also known as unit variance scaling) have been used throughout this work, since the variables are on different scales and no a priori information about variable importance was available that could motivate other scaling schemes.

PCA was used to detect trends and groupings in the data. In the cases SMD was used, samples were selected manually from the score plots as discussed in chapter 5 below for each model.

Regression was performed by the PLS method. All regression models were validated by both cross-validation and a separate test set. Cross-validation was used to select model complexity, and on some occasions to perform variable selection, while the test set was used solely for estimating prediction error and judging the ability of the model to detect outliers in the prediction stage. This procedure gives a reliable and objective estimate of model performance.

An unusually large proportion, usually about 50 % of the available data has been used for model testing. This is motivated by the fact that the focus of the work is to develop methodology to obtain reliable QSAR models. The only way to estimate reliability is by the test set and the estimate is better the more samples are used for this purpose.

5 Results

The research presented in this report was performed with multiple aims. The discussion in this chapter and chapter 6 is meant to reflect all of these.

• To develop accurate and useful QSAR models for toxicity of chemical substances.

• To develop and evaluate methodology for increasing the reliability of QSAR predictions

• To investigate the information content and usefulness of different groups and types of descriptors

4 http://www.disat.unimib.it/chm/Dragon.htm

5 Umetrics Inc., http://www.umetrics.com

(24)

Before viewing the modelling results it can be interesting to note the distribution and span of the toxicity values in the data sets used for modelling. These are shown in Figure 1 below and values are also given in Appendix B. It can be noted that the span for the fish species and for green alga toxicity was significantly shorter than the span for the other two.

0 5 10 15 20 25 30

-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 pEC50

no of substances

Daphnia Microtox Lepomis Alga Leuciscus

Figure 1. Distribution of reference values used for QSAR modelling.

5.1 Microtox prognosis model

83 substances with reference values between pEC50=-3.3 and pEC50=3.8 were available for modelling, which means that the substances span 7 orders of magnitude in the concentration domain.

A preliminary evaluation of Microtox QSAR models showed that ethylene diamine appeared as a Y outlier; it had a very low predicted pEC50 (about –2 to –3) in all models while the reference value obtained from the literature was 0.47. Therefore a new test of the compound was conducted. The average EC50 at 15 min exposure of ethylene

diamine was 17.9 g/L (limits of one standard deviation 17.6-18.2 g/L). This corresponds to pEC50 = -2.47, which agrees very well with the value predicted by the preliminary models but is considerably lower than the published value. The higher toxic value probably depended on insufficient pH control. Now, after testing, pH 7.3 was recorded, indicating that pH was maintained at a constant and optimal level throughout the 15 min

(25)

capacity changes pH, which in turn affects the toxic behaviour. Therefore a careful pH control is extremely important. Alkaline condition itself will inhibit the luminescence, and the amine will probably have a higher toxicity due to a higher proportion as the dissociated, uncharged species, as recently was shown in alga toxicity for ammonia [Källqvist, Svenson 2003].

The ability to detect erroneous literature values, which was confirmed by new

experiments, shows the power of QSAR modelling. The new reference value Microtox pEC50 = -2.47 was used in all further modelling. A list of all Microtox toxicity values used for modelling and validation is shown in Appendix B.

5.1.1 Prediction outlier detection

A number of QSAR models for Microtox toxicity were developed based on different sets of descriptors. These were used to both evaluate the performance of the descriptors and different prediction outlier detection methods. Evaluation of criteria for detection of outliers during model application (i.e. prediction) is discussed in this section. The actual modelling results are discussed in the next section.

Outlier detection during prediction, i.e. to detect substances that do not fit in the model and thus have a high risk of being poorly predicted, is very important. Since the aim of the prognosis model is typically screening of new substances and prioritising further testing, it is serious if substances are classified as false negatives. On the other hand, to predict false positives is less serious, since this will be revealed by the testing performed as a result of the QSAR prediction. However, also such malpredictions decrease the efficiency of the screening and prioritisation and should, naturally, be avoided if possible.

Outlier detection during prediction aims at completely avoiding grossly erroneous predictions. If the substance in question risks being badly predicted, this should be detected and the prediction should be considered unreliable and not used. Other methods for screening and prioritisation should then be used, e.g. other QSAR models or testing toxicity. In outlier detection, it is inefficient but not serious if reliable

prediction is classified as unreliable, i.e. if a substance that is well predicted is classified as an outlier, since this will lead to prediction by other methods or toxicity testing of the substance. The opposite mistake, on the other hand, i.e. to classify a bad prediction as reliable, is serious. This should be kept in mind while reading the discussion in this section.

When PLS regression is used as the modelling method, as in this work, two measures can be considered when judging whether or not a new sample belongs to the model. The first is the distance to the model plane (also called residual magnitude) and the second is the distance between the model centre and the projection in the model plane. In the

(26)

SIMCA software, the distance to the model plane of a prediction is known as DModXPS (Distance to Model in X space for the Prediction Set), while also

considering the distance in the model plane leads to the statistic DModXPS+. From theses distances and the corresponding distances in the training set, it is possible to calculate a probability that a (new) substance belongs to the model. These probabilities are known as PModXPS and PModXPS+, respectively, in the software.

In order to classify substances as falling within or outside the domain of the model, one must choose a significance level. Initial investigations with significance levels

corresponding to 5 % or 1 % theoretical risk of erroneously classifying a valid

prediction as an outlier showed that these levels gave a very large number of erroneous outlier indications. The reason is probably that the theoretical assumptions, e.g.

normally distributed data, are not fulfilled. In such cases, it is common to use empirical significance levels in statistical tests.

Results from further investigations with both PModXPS and PModXPS+ at 0.5 % and 0.1 % risk levels are shown in Table 1 for 9 different PLS models based on different sets of descriptors. The RMSEE and RMSEP values in the second and third column are the root mean squared error of estimation for the training data and the root mean squared error of prediction for the full test set. In addition, the table shows the number of outlying substances in the test set according to each method and the RMSEP after these were removed from the test set.

RMSEE is expected to be significantly lower than RMSEP since it is calculated from predicting the same substance from which the model parameters were estimated. The RMSEP values are aggregated values for the whole prediction set but it is clear from plots of predicted versus measured toxicities that the high RMSEP values for some models are caused by one or a few substances being poorly predicted by the model, i.e.

they are outliers. This is visualised in Figure 2 and even more clearly in Figure 3.

No differences were encountered between the PModXPS and PModXPS+ methods as shown in the table. Therefore, this is not further discussed but the following discussion is devoted to the choice of significance level and the reliability of the outlier detection method.

(27)

Table 1. Outlier detection in the Microtox toxicity data set.

PModXPS+ 0.5 % PModXPS 0.5 % PModXPS+ 0.1 % PModXPS 0.1 % Modela RMSEE RMSEP outliers RMSEPb outliers RMSEPb outliers RMSEPb outliers RMSEPb

PLS2 0.34 0.64 7 0.53 7 0.53 7 0.53 7 0.53

PLS3 0.46 0.73 7 0.65 7 0.65 5 0.64 5 0.64

PLS4 0.56 0.77 7 0.66 7 0.66 6 0.70 6 0.70

PLS5 0.42 1.06 11 0.62 11 0.62 7 0.58 7 0.58

PLS6 0.65 3.72 5 0.82 5 0.82 4 0.82 4 0.82

PLS7 0.52 0.98 7 0.66 7 0.66 7 0.66 7 0.66

PLS8 1.03 0.95 9 0.89 9 0.89 8 0.90 8 0.90

PLS9 0.50 1.47 6 0.74 6 0.74 5 0.74 5 0.74

PLS10 0.38 0.86 3 0.54 3 0.54 2 0.55 2 0.55

a For explanation of model notation see 5.1.2 below.

b RMSEP for the test set after removal of the outliers indicated by this method.

In Figure 2 and 3 the substances have different symbols according to their probability of belonging to the model according to PModXPS+. Substance marked with squares are outliers according at both the 0.1 % and 0.5 % levels, while the substances marked with diamonds in Figure 3 are outlier only at the 0.5 % level.

It is clear from Figure 2 that at least p,p-DDT (ppDDT) and carbon tetrachloride (ccl4) are poorly predicted by the model PLS2. However, 5 more substances are classified as outliers although they are quite well predicted: tetrachloroethene (teke), nitrilotriacetic acid (nta), methanol (meoh), dichloromethane (dkm), and dioxane (dioxan). There are two possible reasons:

1. They are outliers in the model that are reasonably well predicted 'by chance'. This should be the case for at least methanol and nitrilotriacetic acid. Methanol is an extreme sample with a toxicity value lower than any compound in the training set.

Thus the model is extrapolated which is uncertain and the outlier classification is correct although the prediction happens to be correct in this case. Nitrilotriacetic acid with its large number of polar bonds in such a small molecule is different from substances in the training set.

2. They are falsely classified as outliers although they are similar to the compounds in the training set. This can be the case for dioxane that is not structurally dissimilar to substances in the training set.

(28)

For dichloromethane and tetrachloroethene it is questionable if they are outliers or not.

There were relatively few smaller chlorinated compounds in the training data (chloroform, trichloroethene).

-3 -2 -1 0 1 2 3

-3 -2 -1 0 1 2 3

aniline bensen

buam1 butanon

cp2 cp4

dcb12 dcb14

dcp24 dcp26 dcp35

deoh2

dnb13 dnt24

etenamin

eter

hpoh1kresol3 mnt2

nb

pcp

pedion24 peoh3

pyridin

tcb123 tcb135

tcp235tcp245 tcp345

tecb1235 tecp2345 tecp2356

toluene

unoh1

ccl4

dioxan dkm

meoh

nta

ppddt teke

Figure 2. Measured versus predicted Microtox pEC50 values for the PLS2 model based on all descriptors. Substances marked with squares have a probability of less than 0.1 % of belonging to the model according to PModXPS+.

For the model PLS5 visualised in Figure 3 similar results were obtained at the 0.1 % level (blue substances):

• p,p-DDT, methanol, CCl4 and tetrachloroethene were inaccurately predicted and this is detected by the outlier detection method.

• Diethylether, nitrilotriacetic acid and 1,3,5-trichlorobenzene were correctly

predicted but nevertheless classified as outliers. For nitrilotriacetic acid, this should be considered a coincidence as discussed for the PLS2 model above. For the other two substances the classification is more questionable and they are probably falsely detected as outliers. Nevertheless, the RMSEP of the test set is decreased from 1.06 to 0.58 when the substances indicated as outliers were removed, which according to our experience was in reasonable relation to the RMSEE of 0.42.

The substances classified as outliers only at the 0.5 % level in model PLS5 (diamonds in Figure 3), dichloromethane, 4-chlorophenol, 3,5-dichlorophenol and 1,2,3-

trichlorobenzene, were all predicted correctly by the model. At least the three aromatic compounds should not be outliers considering their structural similarity to the training set. Similar results were obtained for other models (not shown). It can be noted in Table 1 that although more outliers were detected at the 0.5 % level for several models, the RMSEP of the test set has not decreased significantly.

(29)

-4 -3 -2 -1 0 1 2 3 4 5

-4 -3 -2 -1 0 1 2 3 4 5

aniline bensen

buam1 butanon

cp2 dcb12 dcb14

dcp24 dcp26

deoh2

dioxan

dnb13 dnt24

etenamin

hpoh1 kresol3

mnt2 nb

pcp

pedion24 peoh3

pyridin

tcp235 tcp245

tcp345 tecb1235

tecp2345 tecp2356 toluene

unoh1

ccl4

eter meoh

nta

ppddt tcb135

teke cp4 dcp35

dkm

tcb123

Figure 3. Measured versus predicted Microtox pEC50 values for the PLS5 model based on 2D autocorrelation descriptors. Substances marked with squares have a probability of less than 0.1 % of belonging to the model according to PModXPS+ and are classified as outliers.

Substances marked with diamonds have probabilities between 0.1 %-0.5 %.

From inspection of predicted versus measured plots for all the other models investigated it was observed that not a single clear outlier was missed at the 0.1 % level. Similar results as those discussed above were obtained when evaluating outlier detection method for the models based on systematic selection of training set discussed in 5.1.3 (not shown).

To summarise, it can be concluded that outlier detection at the 0.1 % level is sufficiently safe and more efficient than at the 0.5 % level. In the rest of the report, outlier detection at the 0.1 % level with the PModXPS+ statistic has been used.

5.1.2 Random training set selection

A number of QSAR models for Microtox toxicity were developed based on different sets of descriptors. The complete data set collected consisted of 83 substances, see Appendix B. Initial modelling showed that 1-pentadecanol, which was the longest carbon chain in the data set, was difficult to fit into a model with the rest of the substances and, hence, it was removed. The remaining 82 substances were split non- systematically into a training set and a test set each comprising 41 substances. To leave 50 % of the substances in a test set is unusual but was motivated by the intention to develop methodology for reliable and robust QSAR predictions. The only way to objectively test the reliability of the models is to use a test set and the larger the test set the better the estimate of the degree of reliability. Information about the models

developed is shown in Table 2. No outliers were removed from the training set for any of the models.

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The purpose of this research is therefore to create an Adaptive Neuro-Fuzzy Inference System (ANFIS) model to predict yarn unevenness for the first time using input data of

Compared with other stochastic volatility models such as the famous Heston model, SABR model has a simpler form and allows the market price and the market risks, including vanna

We have taken a somewhat dierent perspective in the main result, Theorem 3.1, showing that a traditional model validation test immediately gives a \hard" bound on an integral of

When training machine learning models one typically uses so called loss functions that output a number expressing the performance of the current iteration.. When speaking of

Swedenergy would like to underline the need of technology neutral methods for calculating the amount of renewable energy used for cooling and district cooling and to achieve an

The dynamic simulation showed that the controller dosed precipitation chemical in the range between about 5-10 mg/l and that the effluent phosphate (S PO4 ) from

In this paper I discuss convex sets which are both closed and bounded and with non-empty interior (in some finite- dimensional affine space over the real numbers) and I refer to