• No results found

In silico modeling for uncertain biochemical data

N/A
N/A
Protected

Academic year: 2022

Share "In silico modeling for uncertain biochemical data"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

In silico modeling for uncertain biochemical data

Daniel Gusenleitner

Master’s dissertation University of Sk¨ ovde

3 June 2009

(2)

In silico modeling for uncertain data

Daniel Gusenleitner

Submitted by Daniel Gusenleitner to the University of Sk¨ovde as dissertation towards the degree of Master by examination and dissertation in the School of Life Sciences.

3 June 2009

I certify that all material in this thesis which is not my own work has been identified and that no material is included for which a degree has previously been conferred on me.

Daniel Gusenleitner

(3)

Abstract

Analyzing and modeling data is a well established research area and a vast variety of different methods have been developed over the last decades. Most of these methods as- sume fixed positions of data points; only recently uncertainty in data has caught attention as potentially useful source of information. In order to provide a deeper insight into this subject, this thesis concerns itself with the following essential question: Can information on uncertainty of feature values be exploited to improve in silico modeling?

For this reason a state-of-art random forest algorithm is developed using Matlab R. In addition, three techniques of handling uncertain numeric features are presented and incor- porated in different modified versions of random forests. To test the hypothesis six real- world data sets were provided by AstraZeneca. The data describe biochemical features of chemical compounds, including the results of an Ames test; a widely used technique to determine the mutagenicity of chemical substances. Each of the datasets contains a single uncertain numeric feature, represented as an expected value and an error estimate. The modified algorithms are then applied on the six data sets in order to obtain classifiers, able to predict the outcome of an Ames test. The hypothesis is tested using a paired t-test and the results reveal that information on uncertainty can indeed improve the performance of in silico models.

(4)

Contents

List of Figures 1

List of Tables 2

1 Introduction 4

1.1 Uncertainty in Data . . . 4

1.2 Ames Test for Mutagenicity . . . 5

2 Motivation 8 2.1 Problem Description . . . 8

2.2 Related Work . . . 9

2.3 Objectives . . . 9

3 Material and Methods 11 3.1 Datasets . . . 11

3.2 Software . . . 12

3.2.1 Matlab R . . . 12

3.2.2 Rule Discovery SystemTM . . . 12

3.3 Decision Trees . . . 12

3.4 Handling of Numerical Values . . . 15

3.5 Handling of Missing Values . . . 16

3.6 Random Forests . . . 17

3.7 Handling of Uncertain Numerical Data . . . 18

3.7.1 Uniformly Distributed Uncertainty . . . 19

3.7.2 Normally Distributed Uncertainty . . . 19

3.8 Evaluation . . . 20

3.8.1 Cross-Validation . . . 20

3.8.2 Statistical Characteristics . . . 21

3.9 Statistical Inference – Paired t-Test . . . 23

(5)

4 Results 24

4.1 Graphical Analysis of Uncertain Data . . . 24

4.2 Variable Importance using Rule Discovery SystemTM . . . 27

4.3 Test Results . . . 27

4.3.1 Uniformly Distributed Uncertainty using Split Criteria . . . 28

4.3.2 Uniformly Distributed Uncertainty using Random Split Value . . . . 30

4.3.3 Normally Distributed Uncertainty . . . 33

4.3.4 The CAS dataset . . . 36

5 Discussion 38

Bibliography 40

(6)

List of Figures

1.1 Two probability density functions for uncertain attributes . . . 5

1.2 Ames spot test on a petri dish . . . 7

2.1 Problem description on a nearest neighbor classifier . . . 9

3.1 Data representation of a binary tree . . . 13

3.2 Two methods of finding a split criterion . . . 16

3.3 Method of handling missing values . . . 17

3.4 The three cases of partitioning uncertain uniformly distributed instances . . . 19

3.5 The three cases of partitioning uncertain normally distributed instances. . . . 20

3.6 Confusion matrix . . . 21

3.7 Receiver operating characteristic (ROC) . . . 22

4.1 ACE error bar plot . . . 25

4.2 ACE error plot . . . 25

4.3 ACHE error bar plot . . . 25

4.4 ACHE error plot . . . 25

4.5 BZR error bar plot . . . 25

4.6 BZR error plot . . . 25

4.7 DHFR error bar plot . . . 26

4.8 DHFR error plot . . . 26

4.9 COX2 error bar plot . . . 26

4.10 COX2 error plot . . . 26

4.11 CAS error bar plot . . . 26

4.12 CAS error plot . . . 26

4.13 ROC curves of the DHFR dataset using randomizer 1 . . . 29

4.14 ROC curves of the DHFR dataset using randomizer 0.5 . . . 31

4.15 ROC curves of the COX dataset using randomizer 1 . . . 32

4.16 ROC curves of the DHFR dataset using randomizer 0.5 . . . 34

4.17 ROC curves of the BZR dataset using randomizer 2 . . . 35

4.18 ROC curves of the ACHE dataset using randomizer 0.5 . . . 37

(7)

List of Tables

1.1 Fictive cancer treatment dataset with categorical attributes . . . 5

1.2 Fictive cancer treatment dataset with numerical attributes . . . 6

3.1 The six datasets with according number of samples. . . . 11

4.1 Variable importance . . . 27

4.2 Student’s t-Table for 9 degrees of freedom . . . 28

4.3 Accuracy of trained models with median as split criterion, assuming uniformly distributed uncertainty that is partitioned by the split criterion . . . 29

4.4 Detailed results of the DHFR dataset using randomizer 1 . . . 29

4.5 Accuracy of trained models with maximum information gain producing split cri- terion, assuming uniformly distributed uncertainty that is partitioned by the split criterion . . . 30

4.6 Detailed results of the DHFR dataset using randomizer 0.5 . . . 30

4.7 Accuracy of trained models with median as split criterion, assuming uniformly dis- tributed uncertainty that is partitioned by a random value within the uncertainty interval. . . . 31

4.8 Detailed results of the COX dataset using randomizer 1 . . . 32

4.9 Accuracy of trained models with maximum information gain producing split crite- rion, assuming uniformly distributed uncertainty that is partitioned by a random value within the uncertainty interval. . . . 33

4.10 Detailed results of the BZR dataset using randomizer 2 . . . 33

4.11 Accuracy of trained models with median as split criteria, assuming normally dis- tributed uncertainty that is partitioned by the split criterion. . . . 34

4.12 Detailed results of the BZR dataset using randomizer 2 . . . 35

4.13 Accuracy of trained models with maximum information gain producing split cri- terion, assuming normally distributed uncertainty that is partitioned by the split criterion . . . 36

4.14 Detailed results of the ACHE dataset using randomizer 2 . . . 36

4.15 Accuracy of trained models on CAS data set, with median as split criterion . . . . 37

(8)

Chapter 1

Introduction

1.1 Uncertainty in Data

Data miners including bioinformaticians all too often assume that given data points have fixed positions, without any variation in it, but this is usually not the case. If you think for instance of a digital scale; the actual value is mostly not the displayed value but lies between two boundaries which are spanned around the measured value. Usually only the measured weight would be used in learning algorithms, without any knowledge on variance.

There are many instances where uncertainty can be introduced into the data collection process. The scale example shows the general problem of measuring instruments: an accu- racy of 100% is often not possible, but uncertainty of such features can often be estimated by prior experiments. Many datasets are derived from raw data using statistical or other data mining methods. Such previously used methods often provide an error estimate.

Sometimes missing values are estimated by imputation, which also introduces uncertainty.

Depending on the imputation process statistical errors can also be estimated. These are only a few examples of possible introduction of imprecision into datasets, Aggarwal et al.

(2007) offers a more extensive insight into this issue. Although the relative amount of the error compared to the measured value is often very small and can be neglected, this is not always the case and uncertainty can carry essential information.

Uncertainty can occur in categorical as well as in numerical attributes. The uncertainty in categorical attributes is usually expressed as a probability for each possible categorical value. The standard approach to deal with such data is to transform them into certain data by using the value with the highest probability. The problem with this approach is that information gets lost during the transformation, which could be crucial for the success of a subsequently modeled classifier. Table 1.1 shows an example of such data.

Numerical attributes can also possess uncertainties. In this case the attribute does not only have a single value but also an associated range and a probability density function corresponding to this range. In this way the uncertain attribute can be treated as a con- tinuous random variable (Qin et al., 2009). In the simplest case the probability density

(9)

Table 1.1:Fictive cancer treatment dataset with categorical attributes. The cancer type prediction is an uncertain attribute which shows the outcome of a prior used classifier. There are three cancer types (A-C) with the according probability in %. The treatment column indicates if the investigated treatment plan actually worked.

Id Sex Smoker Age<60 Cancer Type Prediction Treatment X?

1 f no no (A:62%–B:29%–C:9%) yes

2 m yes yes (A:13%–B:11%–C:76%) no

3 m no yes (A:22%–B:67%–C:1%) yes

4 f no no (A:47%–B:9%–C:44%) no

5 m yes no (A:8%–B:12%–C:80%) no

6 f yes yes (A:73%–B:12%–C:15%) yes

7 m no yes (A:30%–B:42%–C:28%) yes

8 f no yes (A:7%–B:82%–C:11%) no

9 f no yes (A:51%–B:27%–C:22%) yes

Figure 1.1: Two probability density functions (PDFs) for uncertain attributes. On the left a uniformly distributed PDF, on the right a normally distributed PDF.

function is assumed to be uniformly distributed, i.e., the probability of the actual value located at a certain position is the same at every point in the defined interval. A more sophisticated and often more realistic assumption is that the probability density function is normally distributed. Anyway, the probability density can be any arbitrary continuous function as shown in figure 1.1. In an ideal situation one has a prior knowledge of the dis- tribution within the range, which can then be exploited to obtain possibly more accurate results. An example of a possible numerical dataset including an uncertain value can be found in table 1.2.

1.2 Ames Test for Mutagenicity

Mutagenicity is a characteristic of chemicals or radiation which leads to changes in geno- types. These mutations can be categorized in several ways: Changes of a single nucleotide into another one are called single point mutations. Indels are insertions and deletions of a

(10)

Table 1.2: Fictive cancer treatment dataset with numerical attributes. The estimated growth rate is an uncertain attribute. The treatment column indicates if the investigated treatment plan actually worked.

Id Sys. BP Dia. BP. Weight Age Est. Growth Rate Treatment X?

1 130 70 55 59 1.1–1.8 yes

2 150 85 81 67 2.4–2.8 no

3 115 60 73 43 1.5–2.6 no

4 140 75 72 35 0.9–1.4 yes

5 175 95 97 72 2.1–2.7 no

6 110 65 62 69 1.6–2.1 yes

7 135 70 85 52 1.0–1.8 no

8 125 60 82 49 2.3–2.9 no

9 120 55 61 55 0.8–1.9 yes

couple of bases, which are also minor changes in a chromosome, but there can also appear major mutations, like the loss or duplication of larger parts or even whole chromosomes, as well as rearrangements of whole genomes. For more information on mutations, see Alberts et al. (2007). Changes in genotypes can introduce several side effects. Eukaryotes possess large non-coding regions in the DNA, mutations occurring in such a region show no effect.

This is also the case at silent mutations, where a nucleotide changes but the correct amino acid is still translated. Nonsense and missense mutations on the other hand can lead to changes in the translated amino acid and therefore truncate or alter the coded protein. In a benign scenario, such an alteration has only minor effects on the cell like the change of pigments. In a more severe case, a mutation alters an essential part of the coding region and the affected cell loses its ability to sustain itself. This usually triggers the programmed cell death and does not pose a long term threat to the survival of an organism. In the worst case, an essential process gets altered and remains undetected by the apoptosis mechanism. A mutation in the coding region for oncogenes or tumor suppressor genes for instance can lead to the development of cancer. For this very reason, it has become an essential task to identify a potential mutagenicity of used materials (Mortelmans and Zeiger, 2000). Especially the pharma industry is very interested in such analyses because it offers the opportunity to screen for carcinogenic chemicals and to assess possible long term risks.

Ames (1971) introduced a method of screening for chemical mutagens using Salmonella histidine mutants. Salmonella was discovered by Dr. Daniel E. Salmonis over a century ago.

The two most common types are Salmonella Enteritidis and Salmonella Typhimurium. It is a gram-negative bacilli which is rod-shaped and known to cause diarrheal illness in humans

(11)

(Giannella and Giannella, 1996). The method described by Ames evolved over the next decades, some adaptations were introduced for instance by Ames et al. (1973), McCann et al. (1975), Maron and Ames (1983) and is now known as the Ames or Salmonella test.

This test uses Salmonella Typhimurium strains, which have a preexisting mutation and are therefore not able to synthesize histidine. Consequentially, this prevents them from forming colonies. Such Salmonella strains are exposed to different chemicals, which are thereby tested for mutagenicity. If the substances trigger mutations, there is the possibility of reversing the preexisting mutation, which leads to the production of histidine and in the following to the emerging of colonies. For this reason, the Ames test is also referred to as a reversion assay. The Salmonella test is very widely used because a positive result is a strong indicator for a possible carcinogenicity of newly developed drugs and chemicals.

Mortelmans and Zeiger (2000) offers a more detailed description of the Ames test.

Figure 1.2: Ames spot test on a petri dish with strain TA100 and methyl methanesulfonate(10 ml) on a petri dish, from Mortelmans and Zeiger (2000)

The Salmonella Typhimurium strains used nowadays possess mutations in multiple genes and are engineered to be more sensitive towards mutations. This enables the Ames test to find a large number of mutagens even if they use different mechanisms for mu- tating (Mortelmans and Zeiger, 2000). In addition to the described procedure, rat liver homogenates are often added, in order to test in a mammalian environment. Some chemi- cal substances are actually not harmful until they are processed by the metabolism, when they can become mutagenic. Adding liver homogenates help to simulate such metabolic breakdowns (Ames et al., 1973, Kier et al., 1986).

(12)

Chapter 2

Motivation

2.1 Problem Description

Analyzing and modeling data is a well established research area and a vast variety of methods have been developed over the last decades. But most of these proposed methods assume fixed positions of data points; only recently uncertainty in data has caught atten- tion as potentially useful source of information. Uncertainty in data was most of the time ignored or even discarded. It is quite simple to transform uncertain features into certain ones. For categorical attributes the value with the highest probability is usually chosen as representative. Numerical values on the other hand are reduced to representative values like the mean or the median. The problem with this approach is that potentially useful information gets lost. An instance of a binary attribute with a 51% probability of one value and a 49% probability of the other one is usually far from certain and the transfor- mation into 1 for the first value and 0 for the second one does not very accurately reflect the reality. Variance in numerical values is very common and could often provide essen- tial information, which could improve a classifier. An example of this is demonstrated in figure 2.1, which outlines a binary classification problem in a 2-dimensional feature space.

Within this space, three data points are located, one for each class and an unknown sample (labeled with X). The variance of the known data points are displayed by the oval shapes.

A nearest neighbor classifier, as described in Dasarathy (1991), using Euclidean distance would calculate the distances between the known samples and the unknown data points.

In this case the data point labeled A has greater distance to the unknown sample than the second data point and the unknown sample would subsequentially be labeled as B, but obviously the error rate of attribute1 is greater than the one of attribute2. Consider- ing the variance one would more likely classify the unknown data point as A (Aggarwal et al. (2007)). This shows that information on uncertainty could indeed be essential for improving current classifiers.

(13)

Figure 2.1:Problem description on a nearest neighbor classifier. Two samples A and B with known class in a two dimensional uncertain feature space. A nearest neighbor algorithm that does not use the information on uncertainty would calculate the distances and classify the unknown sample as B because d1> d2. Aggarwal et al. (2007)

2.2 Related Work

In the last few years, uncertainty in data has caught the focus of several researchers. Sev- eral of them, i.e., Ngai et al. (2006), Xia and Xi (2007), Kriegel and Pfeifle (2005) are concerned with clustering uncertain data. Ngai et al. (2006) proposes such an approach by modifying the standard K-means algorithm. Thereby, expected distances between un- certain objects are calculated by using a probability density function. But uncertain data have also exploited in other ways. Chui et al. (2007) deals with mining frequent item sets from uncertain data under a probabilistic framework and Aggarwal et al. (2007) suggest a method for handling error-prone and missing data using a density based approach. More recently, Qin et al. (2009) introduced an decision tree algorithm which is able to han- dle uncertain categorical as well as uncertain numerical attributes. For categorical values, probabilities of the single values are used instead of the discrete values. Numerical values are represented as probability density functions. The area under such a function can sim- ply be distributed to the subbranches, when splitting an uncertain attribute. Currently, the supervisor of this thesis, Henrik Bost¨orm, is also working on an modified version of a random forest algorithm and the results on the datasets have been compared.

2.3 Objectives

The aim of this thesis is to test the hypothesis that information on uncertainty in data sets can be exploited to improve in silico modeling. To acquire realistic results, the hypothesis is tested on a set of real world datasets, provided by AstraZeneca. These datasets contain samples with features of chemical compounds, including the corresponding Ames class.

Predicting the mutagenicity of a substance is a relevant task in drug development and

(14)

the use of the Ames datasets for the thesis should underline the importance of in silico modeling for the research area of bioinformatics.

There are pre-built packages and tools (i.e., Frank et al. (2005)) available for training classifiers, but it was decided to re-implement a learning algorithm for two reasons. First of all, there are usually many assumptions and choices made during the implementation of an algorithm for generating classifiers and through the implementation a deeper insight into these choices is provided. Secondly and more importantly, the incorporation of un- certain information is a major change in the learning algorithm and is easier to perform when implementing it from scratch. To achieve novelty the random forest algorithm first proposed by Breiman (2001) was chosen for this thesis. It is a well-performing state-of- art classifier based on ensembles of decision trees and straightforward to adapt to handle uncertain data. A similar approach has been proposed by Qin et al. (2009), where the decision tree algorithm has been adapted to handle uncertain data.

(15)

Chapter 3

Material and Methods

3.1 Datasets

Table 3.1: The six datasets with ac- cording number of samples.

Name Samples

ACE 114

ACHE 111

BZR 163

CAS 4337

COX2 321

DHFR 397

The datasets for the thesis were provided by the pharmaceutical company AstraZeneca and de- scribe characteristics of chemical compounds. For the sake of confidentiality, the exact information on the attributes is left out in this thesis. Each of the six datasets contains 95 features and a binary Ames class attribute (−1, 1), which indicates the results of a prior conducted Ames test. The datasets are perfectly suited to build a structure-toxicity classi- fier, which should predict the outcome of an Ames test using only the 95 features. The datasets con- tain varying numbers of samples, see Table 3.1. Es- pecially the CAS dataset was inconvenient to handle

due to its size, see section 3.4. The data contain missing values, with no additional in- formation of the reason, so it was assumed that this was due to problems with the data collection process, hence their existence holds no additional information which could be used for training. In a preprocessing step, the missing values were replaced by NaN (Not- a-Number). Each dataset also contained exactly one uncertain feature with an according error estimate. This error estimate represents a 95% confidence interval. Initially, there was no knowledge of the probability density function, hence a uniform distribution was assumed for the first tests. After further advice of the co-supervisor, a normal distribution was then assumed for the error distribution.

(16)

3.2 Software

3.2.1 Matlab R

The structure–toxicity classifier for uncertain data was implemented in Matlab R, which is a technical computing language and inter- active environment for numeric computation, algorithm development and data visualization. Matlab R is the abbreviation for matrix lab- oratory, which points out its strength in fast computation of matrix operations. The implemented algorithm takes advantage of this by using only arrays and matrices as data structures. Other advantages

of Matlab R are its built-in algebra capabilities and its simplicity in creating graphical outputs of data and functions. There is also a large variety of toolboxes available for dif- ferent research and application areas, which were not exploited for this thesis. For the implementation Matlab R 2008a was used.

3.2.2 Rule Discovery SystemTM

RDSTM is a state-of-art data mining tool and is developed by the supervisor of this thesis. Amongst other algorithms, it uses the random forest algorithm to build models;

hence it is exploited for the purpose of comparison. The implemented algorithm is not as sophisticated as in RDSTM and has therefore a higher variance in terms of output results, but still it was used to validate the implemented standard random forest without the handling of the uncertain data. The performance of the implementation is slightly lower than the performance of RDSTM, but since the aim is to investigate the applicability of uncertainty in data, this issue is not relevant. RDSTM has a built-in variable importance analysis tool that was used to determine the importance of the uncertain variable, when modeling without the error estimate.

3.3 Decision Trees

Decision trees is a well known type of machine learning algorithm that is widely used.

The ID3 algorithm proposed by Quinlan (1986) and its successors the C4.5 developed by Quinlan (1993, 1996) and the C5.0 are powerful algorithms, which are applied on a vast amount of datasets in various areas. Decision trees basically use a divide-and-conquer approach to build a classifier. An attribute is selected and placed at the root node of the tree. This node can subsequently be used to split the whole dataset according to the values of the attribute, i.e., for N values of the attribute there will be N separate subsets.

This procedure is then recursively applied on the subsets as long as the classifications of the contained samples are heterogeneous. The recursive process in a branch terminates, if all samples possess the same classification. In this case, a leaf node is inserted containing

(17)

the class of the samples. The algorithm produces an unbalanced tree, which represents the model learned by the used training data. In this basic version of the decision tree, only categorical values can be used, numerical values would create subsets for every single sample, which would overfit the data and render the model useless for classifying unknown data (Witten and Eibe, 2001).

The decision trees used in the implementation are even more restrictive in terms that they split the data only in two subsets at every node. Decision trees can handle binary classifications but are also perfectly suited to handle multiple classes. They split the data into subsets, until only homogeneous sets remain and multiple classes do not interfere with this process. The provided datasets only contain binary classes (Ames pos/neg, see section 3.1) and since the implemented algorithm was tailored for this datasets, it supports only binary classifications. To exploit the matrix operation capability of Matlab R, the implemented algorithm for the thesis represents a tree as a matrix, see figure 3.3. Other than the advantage of the speed-up, this representation also uses a memory complexity of only O(2M), where M represents the total number of nodes.

Figure 3.1:Data representation of a binary tree. On the left, a binary decision tree. On the right, the matrix representation of the same tree: The first row shows the attribute and classes of a tree.

The left subnode of an inner node is always on the next position in the matrix, the number at same column in the second row acts as a pointer towards the right subnode; leaf nodes do not possess subnodes and have therefore no pointer.

The question that remains is how to select an attribute for a node. A random selection would produce a result, but the modeled tree would certainly be larger than the smallest possible model and although the training samples would be classified correctly, unknown samples might not. Training sets are supposed to represent a general population and the aim of a classifier is usually to find underlying principles, which do not only hold for the training set but also for the universe they represent. This is called generalization.

A classifier that only memorizes the training samples, without learning those underlying principles often does poorly in predicting the classification of unknown samples, i.e. overfits

(18)

the training data. To minimize overfitting in decision trees, it is a good idea to select attributes in a more target-oriented way according to their relevant information for the classification process. The standard approach, which was already introduced by Quinlan (1993) uses the two measurements entropy and information gain:

Entropy(C) = − p p + nlog2

p

p + n − n p + nlog2

n p + n

Inf oGain(C, A) = Entropy(C)X |Ci|

|C|Entropy(Ci)

This assumes that there are two classes, where A is the attribute used for the current node, C the set of samples for the current node, |C| the number of samples in C and p, n the number of positive/negative samples in C.

The entropy is a measurement, which describes the disorder amongst a set of samples.

If a node contains only samples with the same classification, the entropy is zero and if in a binary case a node has exactly the same amount of positive and negative classifications, the entropy is at the maximum. The information gain on the other hand describes the reduction of the entropy in all of the subnodes, when using a certain attribute A. If an attribute splits the sample set in way such that each subnode has only samples of a single class, the second term of the information gain equation is zero, because the entropy of every subset is zero, i.e., the information gain is equal to the entropy within the samples C. When selecting a suitable attribute, the information gain is calculated for all attributes and the one with the highest gain is selected (Quinlan, 1993). This is a greedy technique that does not guarantee the smallest possible trees, but still it produces small trees, since attributes with no information are usually not selected. It is also computationally feasible.

However, problems can occur when high branching attributes are used. An identifier unique for every sample would split a set in N subnodes, each containing a single sample (Witten and Eibe, 2001). To handle this problem, several methods have been proposed like the use of the Laplace estimate, but as mentioned before the implementation for the thesis uses only binary splits and therefore the information gain is perfectly suited for attribute selection.

For the sake of completeness, the concept of pruning is also presented. If a tree is deeply branched, with many nodes containing only a few or even single samples, it tends to overfit the train dataset. In order to overcome this problem and to achieve a more general classifier, such sparse branches are pruned. Basically there are two strategies: pre- and post-pruning. Prepruning is performed during the growing of the tree. For example, if the entropy in a node falls under a certain threshold, the entropy is presumed to be zero

(19)

and the node is labeled with the class of the majority. Post-pruning on the other hand analyzes a tree after modeling, cuts sparse branches and merges subsets. Although the performance of the classifier on the training data is reduced generalization is achieved and the performance on unknown samples is often improved (Witten and Eibe (2001), p.192).

As will be explained in section 3.6 the idea of random forests often works better without pruning, hence the implementation for the thesis does not contain any pruning strategies.

3.4 Handling of Numerical Values

The basic version of decision trees described in section 3.3 can only use categorical val- ues. This is impractical for many real world data sets, including the ones described in section 3.1. Aside from prior discretization of the numerical values that is usually coupled with a loss of potentially useful information, there is a simple solution to circumvent this problem. A split criterion can be introduced; if an instance of the used numerical attribute complies with this criterion it precedes in a certain branch of the subtree and if it does not it proceeds in the other one. In this way, a binary split is achieved and the algorithm can proceed as described. In contrast to splitting categorical values, splitting numerical values does not use all the potential information of an attribute; hence a numerical attribute can be used again in a subbranch. The problem that remains is finding a suitable split criterion. There are several ways to deal with this, but the implementation for the thesis includes only two of them, presented in figure 3.4. The simplest solution is to rank the instances according to their value and use the median as split criteria. This yields some additional runtime for sorting the numeric values but is still computationally feasible. The resulting subsets have more or less the same size, which is not necessarily ideal. A more sophisticated approach would be to also sort the numerical values and to investigate ev- ery single leap between two neighboring values to find the split criteria with the highest information gain. For this technique, repeated values can be collapsed; i.e. instances with the same numerical value are represented by a single value, because boundaries between such values cannot be drawn. With this approach the average size of a tree can usually be reduced, but it is computationally very expensive. A large dataset like the CAS (see table 3.1) with no repeated values can have several thousand potential thresholds for a single attribute. If all features are assessed for a node this becomes computationally very infeasible, due to the vast number of necessary information gain calculations (Witten and Eibe (2001), p.189-191) A middle ground of both presented methods would be to calculate only a subset of the possible split criteria of an attribute, which is for instance used by Rule Discovery SystemTM.

(20)

Figure 3.2:Two methods of finding a split criterion. On the top: The instances of attribute A are sorted and the median is used as split criteria. On the bottom: The information gain of every leap between non-repetitive values is calculated and the one with the highest information gain is used as threshold for numerical condition.

3.5 Handling of Missing Values

Real world data often contain missing values and there are many strategies to circumvent them. In the simplest case, either the attributes or the samples containing these values are excluded before training the classifier. Of course there is always a chance that the existence of missing values point towards a subgroup in all of the samples and those values are characteristic for this group. Additionally, dropping samples reduces the data for training and a too low number of samples might be inadequate to represent the whole population. After all, the classifier should be successfully applicable on unknown data and not only on the training set. Categorical values offer additional solutions: Missing values can be handled as a separate value, if it can be assumed that the fact they are missing is significant information. It is also possible to simply handle them like the major category of the non-missing values or a little bit more thought-out, like the major category of the non-missing values which have the same classification (Witten and Eibe, 2001).

A more sophisticated approach to handle numerical data is considering all the branches of the subtree. The methods described so far, like the information gain calculation, do not require an integers number of instances; they can also handle continuous values. Hence, a sample can be split into pieces, each piece proceeding to a different subnode, see figure 3.5.

The ratio of the single pieces can be determined by the distribution of the non-missing values. This introduces an additional weight for every sample, which is 1.0 if a sample has not been split and 0.0 < weight < 1.0, if the sample was already split before. Naturally, a sample can posses several missing attributes or in the case of numeric values an attribute can be subsequentially tested again with a different split criterion, hence an already split sample can be split again (Witten and Eibe, 2001).

(21)

Figure 3.3:Binary split of an attribute containing missing values. +/- indicated the classification of the samples. Samples in grey have missing values for the used attribute. Other than the normal samples, they proceed to both subnodes with a smaller weight.

3.6 Random Forests

The random forest algorithm was first introduced in Breiman (2001). It combines the two methods random subspace introduced by Ho (1995) and bagging by Breiman (1996).

Random forests use ensembles of decision trees and combine the results of the single trees in order to improve the overall accuracy. The combining, also called voting, exploits the finding formulated in Condorcet’s jury theorem in 1785: ”If each member of a jury is more likely to be right than wrong, then the majority of the jury, too, is more likely to be right than wrong; and the probability that the right outcome is supported by a majority of the jury is a swiftly increasing function of the size of the jury, converging to 1 as the size of the jury tends to infinity”(Condorcet (1785)). This assumes that the members of the jury vote are independently. If a single member could influence all other member to vote the same way as he does, the outcome probability would be the same as the one for the single member. The same assumption is necessary for an ensemble of the decision trees; if all of them would be trained on the same dataset with the algorithm described in section 3.3, all trees would produce the same outcome. In order to achieve independence between single trees, random forest uses bagging and randomization (Witten and Eibe, 2001).

Bagging uses bootstrap replication to create different training datasets. If the original training dataset contains N samples, it draws N samples with replacement to create a new dataset. The probability of a sample to be drawn for the new dataset is 1 − (1 −N1)N = 1 −1e ≈ 0.632. This means of course that information is lost and the accuracy of a single tree in the ensemble deteriorates, but the trees are more independent from each other and more often than not the overall performance of the classifier improves (Breiman (1996), Hall et al. (1998)). The implementation for the thesis also contains an optional bootstrap mechanism. It was tested on the six datasets, but showed deterioration on the achieved

(22)

results; hence it was not used for the final versions of the algorithm.

Randomization on the other hand tries to introduce diversity by randomizing the training of a classifier. The information based selection of an attribute described in section 3.3 is a greedy algorithm, which does not guarantee the smallest possible trees, but usually the produced models are very compact. A simple way of randomizing the selection is to not use all attributes for the selection, but only a randomly drawn subset of them. This produces bigger trees which are more prone to overfit, even more if pruning strategies are left off, but the overall performance of the ensemble is often improved. A good starting point for the number of randomly chosen attributes is ≈√

N , where N is the number of attributes. This can vary depending on the number of attributes that carry only very little or no relevant information for the classification (Witten and Eibe, 2001). The algorithm for the thesis uses a variable magnitude of randomly chosen attributes (0.5√

N ,√ N , 2√

N ), and the parameter is referred to as randomizer.

Another important feature of random forests is the voting system. In the simplest case, each vote counts as positive or negative and the majority of all votes decide the outcome.

But as described in section 3.5, the splitting of samples due to missing values can lead to results in the form of probabilities instead of pure binary classifications. This can easily be circumvented by classifying a sample according to the highest probability, but with this transformation potentially useful information is lost. A classifier which produces a result of 51%–yes and 49%–no is not as certain as a classifier producing a result of 100%–yes and 0%–no, hence it might be a good idea to use the probability of the single trees instead of a transformed binary classification (Bostrom (2007)). The implemented algorithm for the thesis uses those probabilities instead of the binary classes.

3.7 Handling of Uncertain Numerical Data

The novel part of this thesis is the extension of the random forest algorithm to handle uncertain data. As described in section 1.1, there are different kinds of uncertainty which can be roughly partitioned into categorical and numeric uncertainty. Table 1.1 illustrates an example of uncertain categorical data, where the prediction of a cancer type has a probability assigned to every single class. To extend a decision tree to handle uncertain data, the sample splitting technique, described in section 3.5, can be adopted. Instead of using the uncertain categorical value in a single branch, it is split according to the probabilities and every piece proceeds to the appropriate branch. This is basically the method Qin et al. (2009) used to handle uncertain categorical attributes.

Numeric uncertainties on the other hand are distributed by an assumed probability density function and uncertain numeric values have characteristics of this distribution assigned to them. The datasets provided by AstraZeneca have each a single uncertain attribute. The implementation was tailored to these datasets, so it can handle only numeric

(23)

uncertain data with different distributions.

3.7.1 Uniformly Distributed Uncertainty

Figure 3.4: The three cases of parti- tioning uncertain uniformly distributed instances

In the simplest case, probability density functions for numeric uncertainties are uniformly distributed.

Table 1.2 shows an example of an uncertain numeric attribute in the form of assigned ranges. The pro- vided datasets on the other hand depict numerical values as a mean and an error estimate, spanning a 95% confidence interval around the mean.

In order to adapt the random forest to handle the uncertain data, several aspects had to be consid- ered. First of all, the selection of split criteria had to be defined. For this only the expected values were used, representing the whole interval. With the de- fined threshold, there are three cases to consider, when using an uncertain attribute for a node, see figure 3.7.1. In the first and the last case, the split criteria lies outside the interval, hence the interval can be handled as normal numeric value and pro-

ceed in either the left or the right branch. This leaves the middle case where the interval is partitioned by the split criterion. The implementation for the thesis has two approaches to handle this case. The first and more obvious method is to split the sample according to the span width ratio obtained by applying the split criterion. The two pieces can then pro- ceed in the corresponding branches. A more general algorithm on this approach, dealing with arbitrary probability density functions can be found in Qin et al. (2009). The second variant is to split the interval at a random position. This seems like an unusual choice because it introduces another random element, but as explained in section 3.6, this could lead to more independent trees and therefore to an improvement of the final classification.

3.7.2 Normally Distributed Uncertainty

As mentioned in section 1.1, the underlying probability density function was initially unknown. Later, it turned out that the underlying distribution can be assumed to be normally distributed and the error estimate is the 95% confidence interval of this distribu- tion. With the given confidence interval, the standard deviation can be roughly estimated (4σ ≈ 95.45% confidenceinterval). A normal distribution is completely described by its mean and standard deviation; hence the implementation was extended to handle normally distributed uncertain data.

(24)

Figure 3.5: The three cases of parti- tioning uncertain normally distributed instances.

A normal distribution is defined in ]−∞; ∞[. Con- sidering the whole range would introduce the addi- tional problem, that the first and the last case of figure 3.7.2 would not exist, because every thresh- old would split the distribution. Subsequently, this would lead to very odd split ratios; hence only the 95% confidence interval is considered for the calcu- lations. The used method in the implementation is more or less the same as the first one described in the last subsection, see figure 3.7.2. In the middle case, where the split criteria partitions the normal dis- tribution, the split ratio for the sample is calculated by considering the two areas under the Gauss curve, hence the normal cumulative distribution function is calculated.

p = F (x|µ, σ) = 1 σ√

2π Z x

−∞

e−(t−µ)

2 2σ2 dt

3.8 Evaluation

3.8.1 Cross-Validation

A typical method of evaluating the performance of a classifier is to use only a part of the dataset for training and hold out the rest for testing. The classifier is modeled with the training set and can then be validated on the test set. As described in section 3.3, the ultimate goal of modeling a classifier is to find the underlying principles, describing not only the representative dataset, but also the general population. A classifier that performs well on the training set and has a poor performance on the test set overfits the training set, i.e., it does not generalize very well. To circumvent this issue, a test set is used to measure the performance of a classifier on samples, which were not used for the modeling and can therefore not be overfit. A split ratio of 2:1 (train data : test data) is quite common for this purpose (Witten and Eibe, 2001). To build high performing classifiers, the training set should be representative, which is a problem for small datasets. The used dataset ACE for instance has only 111 samples, meaning that there would be only 74 samples available for training. To avoid this reduction, cross-validation, a well known statistical method (Stone, 1977) can be used. This technique divides the entire dataset in to N non-

(25)

overlaping partitions. Then N − 1 partitions are used to train a model and the left out partition is used for testing purposes. In the next step, another partition is left out and the rest is used for training. This is repeated until all partitions have been used for testing once.

During the cross-validation, N models are trained and tested, which is computationally more expensive, but it has the advantage, that the classifier is tested on all samples. The standard approach is to use N = 10 and is also applied in the experiment in the thesis.

3.8.2 Statistical Characteristics

The test results from cross-validation can be used to calculate several statistical measures.

What is the most useful measurement always depends on the application, for the imple- mentation of the thesis, confusion matrices, accuracy, precision, recall, receiver operating characteristic (ROC), area under curve (AUC) were considered.

Figure 3.6:Confusion matrix Confusion Matrix: The cross-validation provides

a prediction for every sample in the dataset. Be- sides that, the actual classification is also known.

Now there are four cases: the actual class and the prediction concur, resulting in either true positive or true negative values. If the actual class is pos- itive, but the classifier predicted a negative class this results in a false negative value and in the vice versa case a false positive value is produced. All four of these values are displayed in a confusion ma- trix, see figure 3.8.2 (Witten and Eibe (2001), p.161- 172).

Accuracy is the most widely used measurement describing the overall correct predic- tions.

accuracy = TP + TN

TP + TN + FP + FN

Precision describes the correct predictions for a class.

precision = TP

TP + FP

Recall also called sensitivity when using only the positive class and describes the fraction

(26)

that a certain class is correctly predicted.

recall = TP

TP + FN

Receiver Operating Characteristic (ROC): ROC curves are graphical represen- tations of the performance of a classifier. On the vertical axis, the percentage of the true positives are displayed, whereas the horizontal axis contains the false positives. ROC curves can only be created from classifiers with outputs that are not discrete, i.e., continuous nu- merical values, because the outputs are ranked accordingly. With this ranking the number of true positives (tp) and the false positives (f p) can be defined in relation to a threshold Θ. An instance is only then counted as (true or false) positive if its non-discrete value lies above Θ. Hence different values for Θ produce different tp and f p. If the output of a classifier is defined to be in [0, 1], then a Θ of zero would lead to a result where all actual positive instances are identified correctly and all negative instances are identified incorrectly. A threshold of one would on the other hand classify all instances as negative and neither true positives nor false positives are found. These are the two endpoints of the ROC curve on the upper right and the lower left corner. Thresholds between zero and one lead to individual tp and f p, which can be plotted and result in a ROC curve, like the one in figure 3.8.2 (Fawcett, 2004).

Figure 3.7: Receiver operating characteristic (ROC) (Witten and Eibe (2001), p.169).

Area Under Curve: To calculate a general performance measurement of a classifier the area under a ROC curve can be calculated. The AUC value lies in the interval [0, 1], where

(27)

one means all positives are ranked ahead of all negative examples. A coin toss classifier would achieve an AUC of 0.5, hence a classifier should always have a higher value. Since the AUC only predicates the average performance of a classifier, it is possible that a clas- sifier with a higher AUC has a lower performance than a classifier with a lower AUC at a certain threshold (Fawcett, 2004).

3.9 Statistical Inference – Paired t-Test

When dealing with classifiers, especially when they are based on some random elements, it is important to consider that classification results almost always vary. Simply comparing the accuracy of two models does not provide a significant result, because no information on variance is included. Statistical tests provide the means for asserting a possible signifi- cance. A suitable solution is to use the results of every partitioning of the cross–validation, which can subsequentially be used in a paired t-test, in order to test if the result is signif- icant.

The ultimate goal of the thesis is to test if exploiting information on uncertainty can improve a classifier. To phrase this in a more mathematical way: the objective is to test the hypothesis, that the mean of an improved classifier is not equal to the mean of a standard classifier. Of course this also implies that the mean of the modified version is higher.

To test this hypothesis, a paired t-test as described in Goulden (1959) can be applied:

Definition:

H0: µ2 > µ1

Ha: µ2 ≤ µ2 T est Statistic : t = (X − Y )

s n(n − 1) Pn

i=1( bX1− bY1)2

where Xi and Y i are a set of paired results, obtained by two models trained on the same partition, X and Y are the sample means, n is the number of samples (partitions), Xb1 = (Xi− X), and bY1= (Yi− Y ).

The null hypothesis is rejected when:

t < tα,ν

where tα,ν is the critical value of the t-distribution with ν degrees of freedom. In the case of an N –fold cross–validation, ν = N − 1.

(28)

Chapter 4

Results

This section start with a graphical analysis and a variable importance analysis using RDSTM to give a deeper insight into the used datasets. After that, the different versions of the implemented classifier are tested. A detailed analysis including statistical inference is used on several of the presented classifiers, in order to determine the significance of the results.

4.1 Graphical Analysis of Uncertain Data

The uncertain attributes were analyzed graphically using two plots in Matlab (Fig- ure 4.1– 4.10):

Error bar plot: shows the expected values and their corresponding error estimates, with an interval. The samples with an Ames positive classification are plotted in red whereas the Ames negative values are blue. The instances were sorted according to achieve a better comparability. The x-axis represents the sample number, whereas the y-axis represents the actual value of the samples for the uncertain attribute.

Error plot: shows a dot plot of the error estimates. The used colors are the same as the ones for the error bar plot, but the samples are ordered according to value of the error estimate. The x-axis represents the sample number and the y axis shows the value of the error estimate.

Both plots for the CAS dataset (Figure 4.11, 4.12) look different than the plots for the other datasets, because the number of positive and negative samples is not balances in the CAS dataset.

(29)

0 10 20 30 40 50 60

−4

−2 0 2 4 6 8

Figure 4.1:ACE error bar plot

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Figure 4.2: ACE error plot

0 10 20 30 40 50 60

−4

−2 0 2 4 6 8

Figure 4.3: ACHE error bar plot

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Figure 4.4:ACHE error plot

0 10 20 30 40 50 60 70 80 90

−4

−2 0 2 4 6 8

Figure 4.5: BZR error bar plot

0 10 20 30 40 50 60 70 80 90

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Figure 4.6:BZR error plot

(30)

0 50 100 150 200

−4

−2 0 2 4 6 8

Figure 4.7: DHFR error bar plot

0 50 100 150 200

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Figure 4.8: DHFR error plot

0 20 40 60 80 100 120 140 160 180

−4

−2 0 2 4 6 8

Figure 4.9: COX2 error bar plot

0 20 40 60 80 100 120 140 160 180

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Figure 4.10:COX2 error plot

0 500 1000 1500 2000

−5 0 5 10 15

Figure 4.11: CAS error bar plot

0 500 1000 1500 2000 2500

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Figure 4.12: CAS error plot

(31)

4.2 Variable Importance using Rule Discovery System

TM

Table 4.1: Median importance of the expected value of the uncertain attribute obtained by using RDSTM. The table also includes the average im- portance ranking from the single cross-validation iterations.

Name Variable Importance Rank

ACE 0.5% 37

ACHE 2.3% 7

BZR 2.5% 9

CAS 1.3% 27

COX2 5.2% 3

DHFR 1.8% 19

As described in section 3.2.2, RDSTMhas a built-in tool to analyze the variable impor- tance of a model. Although it cannot han- dle uncertain attributes, information on the significance of the single attributes can be useful in order to obtain a more pro- found knowledge on the datasets. As ex- pected, the importance of the attributes vary in the different models, so the median of the models produced during the cross- validation were used to get more meaning- ful results. However, the values illustrated in table 4.1 show only the characteristics of the expected values, without the error esti- mate of the uncertain attributes. It cannot

be assumed, that these values are the same for models trained with the use of the error estimate.

4.3 Test Results

Four versions of the random forest algorithm were implemented for the thesis: A standard random forest able to handle numerical data and missing values; and three modified ver- sions, each capable to handle a different type of uncertainty. The results of these three random forests produced by using different parameter, then compared with the results of the standard algorithm, in order to find significant improvements. Bagging was tested in the initial phase, but the results were not promising; the performance of the classifier using bagging was deteriorating significantly in comparison to classifiers without bagging.

For this reason bagging was left out for all versions of the classifier. The number of trees was set to 101 trees for each model, to fully utilize the voting capabilities of the random forest and to reduce the variance of the results. For each dataset, varying numbers for random selection of attributes at each node were tested. The ACE, ACHE, BZR, DHFR and the COX2 datasets were trained using a 10–fold cross-validation to get more signifi- cant results. The sample size of these datasets made it computationally feasible to train multiple models for cross-validation, using not only the faster version where the median of the instances of an attribute is used as split criteria, but also the much slower version, where all bounds between the sorted attributes were analyzed for the split criterion with the highest information gain. To obtain comparable results the standard version as well as the tested modified version of the algorithm were trained on the same subsets during

(32)

cross–validation, i.e., the tested dataset was partitioned in ten parts, nine for training and one for testing. Then both models were created using the same training set and validated using the same test set. In this way, a paired t-test could be used to determine the signif- icance of the results. Since the single results are obtained by 10-fold cross-validation, the degrees of freedom are 9. Table 4.2 shows a one tailed Student’s t-table for 9 degrees of freedom and several tail probabilities.

Table 4.2: Student’s t-Table for 9 degrees of freedom

Tail Probability 10% 5% 2.5% 1%

t-Value for 9 degrees of freedom 1.383 1.833 2.262 2.821

The CAS dataset was analyzed separately using a random partition of 2 : 1 (training : testset) for validation. In addition, for the CAS dataset only the faster determination of the split criterion using the median was used, due to the limited time for the preparation of this thesis.

4.3.1 Uniformly Distributed Uncertainty using Split Criteria

The first analyzed kind of uncertainty assumes a uniformly distributed interval for the uncertain attribute in each dataset, see section 3.7.1. At the beginning of the thesis, the underlying probability density was unknown, hence assuming a uniform distribution seemed as the least error prone approach. The five smaller datasets (111 − 397 samples) were used for this analysis. For the random attribute selection, when building a tree, three different magnitudes, known as the randomizer value, were tested: 0.5√

N ,√

N and 2√ N . For this test, the split criterion was obtained by using the mean of all instances of an attribute. This split criterion was then also used to partition the uncertainty interval.

Table 4.3 shows the results of a test using the median of all instances of an attribute as split criterion. The rows labeled without correspond to the results of the standard random forest, whereas the results of the modified algorithm are labeled as with accordingly.

Table 4.3 indicates, besides from the obvious variance in the results, that there is an improvement, when using the DHF R dataset with a magnitude of 1√

N for the number of random attribute selection. The result is significantly better than the standard algorithm with the same parameter settings. The t-value of 1.8731 exceeds the 1.833 for a 5% tail probability from table 4.2, i.e., the probability of incorrectly rejecting the null hypothesis is less than 5%. For a better comparison between the two classifiers, table 4.4 shows the statistical characteristics of the models and figure 4.3.1 shows the corresponding ROC curves.

For the next test series, the same setup was used as before, but instead of using the median of all instances of an attribute for the split criterion, the instances were ordered,

(33)

Table 4.3: Accuracy of trained models with median as split criterion, assuming uniformly dis- tributed uncertainty that is partitioned by the split criterion

Randomizer Uncertainty ACE ACHE BZR DHFR COX2

0.5 without 86.84 64.86 76.69 84.13 79.81

with 87.72 65.77 77.30 84.63 78.26

1 without 86.84 70.27 76.69 84.63 80.43

with 86.84 69.37 74.85 86.40 78.57

2 without 88.60 69.37 79.14 84.89 78.88

with 87.72 66.67 77.91 85.64 77.64

Table 4.4:Detailed results of the DHFR dataset using randomizer 1. The split criterion was ob- tained by using the median. On the left the results obtained by not using information on uncertainty and on the right with uniformly distributed uncertainty.

without uncertainty with uncertainty

Accuracy 84.63 86.40

Precision 84.50 86.07

Recall 84.92 86.93

AUC 92.99 92.71

Confusion Matrix 169 30 31 167

173 26 28 170

t-Value 1.8731

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate

True Positive Rate

Standard Random Forest

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate

True Positive Rate

Modified Random Forest

Figure 4.13: ROC curves of the DHFR dataset using randomizer 1. The median was used as split criterion. On the left without using uncertainty and on the right with uniformly distributed uncertainty.

(34)

repetitive values were discarded and the information gain of all possible thresholds was calculated. For the actual split criterion, the value with the highest information gain was used, see section 3.7.1. The results of this results are displayed in table 4.5.

Table 4.5:Accuracy of trained models with maximum information gain producing split criterion, assuming uniformly distributed uncertainty that is partitioned by the split criterion

Randomizer Uncertainty ACE ACHE BZR DHFR COX2

0.5 without 85.96 67.57 79.75 85.64 77.33

with 85.09 68.47 78.53 86.90 77.33

1 without 87.72 67.57 75.46 86.40 79.50

with 85.96 69.37 75.46 86.40 78.88

2 without 85.96 72.97 72.39 84.63 80.12

with 85.09 71.17 71.78 84.63 79.19

Again, there is an improvement on the DHFR dataset, this time when using 0.5√ N random values. The t-value is 1.8224, which is slightly below the 95% value for the confi- dence. The detailed comparison can be found in table 4.6 and figure 4.3.1. There are also improvements in the ACHE dataset, but the results of classifiers on this set have a very high variance (tr=0.5 = 0.6228/tr=1= 0.9757).

Table 4.6: Detailed results of the DHFR dataset using randomizer 0.5. The split criterion was obtained by using the median. On the left the results obtained by not using information on uncer- tainty and on the right with uniformly distributed uncertainty.

without uncertainty with uncertainty

Accuracy 85.64 86.90

Precision 86.22 87.31

Recall 84.92 86.43

AUC 93.37 93.50

Confusion Matrix 169 30 27 171

172 27 25 173

t-Value 1.8224

4.3.2 Uniformly Distributed Uncertainty using Random Split Value After testing the partitioning with the split criterion, a second version of handling uni- formly distributed uncertain values was investigated. As described in section 3.6, the ran- dom forest algorithm benefits from introducing random element, because the single trees

(35)

0 0.2 0.4 0.6 0.8 1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate

True Positive Rate

Standard Random Forest.

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate

True Positive Rate

Modified Random Forest.

Figure 4.14: ROC curves of the DHFR dataset using randomizer 0.5. The median was used as split criterion. On the left without using uncertainty and on the right with uniformly distributed uncertainty.

in the ensemble tend to be more independent; partitioning the interval by a random split value is such a element, as described in section 3.7.1. The parameter setup was the same as described in section 4.3.1, only the partitioning of the uncertainty interval differs. Table 4.7 shows the results of tests using the median of an attribute as split criterion.

Table 4.7: Accuracy of trained models with median as split criterion, assuming uniformly dis- tributed uncertainty that is partitioned by a random value within the uncertainty interval.

Randomizer Uncertainty ACE ACHE BZR DHFR COX2

0.5 without 87.72 69.37 77.30 84.13 79.19

with 86.84 68.47 76.69 84.63 78.57

1 without 87.72 70.27 77.30 83.88 77.02

with 87.72 71.17 74.85 83.88 78.26

2 without 85.96 72.07 77.91 86.40 79.81

with 85.96 72.97 77.30 86.65 78.57

There is an improvement when using the COX dataset, with a random selection mag- nitude of√

N , but the t-value indicates, that the difference is not significant. The other two comparisons (for r=0.5, r=2) even indicate a deterioration, so the improvement can be interpreted as outlier, but still a more detailed analysis can be found in table 4.8 and figure 4.3.2.

The test was then repeated with the same parameter settings, except for the choice of the split criteria; the same procedure as in the second test series of section 4.3.1 was used.

The results can be found in table 4.9. Interestingly, a deterioration of the accuracy of the

References

Related documents

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

I regleringsbrevet för 2014 uppdrog Regeringen åt Tillväxtanalys att ”föreslå mätmetoder och indikatorer som kan användas vid utvärdering av de samhällsekonomiska effekterna av

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Det finns en bred mångfald av främjandeinsatser som bedrivs av en rad olika myndigheter och andra statligt finansierade aktörer. Tillväxtanalys anser inte att samtliga insatser kan

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av