A Hybrid Filter-Wrapper Approach for FeatureSelection

(1)

International Master’s Thesis

A Hybrid Filter-Wrapper Approach for Feature

Selection

Ghayur Naqvi

Technology

Studies from the Department of Technology at Örebro University örebro 2012

(2)

(3)

A Hybrid Filter-Wrapper Approach for Feature

Selection

(4)

(5)

Studies from the Department of Technology

at Örebro University

Ghayur Naqvi

A Hybrid Filter-Wrapper Approach for

Feature Selection

Supervisors: Dr. Marco Trincavelli Examiners: Dr. Marcello Cirillo

(6)

© Ghayur Naqvi, 2012

Title: A Hybrid Filter-Wrapper Approach for Feature Selection

(7)

Abstract

Feature selection is the task of selecting a small subset from original features that can achieve maximum classification accuracy. This subset of features has some very important benefits like, it reduces computational complexity of learn-ing algorithms, saves time, improve accuracy and the selected features can be insightful for the people involved in problem domain. This makes feature selec-tion as an indispensable task in classificaselec-tion task.

This dissertation presents a two phase approach for feature selection. In the first phase a filter method is used with “correlation coefficient” and “mutual information” as statistical measure of similarity. This phase helps in improv-ing the classification performance by removimprov-ing redundant and unimportant features. A wrapper method is used in the second phase with the sequential forward selection and sequential backward elimination. This phase helps in se-lecting relevant feature subset that produce maximum accuracy according to the underlying classifier. The Support Vector Machine (SVM) classifier (linear and nonlinear) is used to evaluate the classification accuracy of our approach.

This empirical results of commonly used data sets from the University of California, Irvine repository and microarray data sets showed that the proposed method performs better in terms of classification accuracy, number of selected features, and computational efficiency.

(8)

(9)

Acknowledgements

Many people helped me in different ways in completing my MS thesis and I would like to take this opportunity to thank them all. First and foremost I would like to acknowledge the consistent help, advice and prompt support of my supervisor Dr. Marco Trincavelli. Marco has provided me the necessary tools and has always allowed me complete freedom throughout my research. This work would have been impossible without his guidance, criticism, en-dorsement and patience. Thank you very much dear Marco.

I would like to express my very great appreciation to Dr. Federico Pecora for providing me a server PC to get my results done. I would like to express my gratitude to Per Sporrong for making sure an unhampered access to the AASS lab and setting up printing facility for me. Ayoub Einollahi, a good friend and a great classmate, deserves thanks for assistance with LA_{TEX and for the valuable} discussions.

Special thanks must also go to my parents and my entire family for pro-viding me unconditional support and encouragement throughout my time in graduate school. Thank you in particular, Ami and Abu, for your tireless ef-forts to provide me education in Sweden while patiently waiting for me.

My roommate and best friend Zill-e-Hussnain, thank you for your unwa-vering support throughout this endeavor. Last but not least, to my all friend here in Sweden and back in Pakistan, for helping me get through the difficult times, and for all the emotional support, laughs, entertainment, and caring they provided. There is no way I can fully express how much you all mean to me and how much I value your support.

(10)

(11)

5.1.1 Case - 1 FilterCorrcoefWrapperSFSClassifierSVM−LIN . 51 5.1.2 Case - 2 FilterCorrcoefWrapperSFSClassifierSVM−RBF . 52 5.1.3 Case - 3 FilterCorrcoefWrapperSBEClassifierSVM−LIN . 53 5.1.4 Case - 4 FilterCorrcoefWrapperSBEClassifierSVM−RBF . 54 5.1.5 Case - 5 FilterMIWrapperSFSClassifierSVM−LIN . . . . 55

5.1.6 Case - 6 FilterMIWrapperSFSClassifierSVM−RBF . . . 56

5.1.7 Case - 7 FilterMIWrapperSBEClassifierSVM−LIN . . . 57

5.1.8 Case - 8 FilterMIWrapperSBEClassifierSVM−RBF . . . 58

5.1.9 IONOSPHERE Results Summary . . . 59

5.2 WDBC Dataset Experiments . . . 60

5.2.9 WDBC Results Summary . . . 67

5.3 ARR Dataset Experiments . . . 68

5.3.9 ARR Results Summary . . . 75

5.4 NCI60 Dataset Experiments . . . 76

(13)

CONTENTS 13

5.4.9 NCI60 Results Summary . . . 83 5.5 SRBCT Dataset Experiments . . . 84 5.5.1 Case - 1 FilterCorrcoefWrapperSFSClassifierSVM−LIN . 84

5.5.2 Case - 2 FilterCorrcoefWrapperSFSClassifierSVM−RBF . 84

5.5.3 Case - 3 FilterCorrcoefWrapperSBEClassifierSVM−LIN . 85

5.5.4 Case - 4 FilterCorrcoefWrapperSBEClassifierSVM−RBF . 86

5.5.5 Case - 5 FilterMIWrapperSFSClassifierSVM−LIN . . . . 86

5.5.9 SRBCT Results Summary . . . 90 5.6 Results Summary . . . 90

6 Conclusions 93

6.1 Conclusion and Future Work . . . 93

A Appendices 95

A.1 Dataset Configuration . . . 95

(14)

(15)

List of Figures

1.1 Hybrid filter-wrapper algorithm . . . 22 1.2 Two-dimensional points belonging to two different classes

(as-terisks and addition-signs) are shown in the figure. A classifier will learn a model using these points and then use the same model to classify the new samples, marked by "?" . . . 26 2.1 A unified view of a feature selection process. . . 29 2.2 Filter, wrapper and embedded feature selection scheme . . . 31 2.3 Sequential Forward Selection and Sequential Backward

Elimina-tion . . . 34 2.4 A taxonomic summary of feature selection techniques with

im-portant characteristics of each technique. “Reprinted from [75]”. 34 3.1 Schematic describing the proposed method. . . 36 3.2 (a) A maximal margin hyperplane with its support vectors in

bold highlighted (b) A plane defined by SVM in the feature space 41 3.3 Kernel functions . . . 41 4.1 Schematic describing the proposed method with the possible choices

at each step of the algorithm. . . 46 4.2 As the radar signal is transmitted, some of the signal will escape

the earth through the ionosphere (green arrow). The ground wave (purple arrow) is the direct signal we hear on a normal ba-sis (fading signal). Red and blue arrows are called "skywaves." These waves bounce off the ionosphere and can bounce for many 1000’s of miles depending upon the atmospheric conditions [63]. 47 4.3 (a) and (b) Arrangement of benign and malignant (large

nu-clei and prominent nucleoli) epithelial cells respectively [58] (c) FNA, a small needle is inserted into the mass. Negative pres-sure is created in the syringe, allowing for cellular material to be drawn in to the syringe for cytological analysis [69]. . . 48

(16)

16 LIST OF FIGURES

4.4 Normal and abnormal heart rhythms. “Reprinted from [57]”. . 49 4.5 (a) Ewing’s sarcoma (b) Rhabdomyosarcoma (c) Neuroblastoma

(d) Non-Hodgkin lymphoma . . . 50 5.1 Results comparison for Case-1 using IONOSPHERE dataset (a)

Accuracy in percent. (b) Number of features remained. (c) Time cost. . . 52 5.2 Results comparison for Case-2 using IONOSPHERE dataset (a)

Accuracy in percent. (b) Number of features remained. (c) Time cost. . . 59 5.9 Results comparison for Case-1 using WDBC dataset (a)

Accu-racy in percent. (b) Number of features remained. (c) Time cost. 60 5.10 Results comparison for Case-2 using WDBC dataset (a)

(17)

LIST OF FIGURES 17

5.17 Results comparison for Case-1 using ARR dataset (a) Accuracy in percent. (b) Number of features remained. (c) Time cost. . . . 68 5.18 Results comparison for Case-2 using ARR dataset (a) Accuracy

in percent. (b) Number of features remained. (c) Time cost. . . . 69 5.19 Results comparison for Case-3 using ARR dataset (a) Accuracy

in percent. (b) Number of features remained. (c) Time cost. . . . 75 5.25 Results comparison for Case-1 using NCI60 dataset (a)

Accu-racy in percent. (b) Number of features remained. (c) Time cost. 76 5.26 Results comparison for Case-2 using NCI60 dataset (a)

Accu-racy in percent. (b) Number of features remained. (c) Time cost. 83 5.33 Results comparison for Case-1 using SRBCT dataset (a)

(18)

18 LIST OF FIGURES

5.39 Results comparison for Case-8 using SRBCT dataset (a) Accu-racy in percent. (b) Number of features remained. (c) Time cost. 90

(19)

List of Tables

4.1 Characteristics of the datasets used in the experiments. . . 44 4.2 Average correlation coefficients of the datasets used in the

exper-iments. . . 44 4.3 Average mutual information of the datasets used in the

experi-ments. . . 44 4.4 All possible combinations of proposed algorithm . . . 45 A.1 All possible configurations of a dataset with Hybrid (Filter and

Wrap-per) method and their comparisons with only Filter method and only Wrapper method. . . 96

(20)

(21)

Chapter 1

Introduction

Most of the problems in machine learning are prediction problems, i.e. prob-lems in which an output Y has to be predicted given an input vector X. If the output Y is a real valued variable the prediction problem is called regression while, if the output Y is a set of classes the prediction problem is called classifi-cation. Given a feature vector X, the task is to predict some output vector Y, or its conditional probability distribution P(Y|X). Unfortunately in most of real-world problems, the input feature vector X is constituted by a very large num-ber of features. So the high dimensionality of a problem can cause increased computational complexity, and hinder the performance of the learning algo-rithm [31]. Therefore, reducing the dimensionality of a problem has certain advantages [4] [28]. Consequently, in any classification task we need to select the relevant subset of features that possibly contains the least number of di-mensions which are best for solving the task at hand. Such relevant subset of features is unknown a priori, instead we need to use some dimensionality re-duction method (Section 2.2) to discover such subset. Feature selection (Section 2.3) is one of the dimensionality reduction methods. It is commonly used in ap-plications where original features need to be preserved [52]. Although feature selection can be applied to both supervised and unsupervised learning [75] in this thesis we will focus only on the problem of feature selection in supervised learning (in particular for classification problems).

Supervised methods can be categorized into three main categories accord-ing to their dependence on the classifier: filter methods, wrapper methods and embedded methods [43]. The filter methods estimate the classification perfor-mance by assessing the relevance of features by looking only at the intrinsic properties (e.g., distance, mutual information or correlation) of the data, with-out acquiring feedback from classifiers. The wrapper methods are classifier-dependent. These methods evaluate the “goodness” of the selected feature sub-set directly from classifier feedback in terms of classification accuracy. In em-bedded methods, the search for an optimal subset of features is built into the classifier construction, and can be seen as a search in the combined space of

(22)

22 CHAPTER 1. INTRODUCTION

feature subsets and hypotheses. The embedded methods are also specific to a given classifier but less expensive than wrapper methods [75].

According to the results reported in literature, when only filter methods are used their performance is lower than the one obtained by wrapper methods and the wrapper methods can become unfeasible for large datasets due to their computationally expensive nature with high number of features [77]. A hybrid approach (Fig: 1.1) can be an attractive compromise that joins the positive sides of the two approaches while limiting their influence of the drawbacks. We propose a novel hybrid filter-wrapper approach (Chapter 3) that exploits the speed of filter method followed by the wrapper’s accuracy. In first phase of feature selection we apply the Quadratic Programming Feature Selection (QPFS) filter [70] (Section 3.2) to remove redundant and irrelevant features. In second phase, we have used Sequential Feature Selection (SFS) wrapper [73] [3] (Section 3.3) with both forward and backward elimination to tune the selected feature subset to the target classifier. Our aim is to combine the strengths of both the methods, which is less computationally expensive than the wrapper and possibly more accurate than the filter. Moreover, we will also consider the cardinality of the obtained feature subset as an indicator of performance (smaller features subsets are preferable with respect to larger ones).

Figure 1.1: Hybrid filter-wrapper algorithm

1.1 Motivation

The high dimension of todays real-world data poses a serious problem for stan-dard classifiers. Therefore feature selection is a common preprocessing step in many data analysis algorithms. It prepares data for mining and machine learn-ing, which aim to transform data into business intelligence or knowledge. Per-forming feature selection may have various motivations. For instance, consider the following simple but revealing example of a binary classification problem.

Broadly speaking, two factors matter most for effective learning: 1. N be the number of features, and

(23)

1.2. LEARNING ALGORITHMS 23

2. M be the number of instances.

When N is fixed, a larger M means more constraints and the resulting cor-rect hypothesis is expected to be more reliable. When M is fixed, a reduced N is same as to significantly increase the number of instances. Theoretically, the reduction of dimensionality can exponentially shrink the hypothesis space. Suppose we have four binary (i.e. 0 − 1) features F1, F2, F3, F4and class C (e.g.; positive or negative). If the training data is comprised of 4 instances i.e., M = 4, it is only a quarter of the total number of possible instances 24₌_{16. The size} of the hypothesis space is 224 ₌_{65, 536. If only two features are relevant to the} target concept, the size of the hypothesis space becomes 222 ₌_{16, which is an} exponential reduction of the hypothesis space. Now when we are left with only 2 features, the only 4 available instances might be sufficient for perfect learn-ing (assumlearn-ing there is no repeated instance in the reduced trainlearn-ing data). The resulting model with 2 features can also be less complicated than that with 4 features. Hence, feature selection can effectively reduce the hypothesis space, or virtually increase the number of training instances, and help create a compact model.

This dissertation presents a novel hybrid filter-wrapper feature selection technique for classification, based on QPFS filter and SFS wrapper.

Recently, nonlinear programming has become a computational tool of cen-tral importance due to its capability to solve very large, practical engineering problems reliably and efficiently [34]. Nonlinear programming has also found wide application in combinatorial optimization and global optimization, where it is used to find bounds on the optimal value, as well as approximate solu-tions [9].

The QPFS filter method that we use is based on quadratic programming which is a branch of optimization for which very efficient algorithms have been designed and are available in literature [9] [53] [5] [54] [89] and therefore is particularly suitable to cope with large datasets. Experimental results show that the method achieves accuracy similar to that of other existing methods on medium-size data sets but far better time complexity improvement with very large data sets with high dimensionality like the large MNIST data set [70]. In the next phase, we run SFS wrapper that selects a subset of features from the data based on underlying classifier. Features are sequentially added/removed from the active subset until a stopping criterion, often based on the degradation of the classification performance, is met.

1.2 Learning Algorithms

Feature selection algorithms can be further categorized into supervised, unsu-pervised, and semi-suunsu-pervised, corresponding to different types of learning al-gorithms. The primary difference between these types of learning is whether the training data has been hand-labeled or not to generate the classifier’s output.

(24)

1.2.1 Supervised Learning

In supervised learning, we have a training set in which each training example is a (instance, label) pair, and the learner’s goal is to learn a mathematical model that represents the mapping function between the instance vectors and the label values. The model is then used to predict the label of a new unseen instance with only a small chance of erring.

More formally, we have a training set: Sm= xi, yim i=1, x i ∈ RNand yi_{= c(x}i₎ _(1.1)

The goal is to find a mapping P from RN_{to the label set with a small chance of} committing an error on a new unseen instance x ∈ RN_{, N coordinates are called} instances and performance is measured by the quality of the approximating c for the new unseen instances [60].

In essence, supervised feature selection algorithms try to find features that help separate data of different classes. In case of regression, the feature selection is done by selecting the variables that most reduce the residual sum of squares as in forward stepwise selection or minimizes a penalized criterion [88].

1.2.2 Unsupervised Learning

Unlike supervised learning, unsupervised learning have no labeled data. Learn-ing algorithms base only on input values from the data. It aims to find subset of features according to similar patterns in the data without prior information. This type of learning is used for example in clustering, self-organization, auto-association and some visualization algorithms [44].

D ={x1, x2, ..., xm} = X (1.2)

Typical approaches to unsupervised learning include clustering, building prob-abilistic generative models and finding meaningful transformations of the data. Given a fixed number of clusters, we aim to find a grouping of the objects such that similar objects belong to the same cluster. K-means is a classical cluster-ing [55] algorithm which clusters instances accordcluster-ing to the Euclidean distance. Unsupervised methods perform poorly in the beginning as compared to su-pervised learning as they are untuned, but performance increases as they tune themselves over some time [87].

1.2.3 Semi-supervised Learning

The acquisition of labeled instances for a learning problem is often difficult, ex-pensive, or time consuming to obtain, as this requires the efforts of experienced human annotators. Therefore it may not be feasible to get labeled instances

(25)

1.3. THE CLASSIFICATION PROBLEM 25

sometimes. On the other hand acquisition of unlabeled instances is relatively inexpensive to collect.

In a scenario when we have a small amount of labeled data with a large amount of unlabeled data for the training, semi-supervised learning technique can be of great practical value to train our classifier [95].

It is observed that when large unlabeled instances used in conjunction with a small amount of labeled instances, give higher accuracy. Because semi-supervised learning requires less human effort and gives considerable higher accuracy, it is of great interest both in theory and in practice [97].

1.3 The Classification Problem

By definition, classification is the grouping together of similar objects. This “group” share common characteristics and is called a class and the process that perform the grouping is called classification in machine learning.

In a classification task, first the classifier is provided with training data to learn from, as each instance in training data is already labeled with the correct label or class. Then we supply test data (where instances are not la-beled) to the classifier and classifier’s job is to assign these new and previ-ously unseen instances to a particular class based on the prior knowledge. This process is illustrated in Figure 1.2. Formally, given a set of input items, X ={x1, x2, ..., xn} and a set of labels/classes, Y = {y1, y2, ..., yn} and training data T = {(xi, yi)|yiis the label/class for xi}, a classifier is a mapping from X to Y, f(T , x) = y.

This process is not flawless due to many obvious reasons, consider a case when classifier has inadequate training instances. The high dimensional nature of some classification tasks may also be a real problem for classifiers, especially when we have relatively few samples for training [74].

Classification has been an active topic in various research fields over many years, e.g.; in Biology, the study of gene expression micro-arrays became very popular [86]. Optical Character Recognition uses classification to translate im-ages of handwritten, typewritten or printed text (usually captured by a scanner) into machine editable text [47]. In “Document classification” the task is to as-sign an electronic document to one or more categories, based on the document’s contents [36].

(26)

Figure 1.2: Two-dimensional points belonging to two different classes (asterisks and addition-signs) are shown in the figure. A classifier will learn a model using these points and then use the same model to classify the new samples, marked by "?"

1.4 Thesis Structure

The structure of this thesis is the following. Chapter 2 deals with literature re-view and it surveys dimensionality reduction methods, focusing the attention on feature selection approaches. Chapter 3 gives details about the implementa-tion of our algorithm that is developed during the course of this thesis and the framework within which they operate. Chapter 4 contains the detailed descrip-tion of the five publicly available datasets which we have used in our experi-ments. Chapter 5 discusses the results for five datasets that we have obtained during experiments. Chapter 6 presents conclusions and suggests future work.

(27)

Chapter 2

Literature Review

2.1 Introduction

Today many modern scientific research fields make use of machine learning techniques for predictive modeling problems. These problems are often data-intensive and involve such a large amount of features that building a predictive model becomes impractical. Usually, most of these features are either redun-dant or irrelevant to the predictive model. In typical predictive modeling tasks like supervised and unsupervised classification or regression, extensively large feature sets can lead to poor accuracy, high computational cost, memory usage and slow speed. Therefore, selection of the optimal (possibly minimal) feature set giving best possible results is desirable to classification or regression [62]. Feature selection is an indispensable step which has numerous advantages to tackle such problems [64].

Selecting optimal feature set that most contribute to best results can be a daunting task when we have tens or hundreds of thousands of features to choose from. In literature, features are also known as variables or dimensions. Generally a feature can be undesirable due to one of the two following reasons:

• Features which are manifestly irrelevant for the task at hand (output) • Features which contain redundant information

In machine learning we need feature selection essentially for the classifica-tion or regression task. Our aim is to select a subset of relevant features for which a classifier or regressor performs best.

2.2 Dimensionality Reduction

Every data entity in a computer is represented and stored as a set of features, for instance, age, height, weight, and so on. Features can interchangeably be termed as dimensions, because an entity with N features can also be represented as a multidimensional point in an N-dimensional space. The process of reducing

(28)

28 CHAPTER 2. LITERATURE REVIEW

the initial feature set composed by N features to a feature set composed by K features with K < N is called dimensionality reduction. Ideally, the K reduced features would retain the important characteristics of the original N features.

Dimensionality reduction is substantial in many domains like database and machine learning systems and consequently it offers invaluable results like data compression, better data visualization, improved classification accuracy, fast and efficient data retrieval, boosting index performance [91] [90]. There exist two important categories of dimensionality reduction techniques, named fea-ture extraction and feafea-ture selection [76].

Feature extraction also known as feature transformation is the process that finds a new K dimensions that are a combination of the N original dimensions. The best known feature extraction techniques are based on projection and com-pression methods. Principal component analysis (PCA) and linear discriminant analysis (LDA) are examples of projection methods for unsupervised and su-pervised learning respectively. Mutual information and information theory is used in compression method [75] [22] [92].

In contrast to feature extraction, feature selection aims to retain a subset of Kbest features from an original set of N features and the remaining features are discarded. Feature selection techniques do not alter the original representation of the features. The best known feature selection techniques are filter, wrappers and embedded methods [43].

The dimensionality reduction technique that we used in this dissertation is the feature selection. While feature selection can be applied to both supervised and unsupervised learning, we merely focus on the problem of classification here. The remaining of this chapter provides a brief survey of the most common feature selection approaches that can be found in literature.

2.3 Feature Selection

Feature selection (also known as attribute selection or variable selection) is a technique to select an optimal features subset from the original input features according to some criterion. The criterion is often formulated as an objective function that finds which features are most appropriate for some task at hand. But the reason why we are interested in finding a subset of features is because it is always easier to solve a problem in lower dimension. This helps us in under-standing nonlinear mapping between the input and the output variables [51]. Feature selection is the process of finding the most optimal subset of features of a certain size that leads to the largest possible generalization [44]. Figure 2.1 explains the feature selection process.

Nevertheless, reducing the dimensionality has certain advantages. The pur-pose of feature selection is three-fold. Firstly, lower dimensionality enhances the prediction accuracy of the classifier. Once an optimal subset has been deter-mined, even very simple learning algorithms can give very good performance. So in this way, this preprocessing step of feature selection may strikingly

(29)

im-2.3. FEATURE SELECTION 29

prove predictive accuracy of a classifier. Secondly, it lowers the computational cost. Most of the learning algorithms become computationally intractable when the number of features is large, both in the training and in the prediction step. A step of feature selection preceding the training algorithm can alleviate the computational burden. Lastly, dimensionality reduction provides better insight into the process that generated the data. This purpose is also substantial since in many cases the ability to point out the most informative features is impor-tant [28].

The selection of features can be achieved in two ways, feature ranking or features subset selection [28].

Figure 2.1: A unified view of a feature selection process.

2.3.1 Feature Ranking

In feature ranking approach, features are ranked according to some criterion to select the top “n” ranked features based on their individual relevance and this number “n” is either specified by the user [30] or determined automati-cally [81].

The main drawback of this method is that it assumes the features to be independent from each other [29]. This can cause two problems:

• Features that are discarded because of not being individually relevant may turn relevant when considered with some other features.

• Features that are regarded individually relevant may cause unnecessary redundancies.

(30)

Feature ranking uses scoring functions (Euclidean distance), correlation (Pear-son correlation coefficient) or information based criteria as evaluation criterion.

Generally this is used as a preprocessing step because it is usually very effi-cient from the computational point of view [6] [11] [19] [93]. But this technique inevitably fails in situations where only a combined set of features is predictive of the target function. This technique normally fits to the problems like micro-array analysis to find genes that differentiate between healthy and sick patients where genes are evaluated in isolation without considering the gene-to-gene correlation [21].

2.3.2 Feature Subset Selection

In contrast to feature ranking, feature subset selection algorithms may auto-matically find how many features have to select. The rapid advances in several research fields with huge datasets made it essential to select only the most im-portant or descriptive features and the remaining are discarded [37] [83].

Feature subset selection can be divided into three models: filters, wrappers and embedded. All feature selection models have their own advantages and drawbacks. In general, filters are fast due to the fact they do not incorporate learning and rely on the intrinsic characteristics of the training data to select and discard features (mutual information, data consistency, etc). A wrapper model involves a learning algorithm (a classifier, or a clustering algorithm) to evaluate each subset of features quality. By including the learning algorithm they aim at improving accuracy. However, wrapper models are computationally intensive, which restricts their application to huge datasets. An embedded model embeds feature selection in the training process of the classifier and are usually specific to given learning machines. They are usually faster than wrapper approaches but are also more likely to overfit [44]. In case we have large training set then embedded models can eventually replace filter models [29].

Feature subset selection approach is believed to have better predictive ability than that of feature ranking according to their individual predictive power. As already mentioned, a single feature that is completely useless by itself can strikingly improve performance when taken in account with other features. On the other hand, a good feature which is highly correlated with another feature already in the subset would provide no additional benefit since it would be redundant. Feature ranking approaches can not manage to deal with these scenarios [28].

2.4 Feature Selection Models

2.4.1 Filter Methods

Filter methods were the earliest approaches for feature selection in machine learning [32] [13]. As name suggests, filters are algorithms which filter out

(31)

in-2.4. FEATURE SELECTION MODELS 31

significant features that have little chance to be useful in analysis of data. Filter methods are computationally less expensive and also more generic than wrap-pers or furthermore embedded methods because they do not consider underly-ing classifier [85].

The filter methods are algorithms (Fig: 2.2a) which returns a relevance in-dex J(S|D,Y) that approximates given the data D, how relevant a given feature subset S is for the task Y. These indices are commonly known as feature se-lection metrics, some popular metrics are correlation based, distances based, information based or some algorithmic procedure may be used to estimate the relevance index such as decision trees [33] [8] [83] [84]. Filter methods tend to select subsets with a high number of features (even all the features) and there-fore a proper threshold is required to choose a subset [77].

Figure 2.2: Filter, wrapper and embedded feature selection scheme Concerning feature selection metrics, in contrast to information based and decision tree based metrics, correlation coefficient approach is perhaps the sim-plest and preferable as a feature relevance measurement because it avoids prob-lems with probability density estimation and discretization of continuous fea-tures [29]. Here we discuss only two feature selection metrics, Pearson’s

(32)

corre-32 CHAPTER 2. LITERATURE REVIEW

lation coefficient and mutual information in detail because we have used these two metrics in our work.

Pearson’s correlation coefficient between two variables is defined as the co-variance of the two variables divided by the product of their standard devia-tions. For feature X with values x and classes Y with values y treated as random variables it is defined as:

ρ(X, Y) = E(XY) − E(X)E(Y)_p σ2_(X)σ2_(Y) = P i(xi− xi) P i(yi− yi) qP i(xi− xi2) P i(yi− yi2) (2.1)

Correlation coefficient gives us a good idea about how closely one variable is related to the other variable. Correlation is the measure of strength of the linear relation between two random variables and it merely predicts that how much variation in one variable is exactly related to changes in the other vari-able.

Correlation coefficients always lies between the interval [-1,1]. For example, if ρ(X, Y) (in Equation 2.1) is equal to -1 or 1 i.e., the two extreme values of this range, it represents a perfectly linear relation between the two features, “negative” (inversely proportional) in the first case and “positive” (directly proportional) in the other. When ρ(X, Y) is exactly equal to 0, it means there is no relationship at all between the two features. If 0 < ρ(X, Y) < 1 or if 0 > ρ(X, Y) > −1, it tells there is a positive or negative relationship between the two features, the bigger the ρ(X, Y) (in absolute value in later case), the stronger the relationship respectively.

Mutual information of two discrete random variables X and Y can be de-fined as: I(X; Y) =X yY X xX ρ(x, y) log _{ρ(x, y)} ρ1(x)ρ2(y) (2.2) where ρ(x, y) is the joint probability distribution function of X and Y, and ρ1(x)and ρ2(y)are the marginal probability distribution functions of X and Y respectively.

Mutual information is a well known non-linear measure of statistical de-pendence based on information theory [79] [14]. It measures the amount of information that one random variable possesses about another random vari-able. Thus the mutual information is the reduction in the uncertainty of X due to the knowledge of Y. The mutual information is always non-negative i.e., I(X; Y) > 0, and symmetric i.e., I(X; Y) = I(Y; X). It equals zero if and only if X and Y are statistically independent, meaning that X contains no information about Y.

Mutual information is accepted as a better quantity than the correlation function to measure the statistical dependence between two variables because mutual information measures the general dependence while the correlation function measures the linear dependence [50]. However, mutual information

(33)

2.4. FEATURE SELECTION MODELS 33

is harder to estimate than linear correlation since it requires the estimation of high dimensional probability density functions (pdfs) and therefore it is worth investigating both measures.

2.4.2 Wrapper Methods

Wrapper methods select a feature subset using a learning algorithm as part of the evaluation function (Fig: 2.2b). The learning algorithm is used as a kind of “black box” function to guide the search. The evaluation function for each candidate feature subset returns an estimate of the quality of the model that is induced by the learning algorithm, which therefore causes better estimate of ac-curacy [45]. Wrapper methods are tend to be prohibitively slower and compu-tationally expensive, since, for each candidate feature subset evaluated during the search, the target learning algorithm is usually applied several times (e.g.; in the case of 10-fold cross validation being used to estimate model quality).

In wrapper methods, a search strategy iteratively adds or removes features from the data to search a best possible features subset that maximizes accu-racy. A search approach decides the order in which the variable subsets are evaluated [42] like best-first, exhaustive search, simulated annealing, genetic algorithms, branch and bound, etc.

The two simplest, and probably most common, search strategies that can be used in a wrapper are [56]:

1. Forward Selection: start with an empty set of features and add features one at a time. In each iteration, the feature that causes the largest increase of the evaluation function (with respect to the value of the current set) is added. The search stops when adding new features does not cause any improvement of the evaluation function.

2. Backward Elimination: start with a set of features that contains all the features and start removing features one at a time. In each step the fea-ture whose removal results in the largest increase in the evaluation func-tion value is removed. The search stops when removing features causes a decrease of the evaluation function.

Other methods like sequential backward selection-SLASH (SBS-SLASH) [10], (p,q) sequential search (PQSS), bi-directional search (BDS) [17], Schemata Search [59], relevance in context (RC) [18], and Queiros and Gelsema’s [66] are variants of forward selection, or backward elimination, or both.

Backward elimination has the advantage that when it evaluates the con-tribution of a feature it takes into account all rest of the potential features. Whereas in forward selection, a feature that was added at one point can be-come useless later on and vice versa. Forward selection is much faster when we intend to select small number of features. Furthermore, backward elimination becomes infeasible when the initial number of features is very large to start with [60].

(34)

Figure 2.3: Sequential Forward Selection and Sequential Backward Elimination

2.4.3 Embedded Methods

In contrast to filter and wrapper, the learning part and the feature selection part is carried out together in embedded methods. Decision tree [67] [68] learning can also be considered to be an embedded method, as the construction of the tree and the selection of the features are interleaved, but the selection of the feature in each iteration is usually done by a simple filter or ranker. A well-known example of embedded method is the L1-SVM [61]. Since in this thesis we focus on the problem of feature selection and we do not address explicitly the problem of classification, embedded methods are outside the scope of this thesis.

Figure 2.4: A taxonomic summary of feature selection techniques with impor-tant characteristics of each technique. “Reprinted from [75]”.

(35)

Chapter 3

A Hybrid Feature Selection

Approach

3.1 Introduction

Discussing the advantages and drawbacks of filters and wrappers according to the results presented in the literature, we propose a two-phase feature selection algorithm that can preserve the advantages of both methods while mitigating the drawbacks. This chapter presents the hybrid feature selection approach that is the main contribution of this thesis. A schematic that represents the idea behind our approach is given in (Fig: 3.1).

In the first phase, we have used a relatively new filter method named Quadratic Programming Feature Selection (QPFS) [70], that reduces the task to a quadratic optimization problem and selects the top-ranked features. To rank features, we have used two similarity measures (metrics), the Pearson correlation coefficient and mutual information to select the features. The QPFS is devised to deal with very large data sets with high dimensionality providing a time complexity im-provement respect to current methods because it is based on efficient quadratic programming [7]. QPFS selects a feature subset as a compromise between the subset being relevant for predicting the label and the features present in the sub-set being uncorrelated. Therefore a distinct feature of QPFS is that it evaluates the subset of features as a whole and does not just rank the features according to an optimality criterion. However the subset selected with QPFS can be still quite large and it is not tuned on a specific classifier and therefore we introduce a second phase where a wrapper is used to further reduce the dimensionality of the feature subset.

In the second phase of our approach, we have used a common wrapper method for feature selection named “Sequential Feature Selection”. We have used the “sequentialfs” from the Statistics Toolbox function that carries out sequential feature selection in MATLAB®.

(36)

36 CHAPTER 3. A HYBRID FEATURE SELECTION APPROACH

Despite wrapper methods using FFS and BFE heuristic to avoid exhaustive search the method is still very computationally demanding. It typically becomes unfeasible when the number of features is huge. Therefore, we have used filter method in the first phase that reduces the search space for the wrapper. To

Figure 3.1: Schematic describing the proposed method.

examine our approach performance, we had to use a classifier to measure clas-sification accuracy. So we have used Support Vector Machine (SVM) classifica-tion [30] [26] [35]. Recently, SVMs have emerged as a promising tool for data classification and its aim is to devise a computationally efficient way of learning

(37)

3.2. QUADRATIC PROGRAMMING FEATURE SELECTION (QPFS) FILTER

ALGORITHM 37

separating hyperplanes with the maximal margin in a high dimensional feature space [29] [15]. We have used both linear and non-linear SVM classifier.

This chapter is divided in three parts, section 3.2 presents the QPFS filter algorithm. Sequential Feature Selection (SFS) wrapper algorithm is described in section 3.3. Finally, section 3.4 gives a brief description of the Support Vector Machine Classifier.

3.2 Quadratic Programming Feature Selection

(QPFS) Filter Algorithm

The aim of QPFS filter method was to devise a feature selection method that is capable of dealing with very large data sets. To achieve this objective, the problem was formulated as quadratic programming optimization problem [7]. Consider the classifier learning task that has N training examples (instances) and M features (also called attributes or variables). A linearly constrained op-timization problem that minimizes a multivariate quadratic objective function is called a QP and is as follows:

min x 1 2x T_{Qx − F}T_x (3.1) The quadratic part in above objective function (Equation 3.1) measures the dependence (correlation, or mutual information) between each pair of features, while the linear term captures the relationship between each feature and the target class. In Equation 3.1, x is an M−dimensional vector, Q ∈ RM×M _is a symmetric positive semidefinite matrix that represents the similarity among features (redundancy), and F is a vector in RM_{with non-negative elements that} calculates how much each feature is correlated with the target class (relevance). Once the quadratic programming optimization problem is solved, the en-tries of x represent the weight of each feature. The higher the feature weight the better it is to include it in training a classifier. As xirepresents the

impor-tance of each feature, therefore the following constraints are implied:

xi> 0 for all i = 1, ..., M

M X i=1

xi=1

The quadratic and linear terms (Equation 3.1) in the objective function can have different choice depending on different data sets in learning problem. Therefore a scalar parameter α is added in Equation 3.1 as follows:

min x 1 2(1 − α)x T_{Qx − αF}T_x (3.2)

(38)

The best choice of α determine the minimum number of features that can be selected for a classification task. In Equation 3.2 x, Q and F are same as de-fined in Equation 3.1 and α ∈ [0, 1]. When α = 1, the quadratic programming problem becomes linear and only relevance of each feature to the target class is considered. Whereas if α = 0, only inter-dependence between features is considered, which means features which have lower similarity coefficients with other features have higher weights. Nevertheless, a fair choice of α must bal-ance the linear and quadratic terms of Equation 3.2. For this we iterated the value of α between 0 to 1 with step value 0.1.

Problem formulation in Equation 3.2 makes it sufficiently general to allow any symmetric similarity measure to be used. We have chosen the “Pearson

correlation coefficient” and “mutual information” to measure similarity. We

have used MATLAB®“corrcoef” function for calculating correlation coefficient of features matrix.

R = corrcoef X

The above function returns a matrix R of correlation coefficients calculated from an input matrix X whose rows are observations and whose columns are features. The other similarity measure we used is mutual information,

functionh = mutualinfo vec1, vec2

We have used “Mutual Information computation” toolbox [64] for calcu-lating mutual information between two vectors.

3.2.1 Quadratic Optimization Problem and CVX

In mathematics and computer science, an optimization problem is one in which the aim is to find the point for which a function attains its minimum or max-imum value. An optimization problem has an objective or cost function and a set of (equality/ inequality) constraints. If the objective or cost function is con-vex (minimization problem), or concave (maximization problem) and if the set of constraints is also convex or concave, then the problem is called convex or concave optimization problem respectively [9].

Disciplined convex programming (DCP) is a methodology for constructing, analyzing, and solving convex optimization problems [25] [23]. DCP uses a set of conventions or rules, which are called DCP ruleset. Problems which agree to these ruleset can be rapidly and automatically verified as convex and con-verted to solvable form and the problems that deviate these ruleset are need to be rewritten in a way that adhere to the DCP ruleset. A number of com-mon numerical methods for optimization can be adapted to solve DCPs. The conversion of DCPs to solvable form can be fully automated. The ruleset is drived from basic principles of convex analysis. By compliance with the ruleset,

(39)

3.3. SEQUENTIAL FEATURE SELECTION (SFS) WRAPPER ALGORITHM 39

we obtain considerable benefits, such as automatic conversion of problems to solvable form and can be solved much reliably and efficiently.

CVX [24] is a modeling framework for DCPs which can solve standard problems such as linear programs (LPs), quadratic programs (QPs), second-order cone programs (SOCPs), and semidefinite programs (SDPs). As compared to directly using a solver for one or these types of problems, CVX can greatly simplify the task of specifying the problem. The components of the CVX mod-eling framework address key tasks such as verification, conversion to standard form, and construction of an appropriate numerical solver.

CVX is a Matlab-based modeling system that effectively turns Matlab into a modeling language, allowing constraints and objectives to be specified using standard Matlab expression syntax [24]. Matlab code can be easily mixed with these specifications. This combination makes it simple to perform the calcula-tions needed to form optimization problems as well as to process the results obtained from their solution. For example, consider the following convex opti-mization model which we have used in QPFS filter phase:

.. .

cvx_begin

variablex(n_features);

minimize ( alpha*x0_{*G*x - (1-alpha)*F}0_{*x )} subjectto sum(x) == 1; x >= 0; cvx_end .. .

3.3 Sequential Feature Selection (SFS) Wrapper

Algorithm

The selection method optimizes directly the predictor (e.g., a classifier, or a clus-tering algorithm) performance in wrapper algorithms which means this feature selection method should be more accurate than filter method. We use SVM as a reference classifier. We have used a common wrapper algorithm called “se-quentialfs” which is based on sequential feature selection.

This method has two parts:

1. The method aims to minimize the criterion or an objective function over all possible feature subsets. Common examples of an objective function are mean squared error (for regression models) and misclassification rate (for classification models).

(40)

2. To evaluate the criterion, the method needs a sequential search algorithm that add or remove features from a candidate subset. Sequential search algorithms move in only one direction, either always growing or always shrinking the candidate set.

The 2nd_{part of “sequentialfs” method has two variants.}

Sequential Forward Selection (SFS), in which features are sequentially added to an empty candidate set until the addition of further features does not de-crease the criterion. Sequential Backward Elimination (SBE), in which features are sequentially removed from a full candidate set until the removal of further features increase the criterion.

.. .

[fs,history] = sequentialfs(fun,XT,yT,‘cv0_,c,_‘options0_,opts,_‘direction0_,_‘backward0_); ..

.

The sequentialfs input arguments include predictor and response data and a function handle to a code implementing the criterion function. The function also calls “cvpartition” to evaluate the criterion at different candidate sets. The sequentialfs returns a logical vector “fs” that contain selected features.

3.4 Support Vector Machine Classifier

Support Vector Machine (SVM) is a supervised learning prediction tool for clas-sification and regression. The fundamental idea of SVM is to map data into a high dimensional space and find a linear separating hyperplane with the maxi-mal margin (Fig: 3.2a). The data points are projected (Fig: 3.2b) into a higher dimensional space where they can be linearly separable. The projection is deter-mined by the kernel function and a set of specifically selected support vectors. The SVM in its simplest form works only for linearly separable problems (Fig: 3.3a).

However, when the data is not linearly separable(Fig: 3.3b), it is possible to allow misclassification using a soft margin (controlled by a hyperparameter) and by performing a nonlinear mapping of the input data [48] [65]. To map nonlinear data samples into a higher dimensional space, we use Radial Basis Function(RBF) kernel because it has been found to work well for a wide variety of applications [12]. Moreover, the linear kernel is a special case of RBF kernel [39]. An other reason is the number of hyperparameters which influences the complexity of model selection. The RBF kernel used in our experiment is:

K xi, xj = exp −γ k xi, xjk2, γ > 0.

There are two parameters to be determined in the RBF kernel model: C and γ.

(41)

3.4. SUPPORT VECTOR MACHINE CLASSIFIER 41

(a) (b)

Figure 3.2: (a) A maximal margin hyperplane with its support vectors in bold highlighted (b) A plane defined by SVM in the feature space

(a) Linear (b) Nonlinear

Figure 3.3: Kernel functions

In the experiment, we used a library for SVM classification called LIB-SVM [12]. It implements the “one-against-one” approach [41] for multi-class classification. If k is the number of classes, then k k − 1/2 classifiers are con-structed and each one trains data from two classes.

The good generalization accuracy of an SVM model is primarily dependent on the selection of the model parameters. We used a grid search to find best C (soft margin hyperparameter) and γ (RBF kernel hyperparameter).A grid search tries values of each parameter across the specified search range using geometric steps. For example, if a grid search is used with 5 search intervals and an RBF kernel function is used with two parameters (C and γ), then the model must be evaluated at 5 × 5 = 25 grid points. To avoid over fitting, cross-validation is used to evaluate the fitting provided by each parameter value set tried during the grid search process. The number of actual SVM calculations would be further multiplied by the number of cross-validation folds.

(42)

(43)

Chapter 4

Datasets

In this chapter, we describe the datasets used in this work and empirically evalu-ate the efficiency of the proposed method on five data sets summarized in Table 4.1. These data sets were chosen because they have been extensively used for classification and in other research on the feature selection task in earlier stud-ies [46] [49] [16] [96] [94] [71] [78]. Moreover, while choosing these datasets we also considered the ratio between number of instances and number of fea-tures in dataset. In Table 4.1 the first two datasets have much more number of instances than number of features. The Third dataset has number of features half of number of instances. Whereas, the last two datasets have very few num-ber of instances as compare to numnum-ber of features. So, in this way we cover all important cases regarding dataset dimensions.

The first three data sets; Ionosphere, WDBC and ARR are available on the widely used University of California, Irvine (UCI) repository of machine learn-ing [20]. The last two NCI60 [1] [72] and SRBCT [2] are available on the respective authors’ web sites.

All datasets are split into disjoint train and test sets with 80% and 20% proportion respectively. Each feature variable in the train and test sets was pre-processed to have zero mean-value and unit variance (i.e., transformed to their z-scores), hence different features have different scales, and in order to compare them a standardization of the data is needed. The accuracy of our hybrid fea-ture selection algorithm is calculated as the percentage of test examples which are correctly classified.

For the two measures of similarity used in the filter method, the average correlation coefficient of a correlation matrix and the average mutual informa-tion is calculated for all the datasets and is shown in Table: 4.2 and Table: 4.3 respectively. It is used to measure the internal reliability of the features in the matrix(Q - Matrix). Furthermore, it is a net measure of predictiveness of the fea-tures, when the correlations or mutual information are with respect to a target class(F - Matrix).

(44)

44 CHAPTER 4. DATASETS

Data Set Instances Features Classes

Ionosphere 351 34 2

WDBC 569 32 2

ARR 422 278 2

NCI60 60 1123 9

SRBCT 83 2308 4

Table 4.1: Characteristics of the datasets used in the experiments.

Data Set Avg. Corrcoef(Q - Matrix) Avg. Corrcoef(F - Matrix)

Ionosphere 0.2464 0.1913

WDBC 0.5054 0.5623

ARR 0.0964 0.1031

NCI60 0.2709 0.0084

SRBCT 0.1618 0.0060

Table 4.2: Average correlation coefficients of the datasets used in the experi-ments.

Data Set Avg. MI(Q - Matrix) Avg. MI(F - Matrix)

Ionosphere 2.6434 1.9349

WDBC 1.8285 2.4341

ARR 1.1519 2.5688

NCI60 2.3340 2.7804

SRBCT 2.5018 2.3177

(45)

4.1. EXPERIMENTAL DESIGN 45

4.1 Experimental Design

The aim of the experiments described here is twofold: first, to compare clas-sification accuracy achieved by using our method (i.e., hybrid filter-wrapper feature selection) separately with QPFS filter and with sequential feature selec-tion wrapper and also to compare the cardinality of the feature set after the feature selection process have been completed. As we have three phases in our hybrid feature selection algorithm, and each phase has two possibilities (Fig: 4.1), this make a total of eight different combinations shown in Table 4.4. For the purpose of comparison, we further conduct two more sets of experiments: where only filter method and only wrapper method are executed. A detailed experimental configuration is included in Appendix A.1 for any dataset. We iterate the value of α with 0.1 step value in Equation 3.2 in filter and for each iteration of alpha, we perform an 8−fold cross validation. I this way, each step value of α is averaged over an 8 folds. In order to estimate the hyperparameters Cand γ for the SVM classifier, we have used the features in training set that remained after filter and wrapper phase. For each case in Table 4.4, we plot three graphs: comparing accuracy, number of features remained after filter and wrapper phases and time cost.

S. No. Filter Wrapper Classifier

Corrcoef MI SFS SBE Linear Nonlinear

Case - 1 X X X Case - 2 X X X Case - 3 X X X Case - 4 X X X Case - 5 X X X Case - 6 X X X Case - 7 X X X Case - 8 X X X

(46)

Figure 4.1: Schematic describing the proposed method with the possible choices at each step of the algorithm.

(47)

4.2. DATASETS DESCRIPTION 47

4.2 Datasets Description

4.2.1 IONOSPHERE Dataset

The Johns Hopkins University Ionosphere dataset [80] contains 351 instances and 34 features with two classes. The task is to classify if a given radar signal targets a “good” or “bad” electron. This radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilo-watts. The targets were free electrons in the ionosphere. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not as their signals pass through the ionosphere. This process is illustrated in Fig: 4.2.

Received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay system. Instances in this dataset are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal.

Figure 4.2: As the radar signal is transmitted, some of the signal will escape the earth through the ionosphere (green arrow). The ground wave (purple arrow) is the direct signal we hear on a normal basis (fading signal). Red and blue arrows are called "skywaves." These waves bounce off the ionosphere and can bounce for many 1000’s of miles depending upon the atmospheric conditions [63].

4.2.2 WDBC Dataset

Wisconsin Diagnostic Breast Cancer dataset (WDBC) [82] first used to examine breast cancer as “malignant” and “benign” (Fig: 4.3(a) and (b)). This dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass (Fig: 4.3(c)). They describe characteristics of the cell

(48)

nuclei present in the image. It contains 569 instances and 32 features with two classes.

Figure 4.3: (a) and (b) Arrangement of benign and malignant (large nuclei and prominent nucleoli) epithelial cells respectively [58] (c) FNA, a small needle is inserted into the mass. Negative pressure is created in the syringe, allowing for cellular material to be drawn in to the syringe for cytological analysis [69].

4.2.3 ARR Dataset

An Arrhythmia (ARR) dataset distinguishes between the presence and absence of cardiac arrhythmia in a patient. An arrhythmia is a condition in which the heart is not beating in a normal rhythm (Fig: 4.4). An arrhythmia is a symptom of a wide variety of diseases, disorders and conditions that cause the heart to beat in way that is irregular, too rapid (tachycardia), too slow (bradycardia) and/or not at all (asystole). Arrhythmia is also known as cardiac arrhythmia [38]. This dataset contains 422 samples and 278 features with two classes and was first studies in [27].

(49)

4.2. DATASETS DESCRIPTION 49

Figure 4.4: Normal and abnormal heart rhythms. “Reprinted from [57]”.

4.2.4 NCI60 Dataset

The NCI60 dataset was first studied in [71], and were used to examine the variation in gene expression among the 60 cell lines from the National Center Institutes’s (NCI) anticancer drug screen. These 60 human tumor cell lines are derived from patients with leukaemia, melanoma, along with, lung, colon, cen-tral nervous system, ovarian, renal, breast and prostate cancers. It contains 60 instances and 1, 123 gene features with nine classes.

4.2.5 SRBCT Dataset

The small round blue cell tumor (SRBCT) dataset contains in total 83 sam-ples in four classes (Fig: 4.5), the neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS) [40]. Every sample in this dataset contains only 2, 308 gene expression values. Among the 83 samples, 29, 11, 18, and 25 samples belong to classes EWS, BL, NB and RMS, respectively. However, accurate diagnosis of SRBCT is essential because the treatment options, responses to therapy and prognoses vary widely depending on the diagnosis. As their name implies, these cancers are difficult to distinguish by light microscopy, and currently no single test can precisely distinguish these cancers.

(50)

Figure 4.5: (a) Ewing’s sarcoma (b) Rhabdomyosarcoma (c) Neuroblastoma (d) Non-Hodgkin lymphoma

(51)

Chapter 5

Results

5.1 IONOSPHERE Dataset Experiments

5.1.1 Case - 1 Filter

_Corrcoef

Wrapper

SFS

Classifier

SVM−LIN

In Figure 5.1 we can see the results obtain with the current configuration. The accuracy with our hybrid approach is slightly worse than the filter method be-cause the accuracy of the wrapper is not as high as expected. The filter method and our hybrid approach achieve maximum accuracy when the hyperparame-ter alpha almost balances linear and quadratic hyperparame-terms. In hyperparame-term of time cost, our hybrid approach is slightly expensive than the filter method and the wrapper method is the most expensive, as expected.

A Hybrid Filter-Wrapper Approach for FeatureSelection

International Master’s Thesis

A Hybrid Filter-Wrapper Approach for Feature

Selection

Ghayur Naqvi

Technology

A Hybrid Filter-Wrapper Approach for Feature

Selection

Studies from the Department of Technology

at Örebro University

Ghayur Naqvi

A Hybrid Filter-Wrapper Approach for

Feature Selection

© Ghayur Naqvi, 2012

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Motivation

1.2

Learning Algorithms

1.2.1

Supervised Learning

1.2.2

Unsupervised Learning

1.2.3

Semi-supervised Learning

1.3

The Classification Problem

1.4

Thesis Structure

Chapter 2

Literature Review

2.1

Introduction

2.2

Dimensionality Reduction

2.3

Feature Selection

2.3.1

Feature Ranking

2.3.2

Feature Subset Selection

2.4

Feature Selection Models

2.4.1

Filter Methods

2.4.2

Wrapper Methods

2.4.3

Embedded Methods

Chapter 3

A Hybrid Feature Selection

Approach

3.1

Introduction

3.2

Quadratic Programming Feature Selection

(QPFS) Filter Algorithm

3.2.1

Quadratic Optimization Problem and CVX

3.3

Sequential Feature Selection (SFS) Wrapper

Algorithm

3.4

Support Vector Machine Classifier

Chapter 4

Datasets

4.1

Experimental Design

4.2

Datasets Description

4.2.1

IONOSPHERE Dataset

4.2.2

WDBC Dataset