• No results found

Classifying Multivariate Electrocorticographic Signal Patterns from different sessions

N/A
N/A
Protected

Academic year: 2021

Share "Classifying Multivariate Electrocorticographic Signal Patterns from different sessions"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

Classifying Multivariate Electrocorticographic Signal Patterns from different sessions

OSKAR SEGERSVÄRD

<oskarseg@kth.se>

DENNIS SÅNGBERG

<densan@kth.se>

Bachelor’s Thesis at CSC Supervisor: Pawel Herman Examiner: Mårten Björkman

2013-05-27

(2)
(3)

Statement of Collaboration

Oskar Segersvärd and Dennis Sångberg collaborated in writing this report as well as conducting the experiments leading to it. Dennis focused on the wrapper feature selectors, while Oskar focused on the filter feature selector and code structure, although most parts of both implementation and report writing were done together.

(4)

Abstract

In the field of Brain-Computer Interfaces (BCI) there is a problem called the inter-session problem, generally caus- ing a decrease in classification performance between ses- sions. This study investigates the extent of this problem in Electrocorticographic (ECoG) data, and how it may be approached using classification and feature selection algo- rithms. The focus regarding classification is whether linear or nonlinear classification methods generalizes better, and the focus of feature selection is whether filter or wrapper methods improve generalization more. These questions are answered by empirical experiments on two sets of ECoG data collected over two different sessions.

The inter-session problem in ECoG data proved to be of considerable size. Classification performance dropped from 78-91% on the training data set (using cross validation) to 70-80% on the tests. Better normalizations and scaling methods were deemed necessary to help reduce this drop.

The results were inconclusive as to linear or nonlin- ear classifier generalization, since performance was nearly identical. Due to their simplicity, linear methods would be preferable in this case. As to feature selection, the risks of overfitting became apparent using Simulated Anneal- ing (SA) wrapper methods. Simpler feature selection al- gorithms that were less prone to overfitting, both filter and wrapper methods, helped to improve generalization more.

(5)

Referat

Klassifiering av multivariata ECoG signal mönster från olika sessioner

Inom området Brain-Computer Interfaces (BCI) finns ett problem som kallas inter-sessionsproblemet, som vanligt- vis orsakar en försämring av klassifieringsprestandan mel- lan sessioner. Denna studie undersöker problemets omfatt- ning med ECoG (Electrocorticography) data, och hur det kan hanteras med klassificerings och särdragsurvalsalgorit- mer (en. feature selection). Fokus rörande klassifiering är hurvida linjära eller icke-linjära klassifieringsmetoder gene- raliserar bättre, och fokus rörande särdragsurval är hurvida metoder av typen filter eller wrapper förbättrar generalise- ring mer. Dessa frågor besvaras genom empiriska experi- ment av två datamängder ECoG data insamlad från två olika sessioner.

Inter-sessionsproblemet med ECoG data visade sig vara av betydande storlek. Klassificeringsprestandan försämra- des från 78-91% på träningsdatan (med korsvalidering) till 70-80% på testerna. Bättre normaliserings- och skalnings- metoder anses nödvändiga för att reducera försämringen.

Hurvida linjära eller icke-linjära klassifierare generali- serade bättre var resultaten inte entydiga, då prestandan var nästan densamma. I detta fall föredras linjära klassifi- erare, på grund av deras enkelhet. Gällande särdragsurval uppenbarade sig riskerna för överinlärning under använd- ningen av Simulated Annealing. Enklare särdragsurvalsme- toder som var mindre benägna att överinlära, både filter och wrappermetoder, förbättrade generaliseringen mer.

(6)

Abbreviations

BCI

Brain-Computer Interface CV

Cross Validation ECoG

Electrocorticography EEG

Electroencephalography FR

Fisher’s Ratio GFS

Greedy Forward Search LDA

Linear Discriminant Analysis MLP

Multi-Layer Perceptron NN

Neural Network RBF

Radial Basis Function - used as a nonlinear kernel in SVM SA

Simulated Annealing SLP

Single-Layer Perceptron SVM

Support Vector Machine

(7)

Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Problem statement . . . 2

1.3 Purpose . . . 3

2 Method 5 2.1 Method description . . . 5

2.2 Data description . . . 6

2.3 Cross validation . . . 6

2.4 Classification algorithms . . . 6

2.4.1 Single-layer perceptron . . . 7

2.4.2 Multi-layer perceptron . . . 8

2.4.3 Support vector machine . . . 10

2.4.4 Linear Discriminant Analysis . . . 11

2.4.5 Ensemble . . . 11

2.5 Feature selection algorithms . . . 12

2.5.1 Filter methods . . . 12

2.5.2 Wrapper methods . . . 13

2.6 Scaling and normalization of data . . . 16

3 Results 21 3.1 Discussion . . . 24

3.2 Conclusions . . . 25

Bibliography 27

(8)
(9)

Chapter 1

Introduction

1.1 Background

Brain-Computer Interfaces (BCI) are ways to interact with machines using only voluntarily modulated brain activity. It can be done by reading and analyzing electrical signals of the brain when a subject is performing a cognitive task, for example imagining a movement (of an arm, etc.), an object or a sensation (such as anger) [1].

Not surprisingly, there are many uses for a system which is controllable by one’s mind. BCIs have proven to be a very important technology for severely disabled [2], as they can be used as a tool both for communication [3, 4], and restoration of motor and environment control [5, 6]. Apart from clinical uses, BCI have been used for media and entertainment as well. Some gaming application examples are Pinball [7] and Angry Birds [8].

There are a few different means of obtaining measurements of electrical brain activity. Non-invasive methods like electroencephalography (EEG) where electrodes are placed on scalp of the subject’s head are the most commonly used in BCI [9, p. 9]. There are also invasive methods like electrocorticography (ECoG) where the electrodes are placed directly on the surface of the cortex. The advantage of ECoG over EEG is that since the electrodes are in direct contact with the brain the signals are much clearer [10, p. 538]. However, the electrodes must be put inside the skull which would have to be opened. This is usually done when subjects undertake a brain operation, for example to counteract epilepsy. As such there has not been as much work studying ECoG data as there has been with EEG.

The brain signals usually go through several stages in order to be classified into categories of associated mental tasks that the subject is supposed to perform, as shown in Figure 1.1. The first step is pre-processing of the data to amplify and remove artifacts from the signals. Then feature analysis is done, where significant patterns in input data are extracted and recognized as features. These features are then translated by a classification algorithm into commands that software should execute [11, p. 16]. The focus of this report is on selection of relevant features and

(10)

CHAPTER 1. INTRODUCTION

classification of ECoG signals.

Preprocessing

Feature Analysis

Classification Computer

Electrodes

Output command

Little finger...

Figure 1.1. BCI and the steps of signal pattern analysis

Classification algorithms are needed to make use of the data and translate it into a command. They are therefore a very important part of the BCI process, and much consideration should be taken in choosing which to use. However when classifying high-dimensional data, there is a high risk of having poor classification performance.

This is due to the curse of dimensionality [12], as the number of dimensions in data increases, more data points are needed to accurately classify data. Since the number of data points is most often limited, a better alternative is to reduce the number of dimensions. Feature selection algorithms aim to reduce the number of dimensions by removing irrelevant features from data. Feature selection and classification algorithms can therefore be used well in combination to obtain better results.

When attempting to use a classifier trained using data from an earlier session to classify data collected during a later session problems arise. The brain signal fluctuations depend on many factors which change from day to day, resulting in different data distribution, and therefore different results. This is a major problem in BCI when classifying e.g. EEG data and is in this report referred to as the inter- session problem [13]. To overcome the changes in the data, generalization methods capable of classifying based on only significant data are required.

1.2 Problem statement

When attempting to classify EEG signal patterns, there is a problem with decreas- ing inter-session performance, in terms of classification accuracy. Since ECoG signal patterns are clearer than EEG, we believe that the problem (decrease in classifica-

2

(11)

1.3. PURPOSE

tion performance) should not be significant (< 5%) if suitable measures are taken, in terms of classification and feature selection algorithms. In this study we will examine this problem and either find support for this hypothesis or not. Along the way we aim to reveal suitable methods and look for trends in generalization per- formance with regard to linear or nonlinear classifiers as well as filter and wrapper feature selection approaches.

Should it be the case that inter-session still is a major problem, alternative or additional methods that could reduce the problem will be discussed.

1.3 Purpose

What we hope to achieve in this study is to provide some insight into the inter- session problem of ECoG data and what classification and feature selection methods are suitable. The report will help to determine appropriate classification and feature selection algorithms to use and whether there is a need to pay special attention to the inter-session problem when designing an ECoG based BCI system.

(12)
(13)

Chapter 2

Method

2.1 Method description

To assess the problem of decreased inter-session classification performance, empir- ical experiments using ECoG data sets were conducted. For the simplicity of this study classification was done for a two-class problem; to distinguish between the imagination of finger and tongue movement. Since the investigation lies in inter- session performance, two data sets were used, one from each session. The correct classifications for both data sets are given. Using the first data set (training set) the algorithms may be trained using the difference between the produced classifications and the correct classifications, referred to as supervised learning [12]. Once the classifiers were trained they were used to classify the second data set (test set). The results from this classification were compared to the correct classifications of the test set to evaluate the algorithms’ ability to generalize. A more detailed description of the data can be found in the Section 2.2.

We implemented a select few of the currently available algorithms. Explanations of each algorithm can be found under their respective subsection of Section 2.5 and 2.4. There is ongoing discussion in the scientific community about whether linear or nonlinear classification algorithms are better suited for BCI [14], therefore we wanted to test at least one linear and one nonlinear classification algorithm. Even- tually we used three linear (Single-Layer Perceptron, SLP, linear Support Vector Machine, linear SVM and Linear Discriminant Aanalysis, LDA) and two nonlinear (Multi-Layer Perceptron, MLP, and nonlinear Support Vector Machine, nonlinear SVM) to compare the results. Feature selection algorithms can be categorized into filter, wrapper and embedded methods [15]. We implemented one filter and two wrapper feature selection algorithms. It would be interesting to compare other types of the available algorithms, but this was not done due to the time constraints.

(14)

CHAPTER 2. METHOD

2.2 Data description

The data sets were provided by our supervisor and were originally used as data set 1 in the BCI competition III [16]. Data was measured on a subject’s brain activity using ECoG platinum electrodes in a 8 × 8 grid. The electrodes were placed on the right motor cortex of the subject. During the trials, the subject had to perform imagined movements of either the left small finger or the tongue. Recording was done during 3 seconds, 0.5 seconds after the visual cue had ended. A sampling rate of 1000Hz was used. Data from 278 trials during the training session was collected, and 100 trials from the test session which was carried out a week after the first session.

Each trial is represented by a 64 × 10 matrix and a label with values -1 or 1 for either left little finger or tongue. There are 8 × 8 = 64 channels, and for each the spectral power extracted from both µ and β frequency bands during 5 time intervals are stored. The 5 time intervals are 0.6s with 0.1s overlap during the 3 seconds.

64 × 2 × 5 = 640 dimensional input, each of which we consider a feature.

2.3 Cross validation

To evaluate the generalization ability of our methods in an inter-session problem scenario, we compared results from the test set. However to estimate the gener- alization ability before test data is available, methods to evaluate using training data are needed. One common method is cross validation, which separates data into training and validation sets. During validation classification algorithms can only classify data it has not seen during training, which is necessary to measure generalization performance. In this study n-fold cross validation is used.

n-fold cross validation works by randomly separating data into n different sets.

Classification algorithms train on n − 1 sets and one set is used for validation. The process is repeated with one of the other sets as validation set, until every set is used for validation once. This approach is especially useful when limited training data is available, but requires much more computational time since training has to be done n times when cross validating an algorithm.

2.4 Classification algorithms

Classification algorithms can be categorized in many different ways [17], however this study pays special attention to the choice between linear and nonlinear classi- fication algorithms.

Linear classification algorithms use a hyperplane to separate data into two classes, and can therefore only hope to correctly classify linearly separable data.

However, linear classification algorithms can classify nonlinearly if the kernel trick is used to transform the input space. In many cases of BCI, data is not fully linearly

6

(15)

2.4. CLASSIFICATION ALGORITHMS

separable, therefore linear classifiers often tend to have lower classification accuracy due to bias, or approximation error.

Nonlinear classifiers are more powerful since they can fit an arbitrary classifica- tion boundary and can therefore reduce the bias. However, nonlinear classifiers may fit the boundary to peculiarities in the training data, which will yield classification errors in test data. This reduces classification accuracy due to high variance, or estimation error.

The bias-variance problem is a major one in BCI [17]. It explains why less powerful classification algorithms often outperforms more powerful ones. Since this study emphasizes the inter-session problem, it is crucial to keep variance at low levels so the models perform reasonably close to estimations.

2.4.1 Single-layer perceptron

A single-layer perceptron (SLP) is one of the simplest artificial neural networks (NN), and as such, it is easy to implement and use. Due to its similarities with MLP and LDA it is sometimes used for BCI [17, p. 4]. It consists of a set of neurons o, one for each output.

inputs outputs

bias

Figure 2.1. Vizualization of an SLP. This network uses 4 inputs and 2 output nodes.

Each of the neurons have a weight vector w with a weight for each of the inputs x and a bias θ = 1 to enable neuron output should all other inputs be 0. This can be thought of as a (|x| + 1) × |o| sized matrix W. The outputs y of the NN are calculated thusly,

y = sign(W · x). (2.1)

A SLP is visualized in Figure 2.1. The SLP is often trained using the Delta rule, Formula (2.2), where t is a vector with the target values and the learning rate η [18].

(16)

CHAPTER 2. METHOD

W ← W − η(W · x − t) · x| (2.2)

This is also the rule we use in our implementation. Since it is a two-class problem, one output node was needed to classify either 1 or −1. Parameters for the algorithm were determined through trial and error using cross validation results with different feature sets. The learning rate η = 0.001 was found to be a suitable value, and initial weights were set at random using normal distribution. The number of iterations needed usually differed between different feature subsets.

2.4.2 Multi-layer perceptron

A multi-layer perceptron (MLP) is only slightly harder to implement than the SLP but much more powerful since it can perform nonlinear classifications. It can be thought of as several SLPs stacked on top of each another, with one weight matrix for each layer. Figure 2.2 gives an illustration of a two-layered MLP. Using an MLP is not any more difficult than using an SLP, but attempting to train the network introduces new challanges.

inputs hidden outputs

bias bias

Figure 2.2. Vizualization of an MLP. This network uses 4 inputs, 2 hidden nodes in one hidden layer and 1 output node in the output layer.

Since there are more than one layer in an MLP, it is not possible to use the SLP’s method to correct errors, since it is not known in which layer a possible error may be. Instead, at every layer propagating backwards from the output layer, an error is estimated using Formula (2.3).

E(t, y) = 1 2

n

X

k=1

(tk− yk)2. (2.3)

8

(17)

2.4. CLASSIFICATION ALGORITHMS

This error can be used as a measure of height in a gradient descent algorithm. Where as the SLP used a discontinuous activation function, is is necessary for the MLP to have a differentiable activation function, or transfer function, for the gradient descent. In order for the MLP to perform nonlinear classifications, its transfer function must also be nonlinear. A popular function that is very similar to the threshold function used by the SLP is a sigmoid transfer function [12].

The MLP is a discriminative, static and unstable classifier. It is unstable due to its ability of approximating any continous function. Therefore, outliers and noise can be detrimental to its performance. MLP are the most popular NN classifier due to its flexibility and ability to classify an arbitrary number of classes [17, p. 4].

In this case, a two-layered perceptron was used. The MLP was implemented using the backpropagation algorithm, one hidden layer and one output node. The parameters were found in the same manner as with the SLP. The hidden layer had 8 nodes, the learning rate η = 0.001 and the momentum α = 0.9. The used transfer function was

ϕ(x) = 2

1 + e−x − 1 (2.4)

with the derivative

ϕ0(x) = [1 + ϕ(x)][1 − ϕ(x)]

2 (2.5)

which is nonlinear and sigmoidal, as required.

(18)

CHAPTER 2. METHOD

2.4.3 Support vector machine

Support Vector Machines (SVM) have taken off since introduced by Vapnik in 1992, and are widely used in machine learning contexts today. They attempt to find a way to linearly separate two classes whilst maximizing the margin from the hyperplane to the nearest data points. The resulting hyperplane can be defined as w · x + b = 0 where w is a vector and b is a scalar [12]. Figure 2.3 visualizes this [19].

Figure 2.3. Vizualization of a hyperplane calculated by a SVM.

A SVM with linear decision bounderies is known as a linear SVM, and is one of the SVMs used in this study, known to perform well. To be able to classify data points that are not linearly separable in the original input space, SVMs can create nonlinear decision bounderies by increasing the dimensionality of the data in a number of ways, for example by using the kernel trick. The radial basis function (RBF) kernel:

K(x, y) = exp −kx − yk2 2

!

(2.6)

is a popular choice of kernel in BCI research, and as such was included in this paper.

Since all computations are done in the original input space, the dimensionality of the kernel is largely irrelevant. Since SVMs maximize the margin and regularize,

10

(19)

2.4. CLASSIFICATION ALGORITHMS

they are quite good generalizers, capable of avoiding troubles caused by the curse of dimensionality and overtraining [17].

For testing the SVM we used libsvm for MATLAB, a widely used SVM imple- mentation [20]. Due to the mentioned ongoing discussions on linear vs. non-linear methods two SVMs were tested, one with a linear kernel and the other with a non- linear RBF kernel. Parameters c for both kernels and γ for the RBF kernel were tuned through trial and error by cross validating the algorithms on 50 features se- lected by Fisher’s Ratio, as described in Section 2.5.1. The cost parameter c was set to 8 for both kernels and γ = 0.75 for the RBF kernel.

2.4.4 Linear Discriminant Analysis

One of the most popular classifiers in the BCI field, Linear Discriminant Analysis (LDA) is a method that uses a discriminant hyperplane for classification, similar to the SVM. We chose to use MATLAB’s built-in function classify [21], which uses LDA to classify data, for comparative purposes while not having to implement LDA ourselves. For LDA to work best, the data must have a normal distribution.

LDA calculates its hyperplane by finding the data projection that maximizes the distance between the classes’ means and minimizes the interclass variance.

MATLAB’s implementation requires there to be many more data points than there are dimensions, so it is unable to perform when using all 640 features at once.

This should be taken into account when examining our results in Chapter 3.

2.4.5 Ensemble

Ensemble learning algorithms work differently than the above discribed algorithms.

They gather a set of other classifiers, resulting in an ensemble or committee, who collaborate in classifying new data points by weighted voting. The ensemble’s clas- sification is defined by the sum of the weighted classifications c(x) provided by the n committee classifiers.

C(x) =

n

X

i=1

wici(x) (2.7)

Ensembles can be constructed with methods of varying complexity to force diver- sity among the committee classifier, as diversity is what leads to the performance increase. A combination of good classifiers is less likely to perform poorly on new data, than any single classifier [22].

We decided to go with a simple voting implementation (with all weights were set to 1), which created a committee using the above classifiers: SLP, MLP, linear SVM, nonlinear SVM and LDA. When the ensemble was to be trained, it trained these committee classifiers. When it was to classify new data, it gathered results of the five committee members, and used the signed sum as its result, in accordance with Formula (2.8).

(20)

CHAPTER 2. METHOD

C(x) = sign

5

X

i=1

ci(x)

!

(2.8)

2.5 Feature selection algorithms

In order to classify data, the definition of a feature must be determined. The most simple method is to consider each element of one electrode, frequency band and time window as a single feature, which gives 640 features. Since there are 640 features in the data while there are only 278 data points in the training set, feature selection is important. In this study filter and wrapper feature selection methods are used, brief descriptions of them follows.

2.5.1 Filter methods

Filter methods selects features independent of the classifier. They have some method of ranking features, of which the top n features are then to be used for classification.

The advantage of filter methods is that they are fast and gives an overview of what the useful features might be. However some features may work well in combination with other features, and will not necessarily be the top ranked features.

Fisher’s Ratio

The Fisher’s Ratio (FR) can be used to reduce dimensions of large datasets while maintaining the most significant data [23, p. 747]. This is achieved by ranking each feature f by the distance between and the density of the two classification clusters according to Formula (2.9), where mfi is the mean and σfi the standard deviation of feature f of class i. Features with the largest ratio are deemed most significant, and are the preferred features for classification.

R(f ) = |mf1 − mf2| qσ2f

1+ σf2

2

(2.9) Fisher’s ratio was easily implemented using Formula (2.9) above for each feature, and supplied a vector of ones and negative ones to represent whether a feature was to be used or not. The only decision to make was how many features to use.

Looking at Figure 2.4 with an urge to choose as few features as possible; it is clear that 50 features gives the best performance in relation to the number of features.

12

(21)

2.5. FEATURE SELECTION ALGORITHMS

Figure 2.4. Classification performance using various number of features

2.5.2 Wrapper methods

Wrapper methods evaluates subsets of features, instead of ranking features indi- vidually [15]. They often use a classification algorithm’s performance on a subset to evaluate its significance. The goal is to find a subset with a small number of features but will give good classification performance. With n features, the total number of subsets is 2n, therefore a total search isn’t an alternative with a large number of features. Instead greedy search algorithms or search heuristics are used to find satisfactory solutions. Below are descriptions of the two search algorithms we used in this study. This has the advantage of catching correleations between and and identifying usefuluseful combinations of features, but is in return much more computationally demanding than filter methods.

Greedy Forward Search

Greedy Forward Search (GFS) is a simple search algorithm with a greedy approach.

Forward search means that the starting subset will be empty, and more features are consecutively added until a satisfactory solution is found. The opposite is backward search, which starts with the full set of features and consecutively removes features from the subset. In this study we are interested in obtaining a low number of features starting from a high number of features, therefore forward selection is considered more suitable.

At every iteration, the algorithm tests using each unused feature as part of the current subset and evaluates it. The best addition is then permanently added to the current subset. This may be repeated until the increase in classification performance becomes insignificant, or when a large enough number of features is used. Since this is a greedy algorithm, the solutions it finds may be suboptimal. However, due to it’s simplicity and successful previous use [24], it too was included in our trials.

(22)

CHAPTER 2. METHOD

Figure 2.5. Greedy forward search using linear SVM

Our implementation of the algorithm tries out different subsets of up to 100 features.

Cross validation performance is plotted and each subset is stored as the algorithm chooses a new feature to include in the subset, as seen in Figure 2.5. The value of each new feature is the average over five 5-fold cross validation runs, since cross validation performance vary depending on how data is split. The final subset is chosen based on the plot, at the point where the performance increase becomes insignificant, and the subset used at that point is used for actual classification.

SLP and MLP turned out to be too slow for use in the internal value function of the GFS algorithm. SVM with both linear and nonlinear kernel and LDA seemed all to work well, however differently for different classifiers when cross validating the subset. In the end, GFS gave 45 features using a linear SVM, 30 features using a nonlinear SVM and 10 features using LDA.

Simulated Annealing

The Simulated Annealing (SA) algorithm is a probabilistic heuristic which have successfully been used in BCI [25]. SA works by randomly trying and evaluating different solutions, converging to a stable solution as a temperature variable is low- ered. When the temperature is high, worse solutions are more likely to be accepted by the algorithm. Lower temperature means worse solutions are less and less likely

14

(23)

2.5. FEATURE SELECTION ALGORITHMS

to be accepted until the temperature is zero, at which point the current solution stops changing and is returned. One variant of the algorithm is described below in Algorithm 1, which describes the general outline. A suitable neighbor function, value function, acceptance function, start solution and appropriate parameters must be tuned to achieve good results.

Algorithm 1 Simulated annealing algorithm in pseudo-code currentSolution ← generateStartSolution()

while temperature > 0 do for i = 1:iterations do

newSolution ← neighbor(currentSolution) newV alue ← evaluate(newSolution)

if newV alue > currentV alue or accept(currentV alue, newV alue, temperature) then

currentSolution ← newSolution currentV alue ← newV alue end if

end for

temperature ← temperature − 1 end while

return currentSolution

As a wrapper approach to the feature selection problem, the neighbor function in SA will randomly add or remove features from its current subset. The value of a subset is determined by the classification performance of the classification algorithm.

SA has the useful ability to avoid local minimas, since it has a probability of accepting worse solutions. However, in large search space the SA algorithm needs many iterations to find satisfactory solutions. SA algorithms are also often criticized for being prone to overfitting, which would hurt generalization performance [26].

Therefore SA may or may not be a more suitable alternative to greedy approaches in BCI, depending on the data.

In our implementation of the algorithm, the neighbor function first determines with 50% probability if it should add or remove a feature from the current subset.

If it should add, one of the remaining features will be selected by random and vice versa for removal.

The internal value function is the 5-fold cross validation performance in percent of one of the classification algorithms, subtracted by a penalty for the number of features in the subset. A penalty for selecting additional features is included since we want to minimize the subset size. A penalty value of 4/25 per feature was found to be appropriate using trial and error. Like with GFS, cross validation is used 5 times and the average is used as the true value for the subset.

The acceptance function of worse solutions accepts with a probability of

P (accept) = ek∆VT (2.10)

(24)

CHAPTER 2. METHOD

where ∆V is the new value subtracted by the old value, T is the current temperature and k is a constant determined by trial and error to have the same value as the starting temperature.

Figure 2.6. Simulated annealing using nonlinear SVM. This run used 100 as starting temperature and 100 internal iterations. The left graph shows classification accuracy over the number of iterations while the right shows the number of features used at each iteration.

We let the algorithm run with 100 as a starting temperature, and for 500 iterations for each integer value of the temperature between 100 and 0. The subset at 0 temperature is then chosen as the subset of features to use for classification. The algorithm was run several times, where the subset which gave the highest cross validation result but with a low number of features was chosen. Figure 2.6 shows a typical run with the algorithm, with the left graph showing the property that only better solution are accepted after many iterations.

Like with GFS, SLP and MLP were too slow for practical use. Feature selection using SA was done using linear SVM which gave a subset of 102 features, nonlinear SVM gave 105 features and LDA gave 47 features.

2.6 Scaling and normalization of data

Basic analysis of the data was done to improve the performance of the chosen methods. In the BCI competition where the data sets were originally used [16], the test data (but not the test labels) was available beforehand. As such, analysis was done on both training and test data.

16

(25)

2.6. SCALING AND NORMALIZATION OF DATA

When examining training data, it was found that the smallest value was in the order of 102 and the largest in the order of 105. The mean value was in the order of 103. For many classification algorithms, performance is increased if values are in a smaller range [12]. For that reason data was scaled to the interval 0 to 1 using equation 2.11, where x is the unscaled data point, x0 is the scaled data point and min(X) and max(X) are values in respect to the dataset, X.

x0 = x − min(X)

max(X) − min(X) (2.11)

However, when examining test data, the smallest value was in the order of 102, the largest in the order of 107 and the mean value in the order of 105. Before testing it was already evident that test data was different from training data and that without some sort of normalization it would be unlikely to get good results.

The method used was to calculate the mean and standard deviation of each feature indivdually in training data, and subtract the mean and divide by the standard deviation in test data for each feature. This lets both data sets have the same mean and standard deviation in each dimension.

To see if the prepared methods would perform reasonably well on the test data before comparing classification results to the actual labels, it was assumed that test labels would have roughly the same amount of 1 as −1 labels. The labels given by each classification method were summed, and then plotted showing how partial the classifiers was to either class. The sum corresponds to how biased each classifier is to one class. If the above assumption is true, then good unbiased methods would yield sums close to 0, and different classification methods should have roughly the same sum.

Graphs were plotted using several scaling and normalization methods to see which scaling gave the least and most stable bias. While Figure 2.7 shows our initial attempts at data normalization scaling each dimension individually, it was clear that the bias was too great. Figure 2.8 on the other hand shows the method we ended up using, scaling all dimensions uniformly, which resulted in more stable values that were closer to 0.

(26)

CHAPTER 2. METHOD

Figure2.7.Thebiastowardseitherclass,scalingeachdimensionindividually.Num- bersinparenthesisrepresenthowmanyfeatureswasusedforthatfeatureselection method.

18

(27)

2.6. SCALING AND NORMALIZATION OF DATA

Figure2.8.Thebiastowardseitherclass,scalingalldimensionsuniformly.Numbers inparenthesisrepresenthowmanyfeatureswasusedforthatfeatureselectionmethod.

(28)
(29)

Chapter 3

Results

Below, two graphs are presented. Figure 3.1 shows the average classification ac- curacy in percent for each feature selection and classification algorithm over 10 five-fold cross validation runs on training data. The numbers in parenthesis repre- sent how many features were used with that feature selection method. Figure 3.2 shows how each method performed on the test data. In the next section (Section 3.1), the results of these figures are compared with each other.

(30)

CHAPTER 3. RESULTS

Figure3.1.Thefinalcross-validationresultsontrainingdata.Thebarsshow themeanvalueof10five-foldcrossvalidationrunsoneachmethod.Thevertical linesshowthestandarddeviationofthemeanvalues.LDAisnotpresentwhenno featureselectionisusedsincethenumberofdatapointsmustbegreaterthanthen thenumberofdimensionsforLDAtowork.

22

(31)

Figure3.2.Thefinalresultsontestdata.

(32)

CHAPTER 3. RESULTS

3.1 Discussion

Observing the test results, it is clear that all methods performed significantly better than chance (50% since it is a two-class problem), with the exception of LDA using no feature selection due to the number of dimensions, see Section 2.4.4. It is however, also evident that classification accuracy was lower for test data (70-80%) than for the cross-validation results on training data (78-91%). Every method combination showed a drop in performance, some up to 20%. This result indicatates that the inter-session problem is greater than the 5% decrease that we predicted, and that there may be measures one should take apart from feature selection and classification to improve the result.

In Section 2.11 it was found that training and test data were different in that test data had in the order of 102 larger values. The performance difference in training and test results also emphasize the difference in data between sessions. For those reasons, more sophisticated distribution analysis should be carried out. For example Principal Component Analysis (PCA) could be used, which is capable of finding directions with the largest variations [12]. This can be used to visualize the important properties of data, to see how distribution in data varies between sessions.

We also believe that our method of scaling and normalizing data had a large impact on the test results. Among several methods tried before choosing which method to use, many gave very different classifications on the test session. In a future study different methods of transforming data should be investigated.

It is possible to define a feature in many different ways. We chose to look at each electrode with one time window for one frequency band as a single feature. There may be other better alternatives that could improve generalization performance.

In a real-world scenario, it might be more interesting to view each electrode as a different feature, by using the average of different time windows. It is likely that the signals of some areas of the human cortex, read by an electrode, are not critical to classification. In those cases the irrelevant electrodes can be ignored, which would also be more practical in use. It is also possible to use frequency bands as the same or individual features. A different method of dealing with different time windows that are used in BCI are to perform feature extraction and classification steps on each time window and combining the results. Another is to use a dynamic classifier which is capable of classifying a seqence of feature vectors and catch temporal dynamics [17].

Irrespective whether the inter-session problem was underestimated or not, con- clusions can still be drawn upon which methods had the best generalization prop- erties. It can be seen in both Figure 3.1 and Figure 3.2 that the choice of feature selection algorithm affects classification performance. It can be seen in the cross validation results that classifiers using a feature selection with the corresponding internal classifier performs better than others, which is to be expected since the features are optimized according to the specific classifier. This is however not seen in the results from the test session to the same extent.

Feature selection algorithms that boosted performance the most in cross valida- 24

(33)

3.2. CONCLUSIONS

tion trials did not necessarily do so in the test session. While SA initially looked like a valuable feature selection algorithm from the cross validation results, it had very poor results on the test data, even worse than using no feature selection at all. SA is often criticized for having an increased risk of overfitting as mentioned in Section 2.5.2 and looking at the results this was likely the case in this study. For the GFS algorithm, the internal classifiers linear SVM and LDA had better test results. In the choice of feature selection algorithm, simpler methods such as the FR and GFS using linear internal classifiers seem to improve generalization most.

Between linear and non linear classification algorithms, the findings of this study could not provide clear results to which generalizes better, as SLP performed simi- larly to MLP and linear SVM performed similarly to non linear SVM. This implies however that if the performance of linear and non linear methods are about the same, the added complexity of non linear classifiers is unnecessary. The neural networks performed slightly better than the SVMs in many cases, which was not expected when observing the cross validation results. The MLP had good generalization per- formance, despite often being mentioned as having high risk of overfitting.[REF]

LDA classification had the single best result during the test session and is therefore worthy of consideration. This was however only using a very low number (10) fea- tures, which is necessary when using LDA, and a feature selection algorithm which used LDA. The ensemble of classifiers were almost always outperformed by another classifier and it is therefore not worth the effort of training several classifiers using this method. The method used for ensemble learning in this study however was very simple, in a future study it would be more interesting to try complex variants of ensemble learning, as mentioned in Section 2.4.5.

3.2 Conclusions

The hypothesis that inter-session unsimilarities in ECoG data would not result in a decrease greater than 5% was not supported by the results of this study. The method combinations showed a decrease in classification accuracy of up to 20%, which we were unable to reduce further.

The detrimental effects of overfitting can be seen in the performance decrease between the sessions using Simulated Annealing feature selection. Simpler feature selection methods such as Fisher’s Ratio and greedy search algorithms using linear classifiers generalized better.

Linear and nonlinear classification algorithms performed roughly equally in this study. Since linear methods are sufficient, introducing the complexity of nonlinear classifiers is unnecessary. However, our results far from put an end to the ongoing debate of linear versus nonlinear classifiers in the field of BCI.

Making an Ensemble classifier by using a simple combination of classifiers was not effective since it generally did not perform better than any one of the other classifiers did on their own. It was however, in accordance with theory, more stable than the others.

(34)

CHAPTER 3. RESULTS

It was rather late in our experiments that we found that scaling and normal- ization methods were very important to the outcome. Thorough data distribution analysis should be carried out early to better utilize appropriate data transformation methods. Getting to know your data is key.

26

(35)

Bibliography

[1] Wolpaw JR, Birbaumer N, Heetderks WJ, McFarland DJ, Peckham PH, Schalk G, et al. Brain-computer interface technology: a review of the first international meeting. Trans Rehabil Eng. 2000 Jun;8(2):164-73.

[2] Mak JN, Wolpaw JR. Clinical Applications of Brain-Computer Interfaces: Cur- rent State and Future Prospects. IEEE Rev Biomed Eng. 2009;2:187-99.

[3] Sellers EW, Donchin E. A P300-based brain-computer interface: initial tests by ALS patients. Clin Neurophysiol. 2006 Mar;117(3):538-48.

[4] Nijboer F, Sellers EW, Mellinger J, Jordan MA, Matuz T, Furdea A, et al.

A P300-based brain-computer interface for people with amyotrophic lateral sclerosis. Clin Neurophysiol. 2008 Aug;119(8):1909-16.

[5] Pfurtscheller G, Guger C, Muller G, Krausz G, Neuper C. Brain oscillations control hand orthosis in a tetraplegic. Neurosci Lett. 2000 Oct 13;292(3):211-4.

[6] Cincotti F, Mattia D, Aloise F, Bufalari S, Schalk G, Oriolo G, et al. Non- invasive brain-computer interface system: towards its application as assistive technology. Brain Res Bull., 2008 Apr 15;75(6):796-803.

[7] Tangermann M, Krauledat M, Grzeska K, Sagebaum M, Blankertz B, Vidaurre C, et al. Playing Pinball with non-invasive BCI. In: Proceedings of NIPS. 2008, p. 1641-8.

[8] Mind Controlling Angry Birds [homepage on the Internet]. [updated 2011 May 6; cited 2013 Apr 11]. Available from http://www.feng-gui.com/research/

MindControlAngryBirds/.

[9] Tan D, Nijholt A. Brain-Computer Interfaces and Human-Computer Inter- action. In: Brain-Computer Interfaces: Applying Our Minds to Human- Computer Interaction. Springer London, 2010; p. 3-19.

[10] Lebedev MA, NicolelisLebedev MA. Brain-machine interfaces: past, present and future. Trends Neurosci. 2006 Sep;29(9):536-46.

(36)

BIBLIOGRAPHY

[11] Graimann B, Allison BZ, Pfurtscheller G. Brain-Computer Interfaces: A Gen- tle Introduction. In: Brain-Computer Interfaces: Revolutionizing Human- Computer Interaction. Springer Berlin Heidelberg, 2010; p. 1-27

[12] Marsland S. Machine Learning: An Algorithmic Perspective. Chapman &

Hall/CRC; 2009.

[13] Shenoy P, Krauledat M, Blankertz B, Rao RP, Müller KR. Towards adaptive classification for BCI. J Neural Eng. 2006 Mar;3(1):R13-23.

[14] Müller KR, Anderson CW, Birch GE. Linear and Nonlinear Methods for Brain–Computer Interfaces. Neural Systems and Rehabilitation Engineering, IEEE Transactions on, 2003; 11(2):165-9.

[15] Guyon I, Elisseeff A. An introduction to variable and feature selection. J. Mach.

Learn. Res., 2003 Mar 1; 3:1157-82.

[16] BCI Competition III [homepage on the Internet]. [updated 2005 Jun 16; cited 2013 Apr 26]. Available from http://www.bbci.de/competition/iii/.

[17] Lotte F, Congedo M, Lécuyer A, Lamarche F, Arnaldi B. A review of classi- fication algorithms for EEG-based brain–computer interfaces. J Neural Eng.

2007 Jun; 4(2):R1-R13.

[18] Ekeberg Ö. [lecture slides] KTH. Available from http://www.csc.kth.

se/utbildning/kth/kurser/DD2432/ann13/forelasningsanteckningar/

02-singlelayer-2x2.pdf.

[19] Multid Analyses [image on the Internet]. [updated 2013 Oct 8; cited 2013 Apr 12]. Available from http://www.multid.se/genex/SVM_illustration.png [20] Chang C, Lin C. LIBSVM: A library for support vector machines [homepage

on the Internet]. [updated 2013 April; cited 2013 Feb 14]. Available from http:

//www.csie.ntu.edu.tw/~cjlin/libsvm

[21] Statistics Toolbox - Classification [homepage on the Internet]. [cited 2013 Apr 12]. Available from http://www.mathworks.se/products/statistics/

examples.html?file=/products/demos/shipping/stats/classdemo.html.

[22] Dietterich TG. Ensemble Learning. In: Arbib MA, editor. The Handbook of Brain Theory and Neural Networks. Cambridge, MA: The MIT Press, 2002; p.

405-8.

[23] Cheng M, Jia W, Gao X, Gao S, Yang F. Mu rythm-based cursor control: an offline analysis. Clin Neurophysiol. 2004 Apr;115(4):745-51.

[24] Reunanen J. Overfitting in Making Comparisons Between Variable Selection Methods. JJ. Mach. Learn. Res., 2003; 3:1371-82.

28

(37)

BIBLIOGRAPHY

[25] Filippone M, Masulli F, Rovetta S. Simulated annealing for supervised gene selection In: Proceedings of the International Joint Conference on Neural Net- works, 2006; p. 3566-71.

[26] Guyon I. Feature selection, fundamentals and applications [video lecture]. 2007.

Available from http://videolectures.net/mmdss07_guyon_fsf/.

References

Related documents

Eventually, s · t trees are constructed and evaluated in the main step of the pro- cedure. Both s and t should be sufficiently large, so that each feature has a chance to appear in

I frågan står det om djurparker i Sverige, men trots detta svarade 3 personer om giraffen i Danmark, vilket även hamnar i bortfall eftersom de inte hade läst frågan ordentligt..

In Table 4.2 the results show that feature selection improved accuracy as well as ROC area indices, which means the classifier was able to find a better threshold value

In detail, this implies the extraction of raw data and computation of features inside Google Earth Engine and the creation, assessment and selection of classifiers in a

Using the different phases of the fitted sine curve where a successful way determine which gait a horse is moving for walk and trot, but for canter there is some obscurity3. The

Using semi- structured interviews will allow the researcher during this study to gain valuable insight on the inventory management and criterion the case company is living upon when

In the Vector Space Model (VSM) or Bag-of-Words model (BoW) the main idea is to represent a text document, or a collection of documents, as a set (bag) of words.. The assumption of

The original study used a decision tree format for classification, with the supervised learning algorithm support vector machines (SVM), where the genes are selected