Redundant and Irrelevant Attribute Elimination using Autoencoders

(1)

Redundant and Irrelevant Attribute Elimination using Autoencoders

TIM GRANSKOG

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

(3)

Attribute Elimination using Autoencoders

TIM GRANSKOG

Master in Computer Science Date: July 6, 2017

Supervisor: Örjan Ekeberg Examiner: Erik Fransén

Swedish title: Redundant och irrelevant attributeliminering med autoencoders

School of Computer Science and Communication

(4)

(5)

Abstract

Real-world data can often be high-dimensional and contain redundant or irrelevant attributes. High-dimensional data are problematic for machine learning as the high dimensionality causes learning to take more time and, unless the dataset is sufficiently large to provide an ample number of samples for each class, the accuracy will suffer. Re- dundant and irrelevant attributes cause the data to take on a higher dimensionality than necessary and obfuscates the important attributes.

Because of this, it is of interest to be able to reduce the dimensionality of the data whilst preserving the important attributes. Several techniques have been presented in the field of computer science in order to reduce the dimensionality of data. One of these is the autoencoder which is an unsupervised learning neural network which uses its input as the target output, and by limiting the number of neurons in the hidden layer the autoencoder is forced to learn a lower dimensional representation of the data.

This study focuses on using the autoencoder to reduce the dimensionality, and eliminate irrelevant or redundant attributes, of four different datasets from different domains. The results show that the autoencoder can eliminate redundant attributes, that are a linear combination of the other attributes, and provide a better lower dimensional representation of the data than that of the unreduced data. However, in data that is gathered under a controlled and carefully managed situation, the autoencoder cannot always provide a better lower dimensional representation than the data with redundant attributes. Lastly, the results show that the autoencoder cannot eliminate irrelevant attributes which have no correlation to the class or other attributes.

(6)

Sammanfattning

Verklig data kan ofta vara högdimensionella och innehålla överflödi- ga eller irrelevanta attribut. Högdimensionell data är problematisk för maskininlärning, eftersom det medför att lärandet tar längre tid och om inte datasetet är tillräckligt stort för att ge ett tillräckligt antal in- stanser för varje klass kommer precisionen att drabbas. Överflödiga och irrelevanta attribut gör att datan får en högre dimension än vad som är nödvändigt och gör de svårare att avgöra vilka de viktiga attributen är. På grund av detta är det av intresse att kunna reducera datans dimensionalitet samtidigt som de viktiga attributen bevaras.

Flera tekniker har presenterats för dimensionsreducering av data. En utav dessa tekniker är autoencodern, som är ett oövervakat lärande neuralt nätverk som använder sin indata som målutdata, och genom att begränsa antalet neuroner i det dolda lagret tvingas autoencodern att lära sig en representation av datan i en lägre dimension.

Denna studie fokuserar på att använda autoencodern för att mins- ka dimensionerna och eliminera irrelevanta eller överflödiga attribut, av fyra olika dataset från olika domäner. Resultaten visar att autoenko- dern kan eliminera redundanta attribut, som är en linjär kombination av de andra attributen, och ge en bättre lägre dimensionell representation av datan än den ej reducerade datan. I data som samlats in under en kontrollerad och noggrant hanterad situation kan emellertid autoencodern inte alltid ge en bättre lägre dimensionell representation än datan med redundanta attribut. Slutligen visar resultaten att autoencodern inte kan eliminera irrelevanta attribut, som inte har någon korrelation med klassen eller andra attribut.

(7)

1 Introduction 1

1.1 Research Question . . . 2

2 Background 3 2.1 Neural Networks . . . 3

2.2 Dimensionality Reduction . . . 6

2.3 Variations of the Autoencoder . . . 7

2.4 Related Research . . . 8

3 Method 11 3.1 The Autoencoder . . . 11

3.2 Datasets . . . 13

3.3 Training and Testing . . . 14

3.4 Inducing Additional Attributes . . . 15

4 Results 17 4.1 Irrelevant Attributes . . . 17

4.2 Redundant Attributes . . . 26

5 Discussion 35 5.1 Results . . . 35

5.2 Method . . . 37

5.3 Ethical Consequences . . . 37

5.4 Future Work . . . 38

6 Conclusion 39

Bibliography 40

v

(8)

(9)

Introduction

Computers have been designed to be able to solve some tasks much faster than what is possible for humans, in particular tasks that in- volve complex formulas or rules. However, it turns out that computers have a hard time solving things a person can do intuitively or almost automatically, like recognizing a face or a number in a picture. The problem is that computers solve problems by following a set of prede- fined rules and with problems that we might find trivial there is a lot of experience and information we can use intuitively and there exist no real easily defined rules. [7, Ch. 1]

Another problem for computers is that the real world by its nature has many degrees of freedom, which means that for many areas of research and commerce, the data they work with is high dimensional [10, Ch. 1]. Furthermore, real world data is not only high dimensional because there are many intrinsic attributes, the attributes that cause the observed class, that needs to be taken into account. Sometimes it is not exactly clear what should be measured which could result in some measured attributes being redundant; as they are in actuality a combination of other attributes. In other cases the measured attribute can instead be irrelevant; as they have no correlation to the classification, or otherwise damaging. This has been observed in both a paper [5]

on arrhythmia, where the authors achieved the highest accuracy using 21 out of the 278 attributes, as well as in the detection of heart disease where one report found that they got the highest accuracy using six out of the original 13 attributes [21].

In other cases one or several measured attributes that seem to be an intrinsic attribute are actually caused by a unobserved, or in other

1

(10)

words, a latent attribute. In both cases the data takes on a higher dimensionality than necessary. Since the curse of dimensionality is, as we know, a problem in machine learning that causes the process of learning to take longer and yield worse results as the degree of dimensionality climbs many areas rely on techniques for dimensionality reduction to achieve better results or even make learning possible. [10, Ch. 1]

To to solve the problem of defining what characterizes an object and since it is far too time-consuming to define these rules manually, especially for more complex problems, we can use the fact that computers are good at repetitive tasks and have it look at a large number of examples of what we want it to learn about and have it create its own idea of what characterizes the data. This process is sometimes called representation or feature learning and is used to make a future task such as classification easier by presenting a more robust representation of the gathered data [7, Ch. 15]. One such representation learning system that has been used to try to alleviate the problems of dimensionality is the autoencoder. An autoencoder is a neural network that tries to match its output with its input, and by limiting the number of neurons in the hidden layer it is forced to extract a lower dimensional representation of the data [12]. To for the autoencoder to provide the best possible representation for a classifier, it would need to accurately extract all the intrinsic attributes, which is made more difficult as the number of redundant and irrelevant attributes increase. To this end, it is then of interest to discover just how the number of redundant and irrelevant attributes affects the capabilities of the autoencoder.

1.1 Research Question

Since the presence of redundant or irrelevant attributes occurs in real world data and the more there is obfuscating the intrinsic attributes, the harder the problem of finding these attributes is. This paper, therefore, examines the question: can dimensionality reduction using autoencoders extract low-dimensional and robust representations from high-dimensional data by eliminating redundant or irrelevant attributes?

(11)

Background

2.1 Neural Networks

The smallest constitutive unit of a neural network is the neuron. A neuron takes its input from the previous layer, calculates a pre-activation value, applies an activation function to the pre-activation value and outputs the result to the next layer. The pre-activation of a neuron is

f (x) = b +X

i

wixi (2.1)

where b is the bias of the neuron, xiis an input from the previous layer and wi is the weight for that particular connection. Just doing this would produce a linear activation however so in order to achieve non- linearity in the neuron an activation function g is applied

h(x) = g(f (x)) = g(b +X

i

w_ix_i) (2.2) which gives the neuron its final output. [7, Ch. 6]

A single neuron is not very powerful by itself and can perform some simple linear classifications such as the OR and AND operations, which is illustrated in figure 2.1. However, when the classification boundary is not linear, as is the case with XOR in figure 2.2, the single neuron cannot perform the classification correctly. In some cases, it is possible to change how a problem is represented or learn something about it that makes it solvable. In the case with XOR it is possible to change the input from being two booleans into the result of two AND operations making it possible for the single neuron to classify XOR input correctly by adding a hidden layer in between the input and the

3

(12)

single neuron, as shown in figure 2.3. This is the basis of the idea of having layers of neurons where an earlier layer can find something special about the data then pass it along to next layer, a process called forward propagation, making it possible for each consecutive layer to learn more and more abstract features of the data [7, Ch. 6].

Figure 2.1:Linear classification of the OR and AND operations.

Figure 2.2: Initial non linear case of XOR operation and XOR in changed base making it linearly separable.

(13)

x₁ x₂ Input

layer

y₁ Output

layer

x₁ x₂ Input

layer

Hidden layer

y₁ Output

layer

Figure 2.3: To the left: A single neuron network which can learn to classify the OR and AND operations. To the right: A small neural network in which the hidden layer can learn AND and allow the output neuron to learn the XOR operation

In reality, when working with neural networks they will usually be far more complex than the one in figure 2.3 and there are several different factors to consider, as some networks are more suited for different tasks, such as the number of layers, the number of neurons in each layer and the aforementioned activation functions.

More layers and neurons can improve the capabilities of the network, however, the network will also take longer to train. Other choices such as to have fully connected layers or partially connected layers as well as the data flow, either forward only as in a feed forward networks, or with connections going backward as in recurrent neural networks can be highly problem-specific and should be thoroughly researched. [7, Ch. 11]

One last part to consider when designing a neural network is decid- ing what cost function to use. The cost function defines how close the output of the network is to the correct output, and is therefore needed during training to ensure the network is moving towards the correct goal. Just as with the structure of the neural network, so also, is the cost function problem-specific. [7, Ch. 6]

When the neural network has been implemented it can be trained through backpropagation. This is done by computing the gradient of the goal function in terms of each individual output neuron. Next, the activation function and activation function gradients are computed in each step backward in terms of the weights and the bias in each layer.

This process can be done in a number of ways the first of which is batch gradient descent which for each epoch, or iteration of training, goes through the whole training set and calculates the total cost function before performing backpropagation. Another way is to use stochastic gra-

(14)

dient descent which only looks at one training example at a time before performing backpropagation. However, most neural network learning algorithms use something in-between called mini-batch gradient descent where n examples of the training data is looked at together. [7, Ch. 8]

2.2 Dimensionality Reduction

The concept of dimensionality reduction algorithms is quite old with one of the arguably most well-known algorithms: Principal Compo- nent Analysis (PCA) was first introduced by Pearson in 1901. The as- sumption on which the PCA is based is that our n collected observa- tions are a linear transformation of i < n unknown latent variables with a transformation matrix T . [10, Ch. 2.4]

The use of kernel functions in PCA has been common practice to extract nonlinear properties based on linear representation in high dimensional space. However, these kernel function methods do not scale well with increasing sizes of data as they are based on matrix representations of the data in higher dimensional space, which grows far too large with the increase in data. One way to combat this fact has been to use randomized algorithms which scale better and thus allows faster training time whilst keeping similar accuracy. However, they often fail to extract discriminative features of the data in favor of creating a simple representation which limits their classification accuracy. [3]

Autoencoders are a type of fully collected feed-forward networks that were first introduced in the 1980s by Hinton and the PDP group [17]

to address the problem of “backpropagation without a teacher”, by using the input data also as the goal for the output. Hinton showed that if you pre-train each layer of the network in using an autoencoder to learn a sparse representation before the classification task, the learning problem was greatly reduced. In 2006, Hinton et al. showed how training a stack of Restricted Boltzmann Machines (RBMs) and unrolling them into a multi-layered autoencoder outperforms PCA in compress- ing and reconstructing images as well as creating two-dimensional codes for digits. They also showed that the multi-layered autoencoder outperformed a Latent Semantic Analysis, which is a document retrieval algorithm based on PCA, when retrieving documents of a given class. [8]

Between the input and output layers of an autoencoder, there is a

(15)

hidden layer. If the hidden layer has the same or more neurons compared to the input and output layers the autoencoder is said to be a over-complete autoencoder and if it has less it is a under-complete autoencoder. It is by using a under-complete autoencoder the dimensionality of the input data can be reduced. This is done by encoding it to the hidden layer and minimizing the reconstruction error to the output [12].

The reason this is possible is that by limiting the number of neurons available in the hidden layer to less than that of the input layer the autoencoder is forced to learn the most important attributes of the input data [7, Ch. 14].

2.3 Variations of the Autoencoder

Other than the under-complete and over-complete autoencoders there exists several other types of autoencoders designed for different tasks.

Some of the more common ones include Convolutional Autoencoders, Denoising Autoencoders, and Sparse Autoencoders.

The convolutional autoencoder structure is not fully connected like many of the other autoencoders, instead, it only has connections going from a selected number of input neurons to a hidden neuron. This is because they are designed to work on data with a known grid-like structure, such as images, where we know that points on the structure that are close together can be a part of the same feature. The weights are also shared across the neurons so that the convolutional autoencoder can find localized features that repeat themselves in the input. [14]

Both sparse autoencoders and denoising autoencoders are over- complete and work on the same idea, that we want to learn a good internal representation of the data whilst not simply learning the iden- tity function. The sparse autoencoder has a constraint that limits the activation of its neurons, which means the neuron is forced to ignore some of the inputs it is receiving. Similarly to the under-complete autoencoder, this forces the neuron to decide which inputs are important.

For denoising autoencoders it is instead the input that is changed. The denoising autoencoder receives input that has been corrupted and is trained to reconstruct the uncorrupted original data. Since it cannot simply copy the input anymore it is instead forced to learn what is important about the data. [7, Ch. 14]

(16)

There are many ways to alter the autoencoder and other more task specific variations of the autoencoder have been used to solve natural language processing tasks such as predicting the sentiment of a sen- tence [20] and detecting paraphrases [19].

2.4 Related Research

As mentioned in the introduction there has been some research into removing redundant and irrelevant features. The study by Sumit Bha- tia and Pillai [21] developed an iterative algorithm that uses a classifier as a black box for ranking each attribute and achieved state of the art accuracy on several datasets. Cohen, Ruppin, and Dror [5] evaluated a genetic algorithm that attempted to find the best subset of attributes for an SVM from a heart disease dataset, and achieved the highest accuracy with six out of the original 13 attributes. What makes an autoencoder based method different from these is that they are supervised with the goal of improving the accuracy of a given classifier, whilst the autoencoder is unsupervised and works towards finding a robust lower dimensional representation.

While looking at related research in the area of neural networks and the autoencoder we found only a few reports which studied the capabilities and effects of dimensionality reduction using an autoencoder, the rest are focused on applying an autoencoder to a problem. Wang et al [25] looked at how well an under-complete autoencoder could reduce the dimensionality and found that the autoencoder is unstable as it can produce different low-dimensional representations. However, the autoencoder did manage to detect repetitive elements in the input and map them to the same point in the lower dimension, thus removing redundant data. They also found that when reducing the dimensionality of data they got the highest accuracy when the number of hidden neurons was the same as the intrinsic dimensionality of the data. Another study of dimensionality reduction using an autoencoder was done in 2008 by a medical research group which used a variety of non-linear dimensionality algorithms on a molecule which could be represented on a 2-manifold in Cartesian coordinates. They found that even though the autoencoder was the slowest to train it outperformed the other algorithms in creating low-dimensional representations [1].

(17)

One area of research in computer science which uses autoencoders and has achieved significant improvements in recent years is the area of image recognition tasks. When the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)¹was held in 2011 the error rate of the winning team was 25.2%. When Krizhevsky et al entered in 2012 with a Convolutional Neural Network (CNN) they managed to achieve an error rate of 16.4 percent and recently in 2014 the lowest error rate was 6.4 percent [18]. In 2011 Jonathan Machi et al showed that using a Convolutional Autoencoder to pre-train a CNN can improve its classification rate slightly but consistently [14]. This can be especially useful for very complex networks as they can take very long to train and as the error rate gets lower it gets more difficult to improve the accuracy.

CNNs are not infallible however; A. Nguyen et al [15] showed that it is possible to generate images using genetic algorithms that look nothing like what the neural network believes it to be. This could possibly also have consequences for systems using autoencoders as a hashing method for retrieval, which has been used because of their speedy performance in large databases, such as the 100 million large x-ray database worked on by A. Sze-To et al [22].

Originally autoencoders were made to train without labeled data using its own input as the goal, but in 2014 the concept of a Gener- alized Autoencoder [24] was presented. The idea is that instead of inputting a training example x and matching it to itself, it could be matched to a set {x1, x₂, ..., x_i}. The set, in this case, would be data of the same class as x and their research showed improved classification on both faces and digits.

In research on deep networks and special activation functions, it has been shown that rectifying neurons, that is neurons with the rec- tified linear unit (ReLU) activation function, perform well "creating sparse representations with true zeros, which seem remarkably suit- able for naturally sparse data." and also can reach their best performance without needing any pre-training [6]. More recent research into alternatives to ReLU in 2016 [4] yielded the Exponential Linear Unit (ELU) function which sped up learning and when used in a convolu- tion neural network set a new state of the art in CIFAR-100.

Some closing remarks on the state of deep learning and unsupervised learning methods such as the autoencoder can be found in a recent review by Yann Lecun et al [9] where they state that unsupervised

1http://www.image-net.org/challenges/LSVRC/

(18)

learning methods are mainly to thank for the regained interest in machine learning. Furthermore, even though unsupervised methods in large has been replaced by supervised learning they believe unsupervised learning will return since humans learn in a mostly unsupervised manner.

(19)

Method

To answer the question of if an autoencoder can create a good lower dimensional representation by eliminating redundant and irrelevant attributes, ideally we would want to try every combination of autoencoder architectures as well as test it on every available data set. How- ever neither of these are feasible in reality, so a methodology which provides representative results has to be established. First of all, a measure of how well the autoencoder performs had to be established, and to that end it was decided to compare it to the accuracy of a classifier on the unreduced data. How the autoencoder was structured, which datasets were used, and how training and testing were done will be discussed in the following sections.

3.1 The Autoencoder

As the autoencoder uses its input as the target output both the number of input and output neurons are decided by the data. What is left to decide on is the number of hidden neurons, the number of layers, and the activation functions. For the number of layers and neurons, it is not possible to test all permutations, as the number of permutations would be far too many for anything but very low dimensional inputs. So it was decided to use a shallow autoencoder with one hidden layer, and that the number of hidden neurons would be chosen by binary search methodology, so starting at reducing the dimensionality by half if the classifier performed worse on the reduced data than the unreduced then in the next test phase the data was reduced to 75%, and if it performed better the data was reduced to 25%.

11

(20)

To determine the activation functions, some preliminary tests were done on the MNIST¹ dataset, and the Iris² dataset which is described in greater detail in the next section, using the older, yet common sigmoid and tanh activation functions [6], the currently common ReLU activation function [7, Ch. 6] and the ELU activation function.

Figure 3.1:On top: Original data with 768 (28×28) dimensions. The rest are compressed from 768 to 128 to 64 to 32 using only, from top to bottom: ReLU, sigmoid, tanh and ELU.

Figure 3.1 shows the effect of training an autoencoder network on the MNIST Dataset with only one type of activation function in all the layers. What this shows is that none of them work really well as the only activation function present in a neural network. So further tests were done, to test all combinations of the activation functions using the Iris dataset. The results of which is shown in table 3.2 .

1http://yann.lecun.com/exdb/mnist/

2http://archive.ics.uci.edu/ml/datasets/Iris

(21)

Figure 3.2: Loss and Accuracy for the Iris dataset when the dimensionality has been reduced to 50%. Rows are activation function in the hidden layer. Columns are the activation function in the output layer. Best values are marked in bold.

The preliminary tests on the Iris dataset show that tanh-sigmoid, ELU-sigmoid and ELU-tanh all get the same loss. However, the combination of ELU and sigmoid got the best accuracy. Based on these results it was decided to use the ELU activation function in the hidden layer and the sigmoid activation function in the output layer.

3.2 Datasets

In order to provide more representative results the datasets were picked from different domains. In total four datasets were collected from the UCI Machine Learning Repository [11]. The first dataset used in the experiments is the Iris dataset which contains three different classes of 50 instances each. The classes in the dataset are three types of Iris plants and each has four real-valued attributes. One of these classes is linearly separable from the other two whilst the other two are not linearly separable.

The next dataset used for testing is the Arrhythmia³ dataset which has 16 classes corresponding to, either one of 15 types of arrhythmia, or no arrhythmia in the heart. This data is a collection of 278 measurements including real-valued attributes, such as age and weight, and nominal attributes such as sex. The arrhythmia dataset is imbalanced with 245 out of the 452 instances belonging to the "no arrhythmia" class whilst the second and third largest classes have 50 and 44 instances respectively. Furthermore, three of the classes have no instances in the dataset. There were also some instances that had missing values in

3http://archive.ics.uci.edu/ml/datasets/arrhythmia

(22)

the dataset, which was handled by setting any value marked with a question mark to zero.

The third dataset is the Drive⁴dataset which has 48 real-valued attributes corresponding to measurements from currents through engine components. The purpose of these measurements is to detect if the engine is currently suffering from one of 10 possible errors or is running as intended. For each of the 11 classes the dataset has 5319 instances meaning that it is a balanced dataset.

The last data set is the Glass⁵ dataset which has 10 real-valued attributes pertaining to the oxide content, such as the amount of iron, of seven different types of glass. This dataset is unbalanced with 70 and 76 instances, out of 214 total instances, for the two different types of building window glass as well as zero instances for one type of glass.

For all of the datasets there is only one class label per instance.

3.3 Training and Testing

An autoencoder on its own cannot perform the task of classifying a data set. For this, a classifier is needed and common choices for neural networks are a Softmax layer or a Support Vector Machine (SVM) with one outperforming the other in some cases, whilst being outperformed in other cases [23] [16]. The difference between them is that the SVM outputs a score, that can be any real number, for each class given whilst the Softmax classifier outputs a probability, between 0 and 1, for each class [23]. In this study the Softmax classifier was used since they both perform similarly but the Softmax classifier gives a more intuitive output.

Initial tests using the Iris data set showed that the autoencoder converged at 1000 epochs of training and the classifiers converged after 750 epochs so these numbers of epochs were used for all tests. To lower the possible risk of over-fitting the weights for the epoch with the bests results on the validation is saved for both the autoencoder and the classifiers. Before using the autoencoder to compress the data its saved weights are loaded in, and similarly, before the classifiers are evaluated on the training set, their saved weights are loaded in.

Before training, the data had to be normalized as the sigmoid ac-

4http://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis

5https://archive.ics.uci.edu/ml/datasets/Glass+Identification

(23)

tivation function used in the output layer of the autoencoder has a lower and upper bound of 0 and 1 respectively. This was done on each attribute independently by subtracting the minimum value of that attribute from all instanced and then dividing by the maximum value.

For training and testing the data it is randomly shuffled and the datasets were split into 70% training data, 20% validation data and 10% testing data and both the autoencoder and the classifiers had the cross-entropy function as their cost function. The process of training and then testing is repeated five times with different training, validation and test sets, which is called 5-fold cross validation, and was done to lower the chance that we just happened to pick a particularly easy or difficult combination of sets. This 5-fold cross-validation is also repeated again with the same training, validation and test sets, but the additional attributes are inserted again, which was done to lower the chance of having generated a particularly easy, or difficult, to distinguish set of attributes. For the Iris and Glass datasets the whole process was repeated once more in order to reduce their confidence intervals.

The same generated data set is given to each autoencoder as well as the Softmax classifier for the unreduced data to make it as fair as possible.

Lastly the test programs were all written in Python 3.5 using the Keras [2] and TensorFlow [13] deep learning libraries. All tests were run using the NVIDIA GeForce GTX 970 graphics card on a 64-bit Win- dows 10 computer with a 3.2GHz Intel Core^R ^TM i5-6500 Processor (4 cores), 8GB RAM, and a 1TB Disk (SATA 7200rpm).

3.4 Inducing Additional Attributes

For the tests with additional irrelevant attributes, such as in figure 4.1, the normalization was done first. After the normalization values were generated with a uniform distribution between zero and one, and was added to all instances. This randomization was done so that there is no correlation between the additional attribute and the class or other attributes.

Additional redundant attributes, such as in the test which results are displayed in figure 4.9, were added before normalization. These attributes were created by generated by uniformly randomizing weights,

(24)

between one and zero, for each of the original attributes, multiplying the weight with the corresponding attribute and then summarizing the products. This generates a linear combination of existing attributes into the new attribute, and as such has a correlation to both the class and the original attributes.

Each test started with no added attributes, after which additional attributes equal to half the number of original attributes were added, then equal to the number of original attributes after which the number of additional attributes was doubled. So for the tests on the Iris dataset, show in figures 4.1 to 4.2 and figures 4.9 to 4.10, the number of additional attributes becomes 0, 2, 4, 8, 16 and so on. This was done in order to speed up testing as compared to adding just a constant amount more on each iteration until a trend could be found.

(25)

Results

In the results the average error rate of the Softmax classifier is presented for runs on the four data sets with varying amounts of additional attributes, displayed along the x-axis, as well as with varying amounts of compression, which is displayed by different colors.

These average error rates are shown, with their 95% confidence interval, to visualize the effects additional attributes have on the autoencoder. For each of the datasets, the spread of the error rates is displayed for the uncompressed data and the data compressed by one of the autoencoders. The spreads are shown to present possible trends in how the additional attributes affect the autoencoder. As for why only one spread is shown is to display the results in a more compact format.

4.1 Irrelevant Attributes

Figure 4.1: Error rate of the Softmax classifier on compressed and uncompressed data of the Iris dataset.

17

(26)

The first tests were run on the Iris dataset with up to 256 additional attributes. In figure 4.1 above it can be observed that the autoencoder lowers classification error on the iris data set from 0 additional attributes up until 16 additional attributes have been added to the original data, yet there is no reduction rate that performs better than the other rates across the test cases. With more than 16 additional attributes the autoencoder causes the error rate to increase above what the classifier on the uncompressed data can achieve. Also, at 16 or more additional attributes the confidence intervals for the unreduced and reduced data starts to largely overlap.

(27)

Figure 4.2: Above: Spread of error rate of the Softmax classifier on the Iris data that has had its dimensionality reduced to 75% of the dimensionality of the uncompressed data. Below: Spread of error rate of the Softmax classifier on the Iris data that has not had its dimensionality reduced.

(28)

For the spread of errors on the Iris dataset, displayed in figure 4.2, there seems to be no clear pattern other than that the error rate intervals grow as the number of irrelevant attributes increases. The same hold true for the mean error rate, which is shown by the dotted line, which also gets worse as the dimensionality increases.

Figure 4.3: Error rate of the Softmax classifier on compressed and uncompressed data of the Arrhythmia dataset.

The next round of tests was done using the Arrhythmia dataset with up to 4448 additional attributes. These tests, shown in figure 4.3, show that the error rate goes up as the number of additional attributes increases but unlike on the Iris dataset the classifier gets the best result the majority of the time on the uncompressed data. With the exception at 1112 and 4448 additional attributes where reducing the data to 50%

and 25% respectively, performs a little better.

(29)

Figure 4.4: Above: Spread of error rate of the Softmax classifier on the Ar- rhythmia data that has had its dimensionality reduced to 75% of the dimensionality of the uncompressed data. Below: Spread of error rate of the Soft- max classifier on the Arrhythmia data that has not had its dimensionality reduced.

(30)

Similarly to the Iris dataset, it appears as if the spread of error rates on the Arrhythmia dataset has no clear pattern. Both the error rates on the data that has had its dimensionality reduced to 75% and the uncompressed data shown in figure 4.4, has the interval of the error rates starting quite wide and remains as such.

Figure 4.5: Error rate of the Softmax classifier on compressed and uncompressed data of the Drive dataset.

With the Drive dataset the classifier achieves lower error rates with up to 96 additional attributes, for the lowest compression rate, shown in orange in figure 4.5. With 192 or more additional attributes the classifier performs worse on any amount of compression than on the uncompressed dataset. A trend in the error rate seems to be that at some point it jumps up to just above 0.5 for any amount of dimensionality reduction. For example for the data reduced to 25% this happens already at 24 additional attributes, and for the data reduced to 50% this happens at 48 additional attributes.

(31)

Figure 4.6: Above: Spread of error rate of the Softmax classifier on the Drive data that has had its dimensionality reduced to 75% of the dimensionality of the uncompressed data. Below: Spread of error rate of the Softmax classifier on the Drive data that has not had its dimensionality reduced.

(32)

Unlike the Iris and Arrhythmia datasets, the spread of error rates of the runs on the Drive dataset is clustered quite closely together. The error rate on the uncompressed data, shown in blue in figure 4.6, appears to be increasing linearly as the number of irrelevant attributes increases. Initially, the error rate on the data which has been reduced to 75% that is illustrated in the same figure also increases linearly, however at 192 additional attributes it jumps from around 25% to around 55%.

Figure 4.7: Error rate of the Softmax classifier on compressed and uncompressed data of the Glass dataset.

The last set of tests were run on the Glass dataset and the results are shown above in figure 4.7. In contrast to the other tests, the classifier seems to get lower error rates the more additional attributes are added, both on compressed and uncompressed data. In fact, the error rate on the uncompressed data is at its lowest with 640 additional attributes. However, the error rate fluctuates too much to say whether any degree of dimensionality reduction provides a better representation of the data for the classifier than what the uncompressed data already is. As, for example, the error rate is lower with a reduction in dimensionality to 75% of the original at 0, 5, 10, 80 and 320 additional attributes, but worse than the unreduced data at the 20, 40, 160 and 640 additional attributes. Furthermore, the confidence intervals overlap at each number of additional attributes and gets wider as the number of additional attributes increases.

(33)

Figure 4.8:Above: Spread of error rate of the Softmax classifier on the Glass data that has had its dimensionality reduced to 75% of the dimensionality of the uncompressed data. Below: Spread of error rate of the Softmax classifier on the Glass data that has not had its dimensionality reduced.

(34)

As with the previous datasets, we have the spread of the error rates for data that has had its dimensionality reduced, in this case by 25%, and the data that has not been reduced, in figure 4.8. We can again see that the average error rate goes down as the number of additional attributes increases for both the reduced and the unreduced data. Also just as the confidence interval grows as the number of additional attributes increases it seems that the spread of error rates gets larger as more additional attributes are added.

4.2 Redundant Attributes

Figure 4.9: Error rate of the Softmax classifier on compressed and uncompressed data of the Iris dataset.

Contrary to the results on the Iris data set with irrelevant attributes there seems to be no increase in error rate as the number of extra redundant attributes are added. Instead, for the reduced data it appears that the additional attributes only cause some fluctuation between around about an error of 5 and 10%. However, the Softmax classifier seems to perform better on the reduced data up until 64 attributes have been added, after which it performs better on the unreduced data. Also just as the error rate on the Glass dataset the error rate on the unreduced data seems to clearly go down as the number of additional attributes increase.

(35)

Figure 4.10: First: Spread of error rate of the Softmax classifier on the Iris data that has had its dimensionality reduced to 75% of the dimensionality of the uncompressed data. Second: Spread of error rate of the Softmax classifier on the Iris data that has not had its dimensionality reduced.

(36)

Looking at the spread or error rates for the compressed data in figure 4.10 we can see that the mean error does change a bit with different amounts of additional attributes. However most of the data points seem to land on the same error rates. For the uncompressed both the mean error and the spread of points decrease with the number of additional attributes.

Figure 4.11: Error rate of the Softmax classifier on compressed and uncompressed data of the Arrhythmia dataset.

Just as with the irrelevant added attributes it seems that the autoencoder cannot find a better representation than the unreduced Ar- rhythmia data, as can be seen in figure 4.11. From 0 to 2224 additional attributes the classifier performs either better, or the same as one of the reduced datasets, and only perform worse on the unreduced at 4448 additional attributes. However unlike the results on the data with irrelevant attributes it seems that the error rate doesn’t increase as more attributes are added.

(37)

Figure 4.12: First: Spread of error rate of the Softmax classifier on the Ar- rhythmia data that has had its dimensionality reduced to 75% of the dimensionality of the uncompressed data. Second: Spread of error rate of the Soft- max classifier on the Arrhythmia data that has not had its dimensionality reduced.

(38)

The error spread on the Arrhythmia, shown in figure 4.12, doesn’t land on the same numbers very often unlike the spread for the Iris dataset in figure 4.10. Similarly though the mean stays almost the same, around no matter the amount of additional attributes.

Figure 4.13: Error rate of the Softmax classifier on compressed and uncompressed data of the Drive dataset.

Similarly to the test using redundant attributes with the Iris dataset we can see that the error rate on the Drive dataset in figure 4.13 remains pretty much the same as more redundant attributes are added.

In fact it appears to be even more stable than on the Iris dataset and any reduction improves the error rate by approximately 5%. We also never see the same jump in error rate as in figure 4.5.

(39)

Figure 4.14: First: Spread of error rate of the Softmax classifier on the Drive data that has had its dimensionality reduced to 87,5% of the dimensionality of the uncompressed data. Second: Spread of error rate of the Softmax classifier on the Drive data that has not had its dimensionality reduced.

(40)

From the spread in figure 4.14 we can see that the average error rate, shown by the dotted line, fluctuates by about 0.01 for the reduced data and 0.02 for the unreduced, which is very stable. The same can be said for the largest spread of error rates, which at 48 additional attributes varies by about 0.02 for the reduced data, and also by about 0.02 at 96 additional attributes for the unreduced data.

Figure 4.15: Error rate of the Softmax classifier on compressed and uncompressed data of the Glass dataset.

Once again we can see that, just as for the Glass data with additional irrelevant attributes, that there seems to be a downwards trend in the average error rate as the number of additional attributes increases, as can be seen in figure 4.15. With redundant attributes rather than irrelevant attributes however it seems that reducing the dimensionality of the data to 75% or 50% generally results in lower error rates than using the Softmax classifier on the unreduced data. The excep- tions to this are at 5 additional attributes where the classifier performs better on the unreduced than on either of the reduced datasets, and at 320 additional attributes where the classifier performs the same on both the unreduced data and the data reduced to 50%.

(41)

Figure 4.16: First: Spread of error rate of the Softmax classifier on the Glass data that has had its dimensionality reduced to 75% of the dimensionality of the uncompressed data. Second: Spread of error rate of the Softmax classifier on the Glass data that has not had its dimensionality reduced.

(42)

For the error spreads on the Glass dataset with additional redundant attributes, shown in figure 4.16, we can again see that the average rate is in fact going down. However the error rate spread seems to have no correlation with the amount of additional attributes.

(43)

Discussion

5.1 Results

The results seem to indicate that an under-complete autoencoder can handle to reduce data that has up 16 additional irrelevant attributes for the Iris dataset and 96 for the Drive dataset, and give the Softmax classifier a better representation allowing it to classify data more accurately. However, with more additional attributes than this, the autoencoder appears to not be able to discover a better representation. In fact, for the Drive dataset there is a point for each compression rate where the accuracy drastically falls. One possibility is that with so many additional irrelevant attributes and the fact that the autoencoder tries to reconstruct it as well as possible, the additional attributes affect the in- ner representation adversely enough that the Softmax classifier cannot learn to differentiate between the different classes as well, whilst the Softmax classifier on the uncompressed can eventually learn to ignore, at least some of, the additional irrelevant attributes. The Drive dataset seems to benefit from reducing the dimensionality as little as possible when either irrelevant or redundant attributes are present. For the Iris dataset it seems the classifier benefits from higher degrees of reduction up to when 16 additional attributes have been added. When the Iris dataset instead contains redundant attributes it appears to clearly benefit from any degree of dimensionality reduction up until 64 attributes have been added. Furthermore, the Softmax classifier on the reduced appears to be able to keep more or less the same accuracy no matter the amount of added redundant attributes. Which is likely because the added attributes are linear combinations of the original attributes, so

35

(44)

with enough training the autoencoder can learn to ignore them. How- ever the error rate as well as spread of error on the unreduced data has the interesting characteristic that it shrinks, and becomes better than the reduced data. The dataset does only have 150 instances, which becomes 105 instances to train on, so it could be that the additional attributes have the same effect here as it does in a denoising autoencoder.

For the Glass and Arrhythmia datasets the autoencoder appears to not be able to provide a better representation of the data when irrelevant attributes have been added. It seems though that with additional redundant attributes in the Glass data the autoencoder provides a bit better representation. For these two datasets it could be that the original attributes have been chosen and collected well enough that the classifier manages to distinguish them from the added attributes more easily than the autoencoder when there are irrelevant attributes.

Which seems probable as one is a medical dataset and the illness most likely has been researched for years, and that one has been collected with the purpose of classifying forensic evidence. Glass share the interesting characteristic, with redundant data on the Iris dataset, that it in both cases seems to perform better with additional attributes. Simi- larly to the Iris dataset it only has a few instances available to train on, so it could again be that the additional attributes have the same effect here as it does in a denoising autoencoder.

So one possible reason that the autoencoder can provide a better representation for the Iris and Drive dataset, but not for the Glass and Arrhythmia dataset, could be the presence of noise or already existing redundant or irrelevant attributes, in the former two. The Iris dataset was measured by hand and as such the autoencoder might help reduce the effect of these factors and learn the underlying characteris- tics. Likewise, the Drive dataset consists of measurements of currents through an engine, which was probably never intended to be measured and used in this way. Both the Arrhythmia and Glass dataset are likely measured in a controlled environment with carefully designed tools and by people with a significant education, and therefore might have minimal or no noise. Lastly when the autoencoder is used on data with additional redundant attributes the classification error stays approximately the same as when no additional attributes have been added or goes down, whilst with irrelevant attributes the error rate tends to go up. This is probably then because the autoencoder can,

(45)

in fact not, remove irrelevant attributes and only removes redundant attributes.

5.2 Method

There are some improvements that could be done to the method that was not feasible at the time. First of all, not all compression rates were tested to find the optimal one at each stage, which is possible to do but could not be done in this project because of time constraints. Finding the compression which results in the highest accuracy could plausibly provide a better understanding of how the additional attributes affect the autoencoders ability to provide a more robust set of attributes.

Because of time constraints, it was also not possible to do much more than 10 test runs. Apart from the results on the Drive dataset for which the error rates were tightly clustered, the others had quite a large spread of error rates even for lower amounts of additional attributes. Because of the large spread of error rates performing more iterations of tests should have provided more representative results.

Lastly, the choice of using the ELU activation function in the hidden layer and using the sigmoid activation function in the last layer was based on preliminary tests on the MNIST and iris data sets and could possibly not be the optimal choice for other data sets. Possibly preliminary tests for the choice of activation functions should have been done on every data set individually.

5.3 Ethical Consequences

The results show that in some cases it is possible to extract better representations of the data automatically, and as such it is possible that jobs pertaining to designing data that is suited for the task of classification could feasibly be lost. Another possible outcome of being able to extract robust representations from data with large amounts of irrelevant and redundant attributes is that there is less need to be selective with what to collect. Some less scrupulous might see this as an op- portunity to collect all the information about people they can, with the possibility of even collecting sensitive information.

However all outcomes aren’t negative, in the case of removing people from the picture when it comes to designing datasets it is possible

(46)

to eliminate the human factor in handling data. This could lead to get- ting better and more trustworthy results if it is possible to guarantee that the input data is in fact entirely intact. Furthermore, the auto- matic handling of data could possibly also reduce privacy concerns that could arise from having a human handle the information if used to eliminate personal information from the data before allowing it to be handled by other people.

5.4 Future Work

There are several possibilities for future work based on what has been found in this paper. To start off with the tests in this paper are based on a shallow autoencoder and linear redundant attributes, and as such it could be of value to see how a deeper autoencoder structure could handle both linear and non-linear redundant attributes. Based on the results on the Glass dataset, where accuracy improved as the number of additional attributes increased, it could be interesting to see if it is possible to create something similar to a denoising autoencoder but with additional attributes instead of corrupting the existing attributes.

Lastly, this work has focused on only the under-complete autoencoder, so in the future it could be interesting to combine under-complete with perhaps a sparse autoencoder.

(47)

Conclusion

In this study it was investigated to what degree under-complete autoencoders can extract a lower dimensional and robust representation from high-dimensional data containing redundant or irrelevant attributes. The results from tests on four datasets from different domains show that the autoencoder can eliminate redundant attributes, that are a linear combination of the other attributes, and provide a better lower dimensional representation of the data than that of the unreduced data. However, in data that is gathered under a controlled and carefully managed situation, the autoencoder cannot provide a better lower dimensional representation than the data with redundant attributes. Lastly, the results show that the autoencoder cannot eliminate irrelevant attributes which have no correlation to the class or other attributes.

39

(48)

[1] W. Michael Brown et al. “Algorithmic dimensionality reduction for molecular structure analysis”. In: The Journal of Chemical Physics 129.6 (2008), p. 064118.DOI: 10.1063/1.2968610.

[2] François Chollet. keras. https://github.com/fchollet/

keras. 2015.

[3] Haoda Chu et al. “SDRNF: generating scalable and discriminative random nonlinear features from data”. In: Big Data Analytics 1.1 (2016), p. 10. ^ISSN: 2058-6345.^DOI: 10.1186/s41044-016- 0015-z.

[4] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter.

“Fast and Accurate Deep Network Learning by Exponential Lin- ear Units (ELUs)”. In: CoRR abs/1511.07289 (2015).URL: http:

//arxiv.org/abs/1511.07289.

[5] Shay Cohen, Eytan Ruppin, and Gideon Dror. “Feature Selection Based on the Shapley Value”. In: IJCAI. 2005, p. 665.

[6] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep Sparse Rectifier Neural Networks”. In: JMLR Workshop and Conference Proceedings Volume 15: AISTATS 2011. Apr. 2011, pp. 315–323.

DOI: 10.1109/CVPRW.2014.79.

[7] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learn- ing. http://www.deeplearningbook.org. MIT Press, 2016.

[8] G. E. Hinton and R. R. Salakhutdinov. “Reducing the Dimen- sionality of Data with Neural Networks”. In: Science, 313(5786):504, 2006 ().

[9] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: Nature 521.7553 (May 2015), pp. 436–444. ISSN: 0028- 0836. ^DOI: 10.1038/nature14539.

40

(49)

[10] John A. Lee and Michel Verleysen. Nonlinear Dimensionality Re- duction. 233 Spring Street, New York, NY 10013, USA: Springer Science + Business Media, LLC, 2007.ISBN: 978-0-387-39351-3.

[11] M. Lichman. UCI Machine Learning Repository. http://archive.

ics.uci.edu/ml. Accessed: 2016-02-03. 2013.

[12] Laurens Van Der Maaten, Eric Postma, and Jaap Van den Herik.

“Dimensionality reduction: a comparative review”. In: Journal of Machine Learning Research 10 (2009), pp. 66–71.

[13] Martín Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org.

2015.URL: http://tensorflow.org/.

[14] Jonathan Masci et al. “Stacked Convolutional Auto-encoders for Hierarchical Feature Extraction”. In: Proceedings of the 21th In- ternational Conference on Artificial Neural Networks - Volume Part I. ICANN’11. Espoo, Finland: Springer-Verlag, 2011, pp. 52–59.

ISBN: 978-3-642-21734-0.^URL: http://dl.acm.org/citation.

cfm?id=2029556.2029563.

[15] A. Nguyen, J. Yosinski, and J. Clune. “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images”. In: 2015 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR). June 2015, pp. 427–436.^DOI: 10.1109/

CVPR.2015.7298640.

[16] Thomas Pellegrini. “Comparing SVM, Softmax, and shallow neural networks for eating condition classification”. In: PLoS ONE 11(5):e0154486 · May 2016 ().

[17] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Parallel Dis- tributed Processing: Explorations in the Microstructure of Cog- nition, Vol. 1”. In: ed. by David E. Rumelhart, James L. McClel- land, and CORPORATE PDP Research Group. Cambridge, MA, USA: MIT Press, 1986. Chap. Learning Internal Representations by Error Propagation, pp. 318–362.ISBN: 0-262-68053-X.URL: http:

//dl.acm.org/citation.cfm?id=104279.104293. [18] Olga Russakovsky et al. “ImageNet Large Scale Visual Recogni-

tion Challenge”. In: International Journal of Computer Vision (IJCV) 115.3 (2015), pp. 211–252.^DOI: 10.1007/s11263-015-0816- y.

(50)

[19] Richard Socher et al. “Dynamic Pooling and Unfolding Recur- sive Autoencoders for Paraphrase Detection”. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. NIPS’11. Granada, Spain: Curran Associates Inc., 2011, pp. 801–809. ISBN: 978-1-61839-599-3. URL: http://dl.acm.

org/citation.cfm?id=2986459.2986549.

[20] Richard Socher et al. “Semi-supervised Recursive Autoencoders for Predicting Sentiment Distributions”. In: Proceedings of the Con- ference on Empirical Methods in Natural Language Processing. EMNLP

’11. Edinburgh, United Kingdom: Association for Computational Linguistics, 2011, pp. 151–161.ISBN: 978-1-937284-11-4.URL: http:

//dl.acm.org/citation.cfm?id=2145432.2145450. [21] Praveen Prakash Sumit Bhatia and G.N. Pillai. “SVM Based De-

cision Support System for Heart Disease Classification with Integer- Coded Genetic Algorithm to Select Critical Features”. In: Pro- ceedings of The World Congress on Engineering and Computer Science 2008. 2008, pp. 34–38.

[22] A. Sze-To, H. R. Tizhoosh, and A. K. C. Wong. “Binary codes for tagging x-ray images via deep de-noising autoencoders”. In:

2016 International Joint Conference on Neural Networks (IJCNN).

July 2016, pp. 2864–2871.^DOI: 10.1109/IJCNN.2016.7727561.

[23] Yichuan Tang. “Deep Learning using Support Vector Machines”.

In: CoRR abs/1306.0239 (2013). ^URL: http : / / arxiv . org / abs/1306.0239.

[24] W. Wang et al. “Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction”. In: 2014 IEEE Con- ference on Computer Vision and Pattern Recognition Workshops. June 2014, pp. 496–503.DOI: 10.1109/CVPRW.2014.79.

[25] Yasi Wang, Hongxun Yao, and Sicheng Zhao. “Auto-encoder Based Dimensionality Reduction”. In: Neurocomput. 184.C (Apr. 2016), pp. 232–242.^ISSN: 0925-2312.^DOI: 10.1016/j.neucom.2015.

08.104. ^URL: http://dx.doi.org/10.1016/j.neucom.

2015.08.104.

(51)

(52)