Convolutional Neural Networks: Performance on Imbalanced Data

(1)

Convolutional Neural Networks: Performance on Imbalanced Data

Oscar Sallander

Spring term 2021

(2)

Abstract

Imbalanced data is a major problem in machine learning classification, since predictive performance can be hindered when one class occurs more frequently than the others. For example, in medical science, imbalanced data sets are very common. When searching for rare diseases in a population, the healthy proportion can be extremely large in comparison to the proportion with a disease. This raises a problem, because when a model is given only a few example observations of one class and a larger amount of observations of the other, the model tends to be biased towards the majority class. When the label with less occurrences is of great importance, or if both labels must be correctly classified, this creates a problem. In deep learning and image classification, there is a lack of research on how Convolutional Neural Networks perform on imbalanced data compared to other classifiers. The goal of this thesis is to analyze and compare the performance of Convolutional Neural Networks against the k-Nearest-Neighbor algorithm. Performance is evaluated on a data set that is modified with increasingly imbalanced classes. The results show that imbalanced data does have a negative effect on the performance of Convolutional Neural Networks for classifying the minority classes, but to a lesser degree than for the k-Nearest-Neighbor algorithm.

Abstrakt

Titel: Faltningsn¨atverk: Prestanda p˚a obalanserade data

Obalanserade data är ett stort problem inom maskininlärning och klassificering, eftersom modellernas prediktionsförm˚aga kan försämras när en klass förekom- mer mer ofta än en annan. Till exempel inom medicinsk forskning s˚a är obalanserade klasser mycket vanliga. I sökandet efter sällsynta sjukdomar i en population, s˚a kan den friska proportionen deltagare vara avsevärt större än den sjuka proportionen. Detta leder till problemet att en modell f˚ar bara en liten mängd observationer av en klass och mycket fler av en annan. Modeller tenderar i detta fall att bli snedvridna mot majoritetsklassen, vilket blir ett problem om minoritetsklassen är viktig att prediktera rätt. Inom djup maskininlärning och bildigenkänning s˚a finns det ett forskningsgap ang˚aende hur faltningsnätverk presterar p˚a obalanserade klasser jämfört med andra klassificeringsmetoder.

M˚alet med denna uppsats är att analysera och jämföra prediktionsförm˚agan av faltningsnätverk mot algoritmen k-närmaste-granne. Prestandan mäts p˚a ett datamaterial som är modifierat till att ha en gradvis starkare obalans mellan klasser. Resultatet visar att obalanserade klasser har en negativ effekt p˚a falt- ningsnätverkets förm˚aga att klassificera minoritetsklasserna, men den negativa effekten för k-närmaste-granne är större.

(3)

Popular scientific summary

Image recognition has an essential role in the pursuit of global digitalization. It is used to automate driving, detect diseases in medical imagery, recognize faces and much more. The neural network is the backbone of image recognition because of its ability to identify complex patterns in data. In order to get the best results in a given task, we need to show the neural networks the optimal data. In general, it is easier to recognize an image correctly if the data is balanced. Balanced data in a image recognition setting means that the network is shown an equal amount of pictures of each object in order to learn what they represent. In the real world however, the number of objects are rarely balanced. For example, when analyzing x-ray pictures of lungs in search of a disease, there could be much more pictures of healthy lungs than the contrary. In this thesis, the Convolutional Neural Network is evaluated on five increasingly imbalanced data sets. The results are then compared to the algorithm k-Nearest-Neighbor. It was concluded that imbalanced data does have a negative effect on the performance of the Convolutional Neural Network, but it was less affected by the imbalance than the k-Nearest-Neighbor algorithm.

(4)

Acknowledgements

I would like to thank my supervisor Xijia Liu, who helped me all throughout the process of this project.

(5)

1 Introduction

Imbalanced data is a major problem in machine learning, since most machine learning algorithms thrive when classes are of equal proportions. Consequently, when there are fewer training instances of one class, the decision boundaries associated with the algorithms tend to be biased towards the majority class.

This leads to an increased misclassification rate of the minority class as well as potentially misleading evaluation metrics (Hoang, Bouzerdoum, & Lam, 2009).

According to Wang, Liu, Wu, Cao, Meng, & Kennedy (2016), deep learning studies have focused primarily on data sets with balanced classes and have not thoroughly examined model performance on imbalanced data. They further ar- gue that imbalanced data sets are common in the real world, and thus more focus should be put on model performance in a real-world scenario. More research on imbalanced data could lead to societal benefits, as medical science is one of many fields that face real-world applications of imbalanced classification.

For example, image recognition in the form of image-based diagnosis is becom- ing more common, as the amount of medical data is rapidly growing (Khatami, Babaie, Khosravi, Tizhoosh, & Nahavandi, 2018).

There are multiple common solutions for solving the issues of training a machine learning algorithm on an imbalanced data set. One of the simplest solutions is oversampling, that increases the size of the minority classes by creating repli- cates of random observations (Ganganwar, 2012). Conversely, undersampling can be used to downsize the majority classes, by randomly discarding some observations. The drawback of undersampling is that potentially useful information is lost when removing observations. It is also possible to counter the class imbalance on an algorithmic level, by adjusting the decision thresholds, and thereby minimize the skewness towards the majority class (Ganganwar, 2012). However, the goal of this thesis is not to increase predictive performance by using these solutions, but to help map how well the algorithms can classify minority classes, even if classes are imbalanced.

Convolutional Neural Networks (CNNs) are one of the most common types of neural networks for high-dimensional data and has repeatedly been proven to be one of the best performing algorithms for image classification (Rawat & Wang, 2017). In this thesis, the predictive performance of CNN on imbalanced data will be evaluated on a multi-label classification problem containing 10 different pieces of clothing. In order to evaluate the effect of imbalanced classes, a modified randomized sampling process will be used to purposely create imbalanced data sets. Model performance will be evaluated on increasingly imbalanced data sets and then compared to the k-Nearest-Neighbour algorithm (k-NN).

The results showed how the CNN resulted in an overall better predictive performance than the k-NN. However, with increasing imbalanced classes, the macro- average-precision of k-NN did not decrease as much as for the CNN. Following, the macro-average-recall did decrease substantially more for k-NN than CNN, as classes became more imbalanced. This suggests that, on this data, the CNN

(7)

algorithm is a better choice if identifying the true positives of the minority classes, are of great importance.

The structure of this thesis is as follows. First, the MNIST Fashion data set is introduced. Second, the method section describes how to quantize images into matrices, and then the basics behind supervised machine learning and classification. The theory behind the artificial neural network and its basic parts is then introduced before the main section about the CNN. The k-NN algorithm is briefly summarized before the experiment design is introduced. Here, the process of modifying the original data set, into increasingly imbalanced sets, is explained. Following, the section about evaluation metrics explains different measurements that are suitable for multi-class classification of imbalanced data.

The result section includes the hyperparameters of the final models and their performance on the differently balanced test data sets. Lastly, the conclusion and discussion sections interprets the results and discusses the limitations of the study.

(8)

2 The MNIST Fashion Data Set

The MNIST Zalando Fashion data set (Xiao, Rasul, & Vollgraf, 2017) is a successor of the popular MNIST set with handwritten digits (LeCun & Cortes, 2010). Since the set of handwritten digits was first introduced in 1998, it has become the most widely used testbed for deep learning. One possible reason for the data sets popularity is its size, straightforward encoding and permissive license. According to the Zalando Research Team, the original MNIST set is too easy, as classic machine learning algorithms can easily achieve almost perfect scores. Therefore, the team decided to create an improved MNIST set, with the same accessibility of the original set. Just like the original set, the new set has 70 000 grayscale images (see examples in Figure 1) in 28x28 resolution with 10 classes (see Table 1).

Figure 1: Four grayscale 28x28 images from the MNIST Fashion data set.

(9)

Label Description No. of Obs.

0 T-Shirt/Top 7000

1 Trouser 7000

2 Pullover 7000

3 Dress 7000

4 Coat 7000

5 Sandals 7000

6 Shirt 7000

7 Sneaker 7000

8 Bag 7000

9 Ankle boots 7000

Table 1: Description of the ten classes, which represent ten different pieces of clothing. The data set contains 7000 observations (images) of each class.

(10)

3 Method

3.1 Image Representation

Almost any image can be viewed in the form of a matrix, or in the case of a colorized (RGB) image, three matrices. These matrices consist of quantized numbers of a certain bit length, that represent information about the intensity and color of the image.

In a camera, there are many sensors that help capture an image, and the amount of sensors is what determines its resolution and size. Suppose that the sensor array has (n, m) sensors, which result in an image of the same (n, m) size. When a photo is taken, each sensor grabs a sample of light resulting in a value between 0 and (2^b− 1) for a b-bit image. For example, an 8-bit image would sample values between 0 and 255. Because of this process, it is possible to represent an image as stored digital data. The pixel representation reflects the values that were collected by the sensors and in the same exact order, where higher values indicate a greater intensity of coloring. Each sensor samples a value independently of the others, but the resulting pixels are seldom independent.

In general, pixels often look similar to the ones next to them, except for when edges occur in an image. The edges result in large changes in the pixel values, which are good indicators of objects in the image (Venkatesan & Li, 2017).

3.2 Classification

Machine learning is a sub-field of artificial intelligence (AI), which has the ob- jective of learning from data without having to program explicit rules or logic.

These learning algorithms can be divided into three main groups; supervised, semi-supervised and unsupervised. Supervised learning has the goal of produc- ing a prediction y∗ based on a collection of (input:x, label:y) pairs, where the input x can be a feature vector or a vector of data points from features, images, documents or graphs. In most classification problems, the output y is a binary label that differentiates two groups with different properties. There are however, numerous multi-label classification problems when y has k labels (Khan, Rahmani, Shah, & Bennamoun, 2018).

Conversely, unsupervised learning is based on modeling underlying patterns in input x without labels. The most commonly used unsupervised learning method is the clustering approach, which cluster data points into groups based on similar properties. When labels are obtainable for only a portion of large amounts of input data, a semi-supervised learning algorithm can be used to classify the unlabeled data correctly (Khan et al., 2018).

3.3 Holdout Method

In machine learning, the Holdout Method is used to evaluate model performance on unseen data. While algorithm coefficients can be learned from the training

(11)

set, some model parameters have to be chosen manually. To do this, the training set can be split into two parts; a validation set and a reduced training set. The model can now be trained on the reduced training set, with a specific model parameter, and the best parameter setting is the one that performed the best on the validation set (Venkatesan & Li, 2017). Ideally, the validation set should be of the same distribution as the test set.

3.4 Artificial Neural Networks

In order to explain how the CNN works, its building blocks will be introduced sequentially. First, the Artificial Neural Network (ANN) has to be introduced.

ANNs can be grouped into the categories feed-forward networks and feed-back networks, based on how the information flows between each neuron (also called node or perceptron). In a feed-forward network, such as CNN, the information only moves in one direction from the input to output, whereas the feed-back network uses loops and cycles to store information and sequence relationships in the internal memory (Khan et al., 2018).

In Figure 2, the function of a neuron is visualized. The basic neuron works by computing dot products, s, between inputs (x₁, x₂, x₃, ..., x_n) and their internal weights (w1, w2, w3, ..., wn). The dot product then passes through a non-linear activation function, f . The bias term b is just like adding an intercept value to a linear equation. With the option to add a constant, it allows the neuron to adjust the output along with the weighted input sum, and can help fitting the data better (Verdhan, 2021).

(12)

Figure 2: A perceptron, also known as node or neuron, where input flows from the left, through the activation function, which produces an output.

To get a network that can classify more complex relations, a network with more layers has to be built. Figure 3 shows a network with an input layer (L1), an added hidden layer (L2) with three neurons, and an output layer (L3). There are now two sets of trainable weights. The dot products from the input layer and its corresponding weights acts as input for the neurons in the hidden layer, where new weights can be trained to pass forward to the output layer (Zafar, Tzanidou, Burton, Patel, & Araujo, 2018).

(13)

Figure 3: A two layer network, with a hidden layer containing three neurons.

This network is capable of learning more complex patterns than the simple perceptron.

3.4.1 Activation Functions

To allow the neural network to learn more complex patterns, a non-linear block has to be added after the dot product. By doing this, and then cascade these non-linear layers, the network can compose different concepts together and solve complex problems more easily (Zafar et al., 2018). The non-linear activation functions in neurons enables the ANN to tackle more complex problems. No matter how many many layers a network contains, without these functions the network would always behave like a linear model. One of the most popular functions is the Rectified Linear Unit (ReLu). The function is defined as equation:

f (x) = max(0, x).

(14)

Figure 4: The ReLU activation function, which produces the output 0 for negative values.

The ReLU is a simple activation function which is of special practical importance thanks to its computational speed (see Figure 4). The ReLU maps the input to 0 if it is negative and keeps its value unchanged if its larger than zero (Khan et al., 2018).

Another activation function, called the softmax activation function, is used in the final layer to generate a final classification of an image for different categories. The softmax works like multinomial logistic regression, by calculating probabilities for each class over all the possibilities. For example, if the classes are t-shirts, trousers, and coats, then the softmax function will calculate three probabilities for an input to determine what class it is likely to belong to. These probabilities add up to 1, and the predicted class is then chosen based on the highest value (Verdhan, 2021).

3.4.2 Hyperparameters

In the training process of a neural network, the algorithm is learning about the attributes of the data, but there are some parameters that are required to be chosen manually, that cannot be learned by the network itself. These parameters are called hyperparameters and need to be set before training the network. One of these hyperparameters is called learning rate, which defines the step size that a model takes to reduce errors (Verdhan, 2021). Training a network is an optimization problem, with the goal of finding the weights to minimize the amount of misclassifications. The learning rate decides the amount of adjustment made to the weights in the training process. Intuitively, a small

(15)

learning rate increases the time for the network to converge and reach the global minima. A learning rate of 0.01 is acceptable in most cases.

Other examples of hyperparameters are the number of hidden layers in the network, the number of neurons in each layer, weight initialization and which activation function to use. Hyperparameters are chosen based on their performance.

They are evaluated on the validation set and tweaked until the performance is maximized (Verdhan, 2021).

3.4.3 Over-fitting

As mentioned previously, the goal of the training process is to learn the patterns in the training data and then make correct predictions on unseen data. A common problem, however, is if the network mimics the training data and reaches high training accuracy, but has a low accuracy on the unseen validation set.

This phenomena is called over-fitting, which can be solved. One solution is to reduce the complexity of the network by using regularization methods (Verdhan, 2021).

Dropout is a regularization method that randomly drops the output of some layers (see Figure 5). By doing this, each combination of networks is different from the next, and the networks are less likely to over-fit the training data (Verdhan, 2021).

Figure 5: On the left is a network where all output from the layers are included in the next layer. On the right, some outputs are randomly discarded, to reduce

(16)

sified observations. It is an optimization technique which iteratively moves in the direction of the steepest descent (see Figure 6), which is defined by the negative of the gradient (Venkatesan & Li, 2017). The problem with this method is that it is time consuming for large data sets, since each iteration has to predict every sample in the training set in order to update the weights. The solution for this is to update the weights by using stochastic gradient descent. Before, in every iteration, all observations were used to produce a gradient approximation.

Now, the weights are updated by randomly choosing one observation at each iteration (Verdhan, 2021).

Figure 6: The learning rate decides how much the weights are changed in each iteration. A small learning rate often requires more iterations before minimizing the loss function, but might do so more accurately.

Figure 6 visualizes how the gradient descent minimizes the loss, which is the difference between actual and predicted labels. W are weights which initially start at a random value, and then get updated to minimize the loss. A commonly used loss function in deep learning is the cross-entropy loss function (Verdhan, 2021). The cross entropy function between model predictions q and true labels p, where i is the index for each element of labels and predictions, is defined as

(17)

equation:

H(p, q) = −X

i=1

pilog(qi).

For a binary classification problem, where the label y is 0 or 1, then p ∈ (y, 1−y) and q ∈ (ˆy, 1 − ˆy), the loss function L that should be minimized is:

L(ˆy, y) = −1 m

m

X

i=1

(y_ilog(ˆy_i) + (1 − y_i)log(1 − ˆy_i)),

which is intuitively correct, since L(ˆy, y) = −log(1 − ˆy) should be minimized, which requires a small ˆy. Respectively, to minimize L(ˆy, y) = −log ˆy, ˆy should be large (Zafar et al., 2018). For multi-class classification, the calculations needed to obtain the loss L is defined as:

L(ˆy, y) = −

K

X

k

y^klogˆy^k,

where y^k is 0 or 1, and represents whether class k is the correct classification for the prediction ˆy^k.

In order to use this loss function, a softmax activation function needs to be added to the final layer in the network, which normalizes the output to a probability distribution over the predicted classes (Zafar et al., 2018). The multi-class cross-entropy with softmax is defined as equation:

L(ˆy, y) = −

K

X

k

y^klog eˆ^y^ˆ^k PK

j=1ˆe^ˆ^y^j.

3.5 Convolutional Neural Networks

CNNs work in a similar way to standard neural networks. The difference, however, is how the CNN breaks down the image into small basic components.

These basic components are for example edges, and other local properties found by the convolutional layers. This functionality is missing from a standard neural network, since the network has problems recognizing that two patterns are the same, if they appear in different positions of an image (Venkatesan & Li, 2017).

(18)

are basically matrices with values called weights, which are trained during the model training process. For example, in the case of a 32x32 image, the features can be extracted by applying a 5x5 filter, resulting in an 28x28 image. The filter is passed over the entire image and the features obtained from the process can be lines, edges, curves and more (Verdhan, 2021).

Figure 7: Convolving a 5x5 image by a filter of 3x3. The filter (green) slides over the input matrix whilst searching for a specific pattern in the filter.

Figure 7 shows how a 3x3 filter passes over a 5x5 matrix (left), and results in an output (right). The stride value determines how many cells the filter moves with each step (in Figure 7 the stride value is 1). The output is the result of how many cells in the colored area match the 3x3 filter. The 3x3 filter checks if the feature it is meant to detect is present in the 5x5 matrix or not. The highest possible output value is 3, which means that the filter is confident that the specific pattern is present in the matrix. In contrast, a lower value means that the specific pattern is not present in a specific location. The resulting output matrix is called a feature map or activation map.

Given an image of dimension (n, n), and a filter of dimension (x, x), then the output of a single CNN layer has dimension ((n − x + 1), (n − x + 1)).

For the example in Figure 3, this equals to ((5 − 3 + 1), (5 − 3 + 1)) = (3, 3).

Channels are an additional component to represent the depth of the matrices in the convolution process. If w = width, h = height, c = channel, then generally

(19)

an image has dimensions (w, h, c). In a gray-scale image, there is only one channel, resulting in (w, h, 1). However, in a colorized image, there are often c = 3 channels, representing RGB (Red, Green, Blue), resulting in a matrix of dimension (w, h, 3).

This means that when convolving images, the filter must have the same number of channels as the input. In Figure 8, our 5x5 image now has 3 channels instead of the previous single channel. However, the output is still an image of 3x3.

Figure 8: Convolving a 5x5x3 image by a filter of 3x3x3.

One can see that the number of pixels rapidly reduces through the convolution process, as the output is smaller than the input. When the number of convolution layers increase, this raises a problem. The solution for this problem is padding, which is a method for adding pixels to a processed image. One example is zero padding (see Figure 9), where the number of pixels is increased by adding zeros around the output from the convolution process. Now the pixels along the periphery are not lost, which leads to better results and better analysis from the CNNs (Verdhan, 2021).

(20)

Figure 9: Padding with zeros along the edges of an output matrix.

To summarize the convolution process; a filter of size f is applied to an image of size (n, n) with a stride value of s and padding p, which results in the output:

((n + 2p − f )/s + 1), (n + 2p − f )/s + 1)).

To illustrate, with a colorized input image of size 37x37, a 3x3 filter, a stride value of 1 and zero padding:

((37 + 0 − 3)/1 + 1), (37 + 0 − 3)/1 + 1) = ((35, 35)).

When using 10 filters, this results in an output of 35x35x10, as visualized in Figure 10. As seen in Figure 8, the initial 3 channels are neutralized by the 3x3 filter. However, since the number of filters is 10, the end result is a total of 10 channels.

Figure 10: 37x37x3 to 35x35x10.

(21)

3.5.2 Pooling

As the complexity of the features increases, so does the complexity of the computation. The number of dimensions in the network increase with each layer and the overall complexity of the CNN increases. Pooling is a way to deal with this increasing number of dimensions (Verdhan, 2021).

As mentioned previously, the convolution process results in an output called a feature map. An issue with this feature map is that any augmentation of the image will change the feature map. For example, if an image is rotated, the feature map will change because of the changed position of the feature in the image. The pooling layer is a way to lower the resolution of the input image, and by doing this the change in the feature map can be mitigated. Most commonly, a pooling layer is applied after a convolutional layer.

Figure 11 shows two different versions of pooling. Firstly, max pooling is used on the 4x4 feature map from a convolution layer. The pooling layer with 2x2 pixels and a stride value of 2, results in a 2x2 matrix with the max value of each 2x2 patch of the feature map. Average pooling, on the other hand, calculates the mean value of the 2x2 patches on the feature map.

(22)

Figure 11: Pooling generalizes the values of a large matrix into a smaller matrix, by choosing the max value or the average value in a given patch.

3.6 k-Nearest Neighbor

The k-NN is a supervised classification algorithm that finds a group of k observations in the training set, that are nearest neighbors with the test observation.

The test observation is then assigned to the class that is most predominant in this neighborhood (Wu et al., 2008). The nearest neighbors are chosen based on a distance (or similarity) metric between observations, for example the Eu- clidean distance (Ali, Neagu, & Trundle, 2019). Euclidean distance is defined as equation:

d(x, y) = v u u t

n

X

i=1

(x_i− y_i)²,

given a training set D with training observations (x, y) ∈ D (where x is input and y is class) and a test observation z = (x⁰, y⁰), the distance between z and

(23)

all training observations in D is computed by the algorithm to determine the nearest-neighbor list Dz. The test observation is then classified by majority voting, based on the classes of the k nearest neighbors:

ˆ

y = argmax

v

X

(xi,yi)∈Dz

I(v = yi),

where v is the class, yi classes for the ith nearest neighbors, and the indicator function I() returns 1 if the argument is true and 0 if false.

The choice of k has an effect on the flexibility of the k-NN classifier. When k is small, the classifier is overly flexible and the training error becomes zero, since every training observation will choose itself as the nearest neighbor. As k grows, the decision boundary of the classifier slowly becomes less flexible, until it is close to linear (James et al., 2013).

The MNIST Fashion benchmark showed that k = 5 resulted in the highest accuracy on the original (and balanced) data set (Xiao et al., 2017). It should be noted that the best performing model used the Minkowski distance instead of the Euclidean distance, but the model using Euclidean distance was very close in performance. Therefore, for the sake of simplicity, the Euclidean distance is used for the models in this thesis.

3.7 Data Manipulation

This experiment aims to compare how the CNN and the k-NN algorithm perform on increasingly imbalanced data sets. The data sets are created from the original MNIST Fashion data, by following a modified randomized sampling process.

The performance will be evaluated by inspecting how the algorithms classified the minority classes.

To compare how the CNN and k-NN were affected by different degrees of class imbalance, five different training, validation, and test subsets were created from the original training and test sets. The MNIST Fashion data set contained a training set of 60 000 images and a test set of 10 000. A validation set of 10 000 images (1000 of each class) was randomly split from the training set.

The goal was to create five different imbalanced data sets for the training, validation and test sets, while keeping the same total amount of observations across the sets. The sample size had to remain even across the different sets, since machine learning algorithms tend to perform better with more data. Five of the ten classes were chosen to be minority classes, i.e. occurring less often

(24)

create a highly imbalanced training set by restricting sample size to 25 000, and to use observations from the remaining 25 000.

Training set Observations (n) Minority (class 0-4) Majority (class 5-9)

Set 1 25 000 12 500 (50%) 12 500 (50%)

Set 2 25 000 9 375 (37.5%) 15 625 (62.5%)

Set 3 25 000 6 250 (25%) 18 750 (75%)

Set 4 25 000 3 125 (12.5%) 21 875 (87.5%)

Set 5 25 000 1 550 (6.2%) 23 450 (93.8%)

Table 2: Training set sample size

However, if every set of 25 000 observations would have been sampled from a set of 50 000, there is an element of randomness involved. In a worst case scenario, training set 1 and training set 2 could sample almost completely different observations, which could invalidate the comparison of model performance on the two. To solve this, an algorithm was created to sequentially build each set, based on the same set (training set 1). To illustrate:

(1) From the original training set of 50 000 observations, reduce total amount of images to 25 000, still with 10 balanced classes (2500 each). This is training set 1.

(2) From the unused 25 000 images, randomly sample 625 for each of classes 5,6,7,8 and 9. Add these observations to training set 2 along with all observations from training set 1. Randomly remove 625 observations from classes 0,1,2,3,4. This is training set 2.

(3) From the remaining 21 875 images, randomly sample 625 for each of classes 5,6,7,8 and 9. Add these observations to training set 3 along with all observations from training set 2. Randomly remove 625 observations from classes 0,1,2,3,4. This is training set 3.

(4) Repeat for training set 4 and 5.

Figure 12 shows all sets and their class imbalances.

(25)

Figure 12: Visualization of the percentages of each class.

The same steps were used to create the five validation and five test sets from their 10 000 images.

3.8 Evaluation Metrics

3.8.1 Confusion Matrix

The Confusion Matrix is a table layout that shows the performance of classification algorithms. The classifications are divided into four groups, True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) (Sokolova & Lapalme, 2009). See Table 3.

Predicted class

True class 0 1

0 TN FP

1 FN TP

Table 3: Confusion matrix.

(26)

Precision is the percentage of all the classified positives that are true positives:

Precision = T P T P + F P.

Macro-average-precision, is an average per-class agreement of the class labels with the predicted classes (Sokolova & Lapalme, 2009). This is a more suitable for imbalanced multi-class classification, since all classes contribute equally regardless of imbalance: It is defined as equation:

Macro-average-precision = PA+ PB+ PC

Number of classes, where PA, PB, PC is the precision for classes A, B, C).

Recall is the percentage of all positives that are classified as positives:

Recall = T P T P + F N,

which in a multi-class setting can be used to find the macro-average-recall : Macro-average-recall = RA+ RB+ RC

Number of classes, where RA, RB, RC is the recall for classes A, B, C.

The F1-Score is a combination of precision and recall (Sokolova & Lapalme, 2009), and is defined as equation:

F1-Score = 2 ∗ Precision ∗ Recall Precision + Recall,

which in a multi-class setting is derived to macro-average-F1-score:

Macro-average-F1-Score = F 1A+ F 1B+ F 1C

Number of classes .

(27)

4 Results

To chose the hyperparameters for the CNNs, a CNN was first trained on training set 1 and evaluated on validation set 1. The model was created with a 32 filter convolutional layer and a ReLU activation function. The filter size 3x3 was chosen after being compared to the filter size 5x5, but the 3x3 resulted in a better performance on the validation set. A dropout layer with rate 0.2 was chosen after resulting in the best performance between the different rates (0.1, 0.2, 0.3, 0.4, 0.5) on the validation set. The first fully connected layer has 32 neurons with a ReLU activation function and the second has 10 neurons with a softmax activation function. Lastly, the CNN was evaluated on the five unseen test sets. The k-NN used k = 5 neighbors and Euclidean distance.

Figure 13 and Table 4 show how the macro-average-precision changed as classes became more imbalanced. For CNN, it dropped from 0.91 to 0.85. For k-NN, it dropped for sets 3 and 4, but ended up at 0.87 at set 5, which was the same as for set 1.

Figure 13: Macro-average-precision on each test set.

Set (Balance %) 1 (50/50) 2 (37.5/62.5) 3 (25/75) 4 (12.5/87.5) 5 (6.2/93.8)

(28)

decreases from 0.91 to 0.88 for CNN, while k-NN drops from 0.83 to 0.71.

Figure 14: Macro-average-recall on each test set.

Set (Balance %) 1 (50/50) 2 (37.5/62.5) 3 (25/75) 4 (12.5/87.5) 5 (6.2/93.8)

CNN 0.91 0.91 0.91 0.88 0.88

k-NN 0.83 0.84 0.84 0.77 0.71

Table 5: Macro-average-recall for CNN and k-NN.

Figure 15 visualizes the macro-average-F1-Score for CNN and k-NN. As seen in Table 6, the CNN score drops from 0.91 to 0.86 as imbalance increases, while the k-NN drop from 0.84 to 0.76.

(29)

Figure 15: Macro-average-F1-Score on each test set.

Set (Balance %) 1 (50/50) 2 (37.5/62.5) 3 (25/75) 4 (12.5/87.5) 5 (6.2/93.8)

CNN 0.91 0.91 0.90 0.88 0.86

k-NN 0.84 0.85 0.83 0.79 0.76

Table 6: Macro-average-F1-Score for CNN and k-NN.

(30)

5 Discussion

The results showed that for the CNN, the macro-average-precision and macro- average-recall decreased, along with the combined metric macro-average-F1- Score. If the true positives of the minority classes were of great importance, the macro-average-recall metric is of the greatest interest. In that case, a drop from 0.91 to 0.88 means that the CNN did not lose much in terms of predictive performance in classifying true positives correctly. Conversely, the k-NN algorithm had a substantial drop in macro-average-recall from 0.83 to 0.71, which indicates that if the true positives of the minority classes are of more interest, then the CNN should be used instead of the k-NN algorithm. However, the k- NN did not decrease in macro-average-precision, which indicates that the k-NN is more robust in terms of precision, even when classes are imbalanced. Lastly, the macro-average-F1-Score suggests, that the CNN is the better performing model overall, but also that F1-Score for the CNN and the k-NN decrease by almost the same amount as the sets go from balanced to highly imbalanced.

In order to create the imbalanced data sets, a sampling algorithm had to be created. Multiple decisions had to be made, which could have had a great impact on the results. The first decision was to reduce the total size of the data sets, to make it possible to add samples of specific classes in order to make them imbalanced. This could have been done by just keeping the original sets, and then randomly sampling duplicates of the majority class and randomly deleting the minority classes. The problem with this approach is that for the last set, where almost all observations belong to classes five to nine, almost every single observation would have a duplicate. A result of this method could have resulted in that the models easier could have predicted the majority class, while the minority class was relatively more difficult to predict. To avoid this, the data sets were instead halved in size, and the unused half of the sets were used to create imbalanced classes, with no duplicates. This solution also has some downsides. For example, fifty percent of the minority classes (0,1,2,3,4) were lost, and therefore, the models lose valuable data from which they could have learned. Furthermore, some of these deleted observations could have been easier or harder to predict than others, and we do not know which of these ended up in the sets. However, due to the large size of the original data set, it is unlikely that a randomized half of the data differs a lot from the other half.

The second decision was to build each set with respect to the prior set. For example, training set 2 was a copy of training set 1 but with randomly deleted observations from classes 0-4, and randomly added observations from classes 5-9.

An alternative approach would have been to just sample the desired imbalanced sets from the original set. To illustrate, given 50 000 training images, then training set 1 (which is balanced) could be built by sampling 2500 of each class.

Subsequently, training set 2 would have been created by sampling 1875 of each minority class and 3125 of each majority class. The reason for using the first approach, was that the probability of ending up with different observations in the sets was minimized. Especially since the sampling process had to be used

(31)

for five sets of training, validation and test data, which is a total of 15 times.

As mentioned in section 3.6, the k-NN hyperparameter k was chosen based on the benchmark results made by the Zalando Research Team. This value of k was used for all k-NN models, even though k = 5 was only proven to yield the best accuracy on the first, balanced, data set. The CNN hyperparameters were also tuned on the first set, and then used throughout the sets. It is hard to say if the most optimal hyperparameters would differ as the sets became more imbalanced, but it is a possibility. However, changing the network layer structure, and the k-NN value k, based on each set, could also have led to an unfair comparison. Therefore, additional research could be done in this area to see if the optimal hyperparameters change as classes become more imbalanced.

Undoubtedly, the network hyperparameters play a big role in the result of the CNN, and results are difficult to generalize across all possible CNN structures.

Also, there are many different hyperparameters that can be tuned for CNNs, such as the learning rate, number of hidden layers, dropout rate, number of convolutional layers, and much more. Due to computational requirements and time restriction, it is not feasible to evaluate every possible network.

As stated in the introduction, there are multiple solutions for treating an imbalanced data set before feeding them into a classifier. In this thesis, it was a conscious decision to evaluate model performance on imbalanced classes, without attempts to solve the imbalance first. Further research could replicate this study, but first treat the training sets with oversampling or undersampling. By doing this, and also including more classification algorithms, it would yield in- teresting results on how algorithms perform when trained on a balanced training set and then evaluated on increasingly imbalanced test data.

(32)

6 Software

The coding for this study was done in the programming language Python 3 (Van Rossum & Drake, 2009). Tensorflow and Keras (Chollet et al., 2015) packages to create the CNNs. Pandas (pandas development team, 2020) and NumPy (Harris et al., 2020) was used throughout the data cleaning and sampling process.

(33)

References

Ali, N., Neagu, D., & Trundle, P. (2019). Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets. SN Applied Sciences, 1 (12), 1559. https://doi.org/10.1007/s42452-019-1356-9

Chollet, F. et al. (2015). Keras. https://github.com/fchollet/keras

Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2 (4), 42–47. http://www.ijetae.com/files/Volume2Issue4/

IJETAE 0412 07.pdf

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del R´ıo, J. F., Wiebe, M., Peterson, P., . . . Oliphant, T. E. (2020).

Array programming with {NumPy}. Nature, 585 (7825), 357–362. https:

//doi.org/10.1038/s41586-020-2649-2

Hoang, G., Bouzerdoum, A., & Lam, S. (2009). Learning Pattern Classification Tasks with Imbalanced Data Sets. Pattern recognition (pp. 137–144).

InTech. https://doi.org/10.5772/7544

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (Vol. 103). Springer New York. https://doi.org/10.

1007/978-1-4614-7138-7

Khan, S., Rahmani, H., Shah, S. A. A., & Bennamoun, M. (2018). A Guide to Convolutional Neural Networks for Computer Vision (Vol. 8). https:

//doi.org/10.2200/s00822ed1v01y201712cov015

Khatami, A., Babaie, M., Khosravi, A., Tizhoosh, H. R., & Nahavandi, S. (2018).

Parallel deep solutions for image retrieval from imbalanced medical imaging archives. Applied Soft Computing Journal, 63, 197–205. https:

//doi.org/10.1016/j.asoc.2017.11.024

LeCun, Y., & Cortes, C. (2010). {MNIST} handwritten digit database.

http://yann.lecun.com/exdb/mnist/. http : / / yann . lecun . com / exdb / mnist/

pandas development team, T. (2020). pandas-dev/pandas: Pandas. https://doi.

org/10.5281/zenodo.3509134

Rawat, W., & Wang, Z. (2017). Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Computation, 29 (9), 2352–2449. https://doi.org/10.1162/neco{\ }a{\ }00990

Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance mea- sures for classification tasks. Information Processing and Management, 45 (4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002

(34)

Wang, S., Liu, W., Wu, J., Cao, L., Meng, Q., & Kennedy, P. J. (2016).

Training deep neural networks on imbalanced data sets. 2016 Inter- national Joint Conference on Neural Networks (IJCNN), 2016-Octob, 4368–4374. https://doi.org/10.1109/IJCNN.2016.7727770

Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLach- lan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowl- edge and Information Systems, 14 (1), 1–37. https://doi.org/10.1007/

s10115-007-0114-2

Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv, 1–6.

http://arxiv.org/abs/1708.07747

Zafar, I., Tzanidou, G., Burton, R., Patel, N., & Araujo, L. (2018). Hands-On Convolutional Neural Networks with TensorFlow. Packt Publishing.

Convolutional Neural Networks: Performance on Imbalanced Data