Tool-Mediated Texture Recognition Using Convolutional Neural Network

(1)

IT 16 066

Examensarbete 30 hp

September 2016

Tool-Mediated Texture Recognition

Using Convolutional Neural Network

Ziring Tawfique

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Tool-Mediated Texture Recognition Using

Convolutional Neural Network

Ziring Tawfique

Vibration patterns can be captured by an accelerometer sensor attached to a hand-held device when it is scratched on various type of surface textures. These acceleration signals can carry relevant information for surface texture classification. Typically, methods rely on hand crafted feature engineering but with the use of Convolutional Neural Network manual feature engineering can be eliminated. A proposed method using modern machine learning techniques such as Dropout is introduced by training a Convolutional Neural Network to distinguish between 69 and 100 various surface textures. EHapNet, which is the proposed Convolutional Neural Network model, managed to achieve state of the art results with the used datasets.

Tryckt av: Reprocentralen ITC IT 16 066

(3)

Acknowledgment

(4)

List of Figures

1.1 System Architecture . . . 3

2.1 Neural Network [42]. . . 4

2.2 Outline process of an artificial neuron [43]. . . 6

2.3 Activation function including a bias term . . . 6

2.4 Activation function excluding a bias term. . . 7

3.1 Topology of Convolutional Neural Network [14] . . . 10

3.2 Local Receptive Field and Filter in CNN [24] . . . 11

3.3 Convolution Operation [9] . . . 12

3.4 Convolutional Layer with feature maps [47]. . . 13

3.5 Max-Pooling Operation [24] . . . 14

3.6 Derivative curve of the sigmoid activation function [33] . . . 20

3.7 Relu Activation Function y(S) = max(0, S) . . . 21

3.8 HapticNet [10] . . . 24

4.1 A typical acceleration data trace with constant normal force. One can notice that as increasing the velocity (going from the left to right), increases the signal power and variance [39] . . . 25

4.2 Hardware setup for texture recording in controlled environment [40] . . . 26

4.3 Hand-held device for texture recording [39] . . . 27

(9)

5.1 Brick in LMT (left), Aluminum in HaTT (Right) . . . 31

5.2 Example of a 50 x 250 image of the training set . . . 32

5.3 Selecting the appropriate learning rate [46] . . . 37

6.1 Training progress of up to 1270 steps for three various models. . . 47

6.2 Training progress of adding an extra fully connected layer to Experiment C 47 6.3 Training progress for modifying the kernel sizes in the convolutional layers 48 6.4 Training progress for modifying the number of output channels in the con-volutional layers . . . 49

6.5 Evaluation progress for changing learning rate from 0.01 to 0.001 . . . 50

6.6 Loss convergence for new baseline model . . . 51

6.7 Loss convergence for EHapNet . . . 51

6.8 Error rate as training progresses for both training and testing set . . . . 53

6.9 Loss convergence for EHapNet for 100 classes from the HaTT Dataset . . 54

(10)

List of Tables

3.1 Neural Network 1 . . . 18

3.2 Neural Network 2 . . . 18

5.1 2 training set images for class 0, 1 and 68. . . 33

5.2 2 training set images for class 0, 1 and 68 (with trivial shuﬄing function) 34 5.3 EHapNet Architecture . . . 40

6.1 Training Progress for CIFAR-10 Model . . . 44

6.2 MINIST Model Architecture . . . 45

6.3 Training Progress for MINIST Model . . . 45

6.4 The hyper-parameters for EHapNet . . . 52

(11)

Chapter 1 Introduction

1.1 Motivation

User experience through mobile devices have developed and improved enormously. As the world is moving towards an even more technology oriented culture, the use of end-user devices have an even more significant role in the IT field. However, there is still some limitation in several practical applications that are available today as technology is born incomplete.

It is for certain that the sensation of touch is significant for understand the real world. It has been an important topic in human computer interaction of how to enhance user interfaces of nowadays systems. E-commerce applications typically display an image of the product (e.g. an IKEA sofa) in a 360-degree view at most. Hopefully, with the combination of embedded computer devices such as sensors and tactile interfaces, a next level user interaction can be established. By having a system that can recognize a particular texture and later render the feeling of the texture, online shoppers will not only be able to view products, but also have a sensational feeling of the texture.

1.2 Background

A haptic interface can be defined as a tactile interface, which generates a touch sensation to a user when exploring the screen. As opposed to only allowing the user to feed in information to the device, a haptic device can be used to allow the device to output information in a touch sensation form to the user.

(12)

application icons or buttons on the screen to be easily distinguished. Beyond that, one could easily imagine how fascinating haptic interfaces could be for the gaming industry. The TPad Phone, which is a regular Motorola Moto G android smartphone, is an example of a haptic device with a variable friction display attached on top of it [16]. The device can adjust the friction between the surface of the display and the fingertip. This is diﬀerent to vibration because by varying the friction level, a resistance force can be applied to the user’s finger, when exploring the TPad display. For instance, the resistance force could allow a slider to give a sort of push back feeling before unlocking the mobile device, which generates a more realistic feeling [7].

In relation to a surface texture context, scratching a rigid hand-held tool over the surface of an object produces vibration patterns that can be captured by the accelerometer sensor attached to the device. This vibration can be reproduced to render a realistic feeling of the object’s texture. Joseph Romano’s and Katherine Kuchenbecker’s [26] study involved capturing acceleration, position, contact force data of the tool over time. The captured data was used to render a simulating feeling of 8 various surface textures using voice coil adulators.

Unlike capturing data in a controlled environment, where most of the parameters such as scan velocity, applied force of the tool and angle between the tool and surface is kept constant, capturing data using a hand-held tool is less controlled. These parameters are not controlled when a human explores the surface with a hand-held tool since it will vary over exploration time. Therefore, to be able to mimic a more realistic context, this thesis only considers data from a hand-held device.

Moreover, a critical step in machine learning is Feature Engineering. Feature engineering involves manually extracting key features from the data before applying classification on the particular problem. Typically, this is the approach and part of the pre-processing stage in previous machine learning methods. However, a significant property of con-volution neural network is that it acts as a feature extractor as well as a classifier. It automatically learns valid features in contrast to previous methods that emphasizes on designing task specific and hand crafted features [11].

1.3 Project Aim

(13)

Figure 1.1: System Architecture

such as an accelerometer, which later can be rendered on a haptic device such as the TPad Phone.

1.4 Project Scope

A classification problem is when a machine is trained to predict which class a particular input belongs to. Texture classification problems are usually tackled by using images of the texture as the source of input data. However for this thesis, generated spectrograms of haptic signals, i.e acceleration signals, are used to classify textures. The captured dataset that is used in this research is from the LMT Haptic Texture Database provided by Technische Universit¨at M¨unchen [41] and the Penn Haptic Texture (HaTT) provided by the Haptic Group at the University of Pennsylvania [15].

The scope of this thesis is to investigate various convolutional neural network (CNN) models, to achieve a model that can hopefully classify textures with a state of the art re-sult. Initially, the model will be trained and tested on the LMT Haptic Texture database, which contains 69 textures. To extend the usability of the model, it will also be trained and tested on the Penn Haptic Texture database, which contains 100 textures.

(14)

Chapter 2 Overview of Machine Learning

Figure 2.1: Neural Network [42].

2.1 Background

Psychologists have for several years studied learning in animals and humans. Computer Scientists have tried to adopt similar techniques derived form psychologist to learn ma-chines to be more intelligent. It is fair to say that many of the concepts and techniques used in machine learning are inspired by many aspects of biological learning.

(15)

passed on to the next hidden layer and finally to the output layer assuming there exists only two hidden layers in the network.

2.2 Learning

Learning can be done in diﬀerent ways, such as supervised learning, unsupervised learn-ing or reinforcement learnlearn-ing. For this thesis, supervised learnlearn-ing is used to train a network to classify diﬀerent surface textures. In supervised learning, learning is estab-lished by providing a labeled training data set. This means that each input vector and its corresponding target value is provided to the network during training.

The goal of supervised learning is for the network to find and update its weight to minimize the error between the actual network output prediction, p, and the desired target value, t. To find such weights, an optimization algorithm is used. The network is later tested with a separate labeled dataset called the testing set, which has not been seen by the network during training, to evaluate its accuracy in classifying.

2.3 Weight and Bias

The weight value expresses the importance of the input to the output prediction, it is used to strengthen or weaken the input signal, x. The bias b, shown in equation 2.1, is considered as an extra fixed input value that also has a weight value, w0, associated with

it. The bias can be considered to act similar to the constant y-intercept, c, in a linear equation shown in equation 2.2, where c would equal to the constant bias.

b = w0· biasV alue, where biasValue typically equals 1 (2.1)

y = mx + c, where c = b (2.2) As shown in Figure 2.2 on page 6, the weight and bias values are linearly combined with the input values and this is referred to as the net input function. Typically, the net input is placed through a non-linear activation function, f (netInput), which in turn is passed on to the output layer to return the probability of the input belonging to a specific class.

(16)

Figure 2.2: Outline process of an artificial neuron [43].

a bias value is important because it will shift the activation function to the right or left along the x-axis, as seen in Figure 2.3. If the bias was excluded, only the steepness of the activation function would change, as shown in Figure 2.4 on page 7. Thus, having a bias term may result in the network to learn significantly better because it can fit the prediction values better with the data. If one wants the network to output 0 when the input is 2 then the graph needs to be able to shift and not only change its steepness.

netInput = x1w1+ x2w2, . . . , xmwm+ b (2.3)

f (netIntput) = f (x1w1+ x2w2, . . . , xmwm+ b) (2.4)

(17)

Figure 2.4: Activation function excluding a bias term.

2.4 Cost Function

A cost function, also known as the objective function or loss function, can be used to quantify how well the network is when updating its weight values. The lower the cost of the network model is, the better the network will perform. In other words, if the learning algorithm finds weights during training such that the cost is close to zero, it is fair to say that it has performed well. To calculate the cost, a comparison between the actual network output prediction p and its specified target value t is made. Ideally, the aim is for them to be equal but if p > t then the connections should be weakened, in other words decrease the weighted sum. On the other hand, if p < t then the weighted sum should be increased. In this thesis, the cost function that was used was the cross-entropy cost function, which can be seen in equation 3.4 at page 17.

2.5 Activation Function

There are several types of activation functions that can be used such as the step function, the logistic sigmoid function or rectifier (ReLu) function. In this thesis, a deep neural network is used rather than a shallow neural network thus; the activation function that is used is the Rectified Linear Unit (ReLU) activation function as it currently is the most popular activation function used for training deep neural networks with large and complex data. It is argued that it is more eﬀective, faster and a better model of a biological neuron [4].

(18)

The ReLu activation function is defined as f (netInput) = max(0, netInput). More information and why the ReLu activation function is used will be further discussed in section 3.5.

2.6 Overfitting

One major problem that occurs when training a neural network is called overfitting or overtraining. This occurs when the network model tries extensively to capture the real structure of the training set to the extend were it will consider noise as relevant information and therefore, when presented the test set, will perform badly. The worst case of training that might cause overfitting is a neural network with too many neurons (i.e. weights), which is trained for too long on too little data. There are several techniques of trying to avoid overfitting:

1. Increasing the number of training set if the network has to many neurons. 2. Divide the data set into three sets, training set, validation set and testing set. 3. Reducing the size of your network, i.e. reduce the number of weights.

4. Using a modern technique referred to as Dropout, which will be explained in section 5.3.2

(19)

Chapter 3 Literature Review

3.1 Deep Learning

Deep learning is a sub-field of machine learning techniques with the objective of trying to move the machine leaning field closer to one of its original goals, Artificial Intelligence. It can be defined as “a class of machine learning techniques that exploit many layers of non-linear information processing for supervised or unsupervised feature extraction, pattern analysis and classification” [48].

For the past decades, most machine learning techniques have used regular neural net-works with a shallow architecture structure that usually contain one or two hidden layers (i.e non-linear layers) such as the multilayer perceptron. Regular neural networks have performed well on simple problems but when it comes to more complicated real world problems involving natural signals such as speech recognition or image recognition, their models show limitations. Therefore, comes the need of deep architectures to be able to extract complex data and divide it into less complex and understandable data.

For instance, face detection can be broken down into several layers, where one layer is used to detect eyes in the image, another layer to detect if there is a mouth. These layers can in fact be broken down even further; the first mentioned layer could be broken down to detect eyelashes. Together, these layers can determine whether or not there exists a face in the image. The result of such a network is a multi-layer structure, which is referred to as a deep neural network [33].

(20)

if they get it wrong the accuracy of the network is extremely low [33]. Section 3.3.2 will discuss how to overcome this problem.

3.2 Why Deep Neural Network?

The architecture of a regular neural network is made up of an input layer, hidden layer and output layer. Each hidden layer consists of a set of neurons that are fully connected to the previous layer. The output layer is the final fully connected layer and typically outputs the class score for a particular input. With that being said, one can argue that a regular neural network don’t scale well for a large input size such as an image since each neuron is fully connected to the neuron in the previous layer, there would just be a large number of parameters, which can easily result in the network to suﬀer from overfitting. A regular neural network might manage with small images with a height and width of 32 pixels. This would mean that each neuron in the first hidden layer will have 32_{× 32 × 3 = 3072 weights, assuming the image is an RGB image. One can imagine that} this does not scale up for larger images with a height and width of 200 pixels [24]. This would just result in an enormous number of weights.

Therefore, a deep neural network is preferably used when working with hybrid data such as accelerometer signals or images. It is also better to use a deep neural network due to several of it’s properties, which will be discussed in the following section 3.3.1.

3.3 Convolutional Neural Network

Figure 3.1: Topology of Convolutional Neural Network [14]

(21)

The main idea of CNN is to extract local features at a high resolution and later combine these features into more complex features at the lower layers. However, due to its deep network architecture it is computationally expensive and training on a large dataset with high-resolution images could take several weeks to finish if it is performed on the CPU. Therefore, such networks are preferably trained on the GPU, which reduces the training time enormously [3].

3.3.1 Properties of CNN

Local Respective Field

The architecture structure of a CNN reduces the number of parameters in the network because each layer consists of neurons that are arranged in 3 dimensions (height, width, depth). The neurons in a layer are only connected to a local small region of the previous layer, which is referred to as local receptive field. In other words, each neuron receives it’s input from a set of neurons that are located in a small region of the previous layer and not from all the neurons in the previous layer.

Each local respective field also known as filter or kernel has a predefined width and height that is selected by the experiment, which is referred to as the filter/kernel size. Figure 3.2 shows a filter with a filter size of 5 x 5 that connects a set of neurons to a local small region of the input image with a size of 32 x 32 x 3 (3 is the depth of the RGB input image which has 3 channels; red, green and blue). It is also worth mentioning that the filters do extend through the depth of the input.

Therefore, by having such a structure, a CNN is able to extract various significant features such as edges, corners or mark of special interesting colors from the input by applying it’s filters on the image. Moreover, since every neuron is not connected to every single neuron in the previous layer, the reduction of the number of connections benefits in overcoming the overfitting problem.

(22)

Figure 3.3: Convolution Operation [9] Convolution Operation

Every member of a filter is used at every position of the input. In other words, the filter scans the entire input, with a hyper-parameter referred to as the stride length, and computes the sum of products between the input image pixels and the entries of the filter at any position. This process is referred to as the convolution operation and the outcome is a plane (feature map) which is composed of the results. Figure 3.3 displays a graphical explanation of the convolution operation procedure. A convolutional filter is used to scan the input image for edges and the result is a convolved feature map.

Each filter produces a separate 2-dimensional feature maps and is initialized randomly with diﬀerent weights, to look for diﬀerent features in the input. The combination of these features can determine higher-level features such as a face.

Sharing Weight and Bias

(23)

Figure 3.4: Convolutional Layer with feature maps [47].

3.3.2 Network Architecture

The architecture of a simple CNN consists typically of a: 1. Convolutional Layer.

2. Non-linearity Layer.

3. Subsampling Layer/Pooling Layer. 4. Fully Connected Layers.

5. Output Layer.

Convolutional Layer

It is in the convolutional layer where the number of learnable filters resides and the convolution operation is performed. In Figure 3.4, the convolution layer has 4 filters that produced 4 feature maps. As the input image is passed forward from layer to layer, the filters in each convolutional layer perform a convolution operation to produce the 2-dimensional feature maps. As already stated, each filter will produce a separate feature map. Together all feature maps produced are stacked up along the depth dimension. Each feature map consists of a set of neurons and the number of neurons in a convolutional layer is determined by the following hyper-parameters [24]:

1. Depth, this corresponds to the number of filters that the experimenter selects. 2. Stride, the sliding length of the filter as it moves when scanning the input. If the

(24)

3. Zero padding, this refers to padding the input volume with zeros around the border in order to control the width and height of the output when a convolution operation is made.

Non-linearity Transformation Layer

As the filters perform several convolution operations on the input, it produces several linear activations. These linear activations are typically passed on through a nonlinear activation function such as a sigmoid or rectified linear unit (ReLU) activation function as explained in section 2.3. This happens in the so-called non-linearity transformation layer.

Subsampling Layer/Pooling Layer

Traditionally, the next layer of a CNN is called the pooling layer, which modifies the output of the previous layer further by discarding irrelevant details and preserving im-portant ones. The purpose of pooling layer is to achieve spatial invariance by reducing the resolution of the feature maps from the previous layer by a hyper parameter chosen by the experimenter. The result of reducing the spatial size of the feature maps is a re-duction to the amount of parameters and computation in the network, which also assists in controlling the network from overfitting.

There are several pooling operation such as the max pooling, average pooling. However, average pooling has recently fallen out since max pooling has shown to work better in practice [2]. Max-Pooling is typically the operation used in the subsampling layer for several state-of-the art models [28, 44, 20, 13]. It applies a window function to the input and computes the maximum in the neighborhood as seen in Figure 3.5, where the window size is 2x2 and stride, the length of the window movement, is 2.

Figure 3.5: Max-Pooling Operation [24]

On the contrary, Average-Pooling computes the average of the pixels (xpq) within the

(25)

within the window region [45]. Average pooling for a single feature map can be computed as seen in equation 3.1. Either way, after a pooling operation the spatial dimension of the feature maps are reduced.

averageP ooling = 1 |Rij|

�

p,q∈Rij

xpq (3.1)

Fully Connected Layers

Regular neural network requires feature engineering, the need to manually extract features from the dataset to make it easier for the network to perform classification. For instance, assume the machine-learning problem is to indicate whether an image contains a sea or land in an image. The pixel colors (bluish vs. greenish) can be used as a feature to determine to assist the network to classify the image. However, there are some limitation to feature engineering. Some features that might be relevant in a dataset might not be usable in another dataset [8].

One of the signification strength of CNN is that there is no need for feature engineering. After several convolutional and pooling layers, the network extract relevant features au-tomatically from the input before passing it forward to the classifier of the network, the fully connected layer.

The fully connected layer consists of neurons that are fully connected to all other neurons in the previous layer just like a regular artificial neural network such as the multilayer perceptron. It is in this layer that the high level reasoning is made.

Output Layer (Softmax)

Projecting the input vector into a set of hyper planes, each which corresponds to a class, makes the process of classification in machine learning. The distance from the input to the hyper plane is mapped to the probability that the input is a member of the corresponding class. The probability of the input to be a member of a specific class is determined by a threshold function inside the neurons in the output layer of the network.

(26)

The threshold function for such a problem is typically a so-called unit step function. The function restricts the output value to be either 1 or 0, to determine if the input belongs to the class or not. However, binary classification is not useful in the research problem investigated in this thesis because the input data must be classified to multiple classes. Therefore, there is a need for a new type of output function in the output layer of the neural network, the Softmax function.

The process is still similar in the sense of calculating the net linear input, but instead of inserting it to a neuron with a unit step function in the output layer, it is placed into a Softmax function defined as the following.

Sof tmax(neti) =

eneti

�

jenetj

(3.2)

• neti, is the weight sum of inputs for a single neuron, =�ni=0wixi.

• j is the total number of neurons.

• netj is the total weight sum of inputs over all neurons.

The sigmoid function can be used in the output layer but has fallen as many state-of-the art neural network architectures use a Softmax function in the output layer [19, 10, 1, 38, 13]. Even thought the sigmoid function perform a non-linear mapping of the linear net input to usually a number between and including 0 or 1, which does give a conditional probability property. Unlike the Softmax function, it does not have the pleasing property of restricting the sum to be equal to 1. In the Softmax function, the output activations are guaranteed to always sum up to 1, because of the sum in the denominator.

This is equivalent to saying that, if the output activation of a neuron increases, the output activation of another neuron must decrease to compensate and ensure that the sum over all activation is equal to 1. Another pleasant property of the Softmax function is that the output activation of a neuron will always be positive because of the exponentials in the equation [33].

The combination of both these observations concludes that the Softmax function can be thought of as an improved probability distribution than the sigmoid function. Moreover, making the output layer a probability distribution is rather wise. It aids in interpreting the network output activations as an estimated probability of an input belonging to a class.

For instance, imagine there is a total of four neurons in the output layer [a1, a2, a3, a4].

(27)

probability of the input belonging to class 4. Therefore, the output of a1, a2, a3 must

be less than 0.73 but should overall sum up to 1. On the other hand, having a sigmoid function does not explicitly state this because the output activations for all four neurons with a sigmoid function could result in 0.80 [33].

3.4 Cross-Entropy Cost Function

Cost function, also known as loss function, in a neural network determines the error value between the network’s output prediction value (p) and the target value (t). There are several diﬀerent cost functions that can be used when training a neural network. The mean-square error function has typically been used but has shown some limitations and is nowadays replaced by the cross-entropy error function in classification problems. In practice, it has been shown that cross-entropy is more eﬃcient as it converges faster and finds a better local optimum with a random initialization of weights. It has also been shown that when using the mean squared error as a cost function, the network is more likely to quickly get stuck in a local minimum [21].

The equation of the mean squared error can be seen in equation 3.3: Error = 1 N N � i=1 (t− p)2 _(3.3)

• N is the number of training data by the network. • t is the target value.

• p is the actual output prediction value of the neural network. The equation of the mean cross entropy can be seen in equation 3.4:

Error = −1 N N � i=1 (t· log(p)) (3.4)

Comparing these two equations, one could notice that the mean squared error gives too much emphasis to incorrect output predictions while on the other hand the log function in the cross-entropy function has the advantage of taking the closeness of a network’s prediction into account. An example by James D. McCaﬀrey [32] illustrates this very well and why cross-entropy is more preferable to use.

(28)

Output Predictions Target Values Classified Correctly

0.1 0.1 0.8 1 0 0 No

0.3 0.4 0.3 0 1 0 Yes

0.3 0.3 0.4 0 0 1 Yes

Table 3.1: Neural Network 1

Output Predictions Target Values Classified Correctly

0.3 0.4 0.3 1 0 0 No

0.1 0.1 0.8 0 0 1 Yes

0.1 0.8 0.1 0 1 0 Yes

Table 3.2: Neural Network 2

worse than network 2 since it is almost classifying the second and third data correctly but is way oﬀ in the first data.

Let’s calculate the cost for each network using the two mentioned cost functions: Mean Squared Error

Neural Network 1 (1_{− 0.1)}2_{+ (0}_{− 0.1)}2_{+ (0}_{− 0.8)}2 _{= 1.46} (0_{− 0.3)}2_{+ (1}_{− 0.5)}2_{+ (0}_{− 0.2)}2 _{= 0.38} (0_{− 0.3)}2_{+ (0}_{− 0.2)}2_{+ (1}_{− 0.5)}2 _{= 0.38} Cost = 1.46+0.38+0.38 3 = 0.74 Neural Network 2 (1− 0.3)2_{+ (0}_{− 0.4)}2_{+ (0}_{− 0.3)}2 _{= 0.74} (0− 0.1)2_{+ (0}_{− 0.1)}2_{+ (1}_{− 0.8)}2 _{= 0.06} (0− 0.1)2_{+ (1}_{− 0.8)}2_{+ (0}_{− 0.1)}2 _{= 0.06} Cost = 0.74+0.06+0.06₂ = 0.29 Mean Cross Entropy Error Neural Network 1

−(1 ∗ log(0.1) + 0 ∗ log(0.1) + 0 ∗ log(0.8)) = −log(0.1) −(0 ∗ log(0.3) + 1 ∗ log(0.4) + 0 ∗ log(0.3)) = −log(0.4) −(0 ∗ log(0.3) + 0 ∗ log(0.3) + 1 ∗ log(0.4)) = −log(0.4) Cost =_{−(log(0.1) + log(0.4) + log(0.4)) = 1.80}

Neural Network 2

(29)

−(0 ∗ log(0.1) + 0 ∗ log(0.1) + 1 ∗ log(0.8)) = −log(0.8) −(0 ∗ log(0.1) + 1 ∗ log(0.8) + 0 ∗ log(0.1)) = −log(0.8) Cost =_{−(log(0.3) + log(0.8) + log(0.8)) = 0.71}

Comparing the cost for the networks using both methods yields that the calculated cost for the neural network 2 using the cross-entropy method diﬀers greatly from the cost of neural network 1, the network that performed worse in classifying. On the contrary, this is not observed when using the mean squared error method. This is because the log function gives more emphasis to the correct predictions and therefore becomes a crueler way of calculating the cost.

3.4.1 Minimizing The Cost

Derivatives are a useful tool to find local minimum of a function because it helps to determine the local shape of the function. The slope of the tangent line to a function determines weather the function is increasing, decreasing or is flat. In other words, if the derivative is zero, this means that the cost function is flat, if the derivative is decreasing; the cost function is going down and vice versa.

Gradient Decent Optimization Algorithm

An optimization algorithm such as the gradient decent optimization algorithm is required to find the local minimum of the cost function during training. It takes steps proportional to the negative of the gradient of the cost function to approach the local minimum. There are several gradient decent optimizing algorithms that can be used in deep learning, to mention a few, Momentum [35], AdaGrad [23] or Adam [25].

There are variant methods of gradient decent:

1. Batch Gradient Decent: For each training step/iteration, calculates the true gra-dient of the entire training set and then updates the weight. In other words, the entire training set must be considered before updating the weights.

(30)

3.5 Why Rectified Linear Unit (ReLU)

The objective of training a CNN is to achieve a minimum error (i.e cost) between the network prediction and target value. As previously stated, derivatives are useful in find-ing the local minimum of a function. Therefore, the derivative of the cost function is significant because it helps in understanding the cost function and finding that minimal cost. However, since the cost function (E) is a function of the activation function (y), which is a function of the weight sum of the inputs (S), which in turn is a function of the weight (w), the derivative of the cost function considers the derivative of the activation function by the chain rule. Mathematically, this can bee seen in the following equation 3.5 for a single neuron, i:

dE dwi = dE dy dE dS dS dwi (3.5) • E is a cost function.

• y = f(S) is a non-linearity activation function.

• S is the weight sum of the inputs to a neuron i.e the netInput =�ni=0wixi.

• xi is the input value and wi is it’s corresponding weight.

Each time the gradient of a neuron is calculated, the gradient of all previous gradients up to that neuron is also included. In other words, the gradient at any neuron is the product of previous gradients up to that point. Since the range of an activation function is all real numbers between 0 – 1, the product of m number of these small values for m layers causes the gradient to exponentially decrease, which results in earlier layers to learn slower. Therefore, one major factor for this problem, which is also referred to as the vanishing gradient problem mentioned in section 3.1, is the type of activation function used in the CNN’s non-linearity transformation layer.

Figure 3.6: Derivative curve of the sigmoid activation function [33]

(31)

is 0.25, which is when the weighed sum of the input of a neuron, S = 0. Otherwise, the derivative will always be close to zero for large and small values.

Therefore, one suggestion to overcome this problem is to use another activation function in the network such as the rectified (ReLu) activation function. As seen in Figure 3.7, the derivative of the ReLu function will be zero only for negative values and always large for large values.

Figure 3.7: Relu Activation Function y(S) = max(0, S)

3.6 Backpropagation

Backpropagation is a common method that is used when training a neural network. It is used together with the gradient decent optimization algorithm to calculate the gradient of the cost function, i.e. the rate at which the cost/error changes with respect to the weight of the neurons in the network.

The gradients are passed to the optimization algorithm, which uses it to update the weights and biases in attempt to minimize the cost. When the gradient is small the network will learn slowly, but if the gradient is large it will learn faster [33].

Backpropagation starts by performing a forward propagation (going from the input layer to the output layer), which passes the training sample through the network to generate initial propagation of the network’s output activations. A backwards propagation (going from the output layer towards the input layer) is followed on the propagation’s output activations by using the target values associated with the inputs, to calculate the dif-ference (error) between the target and actual network output values of all neurons in both output and hidden layers. For each neuron, its associated weight change (∆w) is calculated to adjust it’s weight by using an update rule.

(32)

1. The cost function used is the mean squared error, which was previously seen in equation 3.3.

2. The activation function used is the logistic sigmoid function, y(S) = f (S) = 1

1 + e(S) (3.6)

3. S = weighted sum of inputs = �n_i=0wixi.

For simplicity, lets consider a single-layer neural network with only one neuron. To calculate the weight changes the delta rule is used:

∆w =_−ηdE dwi

where η is the learning rate (3.7)

wt+1← wt+ ∆w weight update (3.8)

The gradient _dwidE in equation 3.7 is calculated, as previously seen in equation 3.5 on page 20. Moreover, the partial derivatives are calculated as the following:

• dE dy = 2(

1

2)(t− p)(−1) = −(t − p)

• dSdy = f�(S) = y(1− y), where y is the sigmoid function.

• dS dwi = xi

Hence, the _dwidE term equals:

dE dwi

=−(t − p)f�(S)xi (3.9)

Substituting in for equation 3.7

∆w =_−ηdE dwi

= η(t_{− p)f}�(S)xi (3.10)

In a more general form,

∆w = ηδxi where δ = (t− p)f�(S) = (t− p)(y)(1 − y)

(33)

neuron, there is no target value (t) therefore, the chain rule leads to an expression where (t_{− p) is replaced with a weight sum of the δ’s from the output layer.}

Therefore, in the backpropagation phase for larger neural networks, weight changes (∆w) are calculated layer by layer from the output to the first hidden layer as the following:

∆wij = ηδjxi

δj =

  

(tj − pj)(yj)(1− yj), if yj is an output neuron

�

kwkjδk(yj)(1− yj), if yj is an hidden neuron

Where the sum is over all k nodes in the layer above, i.e. the layer for which the δ-values were computed in the previous step.

3.7 Related Work

Convolutional neural network has provided stunning results on many image recognition tasks in deep learning. Krizhevshky et al. [20] trained a large CNN to classify 1.2 million high resolution images of various objects into 1000 classes. Using modern techniques such as dropout, data augmentation to reduce over fitting, their CNN model, ImageNet achieved state of the art results with 37.5% error rate.

LeCun et al. research [34] showed that convolutional neural network outperformed many other methods in image recognition of handwritten text. Their LeNet-5 model achieved a 0.8% error rate on the MNIST handwritten digits database. In [11], LeNet is used to classify generated spectrogram of acceleration traces from the LMT Haptic Database, which is the same database used in this thesis. Results where 94,4% error when 100 spectrogram fragments were taken out from each data trace and 44,3% error with 1000 fragments.

In the context of surface texture classification, Tivive et al. [5] used a four layer CNN to classify images of textures from the Brodatz Texture database and achieved an average of 17,2% error rate. Their network architecture consisted of kernels of size 7x7 in the first layer, 5x5 in the second layer, a pooling layer to down sample the spatial dimension of their 13 x 13 pixel image by a factor of 2 after the first convolutional layer and an output layer with each neuron representing one texture class.

(34)

reduces the dimensionality while keeping significant temporal and spectral properties. Their model, ACNN, consisted of 2 convolutional layers, 2 pooling layers and 3 fully con-nected layers, achieved an error rate of 20.8%. They increased their model performance, reducing the error rate to 18.2%, by using weights of a trained sparse Auto-encoder to initialize the weights of the first convolutional layer.

Figure 3.8: HapticNet [10]

(35)

Chapter 4 Datasets

There are, as previously mentioned in section 2.2, several ways of learning a machine. In this thesis, supervised learning is used to train a CNN to learn to recognize between various textures. This implies that there is a requirement of labeled data, in which each input data must have a corresponding target value. In other words, an associated class it belongs to.

The two dataset that are being investigated in this thesis are: 1. LMT Haptic Texture Database [41].

2. PENN Haptic Texture Toolkit [15].

When a rigid tool is stroked over a surface texture, several scanning parameters such as the scanning velocity, the applied normal force or angle between the tool and surface, can influence the recorded acceleration signal. Figure 4.1 shows an example of how increasing the scanning velocity linearly aﬀects the acceleration trace.

Figure 4.1: A typical acceleration data trace with constant normal force. One can notice that as increasing the velocity (going from the left to right), increases the signal power and variance [39]

(36)

direction [22], both database include a conversion of the three-axis acceleration to a one-axis acceleration signal using the DFT321 algorithm proposed by Latin et. al [30].

4.1 LMT Haptic Texture Database (LMT)

The LMT haptic database is a publicly available database provided by the Technische Universit¨at M¨unchen in Germany. The database was established in response to the limitations of the PENN Haptic Texture Toolkit (HaTT). It was argued that HaTT only provided free hand exploration of the acceleration signals with varying force and velocity. LMT includes acceleration signals measured during a controlled and uncontrolled en-vironment. In the controlled environment, the device is attached to a Phantom Omni device as shown in Figure 4.2 whilst the velocity or the force is linearly increased using a motor-controlled rotatory plate. However, in the uncontrolled environment, the device is hand held and freely used to explore the surface.

(37)

Figure 4.3: Hand-held device for texture recording [39]

4.1.1 Recording Process

As the stainless steel tool-tip of the hand-held device, shown in Figure 4.3, scratches a texture, vibration of the tool is produced and can be captured with the acceleration sensor attached. The acceleration sensor used is a three-axis LIS344ALH accelerometer with a range of±6g manufactured by ST Electronics.

There are in total 10 free hand recordings (labeled query0-query9) of approximately 20 seconds for each texture by a human subject. The first 5 recordings have a circular movement on the texture. The other 5 recordings are recorded by allowing the user to move back and forth between two points. In order to cover a wide range of typical human exploration, the scan parameters and exploration traces are varied between the 10 recordings. For further details on the recording procedure refer to [40] and [39].

4.2 PENN Haptic Texture Toolkit (HaTT)

The Penn Haptic Texture Toolkit (HaTT) is provided by the Haptic Group in the Univer-sity of Pennsylvania [15]. It is a collection of 100 diﬀerent textures across 10 categories; paper, plastic, fabric, tile, carpet, foam, metal, stone, carbon, fiber and wood. Unlike LMT, HaTT includes only 2 recording traces of approximately 10 seconds long for each class/texture.

4.2.1 Recording Process

(38)

the sensor resonance. The data for each recording trace of a texture was provided as XML files as seen in Figure 4.4. Acceleration recordings in the < Accel > field was used in this thesis.

Figure 4.4: Data structure in the XML file [29]

(39)

Chapter 5 Methodology

5.1 Framework

The implementation framework that was used in this thesis is called TensorFlow (TF), which is Google’s recent deep learning open-source framework [12]. Currently, TensorFlow only provides two APIs, an API for C++ and an API for Python. However, Tensorflow will hopefully support other languages over time. Moreover, the Python API was used in this thesis.

5.2 Pre-Processing

5.2.1 Converting xml files to text files

(40)

5.2.2 Spectogram Generation

CNNs take advantage of having inputs as images because of its 3-dimensional layer (width, height, depth) architecture. As previously mentioned in section 3.3.1, neurons in a feature map share weights and biases which allows a CNN to be well adopted to translation invariance.

It has been demonstrated in many speech recognition studies as mentioned in [10], that converting a one-dimensional signal into a spectral domain will help a CNN to achieve translation invariance between the temporal and frequency domain of the signal. There-fore, as suggested in [10], spectrogram for all one-dimensional DFT321 acceleration traces in the text files were generated by computing the squared magnitude of the Discrete Fourier Transformation. The spectrogram images were in log domain with a frame length of 500 and a frame increment of 250 as shown in the code below.

1 def spectrum_generation(fileName,fileLableName,idxLabel):

2 data = np.array(lines) # lines is the parsed acceleration trace from the

text file

3 frame = 500

4 overlap = 250 # the frame increment

5 for i in xrange(numFrame):

6 seg = data[i*overlap:i*overlap+frame]

7 seg = [float(str_temp2) for str_temp2 in seg]

8 segF = np.fft.fft(seg)

9 segF = segF[0:frame/2]

10 segF = segF[::-1]

11 segF = abs(segF)**2

12 spectrum = np.concatenate((spectrum,segF))

13 logSpectrum = [math.log10(i) for i in spectrum]

14 logSpectrum = np.array(logSpectrum)

To adjust the calculated values to a notionally common scale, each spectrogram was normalized by using the feature scaling normalization method as seen in the code below.

1 data_min = min(logSpectrum)

2 data_max = max(logSpectrum)

3 logSpectrumNorm = (logSpectrum-data_min)/(data_max-data_min)

(41)

Figure 5.1: Brick in LMT (left), Aluminum in HaTT (Right)

LMT Haptic Texture Database (LMT)

As previously mention, the LMT dataset has a total of 10 acceleration traces for each of its 69 textures/classes. Each acceleration trace was measured for approximately 20-25 seconds. 9 of the acceleration traces were used to train the CNN and the last acceleration trace was used as the testing set to evaluate the trained CNN. For each acceleration trace, 100 spectrogram samples of size 50 x 250 was extracted, as shown in the code below, by having the predefined configuration settings mentioned above. The result of this was a total of 62,100 images used for training and 6900 images used for testing the CNN. Figure 5.2 on page 32 shows an example of an image used for training the CNN.

1 shift = 7

2 num_image = (numFrame-50)/shift+1

3 k = 0

4 idx_min = max(-1,num_image-101)

5 for i in xrange(num_image-1,idx_min,-1):

6 image_cur = logSpectrumNorm[:,i*shift:i*shift+50]

7 I8 = (image_cur*255.9).astype(np.uint8)

8 img = Image.fromarray(I8)

9 imageFileName = filePathImage + str_temp + str(k).zfill(3) + ".png"

10 img.save(imageFileName)

11 text_label_file.write("%s " %idxLabel + str_temp + str(k).zfill(3) + ".png"

+ ’\n’)

12 k = k +1

(42)

Figure 5.2: Example of a 50 x 250 image of the training set PENN Haptic Texture Toolkit (HaTT)

The HaTT dataset has only 2 acceleration traces for each of its 100 textures/classes. Each acceleration trace was half the size compared to the LMT database, approximately 10 seconds long. Therefore, only 50 spectrogram samples of size 50 x 250 could be extracted for each acceleration trace.

Initially, only 69 classes were considered from the database. The first dataset division was a 50:50 ratio. The first acceleration trace was used for training the CNN and the second acceleration trace was used for evaluating the CNN. The result of this was a total of 3450 images for both the training and testing set. Compared to the LMT dataset, this was a huge reduction of the training set. One could easily predict that this would produce a common machine-learning problem called overfitting, which was introduced in section 2.6.

Therefore, the dataset division was changed to a 70:30 ratio. All data from the first acceleration trace including the last 20 images from the second acceleration trace for each texture was used to train the CNN. The remaining images, that is 30 images per texture from the second acceleration trace were used to evaluate the trained CNN. In total, there were 4830 images for training and 2070 images for testing, when training and evaluating the network for 69 classes. The total number of images was obviously increased when the CNN was trained and tested for 100 classes. The source code for generating spectrogram images for the HaTT dataset is similar to the LMT dataset with minor changes, which can be overviewed in Appendix B.

5.2.3 Data Augmentation

(43)

Data augmentation can be defined as a technique to artificially increase the training data to boost the network performance. There are several ways of doing data augmentation such as horizontally flipping, adjusting the contrast or perhaps rotating the training image. The combination of several of these processes can also be implemented such as initially rotating the image and then adjusting its contrast. By adding such noise to the training data, the neural network is able to learn other features and generalizes typically better to the testing data [19].

Experiments were made by applying several data augmentations on the training images by randomly rotating and horizontally flipping images in a batch. Images contrast were also randomly adjusted by a random factor between 0.2-1.8.

5.3 Implementation

As mentioned in the project scope in section 1.4, the aim is to initially experiment with the LMT Haptic database and achieve a CNN model that can at least produce competitive results. From there on, the ability of the same CNN model will be extended and experiments with the HaTT database will be made.

5.3.1 Data Input

Data input to TensorFlow was similar for both LMT and HaTT database. When the spectrogram images were generated for each database, two text files were generated. These files included the name of each image file and its corresponding class label. One text file contained images of the training set and the other text file included images for the testing set. Table 5.1 shows a preview inside the training set text file for the LMT database. 0 G1EpoxyRasterPlate query0000.png 0 G1EpoxyRasterPlate query0001.png 1 G1IsolatingFoilMesh query0000.png 1 G1IsolatingFoilMesh query0001.png . . . . 68 G1SquaredAluminumMesh query0000.png 68 G1SquaredAluminumMesh query0001.png Table 5.1: 2 training set images for class 0, 1 and 68.

(44)

image from it. It’s corresponding class could also be extracted. Therefore, before the training process was initiated, both the training and testing text files were parsed. All image samples used for training or testing were allocated into numpy array and their corresponding class label was allocated to another numpy array. This was done by the extract TRAIN data function for image samples used to train the CNN model and ex-tract TEST data function for image samples used to evaluate the CNN model. The source code can be viewed in Appendix C.

Both functions included a separate function called the generate dataArrays. This function was used to modify each image sample before finally presenting them to the network. For instance image pixel values were rescaled from [0, 255] to [-0.5, 0.5]. This can be seen in the code snipped below. Other modification that was used was the various data augmentation techniques previously explained.

1 def generate_dataArrays(data, labels, line, idx_image, trainingset):

2 list_tmp = line.split(’ ’)

3 fileImageName = main.IMAGEPATH + list_tmp[-1].replace(’\n’, ’’)

4 im = Image.open(fileImageName)

5 data_tmp = numpy.array(im)

6 data_tmp = (data_tmp - (PIXEL_DEPTH / 2.0)) / PIXEL_DEPTH

The result of this were arrays that contained non-shuffled training data. In other words, the network would initially be presented with images from the first class, the second class, and third class until the final class. Therefore, a trivial shuffling function was introduced but was later replaced with another more random shuffling function. The trivial shuffling function would assign two new arrays by starting with allocating the first image for each class and their corresponding class label to the new arrays. The process would continue for the second image of each class and the remaining number of images per class. The result would be two new numpy array that included images similar to the structure in table 5.2. 0 G1EpoxyRasterPlate query0000.png 1 G1IsolatingFoilMesh query0000.png . . . . 68 G1SquaredAluminumMesh query0000.png 0 G1EpoxyRasterPlate query0001.png 1 G1IsolatingFoilMesh query0001.png . . . . 68 G1SquaredAluminumMesh query0001.png

(45)

randomness of shuﬄing using python’s random module. After parsing the training set file, a temporary text file was created, which only included images for the number of classes specified by the experimenter. This temporary file was then randomly shuﬄed before creating the data and label array as previously explained above. Moreover, images were completely random when presented during training. To view the source code refer to Appendix C in section extract data.py.

The key component of Tensorflow is the data flow graph. The graph represents a descrip-tion of the computadescrip-tions. Each operadescrip-tion, which is in the graph, is called a node, and to compute or initialize the operations the graph must be launched in a session. A tensor, which can be thought of as an n-dimensional array, is used as input to these operations. To feed data of a diﬀerent format such as python numpy array into the operations, a placeholder along with the feed dict operation must be used.

Therefore, since the format used in the implementation were numpy arrays, a small batch of images and labels were extracted from the created train data and train label numpy arrays for each step/iteration during training. These numpy arrays were feed to the placeholders (inputData and train labels node) by a run() call as shown in the code below.

1 with tf.Session() as sess:

2 for step in xrange(int(num_epochs * train_size) // BATCH_SIZE): 3

4 offset = (step * BATCH_SIZE) % (train_size - BATCH_SIZE)

5 batch_data = train_data[offset:(offset + BATCH_SIZE), ...]

6 batch_labels = train_labels[offset:(offset + BATCH_SIZE)]

7 feed_dict = {inputData: batch_data, train_labels_node: batch_labels}

8

9 sess.run([operations to run], feed_dict=feed_dict)

5.3.2 Training

(46)

Batch Size

The batch size parameter is one hyper-parameter that needs to be tuned by the experi-menter. The particular batch size is significant for training since the CNN is trained using the stochastic gradient descend (SGD) method described in section 3.4.1. The reason for choosing SGD over Batch Gradient Decent (BGD) is because of the following advantages [27]:

1. SGD is usually much faster than BGD for large datasets because it uses mini batches of the training set instead of the entire training set.

2. SGD often results in better solutions because of the noise in the weight updates. Deep neural networks with multilayers tend to have a cost function with many local minima. Therefore, with the noise presented in the weight updates, a new local minimum can be found because it may jump to a better one.

When choosing the batch size, one must take the training set and hardware availability into account. The larger the batch size is the more computational expensive training will be because it will take longer to compute the gradients for each training step by Backpropagation.

Learning Rate

The learning rate parameter is another hyper-parameter that needs to be tuned. The learning rate controls the size of the weight and bias updates as seen in Equation 3.7 on page 22. In other words, it controls the step length of finding the global minimum in the cost function. Finding an appropriate learning rate can be challenging because if the learning rate is too large, it can overshoot minima, and if it is too small it can get stuck in local minima and the network will not learn anymore.

In practice, there are a few tips and tricks that can help in deciding a fairly well learning rate by looking at the graph of the cost/loss function. If the learning rate is very high then the loss/cost will diverge. The idea is to have a smooth convergence, in order to have a loss close to the value of zero as the number of epoch increases. This can be previewed in Figure 5.3.

(47)

Figure 5.3: Selecting the appropriate learning rate [46]

1 def exponential_decay(learning_rate, global_step, decay_steps, decay_rate):

2 #:param learning_rate: initial learning rate (0.01)

3 #:param global_step: counter that counts the number of training samples

processed

4 #:param decay_steps: when the learning rate should decay (2 * epoch)

5 #:param decay_rate: scalar factor (0.95)

6 #:return: decayed_learning_rate

7

8 decayed_learning_rate = learning_rate * decay_rate ^ (global_step /

decay_steps)

9

10 return decayed_learning_rate

Several experiments were made with diﬀerent learning rate such as 0.05, 0.005, 0.1. How-ever, the optimal value that seemed to work best for the proposed model was an initial learning rate of 0.01 with learning rate decay made after each second epoch.

Cost Function

(48)

Regularization Techniques

Regularization methods are techniques that are used to in attempt to solve overfitting problems in machine learning. Increasing the training set or reducing the network are two ways of reducing overfitting. However, large networks have the potential of being more powerful than small networks. Therefore, reducing the network should only be made if enforced. Fortunately, there are other techniques of reducing overfitting.

Dropout [37] is an effective technique proposed by Geoffrey Hinton et al. to deal with overfitting in deep neural networks. Dropout has shown to improve the performance of neural networks when trained with supervise learning. The idea of Dropout is to turn neurons off every once in a while, with a probability parameter selected by the experimenter. Once a neuron in the network is turned off, it’s weights will not be updated. Therefore, such neuron will not affect the learning of other neurons in the network. Thus, Dropout prevents neurons from co-adapting too much since the learned weights of the neurons, which are turned off, become insensitive to the weights of the other neurons. This results in the neurons to learn by themselves.

The proposed CNN model consists of a Dropout layer after the first fully connected layer. Initially, experiments were made with the probability parameter equal to 0.5 i.e. adding 50% dropout during training. However, the probability parameter was then changed to 0.7. As recommended by the TensorFlow API [12], dropout was only activated during training. It’s functionality was turned oﬀ during evaluation, which can be seen in the following code.

1 def model(data, train=False):

2 #:param data: training set images

3 with tf.variable_scope(’FCL1’) as scope:

4 hidden = tf.nn.relu(tf.matmul(reshape, weights[’fcw1’]) + biases[’fcb1’],

name=scope.name)

5 # Add a 70% dropout during training only.

6 if train:

7 hidden = tf.nn.dropout(hidden, 0.7, seed=SEED)

L2 Regularization is another eﬀective technique that was used to reduce overfitting and improve generalization. The idea of L2 Regularization is to modify the cost function by add an extra term, called the regularization term. A regularization term is added to the cost function E0, which penalizes large weights. The strength of how strongly

weights should be penalized is specified by λ, which can be seen in equation 5.1. E0 is

(49)

all the free parameters of the network, i.e the number of weights and biases. E(w) = E0(w) + λ 2 � i

w2_i , where λ is the regularisation paramter. (5.1)

As specified earlier in section 2.3, weights in a network strengthen or weakens an input. An unregularized network can use its large weights to learn noise information in the training set. Therefore, by penalizing large weights, regularized networks are constrained to learn features that are significant and often seen during training [33]. In other words, small weights in a network can make it diﬃcult for the network to learn noise in the data because the behavior of the network does not change much if the input was modified slightly. For more information about the process of weight decay and how it can improve generalisation refer to [18].

During training, L2 regularization was used to decay weights of the neurons in the fully connected layers. In order to find an appropriate value for the regularization parameter λ, some trial and error had to be made. When the HaTT dataset was used, the regularization strength was increased since the HaTT dataset was much smaller than the LMT datset. The idea was to use the same CNN model without modifications and comparing its performance with the LMT dataset. The following code shows how the L2 regularization was implemented.

1 def cost_function (logits, labels):

2 loss = tf.reduce_mean(

3 tf.nn.sparse_softmax_cross_entropy_with_logits(logits, labels))

4 # L2 regularization for the fully connected parameters.

5 regularizers = (tf.nn.l2_loss(weights[’fcw1’]) + tf.nn.l2_loss(biases[’fcb1’

])

6 + tf.nn.l2_loss(weights[’fcw2’]) + tf.nn.l2_loss(biases[’fcb2’]))

7 # Add the regularization term to the loss.

8 loss += 0.05 * regularizers #0.05 is the regularization parameter

9 return loss

Normalization Layer

(50)

Network Architecture

As previously mentioned, the MINIST model was used as a starting point. The model is a CNN with 2 convolutional layers, 2 max-pooling layers, 1 dropout layer and 2 fully connected layers with the last one being a softmax output layer. Several experiments were made by applying various modifications to the network architecture such as adding convolutional layers followed by a max-pooling layer, changing number of output channels in each layer and adding normalization layers.

In order to perform the convolution operation described in section 3.3.1, two filter sizes which are commonly seen in the literature were tested, a 3 x 3 and 7 x 7 filter size. The filter stride was kept at 1 pixel and the padding operation used was TensorFlow’s padding scheme called SAME, which completes the borders of the image with 0, to apply an even filtering over the image.

The final architecture and layer configurations of the proposed convolutional neural net-work, EHapNet, for haptic (acceleration data) can be seen in Table 5.3. It takes generated spectrograms of acceleration images as input and output a series of softmaxed-vectors which represents a probability distribution of each input belonging to a class. Each neuron in the network has a ReLu activation function because of the specified reasons explained in section 3.3.2.

Layer Type Patch Size / Stride Output Channel Size

Convolution 7 * 7 / 1 32 Max Pooling 2 * 2 / 2 32 Normalization - 32 Convolution 7 * 7 / 1 64 Max Pooling 2 * 2 / 2 64 Normalization - 64 Convolution 7 * 7 / 1 96 Max Pooling 2 * 2 / 2 96 Fully Connected - 512 Dropout - 512

Fully Connected - Number of classes

Softmax - Number of classes

Table 5.3: EHapNet Architecture

(51)

equivalent to the same number of classes specified by the experimenter. There are also 2 normalization layers, the first one coming right after the first convolutional layer and the second one after the second convolutional layer. To prevent from overfitting, a dropout layer is also introduced after the first fully connected layer.

Training Process

Two numpy arrays containing training image samples were created and their correspond-ing class labels they belong to. The weights and biases of the network were randomly initialized using a truncated normal distribution. For every iteration, a mini-batch size of 100 training samples were extracted from the arrays and presented to the CNN. The CNN computed neuron activations and outputs logists, which are network predictions. These predictions were passed to the cross-entropy cost function, which calculated the error between the network prediction and the target class values. Finaly, the momentum optimizer tried to find better weight values, with the help of the gradients of the cost function that were calculated by backpropagation, to minimize the network error. After each approximate epoch, that is after the entire training sample set was presented to the network, the network was evaluated with the test set. The network accuracy performance was calculated and a checkpoint file was saved containing the model config-urations after each evaluation. The implementation of this function was in such way that the best model configurations that gave the lowest error prediction, i.e. highest accuracy, on the testing set was only saved.

5.3.3 Evaluating

In order to compute the network performance an evaluation of the network is made. The implemented evaluation method that was used was a fragment evaluation. The test set was passed to the evaluation function, eval in batches. Small batches of the testing set were passed to another function, eval prediction, which passed the mini batch of the test set through the CNN model to compute the network sotfmax predictions. This procedure was iterative until the entire test set was seen by the network. All predictions were saved in a numpy array and returned by the eval in batches function. The following code previews the implementation.

1 def eval_in_batches(data, sess, eval_prediction):

2 """Get all predictions for a dataset by running it in small batches."""

3 size = data.shape[0]

4 if size < EVAL_BATCH_SIZE:

(52)

6 predictions = numpy.ndarray(shape=(size, NUM_LABELS), dtype=numpy.float32)

7 for begin in xrange(0, size, EVAL_BATCH_SIZE):

8 end = begin + EVAL_BATCH_SIZE

9 if end <= size: 10 predictions[begin:end, :] = sess.run( 11 eval_prediction, 12 feed_dict={inputData: data[begin:end, ...]}) 13 else: 14 batch_predictions = sess.run( 15 eval_prediction, 16 feed_dict={inputData: data[-EVAL_BATCH_SIZE:, ...]})

17 predictions[begin:, :] = batch_predictions[begin - size:, :]

18 return predictions

Each image sample produces a softmax predictions array of size relevant to the number of classes. This array contains the probability distribution of the particular input belonging to a class. The maximum value in this array is extracted because this is the class the network believes the particular input belongs to. This value is compared to the actual target value that is the particular input should be belonging to. If the two values are equal then the index of the maximum value is set to True otherwise False.

Once the entire network predictions of the testing set were established the error rate was computed. The error rate was computed by dividing the total number of correctly classified inputs (all the indexes that were set to True) and the total number of testing set. The code below shows the implementation of the error rate function.

1 def error_rate(predictions, labels):

2 predicted_class = numpy.argmax(predictions, 1)

3 booleanArray = predicted_class == labels

4 error = (100.0 * numpy.sum(booleanArray) / predictions.shape[0])

5 return 100.0 - error

(53)

Chapter 6 Results

6.1 Hardware

Initially, experiments were executed on the CPU, which was time consuming. However, after performing some trivial experiments this was quickly changed and the GPU version of TensorFlow was installed. The remaining experiments were all completed on an ASUS NVIDIA GeForce GTX 970 4 GB GPU, which saved loads of time. Moreover, Ubuntu 14.04 LTS and Python 2.7 was also used to implement the system. Training the proposed CNN with a GPU took on average 3 days with the LMT dataset and 1 day when the HaTT dataset was used.

6.2 Experiments with LMT Dataset

6.2.1 State of The Art Models

Training on the CIFAR-10 Model

The CIFAR-10 model, provided by TensorFlow’s Convolutional Neural Network tutorial, is a 9 layer CNN that has a top-1 precision of 86 % on the CIFAR dataset. The CIFAR-10 dataset is a 10 class dataset that consists of tiny 32 x 32 images of various objects such as airplanes, dogs, trucks and several others.

Tool-Mediated Texture Recognition Using Convolutional Neural Network

Examensarbete 30 hp

September 2016

Tool-Mediated Texture Recognition

Using Convolutional Neural Network

Ziring Tawfique

Abstract

Tool-Mediated Texture Recognition Using

Convolutional Neural Network

Acknowledgment

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Motivation

1.2

Background

1.3

Project Aim

1.4

Project Scope

Chapter 2

Overview of Machine Learning

2.1

Background

2.2

Learning

2.3

Weight and Bias

2.4

Cost Function

2.5

Activation Function

2.6

Overfitting

Chapter 3

Literature Review

3.1

Deep Learning

3.2

Why Deep Neural Network?

3.3

Convolutional Neural Network

3.3.1

Properties of CNN

3.3.2

Network Architecture

3.4

Cross-Entropy Cost Function

3.4.1

Minimizing The Cost

3.5

Why Rectified Linear Unit (ReLU)

3.6

Backpropagation

3.7

Related Work

Chapter 4

Datasets

4.1

LMT Haptic Texture Database (LMT)

4.1.1

Recording Process

4.2

PENN Haptic Texture Toolkit (HaTT)

4.2.1

Recording Process

Chapter 5

Methodology

5.1

Framework

5.2

Pre-Processing

5.2.1

Converting xml files to text files

5.2.2

Spectogram Generation

5.2.3