INTELLIGENT SYSTEMS

(1)

Draft PP RR OO FF E E S S S S IIO O N N & & F F R R A A M M S S T T E E G G

P P R R O O F F E E S S S S I I O O N N & & P P R R O O G G R R EE SS SS Y R K E S H Ö G S K O L A N

Johan Westö

(2)

Draft

Abstract

This course material is meant as an introduction to neural networks. Based on a mathematical representation of neurons, the material will try to present 1) how these neurons can be used to build models, 2) how the models are dependent on the network’s structure, and 3) how we can make the networks learn from data. During the course, neural networks will be trained to solve simple examples problems as well as large scale real problems in the form of image classification. The goal is to give students a basic understanding for how neural networks can be used to solve both regression and classification problems.

Sammanfattning

Denna kurs är tänkt som en introduktion till neurala nätverk. Materialet kommer att påvisa hur en matematisk representation av nervceller kan användas för att representera modeller, samt hur dessa modeller är beroende av nätverkets struktur och hur de kan fås att lära från data. Under kursen kommer neurala nätverk att användas för att lösa såväl fiktiva exempel problem som riktiga problem i form av bild klassificering. Målet är att de studerande skall erhålla en baskunskap för hur neurala nätverk kan användas för både regressions- och klassificeringsproblem.

Publisher: Novia University of Applied Sciences, Wolffskavägen 35 B, 65200 Vasa, Finland Johan Westö & Novia University of Applied Sciencesc

Novia Publications and Productions, series L: Teaching materials 2/2014 ISSN: 1799-4195,

ISBN: 978-952-5839-86-9 (online) Layout: Johan Westö

(3)

1 Course information 1

2 Introduction 2

2.1 Intelligence . . . 2

2.2 Machine learning . . . 3

2.2.1 Traditional machine learning versus deep learning . . . 4

2.3 Neurons and neural networks . . . 5

2.3.1 Artificial neuron . . . 6

2.4 Used notation . . . 7

2.5 MATLAB . . . 7

3 Linear regression 8 3.1 A quadratic error measure . . . 9

3.2 Example data (part 1) . . . 9

3.3 Finding optimal weight values . . . 10

3.3.1 Running gradient descent . . . 11

3.4 Example data (part 2) . . . 12

3.5 Final remarks on gradient descent . . . 13

4 Softmax regression 14 4.1 Maximum likelihood . . . 15

4.2 Gradient checking . . . 17

4.3 2 class example . . . 17

4.4 Training and testing . . . 18

4.5 MNIST . . . 19

4.6 Restrictions for linear models . . . 21

5 Multilayer Perceptrons 22 5.1 Backpropagation . . . 24

5.2 The XOR problem . . . 26

5.3 Non-linear regression . . . 27

5.4 MNIST revisited . . . 29

5.5 Deeper architectures . . . 30

6 What is next? 31

References 32

(4)

CONTENTS

Appendices 34

A Matlab code for solving the XOR problem . . . 34

B Suggested course structure . . . 37

B.1 Guidelines for reporting answers to homework problems . . . 37

B.2 Homework 1 . . . 38

B.3 Presentation . . . 39

B.4 Homework 2 . . . 39

B.5 Homework 3 . . . 41

Notation 43

Acronyms 44

Index 45

(5)

1 Course information

T

his is a course on intelligent systems with a focus on Artificial Neural Networks (ANNs).

ANNs are currently experiencing something like a new golden age due to their recent successes on problems related to image and speech recognition (Bengio, Courville, & Vincent, 2012). The purpose of this course is to provide the necessary background information needed about ANNs in order to understand the kind of thinking that has led to their recent successes. Hence, the course targets students in fields where models needs to be learned from data, such as computer science and electrical engineering. Upon completing the course, participating students are expected to have obtained knowledge about 1) how ANNs can solve both regression and classification problems, 2) how gradient based optimization methods can be used for training ANNs, 3) how layering allows networks to solve non-linear problems, and 4) why deep ANNs are thought to be useful.

Most of the material used is taken from Haykin (2009), but today it is also possible to find excellent free courses online. I would recommend visiting Coursera’s homepage.

There you can find a really good course on “Machine Learning” taught by Andrew Ng (co-founder of Coursera by the way), and an equally good course on “Artificial Neural Networks” taught by Geoffrey Hinton (a.k.a. the godfather of neural networks). I would also recommend an article about deep neural networks and the human brain by Laserson (2011) (found here) as inspiration.

Prerequisites: This material assumes that students are familiar with 1) linear algebra (matrix operations), 2) calculus (the gradient and partial derivatives), 3) system modelling,

and 4) MATLAB.

(6)

2 Introduction

T

oday more and more appliances connect to the internet all the time, and this increased connectivity is accompanied by an increased ability to collect and store data. Factories have more sensors collecting data about their processes, and companies such as Facebook and Google sit on a wealth of information about their users; but how do we make sense of all this data? One way is to let computers build models from it, which is what this course is about. More specifically, this course will give an introduction to how we can teach machines to learn from data using ANNs. We will start by first looking at how linear regression can be represented by one artificial neuron and how we can use gradient descent to solve optimization problems. From here, we will move on to see how softmax regression can complement linear regression in solving problems related to classification. Finally, we will look into how multilayer perceptron networks emerges out of both linear and softmax regression models by first first making a non-linear projection of the input data.

2.1 Intelligence

What is meant by intelligence? Different people will probably have different opinions, but what follows is one that is currently receiving a lot of attention. Karl Friston, who is a famous neuroscientist, proposed that several different brain theories could be gathered under one concept (Friston, 2010). Within this concept, the brain’s functionality could be looked upon as minimizing surprise. If the brain is thought to be intelligent, one meaning of intelligence would then be the ability to make correct predictions. This definition is by no means a new one, and when asked, “What is intelligence if it is not defined by behaviour?” Jeff Hawkins replied (Hawkins, 2004, p. 6):

The brain uses vast amounts of memory to create a model of the world. Ev- erything you know and have learned is stored in this model. The brain uses this memory-based model to make continuous predictions of future events. It is the ability to make predictions about the future that is the crux of intelligence. I will describe the brain’s predictive ability in depth; it is the core idea in the book.

Similarly, in a translation on Sun Tzu’s “The Art of War” by Cleary (1988, p. xi), the following two statements are found.

(7)

According to an old story, a lord of ancient China once asked his physician, a member of a family of healers, which of them was the most skilled in the art.

The physician, whose reputation was such that his name became synonymous with medical science in China, replied, “My eldest brother sees the spirit of sickness and removes it before it takes shape, so his name does not get out of the house. My elder brother cures sickness when it is still extremely minute, so his name does not get out of the neighbourhood. As for me, I puncture veins, prescribe potions, and massage skin, so from time to time my name gets out and is heard among the lords.”

Just as the eldest brother in the story was unknown because of his acumen and the middle brother was hardly known because of his alacrity, Sun Tzu also affirms that in ancient times those known as skilled warriors won when victory was still easy, so the victories of skilled warriors were not known for cunning or rewarded for bravery.

Despite that these Chinese stories refer to old texts (2500 years) they still agree with the interpretation of intelligence being the ability to foresee events, even if it is disguised as skill in this particular case. Going back to Friston’s idea that brain functionality tries to minimize surprise, we see that predictions should not be restricted to only foreseeing the future. A definition of intelligence based on predictive capability could also include the ability to make correct judgements of data. That is, if an object recognized to be a car actually is a car, then this also corresponds to a situation were surprise is minimized.

So, if a system is able to either make correct judgements of data or predict future events, we could call it an intelligent system. However, people have discovered during the last 50 years that programming intelligent systems explicitly is very difficult, one often fails to note the complexity of a task. For example, humans easily recognize Figure 2.1a as a car, but it is very difficult to tell a computer how to infer the same thing from the RGB matrices representing an image (see Figure 2.1b), and the task gets even more difficult if one has to also allow for all different types of cars and viewpoints. One way to solve this problem would then be to write code for instructing a computer on how to learn the task, instead of trying to tell it explicitly how to do it. This way of thinking leads us straight to the next section on machine learning.

2.2 Machine learning

The material in this course will mainly relate to a branch of Artificial Intelligence (AI) called machine learning, a term defined by Mitchell (1997, p. 2) as:

Definition

“A computer program is said to learn from experience E with respect to some task T and performance measure P, if its performance at task T, as measured by P, improves with experience E.”

In practise, this means that machine learning applications include among other things:

face detection, object recognition, cluster analysis, recommender systems, fault detection, spam detection, and automatic speech recognition systems. In all these situations, a machine learning algorithm has learned to perform the task from experience (old data).

(8)

CHAPTER 2. INTRODUCTION

(a)

18 141 253 252 23 47

235 40 154

7 64 106

138 124 206 153 26 157

188 58 49 58 3 236

121 101 15 231 136 104

240 248 131 237 99 54

42 118 127 138 36 91

52 204 143 58 178 119

233 33 96 175

24 193

100 91 23 134 191 238

132 124 148 176 13 241

30 123 31 201 218 55

50 82 112 235 48 219

24 185 62 99 92 95

46 20 50 180 240 100

242 113 99 185 144 95

191 181 136 227 209 209

190 28 181 14 192 243

(b)

Figure 2.1: Image recognition: a) a image of a T-Ford as seen by a human, adapted from Wikipedia (n.d.-a), and b) a color image as seen by a computer.

Despite the wide spread use, machine learning algorithms can normally be classified as belonging to one of the following three categories:

Supervised learning

Represent situations were each data point is associated with a desired output.

Normal tasks include classification when the answer is a label (face / not a face) and regression when the answer is a real value (function fitting).

Unsupervised learning

Seeks to find structure in unlabelled data, examples include cluster analysis and dimensionality reduction.

Reinforcement learning

Focuses on on-line trial and error learning where a computer program tries to improve its performance on a task by testing different actions and evaluates the responses observed.

This course will only look at methods belonging to the first category, and focus is put on methods based on ANNs. The reason for this is “deep learning” which is a collection name for different types of ANNs with a “deep” hierarchical structure. These networks are especially good at image and speech recognition, and they are currently receiving a lot of attention from big companies, such as Microsoft, Google, and Facebook. Both Geoffrey Hinton and Yuan LeCun, who are big names within the field, have recently been hired by Google and Facebook respectively (Mcmilan, 2013; Metz, 2013).

2.2.1 Traditional machine learning versus deep learning

Several problems within machine learning faces high dimensional data, e.g. the dimensionality of image data corresponds to the number of pixels in the image. Richard Bellman pointed out already 50 years ago that the learning complexity increases exponentially as the dimensionality of the data increases linearly, and he named this problem the “curse of dimensionality”. Traditionally machine learning methods have tried to avoid this problem

(9)

by first performing dimensionality reduction, or feature extraction as it is also commonly called (Arel, Rose, & Karnowski, 2010).

In simpler terms, the above means that machine learning traditionally have operated in two stages. First, the dimensionality of the problem have been decreased by extracting features, whereupon these features have been fed to a machine learning algorithm selected for the task at hand. Examples of feature extraction methods are “Bag of Words” for text data and Fourier transforms for temporal and spatial data. As feature extraction is performed first, the success of many machine learning algorithms is strongly dependent on how well the extracted features can represent variation in the original data. Furthermore, the feature extraction process is normally labour intensive and often performed by humans.

Taken together, the above can be seen as a serious weakness or obstacle in reaching a wide spread deployment of intelligent systems (Bengio et al., 2012).

Deep learning methods differ in the way that they try to handle the curse of dimensionality. Instead of relying on human ingenuity, these methods strive to incorporate the feature extraction process in the machine learning algorithm by taking inspiration from the neocortex (Arel et al., 2010). This is the wrinkled outermost layer of the brain, and it is thought to be responsible for our cognitive abilities. Hence, this is also where the visual and auditory cortices are located. Investigations of these have revealed that sensory information is progressed trough a hierarchical structure were higher level information is extracted as one moves up the hierarchy. For vision, this means that edges are detected at the lowest levels, whereas more complex objects such as cars and faces would detected at higher levels (Poggio & Ullman, 2013). In hierarchical systems, depth can also be used as a measure of the number of levels, and hence, deep learning, as we shall see, corresponds to ANNs with several levels or layers as it is also normally called.

Figure 2.2: Biological neuron, adapted from Wikipedia (n.d.-b).

2.3 Neurons and neural networks

Your brain consist of around 86 billion connected neurons, recently verified by Herculano-Houzel (2012), and each neuron can connect to up to 10⁴ other neurons (Squire & Kandel, 2009). In more detail, each neuron is a type of cell capable of sending and receiving impulses. Functionally, the neuron re- ceives impulses from other neurons on its dendrites, and these can trigger the neuron to send an electrical impulse of its own down the axon, which in turn connects to other neurons (see Figure 2.2). The sites that connect different neurons to each other are called synapses, and contradictory to the neuron’s electrical impulse, signals are transmitted chemically within the synapse using substances called neu- rotransmitters.

(10)

CHAPTER 2. INTRODUCTION

The term neural network is used to describe networks of connected neurons, and one example of a neural network is therefore the brain. But what is it about these types of networks that make it possible to store memories or information? Ramón Cajal, a Nobel laureate in Physiology and Medicine 1906, proposed that memories are stored trough alterations of the synapses connecting neurons. In other words, he proposed that synapses could form connections of various strengths, and that these strengths were plastic in the sense that they could change over time. In 2000, Eric Kandel received the Nobel prize in Physiology and Medicine for his work on verifying the above statement and for describing how this process occurs in real synapses (Squire & Kandel, 2009).

Even if changes in the brain’s synapses constitute the basis for memories, one should not imagine memories as being stored in any specific location. That is, there is no specific place where a memory is stored, instead memories are stored in a distributed fashion throughout the network. This property is maybe best captured by Lashley (1950) who spent his carrier searching for a specific memory trace, or engram as he called it, and arrived at the following conclusion:

This series of experiments has yielded a good bit of information about what and where the memory trace is not. It has discovered nothing directly of the real nature of the engram. I sometimes feel, in reviewing the evidence on the localization of the memory trace, that the necessary conclusion is that learning just is not possible.

The message one should take home from the above is then not that learning is not possible, but rather that finding one or more synapses that specifically code for a memory trace might very well be.

Before moving on to ANNs, there is still one more detail of interest. As the brain performs so many different tasks, neural networks clearly posses interesting features from an intelligent systems viewpoint; but what is the chance of us being able to utilize them to the same degree as the brain is doing? And, how do we know that we are not just going to end up copying lots of different learning mechanism? That is, is it possible that there could be one general learning method for neural networks, utilized by the brain, that also could be implemented for ANNs? The real answer is that we do not know yet, but findings such as the one by Von Melchner, Pallas, and Sur (2000) provide hope. In this study, the signals coming from the eyes were rerouted to the auditory cortex in newborn ferrets. The purpose was to investigate if the auditory cortex could learn to process visual information and this seems to be the case. Therefore, there is some hope for the existence of one general learning algorithm for neural networks.

2.3.1 Artificial neuron

Similarly to real neural networks, ANNs are built up from connected neurons; but in this case from artificial neurons. Different models for artificial neurons exist, but for the purpose of this course all neurons will be of the model described in Figure 2.3. This model uses a vector (w) with connection weights to indicate synaptic strengths to inputs in a vector (x), a summation for determining the induced field (v), and an activation function (ϕ(.)) to calculate the output (ˆy). These calculations can be described mathematically as:

y = ϕˆ

M

X

m=0

w_mx_m

!

= ϕ(w^|x) (2.1)

(11)

x₀ = 1 x1

x₂ x3

... xM

v yˆ

w₀ w₁ w2

w₃

wM

ϕ(.)

Figure 2.3: Artificial neuron.

2.4 Used notation

The notation used is as follows: upper case bold letters represent matrices, lower case bold letters represents vectors, lower case letters with subscripts represent elements in matrices or vectors (if the vector only contains one element the subscript is left out), and finally, superscripts with Roman numerals represents depth in a hierarchy when not obvious from context. A complete list of all used notations and symbols is found on page 43.

2.5 MATLAB

The Matrix Laboratory (MATLAB) environment is widely used within both academia and industry, and it will be used throughout this course to illustrates different learning examples. MATLAB is, as the name implies, well suited for matrix and vector operations;

and these type of operations should be favoured over loops whenever possible. One neat feature of our used artificial neuron is therefore that the induced field is determined by the dot product¹ of the inputs and the weights. This means that the model output ( ˆY) for all data points (X) can be easily determined as:

1 % Calculating the locally induced field

2 % Please note that W is transposed!

3 V = W’*X;

4 % Calculating yHat (assuming that the activation function

5 % can handle matrices

6 YHat = phi(V)

and these calculations still work even if there are several neurons attached to the inputs. In the chapters that follow, we will use this model of a neuron to see 1) how we can represent models, 2) how the model depends of the network’s structure, and 3) how we can make the network learn from data.

1The dot product between two vectors a and b is obtained as the sum of an element wise multiplication:

P|a|

i=1aibi.

(12)

3 Linear regression

I

^{n linear}² regression, the aim is to fit a hyperplane³ to a set of data points. Such models can be visualized as a single neuron connected to every element in the input vector. We will begin by looking at a simple example with one dimensional input and output data, and we will assume that the process generating the data can be described with the following model:

y = wˆ ₀+ x₁w₁+ ε (3.1)

where ε is an error term representing the influence of unknown or not measured terms and noise. This model can be described by a single neuron, of the same form as the one in Figure 2.3, if we assume that the activation function (ϕ(.)) simply returns the argument unchanged. In this case, we could then model the process with a neuron looking like the one in Figure 3.1.

At this stage however, our input data (x) only has dimensionality one, whereas the dimensions for the weight vector (w) would require it to have a dimensionality of two. As can be seen from Equation 3.1 though, w₀ should always be multiplied with one. w₀ hence represent what is called a bias term, and it is present in all models we will look at. For this reason, the simplest fix is to always add a row of ones to the input data matrix (X) (X then gets the dimensions M + 1 by N ). This is also the reason why the indexing here

start from zero when all other indexes starts from one.

x₀ = 1

x1

v yˆ

w0

w1

ϕ(v) = v

Figure 3.1: Linear regression neuron (x0= 1 and w0= bias).

2A linear function must satisfy f (x + y) = f (x) + f (y) and f (ax) = af (x) for all a (Lay, 2012, p. 65).

3A hyperplane is a plane with dimensionality D − 1 where D is the dimensionality of space (Lay, 2012, p. 440). In two dimensional space, the hyperplane then becomes a line.

(13)

Important

Always remember to add a row of ones to the top of your observed data matrix (X) (x₀ should always equal 1).

3.1 A quadratic error measure

The problem in linear regression then boils down to choosing values for the weights (w), so that the model can explain the observed output (y) as good as possible. In order to know what is good and what is bad, we have to define some kind of error measure that scores each possible combination of weight values. To this end, lets start of by defining an error signal (e) as the difference between each desired output (y), indexed with n, and the corresponding model output (ˆy).

e(n) = y(n) − ˆy(n) (3.2)

Using the error signal, we then define an instantaneous error energy (E ) as:

E (n) = 1

2e²(n) (3.3)

where the term ¹₂ is added for mathematical convenience (simpler derivative). Finally, we define the average error energy (Eav) to be:

Eav = 1 N

N

X

n=1

E (n)

= 1 2N

N

X

n=1

e²(n)

= 1 2N

N

X

n=1

[y(n) − ˆy(n)]²

= 1 2N

N

X

n=1

"

y(n) −

M

X

m=0

w_mx_m(n)

#²

(3.4) where it has been assumed that a linear regression output neuron is used with the activation function ϕ(v) = v. As E squares the observed error signal, Eav will obtain positive contributions from all errors signals, independently if they have a positive or negative sign.

Hence, the lower the value ofEav the better the model can explain the observed data. The most optimal choice of model weights is therefore obtained at the minima of Eav with respect to different weight combinations. However, one should keep in mind that these model weights are only optimal for the currently used quadratic error measure.

3.2 Example data (part 1)

Lets look at an example to illustrate what have been said so far. Figure 3.2a plots 100 data points generated by the process:

y = 1 + 0.5x + r (3.5)

where r is a random normally distributed variable with mean zero and standard deviation 0.5. To this data, we will try to fit the model in Figure 3.1. We are not yet in a position

(14)

CHAPTER 3. LINEAR REGRESSION

to determine what the optimal weight values are, but as there are only two weights in this particular case, we can plot a surface illustrating howEav varies for different weight combinations. This is done in MATLAB by calculatingEav as:

1 E = Y - Yhat;

2 En = 1/2*E.*E;

3 En_av = mean(En);

over a grid of different weight combinations and visualizing the result as a surface plot.

Such a surface is shown in Figure 3.2b, and despite that the curvature along the w₀ axes is small, it is still possible to get an idea of where the minima is located.

−4 −2 0 2 4

−4

−2 0 2 4

x

y

Training data Original funtion

(a)

0 −2 2

−2 0 2 0

50

w₀ w1

Averageerrorenergy

(b)

Figure 3.2: Example data: a) 100 data points, selected randomly in the interval [-5 5], generated by the process in Equation 3.5 (black circles) together with a blue line representing the same process without the stochastic variable r, b)Eav as a function of w0 and w1 in the interval [-3 3].

Important

Before proceeding, one should ask if this is the only minima that can be found.

It this case it is, but it is not always the case and we will therefore return to this question later.

3.3 Finding optimal weight values

For simple problems, it is often possible to analytically determine the exact weight values where Eav has its minima. Linear regression belongs to this group of simpler problems, but in order to prepare for more difficult challenges ahead, we will look at a gradient based iterative method called gradient descent (also known as the method of steepest descent). Given any set of weights, this method evaluates the curvature of the energy surface (Figure 3.2b); and adjusts the parameters so that a step is taken in the direction of steepest descent. From calculus, we know that the gradient⁴of a function gives the direction of steepest ascent. Hence, with gradient descent we therefore want to update our current weights so that a step is taken in the negative direction of the gradient. Mathematically, we define this parameter updating process as:

4The gradient to a function is a vector where each element is the partial derivatives of the function with respect to a certain variable. In our case, the variables are represented by our model’s weights.

(15)

w^new_m = w^old_m − η ∂Eav

∂w^old_m (3.6)

where η is a parameter that determines the step size. In order to simplify things further ahead, we will at this point also introduce the following shorthand notation for the weight change (∆w) in Equation 3.6.

∆w_m = η∂Eav

∂wm

(3.7) The next step is then to obtain an expression for the partial derivatives of E_av. We begin by noting the similarity:

∂Eav

∂wm

= 1 N

N

X

n=1

∂E (n)

∂wm(n) (3.8)

From here, we derive partial derivatives ofE (n) by implementing the chain rule⁵ as:

∂E (n)

∂w_m(n) = ∂E (n)

∂e(n)

∂ ˆy(n)

∂v(n)

∂w_m(n) where

∂E (n)

∂e(n) = ∂¹₂e²(n)

∂e(n) = e(n)

∂e(n)

∂ ˆy(n) = ∂y(n) − ˆy(n)

∂ ˆy(n) = −1

∂ ˆy(n)

∂v(n) = ∂ϕ(v(n))

∂v(n) = ∂v(n)

∂v(n) = 1

∂v(n)

∂wm(n) = ∂^P^M_m=0w_m(n)x_m(n)

∂wm(n) = x_m(n) therefore

∂E (n)

∂w_m(n) = −e(n)x_m(n) and

∂Eav

∂wm

= −1 N

N

X

n=1

e(n)x_m(n) (3.9)

Equation 3.9 then tells us in which direction, in weight space, we should move our weights so thatEav decreases.

3.3.1 Running gradient descent

Using partial derivatives, we have a method for updating the model’s weights so thatEav

decreases; but this requires that we already have a set of weights that we are trying to improve. To get started, we therefore need to select a set of initial weight values. Previous knowledge about the problem can here be used, but if no such knowledge exist, it is common to simply generate a set of random values. Based on this, the complete algorithm for gradient descent is summarized in Algorithm 1, where it has been assumed that the iterative process continues until convergence. That is, until a point whereEav is no longer decreasing.

5The chain rule says that ^dz_dx =^dz_dy^dy_dx, assuming z to be a function of a variable y which int turn is a function of a variable x (Croft, Davison, & Hargreaves, 2001, p. 368).

(16)

CHAPTER 3. LINEAR REGRESSION

Algorithm 1: Linear regression using gradient descent.

w ← random initial weights [-1 1]

i ← 1 {epoch^a counter}

Iteration loop for gradient descent repeat

Loop over all training examples (replace with matrix operation in MATLAB) for n = 1 to N do

v(n) ← w^|x(n) y(n) ← v(n)ˆ e(n) ← y(n) − ˆy(n) E (n) ← ¹₂e²(n) Eav←Eav+_N¹E (n)

∆w ← ∆w − _N^ηx(n)e(n) end for

Eav(i) ←Eav

Plot progress {check that Eav is decreasing}

w ← w − ∆w

until convergence {Eav is no longer decreasing}

aAn epoch is when training networks referred to one complete run trough the training set.

We have seen earlier that the for loop, in Algorithm 1, can be replaced with matrix multiplications when calculating V, ˆY, E, and Eav. Similarly, ∆w can also be calculated directly in MATLAB using

1 E = Y - Yhat;

2 dw = -eta * 1/N * X*E’;

This speeds up the calculations drastically when large datasets are used. Finally, using Equation 3.6 we update the weights in MATLAB as:

1 w = w - dw;

It could here be noted that MATLAB also has built in optimization routines and that several of these are gradient based. It is therefore possible to use these routines instead of gradient descent, but we will stick with gradient descent throughout this course.

3.4 Example data (part 2)

Now, when we have a general method for finding the minima ofEav, we can implement it on the example studied earlier. Starting from random initial weight values, the result after 30 iterations with gradient descent is shown in Figure 3.3a. As the energy surface has the form of a long valley, the direction of gradient descent will not necessary point towards the minima. Too large η values can therefore make the algorithm take big leaps that end up increasingEav. In Figure 3.3a, η is on the verge of becoming too large and this is illustrated by the zigzag path taken. Nevertheless, the algorithm was able to find good model parameters after only 30 iterations (red line in Figure 3.3b).

(17)

−2 0 2

w₀ w1

(a)

−4 −2 0 2 4

−4

−2 0 2 4

x

y,ˆy

Training data Original funtion Trained network

(b)

Figure 3.3: Gradient descent: a) 30 iterations from random initial parameters with η = 0.15.

b) 100 data points from Equation 3.5 (black circles), a blue line representing the same process without the stochastic variable r, and a red line representing the model obtained from the trained neuron.

Important

The gradient descent algorithm might become unstable and diverge if η is chosen to large.

3.5 Final remarks on gradient descent

One might wonder about why we are using gradient descent for finding the minima when several algorithms exist that are more efficient. The reason is that gradient descent is very simple and intuitive, and it can also be implemented in batch or online mode. Batch mode corresponds to the description given above where the contributions from all training examples are summed up before the weights are updated. That is, the output for each data point is evaluated using the same weights before the update is performed. In online mode, the weights are instead updated continuously using the individual partial derivatives obtained from each data point.

Batch and online mode are therefore both extreme cases. One uses all training examples to update the weights, whereas the other uses only one. Between these two extremes we find something called mini batch, and this is one of the main reasons why gradient descent is still sometimes used. Imagine having a million training examples. If you now implement gradient descent using batch mode, you will have to do a lot of calculations before any progress can be done; but you will know exactly in which direction to move the weights. Online mode requires a lot less calculations before any weights are updated, but here you have only considered one data point; and the direction suggested is not likely to be same as the one suggested by the whole batch. However, if you randomly selected a subset consisting of ten to a hundred thousand training examples (a mini batch).

The direction suggested by this mini batch would most likely point in approximately the direction suggested by the batch, but you would get this information at a fraction of the cost that it would take to evaluate all data points. As Equation 3.9 includes a sum over all data points, the presented gradient descent algorithm can be used very easily with mini batches, as this only requires that the sum is restricted to a subset of the data points.

(18)

4 Softmax regression

U

nfortunately softmax regression is a bit of a misnomer. We saw already in chapter 2 that the term regression is used for real valued outputs, whereas the term classification is used for labelled outputs. Softmax regression is, however, a classification algorithm despite its name.

In softmax regression,⁶ each class (k) is represented by one neuron; and each neuron represent the probability that a given input vector belongs to its corresponding class. The vectors y and ˆy must therefore both sum up to 1 (probabilities have to sum to 1). For y, the class label is known and this vector therefore contains a 1 (indicating 100 % confidence) for the correct class and zeros for the rest. For ˆy, the summation constraint is in turn satisfied by using the following activation function:

φ(v_k) = e^v^k PK

k=1e^v^k (4.1)

The probability, given by our model, that an input vector belongs to class k is therefore given by:

yˆ_k = p(class = k|x) = e^w^|^k^x PK

k=1e^w^|^k^x (4.2)

In MATLAB, we can make use of built in functions and matrix operations to calculate ˆY from V as:

1 % Assuming V has dimensions K by N

2 Yhat = exp(V) ./ ( ones(size(V,1),1) * sum(exp(V)) );

The summation over all induces fields in Equation 4.2 requires that each output is connected to all the induced fields. Hence, we will represent softmax regression with the structure in Figure 4.1.

6Softmax regression is a generalization of the more common logistic regression algorithm to more than two classes. This generalization is also known as multinomial logistic regression.

(19)

x₀= 1 x1

... xM

v1

... vK

yˆ1

... yˆK

w_mk φ(vk) = _P_K^e^vk

k=1e^vk

Figure 4.1: Softmax regression model for classification between K classes.

4.1 Maximum likelihood

Softmax regression output probabilities for different classes, and hence, weight selection should be based upon how likely the observed combination of inputs and desired outputs would be for a given set of weights. As an example, imagine that you are trying to model human height by determining the mean (µ) and the standard deviation (σ), and that you have been given a sample containing the heights of 1000 randomly selected persons from the entire human population. Based upon our sample, we can calculate how likely we are to observe it for different values of µ and σ with a likelihood function (L ), defined asL (µ, σ|sample). With this definition, the most most logical choice of model parameters would be found at the maxima ofL (µ, σ|sample); and the task hence becomes an optimization problem, which again could be solved using gradient descent.

Similarly, in softmax regression we are interested in finding the weights for our neurons that would be the most likely ones given the available data, and for our case, the likelihood that our model generated one data point is given by:

L (w|x(n)) =

K

Y

k=1

p^y_k^k⁽ⁿ⁾(n) (4.3)

Assuming independence between data points, the likelihood function over X then becomes the product over all observed data points.

L (w|X) =

N

Y

n=1 K

Y

k=1

p^y_k^k⁽ⁿ⁾(n) (4.4)

Large products are, however, cumbersome to work with. A normal trick is therefore to work with the mean log likelihood function instead (l ), or the mean negative log likelihood function⁷ as in this case. We obtain this function by simply taking the logarithm of Equation 4.4 and multiplying the expression by −_N¹, which gives:

l = − 1 N

N

X

n=1 K

X

k=1

y_k(n) ln(p_k(n))

= −1 N

N

X

n=1 K

X

k=1

y_k(n) ln(ˆy_k(n)) (4.5)

7The negative log likelihood function is used in order to transform a maximization problem into a minimization problem. Multiplying a function with negative one flips its surface so that the previous maxima becomes a minima.

(20)

CHAPTER 4. SOFTMAX REGRESSION

In order to avoid unnecessary loops, it is a lot faster to calculatel directly in MATLAB using:

1 % Mean negative log likelihood

2 l = -1/N * sum(sum( Y.*log(Yhat) ));

Just as Eav for the linear regression problem, l only have one minima in softmax regression; following the gradient is therefore guaranteed to lead to a global minima.

Differentiating Equation 4.5 with respect to the model parameters is somewhat tricky. A complete derivation can be found in Bishop (1995), but we will here just conclude that the derivation in the end gives:

∂l

∂w_mk = −

N

X

n=1

e_k(n)x_m(n) (4.6)

where e_k(n) is defined as

ek(n) = y_k(n) − ˆyk(n) (4.7)

Quite interestingly, Equation 4.6 is actually identical to what we obtained for linear regression in the previous chapter. At this point, we now have all the information needed to fit models, as the one in Figure 4.1, to labelled data. Using gradient descent, the complete procedure is summarized in Algorithm 2

Algorithm 2: Softmax regression using gradient descent.

W ← random initial weights [-1 1]

i ← 1 {epoch counter}

Iteration loop for gradient descent repeat

Loop over all training examples (replace with matrix operation in MATLAB) for n = 1 to N do

v(n) ← W^|x(n) for k = 1 to K do

yˆ_k(n) ← _P_K^e^vk(n)

k=1e^vk(n)

end for

e(n) ← y(n) − ˆy(n)

l (n) ← −^P^K_k=1yk(n) ln(ˆyk(n)) l ← l +_N¹l (n)

∆W ← ∆W − _N^η[x(n)e^|(n)]

end for l (i) ← l

Plot progress {check that l is decreasing}

W ← W − ∆W

until convergence {l is no longer decreasing}

(21)

4.2 Gradient checking

Programming mistakes easily occur during implementation of the presented algorithms.

In the example we looked at in chapter 3, it was possible to confirm that our selected parameter values moved in the intended direction by looking at how our positioned changed on the Eav surface. As the dimensionality of X grows larger (more weights) this is no longer possible, and we therefore need another method to verify our calculations. A nice way of verifying that everything is working as intended is to compare the calculated partial derivatives to approximate numerical estimates.

The partial derivative for each weight tells us how muchEav orl should change when that weight is varied. We can easily verify this by adding or subtracting a small number (κ) from a weight and calculating the induced change inEav or l afterwards. To get an unbiased estimate, its a good idea to take the average of the observed change when κ is both added and subtracted. Mathematically we can describe the estimated partial derivative as:

estimated value for ∂l

∂w_mk = l (w⁺_mk) −l (w_mk⁻ )

2κ (4.8)

where

w⁺_mk= w_mk+ κ and w_mk⁻ = w_mk− κ

Equation 4.8 can be repeated for every weight and the results can be compared to what is obtained from either Equation 3.9 or Equation 4.6. For small values on κ (10⁻⁴), the estimated and computed partial derivatives should be identical for atleast the first three or four decimals.

4.3 2 class example

Lets start off with a two class example so we can get a feel for how softmax regression works.

Our problem consist of learning to correctly classify the 100 data points in Figure 4.3a.

These data points are generated by the processes:

Class 1

( x1 = 2 + r

x₂ = 2 + s Class 2

( x1= −2 + r

x₂= −2 + s (4.9)

where both r and s are random normally distributed numbers with mean 0 and standard deviation 1. After adding a row of ones to X, its dimensions are now 3 by 100. Labels are coded using the desired outputs as:

y =

( [1 0]^| if class = 1

[0 1]^| if class = 2 (4.10)

Hence, the model that we will try to fit looks like the one illustrated in Figure 4.2. Even if Equation 4.5 provides an error measure, it is still interesting to know how often data points are classified incorrectly. For this purpose, we introduce a classification error ratio (C ) as:

C = 1 N

N

X

n=1

( 0 if the predicted class is equal to the assigned class

1 otherwise (4.11)

(22)

x₀ = 1 x1

x₂

v1

v₂

yˆ1

yˆ₂ w_mk φ(v_k) = _P₂^e^vk

k=1e^vk

Figure 4.2: Softmax regression model for a two dimensional two class problem.

When calculating C in MATLAB, we can make use of built in relational operators, but first we need a vector with class labels for both Y and ˆY. As our model outputs probabilities, we would like to assign a data point to the class with the highest probability.

With our matrix definitions, this corresponds to selecting the row in Y or ˆY with the maximum value in each column, and taking the row number as the defined or predicted label for each data point. In the end, we can implement all of this using the following three lines of code in MATLAB:

1 % Labels from Y

2 [~, Y_labels] = max(Y, [], 1);

3 % Labels from Yhat

4 [~, Yhat_labels] = max(Yhat, [], 1);

5 % The classification error ratio

6 C = mean( Y_labels ~= Yhat_labels );

Before running the procedure in Algorithm 2, we will check that our implementation is correct by comparing the calculated and estimated gradient. For random initial weights [-1 1], this comparison showed that both the calculated and estimated partial derivatives are identical for the first four decimals (see Table 4.1).

Table 4.1: Comparison between calculated and estimated partial derivatives for random initial weights [-1 1] and with κ = 10⁻⁴

∂l

∂w01

∂l

∂w02

∂l

∂w11

∂l

∂w12

∂l

∂w21

∂l

∂w22

Calculated -0.0142 0.0142 -0.3747 0.3747 -0.4990 0.4990 Estimated -0.0142 0.0142 -0.3747 0.3747 -0.4990 0.4990

Figure 4.3b illustrates training progress for 30 epochs while running the procedure in Algorithm 2 with η = 0.25. After these 30 epochsl is hardly decreasing any more, and it was concluded that the algorithm had converged. The decision surface found by the algorithm is illustrated in Figure 4.3c together with a decision surface found using linear regression in Figure 4.3d. Linear regression is clearly not suited for classification tasks, whereas softmax regression is meant to handle labelled data.

4.4 Training and testing

Our hope when training a model is that it will be able to generalize to unseen data.

However, as one starts to fit models with more and more weights, there is always a risk that the model starts to learn patterns that are just found in the data used for training. This phenomena is called overfitting, and it can make the model perform significantly worse on unseen data. Even if overfitting has not occurred, models normally perform worse on

(23)

unseen data because it might contain patterns that were not present in the data used for training. Model performance on training data is therefore not a good estimate on how well the model will perform. Available data is therefore normally divided up into a training set and a test set. The training set is used for training the model, whereas the test set is used for getting an unbiased measure on how well the model performs. A common division is to randomly select around 80 % of the data for training and leave the rest for testing.

−4 −2 0 2 4

−4

−2 0 2 4

x₁ x2

Class 1 Class 2

(a)

0 0.5 1 1.5

Negativeloglikelihood

Training epoch Training

0 10 20 300

0.2 0.4 0.6 0.8 1

Classificationerror

(b)

0 −5 5

−5 0 5 0

0.5 1

x₁ x2

y,ˆy2

Class 1 Class 2

(c)

0 −5 5

−5 0 5 0

2

x₁ x2

y,ˆy

Class 1 Class 2

(d)

Figure 4.3: a) 50 data points from both class 1 and class 2 generated in accordance with Equation 4.9. b) Training results with random initial weights [-1 1] and η = 0.25. The blue line representsl and the dashed blue line is the classification error as given by Equation 4.11.

c) Decision surface for the softmax regression network trained in Figure 4.3b. d) Decision surface for a linear regression model trained on the data in Figure 4.3a.

4.5 MNIST

Different people and groups all over the world develop new or improve existing algorithms, and it is often difficult to know which ones are the best. However, there exist several

“famous” datasets that people try out their algorithms on and comparisons are hence possible. One well know dataset is the MNIST database of handwritten digits (LeCun, Bottou, Bengio, & Haffner, 1998). This dataset contains 60 000 images (20 by 20 pixels) that are to be used for training and 10 000 more for testing. As can be noted in Benenson (2013), the best reported classification error ratio on the test set is 0.21 %. It could here also be noted that deep learning networks are currently at the top for both this and the other image classification datasets.

The MNIST datasets can be freely downloaded from the MNIST homepage and a

(24)

function for reading them into MATLAB can be found from Matlab central. Figure 4.4b illustrates 16 images taken from the training set.

Data in image format have to be concatenated into vector format before we can fit a model using softmax regression. This is accomplished by simply taking each column in the image matrix and placing them on top of each other to form a vector x with dimensionality 400. As before, we also have to add a row of ones; the dimensions for X_trainingthen become 401 by 60 000. A similar process for the test set gives X_test with dimensions 401 by 10 000.

Labels are coded using the desired outputs as:

y =











[1 0 0 0 0 0 0 0 0 0]^| if digit = 1 [0 1 0 0 0 0 0 0 0 0]^| if digit = 2

... ...

[0 0 0 0 0 0 0 0 0 1]^| if digit = 0

(4.12)

Y_trainingand Y_testtherefore have the dimensions 10 by 60 000 and 10 by 10 000 respectively.

In the end, this means that the model we will try to fit looks like the one illustrated in Figure 4.4a. Training results after 100 epochs of gradient descent using random initial weights [-1 1], η = 2, and mini batches of 20 % are shown in Figure 4.4c. Running gradient descent for additional epochs could improve the results slightly, but overall it looks like the algorithm has converged, and the final results are summarized in Table 4.2.

x₀= 1 x1

...

x400

v1

...

v10

ˆ y1

...

ˆ y10

wmk φ(v_k) = P10^e^vk

k=1e^vk

(a)

(b)

0 2 4 6 8 10

Negativeloglikelihood

Training epoch Training Test

0 50 1000

0.2 0.4 0.6 0.8 1

Classificationerror

(c)

Figure 4.4: a) Softmax regression model for classifying the MNIST data set. b) 16 images from the MNIST training set. c) Training progress, on the MNIST dataset, using softmax regression with random initial weights [-1 1], η = 2, and mini batches of size 20 %. Solid lines representl and the dashed line isC .

(25)

Table 4.2: Numerical summary of the final training results from Figure 4.4c

ltraining ltest Ctraining Ctest

0.4561 0.4695 0.1051 0.1053

Correctly classified training images 53696 Incorrectly classified training images 6304 Correctly classified test images 8947 Incorrectly classified test images 1053

4.6 Restrictions for linear models

Both linear and softmax regression represent artificial neural networks with a single layer of neurons. This simplicity has the advantage that both problems are convex,⁸ but it also brings restrictions on what is possible to achieve. Linear regression can only fit hyperplanes to the observed data and softmax regression can only separate classes using hyperplanes.

Softmax regression can therefore only classify all data points correctly if the data is linearly separable. In two dimensions, linearly separable then means that the classes should be separable by a straight line. A simple example that is not linearly separable is the XOR problem.

Important

Softmax regression can only try to separate classes using hyperplanes as bound- aries and can therefore only obtain C = 0 on linearly separable problems.

If the models found using linear or softmax regression are unsatisfactory, more complex models will have to be used. One such example are multilayer perceptron networks that are presented in the next chapter.

8Convex problems only have one minima, and hence gradient descent is guaranteed to find the global minima for eitherEav orl .

(26)

5 Multilayer Perceptrons

T

he previous chapter ended with a discussion about limitations for one layer networks. It turns out that we can get around these limitations by adding another layer of neurons.

This gives us a Multilayer Perceptron (MLP) network (another misnomer unfortunately)⁹ and an example is given in Figure 5.1. Previously, we only had an input layer with inputs and an output layer with neurons; but now we have added a hidden layer with neurons in between these two layers (one could also add more than one hidden layer). These neurons in the hidden layer will now function as inputs to the output layer. It is, however, important that neurons in the hidden layer have a non-linear activation function. If not, the network can be truncated down to a single layer and there is no gain in using a hidden layer. At the same time, the activation functions must be differentiable, otherwise, as we shall see, we will not be able to train the network using gradient descent.

Important

Neurons located in hidden layers must have differentiable non-linear activation functions.

x0= 1 x1

...

x_M Input layer

yˆ₀ⁱ = 1

yˆ₁ⁱ ...

yˆⁱ_Ki

Hidden layer

yˆ₁ⁱⁱ ...

yˆⁱⁱ_Kii

Output layer

w_mkⁱ i wⁱⁱ_kikⁱⁱ

Figure 5.1: MLP neural network with one hidden layer (Roman numerals are used to indicate depth).

9The perceptron is actually a type of learning algorithm for linear classifiers and does not unfortunately have anything to do with MLP networks.

(27)

−3 −2 −1 0 1 2 3

−1

−0.5 0 0.5 1

v

ϕ(v)

(a)

−3 −2 −1 0 1 2 3

0 0.25 0.5 0.75 1

v

ϕ′(v)

(b)

Figure 5.2: a) The hyperbolic tangent function. b) The derivative to the hyperbolic tangent function.

What are then suitable activation functions? A common choice that fulfil the require- ments is the hyperbolic tangent function. This function and its derivative are shown in Figure 5.2a and Figure 5.2b, and they can be represented mathematically as:

ϕtanh(v) = tanh(v) = −1 + 2

1 + e^−2v (5.1)

ϕ0_tanh(v) = sech²(v) = 4e^2v

(e^2v+ 1)² (5.2)

Independently on if we are performing regression or classification, we will here always use the hyperbolic tangent function as activation function for all neurons in the hidden layer. The output layer, on the other hand, will either consist of a linear regression neuron or softmax regression neurons. We can therefore look upon the hidden layer as a non-linear projection of the inputs before performing linear or softmax regression. Thinking about the XOR problem, our objective then becomes to select weights for the hidden layer so that the classes, after the non-linear projection, becomes linearly separable.

With multiple layers we have to calculate the outputs sequentially. That is, we first determine the outputs for the neurons in the hidden layer, whereupon we determine the outputs for the neurons in the output layer. Similarly, just as we always add a row of ones to X before calculating ˆYⁱ, we also have to add a row of ones to ˆYⁱ before calculating ˆYⁱⁱ. In MATLAB, this is done by:

1 % Calculating the locally induced field in the hidden layer

2 % Please note that Wi is transposed!

3 Vi = Wi’*X;

4 % Calculating outputs from the hidden layer

5 YiHat = tanh(Vi);

6

7 % Adding a row of ones to yiHat

8 YiHat = [ones(1,size(YiHat,2)); YiHat];

9

10 % Calculating induced fields in the output layer

11 % Please note that Wii is transposed!

12 Vii = Wii’*YiHat;