Development of a General Purpose On-Line Update Multiple Layer Feedforward Backpropagation Neural Network

(1)

Master Program in Electrical Science 1997 College/University of Karlskrona/Ronneby Department of Signal Processing

Development of a General Purpose On-Line Update

Multiple Layer Feedforward

Backpropagation Neural Network

(2)

                                            

Abstract

This Master thesis deals with the complete understanding and creation of a 3-layer

Backpropagation Neural Network with synaptic weight update performed on a per sample basis (called, On-Line update). The aim is to create such a network for general purpose

applications and with a great degree of freedom in choosing the inner structure of the network.

The algorithms used are all members of supervised learning classes, i.e. they are all supervised by a desired signal.

The theory will be treated thoroughly for the steepest descent algorithm and for additional features which can be employed in order to increase the degree of generalization and learning rate for the network.

Empirical results will be presented and some comparisons with pure linear algorithms will be made for a signal processing application, speech enhancement.

(3)

                                            

Contents

Section 1 Preface 4

Section 2 Structure of a single neuron 5

2.1 General methods for unconstrained optimization 5

2.2 Basic neuron model 6

2.3 Widrow-Hoff Delta Rule or LMS Algorithm 6

2.4 Single Layer Perceptron 9

2.5 General Transfer functions 9

Section 3 Neural Network Models 11

3.1 Different Models 11

3.2 Feedforward Multilayer Neural Network 11

Section 4 Steepest Descent Backpropagation Learning Algorithm 14

4.1 Learning of a Single Perceptron 14

4.2 Learning of a Multilayer Perceptron 15

Section 5 Additional Features for improving Convergence speed and 19 Generalization

5.1 Algorithm with Momentum Updating 19

5.2 Algorithms with Non-Euclidean Error Signals 19 5.3 Algorithm with an Adaptation of the Slope of the Activation

Functions 21

5.4 Adaptation of the Learning Rate and Mixing of input patterns 22 5.5 A combined stochastic-deterministic weight update 23 Section 6 Comparison of a Batch update system with an On-Line

Update system 24

Section 7 Empirical Results 25

7.1 Approximation of sin(x) with two neurons in hidden layer 25

7.2 Separation of signals with noise 26

7.3 Processing of EEG-signals 27

7.4 Echo-canceling in a Hybrid 29

7.5 Speech enhancement for Hands-Free mobile telephone set 30

Section 8 Further Development 33

Section 9 Conclusions 34

(4)

                                            

1. Preface

Neural Networks are used in a broad variety of applications. The way they work and the tasks they solve differs widely. Often a Neural Network is used as a associator where one input vector is associated with an output vector. These networks are trained rather than programmed to perform a given task, i.e. one set of associated vector pairs are presented in order to train the network. The Network will then hopefully give satisfactory outputs when facing input vectors not present in the training phase. This is the generalization property which in fact is one of the key features of Neural Networks.

The training-phase is sometimes done in batch, which means that the training vector-pairs are all presented at the same time to the network. If the task to be solved is the same during time, i.e. the task is stationary, training will be needed only once. In this case the batch procedure can be used. If the task is changing during time the network will have to be adaptive, i.e. to change with the task. In this case an on-line training is preferable. The on-line approach takes the training vector-pairs one at the time and performs a small adjustment in performance for every such presentation.

The on-line approach has the great advantage that learning time is reduced considerably when compared to a batch system. In signal processing applications the adaptation during training is of great importance. This can only be achieved with the on-line system.

In this thesis an on-line update Neural Network will be created and thoroughly explained. The Neural Network will be general in the sense that it can be chosen arbitrary when it comes to structure, performance, complexity and adaptation rules. Empirical experiments of both artificial and real-life applications will be presented.

The results show that many standard applications which earlier has been solved with linear systems can easily be solved with neural networks. Often the result is better. With properly chosen structure adaptation during learning can be satisfactory.

In real time implementations the use of neural networks can be limited due to their need for computational power. But, the parallel structure of these networks can make way for implementations with parallel processors.

I would like to thank Professor Andrzej Cichocki for letting me use material from his and Dr Rolf Unbehauens’ book [7].

Table 2.1 and figures 3.1, 4.1 and 4.2 are taken from this book.

(5)

                                            

2. Structure of a single neuron

2.1 General methods for unconstrained optimization

Consider the following optimization problem: find a vector w that minimizes the real value scalar function J(w):

min J(w) , where w = [w w₁ ₂ ....w_n]^T, the number of elements, denoted n, of w is arbitrary.

This problem can be transformed into an associated system of first order ordinary differential equations as

dw dt

J w

i

ij j j

n

= −

∑

= ^η _∂^∂ 1

The vector w will then follow a trace in the phase-portrait of the differential equations to a local- or global minimum.

In vector/matrix notation this becomes d

dtw J

w,t _w (w)

= − η( )∇

An initial vector w₀ must be chosen, from which the trace starts.

Different selections of the matrix η(w,t) gives different methods. The simplest selection is to choose the matrix as the unity matrix multiplied with a small constant η0, the learning parameter, in this case the method becomes the well known method of Steepest Descent.

If a Taylor approximation of J(w) truncated to second order terms, and the gradient for this, is taken then the resulting equation can be solved for zero. A necessary condition for a local minimum is that the gradient equals zeros. In this case η(w,t) is chosen as the inverse Hessian matrix. This is the well known Newton’s method.

d

dtw J J

(w) (w)

w

- 1

= − η₀[∇² ] ∇w

There are some drawbacks with Newton’s method. First the inverse of the Hessian must be calculated and this inverse does not always exist (or the Hessian can be very ill-conditioned).

One way to overcome this is to add an unity matrix multiplied with a small scalar to the Hessian matrix in order to improve the condition. Of course this will then only be an approximation to the true Hessian. This method is then called the Marquardt-Levenberg algorithm.

Second, there are strong restrictions about the chosen initial values of w₀because the truncated Taylor series gives poor approximations far away from the local minima.

In the context of Neural Networks there are even more difficulties because of the

computational overhead and large memory requirements. In an On-Line update system we would need to calculate as many matrix inversions as the number of neurons in the network for every sample.

There are several approaches which iteratively approximates this inverse without actually calculate one, i.e. Quasi Newton’s methods.

This thesis will emphasize the Steepest Descent method since the weight update will be On- Line and this sets strong requirements on computational simplicity.

(6)

                                            

2.1 Basic neuron model

Let us consider a single neuron which takes a vector x =[x x₁ ....₂ x_n]^Tas input and produces a scalar y as output. As before the number of elements, denoted n, in vector x is arbitrary. The neuron has inner strengths w =[w w₁ ₂ ....w_n]^T also called the synaptic weights. In most models of a neuron it also has a bias, a scalar denoted _Θ . Often the input x is multiplied with the synaptic weights and then the bias is added. The result is passed through a scalar activation function, Ψ(⋅) , according to

y w x_{i i}

i n

= +

∑

=

Ψ( Θ)

1

, index i stands for the i:th scalar in vector w and x respectively.

This can be written in a compact form as

y w x_i _i

i n

=^Ψ⁽

∑

= ⁾ 0

, where w₀ = Θ and x₀ =1by definition.

The activation function Ψ(⋅) can be any linear or non-linear function which is piece-wise differentiable.

The state of the neuron is measured by the output signal y.

2.2 Widrow-Hoff Delta rule or LMS algorithm

In the Widrow-Hoff Delta rule, also called the LMS algorithm, the activation function is linear and often with slope of unity. In supervised learning the output y should equal a desired value d. The LMS algorithm is a method for adjusting the synaptic weights to assure minimization of the error function

J = 1e t = d − y 2

1 2

2 2

( ) ( )

For several inputs the sum of J is to be minimized according to the L₂-norm (other norms can be used, see section 5.2)

Applying the steepest descent approach the following system of differential equations will be obtained

dw dt

J w

J y

y w

i

i i i

= − η∂ = −

∂ η∂

∂

∂ since

y w x_i _i

i n

=

∑

= 1

we get

dw

dtⁱ = ⋅η ( )e t ⋅x t_i( )

Where t is the continuous time index and η is the learning parameter.

Converting this into the standard discrete time LMS algorithm and writing it in vector notation gives

w(k+1)=w(k) + µ⋅e(k)⋅x(k)

Here index k stands for the discrete sample index and µ is the learning parameter (stepsize).

The parameter µ differs from the learning parameter η in the continuos case, generally it must be smaller to ensure convergence.

(7)

                                            

In signal processing applications, where LMS is widely used, the vector x is a time shifted version of a stream of discrete input samples.

Consider a stream of input scalars (a sampled signal) with sample-index k starting at 0 and increasing to infinity, that is

X = [x(0) x(1) , … .. , x(i-1) x( i) x(i+1), … .]

The aim is to map this signal one-to-one on a desired signal D D = [d(0) d(1) , … .. , d(i-1) d( i) d(i+1), … .]

by filtering the signal X with a finite impulse response filter F. This type of filter uses a finite length of past experience, without any recursion, from the signal X as input. The number of filter coefficients in F is denoted n and the filter coefficients is w =[ ( )w 0 w( )1 ....w n( − 1)]^T. The output at sample index k is

y k w i x k i

i n

( )= ( ) ( − )

=

∑

− 0 1

Which is the convolution between w and x, that is y = w∗x. Now using the LMS algorithm in order to update the filter coefficients during presentation of the input stream gives as before

w(k+1)=w(k) + µ⋅e(k)⋅x(k)

where the input vector x, at sample index k, is x(k) = [ ( ) (x k x k− 1) .... x k( − +n 1)]^T

The error e(k) is, as before, defined as the difference between the desired value and the actual outcome from the filter, that is e(k) = (d(k) - y(k)).

There are several improvements to this standard LMS algorithm which all fit their special purposes. Table 2.1 shows some modified versions of the LMS algorithm. The aim for these modifications is to improve the standard LMS algorithm with regard to different aspects.

These aspects could be in focus on rapid convergence speed or on minimizing the excess mean squared error. Some of the modifications aims to reduce computational complexity, as the sign algorithms. A further insight in their behavior can be obtained from any textbook in signal processing

(8)

                                            

Table 2.1 Modified and improved versions of the LMS algorithm, [7]

(9)

                                            

2.3 Single layer perceptron

The single layer perceptron uses the hardlimiter as the activation function. This hardlimiter gives only quantized binary outputs, that is ~y∈{-1 , 1}. In this case the output is compared with the quantized binary ~

d ∈{-1 , 1} and a quantized error signal ^~e =(d~− ~)y ∈{-2,0,2} is produced. The update rule for this is very similar to that for the LMS algorithm

w(k+1)=w(k) + µ ~( ) ( )⋅e k ⋅x k

The only differens is that the error is quantized. This perceptron can achieve any separation of input vectors that are linearly separable. For input vectors that are not linearly separable it is possible to separate them as well with appropriately interconnecting several perceptrons.

In more general perceptron implementations the activation function is non-linear, se next section.

If an arbitrary transfer function is used which is piece-wise differentiable the update rule can be derived in similar way as for the LMS algorithm. Consider a general transfer function Ψ(⋅) with gradient Ψ´(⋅).

u w x_i _i

i n

= ⋅

∑

= 0

and y= Ψ( )u

As previous we will use theL₂-norm for error measurement (a more general error norm will be described in section 5.2)

J = 1e t = d − y 2

1 2

2 2

~( ) (~ ~) , index t stands for continuous time The differential equation states (applying the chain rule)

dw dt

J w

J e

e w

J e

e u

u w

i

i i i

= −η ∂ = − = −

∂ η∂

∂

∂ η∂

∂

~

This gives for a general activation function in discrete vector form w(k+ 1)=w(k)+ ⋅µ ~e (k)⋅Ψ'(u(k))⋅x(k)

Here the error ~e is quantized as before and index k stands for the discrete sample index, µ is the discrete stepsize. Often the activation function is a s-shaped sigmoid function.

2.4 General transfer functions

There are a variety of transfer functions where all of them has their key application. The most general and widely used functions are the linear, bipolar sigmoid and the unipolar sigmoid functions. These are the transfer function which will be used in this approach. In the network, which will be derived later, they can be intermixed in any way.

The linear function just passes the value with an amplification due to the slope. Often the slope is set to unity because this will only affect the magnitude of the stepsize.

The bipolar sigmoid function is the tangent hyperbolicus function with a slope, see fig 2.4.1 Ψ(u) = tanh(γ⋅u) = 1

1

2

− +

−

e e

u u γ γ

The definition set is all real numbers and the target set is all real numbers between -1 and 1.

The unipolar sigmoid function is also s-shaped but the target set is the real numbers between 0 and 1, see fig 2.4.2

Ψ(u) = ¹ 1+ e⁻^γ^u

(10)

                                            

The slope γ is a parameter which controls the steepness in the non-linearity and this can either be pre-specified or it can be updated according to a scheme, section 5.3 deals with this aspect.

-5 -4 -3 -2 -1 0 1 2 3 4 5

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

The bipolar sigmoid function for different slopes (0.5, 1.0, 2.0)

Fig 2.4.1 The bipolar sigmoid function for different slopes (0.5, 1.0, 2.0)

-5 -4 -3 -2 -1 0 1 2 3 4 5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

The unipolar sigmoid function for different slopes (0.5, 1.0, 2.0)

Fig 2.4.2 The unipolar sigmoid function for different slopes (0.5, 1.0, 2.0)

The partial derivative for the bipolar sigmoid function (bsig) follows by using the derivative rule of a quote.

d u d u

e e e e

e

e e e e

e

e e e e

e

u u u u

u

u u u u

u

u u u u

Ψ( ) ( )

( ) ( )( )

( )

( ) ( )

( )

( ) ( )

(

= ⋅ + − − −

+ =

⋅ + + ⋅ −

+ = + ⋅ + − + ⋅ −

+

− − − −

−

− − − −

−

− − − −

−

2 1 1 2

1

2 2 2 2

1

2 2

1

2 2 2 2

2 2

2 2 2 2 2 2

2 2

2 2 2 2 2 2

γ γ

γ γ γ γ γ γ γ γ γ γ

γ γ γ γ

γ

γ γ γ γ

γ

γ γ γ γ

2 2

2 2 2 2

2 2

2

2 2

1 1 2

1 1 1

1 1

γ

γ γ

γ

γ γ γ γ γ

u

u u

u

u u

e e

e

e u

)

( ) ( )

( ) ( ( )

( ) ) ( tanh( ) )

=

+ − −

+ = − +

−



 

 = −

− −

−

And for the unipolar sigmoid function (usig) d u

du

e e

e

e e

e e u u

u u

u

u u

Ψ

Ψ Ψ

( )

( ) ( ) (

( ) ( ) )

( ) ( ( ))

= − − ⋅

+ = + −

+ = +

+ − + =

+ − +

 





 

 = ⋅ ⋅ −

−

− −

γ γ γ

γ γ

γ

γ γ

1

1 1

1

1 1

1 1 1

1

1 1

2 2 2 2

2

(11)

                                            

3. Neural Network Models

3.1 Different models

There are different type of Neural Network models for different tasks. The way they work and the task they solve differs widely, but they also share some common features. Generally Neural Networks consists of simple processing units interconnected in parallel.

These connections divide the set of Networks into three large categories:

1. Feedforward networks

These networks can be compared with conventional FIR-filters without adaptation of the weights. The differens is that they can perform a non-linear mapping of input signals onto the target set. There are as for the FIR-filters no stability problems.

2. Feedback networks

These Networks are non-linear dynamic systems and can be compared with IIR-filters. The Elman Network and the Hopfield Network are both feedback Networks used in practice.

3. Cellular networks

These Networks has more complex interconnections. Every neuron is interconnected with its neighbors and the neurons can be organized in two or three dimensions. A change in one neurons state will affect all the others. Often used training methods are simulated annealing schedule (which is a pure stochastic algorithm) or mean field theory (which is a deterministic algorithm).

In supervised learning the backpropagation learning can be used in order to find the weights that best (according to an error norm) performs a specific task. This method will be described in detail in section 4.2 for a feedforward Neural Network.

In unsupervised learning , where no desired target is present, the task to be performed is often to reduce all components in the in-signal that are correlated (here not only linear correlation is meant, as often is the case). This can for instance be done when minimizing the following

E( )w =α w − (w x^T ) 2 ² σ

2 where σ(u) is the loss function, typically σ(u) = ¹ 2 u2

This is called the potential learning rule. The first term on the right side is usually called the leaky effect. If this unsupervised approach is adopted, the algorithm described above will only change with the following

w(k+1)= −(1 α)⋅w(k)+ µ⋅u~(k)⋅Ψ'( ~u(k))⋅x(k)

Here the error e is replaced with the internal state u and a small constant α is added to prevent the weights to grow beyond limits. Sometimes the constraint on w is chosen to fixed length at unity, that is w₂² =1. In this case one wish to minimize the absolute value of the loss function.

Observe that this is a local update rule, that is, no information has to be passed to other interconnected neurons.

3.2 Feedforward Multilayer Neural Network

A feedforward neural network can consist of an arbitrary number of layers but in practice there will be only one, two or three layers. It has been shown theoretically that it is sufficient to use a maximum of three layers to solve an arbitrarily complex pattern classification problem [1]. The layers are ordered in serial and the last layer is called the output layer. The preceding layers are

(12)

                                            

called hidden layers with indices starting at one in the layer closest to the input vector.

Sometimes the input vector is called the input layer.

Here a three-layer perceptron will be regarded. In every layer there can be neurons with any transfer function and with or without a bias. The number of neurons in hidden layers are determined by the complexity of the problem. The number of neurons in the output layer are determined by the type of problem to be solved because this is the number of the output signals from the network.

In this approach the type of transfer functions used are the linear, bsig and usig intermixed arbitrarily in each layer. All neurons will have a bias in order to improve the generalization.

If using the compact notation (as in section 2.1) y w x_i _i

i n

=^Ψ⁽

∑

= ⁾ 0

where w₀ = Θ and x₀ =1

and arranging the number of neurons as a column, the processing of the whole layer can be described in matrix notation. Defining a matrix W^{[ ]}¹ for the first layer, where the j:th row contains the weights for the j:th neuron, and the transfer function Ψ^[1]as a multivariable function defined as

Ψ^[1] : ℜ → ℜⁿ ⁿ and ^Ψ^{[ ]}¹ ⁼

[

^Ψ¹^{[ ]}¹ ^Ψ²^{[ ]}¹ ^... ^Ψⁿ^{[ ]}¹

]

^T

The components in Ψ^[1]is arranged according to the chosen transfer functions.

The first layers processing can then be described as

( )

o^{[ ]}¹ =Ψ^{[ ]}¹ W^{[ ]}¹ ⋅x , where o^{[ ]}¹ is the first layers output (see fig 3.1)

In similar way the other layers can be described with input signals for each layer as the output values of the preceding layer. The total processing of the feedforward network will be

( )

( ( ( ( ) ) ) )

y=Ψ x =Ψ^{[ ]}³ W^{[ ]}³ Ψ^{[ ]}² W^{[ ]}² ⋅Ψ^{[ ]}¹ W^{[ ]}¹ ⋅x

Figure 3.1 shows the configuration of a three-layer feedforward neural network.

Fig 3.1 A Feedforward Multilayer Neural Network (3-layer), [7]

(13)

                                            

Here all neurons in every layer are connected with all feeding sources (inputs). This approach is called a fully connected neural network. If some of the connections are removed the network becomes a reduced connection network. The way in which one chose the connection in the later network are not in any way trivial. The aim is to maintain the maximum of information- flow through the network, that is, only to reduce connections that are redundant.

This of course varies with the applications. In speech enhancement systems for example one could try to make use of the fact that speech is quite correlated. This could perhaps make it possible to reduce some of the connections without any major loss in performance.

In this thesis a general approach is made with a fully connected network.

(14)

                                            

4. Steepest Descent Backpropagation Learning Algorithm

4.1 Learning of a Single Perceptron

The deriving of the learning rule for the single perceptron will be done with bias and with a general transfer function. The perceptron will be denoted with index j for straightforward incorporation later into the network, see figure 4.1.

Consider a general transfer function Ψ(⋅) with gradient Ψ´(⋅).

u_j w_ji x_i

i n

= ⋅

∑

= 0

, w_{j 0} = Θ ,x₀ =1 and y_j = Ψ( )u_j We wish to minimize the instantaneous squared error of the output signal

J_j = 1e t_j = d_j − y_j 2

1 2

2 2

( ) ( )

The steepest descent approach gives the following differential equations dw

dt

J w

J e

e w

J e

e u

u w

ji j

i

j j

j ki

j j

j ji

= −η∂ = − = −

∂ η∂

∂

∂ η∂

∂

∂ since

∂

∂ J e^j e

j

= j

∂

∂ e

u

u u

j j

= − Ψ( )

∂

∂ u w^j x

ji

= i

the update rule becomes dw

dt^ji = ⋅ ⋅η e_j Ψ^'(u_j)⋅x_i If defining a learning signal δjas

δj = ⋅Ψej ^'(uj)

we can write the discrete vector update form as w_j(k + 1)=w_j(k)+ ⋅ ⋅µ δ_j x(k)

where k is the iteration index and µ is the learning parameter. This last formula is represented by the box called Adaptive algorithm in figure 4.1.

(15)

                                            

                                             Fig 4.1 Learning of a single perceptron, [7]

If the activation function ‘bsig’ is used as derived in section 2.5 the update rule becomes w_j(k + 1)=w_j(k)+ µ γ⋅ ⋅_j e (k)_j ⋅ −(1 y (k)_j ²)⋅x(k)

and for the ‘usig’ activation function the update rule becomes (see section 2.5) w_j(k + 1)=w_j(k)+ µ γ⋅ ⋅_j e (k) y (k)_j ⋅ _j ⋅ −(1 y (k)_j )⋅x(k)

For the pure linear function the rule becomes the well known LMS-algorithm (section 2.3) w_j(k + 1)=w_j(k)+ µ γ⋅ ⋅_j e (k)_j ⋅x(k)

As stated before the slope γ in LMS is often chosen to unity since it only affects the step-size of the rule.

4.2 Learning of a Multilayer Perceptron, MLP

Here a 3-layer perceptron will be derived where fewer layers can be used. The network structure is according to fig 4.2. The structure is built up with several single layer perceptrons where the learning signals is modified according to rules described below. In this figure the biases are left out.

The extension to more layers is straightforward. The network behavior should be determined on the basis of input/output vector pairs where the size of these vectors can be chosen in any way. The network will have the same number of neurons in the third layer as the number of desired signals. A set of input/output pairs will be used to train the network. The presented pair will be denoted with index k. Each learning pair is composed of n₀input signals

x i_i ( =1 2, ,...,n₀)and n₃corresponding desired output signals d_j (j=1 2, ,...,n₃). In the first hidden layer there will be n₁ neurons and in the second hidden layer there will be n₂neurons.

The number of neurons in hidden layers can be chosen arbitrarily and this represents the complexity of the problem. It is a delicate task to chose this complexity for specific

applications and often a trial and error approach is used. There are some ways to automatically increase or decrease the number of neurons in hidden layers during learning. This approach is

(16)

                                            

called pruning. Sometimes this is done by looking at each neurons output state during learning cycles and if those are similar to an other neurons state (or with opposite sign) then one of them could be removed. Some levels for how much they can vary is proposed by [2]. Too many neurons in hidden layers decreases the generalization of the network and too few neurons will not solve the task.

The learning of the MLP consists in adjusting all the weights in order to minimize the error measure between the desired and the actual outcome. A initializing set of weights must be chosen.

Fig 4.2 Learning of a Multiple Layer Perceptron, MLP (3-layer), [7]

If multiple outputs is chosen then the sum of all errors is to be minimized for all input/output- pairs, that is

min{J_k} J_k (d_jk y_jk) (e )

j n

jk j n

, = − =

= =

∑ ∑

1 2

2

1

2

1

3 3

, ∀k Here the L₂-norm is used, in section 5.2 other error measures is discussed.

This function is called the local error function since this error is to regard of the instantaneous error for the k:th input/output pair.

The global error function (also called the performance function) is the sum of the above function over all training data.

If the task to be solved is stationery the later function is to be minimized, but on the other hand, if an adaptation during learning patterns is desired then the local error function must be minimized. This can only be done in an on-line update system, which is of concern here. An discussion of this topic will be presented in section 6.

(17)

                                            

When minimizing the local error function the global error function is not always minimized. It has been proved that, if the learning parameter (step size) is sufficiently small, the local minimization will lead to a global minimization [3].

When deriving the backpropagation algorithm, the differential equations will be initialized by the gradient method for minimization of the local error function, that is

dw dt

J w

ji s

k ji [ ]

= −η ∂[ ] >

∂ 3 , η 0

(Here index [s] stands for layer s, j stands for neuron j and i stands for the i:th weight, k is the k:th input/output pattern)

First the update rules for the output layer will be derived (s=3). Using the steepest descent approach gives

dw dt

J w

J u

u w

ji k

ji

k j

j ji [ ]

[ ] [ ]

[ ]

[ ] 3

3 3

3

= −η ∂ = − 3

∂ η ∂

∂

∂ since

u_j w x_ij _i w o

i n

ij i i

n

[ ]3 [ ]3 [ ]3 [ ] [ ]

1

3 2

1

3 3

= =

∑ ∑

where o_i^{[ ]}² is the input vector to layer 3, the same as the output vector from layer 2, see fig 4.2 and compare it with the single perceptron.

This becomes dw

dt

J w

J

u o J

e e

u o e e

u o e

u o

ji k

ji

k j

i

k jk

jk j

i jk

jk j

i jk

j j

i [ ]

[ ] [ ]

[ ]

[ ] [ ] [ ]

[ ] [ ]

[ ]

[ ] [ ] 3

3 3

2

3 3

2

3 2

3

= − η ∂ = − = − = − ⋅ = ⋅ 2

∂ η ∂

∂

∂ η ∂

∂ Ψ

If we define the local error δi [ ]3

as

δ ∂

∂

i

k j

jk j jk jk

j j

J

u e d y

u

[ ]

( )' ( ) [ ] 3

3

= − = Ψ = − Ψ3

this can be written as dw

dt^ji _j o_i

[ ]

[ ] [ ] 3

3 2

= ⋅η δ ⋅

and the discrete vector update rule becomes

w_j w_j _j ^j o

j

(k 1) (k) e (k)

u (k) (k)

+ = + µ⋅ ⋅∂ ⋅

∂ Ψ^{[ ]}

[ ]

[ ] 3

3

2 [Output layer, neuron j]

where index k stands for the k:th iteration of input/output patterns and index j is the j:th neuron in output layer. Here the stepsize µ differs from the learning parameter η in the continuos case, generally it must be smaller to ensure convergence.

The gradient above is depending on the transfer function chosen. In this approach the transfer function can be intermixed in any order as mentioned earlier, see section 2.5.

(18)

                                            

For second hidden layer the error is not directly reachable so the derivatives must be taken with regard to quantities already calculated and other that can be evaluated.

This still hold dw

dt

J w

J u

u

w o

ji k

ji

k j

j ji

j i

[ ]

[ ] [ ]

[ ]

[ ] [ ] 2

2 2

2

2 1

= −η ∂ = − = ⋅ ⋅

∂ η ∂

∂

∂ η δ

The differens from output layer will be in the local error δ^{[ ]}_j² . As before the local error is defined as

δ ∂

∂

i

k j

J u

[ ]

[ ] 2

= − 2

and this gives (using the chain rule)

δ ∂

∂

j

k j

j j

J o

o u

[ ]

[ ] [ ]

[ ] 2

2 2

= − 2

since

o^{[ ]}_j² =ψ ^{[ ]}_j² (u^{[ ]}_j² ) we have

δ ∂

∂

j

k j

j j

J o u

[ ]

[ ] [ ]

[ ] 2

2 2

= − Ψ2

The second term on the right is the derivative of the transfer functions used for the j:th neuron.

The reason why this algorithm is called backpropagation is seen when calculating − ∂

∂ J o

k j [ ]2

since information from the output layer update is used here to update the second hidden layer.

− = − = −

 

 

 

 = 

 

 =

= = = = = =

∑ ∑ ∑ ∑ ∑ ∑

∂

∂ δ ∂

∂ δ

J o

J u

u o

J

u o w o

o w o w

k j

k i i

n

i j

k

i j

i n

ip p p

n

i i j

n

ip p p

n

i i

n

[ ] [ ] ij

[ ]

[ ] [ ] [ ]

[ ]

[ ] [ ] [ ] [ ]

2 3

1

3

2 3 2

1

3 2

1

3 2 1

3 2

1

3

1 3

3 3 3 3 3 3

This gives the local error for the second hidden layer as

δ ∂

∂ δ

j

j j

i ij

i n

u w

[ ]

[ ] [ ] 2

2

3 3

1

= − 3

∑

=

Ψ In vector notation this becomes

( )

w_j w_j ^j o

j

i ij

i n

(k 1) (k)

u (k) w (k)

+ = + ⋅ ⋅ ⋅

∑

=

µ ∂

∂Ψ^{[ ]} δ

[ ]

[ ] [ ] [ ] 2

2

3 3

1

3

[Second hidden layer, neuron j]

Here w_ij^{[ ]}³ means neuron i, weight j, in output layer.

The information, that are backpropagated, are the local errors and the updated weights from layer three.

In similar way the update rule for first hidden layer is derived, now with regard to second hidden layer local error and weights. This gives the vector update rule for first hidden layer as

( )

w_j w_j ^j x

j

i ij

i n

(k 1) (k)

u (k) w (k)

+ = + ⋅ ⋅ ⋅

∑

=

µ ∂

∂Ψ^{[ ]} δ

[ ]

[ ] [ ] 1

1

2 2

1

2

[First hidden layer, neuron j]

Here w_ij^{[ ]}² means neuron i, weight j , in second hidden layer. The extension to more hidden

(19)

                                            

5. Additional features for improving Convergence speed and Generalization

5.1 Algorithm with Momentum Updating

The standard backpropagation have some drawbacks, the learning parameter (stepsize in the discrete case) should be chosen small in order to provide minimization of the global error function, but a small learning parameter decreases the learning process. In the discrete case the stepsize must be kept small also to ensure convergence. A large learning parameter is desirable for rapid convergence and for minimizing the risk of getting trapped in local minima or very flat plateaus in the error surface.

One way to improve the backpropagation algorithm is to smooth the weights by over-

relaxation. This is done by adding a fraction of the previous weight update to the actual weight update. The fraction is called the momentum term and the update rule can be modified to

∆w^{[ ]}_ji^s( )k = ⋅η δ^{[ ]}_j^s ⋅o_i^[^s⁻¹^] + α⋅∆w^{[ ]}_ji^s (k− 1) , (s = 1, 2, 3)

where α is a parameter which controls the amount of momentum (0 ≤ α < 1). This is done for each layer [s] separately.

The momentum concept will increase the speed of convergence and at the same time improve the steady state performance of the algorithm.

If we are in a plateau of the error surface, where the gradient is approximately the same for two consecutive steps, the effective step size will be

η η

α

eff =

− (1 ) since

∆w k J ∆

w k w k J

w k

ji

s k

ji

s ji

s k

ji s [ ]

[ ]

( ) [ ]

( ) ( )

= ⋅η ∂ + ⋅ − ≅ − ⋅

∂ α η

α

∂ 1 ∂

1

If we are in a local- or a global minima the momentum term will have opposite sign to the local error and thus decrease the effective step size.

The result will be that learning rate is increased without magnifying the parasitic oscillations at minimas.

5.2 Algorithm with Non-Euclidean Error Signals

So far the optimization criterion has been to minimize the local/global error surface on the basis of the least-square error, the Euclidean or the L₂-norm. When the output layer is large, and/or the input signals are contaminated with non- Gaussian noise (especially wild spiky noise), other error measures is needed in order to improve the learning of the network. In this general approach the error measure will be derived for the L₁-norm and a continuos selection of norms between the Euclidean and theL₁-norm. The norms will be an approximation with minor differens to the true norms.

Consider the performance (error) function defined as J_k e_jk

j n

=

∑

= ^{σ( )} 1

3

, where e_jk =d_jk− y_jk as before.

(20)

                                            

n₃ is the number of neurons in output layer and σ is a function (typically convex) called the loss function. The case when σ= e²

2 theL₂-norm is obtained and when σ= e the L₁-norm is obtained.

There are many different loss functions proposed for a variety of applications. A general loss function which easily gives freedom in choosing the shape of the function is the

logistic function

σ( )e β ln cosh( )βe

= ⋅ 

 



2

Figure 5.1 shows the logistic function for different selections of β.

-5 -4 -3 -2 -1 0 1 2 3 4 5

0 2 4 6 8 10 12 14

Loss functions for beta varying between 1 and 1000

Fig 5.1 The Logistic function for β = 1.0, 1.3, 1.7, 5.0 and 1000

When β is close to one the function approximates the absolute (L₁) norm and when β is large it approximates the Euclidean ( L₂) norm.

The derivative for this loss function must be calculated in order to employ it into the network.

∂

∂ J

y

J e

e y

J e

k jk

jk jk

k jk

= = − , e_jk =d_jk− y_jk, ∀ j (j = 1,2, … n₃)

Introducing: h(g) = ln(g) ⇒ ∂

∂ h g = g1 g(y) = cosh(y) ⇒ ∂

∂ g y

e e

y

y y

= − ⁻ =

2 sinh( )

y(e) = e/β ⇒ ∂

∂ β

y e = 1 using the chain rule