ANN for Optimization on Large-Scale Structural Acoustics Models

(1)

IT 17 073

Examensarbete 30 hp

Oktober 2017

ANN for Optimization on Large-Scale

Structural Acoustics Models

Desislava Stoyanova

Institutionen för informationsteknologi

Department of Information Technology

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

ANN for Optimization on Large-Scale Structural

Acoustics Models

Desislava Stoyanova

Optimization of transformer design is a challenging task which defines the dimensions of all the transformer parts, based on a given specification in order to achieve better operating performance. The mechanical force distributions, as a result of the transformer operation, make the structure vibrate at twice the network frequency, and ultimately lead to noise emission from the outer surface of the tank. In this paper, an artificial intelligence technique is proposed for transformer noise data prediction as an optimized alternative to the finite-element method with multi-physics capabilities.

The technique uses a feedforward artificial neural network and the backpropagation of error learning rule for predicting the noisy data, along with a finite-element model for computing a training data set. The method considers two well-known

backpropagation algorithms, Levenberg–Marquardt and Bayesian Regularization, and while both of them appear to be extremely efficient when it comes to execution time, Bayesian Regularization presents considerably higher accuracy. The level of accuracy as well as the fast execution time makes the application of artificial neural networks for finite-element model optimization a viable and efficient approach for industrial use.

Tryckt av: Reprocentralen ITC IT 17 073

Examinator: Mats Daniels

Ämnesgranskare: Maya Neytcheva Handledare: Anders Daneryd

(4)

(5)

Acknowledgement

I would first like to thank my thesis supervisor Anders Daneryd of the Power Devices Depart- ment at ABB Corporate Research Center. The door of his office was always open whenever I ran into a trouble spot or had a question about my research or writing. He consistently allowed this paper to be my own work and inspired me to do what I am extremely passionate about, but steered me in the right direction whenever he thought I needed it.

I would also like to acknowledge Prof. Maya Neytcheva of the Department of Information Technology, Division of Scientific Computing at Uppsala University as the second reader of this thesis, and I am gratefully indebted to her for her very valuable comments on this thesis.

I must also express my very profound gratitude to Camila Medina and Monika Miljanovi´c for providing me with unfailing support and continuous encouragement throughout the process of researching and writing this thesis. Thank you for the breakfasts, coffee-breaks and advices – you were always there with a word of encouragement or listening ear. You should know that your support and encouragement were worth more than I can express on paper. This accomplishment would not have been possible without you.

Finally, I would like to thank my family for supporting me throughout the last couple of years, financially, practically and with moral support, especially my grandparents. Mum, you knew it would be a long and sometimes bumpy road, but you encouraged and supported me along the way.

To dad who was often in my thoughts on this journey – you are missed.

Thank you!

Desislava Stoyanova Västerås, June 2017

v

(6)

(7)

„

By far, the greatest danger of Artificial Intelligence is that people conclude too early that they

understand it.

— Eliezer Yudkowsky (Writer)

vii

(8)

(9)

of vibrating system . . . xix C.2 Calculate the data for the error histograms and put that data into excel files . xxi C.3 ANN training for multimillion DoF model of vibrating system . . . xxii C.4 Identify the points to be excluded from the initial data set . . . xxiv C.5 ANN training using a reduced by 20% data set . . . xxv

xi

(12)

List of Figures

1.1 Schematic presentation of power transmission from source to domestic consump-

tion . . . 1

1.2 A typical three-phase unit (to the left) and an one-phase 70MVA unit (to the right) 2 2.1 A simple feedforward neural network . . . 8

2.2 Visualization of the gradient descent [Kri07] . . . 15

2.3 Possible errors during a gradient descent [Kri07] . . . 16

2.4 Single Layer Perceptron . . . 18

2.5 2-3-2-1 Multi Layer Perceptron . . . 20

2.6 Generalized delta rule . . . 22

2.7 Illustration of a Jordan recurrent perceptron-like network . . . 32

2.8 Illustration of an Elman recurrent perceptron-like network . . . 33

2.9 Example distances of an one- and two- dimensional SOM topology . . . 36

4.1 Graphical representation of the system . . . 47

4.2 trainbr predictions vs. trainlm predictions for m₁, m2 and m3 . . . 51

4.3 trainbr regression plots . . . 52

4.4 trainlm regression plots . . . 52

4.5 Error histogram for m1 . . . 53

4.6 Error histogram for m₂ . . . 54

4.7 Error histogram for m3 . . . 54

5.1 Transformer winding block referring to the one-phase unit in Figure 1.2 . . . . 57

5.2 trainbr predictions for Power(W) at frequency of 80Hz . . . 59

5.10 trainbr predictions for Power(W) at frequency of 160Hz . . . 63

5.11 trainbr regression plots . . . 64

5.12 trainbr predictions for Power(W) at frequency of 80Hz; the training set is reduced by 20% . . . 65

xii

(13)

5.13 trainbr predictions for Power(W) at frequency of 90Hz; the training set is reduced

by 20% . . . 65

5.14 trainbr predictions for Power(W) at frequency of 100Hz; the training set is reduced by 20% . . . 66

5.21 trainbr regression plots; the training set is reduced by 20% . . . 69

B.1 One hidden layer feedforward neural network [Vil11] . . . vi

B.2 Two hidden layer feedforward neural network [Vil11] . . . vi

B.3 Elman recurrent neural network [Vil11] . . . vii

B.4 Performance with regard to training time in seconds [Vil11] . . . vii

B.5 Performance with regard to average deviation from the measured value [Vil11] viii B.6 Elman network’s performance with regard to average deviation from the measured value [Vil11] . . . viii

B.7 Geometry of a cracked solid beam [Moh+15] . . . ix

B.8 Comparison between ANSYS and ANN results for the three natural frequencies (crack depth) [Moh+15] . . . ix

B.9 Comparison between ANSYS and ANN results for the three natural frequencies (crack location) [Moh+15] . . . ix

B.10 trainlm predictions for Power(W) at frequency of 80Hz . . . . x

B.11 trainlm predictions for Power(W) at frequency of 90Hz . . . . x

B.12 trainlm predictions for Power(W) at frequency of 100Hz . . . . xi

B.13 trainlm predictions for Power(W) at frequency of 110Hz . . . . xi

B.14 trainlm predictions for Power(W) at frequency of 120Hz . . . xii

B.15 trainlm predictions for Power(W) at frequency of 130Hz . . . xii

B.16 trainlm predictions for Power(W) at frequency of 140Hz . . . xiii

B.17 trainlm predictions for Power(W) at frequency of 150Hz . . . xiii

B.18 trainlm predictions for Power(W) at frequency of 160Hz . . . xiv

B.19 trainlm regression plots . . . xiv B.20 trainbr predictions for Power(W) at frequency of 110Hz; stressed values by 40% xv B.21 trainlm predictions for Power(W) at frequency of 110Hz; stressed values by 40% xv

xiii

(14)

B.22 trainbr regression plots . . . xvi B.23 trainlm regression plots . . . xvi B.24 trainbr predictions for Power(W) at frequency of 110Hz; stressed values by 50% xvii B.25 trainlm predictions for Power(W) at frequency of 110Hz; stressed values by 50% xvii B.26 trainbr regression plots . . . xviii B.27 trainlm regression plots . . . xviii

xiv

(15)

List of Tables

A.1 Characteristics of the investigated power transformers . . . iii

A.2 States of pumps and fans in operation of Transformer 3 . . . iii

A.3 The best-performing models . . . iii

A.4 The worst-performing models . . . iii

A.5 The best-performing models of a neural network that contains two hidden layers iv A.6 The worst-performing models of a neural network that contains two hidden layers iv A.7 The best-performing models of Elman neural network . . . iv

A.8 trainlm function’s parameters . . . . iv

A.9 trainbr function’s parameters . . . . v

xv

(16)

(17)

1 Introduction

Chapter 1 is an introduction to this study. More specifically, it gives a background to this paper, considering power transformers modelling using the Finite Element Method (FEM) and Artificial Neural Networks (ANNs) as an optimization tool for such models.

Power transformers are electrical devices that change or transform voltage levels between two circuits. In this process, the power transferred between the circuits is unchanged, except for a typically small loss which occurs in the process. This power transfer occurs only in alternating current (AC) or transient electrical conditions. Transformer operation is based on the principle of induction discovered by Faraday in 1831. In fact, the first practical transformer was invented by three Hungarian engineers in 1885. In addition to the transformers used in power systems, which range in size from small units that are attached to the tops of telephone poles, there are units as large as a small house and weighing hundreds of tons.

In practice, long-distance power transmission is carried out with voltages of 100–500 kV, and more recently, even up to 1200 kV. These high voltages are, however, not safe for usage in households or factories. Therefore, transformers are necessary to convert these high voltages to lower levels at the receiving point. In addition, generators are designed to produce electrical power at voltage levels of approximately 10–40 kV for practical reasons such as cost and efficiency. Thus, transformers are also necessary at the sending point of the line to boost the generator voltage up to the required transmission levels, as schematically shown in Figure 1.1.

Fig. 1.1.: Schematic presentation of power transmission from source to domestic consumption

1

(18)

A typical three-phase power transformer is shown on the left side of Figure 1.2. The main bodies constituting a transformer are the core, coils (or windings), oil, tank, and auxiliary components for connections, cabling, etc. A one-phase unit before being immersed into an oil filled tank is shown on the right side of Figure 1.2.

Fig. 1.2.: A typical three-phase unit (to the left) and an one-phase 70MVA unit (to the right) During operation, the interaction between the currents, magnetic fields and the mechanical structure gives rise to various mechanical force distributions. These forces make the structure vibrate, mainly at twice the network frequency, and quite complex vibration transmission mechanisms ultimately lead to noise emission from the outer surface of the tank. Standard practice for noise prediction is to use the Finite Element Method (FEM) with multi-physics capabilities. This inevitably leads to time-consuming analysis which generally can not be managed in the everyday transformer design work.

1.1 Motivation and Problem Statement

Artificial Neural Networks (ANNs) are a data mining tool which is mainly used for classification and clustering of data. One of the main challenges of such networks is building a machine that can mimic brain activities and most of all - has the ability to learn. A neural network learns from examples, and if provided with enough examples, it can successfully classify data and even discover new trends and patterns. Moreover, ANNs are similar to successfully working biological systems which consist of numerous nerve cells that work massively in parallel and have the ability to learn. One result from the ability to learn is that the neural network is capable of associating and generalizing data. In other words, the neural network

1.1 Motivation and Problem Statement 2

(19)

can give reasonable solutions to other similar problems of the same class that have not been trained explicitly. This leads to a high degree of fault tolerance when it comes to noisy input data. Having all this in mind, applying ANN for FE model optimization can be considered as a viable approach.

Frequently, large-scale FE models have to be run repeatedly in, for example, parametric sensitivity, propagation of uncertainty or/and optimization analysis. Even though the parallel execution has been available for decades when it comes to certain classes of FE solvers, the models are most commonly executed in a sequential manner. Each mode of execution is time-consuming and in some industrial applications, one has to resort to model reduction or reduced modelling. In fact, there are many model reduction alternatives, but this thesis work is focussed on ANN function approximation when applied to FE models for structural acoustics applications. Moreover, the work includes both theoretical work and implementation of a proof-of-concept for the complete chain:

Furthermore, there are several paradigms of learning when it comes to ANN training, and all of them are covered in the theoretical part of this study. However, the actual implementation is completely based on the supervised learning paradigm, where the learning process requires a set of input data patterns and a set of corresponding output data patterns. The performance of the network is measured depending on the deviation of the predictions by the network from the expected output. For this reason, once the FE model has been computed, the results are used to form out a data set for the ANN training.

On the other hand, the topology of the network appears to be as equally important as the learning paradigm. Several neural network topologies are already invented and ready to use, but most of them happen to be either not relevant to the purpose of this study or too complicated for the problem that has been set. Considering that, the multilayer perceptron along with the backpropagation learning rule have been chosen for this thesis work. More specifically, the two most common backpropagation algorithms for function approximation have been investigated: Levenberg–Marquardt and Bayesian Regularization, implemented in MATLAB as trainlm and trainbr, respectively.

After the theoretical part of this study, the training is performed on relatively small data set generated for a simple model of vibrating system with only three degrees of freedom. After that, it is applied for the actual model provided by ABB Corporate Research which contains approximately two million degrees of freedom. The performance of the two backpropagation algorithms is investigated according to the execution time, but mainly according to the level of accuracy. Last but not least, a suggestion is given how to reduce the data set generated after the computation of the FE model, since a full factorial experiment has been used. Such

(20)

experiments have a design that consists of two or more factors, each with discrete possible values or "levels", and their experimental units take on all possible combinations of these levels across all such factors which leads to considerably large data sets.

(21)

2 Artificial Neural Networks

Chapter 2 covers the fundamentals when it comes to Artificial Neural Networks. The components of such networks are presented as well as the different topologies that a network might imple- ment. In the second part of the chapter, different training paradigms and procedures have been considered, while the focus is on the backpropagation of error learning rule.

Technically, the main characteristics that a neural network is trying to adapt from a biological system are:

1. Self-organization and learning capability

2. Generalization and summarizing data capability 3. Fault tolerance

However, it is usually difficult to realize what a neural network knows, how it performs and where its fault is. That is one of the main disadvantages of the neural networks since it is usually easier to perform such tasks on a conventional algorithm. [Kri07]

2.1 Components of the Neural Network

Technically, a neural network consists of two main components. Firstly, such network has simple processing units which are also called neurons. Secondly, there are connections between these neurons that are directed and weighed where the weight of a single connection could be associated with its strength.

A neural network is a sorted triple (N, V, w) where N is a set of neurons, V is a set of connections such that V = {(i, j)|i, j œ N} where i and j are neurons from the neural network and a function w: V ∆ R which defines the weight of the connection between two neurons. For example, wi,j is the weight of the connection between neurons i and j where i and j are also called source and target, respectively.

5

(22)

2.1.1 Neuron Functions

Considering neurons within a neural network, three functions must be taken into account:

1. Propagation function 2. Activation function 3. Output function

Assuming that neuron j is a target neuron for other neurons, the j’s the propagation function receives outputs oi1, o_i₂, ..., o_i_n from neurons i1, i₂, ..., i_n, respectively. The result of the propagation function is called network input for the neuron j. Let I = i1, i₂, ..., inbe a set of neurons such that ’z œ 1, 2, ..., n : ÷wiz,j. The network input for the neuron j is called netj

and it is calculated by the propagation function fprop as follows:

netj = fprop(oi₁, ..., oin, wi₁,j, ..., win,j)

net_j =^ÿⁿ

k=1

o_i_k◊ wik,j

Before the formal definition for an activation function in a neural network, the two main components of this function must be defined. A threshold value for neuron j is represented as ◊j and it is uniquely assigned to j standing for the maximum gradient value that the activation function could take. Also, an activation state for the neuron j is defined as a_j(t) where t stands for the current time. The activation state is also the result from the activation function. Let j be a neuron. The activation function is defined as the following:

aj(t) = fact(netj(t), aj(t ≠ 1), ◊^j)

In general, the activation function transforms the network input net_j as well as the previous activation state aj(t ≠ 1) into a new activation state considering the threshold value. It is also important that the neurons get active only if the network input exceeds their threshold value. Furthermore, the activation function is usually globally defined or at least for a set of neurons, while the threshold value might be different for each neuron. This threshold value could possibly change over the period of time as well. Then, the notation ◊_j(t) is used for the respective period of time t.

The output function of neuron j calculates the values that are to be transmitted to other neurons connected to j. Let j be a neuron. If aj is the j’s activations state, the output function could be defined as:

f_out(aj) = oj

2.1 Components of the Neural Network 6

(23)

In a nutshell, it calculates the output value o_j from the neuron’s activation state a_j. Similarly to the activation function, the output function could also be globally defined. This function is also called an "identity".

2.2 Neural Network Topologies

First of all, the neurons within a neural network are grouped into several processing layers:

one input layer, one or more hidden (processing) layers and one output layer. It is important to mention that the input layer only represents the data which is put into the network. In other words, this part of the network never changes the input data, this data is only transmitted to the neurons from the following layers. There, the data is modified according to the input values and the connection weights until the network produces some output value (or a vector of values). However, depending on the connections between the neurons, three different topologies could be distinguished:

1. Feedforward neural networks

a) No shortcut connections are allowed b) Shortcut connections are allowed 2. Recurrent neural networks

a) Direct recurrences b) Indirect recurrences

c) Lateral recurrences 3. Completely linked networks

2.2.1 Feedforward Neural Networks

In the feedforward neural networks the layers are clearly separated. They consist of one input layer, one or more processing layers (also called hidden since they are not visible from the outside) and one output layer. Connections are only allowed from one layer to another layer of the network towards the output layer. In other words, there is no way to get backwards from the output layer to the input layer. One such neural network is presented in Figure 2.1.

2.2 Neural Network Topologies 7

(24)

Fig. 2.1.: A simple feedforward neural network

The feedforward neural networks with shortcut connections follow the topology of the standard feedforward networks but they allow shortcut connections where one or more layers could be skipped.

2.2.2 Recurrent Neural Networks

Recurrent neural networks are different from the feedforward networks by the matter of fact that they can influence themselves. Another difference is that it is not always possible to distinguish the input, output and processing layers. There are three different kinds of recurrence within a neural network:

1. Direct recurrence - neurons could be connected to themselves. In this way, they can strengthen themselves in order to reach a certain activation limit. Such connection has weight defined as w_j,j

2. Indirect recurrence - such recurrences could affect the starting neuron only by making detours. Those networks allow connections towards the input layer. In this way a neuron j could affect itself by connection to a neuron h (a member of another layer)

(25)

and an opposite connection from h to j. In general, such networks allow connections to the preceding layer.

3. Lateral recurrence - a laterally recurrent network allows connections between the neurons within a certain layer. In such networks, a neuron inhabits its neighbours and in this way, it strengthen itself. It is important to mention that in those networks only the strongest neuron becomes active.

2.2.3 Completely Linked Networks

Completely linked networks permit all kind of connections between the neurons except the direct recurrence. In other words, every neuron could be connected to any other element within the network. As a result, each neuron could become an input neuron and that is the main reason why direct recurrence is not allowed and clearly defined layers no longer exist.

A popular example of such networks is the self-organizing map.

2.2.4 The bias Neuron

In many cases, network neurons have a threshold value ( ) that determines when the neuron becomes active. Moreover, the threshold value is a parameter of the activation function.

Technically, the bias neuron is a trick of presenting the threshold value as a connection weight.

Let j1, ..., j_nare network neurons and j1, ..., _j_nare the corresponding threshold values. An additional neuron is integrated into the network and connected to the neurons j₁, ..., jn. The weights of the new connections are set to be ≠ j1, ...,≠ jn, respectively, while the threshold value for each neuron equals0. Such neuron is called a bias neuron, and its output value is always1. Defining the bias neuron in such way, allows every weight training algorithm to train the biases as well as the connection weights. In a nutshell, the threshold value is no more included in the activation function; it is now a parameter of the propagation function.

The result is an equivalent neural network with threshold values represented as connection weights.

2.2.5 Number of Layers and Neurons

The number of layers and the number of neurons within a single layer are chosen depending on the problem that the neural network is trying to solve, types of data that the network is working with, quality of this data and many more. On the one hand, the number of input and output neurons depends on the training set (a training set consists of several training patterns and it is thoroughly discussed later in this paper). On the other hand, determining the number of neurons within the hidden layers is very challenging task [Lar05]. To be more concrete, if the number of neurons within a hidden layer is too large, this automatically increases the

(26)

number of possible computations that the algorithm has to deal with. In contrast, picking a small number of hidden neurons will prevent the algorithm of reaching its learning capability.

For this reason, the right balance must be found. Last but not least, it is very important to monitor the progress of a neural network while it is learning and if it is not improving, the model must be modified in a certain way [Lar05].

2.2.6 Setting Weights

Modifying the connection weights is the way to control the behaviour of a neural network.

Initially, the weights are assigned randomly and they are being changed during the training of the network. However, the focus should not be on changing one weight at a time, but all of the weights should be modified simultaneously [Fog02]. This comes from the fact that a neural network could be dealing with thousands of neurons and changing one weight at a time will be very time-consuming. The logic behind the update is simple: connection weights are modified after each iteration during the process of learning. The result is that the set of weights will be better after each iteration, if the learning is going well. In other words, the focus should be on finding a proper set of weights in order to minimize the error in the end of each iteration [Fog02].

2.2.7 Pruning

Pruning is a technique which is useful when dealing with large neural networks. The logic behind it is to prune useless neurons or/and connections in order to achieve a smaller neural network that is easier to interpret and less time-consuming when it comes to training.

However, this happens to be a 2-way street since sometimes the lower execution time leads to poor overall performance of the network.

2.3 Order in which the neuron activations are calculated

For every neural network, it is very important how (in which order) the input neurons receive the input data and how the output neurons produce the output data. Two model classes could be differentiated: Synchronous activation and Asynchronous activation.

2.3.1 Synchronous Activation

All neurons within the network change their values simultaneously. Such networks are useful only if they are implemented on certain parallel systems and if they do not follow the feedforward topology. All neurons in the network calculate the network input using the

2.3 Order in which the neuron activations are calculated 10

(27)

propagation function, get active by means of the activation function and produce output by means of the output function, while all of this happens at the same time. This is one completed activation cycle.

2.3.2 Asynchronous Activation

Neurons no more change their values simultaneously but following a certain order. This could be achieved in one of the following ways:

1. Random order - a randomly chosen neuron i updates its neti, aiand oi. For n neurons, a cycle is the n-fold execution of this step. Obviously, the main downside of this method is that some neurons are repeatedly updated during one cycle, and others are not updated at all.

2. Random permutation - similarly to the previous approach, a neuron is chosen at random and its neti, ai and oi are calculated, but only once per cycle. This is achieved by defining a permutation of the neurons in the beginning of each cycle, and process all of the neurons following the defined order afterwards. One downside of this approach is that the order is generally useless. Another downside is that it is time-consuming to calculate a permutation for each cycle.

3. Topological order - neurons are processed each cycle following a fixed order which is defined by the network’s topology. This method is useful for non-recurrent networks because otherwise, there is no order of activation (every neuron could potentially be an input neuron). For this reason, this method of activation is reasonable for feedforward networks. In such network the input neurons will be processed first, the inner neurons will be processed second and after that, the output neurons will be processed. This is one completed cycle.

4. Fixed order of activation - neurons are processed following a fixed order which is determined during the implementation. Such order is defined once, according to the topology and there is no need of verification during the runtime. A disadvantage of this method is that it is not suitable for networks that might change their topology.

2.4 Communication with the world outside the network

All neural networks work with input data and produce output data, as a result. Looking at Figure 2.1, it can be noticed that there are two input neurons: i₁ and i₂. They will receive two numerical values as input data: x1 and x2. Similarly, we have two output neurons: 1

and ₂ which will produce output values o₁ and o₂, respectively. Let the input layer of a

2.4 Communication with the world outside the network 11

(28)

neural network consists of n neurons and the output layer consists of m neurons. The input data is represented by a vector of values x= (x1, ..., x_n) for each input neuron respectively.

Similarly, there is an output vector y = (y1, ..., ym) comprised of the values for each output neuron respectively. Data input is put into a neural network by using the components of the input vector as a network input for the neurons within the input layer. In the end of each iteration, m output neurons produce m output values which form the output vector.

2.5 Learning and training samples

In principle, a neural network can learn by changing its components. In other words, this could happen by:

1. Developing new connections 2. Deleting existing connections

3. Changing weights of existing connections

4. Changing one or more of the neuron functions (propagation, activation and output functions)

5. Developing new neurons 6. Deleting existing neurons

The technique of changing weights of existing connections is most commonly used. A neural network learn according to different rules which are set by a learning procedure. The procedure is nothing more than an algorithm that could be implemented using a programming language. This procedure defines how the weights are going to be changed. Beside the learning procedure, a neural network need a training set to produce a desired output. The training set P is a set of training patterns that are used from the neural network in the process of learning.

2.5.1 Paradigms of Learning

Unsupervised learning In this method of learning, the training set consists only of input patterns. As a result, the network tries to identify similarities and to organize them into pattern classes.

2.5 Learning and training samples 12

(29)

Reinforcement learning In this method of learning, the training set consists of training patterns and a value to be returned after a sequence of actions. This value indicates whether the produced result is right or wrong and possibly how wrong it is.

Supervised learning In this method of learning, the training set consists of training patterns with corresponding correct results. As a result, the neural network can produce a precise error vector. When it comes to supervised learning, there are 5 main procedures:

1. Entering the input pattern (activation of the input neurons) 2. Forward propagation (generation of the output)

3. Comparing (compare the returned output to the desired output (teaching input) and return an error vector)

4. Corrections of the network (based on the error vector) 5. Corrections (corrections are applied)

Offline learning A set of training patterns is entered into the network at once. After that, the connection weights are changed and the error vector is produced by the means of the error function. In a nutshell, the network learns from all patterns at once.

Online learning A number of training patterns is entered into the neural network one after another. In this case, the connection weights are changed after each pattern is presented and the error vector is returned as well. To sum up, the network learns from the error after each training pattern.

2.5.2 Training patterns and teaching input

In the case of supervised learning, a training set consists of several training patterns where each one has a corresponding output value that is desired from the output neurons, as a result from the training process. As long as the neural network is still learning (it produces wrong results), those output values are referred as teaching input for all output neurons, respectively. To sum up, let j be an output neuron and p be a training pattern. The output value for neuron j is defined as oj, while the desired output for p (teaching input for j) is defined as tj.

Training patterns A training pattern is a vector p with components p₁, p₂, ..., pnwhose desired output values are known. After each pattern is entered into the network, each output value is

(30)

compared to the corresponding teaching input. The set of training patterns is called P and it is made up of a finite number of ordered pairs(p, t) of training patterns and corresponding teaching inputs.

Teaching input Let j be an output neuron. The teaching input tj is the desired output that j has to produce after the input of a certain training pattern. Similarly to the vector p, the teaching inputs could be organized into a vector t= (t1, t₂, ..., t_m) that is correspondent to a particular p.

Error vector

E= Q cc cc ca

t₁≠ y1

t₂≠ y2

...

t_m≠ ym

R dd dd db

For several output neurons 1, ₂, ..., _m, the difference between the teaching input vector and the output vector is considered as an error vector, and sometimes called difference vector. Depending on whether the training is performed online or offline, the error vector is based on a single training pattern or on a set of training patterns. In the second case, the error vector is normalized in a certain way.

2.5.3 Learning curve and error measurement

The learning curve indicates the progress of the error and it could be defined in different ways. In other words, such curve can indicate if the network is progressing or not. In order to make this possible, the error must be normalized by representing a distance measure between the teaching input and output vectors. [Pre94]

Specific Error Let be an output neuron and O be a set of output neurons.

Err_p = 1 2

ÿ

œO

(t ≠ y )²

Euclidean Distance This is a generalization of the theorem of Pythagoras which is reasonable for lower dimensions where its usefulness could be observed.

Err_p =^{Û ÿ}

œO

(t ≠ y )²

(31)

Root Mean Square (RMS)

Err_p=^{Û q} ^œO(t ≠ y )²

|O|

To conclude, in all of these three ways for calculating the error curve, Errp is based on a single training pattern which automatically means that it is generated online.

Total Error The total error is based on the all training patterns which means it is generated offline.

Err= ^ÿ

pœP

Err_p

2.5.4 Gradient Optimization Procedures

Fig. 2.2.: Visualization of the gradient descent [Kri07]

Gradient descent procedures are mainly used when trying to maximize or minimize an n- dimensional function. In Figure 2.2, there could be seen a visualization of the gradient descent on a two-dimensional error function. On the left side of the figure, the area is shown in 3D, while on the right side, the steps over the contour lines are shown in 2D. Here, it is obvious how the movement is done in the opposite direction of g towards the minimum of the function, and it continuously slows down proportionally to |g|.

Gradient The gradient g is a vector with n components that is defined for any point of a differentiable n-dimensional function f(x1, x₂, ..., x_n). The gradient operator is defined as:

g(x₁, x₂, ..., x_n) = “f(x1, x₂, ..., x_n)

More concretely, g points from this point to the steepest ascent where |g| stands for the degree of this ascent.

(32)

Gradient descent Let f be a n-dimensional function and s= (s1, s₂, ..., s_n) is a point from this function called a starting point. Gradient descent means going from s against the direction of the gradient g. In other words, the movement is done towards ≠g through steps of length |g| towards smaller and smaller values of f.

Gradient descent procedures are considered as optimization procedures since they work very well on many problems connected to neural networks. This makes them frequently used.

However, they are not completely safe and have several disadvantages such as:

1. A gradient descent procedure could possibly get stuck in a local minimum. This problem increases proportionally to the size of the error surface and there is no universal solution yet. In reality, no one can know whether the optimal minimum has been reached or not.

2. Flat plateaus (a state of little or no change following a period of activity or progress) within the error surface might slow down the training process. In other words, the gradient becomes considerably small when passing a flat plateaus, since there is no descent, which requires many steps more to be done.

3. If there is an area within the error surface with a steep slope followed by a steep ascent (in other words, sudden alternation from one very strong negative gradient to a very

strong positive one), this could lead to oscillation.

4. If the gradient happens to be extremely large, a potentially sufficient minimum could be left behind. This happens when going through a steep slope but there are more steps to be taken.

Fig. 2.3.: Possible errors during a gradient descent [Kri07]

In Figure 2.3, there could be seen the possible errors during a gradient descent: (a) getting stuck in a local minimum (b) passing through flat plateaus (c) oscillation caused by a fast

(33)

change from very strong negative gradient to very strong positive one (d) missing a potentially good minimum.

2.5.5 The Hebbian Learning Rule

In 1949, Donald O. Hebb formulated the Hebbian rule which is a basis for most of the complicated learning rules. There is an original form and more generalized one. [Heb49]

Hebbian rule Let i and j be neurons. If j receives and input from i and both neurons are active at the same time, increase the weight wi,j.

—wi,j ≥ ÷ · oi· aj

In other words, the change in the strength of the connection between i and j (—wi,j) is proportional to:

1. the output of the source neuron o_i 2. the activation of the target neuron aj

3. the constant ÷ (also learning rate)

Generalized Hebbian rule The generalized form of the Hebbian rule defines the change in the connection weight wi,j as a product of two undefined functions but with defined input values. [MR86]

—wi,j = ÷ · h(oi, w_i,j) · g(aj, t_j)

2.6 The perceptron, backpropagation and its variants

A perceptron is a feedforward network containing a retina layer whose neurons are used only for gathering and transmitting data. For this reason, the connections between the retina and the input layer are statically weighted. The fixed-weight layer must be followed by at least one trainable layer. In addition, neurons from one layer must be completely connected to the neurons from the following layer. In most cases, the first layer of a perceptron is considered as an input layer. The retina and the static connections behind it are usually not presented since they do not process information in any way.

2.6 The perceptron, backpropagation and its variants 17

(34)

2.6.1 Single Layer Perceptron (SLP)

In this perceptron, the trainable weights go from the input layer to the output neuron or to a layer of output neurons 1, ₂, ..., _n. In other words, a SLP consists of only one level of trainable weights. One such perceptron is presented in Figure 2.4.

Fig. 2.4.: Single Layer Perceptron

Delta rule - Gradient based learning strategy for SLPs Let’s consider a single layer perceptron with randomly assigned weights that need to be taught. For this reason, a set of training patterns P is defined, and it consists of a number of ordered pairs(p, t). Each pair has a training pattern p and a corresponding teaching input t. Other components of the SLP are:

1. a set of input neurons I 2. a set of output neurons O 3. an input vector x

4. an output vector y

5. a set of output neurons ₁, ..., _|O|

6. i is the input of a neuron 7. o is the output of a neuron

(35)

8. E_p is the error vector for a particular teaching pattern p

The main goal is to meet the desired output, as a result of the training process. In other words, for each training pattern, the actual output y must be approximately equal to the teaching input t or:

’p œ P : y ¥ t or ’p œ P : Ep¥ 0

Error function The error function

Err: W æ R

takes the set of weights W as a vector and maps them onto a normalized output error. Going back to the main goal, the error must be minimized. This could happen by changing the connection weights during the training. For this purpose, a gradient descent procedure, which defines how exactly the weights must be transformed, is used. In other words, the change in the weights W will be defined by the gradient “Err(W ) of the error function Err(W ) or:

W = ≠÷ · “Err(W)

Assuming that the error function is defined as the squared distance between the teaching input and the output vectors (Equation 2.1), the summation of the specific errors Errp(W ) (for all teaching patterns) yields the definition of the error function Err(W ):

Err_p(W ) = 1 2

ÿ

œO

(tp, ≠ yp, )² (2.1)

Err(W ) = ^ÿ

pœP

Errp(W ) = 1 2

ÿ

pœP

(^ÿ

œO

(tp, ≠ yp, )²)

where tp, is the teaching input for neuron correspondent to the teaching pattern p; and y_p, is the actual output. Moreover, their difference is defined as ”_p, which brings the name delta rule. The next equation stands for the offline version of the delta rule since the connection weights are updated after a set of patterns has been presented to the network.

w_i, = ≠÷ˆErr(W )

ˆw_i, = ÷^ÿ

pœP

”_p, · op,i

On the other hand, the online version of this rule simply omits the summation, and the learning is performed after each pattern. In other words:

w_i, = ÷ · ” · oⁱ

Apparently, the delta rule is only suitable for SLPs since the formula is always related to the teaching input, and there is no teaching input for the neurons within the processing layers.

(36)

To sum up, this rule only works when the layer of input neurons is directly connected to the layer of output neurons, no hidden layers are allowed. [Kri07]

2.6.2 Multi Layer Perceptron (MLP)

Fig. 2.5.: 2-3-2-1 Multi Layer Perceptron

A Multi Layer Perceptron (MLP) is a perceptron that has more than one layer of variably assigned weights. In addition, a n-layer perceptron has exactly n layers of trainable weights and n+ 1 layers of neurons in total (the retina is not considered here), including the layer of input neurons. One such perceptron is presented in Figure 2.5. The notation 2-3-2-1MLP is used to show that the input layer consists of 2 neurons, the two hidden layers are made up of 3 and 2 neurons, respectively, and the output layer has a single output neuron.

(37)

2.6.3 Backpropagation (BP) of error learning rule

This is one of the most popular learning procedures when it comes to neural networks. The simplest definition of the backpropagation algorithm could be broken down to four main steps [Roj96]:

1. Feedforward computation 2. BP to the output layer 3. BP to the hidden layer

4. Update of the connection weights

The connection weights are chosen randomly in the beginning and the BP is used to compute the necessary corrections. Then, the changes are applied and a new iteration follows. The algorithm is stopped when the error has become sufficiently small. However, this is a very rough but easy to follow definition. In this paper, the BP algorithm is defined by the meaning of the delta rule presented earlier, but modified for MLPs. More concretely, the derivation strategy, according to (Zell, 1994)[Zel94] and (McClelland and Rumelhart, 1986)[MR86], is followed. Even though this procedure has also been described by Paul Werbos in (Werbos, 1974)[Wer74], the (McClelland and Rumelhart, 1986)[MR86] publication happens to have many more readers.

Backpropagation is a gradient descent procedure with the function Err(W ) which takes all of the weights within a neural network as arguments. As a gradient descent procedure, it has all the disadvantages and advantages that such procedure might have. Similarly to the delta rule, the BP method trains a neural network by changing the connection weights (where it is actually possible). Moreover, it is exactly the delta rule, or its variable ”_i for a neuron i, which is extended from one layer of trainable weights to several ones.

Generalized delta rule The first task here is to identify the neuron which is going to be used for calculating ”. For this reason, an inner neuron h is defined, that has a set of predecessor neurons K and a set of successor neurons L. It is important to mention that ’k œ K and

’l œ L are also hidden (inner) neurons. This structure can be seen in Figure 2.6.

Backpropagation of error The mathematical definition of the generalized delta rule which leads to the definition of the backpropagation of error, is thoroughly described in [Kri07].

However, this is not on focus in this paper and only the backpropagation of error is considered

(38)

Fig. 2.6.: Generalized delta rule

next. In a nutshell, the result for wk,h(change in weight between k and h) is a generalization of the delta rule that is called Backpropagation of error:

w_k,h= ÷ · ”h· ok

where

”_h = Y]

[

f_act^Õ (neth) · (th≠ yh), if h is an output neuron f_act^Õ (neth) ·^q_lœL(”l· wh,l), if h is an inner neuron 1. If h is an output neuron

”_p,h = f_act^Õ (netp,h) · (tp,h≠ yp,h)

More concretely, for a training pattern p the weight wk,hchanges proportionally according to:

a) the learning rate ÷

(39)

b) the output o_p,kof the predecessor neuron k

c) the gradient of the activation function at the position of the network input of the successor neuron f_act^Õ (netp,h)

d) the difference between the teaching input and the actual output(tp,h≠ yp,h) for the successor neuron h

In this case, BP is working on two layers: the output layer with the successor neuron h and the preceding layer with the predecessor neuron k.

2. If h is an inner neuron

”_p,h = f_act^Õ (netp,h) ·^ÿ

lœL

(”p,l· wh,l)

More concretely, for a training pattern p the weight wk,hchanges proportionally according to:

a) the learning rate ÷

b) the output op,kof the predecessor neuron k

c) the gradient of the activation function at the position of the network input of the successor neuron f_act^Õ (netp,h)

d) the sum of the changes in weight^q_lœL(”p,l· wh,l) of all neurons following h In this case, BP is working on three layers: neuron k is a predecessor of the connection whose weight wk,his to be changed, and the neuron h is the successor of this connection.

The neurons ’l œ L form out the layer that follows after the successor neuron h.

Obviously, the BP initially processes the last weight layer by means of the teaching input, and then works backwards from layer to layer, while considering each preceding change in weights. For this reason, the teaching input leaves traces in all weight layers.

Selection of the learning rate In any case, the change in a connection’s weight is strongly proportional to the learning rate ÷. Therefore, the selection of ÷ is crucial for the BP algorithm and for many learning procedures as well. Furthermore, the speed and accuracy of those procedures are always dependent and controlled by the learning rate constant.

(40)

On one hand, if the chosen value for ÷ is too large, this will lead to larger jumps on the error surface. In other words, narrow valleys could simply be jumped over and the movement across the error surface would be very uncontrollable. On the other hand, a very small value of ÷ will make the learning procedure more time-consuming and often, the amount of time is unacceptable. However, experience shows that the best value for ÷ is in the following scope [Kri07]:

0.01 Æ ÷ Æ 0.9

In addition, the value of the learning rate could change in the process of learning or it could be dependent on the topology. For example:

1. Variation of ÷ over time - In the beginning of the learning process, a large value for ÷ leads to good results as well as to shorter execution time. However, it leads to inaccurate results later. For this reason, the value of ÷ must be decreased once or repeatedly over time which leads to slower learning procedure but higher level of accuracy.

2. Different ÷ for different layers - BP is getting slower as moving further from the output layer in the neural network. This automatically means that it is possible to choose larger value of ÷ for the layers near to the input layer, while the value of ÷ will be smaller for the layers close to the output layer.

2.6.4 Resilient BP of error

Different properties of the BP algorithm that could be problematic (besides the ones inherited from the gradient descent procedure) have already been discussed. For example, a user might choose a bad learning rate or the fact that the BP is getting slower for the weights far from the output layer. For this reason, Martin Riedmiller improved the BP learning procedure and called it Resilient BP of error [RB93], [Rie94]. There are two main ideas behind the algorithm which are going to be compared.

1. Weight changes are not proportional to the gradient

According to the BP algorithm, connection weights are changed proportionally to the gradient of the error function. This means that every feature of the error surface is incorporated into the weight changes, which leads to the question whether this is always useful. The Resilient BP follows another approach: w_i,j directly corresponds to the automatically adjusted learning rate ÷i,j. In other words, this change is not proportional to the gradient, but it is only influenced by the sign of the gradient. The result is much smoother learning compared to the error function.

(41)

Weight specific learning rates serve as absolute values for the changes of the respective weights. The question remains. Where does the sign come from? Similarly to the BP algorithm, the error function Err(W ) is derived according to the individual weights wi,j, and the gradients g= ^{ˆErr(W )}_ˆw_i,j are received. Rather than multiplicatively incorporating the absolute value of the gradient into the weight change, only the sign of the gradient is considered. In other words, the gradient now defines only the direction of the change, but no longer its strength. If the sign of the gradient g is positive, the weight wi,j is reduced by ÷i,j. Similarly, if the sing of the gradient is negative, ÷i,j is added to the weight w_i,j. If the gradient happens to be0, nothing happens. To sum up:

w_i,j(t) = Y_ __ ] __ _[

w_i,j(t) ≠ ÷^i,j(t), if g(t) > 0 wi,j(t) + ÷i,j(t), if g(t) < 0

0 if g(t) = 0

2. Many dynamically adjusted learning rates instead of one static

As already mentioned, the BP uses a learning rate ÷ which is defined by the user. Firstly, it applies to the entire network. Secondly, it remains static until it is manually changed.

However, this method has its disadvantages which we explored earlier in the text. In deep contrast, the Resilient BP uses a completely different approach: each weight wi,j

has its own learning rate ÷i,j and those learning rates are not defined by the user, but by the algorithm itself. In addition, the weight changes are not static but adapted for each time step. More formally, the learning rates ÷_i,j(t) are defined according to the time step t. To sum up, this leads to more focussed learning as well as to a solution of the problem of slowing down the learning procedure throughout the layers.

In order to adjust the learning rate ÷i,j, the associated gradients of two time steps must be taken into account: the current gradient g(t) and the gradient g(t ≠ 1) that has just passed. The sign of the gradient will either remain the same or flip over the two time steps. If the sign of the gradient changes from g(t ≠ 1) to g(t), a local minimum was missed. This also means that the previous update for ÷_i,j(t ≠ 1) was too large and ÷i,j(t) must be reduced. In mathematical terms, ÷i,j(t ≠ 1) will be multiplied with a constant

÷^¿that is between0 and 1. Additionally, the weight wi,j is set to be0 since something went wrong at the previous time step(t ≠ 1). If the sign of the gradient remains the same, ÷i,j(t ≠ 1) is multiplied with a constant ÷^ø that is greater than 1. In this way, shallow areas within the error surface could be easily skipped. In a nutshell:

÷_i,j(t) = Y_ __ ] __ _[

÷_i,j(t ≠ 1) · ÷^¿, if g(t ≠ 1) · g(t) < 0

÷_i,j(t ≠ 1) · ÷^ø, if g(t ≠ 1) · g(t) > 0

÷_i,j(t ≠ 1) if g(t ≠ 1) · g(t) = 0