• No results found

A Recurrent Neural Network For Battery Capacity Estimations In Electrical Vehicles

N/A
N/A
Protected

Academic year: 2021

Share "A Recurrent Neural Network For Battery Capacity Estimations In Electrical Vehicles"

Copied!
63
0
0

Loading.... (view fulltext now)

Full text

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköping University

Linköpings universitet

LIU-ITN-TEK-A-19/046--SE

A Recurrent Neural Network For

Battery Capacity Estimations

In Electrical Vehicles

Simon Corell

2019-09-03

(2)

LIU-ITN-TEK-A-19/046--SE

A Recurrent Neural Network For

Battery Capacity Estimations

In Electrical Vehicles

Examensarbete utfört i Datateknik

vid Tekniska högskolan vid

Linköpings universitet

Simon Corell

Handledare Daniel Jönsson

Examinator Gabriel Eilertsen

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(4)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Science and Technology

Master’s thesis, 30 ECTS | Computer Science and Engineering

202019 | LIU-ITN/LITH-EX-A--2019/001--SE

A Recurrent Neural Network For

Battery Capacity Estimations In

Electrical Vehicles

Simon Corell

Supervisor : Daniel Jönsson Examiner : Gabriel Eilertsen

(5)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(6)

Abstract

This study is an investigation if a recurrent long short-term memory (LSTM) based neural network can be used to estimate the battery capacity in electrical cars. There is an enormous interest in finding the underlying reasons why and how Lithium-ion batteries ages and this study is a part of this broader question. The research questions that have been answered are how well a LSTM model estimates the battery capacity, how the LSTM model is performing compared to a linear model and what parameters that are important when estimating the capacity. There have been other studies covering similar topics but only a few that has been performed on a real data set from real cars driving. With a data science approach, it was discovered that the LSTM model indeed is a powerful model to use for estimation the capacity. It had better accuracy than a linear regression model, but the linear regression model still gave good results. The parameters that implied to be important when estimating the capacity were logically related to the properties of a Lithium-ion battery.

(7)

Acknowledgments

This project has been the most challenging and exciting work I have ever been a part of. I am grateful to have received this opportunity in such a reputable company as Volvo cars. The idea behind the project came from Christian Fleischer at Volvo cars. Christian have supervised the project and have provided me with various pieces of advice. A huge thank you Christian for giving me this opportunity and for all the help you have provided. The initial idea was unfortunately not possible due to different circumstances but I am still satisfied with what the project resulted in. A big thank you to all other Volvo cars employees that have helped in various discussions and meetings. Specially Herman Johnsson for all the support you gave. I also want to thank my examiner Gabriel Eilertsen and supervisor Daniel Jönsson for the great feedback and guidance throughout the project. Lastly, I want to thank all other mas-ter thesis students that were at the same department for all the laughs and amusing moments.

(8)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

1 Introduction 1

1.1 Motivation . . . 1

1.2 Aim . . . 1

1.3 Research questions . . . 2

1.4 Delimitations . . . 2

2 Background description and related work 3 2.1 Big Data . . . 3

2.2 Time Series . . . 3

2.3 Machine Learning . . . 4

2.4 Neurons & Synapses . . . 4

2.5 Lithium Ion Battery Description and Battery Management System . . . 4

2.6 Related research . . . 7

3 Theory 9 3.1 Data pre-processing . . . 9

3.2 Linear Regression . . . 17

3.3 Artificial neural networks . . . 18

4 Method 25 4.1 Prestudy: Capacity estimation using lab data from NASA . . . 25

4.2 Overall Project Progression . . . 27

4.3 Data pre-processing . . . 28

4.4 Implementation of the Linear Regression model . . . 33

4.5 Implementation of the recurrent LSTM neural network . . . 33

5 Results and Analysis 35 5.1 Results From Linear Regression Model . . . 35

5.2 Long Short-Term Memory (LSTM) . . . 38

6 Discussion 40 6.1 Method . . . 40

6.2 Results . . . 43

6.3 Replicability, reliability and validity . . . 45

(9)

7 Conclusion 46

7.1 Summarizing . . . 46 7.2 Future work . . . 47

A Appendix 49

(10)

List of Figures

2.1 A Lithium Ion battery pack for Nissan Leaf at the Tokyo Motor Show 2009 . . . 5

3.1 Basic idea behind Local Outlier Factor . . . 11

3.2 The normal distribution . . . 12

3.3 The typical pipeline for a filter method . . . 14

3.4 The typical pipeline for a Wrapper method . . . 16

3.5 A simple linear model . . . 17

3.6 A simple ANN . . . 19

3.7 Vanilla recurrent neural network . . . 23

3.8 The LSTM cell . . . 24

4.1 Absolute difference for every 3000 estimation . . . 26

4.2 Model Pipeline . . . 27

4.3 Capacity values over all cycles for some cars . . . 28

4.4 Capacity values over all cycles for some cars . . . 29

4.5 Top 25 Pearson correlation coefficients . . . 31

4.6 The Pearson correlations between the signals . . . 33

5.1 Error distribution for linear regression . . . 36

5.2 Plot showing the estimated values scattered against the real values . . . 36

5.3 Plot showing the estimated values and the real values for car BDCWCU5 . . . 37

5.4 Plot showing the estimated values against the real values for car BDCWCU5 . . . . 37

5.5 Error distribution for LSTM . . . 38

5.6 Plot showing the estimated values against the real values . . . 38

5.7 Plot showing the estimated values against the real values for car BDCWCU5 . . . . 39

(11)

1

Introduction

This research study focuses primarily on the hypothesis that a recurrent neural network can, with high accuracy, estimate the current capacity on lithium-ion batteries. Lithium-ion bat-teries are used as the primary power source in the latest era of electrical cars. Volvo cars wants to investigate how well a recurrent neural network is performing in this estimation and have provided a data set containing information from a fleet of their cars. The study was carried out on Volvo cars headquarter in Gothenburg, Sweden.

1.1

Motivation

Traditionally, fossil fuels has been the main source of fuel in automobiles. This makes vehicles a huge contributor to the carbon footprint which has a detrimental impact on the environ-ment. Many companies nowadays are taking a more eco-friendly approach by electrifying new automobiles and also the existing fleet in operation to a large extent. The main source of portable power for these automobiles will be batteries. Therefore, it is of primal importance research is done on the batteries when commercializing them, so that understanding of their viability in the field of transportation and vehicle propulsion can be discovered. Even in the existing chemical combinations of batteries, Lithium-ion batteries have proved to be the best choice for Electric Vehicles (EVs). The reasons for this choice being fast charge capability, high power density and energy efficiency, wide operating temperature range, low self-discharge rate, lightweight, and small size [1]. However, the battery wears out due to many factors such as temperature, depth of discharge and charge/discharge rates, driving style, route terrain, etc.

1.2

Aim

Today it is difficult for a driver of an electric vehicle (EV) to exactly know how many ampere minutes/hours the battery in his/her car has left. This is because the current State of Health (SoH) can not precisely state this value since it does not take into consideration factors such as driving patterns, driving style, the type of environment around the car and so on. To be able to estimate with high precision the capacity of the battery will certainly be beneficial

(12)

1.3. Research questions

for a car company like Volvo. This motivation can help us understand what factors that makes a battery perform well over a longer time or why is it not performing well over a longer time. The underlying purpose of this thesis project is to evaluate and investigate if a recurrent neural network model can learn to estimate what the battery’s capacity is with respect regarding all the different parameters the Battery Management System (BMS) has sent out.

1.3

Research questions

Here are the research questions listed that this report throughout will try to answer.

1. How well does a recurrent LSTM neural network estimate the capacity of a battery in an electrical vehicle?

2. Is a recurrent LSTM neural network necessary for the estimation of the capacity or can a linear model be used instead?

3. What parameters seems to be important when estimating the battery capacity?

1.4

Delimitations

The most important delimitations that will be done are listed here.

• The data covers only between 3-6 months. It is estimated that a battery should last for around 7 years. Therefore, the decision was made to change the direction of the thesis. From predicting the batteries end-of-life to instead estimating the online State of Health/capacity that the battery currently has.

• The study will only focus on the provided data set and the results can therefore in theory only be proven to be applied on this particular data set.

• This master thesis project will though only focus on the implementation of the global model. Volvo wants in a wider scope investigate if a global model can be combined with local models to make the estimation even more precisely. Can, for example, Gothen-burg, Sweden have a local model which applies only to the cars here and a more general global model for all the cars in the world? The focus of this master thesis project will though only be to implement a global model for the estimation of the capacity of the batteries.

(13)

2

Background description and

related work

This chapter will in detail explain all the background information that was researched and the related work that was investigated.

2.1

Big Data

Big data is stored digital information of immense magnitude [2]. The sizes of big data sets

varies between gigabytes to terabytes or even petabytes. Big data consists of many different areas such as bioinformatics, physics, environment studies, simulations and web services such as

Youtube, Twitter, Facebook and Google. Many companies and organizations are trying to extract

as much information and value as possible from their big data. It is becoming more and more important for large global companies to handle and manage their data as well as possible. The emergence of big data comes from the possibility of collecting digital information using different sensors as well from the internet to later store the information on hard drives located in different data servers and centers. Big data is too large and/or complex for traditional data-processing methods. Using new algorithms and methods complex correlations in the data can be found. Correlations explaining phenomenons such as business trends, diseases, crimes and so on. The data on our planet grows rapidly with the increase of internet of things

IoT devices such as cameras, microphones, mobile devices, and cars. The global data volume

will grow exponentially from around 4.4 zettabytes to around 44 zettabytes between 2013 and 2020. By 2025 there will be around 163 zettabytes of digital information [3].

2.2

Time Series

The data in this study is a time series. A Time series is a data set where the samples are indexed in time order [4]. The most common time series have the same time difference between sam-ples making it a sequence of discrete-time data. Time series can deal with real continuous data, discrete numeric data or symbolic data. Time series are used in many different fields including statistics, signal processing, weather forecasting, astronomy and so on. Time

se-ries analysis consists of analyzing time sese-ries to extract meaningful information from it. Time series forecasting tries to build a model to predict future values based on the previously

ob-served values. For example, from previously obob-served weather values how will the weather be tomorrow?

(14)

2.3. Machine Learning

2.3

Machine Learning

Machine learning (ML) is a set of algorithms that can learn behavior and patterns without

being hardcoded for the purpose [5]. ML algorithms are founded on optimization and proba-bility theory which uses multidimensional calculus and linear algebra in its calculations. ML have become enormously popular due to the excellent results it has in solving problems in-volving complex and large data sets. There are mainly two different categories of problems

ML algorithms can deal with. These are called supervised and unsupervised learning. In

supervised learning, the input data is labeled so that the task is to learn the correct relation between the inputs and outputs so that the ML algorithm can generalize what it has learned to new data. In unsupervised learning, the input data is not labeled. In this case, it is up to the ML algorithm to find underlying structures and patterns within the input data. Another way to divide ML problems is to what format the output data has. If the output data consists of discrete values out of a set of classes it is called classification. If the output data consists of real values it is called regression. Another ML problem is called clustering. Clustering tries to divide the input data into different groups and in that way categorize them. This is most commonly an unsupervised problem. Today ML algorithms are used in various fields including data mining, natural language processing, and image recognition. Deep learning is a class of machine learning algorithms that are used in supervised and unsupervised tasks. Most deep learning algorithms are most often based on artificial neural networks. They can learn multiple levels of representations and uses a large number of layers of nonlinear con-nectivity. The "deep" in deep learning means that there are a large number of layers in the artificial neural network.

2.4

Neurons & Synapses

The human brain consist of a very large collection of neurons [6]. Neurons are the core com-ponent of the nervous system and this is especially true for the brain. Neurons are electri-cally excitable cells that receive, processes and later transmits information to other neurons through chemical and electrical signals. The connections between the neurons are called synapses and these complex membranes junctions allow for neurons to communicate with each other. The neurons operate in different neural circuits. A neural circuit is a population of neurons designed to carry out a specific task. These neural circuits are interconnected with other neural circuits to form a large scale neural network in the brain. There are an estimated 100 billion neurons in our human brain. Each neuron can be connected up to 10 thousand other neurons and passing signals to each neuron with up to 1,000 trillion synapse connec-tions. This is equivalent to a CPU with 1 trillion bits per second operaconnec-tions. The human brain’s memory capacity estimates to vary between 1 to 1,000 terabytes [7]. It is fairly simple to understand the enormous power of the human brain. How fast it operates and how much memory it can store. The human brain is amazing in how it can learn to understand complex patterns, process information and be able to memorize large sizes of information. All these factors made scientists wonder if there was a possibility to mimic the human brain somehow. The result is called Artificial neural networks.

2.5

Lithium Ion Battery Description and Battery Management System

Lithium-Ion batteries were first introduced during the 1990s and have now become one of the most common batteries on the market [8] [9]. They are being used in a large majority of different electronic devices including music devices, laptops, mobile phones and now elec-tronic cars. Lithium-Ion batteries have found to be very well-suited for mobile devices since they have a high power density (W/kg) and a long expected lifetime. Lithium-Ion batteries consist of Lithium-Ion cells. Lithium-Ion cells have primary three components in its internal

(15)

2.5. Lithium Ion Battery Description and Battery Management System

construction. These are two electrodes called cathode (positive) and anode (negative) as well as a conductive electrolyte. During a charging cycle, the lithium captions move to the neg-ative electrode anode. During a discharge cycle, the lithium captions move to the positive electrode cathode. For electronic cars, a lithium-Ion battery consists of tens to thousands of individual cells packaged together. The amount of cells is determined by the required power, voltage, and energy. These Lithium-Ion battery cell packs are located underneath the car and can be seen in figure 2.1.

Figure 2.1: A Lithium Ion battery pack for Nissan Leaf at the Tokyo Motor Show 2009 taken from [10].

So what characterizes a Lithium-Ion battery? There are some important parameters one must know to be able to understand the most essential properties of a Li-Ion battery. These values are discharging and charging cycles, State of Charge, battery voltage, discharge cur-rent, internal resistance, capacity and State of Health. These values are briefly described below.

• Discharging and charging cycles: A discharging cycle is when the battery is being used. In the case of a car battery, this is equivalent to the car driving and being used. A charge cycle is when the car is standing still and the battery is being filled up with energy. • State of Charge: State of Charge is the equivalent of a fuel gauge for the energy in the

battery of an electric vehicle. 0% meaning an empty battery and 100% meaning a full battery.

• Battery cell voltage: The battery voltage is the difference in charge between the cathode (positive electrode) and anode (negative electrode). The higher the difference is, the higher the voltage is. The battery voltage is equivalent to the electric potential in the battery. The electric potential is measured in volts.

• Discharge current: An electric current is the rate of flow of electric charge passing a point or over a region. It is measured in ampere. Discharge current is the electric current used during a discharge cycle. The discharge current is equivalent to the necessary current required to handle the demands of energy.

• Internal resistance: Internal resistance is the impedance in an electric circuit. It is mea-sured in Ohm Ω and is defined as the opposition to the current in an electric circuit. • State of Health (SoH): State of Health is a value for the health of a battery compared

to its ideal conditions. The state of health will be 100% at the time of manufacture and will later decrease when being used over time. The state of health is measured from the capacity in the battery.

(16)

2.5. Lithium Ion Battery Description and Battery Management System

• Capacity: A battery’s capacity is the amount of electric charge it can deliver. It is mea-sured in ampere-hours and is a good indication for the health of the battery.

• EOL: EOL stands for End-Of-Life and is a state when the battery is considered to be useless. This is estimated to be when a full charge of the battery only results in 70% of its original capacity.

So what is known to be the cause for the degradation of Lithium-Ion batteries?

The capacity is a good indication for the health of the battery and will, therefore, be the value representing the degradation. The capacity will degrade over thousands of cycles. The degradation consists of slow electrochemical processes in the electrodes. Chemical and mechanical changes to the electrodes are major causes of the reduction of the capacity. These changes cause flaws such as; lithium plating, mechanical cracking of electrode particles, and thermal decomposition of the electrolyte. Charge and discharge cycles cause consumption of lithium ions on the electrodes. With fewer lithium ions on the electrodes, the charge and dis-charge efficiency of the electrode material is reduced. This meaning that dis-charges of the cells will result in less capacity and that discharges cycles will faster consume the capacity. The degradation is highly correlated to the cell temperature. The cylindrical cell temperatures are linearly correlated to the discharge current. It is also believed that high discharge levels and elevated ambient temperatures will cause an increase in degradation. High discharge levels meaning that the State of Charge has been pushed down to small percentages.

The battery management system (BMS) is a software program that manages a battery. Ex-amples on its functions are helping the battery from operating outside its safe operating area and monitoring its different parameters. Different parameters including, for example, the voltage, temperature, current and so on. The battery management system is also communi-cating directly to the battery and doing all of the computations needed for the values used for the management of the battery. These values including State of charge, State of health and internal resistance.

(17)

2.6. Related research

2.6

Related research

In the paper written by Hicham Chaoui [11], they have used a neural network with delayed inputs in order to predict the SOC and SOH values of Li-ion batteries. The benefit of this method lies in the fact that the network does not need any information regarding the bat-tery model or parameters. They have used the previously recorded parameters like ambient temperature, voltage, current etc as inputs. Another advantage of this method is that it com-pensates for the non-linear behaviour of the batteries such as hysteresis and reduction in performance due to aging. From the results one can learn that this method provides a high accuracy and robustness with a simple algorithm to evaluate a LiFePO4battery.

Batteries were conventionally modelled by defining an empirical model dependent on the equivalent electric circuit, which can be made through actual measurements or estimation of battery parameters using extrapolation. The problem with these models is that such models can only describe the battery characteristics until before the battery is charged again. In a research paper written by Massimo [12], the inter-cycle battery effects are automatically taken into effect and implemented into the generic model used as a reference template. This method was used to model a commercial lithium iron phosphate battery to check the flexibility and accuracy of the described strategy.

In the paper written by Ecker [13], a multivariate analysis of the battery parameters was performed which was subjected to accelerated aging procedures. Impedance and capacity were used as the output variables. The change in these parameters was observed when the battery was subjected to different conditions of State of Charge and temperature. This lead to the development of battery functions which model the aging process bolstered by the phys-ical measurements on the battery. These functions lead to the development of a general bat-tery aging model. These models were subjected to different drive cycles and management strategies which gave a better understanding of their impact on battery life.

The study done by Tina [14] highlights the use of several layers of Recurrent Neural Net-works (RNNs) to take into account the non-linear behaviour of photovoltaic batteries while modelling. In this manner the trend of change in the input and the output parameters of the battery are taken into account while charge and discharge processes. This has proved to be a powerful tool despite of the heavy computations required to run the neural network archi-tecture. In this model, the electric current supplied to the battery is the only effective external input parameter because it is dependent on the user application. The voltage and SOC are taken as the output variables.

In the study done by Chemali [15], the LSTM cells were used for a Recurrent Neural Net-work (RNN) to estimate the State of Charge (SOC). The significance of the research presented in this paper lies in the fact that they used a standalone neural network to take into account all the dependencies without using any prior knowledge of the battery model or employing any filters for estimation like the Kalman filter. The network is capable of generalizing all kinds of feature changes it is trained on using data sets containing different battery behaviour sub-jected to different initial environment conditions. One such feature taken into account while recording data sets is ambient temperature. The model was able to predict a very accurate SOC with variation in external temperature in this case.

In the research conducted by Saha [16], a Bayesian network was used to estimate the re-maining useful life of a battery where the internal state is difficult to measure under normal operating conditions. Therefore, these states need to be estimated using a Bayesian statistical approach based on the previous data recorded, indirect measurements and different opera-tional conditions. Concepts from the electrochemical processes like equivalent electric circuit and statistics like state transition models were combined in a systematic framework to model the changes in aging process and predict the Remaining Useful Life (RUL). Particle filters and Relevance Vector Machines (RVMs) were used to provide uncertainty bounds for these models.

(18)

2.6. Related research

In the study conducted by You [17], the State Of Health (SOH) of the battery has been estimated to determine the schedule for the maintenance and replacement time period of the battery or to estimate the driving mileage. Most of the previous research is done in a constrained environment where a full charge cycle with constant current has been a common assumption. However, these assumptions cannot be used for EVs as the charge cycles are mostly partial and dynamic. The objective of this study was to come up with a robust model to estimate SOH where batteries are subjected to real world environment conditions. This method has demonstrated a technique that makes use of historic data collected on current, voltage and temperature in order to estimate the State of Health of the battery in real-time with high accuracy.

(19)

3

Theory

This chapter will explain the theory behind the key components used in the thesis. The chap-ter starts with data pre-processing and lastly an explanation of linear regression and neural networks.

3.1

Data pre-processing

This section will go through the theory behind data pre-processing. First with an overall description of data pre-processing followed with descriptions of some of the different steps in data pre-processing.

Data pre-processing is an important step in a machine learning project [18]. When data have been collected it is called raw data. Raw data comes often with many flaws. These flaws include for example values that are out of range, values that are missing or combinations of values that are impossible. Including these kinds of values can lead to misleading results. This is why it is important to analyze the data and to make the necessary changes to the data before applying any kind of machine learning algorithm to it. There are many different steps in data pre-processing. The only steps in data pre-processing that this report covers are called data cleaning, data transformation, and feature selection. These will be explained in the upcoming paragraphs.

3.1.1

Data Cleaning

Data cleaning is a process where the aim is to detect incorrect values and later correct or remove these [19]. The purpose of data cleaning is to make all data sets consistent with each other. Data cleaning can consist of many different steps. The steps that were taken into consideration in this report are NaN-replacements, outlier detection, and Data Aggregation. These will be explained in the upcoming paragraphs.

3.1.1.1 NaN-values

NaN stands for "not a number" and is an error that can happen in the collection of data [20]. NaN is exactly what it stands for, "not a number", meaning that for a value which should

(20)

3.1. Data pre-processing

be numeric, nothing or a non-numeric value has been saved. This error can happen due to wrong user input or from an error in the sensor used in the collection of the data. It is important to deal with NaNs because their false values will affect the data set. There are different solutions for dealing with NaN-values. Some of these are listed down below.

• Replacing the NaN-value with the mean of column. • Removing the samples with a NaN-value.

• Replacing the NaN-value with the mean of the previous valid value and next valid value.

• Replacing the NaN-value with either the previous valid value or next valid value.

3.1.1.2 Outlier detection

Outlier detection identifies events or observations in data sets that differ significantly from the majority of the data [21]. Outlier detection can either be unsupervised or supervised. When dealing with unsupervised outlier detection the assumption is made that the majority of the samples in an unlabeled data set are normal and the aim is to find the abnormal samples. In supervised learning, a classifier is trained on data sets that are labeled either normal or abnormal. There are many different outlier detection algorithms. The one used in this report is based on the K-nearest neighbors algorithm and is called Local outlier factor.

Local outlier factor is an algorithm created by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander in the year 2000 [22]. The algorithm emanates from the samples’ local densities. The local density is estimated by using the k-distance given by k-nearest neighbors. The basic idea is to calculate all samples’ different local densities and then comparing the neighboring samples’ densities. This way the samples that have densities that differ the most compared to its neighbors can be found and later be classified as outliers. The algorithm starts with the user deciding on how many k-nearest neighbors that should be used. The k-distance is later the euclidean distance to the given neighbor.

This k-distance is used to define what is called the distance. The

reachability-distance is defined as the maximum between the k-reachability-distance and the reachability-distance between the

samples. This means that if the sample is within the k neighbors then the reachability-distance will be the k-distance. Otherwise, it will be the real distance between the samples. This proce-dure is just a "smoothing factor" used to get more stable results. Lastly the reachability-distance is used to calculate the local reachability density. The local reachability density for a given sample is the inverse of the averaged value of all the reachability-distances to its k-nearest neighbors. The longer the the local reachability-distances are, the sparser the the local reachability density will become. The local reachability density describes how far away the sample is from other samples or clusters of samples. Local outlier factor is performed by first calculating the local reachability

density for each sample. Then for each sample: calculate k ratios between the given sample’s local reachability density and the the local reachability densities of its k-nearest neighbors. The k

ratios are later averaged into one single value. The equation for Local Outlier Factor for an object p can be seen in equation 3.1.

LOFKNN(p)= ř o PNKNN(p)) lrdKNN(o) lrdKNN(p) |NKNN(p)| (3.1) Where o is an object within k-nearest neighbors, lrdKNN(o)is the local reachability density

(21)

3.1. Data pre-processing

If this value is greater than 1 then the local reachability density for the given sample is greater than its k-nearest neighbors and is considered to be an outlier. The basic idea behind local outlier factor can be seen in figure 3.1.

Figure 3.1: Basic idea behind Local Outlier Factor taken from [23].

Figure 3.1 shows the local reachability densities for some samples. Sample A has a sparse

local reachability density with far k-distances to its neighbors. It is thereby considered to be an

outlier.

3.1.2

Data transformation

3.1.2.1 Data Aggregation

Data aggregation is a process in which data is gathered and expressed in a summary form [24]. The purpose of data aggregation is to prepare data sets for data processing. Data ag-gregation can, for example, consist of aggregating raw data over some time to provide new information such as mean, maximum, minimum, sum or count. With this new information, one can analyze the aggregated data to gain insights about the data.

With the explosion of big data, organizations and companies are seeking ways of un-derstanding their large scale of information. Data aggregation can be a critical part of data management. One example of this could be that minimizing the number of rows will lead to a dramatic decrease in time required to process the data and the size of the data to be dramatically less. Data aggregation is a popular strategy to use in Business Intelligence (BI) solutions and databases.

3.1.2.2 Normalization techniques and Normal distributions

Normalization techniques are used in statistics and applications of statistics. Normalization can have different purposes and can be used within Data Transformation. In the simplest cases normalization is used to adjust values measured using different scales to a common scale. In more advanced cases normalization is used to change whole probability distribu-tions. There are different normalization algorithms. For example Standard score, Studentized

(22)

3.1. Data pre-processing

one that this report will focus on is called Standard score. Standard score is a common normal-ization technique used when working with Machine Learning algorithms.

3.1.2.3 Normal Distributions

In probability theory, the normal distribution (Gaussian distribution) is an important proba-bility distribution [25]. The normal distribution comes from the central limit theorem. The central limit theorem states that samples of observations of random variables will become normally distributed when the number of observations becomes sufficiently large. The dis-tribution fits many natural phenomenons, for example, blood pressures, heights, IQ scores, and measurements errors. The normal distribution is always symmetrical around the mean and the shape of its distribution can be determined by the mean and a value called standard

deviation. The graph for the normal distribution can be seen in figure 3.2.

Figure 3.2: The normal distribution taken from [26].

In figure 3.2 σ is the standard deviation and µ is the mean. One can see that 68 % of the samples are within one σ of the mean, 95 % are within two standard deviations of the mean and 99.7 % of the samples are within three standard deviations of the mean.

The values come from the probability density function for the normal distribution. This function can be seen in equation 3.2.

P(x) = 1 σ?2πe ´(x´µ)2 . 2 (3.2) Where X is a single data sample, σ is the standard deviation and µ is the mean.

(23)

3.1. Data pre-processing

3.1.2.4 Shapiro-Wilk test

The Shapiro-Wilk test was published in 1965 by Samuel Sanford Shapiro and Martin Wilk [27]. The test is a test of normality and tests the null hypothesis that a distribution is normally distributed. W =  řn i=1aix(i) 2 řn i=1(xi´ ¯x)2 (3.3) Where, x(i)is the ith smallest number, xiis the ith order statistic, ¯x is the sample mean. The

coefficients aiare given by a vector of expected values of ordered independent and identically

distributed random variables samples. The null-hypothesis tests if the probability value is less than the chosen alpha value. If the probability value is higher than the chosen alpha value then the null hypothesis can not be rejected and the data tested is normally distributed. The sample size must be greater than two and if the sample size is sufficiently large the test may detect departures from the null hypothesis that are trivial. This resulting in rejections that are based on small meaningless departures.

3.1.2.5 Standardization (Standard Score)

The standard score is a normalization technique where the observed value is subtracted by the mean and later divided by the standard deviation [28].

z= x´ α

β (3.4)

Where x is the sample value, α is mean and β is the standard deviation. The result of doing standard score will be that all the values will be rescaled so that the standard deviation is one and the mean is zero. Having a standard normal distribution is a general requirement for a large proportion of machine learning algorithms. Some examples of machine learning algorithms that need standard scores are neural networks with gradient descent as an opti-mization algorithm, K-Nearest neighbor with euclidean distance measurement and principal component analysis [29]. All have their different reasons. Example for neural networks with gradient descent-based optimization some weights will be updated much faster than others and for principal component analysis you want to find directions of maximizing the variance and this can only be done of having values that have the same scale.

3.1.3

Feature Selection

Feature selection is the process of selecting a subset of features out of an original larger set of features [30]. The newly selected subset of features is instead used when constructing a machine learning model. There are mainly five reasons for using feature selection. These are listed down below.

• Simplification of data makes the models easier to understand by users and researchers. • Reduces the chances of overfitting the model. Overfitting meaning that the model can

not generalize to new unseen data.

• Shortens the time needed for training the model.

• Avoids the curse of dimensionality. Curse of dimensionality means various negative phe-nomenons that can occur when dealing with high-dimensional data.

(24)

3.1. Data pre-processing

• It improves the performance of the model if the most suitable subset is selected. Feature selection emanates from the fact that there are features that are redundant or ir-relevant for the particular problem faced and can, therefore, be removed. Redundant mean-ing that some features could be relevant but be highly correlated with other more relevant features. Feature selection is often used for data sets which have many features but compar-atively few samples. The most famous cases where feature selection is used is the analysis of written texts and DNA analysis. In these cases, there are thousands of features but only hundreds of samples. The aim of doing feature selection is to obtain the most optimal subset of features for the particular problem. There are three main categories of feature selection algorithms Filter methods, Wrapper methods and Embedded methods. Filter methods and Wrapper

methods will be explained in the upcoming paragraphs. This report will not cover Embedded methods.

3.1.3.1 Filter Method

Filter methods are the simplest feature selection algorithms. In filter methods, a measure-ment is used to determine how relevant and important a feature is. The measuremeasure-ments that are most commonly used are Mutual information, Pearson correlation coefficient, relief-based

algo-rithms and Significance tests. These measurements are fast to compute and are capturing the

importance of the features. Some filter methods provide the "best" subset of features while other provides a ranking of the features. The typical pipeline for a filter method can be seen in figure 3.3.

Figure 3.3: The typical pipeline for a filter method taken from [31].

Figure 3.3 shows the typical pipeline for a filter method. It starts with all the features. Then all features different correlations with the target variable are calculated. Then the best subset is selected concerning the chosen threshold. This subset is used when training and later evaluating the model.

(25)

3.1. Data pre-processing

The measurement that this report will focus on is Pearson correlation coefficients [32].

Pearson correlation coefficients was created by Karl Pearson and measures the linear correlation

between two variables. It has a value between -1 and +1. +1 meaning a total positive lin-ear correlation and -1 meaning a total negative linlin-ear correlation. Having a value 0 means that there is no linear correlation between the variables. The definition of the correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.

ρ= cov(X, Y)

σxσy (3.5)

Where cov is the covariance, σxis the standard deviation of X (X meaning the population of

the feature x) and σy is the standard deviation of Y (Y meaning the population of the target

feature y). When used in the filter method the user decides what threshold the Pearson

corre-lation coefficients should be larger than when selecting the subset of features. This means that

the filter methods consists of first calculating all features different Pearson correlation

coeffi-cients and later selecting those features whose Pearson correlation coefficient satisfies the chosen

threshold. The negative aspect when using a filter method with Pearson correlation coefficients is that it does not take into consideration the dependencies between features in the data. It can only find the best subset of features if all features in the data are completely independent of each other. It also does not take into consideration the non-linearity correlation between the features and the target feature. So when dealing with data sets with non-linearity and features with dependencies the only result that is obtained from this method is a ranking of how all features are linearly correlated to the target feature. It will not find the most opti-mal subset of features. But this ranking can still provide the user with a first overall linear correlation description of the data.

3.1.3.2 Wrapper Method

Wrapper methods are feature selection methods which take both linearity and non-linearity dependencies between features into consideration. Wrapper methods can deal with data sets with features that are depended on each other. Wrapper methods use a predictive model to score different selected subsets of features. It trains a new model for each different subset and calculates a score based on how well the model performed. Wrapper methods are very computationally heavy but are usually the best feature selection methods to use. The typical pipeline for a wrapper method can be seen in figure 3.4.

(26)

3.1. Data pre-processing

Figure 3.4: The typical pipeline for a Wrapper method taken from [33].

The typical pipeline for a wrapper method starts with all the features. Then subsets of features are generated and evaluated until the subset of features which gives the best per-formance is found. It is essentially reduced to a search problem and is very computationally expensive. There are mainly three different wrapper methods. These are called Forward

Se-lection, Backward Elimination and Recursive Feature elimination. These are listed and explained

down below.

• Forward Selection: Forward Selection starts with no features. Then for each iteration, it adds the feature that best improves the model. This is done until no feature can be added to improve the performance of the model.

• Backward Elimination: Backward Elimination starts with all features. It later removes the least important feature at each iteration concerning the performance. This is repeated until no removal of features can improve the performance.

• Recursive Feature elimination: Recursive Feature elimination is a greedy optimization al-gorithm which repeatedly creates new models. At each iteration, it saves the best performing feature. It later constructs the next model with the saved features. This is done until all features have been tried.

The wrapper method that this report will cover is a backward elimination method based on shadow features and a random forest algorithm [34]. The wrapper method consists of four steps major steps in its pipeline. The pipeline is explained down below.

• 1: The given data set is extended by creating shuffled copies of all features. These new copies are called shadow features. Each original feature has one corresponding shadow feature.

• 2: A random forest regressor is later trained on the extended data and a feature im-portance measure is applied. The measurement is called Mean Decrease Accuracy. The measurement explains how important each feature is.

• 3: Later at each iteration, it checks if an original real feature has higher importance than the best of the shadow features. If the best shadow feature has higher importance, then that feature is considered unimportant and is removed.

(27)

3.2. Linear Regression

In the backward elimination method based on shadow features and a random forest al-gorithm, the original data is expanded by an equally large randomized data set of shadow features. Then at each iteration, the expanded data set is trained using a random forest al-gorithm. This will calculate the relative importance of each feature. Then for each feature, a check is made to see if the best of the shadow features have a higher value. If the check results in the feature having a higher value then it is considered to be a "hit". This "hit" is later saved in a table. At each iteration, another check is also done which compares the number of times the feature did better than the best shadow features using a binomial distribution. For exam-ple; say F1 is a feature which gets 3 hits over 3 iterations. With these values, we can calculate a probability value using a binomial distribution with k=3, n=3, and p=0.5. F1 is confirmed to be an important feature and its column is removed from the original data set. If a feature has not recorded a hit over a certain time of iterations, it is considered to be unimportant and is rejected (also removed from the original data set). The iterations continue until all features have either been confirmed or rejected.

3.2

Linear Regression

Linear regression is a linear model which tries to model the relationship between a target

vari-able and predictor varivari-ables [35]. It assumes that the relationship between the target varivari-able and the predictor variables is linear. When there is only one predictor variable the model is called simple linear regression and when there is more than one predictor variable it is called

multiple linear regression.

Yi=β0+β1Xi1+¨ ¨ ¨+βpXip+ǫi. (3.6)

Where Yiis the target variable, β0, β1,¨ ¨ ¨ , βpare regression coefficients and Xi1,¨ ¨ ¨ , Xipare

the predictor variables. ǫiis the error variable. The error variable ǫ is an unobserved random

variables which adds noise to the linear relationship. β0, β1,¨ ¨ ¨ , βpare interpreted as partial

derivatives of the target variable with respect to the predictor variables. In figure 3.5 an example of a simple linear regression model can be seen.

Figure 3.5: A simple linear model taken from [36].

In figure 3.5 the red dots are the observed relationship between the target variable and the predictive variables. The green lines are random deviations and are believed to be the cause of relationships which the blue line represents.

(28)

3.3. Artificial neural networks

3.3

Artificial neural networks

This section will describe what Artificial neural networks (ANNs) [37] are and in detail ex-plain the different aspects of ANNs.

3.3.0.1 Background

The interest for neural networks began in 1958 when the perceptron was created by Rosen-blatt. But due to computer hardware limitations and the fact that basic perceptrons were not sufficient enough the interest stagnated until the backpropagation algorithm was created in 1975. The backpropagation algorithm made it possible to train multi-layer networks in a much efficient way and paved to new sets of applications. The following years more and more algorithms were discovered meanwhile the computational power increased. This have lead us to what we have today. Today there are various ANNs including recurrent neural networks, deep neural networks, and convolutional neural networks. The usages for these ANNs are for example recognition for image, text, and voice.

3.3.0.2 General description

ANNs are computing systems inspired by the neurons and synapses in our brains. ANNs are a collection of connected nodes called artificial neurons. These artificial neurons try model the neurons in our human brain. Each connection between the artificial neurons can send signals between the artificial neurons just as the synapses in our brain. One can see ANNs as a framework where different types of machine learning algorithms can be used to process complex data inputs. The most important quality that ANNs has is the ability to perform tasks without being hardcoded for it. ANNs learns how to process new data based on what it has previously learned. The basic idea with a feedforward ANN is to train the network so that it can predict what the input data it gets corresponds to. E.g. predict that an input image contains a human face or a dog. The procedure for training a network for this kind of problem is called supervised learning. In supervised learning, the network is fed with labeled training data over a time duration. After the time duration is over the network changes its internal parameters with respect due to the error it had. The update of its internal parameters is done through an optimization strategy. More about these in a the optimization section. The error is calculated as what the network predicted the training data was compared to what the training data was labeled with. The calculation of the error is called the cost function and will be described more closely in a cost-functions section. The ANN repeats this procedure over the labeled training data until the maximum number of iterations has been done. When this occurs the network gets tested on the test data. It is here where one can tell if the network has learned to generalize its ability to predict or if the network is overfitted to the training data and predicts wrongly on new data. More on this in the later section. The basic structure of an ANN consists of nodes, layers, and weights. These handles data inputs and computes an output as a result of the data input. In figure 3.6 down below can a simple ANN be found.

(29)

3.3. Artificial neural networks

Figure 3.6: A simple ANN taken from [38].

The ANN in figure 3.6 consists of three layers. The first layer is called the input layer and is the layer which first receives the input data. The input layer here consists of three nodes meaning that the input data consists of three features. These three features form together a sample. This sample can represent any kind of data, e.g. it can represent a point in a vector space (x,y,z) or three properties of a flower. Each feature needs to be represented by a number representing its value. Since this is a network of the type fully connected every feature value gets transmitted to every hidden node in the hidden layer. When each feature value leaves the input layer it is getting transmitted to the hidden layer it is multiplied with a unique weight for each hidden layer node. The weights are scalar values and are for the most part initialized with a random value within a certain range. The second layer in this ANN is called the hidden layer. It has four nodes and since this ANN is fully connected every node computes the sum of every weighted feature from the previous layer.

pj(t) =

ÿ

i

oi(t)wij. (3.7)

Where pj(t)is the output of the given node, oi(t)wijis the weighted output from the

pre-vious node. When this has been done it activates the summation of the weighted features with an activation function, more about activation functions in the next section. The hidden layer later transmits its activated values to the third layer in the same way as the input layer transmitted to the hidden layer. The third layer is called the output layer and has two nodes. Every node in the output layer calculates the sum of the weighted activated values and com-putes the activation of this value using an activation function. The network later decides what to output regarding what activation in the output layer that has the highest value. This procedure is done for every single sample in the training data. With an increasing number of nodes and layers, the complexity increases and more computations need to be done. But this also increases the possibility to create a network which can handle a large amount of complex data with a high degree of variations. The difficulty for a designer of an ANN is to first find the right amount of hidden nodes and hidden layers to use in his/her network. The

(30)

3.3. Artificial neural networks

creator also needs to find the right type of activation functions, cost function, optimization strategy, and validation procedure. There are also many different variations of ANNs, each with unique architecture and each most suitable to a certain type of data and problem. The type of ANN that has been chosen for this problem is a Recurrent artificial neural network. More about Recurrent artificial neural networks in a later section.

3.3.0.3 Activation functions

The activation function decides what a node in an ANN should output. It is used to nor-malize the outputs and in some cases, it acts as a binary which decides if the node should be "activated" or not. Activations functions which are non-linear and are central in making the neural network able to model non-linear relationships. Non-linearity is needed in activation functions because its aim in a neural network is to produce a non-linear decision boundary via non-linear combinations of the weight and inputs. Down here are some of the most com-mon activation functions for ANNs listed and the descriptions on how they work.

• Tanh: Tanh normalizes the input between -1 and 1.

tanh(x) = e

x´ e´

x ex+e´

x (3.8)

Where, x is the input value and e is Euler’s constant

• Rectified linear unit (ReLu): Relu outputs the maximum of zero and the input value. This makes non positive outputs impossible.

ReLu(x) =max(0, x) (3.9) Where x is the input value.

• Sigmoid: Sigmoid normalizes the input between 0 and 1.

f(x) = 1

1+e´x (3.10)

Where e is Euler’s constant and x is the input value

3.3.0.4 Cross-validation

Cross-validation strategies are different techniques for validation of an ANN and decide how the whole data set should be divided into training, test and validation parts. The cross-validation must be well thought out since problems like overfitting can occur if not. Overfit-ting is the phenomena when the ANN performs well on the training data but cannot gener-alize its predictions on new unknown data. This is mostly because the training data has had a unique pattern and behavior which is not existing in the test data. There are two common types of cross-validation.

• Holdout method: This method split the entire data into two different parts; one as training part and the other as test part.

• n-fold cross-validation: The original data set is divided into n equal-sized subsamples. Then one sample is randomly picked as validation data while the rest n-1 subsamples are picked as training data. This procedure is then repeated n times until all n sub-samples have been used for validation. Then a single estimation is produced through averaging the n results altogether.

(31)

3.3. Artificial neural networks

3.3.0.5 Cost-functions

Cost functions measures how well an ANN performed when mapping the training examples to correct outputs. It computes this result by calculating all the differences between the out-puts and the correct outout-puts. This is done for every training sample. Then an error or score is calculated by averaging all the differences. The creator of the ANN wants to either minimize (if it is an error) or maximize (if it is a score) the cost-function by using an optimization al-gorithm. More about optimization algorithms in the next section. Cost-functions depends on four parameters. All the weights and biases in the network, one single training sample and the correct output. It returns a single value which either represents an error or score. Two of the most used cost-functions are listed below.

• Mean Squared Error (MSE): MSE sums the squared error over the whole training set.

CMSE(W, B, Sr, Er) =0.5

ÿ

j

(aLj ´ Erj)2 (3.11)

Where, W is the weight matrix, B are the biases, S’ are the training samples, E’ are the correct outputs.

• Cross-entropy cost (CE): CE sums all the cross-entropies in a training set.

CCE(W, B, Sr, Er) =´

ÿ

j

[Erj ln aLj + (1´ Erj)ln(1´ aLj)] (3.12)

Where, W is the weight matrix, B are the biases, S’ are the training samples, E’ are the correct outputs

3.3.0.6 Optimization algorithms

Optimization algorithms are used to either minimize or maximize the return value from the cost-function. They play a major role in the training of an ANN because they state how the internal parameters; the weights and the bias values should be updated. The goal for optimization algorithms is to update the internal parameters in the direction of the optimal solution. There are two major categories of optimization algorithms and these are called First-order optimization algorithms and Second-order optimization algorithms. First-order optimization algorithms use the cost function’s gradient values to update the ANN’s inter-nal parameters. Second-order optimization algorithms use the cost function’s second-order derivative values to update the ANN’s internal parameters. First-order optimization algo-rithms are converging rather fast on large data sets and are easier to compute compared to Second-order optimization. Second-order optimization algorithms are often slower and re-quire more time and memory compared to first-order optimization algorithms. Down here are two of the most used algorithms listed that optimization algorithms use.

• Gradient Descent: Gradient descent is a common algorithm used for optimization. It up-dates the weights and bias values after each training step corresponding to the formula in equation 3.13 which you can see below. The gradient is calculated using a method called backpropagation.

θt+1,i=θt,i´ η ¨θJ(θt,i) (3.13)

Where, θ is the weight and bias values, η is the learning rate,θJ(θ)is the gradient

value of the cost function. Backpropagation is commonly used in the training of deep neural networks. The method computes the gradients for each layer in the ANN by using the chain rule on the cost function. The method can by using the chain rule propagate backwards through the network so that the effect each inner neuron had

(32)

3.3. Artificial neural networks

on the output can be traced and can thereby be updated correctly. Optimization algo-rithms that use gradient descent include standard gradient descent, stochastic gradient descent, and Mini batch gradient descent.

• Momentum: Momentum is an improvement of the previous described gradient descent. It is gradient descent with a short-term memory. It takes into consideration more than only the previous gradient when updating the weights.

zt+1,i =β¨ zt,i+¨∇θJ(θt,i) (3.14)

θt+1,i =θt,i´ η ¨ zt+1,i (3.15)

Where, θ is the weight and bias values, η is the learning rate,θJ(θ)is the gradient

value of the cost function, zt+1,i is the momentum value. Momentum gives gradient

de-scent an acceleration and how large this acceleration should be is up to the user of the

al-gorithm. When β=0, Momentum becomes exactly like gradient descent. But for β=0.99 the acceleration could be what is needed. Momentum has proved to be suitable for data with Pathological curvatures. Pathological curvatures is when regions are not scaled properly and are forming regions similar to valleys, canals and ravines. When

gradi-ent descgradi-ent gets stuck or jumps between these regions, Momgradi-entum more often progress

along the most optimal direction. An optimizer which uses Momentum is Adam.

3.3.1

Recurrent neural network

Recurrent neural networks (RNNs) are a type of ANN that has a certain structure that allows training on data where the input sequence is of importance [39]. Recurrent neural networks can scale to much longer sequences than traditional feedforward neural networks. RNNs processes its inputs as sequences which are defined as vectors where the elements are sepa-rated by a time step. The time step can either be a timestamp in the real world or the position in the sequence. RNNs are based on the idea of an unfolded computational graph to include cycles [40]. In these cycles, the present value has an influence on its own value at a future time step. With a computational graph, there is a way to form a structure which can map inputs to outputs where the inputs have a repetitive structure, typically corresponding to a chain of events. The structure in most RNNs can be decomposed into three blocks of pa-rameters and associated transformations. From the input node to the hidden node, from the last hidden node to the next hidden node and from the hidden node to the output. Each of these transformations corresponds to a shallow transformation. A shallow transformation is a learned affine transformation followed by a fixed non-linearity. RNNs are typically nor-mal feed-forward networks where the outputs from the hidden nodes are fed back into itself. There are three different types of RNNs; Many-to-many, One-to-many, and Many-to-one. The different types and their explanations are listed hereunder.

• Many-to-many: The network takes in a sequence of inputs and outputs a sequence of outputs.

• One-to-many: The network takes in single inputs and outputs a sequence of outputs. • Many-to-one: The network takes in a sequence of inputs and outputs a single output. The vanilla RNN is a fully recurrent neural network, meaning that every neuron in a given layer is directly connected to every neuron in the next layer. Each hidden neuron is at any given time step feeding its output back into itself. With this behavior, the hidden neurons calculate their outputs with regards to what previous values it outputted. The network has a

(33)

3.3. Artificial neural networks

Figure 3.7: Vanilla recurrent neural network taken from [41].

"memory" of the previous inputs. A folded and unfolded example of a Vanilla RNN can be seen in figure 3.7.

In figure 3.7 the folded vanilla RNN shows the cycles where the hidden node h always is saving its previous output and is considering this value when calculating its next output. The unfolded vanilla RNN shows how this looks in practice over three-time steps t-1, t and

t+1.

Vanilla recurrent is not used very often for long-term memory tasks. The main reason for this is the vanishing gradient problem. The vanishing gradient problem is that the more time steps we have, the more chance we have of having back-propagation gradients either accumulating, exploding or vanishing down to zero. This is because, in the backpropaga-tion, the weights are becoming exponentially smaller due to long-term interactions which involves multiplication of many Jacobians. With many time steps, many non-linear functions are used and the result is highly nonlinear, typically most of the values being small derivative, some values being large derivative and many alternations between increasing and decreas-ing. Derivatives becoming small is especially true when having the Sigmoid function as an activation function. This is because the gradient of the Sigmoid function becomes very small when the output from the Sigmoid function is close to zero or one. The problem can not be avoided by simply staying in a region of parameter space where the gradients do not vanish or explode. This is because, to store memories in a way that is robust to small perturba-tions, the RNN must enter a region of parameter space where the gradients must vanish. For the model to be able to learn long-term dependencies the gradient of a long-term de-pendency needs to have exponentially smaller magnitude than the gradient of a short-term dependency. It is not impossible to learn long-term dependencies but it might take a very long time because the long-term dependencies will tend to be hidden by the smallest fluctu-ations arising from the short-term dependencies. To remove the vanishing gradient problem we need to create paths where derivatives neither vanish or explode. Gated RNNs is a type of RNNs that does this by updating the weights at a given time step when suitable and by learning the network to clear the state when needed. Long short-term Memory (LSTM) is a Gated RNN that does exactly this.

3.3.2

LSTM

LSTM introduced the idea of self-loops to create paths where the gradient neither vanishes or explodes [40]. The most important addition has been to make the weight in this self-loop dynamic rather than fixed. By making the weight of the self-loop gated (controlled by another hidden unit), the effect the time has on the integration can be changed dynamically. This is because the time constants are outputted by the model itself. LSTM has been enormously

References

Related documents

Time Series Forecasting of House Prices: An evaluation of a Support Vector Machine and a Recurrent Neural Network with

Figure 4.1: Adjacency matrix for the feed forward 0% network The eigenvalue plot of the network, figure 4.2, shows that the vast ma- jority of the eigenvalues are located to the left

FMV har inom ramen för den kunskapsuppbyggande verksamheten Försvarets Framtida Taktiska Kommunikation (FFTK) en uppgift att för Försvarsmakten under 2003 demonstrera

The range of an activation function usually prescribes the domain of state variables (the state space of the neural network). In the use of neural networks for optimization,

IM och IM-tjänster (Instant messaging): Vi har valt att använda oss av termen IM och IM-tjänster när vi talar om tjänster till smartphones för direkt kommunikation till

In this paper we consider the problem of nonlocal image completion from random measurements and using an ensem- ble of dictionaries.. Utilizing recent advances in the field

The recurrent neural network model estimates a lower

While trying to keep the domestic groups satisfied by being an ally with Israel, they also have to try and satisfy their foreign agenda in the Middle East, where Israel is seen as