Data-Driven Learning for Approximating Dynamical Systems Using Deep Neural Networks

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2021 ,

Data-Driven Learning for

Approximating Dynamical Systems Using Deep Neural Networks

MAX BERG WAHLSTRÖM

AXEL DERNSJÖ

(2)

Abstract

In this thesis, a one-step approximation method has been used to produce approximations of two dynamical systems. The two systems considered are a pendulum and a damped dual-mass-spring system.

Using a method for a one-step approximation proposed by [15] it is first shown that the state variables of a general dynamical system one time-step ahead can be expressed using a concept called effective increment.

The state of the system one time-step ahead then only depends on the previous state and the effective increment, and this effective increment in turn only depends on the previous state and the governing equation of the dynamical system.

By introducing the concept of neural networks and surrounding concepts it is presented that a neural network could be trained to approximate this effective increment, thereby negating the need to have a known governing equation when determining the system state. The solution to a general dynamical system can then be approximated using only the trained neural network operator and a state variable to produce the state variable one discrete time-step ahead.

When training the neural network operator to approximate the effective increment, the analytical solutions to two dynamical systems are used to produce large amounts of training data on which the network can be trained. Using the optimizer algorithm Adam [8] and the collected training data the network parameters were changed to make the difference between the output of the network and some target value small, the target value, in this case, being the correct state variable one time-step ahead.

The results show that training a neural network to be able to produce approximations of a dynamical system

is possible, but if one wants to produce more accurate approximations of more complex systems than the

ones considered in this thesis, greater care has to be taken when choosing parameters of the network as well

as tweaking the hyper-parameters of the optimizer Adam. Furthermore, the structure of the network could

be tweaked by changing the number of hidden layers and the number of nodes in them.

(3)

Sammanfattning

Denna rapport har anv¨ ant en enstegs-approximationsmetod f¨ or att approximera tv˚ a dynamiska system. De tv˚ a systemen som approximeras ¨ ar en pendel och ett d¨ ampat massa-fj¨ ader-system med tv˚ a massor.

Genom att anv¨ anda metoden f¨ or enstegs-approximation som f¨ oreslogs i [15] visas f¨ orst att tillst˚ andsvariblerna i ett generellt dynamiskt system, ett tidsteg fram˚ at i tiden kan uttryckas med hj¨ alp av konceptet effektiv

¨

okning. Systemets tillst˚ and ett steg fram˚ at i tiden ¨ ar endast beroende av det tidigare tillst˚ andet samt den effektiva ¨ okningen. Den effektiva ¨ okningen beror i sin tur endast p˚ a det tidigare tillst˚ andet och systemets styrande ekvation.

Genom att introducera konceptet neurala n¨ atverk och de ytterligare kringliggande koncepten visas det att ett neuralt n¨ atverk kan tr¨ anas f¨ or att approximera den effektiva ¨ okningen. D¨ arav finns inget behov av en k¨ and styrande ekvation. L¨ osningen till ett generellt dynamiskt system kan d˚ a approximeras genom att anv¨ anda det tr¨ anade neurala n¨ atverket och tillst˚ andsvariblerna f¨ or att skapa tillst˚ andet i n¨ asta diskreta tidspunkt.

F¨ or att tr¨ ana det neurala n¨ atverket att approximera den effektiva ¨ okningen anv¨ ands de analytiska l¨ osningarna till de tv˚ a dynamiska systemen f¨ or att skapa stora m¨ angder data som n¨ atverket kan tr¨ ana p˚ a. Genom att anv¨ anda optimeringsalgoritmen Adam [8] och den samlade tr¨ aningsdatan kan n¨ atverksparametrarna ¨ andras f¨ or att g¨ ora felet mellan det som kommer ut fr˚ an n¨ atverket och det ¨ onskade v¨ ardet s˚ a litet som m¨ ojligt, d¨ ar det ¨ onskade v¨ ardet ¨ ar tillst˚ andsvaribeln ett steg fram i tiden.

Resultatet visar att det g˚ ar att tr¨ ana ett neuralt n¨ atverk till att approximera de dynamiska systemen. F¨ or

att ˚ astadkomma h¨ og noggrannhet i approximationerna f¨ or mer komplexa system ¨ an de som presenteras i

rapporten beh¨ ovs mer noggrannhet n¨ ar n¨ atverkets parametrar v¨ aljs och hur hyperparametrarna i optimer-

ingsalgoritmen Adam v¨ aljs.

(4)

Acknowledgments

The authors of this thesis would like to thank our supervisors David Krantz and Xin Huang for their great

support and engaging discussions throughout this project.

(5)

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem Statement . . . . 2

1.3 Overview . . . . 2

2 Dynamical Systems 3 2.1 Pendulum . . . . 3

2.2 Damped Dual-Mass-Spring System . . . . 5

3 Deep Neural Network 7 3.1 Architecture . . . . 7

3.2 Activation Function . . . . 9

3.3 Output and Loss function . . . . 10

3.4 Residual Neural Network . . . . 10

3.5 Training the Network . . . . 11

3.5.1 Gradient-Based Learning . . . . 12

3.5.2 Back-Propagation . . . . 12

3.5.3 Optimization Algorithm . . . . 13

4 Approximating Dynamical Systems With Deep Neural Networks 15 4.1 One-Step Solution Using Effective Increment . . . . 15

4.2 One-Step Approximation Using ResNet . . . . 16

4.3 data sets . . . . 16

5 Implementation 17 5.1 Data Collection . . . . 17

5.1.1 Pendulum . . . . 17

5.1.2 Damped Dual-Mass-Spring System . . . . 17

5.2 Network Setup . . . . 18

5.3 Training Process . . . . 18

5.4 Validation of the Trained Model . . . . 19

6 Result 20 6.1 Pendulum . . . . 20

6.1.1 No External Force and Constant Time-Step . . . . 20

6.1.2 No External Force and Varying Time-Step . . . . 20

6.1.3 External Force Input . . . . 21

6.1.4 Different Stopping Criteria . . . . 22

6.2 Damped Dual-Mass-Spring System . . . . 23

6.2.1 No External Force and Constant Time-Step . . . . 23

6.2.2 Force Input . . . . 24

6.2.3 Overshooting . . . . 25

7 Discussion 26

8 Conclusion 27

(6)

1 Introduction

The use of modeling dynamical systems in the industry is immense. From modeling the suspension of a car to the modeling of the complex weather systems that can be found in meteorology.

For some problems such as modeling the suspension in a car or describing the motion of a planetary body an analytical solution for describing the state of the system can be found. For many of the dynamical systems that need to be studied, one can not find an analytic solution or they are hard to find, and the solutions have to be simulated or approximated instead. These simulations are often quite costly in terms of both time and processing power. Instead of these simulations, an approximation method based on data-driven learning can be implemented.

This thesis will investigate one particular approximation method based on data-driven learning with a deep neural network.

1.1 Background

Data-driven learning is the process of learning a specific task by using large amounts of data. One example that utilizes data-driven learning is Google’s translation system, Google Neural Machine Translation system (GNMT) [19]. The process works by using large amounts of training examples (data) on which the translation method can be trained, thereby the word data-driven is used. The translation process can by utilizing this data-driven approach overcome many of the weaknesses of conventional translation.

The data-driven learning method that will be utilized is based on an artificial neural network that uses large amounts of training data to learn a specific task. The resemblance of an artificial neural network illustrated in figure 3 to a biological neural network, the human brain, has been done several times [5]. Each of the nodes in the network represents a neuron in the brain and the connection between the nodes in the network as a synapse in the brain. These connections have the ability to transmit a signal to the next node (neuron).

The learning process of an artificial neural network can also be seen to mimic a human brain, which learns by examples.

The ability of an artificial neural network to learn a specific task demands training. For this process to be successful the network needs large amounts of data to train on and to have the correct setup of nodes and connections between them.

The training of a neural network is a costly affair that can take days to complete and can require large amounts of processing power and memory. The benefit of using a neural network is that once this training is complete, it can produce some output near instantly. This means that the network only has to be trained once, and even though the data needed for training might need to be collected through a costly simulation, this simulation will only need to be done once.

Some popular numerical methods are for example the 4th order Runge-Kutta method and adaptive finite element methods which have the ability to produce accurate approximations in many cases. But these methods also have restrictions, one area where ordinary numerical methods find difficulty in predicting the system state is when a system has a discontinuous motion [3]. The proposed benefit of using a deep neural network for approximation in regards to this aspect is that a neural network has the ability to learn trends.

This is useful if we in some cases one do not seek an extremely accurate solution but instead are interested in the underlying dynamics and want to see how the system would react if some parameters are changed or a particular external force acts upon the system. If the network obtains the ability to find a system’s underlying dynamics, the question of how the system would react to some change could be easy to answer with such a model instead of making costly simulations with other methods.

The data-driven approximation method that this thesis will utilize is based on previous work by [15] and [16].

The method is based on a concept called effective increment that describes the exact changes of a dynamical

system from one state in time to another. The method is also based on a special type of deep neural

(7)

network, called residual neural network which is well-posed to approximate a general dynamical system by approximating the effective increment. Conceptually, there are similarities between a network in the form of a residual neural network and a first-order approximation method such as Euler’s method, as in the fact that both methods take a step by using the previous step plus some change depending on the previous step [2].

1.2 Problem Statement

The purpose of this thesis is to use a data-driven method to predict the behavior of a dynamical system. In particular this thesis will look into a pendulum and a damped dual-mass-spring system. These dynamical systems can be described by differential equations. Let us consider a general non-autonomous dynamical system

dx

dt = f (x(t), u(t)) x(t ₀ ) = x ₀ (1)

where f : R ⁿ → R ⁿ is the function describing the governing equation of the system, which depends on the time dependent state variable x ∈ R ⁿ and some time dependent input signal u ∈ R ⁿ . The goal is to utilize a data-driven numerical model that takes an initial condition x ₀ at time t ₀ and produces a prediction ˆ x of the actual state x such that

ˆ

x(t; x ₀ ) ≈ x(t; x ₀ ) (2)

where t is the time at which the prediction should be made. This model will then be used to make multiple consecutive one-step predictions using the previous prediction as an input for the next prediction.

1.3 Overview

In section 2 we introduce the two dynamical systems that we seek to evaluate the proposed data-driven

method on and the analytical solutions to these systems will be derived. In section 3 the fundamentals

behind deep neural networks are introduced. In section 4 we show that the exact solution to a dynamical

system can be written on a form expressed with effective increment. Then we give a motivation behind using

a deep neural network with the residual neural network architecture in order to approximate the effective

increment. Section 5 will in more detail explain the process of implementing the theory and producing

results as well as an implementation of a validation model on which the network will be tested. Section 6

then presents the results from the validation model and section 7 discusses these results.

(8)

2 Dynamical Systems

A dynamical system is a system in which the state of the system can be described as a function of time. The state of the system can be one or several different physical quantities, such as position, velocity, temperature et cetera. In this thesis, two dynamical systems will be approximated using a data-driven approximation method. This section will therefore derive an analytical solution to two dynamical systems, namely the pendulum and a damped dual-mass-spring system. These two systems are chosen to be able to test this approximation method on two systems with different complexities. The pendulum has relatively low com- plexity, while the damped dual-mass-spring system has a higher complexity much due to the dampening in the system.

2.1 Pendulum

Figure 1: Pendulum

The differential equation describing the motion of the pendulum in figure 1 can be formulated using Newtons second law,

mg sin(θ) = −ml d ² θ

dt ² + F (t) (3)

where m is the mass attached to the mass-less rod with length l and m is considered to be a point mass.

The gravitational acceleration is notated as g and θ(t) is the angle as defined in figure 1 and F (t) ∈ R is an external time-variant force that acts tangentially on the mass. The differential equation as presented in equation (3) can be rewritten as

d ² θ dt ² + g

l sin(θ) = F (t)

ml . (4)

To simplify the differential equation further, the small angle approximation θ ≈ sin(θ) is used, which means that equation (4) can be written

d ² θ dt ² + g

l θ = F (t)

ml . (5)

To find a homogeneous solution to this differential equation, a characteristic equation can be formed, r ² + g

l = 0. (6)

(9)

The characteristic equation as presented by (6) has the roots r = ± p g

l i. Because these roots are complex we can formulate a homogeneous solution on the following form

θ h (t) = A cos r g l t

+ B sin r g l t

, (7)

where A, B ∈ R are constants. Setting the initial condition to θ ⁰ and rewriting the angular frequency as ω = p g

l to further simplify, the constants can be found and the final homogeneous solution to the differential equation as presented in (5) can be formed

θ h (t) = θ 0 cos(ωt). (8)

The particular solution to the differential equation of the pendulum depends on F (t). In the example below, the equation is solved for the force F (t) = F ₀ sin(αt) where F ₀ is the amplitude of the force and α is the angular frequency. To find the particular solution to this differential equation, we assign the particular solution θ _p (t) = K ₁ sin(αt) + K ₂ cos(αt) where K ₁ , K ₂ ∈ R are constants. The constants are then solved which gives the particular solution,

θ p (t) = F 0

ml(ω ² − α ² ) sin(αt). (9)

The solution to the differential equation is then

θ(t) = θ ₀ cos(ωt) + F ₀

ml(ω ² − α ² ) sin(αt). (10)

Derivating equation (10) with respect to time the angular velocity is obtained as θ(t) = −ωθ ˙ 0 sin(ωt) + α F ₀

ml(ω ² − α ² ) cos(αt), (11)

where the ˙ θ denotes the time derivative of θ.

(10)

2.2 Damped Dual-Mass-Spring System

m ₂ m ₁

k ₂

k 1

c F (t)

x

Figure 2: The damped dual-mass-spring system.

Given the dynamical system as presented in figure 2 the differential equations describing the system can be written as:

( m ₁ x ¨ ₁ − c( ˙x ₂ − ˙x ₁ ) − k ₂ x ₂ + (k ₁ + k ₂ )x ₁ = 0,

m 2 x ¨ 2 + c( ˙ x 2 − ˙x 1 ) − k 2 (x 2 − x 1 ) = F (t). (12) where m ₁ and m ₂ are the two masses, in this case considered as point masses, x ₁ and x ₂ are the positions of the two masses, c is a dampening constant and k ₁ and k ₂ are the spring constants for the two springs and F (t) is a time variant force that acts on the mass m ₂ . The system of differential equations 12 can be rewritten in a matrix form







˙ x 1

¨ x 1

˙ x 2

¨ x 2







| {z }

˙ x

=







0 1 0 0

− ^k

¹

_m ^+k

²

1

− _m ^c

1

k

2

m

1

c m

1

0 0 0 1

k

2

m

2

c

m

2

− _m ^k

²

2

− _m ^c

2







| {z }

A





 x 1

˙ x 1

x 2

˙ x 2







| {z }

x

+





 0 0 0

F (t) m

2







| {z }

F

. (13)

The system in equation (13) will be written in a more compact form as

˙x = Ax + F(t). (14)

To solve this system of differential equations a new variable z(t) is introduced where

z(t) = e ^−At x(t). (15)

The derivative of z(t) with respect to time t can be written as

˙z(t) = e ^−At (−A)x(t) + e ^−At ˙x(t) = e ^−At ( ˙x(t) − Ax(t)) = e ^−At F. (16) The difference between z(t) at time t and z(t) at the initial time t ₀ can be expressed as,

z(t) − z(t 0 ) = Z t

t

₀

dz dτ dτ =

Z t t

₀

e ^−Aτ F(τ )dτ, (17)

where τ is a variable used for the integration. By substituting z(t) as expressed by equation (15) into equation (17) the equation can be rewritten the following way

e ^−At x(t) − e ^−At

⁰

x(t 0 ) = Z t

t

0

e ^−Aτ F(τ )dτ, (18)

which can be further rewritten as

x(t) = e ^A(t−t

⁰

⁾ x(t ₀ ) + Z t

t

0

e ^{A(t−τ )} F(τ )dτ. (19)

(11)

For the purpose of discretizing equation (19), let t = k∆t where k = 0, 1, 2... and t 0 = 0 when k = 0, x k := x(k∆t), x 0 := x(0). This expression of t is inserted into equation (19) which gives the following expression,

x k = e ^Ak∆t x 0 + Z k∆t

0 e ^{A(k∆t−τ )} F(τ )dτ. (20)

Taking one more discrete step we get

x k+1 = e ^A(k+1)∆t x 0 +

Z (k+1)∆t 0

e A((k+1)∆t−τ ) F(τ )dτ. (21)

The integral as seen in equation (21) above can be divided into two integrals as

x k+1 = e ^A∆t

e ^Ak∆t x 0

Z k∆t 0

e ^{A(k∆t−τ )} F(τ )dτ

+

Z (k+1)∆t k∆t

e A((k+1)∆t−τ ) F(τ )dτ. (22) By comparing equation (20) with equation (22) we see that it can be simplified into

x _k+1 = e ^A∆t x _k +

Z (k+1)∆t k∆t

e A((k+1)∆t−τ ) F(τ )dτ. (23)

The substitution v = (k + 1)∆t − τ is made to simplify equation (23). This gives

x k+1 = e ^A∆t x k + Z ∆t

0 e ^Av F(t + ∆t − v)dv. (24)

If we make the assumption that F(t) is constant between the time-steps the force can be discretized and following equation is obtained,

x k+1 = e ^A∆t x k +

Z ∆t 0

e ^Av dv

F k . (25)

where F k is the force at time t = k∆t. The matrix A has full rank (see Remark 1) which implies that it is invertible. The final equation describing the state of the system at the next step can then be written as,

x k+1 = e ^A∆t x k + A ⁻¹ (e ^A∆t − I)F k , (26) where I is the identity matrix.

Remark 1 (Rank) [7] The rank of a matrix A is the dimension of the vector space that is spanned by the

columns of the matrix. That a matrix is of full rank means that the dimension of this vector space is the

same as the number of columns, i.e that the columns are linearly independent.

(12)

3 Deep Neural Network

In the background, the concept of artificial neural networks (ANN) was introduced and the comparison between the human brain and a neural network was made. A neural network will now be introduced in a more mathematically rigorous way. The purpose of a deep neural network (DNN) is to map some input y ⁱⁿ to some output y ^out = N(y ⁱⁿ ; θ), where N is the fully connected feed forward operator of the system from R ^m → R ⁿ , where y ⁱⁿ ∈ R ^m and y ^out ∈ R ⁿ and the network has the parameters θ. This section will further examine the basic concepts and mathematical properties of a DNN.

Figure 3: Illustration of a neural network [10].

3.1 Architecture

The architecture of an ANN can be constructed in multiple ways to achieve different properties. One common

network architecture is a feedforward neural network (FNN) [5], which simply means that information can

only pass forward in the network. This type of architecture will be the foundation of the DNN in this

thesis.

(13)

Figure 4: Neural network with 3 layers and defined variables.

A neural network is made up of several connected layers where each layer contains several nodes. Each node is connected to all the nodes in the previous layer and the next layer. When all the nodes are connected like this, the network is said to be fully connected. This is illustrated in figure 3 and 4.

The first layer of nodes in the network is called the input layer, and the last layer is called the output layer and is notated by L. The layers between the input and output layer are called the hidden layers. The layers are notated as l and the node number in layer l notated as j and the node number in the previous layer l − 1 is denoted by k.

Two terms that are useful when describing ANN are width and depth. The depth of the artificial neural network is how many layers the network contains and the width of a network refers to the maximum number of nodes in a layer in the network [5]. The network that is seen in figure 4 has with these definitions a depth of 3 and a width of 3. The term DNN is simply a way to express that an artificial neural network has one, or several hidden layers, that is to say, at least a depth of 3.

The weight between a node k and the node j is notated w ^l _jk . This weight can be seen as a type of priority function, that rates how important the information is to the node. The output of the previous node is notated as a ^l−1 _k . The output of the current node is then obtained by multiplying the output from the previous node with the corresponding weight and then adding the bias of the current node which is a constant and is notated as b ^l _j . The last step is to sum the contribution from all the previous nodes and feed the information through an activation function σ : R → R and get the equation,

a ^l _j = σ

X

k

w _jk ^l a ^l−1 _k + b ^l _j

, (27)

which tells us the output of a node given all the inputs to this node. This way of passing information is called a linear connection [5]. For ease of reading and to get equation 27 in a more compact form, we introduce a vector z at layer l, a matrix w containing all weights, an output vector a containing the outputs from the previous layer and a bias vector b containing all the biases in the layer. Using these vector notations the following equation is obtained,

z ^l = w ^l a ^l−1 + b ^l . (28)

Substituting the sum in equation (27) with the new expression (28), equation (27) can be expressed in a

(14)

compact vectorized form as

a ^l = σ(z ^l ) =





 σ(z 1 ) σ(z ₂ )

.. . σ(z _n )







(29)

where z _i is the i:th element in z and z _n denotes the last element.

3.2 Activation Function

One can look at the activation function as a filter that decides how information should be passed from one node to another, or how the node should react to the information it gets. If the weight is a function that determines how important some information is, the activation function rates whether the weighed information is important enough to pass through the node or not. In more mathematically correct terms it can be said that the activation function is the function that creates output from one node given inputs from nodes in previous layers.

Two commonly used activation functions are the sigmoid function and the Rectified Layer Unit (ReLU) function [5]. The sigmoid function is not a function in itself, but a family of functions that are characterized for having an S-shaped output. In figure 5a and 5b one can see examples of a sigmoid function and the ReLU function plotted using the expressions

f (z) = tanh(z), ReLU (z) =

( 0 z < 0

z z ≥ 0 (30)

(a) f (z) as a function of z. (b) ReLU(z) as a function of z.

Figure 5: Illustration of the characteristics of two different activation functions.

As seen in figure 5, the sigmoid function converges towards -1 for negative values and converges towards 1

for positive values. This means that whatever the input may be, we will receive an activation value between

-1 and 1. As can be seen from figure 5b the ReLU function will return 0 if z < 0 and z if z ≥ 0.

(15)

3.3 Output and Loss function

The network aims to produce some satisfactory output, where the meaning of the word satisfactory depends on the task that the network should perform. In this section, the output of the network is defined, as well as the loss function, which is a way to measure whether the produced output was satisfactory or not.

Let the input to the network be some arbitrary input y ⁱⁿ ∈ R ^m . This input is passed to the input layer and the network maps this input to the output y ^out ∈ R ⁿ . The mapping is done through the network operator N. Here this operator has the architecture of a feedforward network and the network parameters include:

weights w and biases b and are collected into one network parameter θ. Then the output is formulated as

y ^out = N(y ⁱⁿ ; θ). (31)

The network operator depends on the depth and the output comes from the composition between each hidden layer such that

N(y ⁱⁿ ; θ) = (σ ^(L) ◦ z ^(L−1) ) ◦ · · · ◦ (σ ⁽²⁾ ◦ z ⁽¹⁾ )(y ⁱⁿ ). (32) where ◦ stands for the composition operator and z ^l is defined in equation (28).

With the output from the network, it is necessary to introduce a way to quantify how well the network has performed. This is done through a loss function, also often referred to as a cost function [5]. This is a function that quantifies the performance of the network by some comparison to the expected value. The choice of the loss function, much like the choice of activation function is up to the constructor of the network.

One example of how the loss can be quantified is through the mean squared function

L(y ^out , y) = 1 n

n

X

k=1

(y ^out _k − y k ) ² (33)

where y ∈ R ⁿ is the desired values to receive given the input y ⁱⁿ and n is the number of values in y and y ^out . The goal is to make this loss function sufficiently small which is done through training the network parameters θ. The process of minimizing the loss is what is called training. This is achieved through some optimization algorithm which will be further explained in section 3.5.3.

3.4 Residual Neural Network

FNN has already been introduced and defined but there exist several versions of this type of architecture.

As said in the background this one-step approximation method is based on the concept of a residual neural

network (ResNet) [6] which is one version of an FNN.

(16)

Figure 6: ResNet with one block.

The ResNet structure is illustrated in figure 6 and builds on the principle of compounding the hidden layers into blocks, where each layer is fully connected. Before each block an identity operator I is introduced and what it does is that it connects the input y ⁱⁿ to the block back to the output of the block y ^out such that the following mapping is obtained:

y ^out = y ⁱⁿ + N(y ⁱⁿ ; θ), (34)

y ^out = (I + N(·; θ))y ⁱⁿ , (35)

where N is the operator of the FNN. The matrix I is of the size R ^m×n .

3.5 Training the Network

The process of training a neural network is in simple terms to use some kind of algorithm to change the weights and biases until the network produces a satisfying result. In our case, a satisfying result would be that the difference between some exact value and a value produced by the network is small. This process of changing the network parameters is an iterative process, where for each iteration the network is fed new training data from a data set. This data set can be divided into smaller data sets, where each smaller set is called a batch. One reason for dividing the data set into batches is to make the computations less memory-intensive [5]. Once all batches have been fed through the network an epoch is said to be completed.

The network is then trained for a certain number of epochs, or until a satisfying value from the loss function

has been reached. What the data set contains, how large it is and how it is constructed depends on the

task at hand. Some general things can be said about the data and that is that it must contain some input

and some wanted output, also called the target. Without a target, the training would serve no purpose, and

without input, the network would not produce an output.

(17)

3.5.1 Gradient-Based Learning

The goal of the training is as previously stated, to make the difference between the target value and the output small. One of the most popular ways to minimize this function is through gradient-based learning [5].

Figure 7: Illustration of gradient descent [14].

The idea behind gradient-based learning is that the gradient of the loss function gives the direction in which the loss function increases the most. Taking a small step in the opposite direction and changing the parameters proportional to the gradient will therefore decrease the value of the loss function. This can be visualized in figure 7 where each step is visualized by the black arrow and the value of the loss function decreases in every step.

This process can then be repeated until a local minimum for the loss function is found i.e. the gradient becomes zero. For each of these iterations, new inputs and new targets are taken from the training data set and used in the loss function. This process will not guarantee that the network parameters have reached their optimal values, only that the weights and biases can not be easily further tweaked to reach better performance. The gradient can be calculated through different methods but one of the most commonly used is back-propagation.

3.5.2 Back-Propagation

The back-propagation is an algorithm used to compute the gradient of the loss function [11]. To calculate the gradient means to calculate the partial derivatives of the loss function with respect to both the weight and bias in each node-connection. The algorithm is able to calculate the gradient by letting the information from the scalar output from the loss function (51), flow backward through the network.

In order to explain the algorithm some fundamental equations need to be stated. Firstly we introduce an error in the output layer L,

δ _j ^L = ∂L

∂a ^L _j σ ⁰ (z _j ^L ) (36)

where this equation can be interpreted as the rate of change of the loss function with respect to the output

times the rate of change of the activation function at z ^L _j . Since working on a matrix based form is more

(18)

where ∇ is the nabla operator and is the Hadamard Product (see Remark 2). The aim is to calculate this error term in each layer. In order to accomplish this the algorithm utilizes the error obtain from the output layer such that the error in the previous layer in the network can be computed as

δ ^l = ((w ^l+1 ) ^T δ ^l+1 ) σ ⁰ (z ^l ), (38) where the transpose operator has been applied to the weight matrix w ^l+1 in order to get the error to propagate backwards. In [11] it is is shown that by utilizing the chain rule and derivating the loss function, the following equations can be derived:

∂L

∂b ^l _j = δ ^l _j , (39)

∂L

∂w ^l _jk = a ^l−1 _k δ ^l _j . (40)

Equation (39) and (40) now gives the expressions for the partial derivatives of the loss function, and the right-hand side can be calculated using the equations as presented in equations (37) and (38)

Expressions for the partial derivatives are now presented, and the gradient is thereby also presented. These steps can be condensed into the algorithm know as back-propagation. The pseudo-code for this algorithm is presented in Algorithm (1).

Remark 2 (Hadamard Product) [12] The operator takes two matrix or vectors of the same size and multiplies element-wise such that (A B) _ij = (A) _ij (B) _ij .

Algorithm 1: Back-propagation [11]

Send the input y ⁱⁿ through the feed forward network to obtain a ^L Let δ ^L = ∇ a L σ ⁰ (z ^L )

for l = L − 1, L − 2, ..., 2 do

With the error from the previous layer compute the next layer as δ ^l = ((W ^l+1 ) ^T δ ^l+1 ) σ ⁰ (z ^l )

end for

∂L

∂w

_jk^l

= a ^l−1 _k δ _j ^l

∂L

∂b

^l_j

= δ _j ^l

3.5.3 Optimization Algorithm

Once the gradient is calculated with respect to the network parameters, θ, the weights and biases can be changed. How the weight is changed in each step depends on the optimization algorithm. One of the simpler algorithms, Stochastic Gradient Descent (SGD), only uses one hyper-parameter called the learning rate [17].

For example, using SGD with a learning rate of ∆ to update the weights would give the following equation for each step

w _i+1 = w _i − ∆ ∂L

∂w i+1

. (41)

Where w i+1 is the new weight and w i is the weight in the previous step. Once the weights are updated,

the gradient will once again be calculated using back-propagation and the process will be repeated until

some criterion is reached. This criterion is often that the loss function yields a small enough value. More

sophisticated and complicated optimizers than SGD exist. These optimizers do more each iteration than

equation (41).

(19)

This thesis will utilize the optimizer algorithm Adam [8], which has proven to be a useful algorithm for many machine learning problems. The method uses individual learning rates for different parameters that are calculated using the first and second moments of the gradients. The Adam optimizer combines the strength of two other optimizers: AdaGrad [4] and RMSProp [18].

The hyper-parameters used in Adam are: β ₁ , β ₂ , α and . The hyper-parameters β ₁ and β ₂ are exponential decay rates for the moment and α is the learning rate. is a constant that can be seen as a sort of stabilizer for the process that helps avoid division by zero. What numerical values these parameters should have is up to the user of the algorithm, although there exist default values [8].

At every time t, the optimizer first calculates the gradient with respect to the loss function, which will be noted as g t . The algorithm then updates the exponential moving averages of the gradient notated as m t and v t respectively. These two moving averages are the 1 ^st and 2 ^nd moment estimate where the 1 ^st moment is the mean and the 2 ^nd moment is the average. The optimizer also utilizes bias correction, since the moving averages are initialized as 0 and thereby biased towards 0. The corrected moment estimates are notated as

ˆ

m t and ˆ v t . The algorithm is further explained using pseudo-code in algorithm (2) below.

Algorithm 2: Adam as presented in [8]

Require: α: Stepsize

Require: β 1 , β 1 ,∈ [0, 1): Exponential decay rates for the moment estimates Require: f (θ): Stochastic objective function with parameters θ

Require: θ 0 : Initial parameter vector m 0 ← − 0 (Initialize 1 ^st moment vector) v 0 ← − 0 (Initialize 2 ^nd moment vector) t ← − 0 (Initialize timestep)

while θ t not converged do t ← − t + 1

g _t ← − ∇ θ f _t (θ _t−1 ) (Get gradients w.r.t. stochastic objective at timestep t) m _t ← − β ₁ · m _t−1 + (1 − β ₁ )g _t (Update biased first moment estimate) v _t ← − β ₂ · v _t−1 + (1 − β ₂ )g _t ² (Update biased second raw moment estimate)

ˆ

m _t ← − m _t /(1 − β ₁ ^t ) (Compute bias-corrected first moment estimate) ˆ

v t ← − v t /(1 − β ^t ₂ ) (Compute bias-corrected second raw moment estimate) θ t ← − θ t−1 − α · ˆ m t /( √

ˆ

v t + ) (Update parameters) end while

return θ t (Resulting parameters)

(20)

4 Approximating Dynamical Systems With Deep Neural Networks

In this section, the method for approximating the dynamical systems using deep neural networks will be presented. As mentioned in the background there exist several different methods for approximating dynam- ical systems. This thesis will use a method built on the principles of a one-step approximation presented in [15].

4.1 One-Step Solution Using Effective Increment

The one-step solution is based on an autonomous system which means that the system (1) has no force input u(t). A later paper [16] also presented the same method for a non-autonomous system, which showed that the method is also valid for a dynamic system with an input force. Here, the general idea for an autonomous system will be presented to give a motivation for the method.

First we introduce a flow map Φ : R ⁿ → R ⁿ that maps the state of equation (1) from time t ₀ to the state at time t such that

x(t) = Φ t−t

₀

(x(t 0 )). (42)

With this flow map we can get the state after a given time-step ∆ ≥ 0 as

x(∆) = Φ ∆ (x(0)). (43)

Integrating both sides of the differential equation as presented by equation (1) and then equating the result to equation (43) gives

x(∆) = x(0) + Z ∆

0 f (x(t))dt. (44)

By utilizing the mean value theorem the integral can be expressed as

x(∆) = x(0) + ∆f (x(τ )) (45)

where 0 ≤ τ ≤ ∆. Using the flow map (42) to obtain x(τ ) the following expression is obtained

x(∆) = x(0) + ∆f (Φ _τ (x(0))). (46)

The last part of equation (46) is defined as the effective increment,

φ _∆ (x; f ) = ∆f (Φ _τ (x(t))). (47)

Using this definition equation (46) can be rewritten as

x(∆) = x(t) + φ _∆ (x(t); f ). (48)

If the flow map φ ∆ , the state x and the function f is known, equation (48) can give the exact system state one time-step ∆ ahead. It can also be realized, that if the effective increment can be approximated, one can thereby create a one-step approximation of the system state one time-step ahead.

Now it has been shown that to be able to make an approximation for the dynamical system as described

by equation 1, only the effective increment needs to be approximated. Using this approximation, one can

march forward in time and thereby create multiple predictions of the system state giving a discrete solution

to the dynamical system.

(21)

4.2 One-Step Approximation Using ResNet

When ResNet was first introduced in [6] it is shown how the structure is implemented with multiple connected blocks. Here it will be motivated that ResNet structure with one block is a suitable way of producing a one-step approximation. Recalling equation (48) and equation (34) we had the following equations:

x(∆) = x(t) + φ ∆ (x(t); f ), y ^out = y ⁱⁿ + N(y ⁱⁿ ; θ).

By studying equation (48) describing the exact state and equation (34) describing the one block ResNet architecture, one can see the similarities between the output of the residual network and the approximated state. Hence it can be recognized that when the network operator N is trained the right way it can given an approximation to the effective increment such that

N(x(t); θ) ≈ φ _∆ (x(t); f ). (49)

It has hereby been motivated that using ResNet as the structure of the neural network is a good choice, since the network operator N could be trained to produce an approximation of the effective increment, and thereby, an approximation of the dynamical system described by equation (1).

4.3 data sets

For the application of predicting state variables on a trajectory that this thesis aims to solve, the data set is defined in the following way:

S = {x ^j _i , x ^j _i+1 ; F ^j _i , δ _i ^j }, j = 1, ..., J (50) where J is the total number of sampled data pairs and x ^j _i ∈ R ^d denotes the i:th state variable x in the j:th data pair and x ^j _i+1 ∈ R ^d is the corresponding state variable one discrete time-step, δ _i ^j ∈ R, ahead. The force F ^j _i ∈ R ^b is the force that is acting on the system at the i:th time point.

The data can be further divided into training data and testing data. The purpose of this is in general to have one data set on which the network can be trained, and one separate data set on which the model can be validated.

The input to the network is then set as y ⁱⁿ = [x ^j _i , F ^j _i , δ ^j _i ], and the target, i.e the desired value is set to y = [x ^j _i+1 ]. Using this training data and the loss function as defined in equation (51) the network can be trained to produce approximations of the effective increment.

L(θ) = 1 J

J

X

j=1

|| y ⁱⁿ + N(y ⁱⁿ ; θ) − y|| ² . (51)

Equation (51) is a specification of the general MSE-loss function as defined in equation (33).

(22)

5 Implementation

The validation and implementation of the theory have been done through setting up the network in the form of equation (34). The goal is to approximate the solution to dynamical systems as presented in figure 1 and figure 2. This is done by collecting data from the analytical expressions and then feeding the data through the network and updating the parameters θ using an optimization algorithm. This section will explain how the data was gathered, how the network was set up, and how the training was executed to produce the results as presented in section 6.

The implementation of this has been done using the programming language Python (3.9.4). For implementing the network structure and training process many functions from the open-source library PyTorch (1.8.1) [13]

have been used. NumPy (1.20.2) is another package that has been heavily utilized in the implementation.

All of the code written for the implementation can be found in [1].

5.1 Data Collection

The data is collected in the same way for both systems, the only difference lies in the structure of the data sets and what parameters they contain.

5.1.1 Pendulum

The physical parameters that are used when creating the data set for the pendulum are presented in table 1.

Parameter g [m/s ² ] l [m] m [kg]

Value 9.82 0.5 0.7

Table 1: Parameters used for the pendulum.

As stated in section 4.3 the data is to be collected in data pairs, where each pair is a part of the pendulum’s trajectory. To create a data set, we first choose a starting angle θ ₀ at the time t = 0, a time interval is then also chosen. In this case the interval is chosen as t ∈ [0, 2].

For N different points equation (10) is solved with the force F (t) at time t and at time t + δ for a chosen δ, which gives us θ(t) and θ(t + δ). Equation (11) is also solved which gives us ˙ θ(t) and ˙ θ(t + δ).

A list is then created, consisting of [θ(t k ), θ(t k + δ k ), ˙ θ(t k ), ˙ θ(t k + δ k ), F (t k ), δ k ] where k is the k:th data pair, t k is the corresponding time point and δ k is the corresponding time-step. This list is then created for k = 0, ..., K − 1 where K is the number of data pairs to be included into the data set. This data set is then saved as a .txt file for ease of use and to make sure the program only needs to be run once to create a data file.

The value of K i.e., the number of data pairs, is something that depends on the application, and therefore it is up to the constructor of the network to decide.

5.1.2 Damped Dual-Mass-Spring System

The physical parameters for the damped dual-mass-spring system are presented in table 2 Parameter m ₁ [kg] m ₂ [kg] k ₁ [N/m] k ₂ [N/m] c [Ns/m]

Value 3 2 250 1000 √

500 Table 2: Parameters used for the damped dual-mass-spring system

(23)

When collecting the data we follow the same principles as for the pendulum. We set initial conditions for our state variables x 0 at time t = 0. Given the starting point, we march forward in time up to a given end value t _end with the time-step δ.

At each point in time t the solution to equation (26) is collected with the solution in time t + δ to create the data pair. Each pair makes up one list [x ₁ (t _k ), x ₁ (t _k +δ _k ), ˙ x ₁ (t _k ), ˙ x ₁ (t _k +δ _k ), x ₂ (t _k ), x ₂ (t _k +δ _k ), ˙ x ₂ (t _k ), ˙ x ₂ (t _k + δ _k ), F (t _k ), δ _k ].

5.2 Network Setup

The network setup is a fully connected neural network with 3 hidden layers. The number of nodes in the input and output layers depends on which of the systems we seek to approximate. The hidden layers consist of 60 nodes in each of the layers for both systems.

The connection between each layer in the network is a linear connection. Each layer also has the same activation function, namely f (x) = tanh(x) which is a variant of the sigmoid function. In the implementation as done in this thesis, the weights and biases are not initialized to have a certain value but are rather initialized by the PyTorch library itself.

5.3 Training Process

The process of training the network is in principle the same for the two different dynamical systems. The only difference is in the handling of the data. The training was performed with the optimizer Adam, with the default values of the hyper-parameters as presented in [8].

The training utilizes the concepts of batches and epochs as explained in section 3.5. In practice, this means that the collected data pairs are divided into batches of a certain size.

For each batch, the loss is calculated as equation (51) and the parameters, θ are updated through propagating backward and using the Adam optimizer for calculating the new values of θ.

When all batches have been feed into the network, one epoch is done. After each epoch, the data is shuffled

so that the batches do not consist of the same pairs for each epoch. The training process is then repeated

until either a maximum number of epochs have passed, or a stopping criterion on the loss function has

been reached where L ≤ γ, where γ is the stopping criterion. The reason for this stopping criterion is to

combat overfitting, which occurs when the networks have trained too much on the training examples. When

overfitting occurs the network produces a small loss on the training data but performs poorly on the test

data and validation model.

(24)

5.4 Validation of the Trained Model

The validation process is as follows: For a starting state x ₀ at time t = 0, we input the starting state, a time-step δ and a force F 0 at time t = 0. This input should then produce an output that is then said the be input for the next iteration. This process is then repeated M times. In each of these points the force F i is also used as well as a constant delta.

Algorithm 3: Validation model for a trained network operator N Require: N: Trained network operator

Require: x 0 : State variable at time t 0

Require: F: Force at M discrete time-steps Require: δ: time-step

for i = 1, 2, ..., M do

x i = x i−1 + N(x i−1 , F i , δ) end for

The trajectory this produces is if the model is well trained, close to the actual trajectory produced by solving equation (10) or (26) at M time points.

How to define the approximation accuracy is as many parts of the implementation process up to the con- structor, and of course on the type of problem as well. In this thesis, the ’closeness’ is measured as the difference between each point ˆ x i in the approximated trajectory to the corresponding point x i in the actual trajectory as can be seen in equation (52)

i = |ˆ x i − x i | . (52)

(25)

6 Result

All of the approximated trajectories in this section have been produced with the validation method as pre- sented in section 5.4. The number of one-step approximations done for each result varies depending on which setup is used. The following setups will be used and presented: Approximation of pendulum/damped dual- mass-spring system with no external force, approximation of pendulum/damped dual-mass-spring system with two different time-steps and no external force, and approximation of pendulum/damped-dual-mass- spring system with a time-variant external force.

In the result figures below the analytical solution is plotted with a time-step of δ = 0.01.

6.1 Pendulum

In this section, the approximations of the pendulum will be presented. In all figures, both the approximated trajectory and the absolute error as defined in equation (52) will be plotted.

6.1.1 No External Force and Constant Time-Step

The training data set used to produce the results presented in figure 8 only contains one trajectory, with an amplitude θ 0 = π/12 and an applied force F = 0. The data set contains values of θ(t) sampled in the time interval t ∈ [0, 2] with a step size of δ = 0.1. A stopping criterion on the loss function was used when training and was set to γ = 10 ⁻⁷ .

The approximation has then been made with 50 one-step approximations with δ = 0.1 and F = 0 giving an approximation in the time interval t _approx ∈ [0, 5]. The starting point for the approximation has been chosen as θ(t = 0). In figure 8b the error as presented in equation (52) is shown.

(a) The approximated trajectory and the analytical solution.

(b) Error in the approximated trajectory as a func- tion of time.

Figure 8: Approximated trajectory and error with F = 0 and δ = 0.1.

As can be seen in the results from figure 8a the approximation is close to the trajectory created by the analytical solution, but the accuracy decreases over time. This can be seen more clearly in figure 8b.

6.1.2 No External Force and Varying Time-Step

For the next result as can be seen in figure 9 two trajectories were sampled for the data sets. Each with a

(26)

In figure 9 three approximations were made using the trained neural network. The approximations were made by choosing the starting point as θ(t = 0) and making 50 one-step approximations with three different time-steps. Two of the steps that were used were a part of the training set, while the third one δ = 0.11 was not a part of the training set. This gave the approximation intervals: t _approx ∈ [0, 5], [0, 55], [0, 6] for δ = 0.1, 0, 11 and 0.12 respectively.

(a) Trajectory of pendulum. (b) Error in the approximation.

Figure 9: Approximated trajectory and error when F = 0 and δ = 0.1, 0.11 and 0.12.

As can be seen from the results in figure 9 there is a large difference in the error and how accurate the prediction is with the different time-steps. One can see that the best approximation is made for the smallest delta, and the largest time-step has the largest error while the time-step that wasn’t included in the training set produces a more accurate prediction.

6.1.3 External Force Input

For the creation of a data set with a time-variant force, the process was very much the same as for the data sets used to produce the results as presented in figure 8. The data set only contains one trajectory with an amplitude θ ₀ = π/12 and an applied force given by the expression F (t) = ^θ ₈

⁰

sin(2ωt). The data set contains values of θ(t) sampled in the time interval t ∈ [0, 2] with a step size of δ = 0.1. A stopping criterion on the loss function was used when training and was set to γ = 10 ⁻⁷

The approximation has then been made with 50 one-step approximations with δ = 0.1 and F (t) = ^θ ₈

⁰

sin(2ωt).

giving an approximation in the time interval t approx ∈ [0, 5]. The starting point for the approximation has

been chosen as θ(t = 0).

(27)

(a) The approximated trajectory and the analytical solution.

(b) Error in the approximated trajectory as a func- tion of time.

Figure 10: Approximated trajectory and loss when F (t) = ^θ ₈

⁰

sin(2ωt) and δ = 0.1.

In figure 10 we can see that when a force was applied to the pendulum, the network could not produce a good approximation of the pendulum and that after the training domain the approximation diverged from the analytical solution while still retaining its periodicity.

6.1.4 Different Stopping Criteria

In figure 11 the same data sets as for the results in figure 8 were used for three different stopping criteria.

The values were: γ = 10 ⁻⁶ , 10 ⁻⁷ , 10 ⁻⁸ .

(a) The approximated trajectory and the analytical solution for different stopping criterion’s.

(b) Error in the approximated trajectory as a func- tion of time for different stopping criterion’s.

Figure 11: Approximated trajectory and error when F (t) = 0 and δ = 0.1 with three different stopping criterion’s.

The results in figure 11 show that the stopping criterion is not necessarily proportional to the validation error,

but rather that a smaller loss can give a larger error as seen in figure 11b. The behavior of the approximated

trajectories also differs, where one can see in figure 11a that for the largest loss, the approximated trajectory

(28)

6.2 Damped Dual-Mass-Spring System

The damped dual-mass-spring system is seen to be a more challenging system to approximate than the pendulum. The training sets for all examples were sampled from trajectories with initial values x 1 = 0 and x 2 = 0.1 in the time interval t ∈ [0, 2]. The step-size is set to δ = 0.1 and a stopping criterion is set to γ = 10 ⁻⁸ . This system was unable to be trained to make a good one-step approximation outside of the training domain. What happens is that the approximation diverges to an extremely large number directly after 2 seconds as seen in figure 14. To see how well the method was able to approximate the system in the training domain only the first 2 seconds will be presented. In all error plots equation (52) has been used to produce the error.

6.2.1 No External Force and Constant Time-Step

With no input force to the system the network were able to obtain a fairly good approximation for the positions x 1 and x 2 seen in figure 12. The error accumulates over time and reaches the largest error around 1.8 seconds for both mass 1 and 2.

(a) The approximated trajectory of mass m

1

. (b) Error in the approximated trajectory of mass m

1

.

(c) The approximated trajectory of mass m

2

. (d) Error in the approximated trajectory of mass m

2

.

Figure 12: Approximated trajectory and error with δ = 0.1, F = 0.

(29)

6.2.2 Force Input

The results that were obtained when training the network with the same time-step as in section 6.2.1 but instead adding a time-varying input force F (t) to the system is seen in figure 13. The approximation makes the first step with a very low error and then at the second step the error rapidly increases. The network seems to have some difficulties in following the system behavior and instead settles with low variations around 0.04 for x 1 and for x 2 it settles around 0.0.

(a) The approximated trajectory of mass m

1

. (b) Error in the approximated trajectory of mass m

1

.

(c) The approximated trajectory of mass m

2

. (d) Error in the approximated trajectory of mass m

2

.

Figure 13: Approximated trajectory and error with δ = 0.1 and F = 10 sin(2πt).

(30)

6.2.3 Overshooting

The results in figure 14 show how the trained network is not able to handle data outside of the training domain. The solution becomes unstable and increases exponentially immediately after the 2 seconds on which the network has been trained.

(a) The approximated trajectory of mass m

1

. (b) The approximated trajectory of mass m

2

.

Figure 14: Approximated trajectory and error with δ = 0.1 and F = 0.

(31)

7 Discussion

One can ask why one would choose the validation model as presented in this thesis when testing the ac- curacy of a one-step approximator. The reason for this validation model is to further analyze whether the approximations are accurate enough and whether the network is robust enough, to be able to perform many approximations in a row where the input to these approximations is an approximation in itself and the fault in each of these approximation accumulates. Therefore, this validation model is simply an extension of the simpler validation model.

At the beginning of the thesis, we chose to examine this one-step approximation method on two different dynamical systems with the intention to evaluate how well it would perform on systems with different complexity. The pendulum system has lower complexity and as seen in the results the approximation of the pendulum yielded higher accuracy and the network was better at finding the underlying system behavior than for the more complex damped dual-mass-spring system. In theory, we should be able to find one particular setup of a DNN to do this approximation since in both paper [16] and [15] several different examples are presented with more accurate approximations utilizing this method. Some of the approximated dynamical systems were a damped pendulum and a forced oscillator. The difficult part of this thesis lies in finding the particular network that can produce an accurate approximation of the two dynamical systems that are presented in this thesis.

As can be seen from figure 11 overfitting was one of the large limitations when training. In these results it can be seen that when the training loss is too small or too large, i.e overfitting or underfitting occurs, the prediction works very well in the time domain that is tested on, but outside of that domain the accuracy decreases over time. This further increased the difficulty of the implementation. If no overfitting occurred, one could simply increase the number of training epochs until the loss was significantly small and the problem well trained. This meant that not only did we have to find a network structure that could be trained to perform the task but we also had to find an optimal loss value for each data set. For future implementations of the code, one could do a verification of the model during the training. For example, every 100:th epoch the verification could be done and the error of the verification could be plotted against the epoch loss to find an optimal stopping criterion.

One of the limitations of the model is the choice of the time-step. In most of the results, the time-step is set to δ = 0.1. It produces sufficient resolution in the pendulum problem but in the mass-spring system, this time-step is seen to be too large to catch all the dynamical behavior seen in the trajectory in figure 12c in the time interval 0.1-0.2 seconds. In the process of training the network, no satisfactory results were able to be produced with small time-steps even if the model is trained to have an infinitely small validation error.

Another limitation lies in the Nyqvist-Shannon sampling theorem. Since it states that in order to capture the behavior it has to be sampled at a frequency twice as large as the system maximal frequency. In our thesis, the Nyqvist-Shannon criterion was fulfilled but as seen in [9], in practice one could need to sample 10 times the system’s maximal frequency for good results. This was something that was realized late in the work progress and has not been taken into consideration.

The reason for not training on smaller time-steps is due to limitations in the ResNet architecture of the

DNN. In [15] the authors expand the architecture to a multi-step ResNet to obtain an accurate result with

a time-step δ = 1/30 on a damped pendulum. To expand the work in this thesis it would be interesting to

see how the results would differ with i.e., multi-step recurrent ResNet.

(32)

8 Conclusion

In this thesis, it has been shown that approximating the simple dynamical system as given by a linear

pendulum is possible using the proposed network and method. But that approximating a more complex

system such as the damped dual-mass system using the same network and method does not yield accurate

results. For a more accurate prediction, the network parameters such as depth and width need to be

tuned. More testing should also be done with different values of the hyper-parameters in the optimizer

algorithm.

(33)

References

[1] Max Berg Wahlstr¨ om and Axel Dernsj¨ o. KEX. 2021. url: https : / / github . com / maxbergw / KEX (visited on 06/08/2021).

[2] Bo Chang et al. Multi-level Residual Networks from Dynamical Systems View. 2018. arXiv: 1710.10348 [stat.ML].

[3] Luca Dieci and Luciano Lopez. “A survey of numerical methods for IVPs of ODEs with discontinuous right-hand side”. In: Journal of Computational and Applied Mathematics 236.16 (2012). 40 years of numerical analysis: “Is the discrete world an approximation of the continuous one or is it the other way around?”, pp. 3967–3991. issn: 0377-0427. doi: https://doi.org/10.1016/j.cam.2012.02.011.

url: https://www.sciencedirect.com/science/article/pii/S0377042712000684.

[4] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization”. In: Journal of Machine Learning Research 12.61 (2011), pp. 2121–2159. url:

http://jmlr.org/papers/v12/duchi11a.html.

[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.

org. MIT Press, 2016.

[6] Kaiming He et al. Deep Residual Learning for Image Recognition. 2015. arXiv: 1512.03385 [cs.CV].

[7] Anton Howard and Busby Robert C. Contemporary Linear Algebra. Anton Textbooks, Inc, 2003.

[8] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2017. arXiv: 1412.

6980 [cs.LG].

[9] David Krantz and Olof M˚ ansson. Reduced Order Modelling using Dynamic Mode Decomposition and Koopman Spectral Analysis with Deep Learning. eng. Student Paper. 2020.

[10] Alexander LeNail. “NN-SVG: Publication-Ready Neural Network Architecture Schematics”. In: Jour- nal of Open Source Software 4.33 (2019), p. 747. doi: 10.21105/joss.00747. url: https://doi.

org/10.21105/joss.00747.

[11] M.Nielsen. How the Backpropagation Algorithm Works. http://neuralnetworksanddeeplearning.

com/chap2.html. Determination Press, 2015.

[12] Elizabeth Million. The Hadamard Product. 2007.

[13] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In:

Advances in Neural Information Processing Systems 32. Ed. by H. Wallach et al. Curran Associates, Inc., 2019, pp. 8024–8035. url: http://papers.neurips.cc/paper/9015-pytorch-an-imperative- style-high-performance-deep-learning-library.pdf.

[14] Christian S. Perone. Gradient Descent - Loss surface. Nov. 2020. url: https://drive.google.com/

file/d/1e_9W8q9PL20iqOR-pfK89eILc_VtYaw1/view.

[15] Tong Qin, Kailiang Wu, and Dongbin Xiu. “Data driven governing equations approximation using deep neural networks”. In: Journal of Computational Physics 395 (2019), pp. 620–635. issn: 0021-9991. doi:

https://doi.org/10.1016/j.jcp.2019.06.042. url: https://www.sciencedirect.com/science/

article/pii/S0021999119304504.

[16] Tong Qin et al. Data-driven learning of non-autonomous systems. 2020. arXiv: 2006.02392 [eess.SP].

[17] Sebastian Ruder. “An overview of gradient descent optimization algorithms”. In: CoRR abs/1609.04747 (2016). arXiv: 1609.04747. url: http://arxiv.org/abs/1609.04747.

[18] Geoffrey Hinton Tijmen Tieleman. “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude”. In: COURSERA: Neural networks for machine learning 4.2 (2012/10), pp. 26–

31. [19] Yonghui Wu et al. “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”. In: CoRR abs/1609.08144 (2016). url: http://arxiv.org/abs/1609.

08144.

(34)

Data-Driven Learning for Approximating Dynamical Systems Using Deep Neural Networks

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2021 ,

Data-Driven Learning for

Approximating Dynamical Systems Using Deep Neural Networks

MAX BERG WAHLSTRÖM

AXEL DERNSJÖ

Abstract

In this thesis, a one-step approximation method has been used to produce approximations of two dynamical systems. The two systems considered are a pendulum and a damped dual-mass-spring system.

Using a method for a one-step approximation proposed by [15] it is first shown that the state variables of a general dynamical system one time-step ahead can be expressed using a concept called effective increment.

The state of the system one time-step ahead then only depends on the previous state and the effective increment, and this effective increment in turn only depends on the previous state and the governing equation of the dynamical system.

The results show that training a neural network to be able to produce approximations of a dynamical system

is possible, but if one wants to produce more accurate approximations of more complex systems than the

ones considered in this thesis, greater care has to be taken when choosing parameters of the network as well

as tweaking the hyper-parameters of the optimizer Adam. Furthermore, the structure of the network could

be tweaked by changing the number of hidden layers and the number of nodes in them.

Sammanfattning

Denna rapport har anv¨ ant en enstegs-approximationsmetod f¨ or att approximera tv˚ a dynamiska system. De tv˚ a systemen som approximeras ¨ ar en pendel och ett d¨ ampat massa-fj¨ ader-system med tv˚ a massor.

Genom att anv¨ anda metoden f¨ or enstegs-approximation som f¨ oreslogs i [15] visas f¨ orst att tillst˚ andsvariblerna i ett generellt dynamiskt system, ett tidsteg fram˚ at i tiden kan uttryckas med hj¨ alp av konceptet effektiv

¨

okning. Systemets tillst˚ and ett steg fram˚ at i tiden ¨ ar endast beroende av det tidigare tillst˚ andet samt den effektiva ¨ okningen. Den effektiva ¨ okningen beror i sin tur endast p˚ a det tidigare tillst˚ andet och systemets styrande ekvation.

Resultatet visar att det g˚ ar att tr¨ ana ett neuralt n¨ atverk till att approximera de dynamiska systemen. F¨ or

att ˚ astadkomma h¨ og noggrannhet i approximationerna f¨ or mer komplexa system ¨ an de som presenteras i

rapporten beh¨ ovs mer noggrannhet n¨ ar n¨ atverkets parametrar v¨ aljs och hur hyperparametrarna i optimer-

ingsalgoritmen Adam v¨ aljs.

Acknowledgments

The authors of this thesis would like to thank our supervisors David Krantz and Xin Huang for their great

support and engaging discussions throughout this project.

Contents

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem Statement . . . . 2

1.3 Overview . . . . 2

2 Dynamical Systems 3 2.1 Pendulum . . . . 3

2.2 Damped Dual-Mass-Spring System . . . . 5

3 Deep Neural Network 7 3.1 Architecture . . . . 7

3.2 Activation Function . . . . 9

3.3 Output and Loss function . . . . 10

3.4 Residual Neural Network . . . . 10

3.5 Training the Network . . . . 11

3.5.1 Gradient-Based Learning . . . . 12

3.5.2 Back-Propagation . . . . 12

3.5.3 Optimization Algorithm . . . . 13

4 Approximating Dynamical Systems With Deep Neural Networks 15 4.1 One-Step Solution Using Effective Increment . . . . 15

4.2 One-Step Approximation Using ResNet . . . . 16

4.3 data sets . . . . 16

5 Implementation 17 5.1 Data Collection . . . . 17

5.1.1 Pendulum . . . . 17

5.1.2 Damped Dual-Mass-Spring System . . . . 17

5.2 Network Setup . . . . 18

5.3 Training Process . . . . 18

5.4 Validation of the Trained Model . . . . 19

6 Result 20 6.1 Pendulum . . . . 20

6.1.1 No External Force and Constant Time-Step . . . . 20

6.1.2 No External Force and Varying Time-Step . . . . 20

6.1.3 External Force Input . . . . 21

6.1.4 Different Stopping Criteria . . . . 22

6.2 Damped Dual-Mass-Spring System . . . . 23

6.2.1 No External Force and Constant Time-Step . . . . 23

6.2.2 Force Input . . . . 24

6.2.3 Overshooting . . . . 25

7 Discussion 26

8 Conclusion 27

1 Introduction

The use of modeling dynamical systems in the industry is immense. From modeling the suspension of a car to the modeling of the complex weather systems that can be found in meteorology.

This thesis will investigate one particular approximation method based on data-driven learning with a deep neural network.

1.1 Background

The learning process of an artificial neural network can also be seen to mimic a human brain, which learns by examples.

The ability of an artificial neural network to learn a specific task demands training. For this process to be successful the network needs large amounts of data to train on and to have the correct setup of nodes and connections between them.

The data-driven approximation method that this thesis will utilize is based on previous work by [15] and [16].

The method is based on a concept called effective increment that describes the exact changes of a dynamical

system from one state in time to another. The method is also based on a special type of deep neural

1.2 Problem Statement

dx

dt = f (x(t), u(t)) x(t 0 ) = x 0 (1)

ˆ

x(t; x 0 ) ≈ x(t; x 0 ) (2)

where t is the time at which the prediction should be made. This model will then be used to make multiple consecutive one-step predictions using the previous prediction as an input for the next prediction.

1.3 Overview

dt = f (x(t), u(t)) x(t ₀ ) = x ₀ (1)

x(t; x ₀ ) ≈ x(t; x ₀ ) (2)

mg sin(θ) = −ml d ² θ

dt ² + F (t) (3)

d ² θ dt ² + g

d ² θ dt ² + g

To find a homogeneous solution to this differential equation, a characteristic equation can be formed, r ² + g

θ h (t) = A cos r g l t

+ B sin r g l t

where A, B ∈ R are constants. Setting the initial condition to θ ⁰ and rewriting the angular frequency as ω = p g

ml(ω ² − α ² ) sin(αt). (9)

θ(t) = θ ₀ cos(ωt) + F ₀

ml(ω ² − α ² ) sin(αt). (10)

Derivating equation (10) with respect to time the angular velocity is obtained as θ(t) = −ωθ ˙ 0 sin(ωt) + α F ₀

ml(ω ² − α ² ) cos(αt), (11)

m ₂ m ₁

k ₂

( m ₁ x ¨ ₁ − c( ˙x ₂ − ˙x ₁ ) − k ₂ x ₂ + (k ₁ + k ₂ )x ₁ = 0,