• No results found

Method for Event Detection in Mechatronic Systems Using Deep Learning

N/A
N/A
Protected

Academic year: 2021

Share "Method for Event Detection in Mechatronic Systems Using Deep Learning"

Copied!
80
0
0

Loading.... (view fulltext now)

Full text

(1)

,

STOCKHOLM SVERIGE 2018

Method for Event Detection in

Mechatronic Systems Using Deep

Learning

(2)

Method for Event Detection in Mechatronic Systems

Using Deep Learning

WILLIAM BRUCE EDVIN VON OTTER

Master’s Thesis at ITM Supervisor: De-Jiu Chen Examiner: Martin T¨orngren

(3)
(4)

Master Thesis MMK 2018:195 Method for Event Detection in Mechatronic Systems Using Deep Learning

William Bruce Edvin von Otter

Approved:

2018-06-11 Examiner:Martin T¨orngren Supervisor:De-Jiu Chen Commissioner:

Atlas Copco Contact Person:Daniel Lundborg

Abstract

Artificial Intelligence and Deep Learning are new drivers for technological change, and finds their way into more and more applications. These technologies have the ability to learn com-plex tasks previously hard to automate. In this thesis, the usage of deep learning is applied and evaluated in the context of product assembly where components are joined together. The specific problem studied is the process of clamping by using threaded fasteners.

The thesis evaluates several deep learning models, such as Recurrent Neural Networks (RNN), Long Short-Term Memory Neural Networks (LSTM) and Convolutional Neural Networks (CNN), and presents a new method for estimating the rotational angle at which the fastener mates with the material, also called snug-angle, using a combined detection-by-classification and regression approach with stacked LSTM neural networks. The method can be imple-mented to make precision clamping using angle tightening instead of torque tightening. This tightening method offers an increase in clamp force accuracy, from ±43% to ±17%.

Various estimation methods and inference frequencies are evaluated to offer insight in the lim-itations of the model. The top method achieves a precision of ≠0.05 ± 2.35when estimating

the snug-angle and can classify where the snug-angle occurs with 99.26% accuracy.

(5)

Metod f¨or h¨andelsedetektering i mekatroniska system genom djupinl¨arning

William Bruce Edvin von Otter

Godk¨and:

2018-06-11 Examinator:Martin T¨orngren Handledare:De-Jiu Chen Uppdragsgivare:

Atlas Copco Kontaktperson:Daniel Lundborg

Sammanfattning

Artificiell Intelligens och djupinl¨arning ¨ar nya drivkrafter f¨or teknologisk f¨or¨andring och f¨orekommer i allt fler applikationer. Dessa teknologier har f¨orm˚agan att l¨ara sig kom-plexa uppgifter som tidigare var sv˚ara att automatisera. I detta examensarbete unders¨oks m¨ojligheten att anv¨anda djupinl¨arning inom ˚atdragning av g¨angade f¨astelement.

Examensarbetet utv¨arderar flera djupinl¨arningsmodeller, s˚asom Recurrent Neural Networks (RNN), Long Short-Term Memory Neural Networks (LSTM) och Convolutional Neural Net-works (CNN) och presenterar en ny metod f¨or att estimera rotationsvinkeln vid vilken f¨astele-mentet tar i materialet, ¨aven kallat snug-angle. Detta uppn˚as med en kombinerad detektering-via-klassificering- och regressionsmetod baserad p˚astaplade LSTM-n¨atverk. Metoden m¨ojligg¨or precisions˚atdragningar genom vinkel˚atdragning ist¨allet f¨or moment˚atdragning. Denna typ av ˚atdragning ger en precisionsf¨orb¨attring fr˚an ±43% to ±17% p˚a uppn˚add kl¨amkraft.

Flera estimeringsmetoder och utsignalsfrekvenser utv¨arderas f¨or att belysa algoritmens be-gr¨ansningar. Metoden uppn˚ar en noggrannhet p˚a ≠0.05 ± 2.35och klarar att klassificera en

h¨andelse med 99.26 % precision.

(6)

Acknowledgements

First of all, we would like to thank our industrial supervisor Daniel Lundborg for believing in us and making the subject of our thesis possible.

We would like to thank our academic supervisor De-Jiu Chen for the valuable advice and feedback during this project.

To our thesis coordinator at KTH, Damir Neˇsi´c, thank you for the planning and execution of this year’s thesis projects.

To Adam Klotblixt at Atlas Copco, thank you for the advice and guidance in the realm of threaded tighteners.

To our friends and family, thank you for the love, support and understanding during these five years of studies at KTH, as well as these intense last months.

Lastly, we would like to thank Ulf Samuelsson for your fascinating stories, introducing noise into our days at Atlas Copco, making us generalize better and achieve a deeper learning.

(7)

Contents

1 Introduction 1

1.1 Background . . . 1

1.1.1 Conversion of clamping force to torque . . . 2

1.1.2 Tightening methods . . . 2 1.2 Purpose . . . 3 1.3 Method . . . 3 1.4 Delimitations . . . 3 1.4.1 Tightening strategies . . . 3 1.4.2 Implementation . . . 4

1.4.3 Choice of input data . . . 4

1.5 Ethics and Sustainability . . . 4

2 Background Study 7 2.1 Introduction to Tightening . . . 7

2.2 Introduction to Artificial Intelligence . . . 9

2.2.1 Regression and Classification . . . 10

2.2.2 Sequence Labeling . . . 11

2.3 Event Detection in Sequential Data . . . 11

2.4 Introduction to Neural Networks . . . 13

2.5 Neural Network Training . . . 15

2.5.1 Hyperparameter Search . . . 16

2.5.2 Learning Rate Change Methods . . . 17

2.5.3 Ensemble and Dropout . . . 17

2.5.4 Training, test and validation . . . 19

2.5.5 Neural Network evaluation . . . 19

2.6 Recurrent Neural Networks . . . 21

2.6.1 Long Short-Term Memory Recurrent Neural Networks . . . 21

2.7 Convolutional Neural Networks . . . 23

2.7.1 Convolutional layer . . . 23

2.7.2 Pooling layer . . . 24

2.7.3 Fully Connected Layer . . . 25

2.8 Compressing Neural Networks . . . 25

2.8.1 Pruning . . . 26

2.8.2 Weight quantization . . . 26

(8)

3 Method and Implementation 29

3.1 Event Detection Strategy . . . 29

3.2 Models . . . 30 3.2.1 Multilayered Perceptron (MLP) . . . 30 3.2.2 RNN . . . 30 3.2.3 LSTM . . . 30 3.2.4 LSTM-MLP . . . 31 3.2.5 Stacked LSTM . . . 31

3.2.6 LSTM Fully Convolutional Network (LSTM-FCN) . . . 31

3.3 Method . . . 32

3.3.1 Data Acquisition and labeling . . . 32

3.3.2 Setup of training algorithm . . . 33

3.3.3 Hyperparameter Search Method . . . 33

3.3.4 Full training of models . . . 34

3.3.5 Full model prediction . . . 34

3.4 Hardware used for Implementation . . . 34

3.4.1 EVGA NVIDIA GTX 1080 . . . 34

3.5 Software, Libraries and Frameworks used for Implementation . . . 35

3.5.1 Tensorflow and Keras . . . 35

3.5.2 DEWESoft® and dwdatareader . . . 35

3.5.3 Numpy and matplotlib . . . 35

4 Results 37 4.1 Model Training for First Stage Models (Classification) . . . 37

4.2 Model Training for Second Stage Models (Regression) . . . 41

4.3 First Stage Model results on Identifying Snug-Segment . . . 45

4.4 Second Stage Model Results on Identifying the Snug-Angle . . . 45

4.5 Full model results . . . 46

5 Discussion 49 5.1 The Training Process . . . 49

5.2 Dataset Dependencies . . . 50

5.3 Model performance . . . 51

6 Conclusion 53 6.1 Deep Learning Architecture for Angle Detection . . . 53

6.2 Implementation of Deep Learning Based Event Detection on a Mechatronic Systems . . . 53

7 Future work 55 7.1 Implementation of the Proposed Strategy . . . 55

7.2 Model Architecture and Training . . . 55

Bibliography 57

Appendices 61

(9)

List of Figures

1.1 Example of a sectioned threaded joint . . . 2

2.1 Ideal tightening curve . . . 7

2.2 Diameter parameters . . . 8

2.3 Examples of real data . . . 9

2.4 Flowchart for different AI systems . . . 10

2.5 Visualization of representations of data in a Convolutional Neural Network . . . 11

2.6 Illustration of the input segment sliding of the tightening curve . . . 13

2.7 Biological Neuron . . . 14

2.8 Computational Model of Neuron . . . 14

2.9 Structure of a 3-layered feed-forward neural network. . . 15

2.10 Illustration of how the grid search can conceal the importance of certain hyperpa-rameters . . . 16

2.11 Dropout applied to a network . . . 18

2.12 Classification space with two classes . . . 20

2.13 Recurrent Neural Network . . . 21

2.14 Different kinds of unfolded RNN architectures . . . 22

2.15 LSTM block with one cell . . . 22

2.16 Weights of a 3 ◊ 3 ◊ 3 convolutional filter . . . 24

2.17 Convolutional operation . . . 24

2.18 Max Pooling operation . . . 25

2.19 Compression scheme for deep models . . . 26

3.1 Flowchart of the snug-detector . . . 29

3.2 Architecture of MLP model . . . 30

3.3 Architecture of RNN model . . . 30

3.4 Architecture of LSTM model . . . 30

3.5 Architecture of LSTM-MLP model . . . 31

3.6 Architecture of Stacked LSTM model . . . 31

3.7 Architecture of LSTM Fully Convolutional Network model . . . 31

4.1 Validation for each epoch during training of the first stage MLP model . . . 37

4.2 Validation for each epoch during training of the first stage LSTM-MLP model . 38 4.3 Validation for each epoch during training of the first stage LSTM model . . . 38

4.4 Validation for each epoch during training of the first stage RNN model . . . 39

(10)

4.7 Validation for each epoch during training of the second stage MLP model . . . . 41

4.8 Validation for each epoch during training of the second stage LSTM-MLP model 42 4.9 Validation for each epoch during training of the second stage LSTM model . . . 42

4.10 Validation for each epoch during training of the second stage RNN model . . . . 43

4.11 Validation for each epoch during training of the second stage Stacked-LSTM model 43 4.12 Validation for each epoch during training of the second stage LSTN-FCN model 44 4.13 Error of prediction for prediction frequency 8 kHz . . . 48

4.14 Error of prediction for prediction frequency 80 Hz . . . 48

4.15 Error of prediction for prediction frequency 16 Hz . . . 48

4.16 Error of prediction for prediction frequency 5.33 Hz . . . 48

(11)

List of Tables

2.1 A selection of hyperparameters and their influence . . . 17

4.1 Lowest validation loss for the first stage models . . . 40

4.2 Lowest validation loss for the second stage models . . . 44

4.3 Accuracy of first stage models . . . 45

4.4 Mean absolute error of second stage models . . . 46

4.5 Full model results . . . 47

A.1 LSTM-FCN: Fine Search top 5 results . . . 63

A.2 RNN . . . 64

A.3 LSTM: . . . 64

A.4 LSTM-MLP . . . 65

A.5 MLP . . . 65

(12)

Glossary

ANN Artificial Neural Network. 13 CNN Convolutional Neural Networks. 23 CPU Central Processing Unit. 34

FC Fully Connected Layers. 25 GPU Graphics Processing Unit. 34 LSTM Long Short Term Memory. 22

Prediction Common term for the output of a machine learning algorithm. 3 RNN Recurrent Neural Networks. 21

(13)
(14)

Chapter 1

Introduction

Efficient and reliable assembly of products is a prerequisite for our modern industrial society, and has been a driver for the industrial revolution. Whether it’s a smartphone or an airliner being produced, the customer expects a quality associated to its cost. Airplanes need to handle hundreds of flights per year and mobile devices sustain harsher conditions than we give them credit for. As products become more advanced and environmental awareness demands longer life cycles to mitigate ecological impact, the assembly processes need to evolve and meet those demands.

1.1 Background

(15)

Figure 1.1 – Example of a sectioned threaded joint

1.1.1 Conversion of clamping force to torque

When developing a design, manufacturing engineers establish specifications in the blueprints on the amount of clamping force required in various joints. When the design is received for assembly, the clamping force is converted to a tightening torque by using standard tables. This is done because there is no appropriate way to directly measure the clamping force dur-ing assemblies. The measurdur-ing methods available are expensive, require separate equipment or can only be used for testing due to fastening of the measuring component in the joint [1, 2].

1.1.2 Tightening methods

Tightening methods are separated on what the goal variable for the process is, such that when the goal is reached the process is finished.

Torque control as a tightening method

Torque control as a tightening method is relatively inaccurate compared to other available methods. The reason for this is the relationship between the torque and the clamping force, subjected to many material parameters. The clamping force can vary from ±17% to ±43% for a given tightening torque measurement [3]. This results in a low bolt-utilization, meaning that the bolt needs to be stronger in practice than in theory to get a significant overhead and minimize the risk of breaking. It also makes plastic region tightening unreliable, due to the unknown stress of the bolt [1, 4].

Angle control as a tightening method

(16)

1.2. PURPOSE

methods among noisy data. The noise can, for example, be due to sensor anomalies or obstructing washers or dirt in the joint or thread. This method achieves a clamping force spread of ±9% to ±17% [1, 3].

1.2 Purpose

Deep learning has shown great promise in the analysis of time series data such as evolution of prices, natural language processing and other sequential data. Given the difficulty of writing rules that generalize well for estimating the angle at which snug occurs with numerical analysis, this thesis will investigate if deep learning can be applied to detect the event of snug during a bolt tightening using the sensor data available in the tool. To determine this, the following questions will be researched:

• How can deep learning be implemented in order to detect events in sensor data from mechatronic systems?

• Which deep learning architectures are suitable, in terms of accuracy or precision, for classifying sequential data to detect the angle where the snug fit occurs?

Mechatronics is ”an interdisciplinary design methodology which solves primarily mechanically

oriented product functions through the synergistic spatial and functional integration of me-chanical, electronic, and information processing subsystems” [5]. In this thesis, the method is

applied to sensor data collected on the a mechatronic system in the form of the Atlas Copco PowerFocus 6000 [6].

1.3 Method

To address the research questions, review of existing literature on the subject of time series analysis and prediction with deep learning will help narrow down the varieties of deep learning architectures that can be applied to time-series data.

The research in this thesis will revolve around performing evaluation of different deep learn-ing architectures found in the literature review. Once a few promislearn-ing candidates have been chosen, tweaking the parameters or architectures of the algorithms may increase their perfor-mance until a point where one candidate is concluded to be best suited for final implementa-tion.

1.4 Delimitations

Some delimitations have been decided upon in order to be able to answer the research ques-tions with as high a quality as possible.

1.4.1 Tightening strategies

(17)

discussed below.

The continuous drive strategy tightens the joint in one continuous rotation of the bolt until the objective variable is reached. This produces a smooth curve like the one shown in Figure 2.1.

Turbotight® is a tightening strategy that reduces the reaction torque at the very end of the tightening, making the tool more ergonomic for the operator. This strategy produces curves different from those of continuous drive, but have the same appearance around the snug-angle. We believe this allows the algorithm to classify even these curves accurately, yet this was not tested this thesis.

Pulse drive, as the name suggests, applies pulses of torque to tighten the joint. This also pro-duces curves that are different than those of continuous drive, making the snug-angle appear very differently. Pulse drive tightenings are therefore excluded from the scope of this thesis.

1.4.2 Implementation

Due to complexity of the target hardware and limited time, we will not implement the al-gorithm on the target hardware. We will however evaluate recommended steps towards an implementation with regards to the target hardware, and simulate realtime execution.

1.4.3 Choice of input data

The data acquired for training of the algorithm contains several sequences of data measured by the tool. We have decided to limit the input parameters of the network to torque- and angle sequences because they are consistently present in all data files available. This is also the most common way of analyzing a tightening and the sensors responsible for collecting this data are used in all tightening tools. This excludes eventual data of tool orientation, speed, current draw, voltage, etc. Our belief is that this will make for a well-generalizing model, applicable on various tightening strategies.

1.5 Ethics and Sustainability

Deep Learning applications are highly dependent on larges amount of data. During the col-lection of the data, ethical aspects such as privacy of the person giving their data needs to be considered. In the case of this implementation, the tool operator’s physical abilities might be of interest, since this will affect the tightening. The ethical consideration here would be to acknowledge that the identity is not of interested and that if the data is mishandled, it might have implications for the operator.

(18)

1.5. ETHICS AND SUSTAINABILITY

that affects the efficiency.

(19)
(20)

Chapter 2

Background Study

This chapter contains the background study and litterature review of previous work essential to the thesis. The chapter will present the theory revolving around tightening, neural networks and methods associated with the purpose of the thesis.

2.1 Introduction to Tightening

This section will explain the required theory behind tightening and cover two metrics that can be used to measure the target of the tightening.

During a tightening, the screw will go through the different phases depicted in Figure 2.1 and explained below.

Torque

Angle

1 2 3 4

Figure 2.1 – Ideal tightening curve

1. Rundown: is the rundown or prevailing torque zone that occurs before the fastener head or nut contacts the bearing surface.

(21)

3. Elastic Clamping: the slope of the torque-angle curve is essentially constant as the bolt is elongated.

4. Yield: The stress is now so high that the bolt is deforming plastically and will break if tightened further.

The goal of this process is to tighten the bolt to before, and in some cases until, it plasticizes, i.e. close to or inside phase 4. From phase 2 and on, the bolt head and the pitch of the thread pull the bolt apart and cause it to lengthen. Like a spring, the bolt’s elasticity pulls the joint components together with the so called clamp force.

When converting clamp force F to tightening torque T , the relationship between them is approximated by

T = F (0.16P + 0.58d2µth+

DKmµh

2 ), (2.1)

where P is the pitch of the thread, µth and µh are the friction coefficients related to the

specific bolt thread and bolt head respectively while d2 = d+d2 3 and DKm= dw+d2 h as defined

in Figure 2.2a and Figure 2.2b [1].

d d2 d3

P

(a) Thread parameters (b) Bolt head parameters Figure 2.2 – Diameter parameters

When using angle control, the relationship between the angle A and the clamp force F is described by the equation

A= F

k, (2.2)

where k is a material parameter of the combined joint and screw called force rate.

The length of the rundown zone varies with several environmental parameters, such as the length of the screw, the extent of it that has entered the threading before the tool is intro-duced and the thickness of the joint. The snug-angle can therefore not be predetermined for a general joint.

Because zone 3 is distinguished by a sudden increase in dT

dA, one could assume it would be

easy to pinpoint in the curve shown in Figure 2.1 using thresholds on T , dT dA or d

2T

d2A. That

(22)

2.2. INTRODUCTION TO ARTIFICIAL INTELLIGENCE 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Revolutions [rev] To rq u e [Nm] 0.0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 50 60 Revolutions [rev] To rq u e [Nm]

Figure 2.3 – Examples of real data: Left: A sudden increase in torque due to imperfections.

Right: A very noisy curve, probably due to a low rotation speed

thread and sensor inaccuracy. Two curves taken from real data can be seen in Figure 2.3. Static analysis of this data could prove complicated and unrobust due to variations in the curve due to environmental characteristics.

2.2 Introduction to Artificial Intelligence

The field of artificial intelligence (AI) is bigger than ever and is being integrated in many applications. Andrew Ng, esteemed AI researcher and professor at Stanford University, has even dubbed AI the New Electricity saying that ”Just as electricity transformed almost ev-erything 100 years ago, today I actually have a hard time thinking of an industry that I don’t think AI will transform in the next several years” [7].

(23)

Input designed

Hand-program Output Input designed

Hand-program

Mapping from

features Output Input Features Mappingfrom

features Output Input featuresSimple abstractMore

features Mapping from features Output De ep Le ar nin g R ep res en tat ion Le ar nin g C las sic M ac hin e Le ar nin g R ule -b as ed sy st ems

Figure 2.4 – Flowchart for different AI systems, with gray boxes indicating components that are

able to learn [8].

In machine learning, the algorithm relies heavily on the representation of the data it is given. That means that the data needs to be structured so that each piece of information is a relevant representation that the algorithm can process. Each piece of information, also called features, will then be correlated to an output. For many tasks, however, feature extraction is hard. For example, an object in an image such as an wheel is hard to describe in terms of pixel values, and even if that could be done, it will be demanding to find the wheel depending on light conditions, obstructions of view in the image or other factors that make the object less clear. An approach to tackle that is to use machine learning to learn the representation itself, not only the mapping between representation and output. This approach, called Representation Learning, allows greater and faster adaptation to new tasks.

When designing algorithms for learning features in data, processing the data to separate the factors of variation is needed. The factors of variation are sources that influence the data and can explain variations in the dataset, e.g. in speech recognition, the speakers voice is influenced by the speaker’s gender and accent. For many tasks, separating these factors can become almost impossible. Such a sophisticated factor of variation as the speaker’s accent requires deep knowledge about the data and how it influences it.

Deep learning approaches this problem by creating representations of other simpler represen-tations, as seen in Figure 2.5. In this way, the input is first separated into simpler concepts and then combined in to new concepts that eventually lead to a prediction [8].

2.2.1 Regression and Classification

(24)

2.3. EVENT DETECTION IN SEQUENTIAL DATA

Figure 2.5 – Visualization of how different layers represent the data in a Convolutional Neural

Network [9]

• Classification: is where an algorithm determines to which of a set of categories c an input belongs. The algorithm learns a function f : Rnæ {1, ..., c} and outputs a

proba-bility for each of the categories. For example, classification is used in image recognition to determine which of the previously learned classes has the highest probability of being present in the image.

• Regression: is where the algorithm predicts a continuous numerical value given some input. The algorithm learns a function f : Rn æ R. Regression can be used, for

example, to predict future prices of stocks or predictions of weather.

Deep learning algorithms are able to handle both of these types of tasks, and many more. For image classification, deep learning implementations surpass any other machine learning algorithm [10].

2.2.2 Sequence Labeling

In machine learning, sequence labeling involves tasks where sequences of data are transcribed with sequences of discrete labels. The goal for the algorithm is to assign a label to a sequence of input data. There are three general types of sequence labeling tasks; temporal classifica-tion, segment classification and sequence classification. Temporal classification is where the only information available is the target sequences, i.e. the input is classified as a whole and alignment is not interesting. Segment classification is a special case of temporal classification where the data is labeled with both targets and input-target alignments. Sequence classifi-cation is an special case of segment classificlassifi-cation and the most strict, where each input is assigned one, and only one, label [11].

2.3 Event Detection in Sequential Data

(25)

• Firstly, to detect whether an event of interest has occurred,

• Secondly, to characterize the event by, for example, time, type or severity

It is generally a more demanding task to detect events rather than classifying them. Essen-tially, this is due to the fact that the classification task has access to the boundaries in which a classification should occur, while in the detection task the boundaries are not known in advance [13].

There are two categories of event detectors which are either threshold-based or supervised learning-based [14]. Unsupervised learning-based event detection, that is learning algorithms that find patterns in the data by itself, have also been applied for event detection tasks. That approach is used where it is not know beforehand what kind of events that are of interest [15].

Threshold-Based Methods

Threshold-based methods uses the belief that an event will result in some change in the data that differs from the normal. The normal behavior can be described as threshold, e.g. maximum values, rate of increase or combinations of these, based on historical data. If the data contains more than one variable, e.g. several sensors are used to produce the data, it is possible to have individual thresholds that together make the event detector. An example of such could be a fire alarm, where temperature, carbon monoxide and other appropriate sensor are read and compared to the historical normal. For a specific problem, data variables are weighted depending on their importance and combined to detect the event.

Threshold methods have the advantage of having low computer utilization and is simple to implement. However, it is highly dependable on the kinds of sensors that are involved and it can be hard to specify rules for events. Often, events can’t be fully captured by threshold values [12].

Supervised Learning-based Methods

In supervised learning-based event detectors, the detector has access to annotated sequences where an event has occurred. Sequences from the data are sampled at a constant sampling rate in s.c. frames or windows. For each window, characteristics (i.e. features) are extracted. These features will be the annotation labels to which the supervised learning can be applied [16]. However, even though the event is detected, the timing of event can occur anywhere in the window. To handle both event detection and timing of the event, there are two common approaches:

(26)

2.4. INTRODUCTION TO NEURAL NETWORKS Angle Torque t1 t 2 t3 t 4 t5

Figure 2.6 – Illustration of the input segment sliding of the tightening curve

• Detection-by-classification with regression: This approach builds on the detection-by-classification approach. A classifier determines whether an event has occurred. Then a second classifier determines what kind of event occurred. Subsequently, a regression algorithm determines where in the window the event was estimated to appeared [18]. There are several machine learning algorithms that can be used as classifiers or regressors, some of which are Support Vector Machines (SVM), K-nearest neighbor, na¨ıve Bayes and neural networks [14]. A general disadvantage with this approach is that the detector only focuses on one frame at the time and might, depending on the algorithm, miss information in previous frames [16].

2.4 Introduction to Neural Networks

(27)

Figure 2.7 – Biological Neuron [9]

In light of the biological neuron, the computational neuron acts similarly, see figure 2.8. In the so called forward-pass, a signal, x0, goes over the axon and interacts with the dendrites

of the other neuron. The level of interaction is determined by the ”synaptical” strength, w0,

which is a variable that is learnable. w0 essentially determines the influence of the connection

and is called a weight.

q Axon from a neuron

x0 w0

Dendrite

w0x0

g output axony

Figure 2.8 – Computational Model of Neuron

The dendrites carry the signal multiplied by the synaptical strength, w0x0 to the cell body,

where the signals are summed. In the biological neuron, the neuron would ”fire” a signal over its axon whenever the sum was over a certain threshold. In the computational neuron, however, the precise timing of the firing is unimportant. Instead, only the frequency of the firing, that is how often the neuron is activated, communicates information [9]. This concept, called rate coding, stems from the belief that biological neurons partly communicate with the frequency of firings. The computational neuron would therefore let the sum pass through an activation function, g, which models this firing rate.

(28)

2.5. NEURAL NETWORK TRAINING

Input Layer Hidden Layers Output Layer

Figure 2.9 – Structure of a 3-layered feed-forward neural network.

By using this approach, an ANN made with these computational neurons f(x, W, b), where x is the input and W is the set of weights that are being learned, would be able to approximate the function fú. It has been shown that no matter the function fú, there is a neural network f(x, W, b) that for every possible value x will give the output fú(x). ANNs have therefore

earned the attribute Universal Approximators[19].

2.5 Neural Network Training

Training of neural networks for learning a certain task can be done with three different approaches: supervised learning (where each input in the dataset has a paired target), re-inforcement learning (where scalar reward values are provided for training) or unsupervised learning (where no information is given during training, and the algorithm will try to learn by only observing the data). In this thesis, the data is labeled, i.e. supervised learning can be used.

(29)

Algorithm 1 Stochastic Gradient Descent Require: Learning rate ‘

Require: Initial parameters ◊ while training is not done do

Forward pass with mini-batch x: yi= f(xi, ◊,b)

Compute gradients “g of loss function Update ◊ with learning rate ‘: ◊ Ω ◊ ≠ ‘ “ g

end while

set for the learning algorithm is the learning rate. Even though it seems that the optimizers with adaptive learning rates perform robustly, no optimizer has been dubbed the best and the choice can be based on the users knowledge about the algorithm [8, 20].

2.5.1 Hyperparameter Search

The process of training neural networks is very complex and involves choosing and tuning a large number of so called hyperparameters. Hyperparameters are parameters external to the model that is most often manually set. There are several hyperparameters that can be considered for tuning (in Table 2.1 some of the most influential are presented).

Figure 2.10 – Illustration of how the grid search can conceal the importance of certain

(30)

2.5. NEURAL NETWORK TRAINING

Table 2.1 – A selection of hyperparameters and their influence. With inspiration from [8].

Hyperparameter Influence

Number of Hidden Units The model’s ability to learn different represen-tations of the data varies with the size of the model. A larger model gives more capacity to learn. However, increasing the number of units increases the time that is needed to train the model.

Learning Rate When moving towards a minimum in the cost function, the optimizer takes steps with a step size multiplied by the learning rate. This is a very influential parameter that affects the op-timizers ability to ”get stuck” or ”break free” from local minima with the possibility to find an even lower minimum [9]. An improper learn-ing results in a model with low efficiency. Regularization Regularization strategies such as dropout affects

the generalization error and the models ability to make general predictions.

The process of tuning the hyperparameters can be conducted in two ways: Random or Grid search. Random search: the parameters are in general chosen randomly from a range of numbers where that range is predetermined, or sampled from a list of numbers. Another approach, called grid search, would be to only have a list of numbers for each hyperparam-eter and sample from those. Though the distinction may seem trivial, using the random approach allows for greater understanding of what and how parameters influence the model. For example, when performing grid search, the unimportant values may be over analyzed, see Figure 2.10 [21]. A disadvantage of doing automatic hyperparameter search is that it is computationally very heavy.

2.5.2 Learning Rate Change Methods

During longer training, a common phenomenon is that training gets ”stuck” in a local min-imum [8]. By reducing the learning rate when stuck, the training can sometimes progress further and get out of that local minimum. This can be done either by applying a learn-ing rate schedule, where the learnlearn-ing rate is changed after a few epochs, or uslearn-ing a ”reduce learning rate on plateau” approach [22, 23]. The latter decreases the learning rate when the learning has been stationary over a number of epochs.

2.5.3 Ensemble and Dropout

(31)

One way of reducing generalization error is to use an ensemble method, for example Bagging [24]. Ensemble methods involve training several networks and average their results to make a final prediction. It can be shown that the ensemble on average performs at least as well as any of its members, and if the members make uncorrelated errors the ensemble will perform significantly better than its members [8]. When performing bagging, each model is trained on a independent subset of the dataset, thus ensuring that each model is missing information that could be found in other examples that other models have access to. Small nuances in knowledge will be obtained, and different models will become better at certain types of ex-amples. By averaging the model outputs the algorithm becomes robust and generalizes well, even though each of its models individually generalize poorly [24].

Another way of reducing generalization error is to perform what is called dropout. One of the disadvantages with bagging is that it becomes computationally expensive when the models are deep. Dropout solves this by training subnetworks of the underlying base network by removing non-output neurons from it, as seen in Figure 2.11. Every time a new minibatch is loaded, a random binary mask is applied to the network. The purpose is to mask the network so that a subnetwork, containing a fraction of the whole network determined by the mask, is created and trained with that minibatch. Unlike bagging, the model represented with the subnetwork share parameters with the underlying network, thus allowing training of an exponentially growing amount of models with a controllable amount of computational power.

         

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

yy h1 h1 hh22 x1 x1 xx22 yy h1 h1 hh22 x1 x1 xx22 yy h1 h1 hh22 2 2 h1 h1 hh22 1 1 yy h2 h2 x1 x1 xx22 yy h1 h1 x1 x1 xx22 1 1 22 x1 x1 xx22 h2 h2 2 2 h1 h1 1 1 h1 h1 2 2 yy h2 h2 x x yy x x yy x2 x2 yy 2 2 yy 1 1 yy Base network Ensemble of subnetworks         Figure 7.6: Dropout trains an ensemble consisting of all subnetworks that can be

con-  con-        structed by removing nonoutput units from an underlying base network. Here, we begin

        with a base network with two visible units and two hidden units. There are sixteen

        possible subsets of these four units. We show all sixteen subnetworks that may be formed

        by dropping out different subsets of units from the original network. In this small example,

       

a large proportion of the resulting networks have no input units or no path connecting         the input to the output. This problem becomes insignificant for networks with wider

        layers, where the probability of dropping all possible paths from inputs to outputs becomes smaller.

Figure 2.11 – Dropout applied to a network. The base network is split into subnetworks that

are used in training. The subnetworks share parameters, which enables training of many different models with a manageable amount of memory [8].

(32)

2.5. NEURAL NETWORK TRAINING

2.5.4 Training, test and validation

When training a machine learning model, a common mishap is overfitting, where the algorithm learns the training data so well that it does not generalize and makes incorrect predictions on other data. A method to avoid that kind of behavior is using the holdout method, where the dataset is partitioned into three subsets [8]:

• Training set: Used by the training algorithm to adjust the parameters in the model. • Validation set: Used to given an estimate during training of how well the model

generalizes on other data. This is monitored during training to determine when learning no longer is increasing.

• Test set: Used to evaluate how well the model generalizes on other data and compared to other models.

By using this method, the models will be evaluated accordingly and generalization error can be minimized [8].

2.5.5 Neural Network evaluation

The performance of machine learning algorithm is commonly evaluated on the basis of some form of metric. There are several to choose from, appropriate for different tasks. A metric function is similar to a loss function, with the difference that its result is not used for training. Metrics are human-interpretable values used to compare different models’ accuracies. Since a model that generalizes well is the goal, the metric is calculated on a test set, containing samples randomly drawn from the whole data set and not present in the training set, to make sure that they are new from the model’s perspective.

Metric for Classification

A classification made by an algorithm can either be a true positive/true negative (correctly estimated as either positive or negative) or a false positive/false negative (incorrectly esti-mated as either positive or negative), visualized in Figure 2.12.

To evaluate the algorithms ability to make good predictions, there are a number of metrics that can be used. A common way is to use classification accuracy. It is simply measured as the proportion of examples for which the model produces the correct output:

Accuracy = tp+ tn

tp+ tn + fp + fn (2.3)

A problem with classification accuracy it does not take into account whether the dataset is balanced or not. The algorithm could learn to only predict positives or negatives and still get a high accuracy. Metrics such as precision and recall are better suited to show that kind of behavior. Precision is the algorithms ability to do relevant predictions, i.e. the fraction of correct predictions out of all predictions:

Precision = tp

(33)

relevant elements

selected elements

false positives true positives

false negatives true negatives

Figure 2.12 – Classification space with two classes. The outlined area are classifications made

by the algorithm [25].

Recall is the algorithms ability to retrieve correct predictions, i.e. the fraction of the correct predictions out of all correct examples:

Recall = tp

tp+ fn (2.5)

Another common metric is F1-score:

F1 = 2 · Recall · Precision

Recall + Precision (2.6)

which is the harmonic average of the precision and recall.

Metric for Regression

Accuracy, recall or precision is not as telling for regression as it is for classification. It is not required nor likely for the algorithm to produce an output that is equal to the target down to the last decimal. Hence, the proportion of ”correct” predictions is not a relevant metric for regression. The mean-squared-error (MSE) or mean-absolute-error (MAE) is instead used, signifying the mean numerical difference between the output values and the target values [8]:

(34)

2.6. RECURRENT NEURAL NETWORKS

Depending on the task at hand, either can be used. However, the implications when used as loss function during training can be severe depending on which algorithm is used. Consider the absolute error. Its derivative with respect to ˆx is

d(AE) dˆx = ˆx ≠ x(ˆx ≠ x)2 = I 1, ˆx > x ≠1, ˆx < x , (2.9)

preventing the gradient descent to granularly update the parameters according to the mag-nitude of the error. The squared error, on the other hand, has the derivative

d(SE)

dˆx = 2(ˆx ≠ x), (2.10)

which changes with respect to the difference between the prediction and the target. This is further discussed in Section 3.3.2.

2.6 Recurrent Neural Networks

Recurrent Neural Networks (RNN) are a family of neural networks that specializes in pro-cessing sequential data, e.g. a sequence of values x(1), ...,x(·). An RNN is essentially an ANN

where cyclic connections are allowed, which enables the network to have ”memory” from previous time steps that persists as an internal state in the network which in turn influence the output of the network [8, 11]. Just as ANNs are universal approximators, an RNN with the sufficient number of hidden units can approximate any measurable sequence-to-sequence mapping to arbitrary accuracy [26].

h x y w2 w1 w3

Figure 2.13 – Recurrent neural network. The input x passes through the network to the output y via the hidden unit h. Hidden states from previous time steps are shared through the weights

w2.

One of the strengths of the RNNs structure is its flexibility which allows many different kinds of inputs and outputs and can be used for a variety of tasks from image captioning to machine translation, see Figure 2.14 [27, 28].

2.6.1 Long Short-Term Memory Recurrent Neural Networks

(35)

Figure 2.14 – Different kind of unfolded RNN architectures. Red boxes are inputs, green are

hidden layers and blue are outputs. From left: regular neural network without RNN (e.g. image classification), sequence output (e.g. an image is given as input and output are words describing the image), Sequence input (e.g. sentiment analysis of a sentence where the output is whether the sentence is positive or negative), sequence input and sequence output (e.g. machine translation where a sentence in one language is input and the output, given after the last word in the input, is a sentence in another language), synced sequence (e.g. video classification where each frame is being classified) [29].

trouble learning long-term dependencies. A popular way to tackle this problem, called the vanishing gradient problem, is to use Long Short Term Memory (LSTM).

Figure 2.15 – LSTM block with one cell

(36)

2.7. CONVOLUTIONAL NEURAL NETWORKS

2.7 Convolutional Neural Networks

Convolutional Neural Networks (CNN) share many of the properties of ordinary Neural Net-works mentioned in Section 2.4: they employ learnable weights and biases, perform a dot product of inputs and said weights, express a differentiable function for the its score and use a loss function in the last layer. The full architecture usually consists of multiple

Convolu-tional Layers, Pooling Layers and Fully Connected Layers described in detail below.

2.7.1 Convolutional layer

Looking back at ANNs, they have one input neuron for each value of the input. Consider a two-layered fully-connected ANN with 64 neurons in the hidden layer. If the input to this network were an image of dimensions 32 ◊ 32 ◊ 3 (the last dimension contains the RGB color channels), the number of weights from the input layer to the hidden layer would be 32 ◊ 32 ◊ 3 ◊ 64 = 196608. A modern computer could handle this fairly well, but as the input image scales to a more common size of 600 ◊ 400 ◊ 3, the number of parameters in the network grows to just over 46 million. This is very computationally heavy to train, but can be decreased using a CNN.

Instead of having a specific weight for each individual input pixel, CNNs apply something called weight sharing. The underlying intuition behind this approach is that if a feature is useful to compute at position (x1, y1), it should also be useful to compute at another

posi-tion (x2, y2). This is done using reusable filters that recognize the same feature on different

positions of the input. A filter is a small matrix, typically of size 3 ◊ 3 ◊ 3 where the last dimension has the same meaning as above, while the first two dimensions are user defined and usually equal. A filter of that size contains 27 weight values, visualized in Figure 2.16. Multiple filters are used so that, during training, each filter tunes to recognize its related feature in the input image [9, 30].

The filter matrices are moved across and dot-multiplied with the input image to produce a weighted output as the sum of the products, illustrated in Figure 2.17. Note that the figure shows an input with one channel, for simplicity. The size of the output is dependent on the spatial parameters of the convolution layer.

Most of these parameters have been covered, while stride and zero-padding remain. The stride, S, is the number of steps the filter is moved in one dimension between each operation, while F is the spatial extent of the filter. A stride S < F means that a pixel value is reused in multiple convolutions, while a stride S = F uses each input pixel only once. S > F would cause the filter to move so much that some pixels would be skipped. In Figure 2.17, S = 1 and F = 3 is used [9].

(37)

w0,0,0 w1,0,0 w2,0,0 w0,1,0 w1,1,0 w2,1,0 w0,2,0 w1,2,0 w2,2,0 w0,0,2 w1,0,2 w2,0,2 w0,1,2 w1,1,2 w2,1,2 w0,2,2 w1,2,2 w2,2,2 w0,0,1 w1,0,1 w2,0,1 w0,1,1 w1,1,1 w2,1,1 w0,2,1 w1,2,1 w2,2,1

Figure 2.16 – Weights of a 3 ◊ 3 ◊ 3 convolutional filter

1 0 0 1 1 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 0 -1 2 3 2 1 Input image 10 x 10 x 1 Filter 3 x 3 x 1 Output 8 x 8 x 1

Figure 2.17 – Convolutional operation with stride 1 and no zero-padding

the same size as the input. This is preferred because of the simplifications it makes when sizing multiple layers to work with each other [9].

2.7.2 Pooling layer

(38)

2.8. COMPRESSING NEURAL NETWORKS

There are several ways of performing downsampling in a pooling layer. The most common method is Max Pooling, where the maximum value of a submatrix of the input is chosen as an element of the output. This process is illustrated in Figure 2.18. The pooling, as the convolution, is also carried out in patches, with a stride deciding how much to move between each operation. The shape of the submatrix can vary, and the most common shape is 2 ◊ 2, thus downsampling and discarding 75% of the input [30].

6

1

4

-2

4

3

3

2

5

1

0

7

8

0

3

5

6

4

8

7

Max Pooling

Figure 2.18 – Max Pooling with a 2 ◊ 2 pooling window and stride 2 2.7.3 Fully Connected Layer

The CNN architecture commonly stacks pairs of Convolutional layers and Pooling layers until the input has been reduced to a small number of parameters. The problem mentioned in the beginning of this section can now be avoided when using Fully Connected Layers (FC) (described in section 2.4), which often is the final step of the model. One or a series of FC take the output of the last Pooling Layer and propagate it to the output of the last FC. This output is passed through an activation function and interpreted as the input association to each class [9].

2.8 Compressing Neural Networks

(39)

shows the compression scheme used in [33]. These methods are explained briefly below.

Figure 2.19 – Compression scheme from [33]

2.8.1 Pruning

Many models end up with more neurons than required to fit to the data. Some of their corresponding weights can therefore take values close to zero during training, i.e have a low contribution to the output. These weights are still used for computations, but can be removed by pruning to reduce the number of parameters. To ensure the accuracy is not negatively affected, pruning is performed by removing a few weights at a time. After each removal, the model is trained for a few epochs to adjust the remaining weights. Weights can be removed while monitoring the validation loss to maintain performance or until a certain decrease in size is achieved [36].

By performing pruning, the number of weights is decreased, meaning that fewer operations need to be performed for inference. It also means the model gets smaller as there are fewer parameters to store.

2.8.2 Weight quantization

Weights are usually stored in a high precision format, such as float32 or higher. This is beneficial during training to allow for high precision adjustments of the weights by the opti-mizer. When the model is fully trained, a lower precision format can be used for inference by quantizing the weights. This is done by extracting the maximum and minimum values of each weight matrix, input and output and storing them with high precision. The rest of the high precision weights are then converted to a lower precision range, for example uint8 which would be from 0 to 255. 255 would then correspond to the high precision maximum, 0 to the minimum and everything between can be derived linearly from the represented range [37]. Converting from float32 to uint8 decreases the model size by almost 4◊.

(40)

2.8. COMPRESSING NEURAL NETWORKS

Quantization can also be part of the training loop, as described in [33], which offers the op-portunity to use weight sharing and achieve even smaller bit depths.

2.8.3 Huffman Coding

Data structures are often compressed with Huffman code, an optimal prefix code used for lossless data compression. It encodes source symbols based on probability, using fewer bits to encode common characters than less common ones [39]. According to [33], this saves storage by 20-30% depending on weight distribution. The weight distribution can be optimized for compression by rounding the weight values into predetermined step sizes. This results in repeating bit patterns that are highly compressible [40].

(41)
(42)

Chapter 3

Method and Implementation

3.1 Event Detection Strategy

The strategy used in this thesis is based on the regression event detection method discussed in Section 2.3. As the torque and angle sensors collect data, a queue is filled. The latest 1500 datapoints are continuously fed to the classification model which determines whether the current ”window” contains the snug-angle. If so, the window is provided to the regression model, which estimates the angle. A flowchart of the process can be seen in Figure 3.1. The choice of window length as 1500 was established by an early hyperparameter evaluation not described in this report.

Input from sliding window No Yes Is snug angle in window? At what angle in window is snug occuring? Output Classification model Regression model

Figure 3.1 – Flowchart of the snug-detector

(43)

3.2 Models

A variety of models were developed for testing on the tasks at hand. All models were trained both for the classification task and the regression task. The layer sizes and other hyperparam-eters were established during the hyperparameter search described in Section 2.5.1. These models are presented below.

3.2.1 Multilayered Perceptron (MLP)

The MLP, seen in Figure 3.2, is 4-layers deep to enable it to learn many representations, however without recurrent connections it is believed to lack time dependencies which the RNNs or LSTMs have [42]. MLPs have been used with good results on time series and sequence labeling in previous implementations [43, 44].

Input FC FC FC Output

Figure 3.2 – Architecture of MLP model

3.2.2 RNN

The RNN model, seen in Figure 3.3 has an RNN block consisting of a number of units and uses a fully connected layer to give output.

Input RNN-block FC Output

Figure 3.3 – Architecture of RNN model

3.2.3 LSTM

As described in Section 2.6.1, the RNN can have trouble with long-term dependencies. An LSTM block was used to compare these two architectures on the two stages of the proposed model. The architecture is shown in Figure 3.4.

Input LSTM-block FC Output

(44)

3.2. MODELS

3.2.4 LSTM-MLP

To allow for more abstract features to be identified in the input data, combined with long-term dependencies, this architecture has the output of the LSTM-block fed into a stack of three FC layers before producing an output.

Input LSTM-block FC FC FC Output

Figure 3.5 – Architecture of LSTM-MLP model

3.2.5 Stacked LSTM

Proposed in [45, 46], stacked LSTM network can be used in temporal tasks. This model stacks three LSTM networks and produces an output with a FC layer.

Input LSTM-block LSTM-block LSTM-block FC Output

Figure 3.6 – Architecture of Stacked LSTM model

3.2.6 LSTM Fully Convolutional Network (LSTM-FCN)

A parallel-type architecture of an LSTM and a CNN that was proposed in [47] and showed great promise in classification of sequential data. The pooling-layer employs global average pooling [48]. Input Output CONV CONV LSTM-block CONV POOL FC

(45)

3.3 Method

This section describes the method used for evaluation of the deep learning models and the event detector. This can be divided in to sub-processes:

• Data Acquisition and labeling: The data was preprocessed and labeled so to be able to conduct supervised learning.

• Setup of training environment: Choices of cost function and optimizers so that training could be carried out in an efficient manner.

• Hyperparameter search: Hyperparameters were searched for all architectures, on all tasks, to get good training and model parameters.

• Full training of models: The best performing models were trained to get the best possible performance on both tasks.

• Full model prediction evaluation: The models with the best performance in the full training on the two tasks were implemented in the event detector and evaluated. The processes are described more elaborate below.

3.3.1 Data Acquisition and labeling

A dataset of tightening runs executed with continuous drive was obtained. Each file of type .dxd contained several runs and were annotated with run-start and run-end events for each of them. These annotations allowed for effortless extraction of each run into separate files. .dxd is a proprietary file format used by a data recording device manufactured by DEWE-Soft®. It can be interpreted in Python using the dwdatareader module described in Section 3.5.2

To verify the quality of the data, a script was designed to read each run from file and plot the curve and target angle on screen. This allowed for the authors to root out the runs that used a tightening strategy that was not in the scope of this thesis or were otherwise unfit for training. This reduced the size of the dataset to a final 41 153 runs.

Along with each run, a target value for training, i.e ground truth, was determined. This value was the angle at which the clamp force exceeded a threshold of 0.2 Nm and was determined from clamp force measurements in the .dxd file, available since the data was recorded in a test rig capable of measuring clamp force. The threshold of 0.2 Nm was suggested by an advisor with excellent knowledge on the subject. This ground truth could easily be converted depending on the task for what it was being trained.

(46)

3.3. METHOD

3.3.2 Setup of training algorithm

As described in Section 2.5, training of Deep Neural Networks is done by feeding the network with a input, predicting an output, computing the gradient of the cost function (a function closely related with some error E) and update the parameters in the model with the computed gradients. Two choices had to be made, firstly the choice of cost function and secondly the choice of the optimization algorithm.

Choice of Cost Function

Gradient descent minimizes a differentiable function, where in machine learning that function is related to the error of prediction. It is also known as loss function, error function and

objective function. It is therefore advantageous for the algorithm to know in which way the

error is decreasing, i.e. the cost function should be differentiable [8].

Cross-entropy loss measures the closeness of the probabilities of class membership output by the algorithm. It is defined as

L= x ln(ˆx) + (1 ≠ x) ln(1 ≠ ˆx) (3.1)

for the output ˆx and the target x. It is the most common cost function used for classification, preferred for its differentiability and robustness [49].

For regression, however, the output is not in terms of probability but in terms of numerical prediction. Another cost function is then used that relates to the absolute difference between the continuous target- and output values.

For this thesis, cross-entropy loss and mean-squared-error have been used for classification and regression respectively.

Choice of Optimizer

For this thesis, Adam was chosen as the optimizer. Adam is derived from adaptive moment estimation and computes individual learning rates for parameters from estimates of first and second moments of the gradients [50]. The choice was based on the fact that adaptive learning rate optimizers in general outperforms other optimizers, and Adam outperforms the other adaptive learning rate optimizers [50, 20].

3.3.3 Hyperparameter Search Method

In this thesis, a combination of grid and random search was used. The search was conducted in several steps, where for each step the range of numbers that are used in the search are decreased according to what hyperparamters provides the model with the best performance in terms of the calculated loss on the validation dataset:

1. The first step is a coarse search with the widest range of hyperparameter values. The models are trained for 5 epochs and the top 10 are selected for evaluation.

(47)

3. The top 10 from the finer search are evaluated and a new range is selected. The models are trained for 10 epochs.

The top 3 models from the fine search were then selected to go through a long training. The hyperparameters over which the search was conducted were the learning rate and the size of the model.

3.3.4 Full training of models

The top 3 models for each task found in the hyperparameter search were now selected for a full training. This was conducted with the ”reduce learning rate on plateau” approach presented in Section 2.5.2, where the learning rate was reduced by 0.2 if learning did not improve for 6 epochs, and with a dropout rate of 0.3. In general, training was performed for 100-300 epochs and for each accomplished epoch, the validation loss was computed. If the validation loss improved, all model parameters (i.e. parameters such as weights) were saved. This to ensure that the best possible model was available for implementation in the full event detection.

3.3.5 Full model prediction

The full model, i.e the classifier and regressor in series, was evaluated by simulation. This was carried out by employing the strategy described in Section 3.1 on the test set. As before, the queue size was 1500 timesteps. The number of new values in the queue each time the model evaluated it was varied from 1, 100, 500 to 1500. Note that when the this value is equal to the queue size, 1500, the sliding window is non-overlapping. Considering the dataset was collected with a sample rate of 8 kHz, this simulates prediction frequencies of 8 kHz, 80 Hz, 16 Hz and 5.33 Hz, respectively. As predictions were made during simulation, several estimations of the snug-angle were produced by the model for each run. The first and last prediction, as well as the mean and median of the predictions were computed for each run to evaluate how these estimation metrics compared to each other.

3.4 Hardware used for Implementation

The recent years’ surge in Neural Network and deep learning applications is in large due to the availability of the required computational power, which enables large sized networks and better performance [8]. With investments in research both from the commercial-, research-and open source community, there has been an acceleration in both hardware research-and software development for machine learning applications.

3.4.1 EVGA NVIDIA GTX 1080

(48)

3.5. SOFTWARE, LIBRARIES AND FRAMEWORKS USED FOR IMPLEMENTATION

kind of characteristics, with large sets of parameters and variables that need to be updated at each training step.

For this thesis the EVGA NVIDIA GeForce GTX 1080 [51] was used. This is a general purpose GPU that can run code with other purposes than graphics rendering. The GTX 1080 has 8 GB of memory and a base clock with 1708 MHz. NVIDIA provides a programming language, CUDA, that can be used to implement deep learning models for training on the GPU. There are several software libraries such as Torch, Theano and Tensorflow that implement and run highly optimized CUDA code.

3.5 Software, Libraries and Frameworks used for Implementation

The software in this thesis was largely written in Python. Python was chosen because it is by far the most used language when developing machine learning applications [52]. It has a large community support, both scientific and commercial, with a lot of libraries and frameworks and easy access to help. This section will describe some of the libraries and frameworks that were used in the thesis.

3.5.1 Tensorflow and Keras

As described in section 3.4.1, there are many software libraries used in order to write highly efficient code for deep learning training. In this thesis, the TensorFlow library is used. Ten-sorFlow is an open source library for numerical computations using data flow graphs. The nodes in the graphs represent mathematical operations and the edges represents matrices, or tensors, that flow between them. Since neural networks often are described in terms of graphs, this gives a very flexible tool that enables users to easily implement architectures. The TensorFlow library is written so that it can be used with NVIDIA GPUs[53].

To simplify the process of building and testing deep learning models, the high level API Keras is used. Keras comes with all the advantages that TensorFlow has, as it uses TensorFlow as backend for computations. Furthermore, Keras comes with many of the most used neural network layers predefined, which allows fast implementation [54].

3.5.2 DEWESoft® and dwdatareader

DEWESoft® is a company that provides data acquisition software and test and measurement solutions [55]. All the data used in this thesis was collected with DEWESoft® equipment. In order to be able to read and analyze the data on a large scale using Python, DEWE-Soft® provides a free library for Linux and Windows users. The open-source Python module dwdatareader interacts with the library and has been used to export the data into a more manageable data format. This process is described in Section 3.3.1.

3.5.3 Numpy and matplotlib

(49)
(50)

Chapter 4

Results

This chapter presents the results from the training and evaluation of the implementation.

4.1 Model Training for First Stage Models (Classification)

Figures 4.1-4.6 show the training progress for the top 3 first stage models found in the hyper-parameters search. The models were trained for 100 epochs and with a dropout rate of 0.3. The stars indicate where the lowest validation loss occurred for each model.

(51)

Figure 4.2 – Validation for each epoch during training of the first stage LSTM-MLP model.

(52)

4.1. MODEL TRAINING FOR FIRST STAGE MODELS (CLASSIFICATION)

Figure 4.4 – Validation for each epoch during training of the first stage RNN model.

(53)

Figure 4.6 – Validation for each epoch during training of the first stage LSTM-FCN model.

In Table 4.1, the lowest validation loss scores are presented and at which epoch they occurred.

Table 4.1 – Lowest validation loss for the first stage models

Model Lowest val. loss Epoch

(54)

4.2. MODEL TRAINING FOR SECOND STAGE MODELS (REGRESSION)

4.2 Model Training for Second Stage Models (Regression)

Figures 4.7-4.12 show the training progress for the second stage models. The top 3 models (except for the LSTM-FCN, for which only the top 1 was trained) of the hyperparameter search were trained for 300 epochs and with a dropout rate of 0.3. The stars indicate where the minimum validation loss occurred.

(55)

Figure 4.8 – Validation for each epoch during training of the second stage LSTM-MLP model.

(56)

4.2. MODEL TRAINING FOR SECOND STAGE MODELS (REGRESSION)

Figure 4.10 – Validation for each epoch during training of the second stage RNN model.

Figure 4.11 – Validation for each epoch during training of the second stage Stacked-LSTM

(57)

Figure 4.12 – Validation for each epoch during training of the second stage LSTM-FCN model.

In Table 4.2, the lowest validation loss scores are presented and at which epoch they occurred.

Table 4.2 – Lowest validation loss for the second stage models

Model Lowest val. loss Epoch

(58)

4.3. FIRST STAGE MODEL RESULTS ON IDENTIFYING SNUG-SEGMENT

4.3 First Stage Model results on Identifying Snug-Segment

Table 4.3 shows the final results of the best performing classification models used in the thesis, evaluated as the models with the highest F1-scores on the test set. Scores are high for all models, with the Stacked LSTM model scoring the highest on all metrics and having an accuracy of 99.26 % with a high recall and precision. The worst performing model was the LSTM-FCN model, having an accuracy of 98.01%, but performing a low precision and especially low recall.

Table 4.3 – Results of first stage models

Model Top 1 F1-score Top 1 Recall Top 1 Precision Top 1 Accuracy

4-layer MLP 98.36% 98.36% 98.37% 98.89% RNN 97.62% 97.62% 97.63 % 98.22% LSTM 97.65% 97.62 % 97.65 % 98.18 % LSTM-MLP 97.95% 97.95% 97.96% 98.61% Stacked LSTM 98.44% 98.43 % 98.45 % 99.26% LSTM-FCN 97.04% 96.96% 97.21% 98.01%

4.4 Second Stage Model Results on Identifying the Snug-Angle

(59)

Table 4.4 – Mean absolute error of second stage models

Model Best-of-3 MAE 4-layer MLP 60.67 RNN 85.67 LSTM 67.02 LSTM-MLP 75.43 Stacked LSTM 52.57 LSTM-FCN 128.62

The best models from tables 4.3 and 4.4 were selected to evaluate the full model results.

4.5 Full model results

(60)

4.5. FULL MODEL RESULTS

Table 4.5 – Full model results, mean error and standard deviation of error

Metric Step size Mean Error [¶] Standard Deviation[]

References

Related documents

Truncated Newton (Algorithm 11) replaces the difficult act of calculating the exact inverse of the Hessian with using the conjugate gradient method to calculate an

End-to-end approach (left side): In the training phase directly feed the classification input samples X (i.e. the mammograms) into a clas- sification network.. According to

The range of an activation function usually prescribes the domain of state variables (the state space of the neural network). In the use of neural networks for optimization,

In Part III, we will see that “higher mental functions” tend to be modeled more in connectionist terms constrained (if at all) by psychological or psycholinguistic data (cf. the Part

First we have to note that for all performance comparisons, the net- works perform very similarly. We are splitting hairs with the perfor- mance measure, but the most important

An assumption that often has to be made in machine learning, is that the training dataset and future data must have the same distribution and feature space[23]. When dealing with

The main purpose of this thesis was to explore the possibilities of training a model in lip reading using facial landmarks as data representation. Furthermore, if this model compare

The primary goal of the project was to evaluate two frameworks for developing and implement- ing machine learning models using deep learning and neural networks, Google TensorFlow