Machine Learning for Traffic Classification in Industrial Environments

(1)

STOCKHOLM SWEDEN 2018,

Machine Learning for Traffic Classification in Industrial Environments

Degree Project in Electrical Engineering, Second Cycle.

FILIP BYRÉN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

The consumption has increased drastically over the years, where consumers have high demands on the quality of the products, the time it takes to receive the products and the personalization options. Factories try to scale with the consumers demands by removing human labour and deploying automation devices that can produce products more rapidly and with higher precision.

Wireless communication in the factories would help to achieve this goal, by enabling mobility as well as reducing cable reconfiguration/troubleshooting and increasing the utilization of the factories resources.

This report is investigating if it is possible to achieve benefi- cial wireless communication in a production line, where the evolved Node B scheduler can prioritize important cyclic Real-Time and alarm packets by using machine learning based classification models. This new prioritization technique would allow important factory applications to have high priority and it would make sure that important packets gets served. We found several useful application classification models for factory environments, but demonstrated that the best model may depend on the factory setup.

Therefore, the report introduces as well the idea of automated deep learning model construction, which allows for model improvements by time.

(4)

Maskininlärning för att Klassificera i Industriella Miljöer

Konsumtionen och konsumenternas krav för produkter har ökat drastiskt över de senaste åren. Konsumenterna krä- ver att produkter kan skickas under kort tid efter beställ- ning med möjlighet att modifiera för personliga preferen- ser. Ständigt förbättras fabriker för att kunna tillfredsstäl- la kundernas krav. Till följd av detta har fabrikerna bytt ut mänsklig arbetskraft i produktionen till automatiserade robotar som kan producera mer eﬀektivt och med högre noggrannhet.

Vad industrin försöker möjliggöra i framtiden är att dessa robotar ska kunna kommunicera trådlöst. Om de skulle lyckas med detta skulle det resultera till att produktionen kan vara mobil, där robotar kan flyttas runt för att öka pro- duktiviteten. Trådlöst skulle också bidra till mindre kabel omdragning/felsökning samt att kostnader skulle minska.

Vad denna rapport studerar är att se om det är möjligt att prioritera viktiga Real-Tids applikationer samt alarm som skickas i fabriker och se till att dessa alltid blir hanterade först i trådlösa nätverk. Hur detta ska gå till är att utveckla en maskininlärningsmodell som kan klassificera applikationer, där sedan denna modell kommer användas i kompo- nenten ” evolved Node B”. ”evolved Node B” ansvarar för att ge klienter frekvenser som används för att skicka data i trådlösa nätverk, där målet är att se till att de viktiga applikationerna prioriteras.

Resultatet blev att flertal maskininlärningsmodeller kunde klassificera applikationerna, men det visade sig att den bäs- ta modellen berodde på vilka typer av applikationer som behövdes klassificeras. Därför diskuterar sedan rapporten om en framtida automatiserad hypotetisk modell som an- passar sig för produktionslinjens applikationer.

(5)

Acknowledgment

I am grateful to Filip Mestanov for providing materials and guidelines during un- certainties in the research and thank the company HMS for oﬀer a data capture from their industrial environment. I acknowledge professor Viktoria Fodor from the department of network and system engineering for accepting the oﬀer to be the ex- aminer and supervisor for the thesis. Finally, I appreciate Ericsson AB and Niklas Johansson for constructing the thesis. The research would not have existed without these people and companies. Thank you,

Filip Byrén

(6)

Abbreviations

ADAM Adaptive Moment Estimation AI Artificial Intelligence

ARP Address Resolution Protocol

ART Acyclic Real-Time

CPwE Converged Plantwide Ethernet

DNN Deep Neural Network

eNodeB Evolved Node B

HTTP Hypertext Transfer Protocol

IACS Industrial Automation and Control System

IO Input Output

IRT Isochronous Real-Time

IT Information technology

LLDP Link Layer Discovery Protocol

LSTM Long Short Term Memory

ML Machine Learning

MLP Multi Layer Perceptrons

MTCD Machine-Type Communication Device

NRT Non Real-Time

NTC Network Traﬃc Classification

PN-DCP Profinet Discovery and Configuration Protocol PN-PTCP Profinet Precision Transparent Clock Protocol PNIO Profinet Input Output

PNIO-AL Profinet Input Output Alarm

PNIO-CM Profinet Input Output Context Manager PNIO-PS Profinet Input Output Provider Status

QoS Quality of Service

(9)

ReLU Rectified Linear Unit

RT Real-Time

SVM Support Vector Machine

TCP Transmission Control Protocol

UDP User Datagram Protocol

UE User Equipment

List of Figures

2.1 Profinet-IO protocol stack structure, drawn with https://draw.io/. . . 8 2.2 The Ethernet header for Profinet RT and IRT with the additional 802.11Q

frame added, the figure is from [1]. . . 9 2.3 The scheduling of resource units in the eNodeB for granting UE/MTCDs

access to transmit data, the figure is from [2]. . . 11 2.4 The framework structure for CPwE, notice that no direct communication

is occurring between Enterprise Zone and Manufacturing Zone. The figure is from [3]. . . 13 2.5 The figure illustrates the concept of Decision Tree as a classification

model, in this case the outputs are C and D and based on the Boolean features A and B, the model will follow the correct path and determine which output it is. The figure is done using https://www.draw.io/. . . 19 2.6 The figure is illustrating the classification concept using SVM classifi-

cation, here if a new data point is on the left side of the line it will be classified as the same label as the other points on the left side and vise versa, the figure is from [4]. . . 21 2.7 The two core ideas of a deep neuron network, is artificial neuron and

a neural network. These ideas can be seen in figure (a) of a artificial neuron and (b) a deep neural network. . . 23 2.8 This figure illustrates the typical activation functions used in deep learn-

ing, the figure is from [5]. . . 26 3.1 The time series matrix used for LSTM and Convolutional models input

data, the figure is generated using https://www.draw.io/. . . 33 3.2 The features mixture model for each application in the HMS data-set. . 35

(10)

3.3 The features mixture model for each application in the DEFCON data-set. 35 3.4 The distribution for diﬀerent source interval time for four application

types in the HMS data-set. The applications selected was PNIO-AL,

PNIO-CM, PN-DCP and PINO-PS. . . 37

3.5 The packet behaviour for each application. The colour represents one source and how it is sending its packets, this is to study if there is any cyclic or acyclic behaviour. The result is from the HMS data-set. . . 37

3.6 The distribution for diﬀerent source interval time for four application types in the DEFCON data-set.The applications selected was HTTP, PN-DCP, PN-PTCP and PNIO . . . 38

3.7 The packet behaviour for each application. The color represents one source and how it is sending its packets, this is to study if there is any cyclic or acyclic behaviour. The result is from the DEFCON data-set. . 38

3.8 The frequency for each application in the HMS data-set, containing 10140 packets. . . 40

3.9 The frequency for each application in the DEFCON data-set, containing 1046036 packets. . . 40

3.10 The correlation for each feature in the HMS data-set, this is to determine how each feature and application correlate. . . 41

3.11 The correlation for each feature in the DEFCON data-set, this is to determine how each feature and application correlate. . . 41

3.12 The MLP model design for both data-sets. . . 44

3.13 The Convolutional model design for both data-sets. . . 44

3.14 The LSTM model design for both data-sets. . . 45

3.15 The Convolutional LSTM model design for both data-sets. . . 45

4.1 Models evaluation scores for HMS test data-set. . . 47

4.2 The classification model delay for the selected models in the HMS data-set. 48 4.3 The classification delay for the fastest classification models in the HMS data-set. . . 48

4.4 Confusion matrix result using HMS test data-set for the Decision Tree (C4.5). . . 49

4.5 Confusion matrix result using HMS test data-set for the Convolution LSTM. . . 50

4.6 Models evaluation scores for DEFCON test data-set. . . 51

4.7 The classification model delay for 1000 packets, this is for the DEFCON data-set. . . 52

4.8 The classification delay for the fastest models in the DEFCON data-set. 52 4.9 Confusion matrix result using DEFCON test data-set for the Decision Tree (C4.5). . . 53

4.10 Confusion matrix result using DEFCON test data-set for the Convolu- tion LSTM model. . . 53

4.11 HMS traﬃc streamed to the eNodeB during one second. eNodeB understands the packets applications using ideal ML. . . 55

(11)

understands the packets applications using ideal ML. . . 56 4.14 This example for DEFCON, eNodeB views all packets as equal. . . 56 4.15 The accumulated score for diﬀerent models, to show the potential gain

of using machine learning classifier to prioritize important industrial application in the HMS data traﬃc. . . 58 4.16 The potential buﬀer benefit of using a machine learning model in the

eNodeB vs without, for the HMS data traﬃc. . . 59 4.17 Each application drop rate, using machine learning for the HMS data

traﬃc. . . 59 4.18 Each application drop rate, without machine learning for the HMS data

traﬃc. . . 60 4.19 The serve rate increased by 33.3% for the eNodeB with no ML installed,

to make sure as many important applications gets served, as using machine learning for HMS traﬃc. . . 60 4.20 The accumulated score for diﬀerent models, to show the potential gain

of using machine learning classifier to prioritize important industrial application in the DEFCON data traffic . . . 61 4.21 The buffer load for the DEFCON traffic, using machine learning and no

machine learning. . . 61 4.22 Packet drop rate for the applications, using machine learning, in the

DEFCON traﬃc. . . 62 4.23 Packet drop rate for the applications, without machine learning, in the

DEFCON traﬃc. . . 62 4.24 The serve rate increased by 50% for the eNodeB with no ML installed, to

make sure as many important applications gets served, as using machine learning for DEFCON traﬃc. . . 63 5.1 Illustrative image showing how each factory eNodeB could be updated

and improved over time, using some type of automated machine learning that tries to improve model classification performance and latency. The figure is generated using https://www.draw.io/. . . 66

(12)

5.2 Idea how the future factory eNodeB function in the cellular network can improve over time, the figure is generated using https://www.draw.io/. 68

List of Tables

2.1 The priority level for diﬀerent traﬃc class services in the 802.1 Q header, table is from [6]. . . 7 2.2 This table illustrates common services used in a Profinet communication,

and the priority based on 1(low)-9(high) scale. This scale is based on my conclusion motivated in the background, (slide 14 [7]) and (section 2-4 [8]). . . 10 2.3 The table illustrates the concept of a confusion matrix; here we have four

applications where each contains 15 packets in total. The first element in each row in the matrix is the true application for the packet. Each column after the first column element presents the classification. . . 16 3.1 The HMS/DEFCON village data-set split into to sets for training and

testing. . . 42

(13)

Introduction

AI is not here to take part, it is here to take over.

Conor McGregor, Twist of original quote

1.1 Production line

The consumption has intensified radically over the years. ”Världskoll” a non-profit company funded by United Nations, assesses that the consumption has increased by 46% from 1990 to 2015 in Sweden [9]. Besides the increase, the consumers’

standards on the products have also enlarged. They require higher consistency, faster delivery, better quality and more personalization options then ever before [10].

Factories try to scale with the consumers’ demands by removing human labour and deploying automation devices that can produce products more rapidly and with high precision. The issues factories are facing today are the fluctuated needs for diﬀerent products, and to mass-produce is not always an option as the storage capabilities are limited. Companies try to solve this by having close feedback on the market, and produce products based on the needs today. This means they need production line devices that can adapt their tasks quickly. Today the industry is talking about

"industry 4.0" where they look into using artificial intelligence among other things to create adaptive manufacturing, where robots can learn for instance by human demonstration and by that create a fast relearning process[10]. However the need does not end there, as the storage capacity and material is limited, the production line setup needs also to be adaptive, resulting in that devices need to change places in the production line to be able to produce more eﬃciently. A way to enable this is to make the production line communication wireless; this would enable mobility as well as reducing cable reconfiguration/troubleshooting and allow higher utilization of factory resources.

1

(14)

1.2 Wireless

There are several technologies that enable wireless communication today, WiFi, Bluetooth and cellular networks are a few examples of the large pool of options.

WiFi can locally connect devices and through the WiFi router access the Internet.

WiFi focus on giving high speed to the local devices using a shared frequency spectrum for sending data. The drawbacks of WiFi are that it does not use a licensed spectrum, not using a licensed spectrum for transmitting data results that other devices might use the same spectrum for sending data resulting in interference.

Second problem with WiFi is that the signal strength is very dependent on the environment, where objects around can change the signal strength. Finally, WiFi where not designed for scaling so having it in a factory production line would be a poor solution. Bluetooth is a device-to-device communication and likewise WiFi it uses a shared spectrum. There are several disadvantages of using Bluetooth for a factory device communication, for instance the range limitations, the interference between other devices, and the latency. The technology the industry is looking into is cellular networks. This is because the cellular networks use licensed spectrum, and designed to be scalable, robust and mobile. Moreover, the future standards of cellular networks can oﬀer the low latencies that most production line devices require.

1.3 Problem

The technical hitches with wireless communication in the factories is that in general the communication in the production line is required to be sent in a low latency periodic manner. Those latency requirements are not achievable with today’s cellular networks, but will be in the future. Additional problem the industry is facing is that the automation devices communicate using industrial Ethernet protocols, not designed for wireless communication. This is an issue during traﬃc peak loads where the production line communication needs to be prioritized. If the cellular network cannot distinguish important packets and decides to not serve a production line cyclic packet directly, the way most industrial protocols view this is that the communication is finished. The outcome of this is that the production halts. This cannot happen, as any production delay will translate to major production loss.

This problem will the report assess, and the research question is to find out if it is possible to achieve an eﬀective application classifier model using machine learning, making sure important industrial packets get prioritized in a cellular network.

1.4 Purpose

If the cellular network knew how important different packets are in the traffic flow, it could prioritize the one that are most important for the production. The solution for this is to make a network traffic classifier (NTC) that can find out the characteristics

(15)

of data sent from factories and classify them, so less packet loss will occur for crucial data and by that improving the quality of service (QoS) for the factories.

1.5 Goals and limitations

The goal of the thesis is to answer the research question; if it is possible to achieve an eﬀective application classifier model using machine learning, making sure important packets get prioritized in a cellular network. How this will be achieved is that the thesis will process data from factory enterprises and mimic how the cellular network receives the data. Diﬀerent machine learning models will be tested, to observe if applications can be correctly classified using the input the cellular network can extract from the packets during scheduling. The performance of the models classification will then indicate if application prioritization is achievable. What the thesis needs to explore are:

1. Industrial Ethernet Protocol: understand a common industrial protocol used in production lines. This knowledge will be useful to find characteristics in the packets that normally relate to a certain application. The limitation of this is that there are plenty of industrial protocols, where perhaps different protocols have different characteristics that relate to different applications.

2. The cellular network transmission scheduler: understand what core function in the cellular network gives clients access to frequencies for transmitting data.

3. General factory setup: understand how the setup between devices are in a factory, understand there roles and how they communicate to the oﬃce landscape. The outcome of this knowledge will explain how we select data-sets that mirror the reality as good as possible.

4. Data extraction: how the packets sent in a factory can be extracted and how they could be used for understanding the behaviour of diﬀerent applications.

5. Data processing: how can we extract the relevant features in the data. The research also needs to figure out which features do the cellular network receive when users transmit data.

6. Data analytics: understand the underlying behaviour of the diﬀerent packets and see if this confirms the behaviour we expect.

7. Classification models: generate several classification models and study them to show which one is the most suitable for the data-set.

8. Gain: find out a way to show the profit of using a classification model for industrial environments.

(16)

The benefits of the work are a report showing how industrial production devices communicate, how a machine learning model can classify the application of each packet. How the machine learning model would be deployed in the cellular network and the gain of using application classification for the production line.

There are many related topics to improve the packet priority in a production line using wireless communications. Topics that are not deliberated in this research are for instance, a redesigned industrial Ethernet protocol for wireless communication, other prioritization techniques without machine learning and study multiple industrial Ethernet protocols.

1.6 Methodology

The research will use quantitative measurement, as large set of data exists in packets captures. Data exploration will be the ground reasoning for the applications behaviour, resulting in empirical observations to observe if the data match the hypothesis or the background. Nevertheless, if the behaviour of the data does not match the expectation from industrial standards or previous study, the study will attempt to explain why that might be the case or will question the provided data.

This will be an inductive approach where the thesis will be able to change the direction during the research to be able to answer the research question.

The goal of the report is to answer the research question: ”if it is possible to achieve an effective application classifier model using machine learning, making sure important packets get prioritized in a cellular network”. Which means a conclusion that indicates a sufficient classification model will be difficult to achieve, is as good of a result as finding a sufficient model. This will encourage unbias reasoning where no gain exists for instance in manipulation of the results.

The methods used for building the classifier models will be in the machine learning area. The reason for machine learning is the adaptability machine learning tools have for diﬀerent data characteristics. How a machine learning model works is that the model learns the input mapping to the output, without anyone declaring it.

This allows machine learning to be able to efficiently classify the output of the data without the need of prior knowledge of the data-set. The hypothesis why machine learning is needed is that different industrial applications in the production line will most likely have similar packet behaviour. This means that the applications are difficult to divide manually, where then a machine learning model can instead find out the mapping between the input and the output. Due to the classification strength different machine learning models ensure, the thesis is going to try-out several of them to see if a handful can classify the applications. This means that several different classification models needs to be tested, from simple machine learning models to more complex deep learning methods to reflect the data-sets complexity as good as possible. The hypothesis why deploying many different types of machine learning models is that most likely they will indicate together, if a network traffic

(17)

classification model will perform well or not in the factories.

1.7 Outline

First chapter is the background chapter; it will describe a well-used industrial Eth- ernet protocol, the role of different devices in a production line and conduct a reasonable priority table for different industrial applications. Later the chapter will go through a factory setup, and where the network traffic classifier would be useful in the cellular network. Afterwards it will enlighten different data processing techniques as well as data analytics. Finally, different machine learning models will be motivated based on previously used machine learning models for similar classification problems.

The implementation chapter will explain how the industrial factory data can be converted to useful data types, and this chapter will display analytics on the different data-sets. The implementation chapter will use the background knowledge to select relevant features as input for the machine learning models, where data analytics empirical evidence will later confirm the relevance and the behaviour of the features selected.

The result chapter will compare the machine learning models performance and present the gains of using machine learning for the production line. The discussion of the thesis is in the final chapter where the thesis will explain possible future works and improvements. This chapter will also discuss the general ethical concern about deploying machine learning and AI.

(18)

Background

2.1 Profinet

2.1.1 Introduction

Historically the factory production lines have been separated from the rest of the enterprise. The only method to get information about the production progression is to be present at the production line, the same goes for updating machines and handle oversights. The industry aspired to have the factory activity connected with the enterprise, using some type of standard solution which all automated devices can communicate within, allowing someone to have oversight of the process in the factory without being present. The solution became to introduce industrial Ethernet protocols. The goal of an industrial Ethernet protocol is to enable communication between automation devices, and also be able to communicate to the office landscape using Ethernet cables. One of the most commonly used industrial Ethernet protocols today is Profinet [11]. Profinet allows different input types with different requirements to communicate using the same type of protocol. The Profinet protocol uses a standard communication for the production lines Input-Output Devices (IO-Devices) over Ethernet. Profinet enables connections between the productions IO-Devices and the office landscapes, with several different qualities of service capabilities. This includes for instance different latency performance guarantees based on the IO-Devices needs. The protocols services are Non-Real-Time communication (NRT), Real-Time communication (RT) and Isochronous-Real-Time (IRT) [12].

2.1.2 History

The first Profinet protocol arrived in 2000, named Profinet component-based-automation (CBA). The goal of this protocol was to use the TCP/IP stack for machine-to- machine communication, where the protocol oﬀers a periodicity of approximately 100ms[1]. The downside with this protocol version is that it does not oﬀer Real- Time cyclic communication [12]. Because this version of Profinet cannot handle Real-Time communication, most production IO-Devices cannot use this.

Cyclic Real-Time data means a periodic communication with low jitters, which 6

(19)

should always send and receive in a specific periodic time interval. The factory automation line requires Real-Time cyclic communication as the IO-Devices do fast operations were each IO-Device has its own task for the production. If this communication misses one period or is sent or received to late the communication is terminated[13] and the production stops. An updated version of Profinet called Profinet-IO, offers this guaranteed low latency cyclic communication by adding Real-Time and Isochronous-Real-Time service capabilities. How it achieves this is to bypass the UDP/IP layers in the protocol stack, which reduce the latency for the packets. The result of this is a Real-Time packet offer of 5-10 ms periodic time interval message rate[1] where Isochronous-Real-Time offers even lower periodic latency time between 0.25-1 ms, this is useful for instance in motion control IO-Devices.

Now all the IO-Devices can communicate using only one type of protocol, and able to send data to the oﬃce landscape.

2.1.3 Profinet protocol

The Non-Real-Time Profinet applications use the first version of the Profinet-IO, seen in (A) in the figure 2.1. It is used for communications that do not require low latencies, for instance connection establishments, diagnostics and status reports from the production line to the oﬃce landscape. The Real-Time production line applications use the second version of the Profinet-IO, the figure (B) in figure 2.1 and the Isochronous-Real-Time applications use the third version of Profinet-IO seen in figure (C) in figure 2.1.

Now based on the applications need it can select the suitable Profinet-IO version for transmitting the packet. As mentioned before the Real-Time and Isochronous- Real-Time protocol stack is redesigned without the Internet layers, which is seen in figure 2.1. What the redesign adds is the 802.1Q block in the Ethernet frame

Priority Traﬃc Class

0 Background

1 Best eﬀort

2 Excellent eﬀort 3 Critical application

4 Video

5 Voice

6 Internetwork control 7 Control data traﬃc

Table 2.1: The priority level for diﬀerent traﬃc class services in the 802.1 Q header, table is from [6].

for (B), (C) in figure 2.1. The Ethernet frame in figure 2.2 has the Type 8892, signifying the packet is a Profinet Real-Time or Isochronous-Real-Time application.

The additional 802.1Q [6], also shown in figure 2.2, adds the priority to the packet.

The priority is based on a scale from 0 to 7 where 7 is the highest priority. Table

(20)

(A) Profinet Version 1. (B) Profinet Version 2. (C) Profinet Version 3.

Figure 2.1: Profinet-IO protocol stack structure, drawn with https://draw.io/.

2.1 shows all the priority levels for 802.1Q services, where all Profinet Real-Time and Isochronous-Real-Time applications have a priority of 6, same as internetwork control services. This design results in all Profinet services using version 2 and version 3 of Profinet-IO will have the same priority even though various services have diﬀerent importance.

2.1.4 Profinet communication

There are three types of roles in a Profinet communication, IO-Devices, IO-Controllers and IO-Supervisors. The IO-Devices are machines in the automation line. These machines are for instance robots, sensors drivers, actuators and fans. They do not operate by themselves, as they need some type of programmable logic controller (PLC) that will inform them what to do. Those programmable logic controllers can for instance inform the IO-Devices to change the angle of a robot arm, or do a particular task on the product. In Profinet the programmable logic controllers are called IO-Controllers. The last role is the IO-Supervisor role. The IO-Supervisors assignment is to gather status reports and diagnostics from the IO-Devices. For instance how the production is going and if there is any indication that a machine needs repairing or needs to be changed [8].

(21)

Figure 2.2: The Ethernet header for Profinet RT and IRT with the additional 802.11Q frame added, the figure is from [1].

The most commonly used services in a Profinet communication:

1. Link Layer Discovery Protocol (LLDP): it informs the network about its existence and its ability, as well understands how the network is setup [14].

2. Address Resolution Protocol (ARP): the goal of this protocol is to broad- cast a request to assign this connection a particular IP Address, where it first checks if this IP Address is available. Now the network will map this IP Ad- dress to this connections unique MAC Address [14]. For a Profinet device the MAC Address is a Profinet id which is unique for every device and is based on [Vendor_ID,Device_ID] [15].

3. Profinet Discovery and Configuration Protocol (PN-DCP): it is for the IO-Supervisor to allocate a reference to the connection and to give specific IP Address based on hardware configurations for the IO-Controller, this stage is done together with ARP [14].

4. Profinet Input Output Context Manager (PNIO-CM): this stage setup a connection between the IO-Device and the IO-Controller, and notifies the connection establishment of the type of traﬃc and latency requirements need for the communication [14].

5. Profinet Precision Transparent Clock Protocol (PN-PTCP): this ap- plication maintains synchronization of the production line [14].

(22)

Service Priority Type Concept

LLDP 1 NRT,LLDP Device existence

ARP 1 NRT, ARP Address look up

PN-DCP 5 RT, 802.1Q Address assignment step PNIO-CM 3 NRT, UDP Connection establishment

PN-PTCP 5 RT, 802.1Q Synchronization

PNIO 7 RT, 802.1Q Cyclic data exchange

PNIO-PS 7 RT, 802.1Q Cyclic Service data unit PNIO-AL 5,9 ART 802.1Q Acyclic process alarm (low,high) Table 2.2: This table illustrates common services used in a Profinet communication, and the priority based on 1(low)-9(high) scale. This scale is based on my conclusion motivated in the background, (slide 14 [7]) and (section 2-4 [8]).

6. Profinet Input Output (PNIO): is the application where cyclic Real-Time and Isochronous-Real-Time packets are sent between the IO-Controller and IO-Device, using Profinet-IO version two or three.

7. Profinet Input Output Provider Status (PNIO-PS): is similar to PNIO where service status is added to the exchange, this is optionally done for cyclic Real-Time and Isochronouse-Real-Time packets between the IO-Controller and IO-Device [8].

8. Profinet Input Output Alarm (PNIO-AL): sends acyclic Real-Time and Isochronous-Real-Time alarms [14]. Alarms can for instance inform an IO- Controller that an IO-Device is getting warm and the IO-Controller needs to send a data exchange message to an IO-Device fan to increase the fan speed.

The alarm can have the value high or low, signifying the importance.

Based on our knowledge about Profinet-IO communication we can assign diﬀerent importance level of the application types during the data exchange. This is done in table 2.2.

2.2 Cellular network system

The latest commonly used standard for wireless cellular network systems is LTE, LTE stands for Long Term Evolution[16]. The function in the LTE architecture that is responsible for the wireless transmission of data to the user devices is the evolved Node B (eNodeB). The goal of eNodeB is to oversee all radio functions, including scheduling and provide the User Equipment (UE) or Machine-Type Communication Device (MTCD)[2], communication to the rest of the system[17].

(23)

2.2.1 eNodeB

The eNodeB transports IP user data via a Packet Data Convergence Protocol (PDCP) that encapsulates the data. Encapsulation means the data will be opaque, and cannot be processed nor open for other clients. When the UE/MTCDs want to send data they send a scheduling request to the eNodeB[2], the process can be seen in figure 2.3. The eNodeB schedule the clients based on three criterias shown in figure 2.3, (1) QoS Requirements, (2) Channel Quality Dynamics and (3) Traffic Dynamics. The goal of these criterias is first to understand the QoS requirement the data have, for instance if the packet is for uploading, streaming, web surfing, etc [2]. Different QoS cluster levels exist to map this[2]. However, this turns out to be a bad solution as different packets in the same QoS cluster can have for example different acceptance for delay of the packet, for instance vehicles sensor data, mon- itoring and factory Real-Time applications[2]. Second scheduling criteria studies how much resources each UE/MTCD needs and how fast it needs to send them, and the last one is traffic dynamics, taking the load of the system into account for the scheduling. Based now on the three requirements a scheduling request can be granted during a transmission time interval (TTI), where the UE/MTCD will get assigned Resource Units (RU) for this current TTI, seen in gray in figure 2.3. Now the data will be sent under some defined frequencies where no other client can use those frequencies during this time, called Resource Block (RB) [2], seen in figure 2.3. The goal for the thesis is to investigate if this scheduler can do internal prioritization for important factory applications UEs/MTCDs, using a classification model that classifies the different applications. The network classification model creates a new type of QoS requirement system that can prioritize more exactly based on the applications, overcoming the internal prioritization issue with industrial Ethernet protocols. 5G improvements on the LTE network will for instance allow more de-

Figure 2.3: The scheduling of resource units in the eNodeB for granting UE/MTCDs access to transmit data, the figure is from [2].

(24)

vices connected and make it possible to get lower latencies then what was possible before, now allowing communication with a latency of 1-10ms[18]. It will also make it possible for the eNodeB to process the UE/MTCDs Ethernet type. This will be introduced in future standards to handle QoS mechanism issues, reduce complexity and reduce extra overhead [19].

2.3 Factory

Cisco and Rockwell automation have provided a white paper [20] about their vi- sion how a factory topology can be designed when mixing the oﬃce landscape and the production lines, naming it Converged Plantwide Ethernet (CPwE) [3]. It is designed to separate and protect diﬀerent parts of the network, by defining levels, where each level in the design has a particular role in the network.

2.3.1 Converged Plant-wide Ethernet

Cisco’s and Rockwell automation’s goal of the CPwE solution is a scalable solution that handles both small (50 devices or less) networks and large (10 000 + devices) networks for factories. How it achieves the scalability and security is to have specific levels in the design where one has predefined communication structures between the levels, allowing it to scale horizontally. The CPwE communication can be seen in figure 2.4. The idea of defining levels is that the enterprise will never have direct access to the manufacturing, and by that reduce the risk of getting malicious software in the production.

2.3.2 Manufacturing Zone Level 0 to 1

Level 0 is the process level in the CPwE hierarchy, where the underlying production units exist. This level is the lowest level in figure 2.4. The functions done here are for instance welding, painting, 3-d printing, measurements and so on. The devices connected at this level receives their instruction from the controller devices in level 1 [3]. One that have read the section Profinet-IO can quickly draw the conclusion, that this would be the IO-Devices and IO-Controllers communication in a Profinet setup.

Level 1 have the basic programmable logic controllers for the manufacturing. They can communicate with both the higher level 2 and also the devices in level 0. As mentioned before, these basic controllers can relate to the IO-Controllers in the Profinet-IO.

Level 2 to 3

Based on the size of network, level 2 can be merged with the level 3. The diﬀerences of level 2 and level 3 in CPwE, are that generally level 3 have additional services like

(25)

Figure 2.4: The framework structure for CPwE, notice that no direct communication is occurring between Enterprise Zone and Manufacturing Zone. The figure is from [3].

file servers, and other domain services. The role for the devices in level 2 is to have supervision over the manufacturing, where the devices in level 0-1 send feedback to the supervisor devices. Here humans can get this feedback and have oversight over the production in dedicated control rooms [3]. These devices can be linked to the IO-Supervisors roles in Profinet-IO.

Level 3 serves as the final production layer. The task these devices have is to manage file systems, follow production progress and check material assets [3].

2.3.3 Demilitarized Zone

The goal of the demilitarized zone (DMZ) is to isolate the enterprise and manufacturing, which reduces harmful interference risk. The way this is done is to only allow certain types of traﬃc to pass in either direction. Here several firewalls is setup to make sure only traﬃc with permission can access the lower levels in the enterprise.

A firewall is a system that are designed to protect its resources by having rules on the traﬃc incoming and outgoing of the firewall [3].

(26)

2.3.4 Enterprise Zone Level 4 to 5

Level 4 is an oﬃce landscape level. However not all employees have access to the information one can extract from level 4. What can be extracted in the level 4 are summaries from the production line and with permission, the organization database [3].Level 5 is the final level in CPwE and it is the general oﬃce network. External users are connected at this level and the only way to access lower levels in the CPwE design from level 5 is to get permission through the enterprise secure applications [3].

2.3.5 Wireless

The CPwE architecture explains a general common business setup for a production company, where figure 2.4 shows each level in the architecture. Here every line could theoretically be replaced with a wireless transportation medium, and still keep the internal architecture, allowing the factory setup to be dynamic and secure using for instance a cellular network. The application classification model would make the eNodeB in the cellular network understand the applications going through and make sure important applications will be served first.

2.4 Analytics

A part of building a good classification model is to understand the data. Analytics is essential when one for instance wants to understand the complexity of the data or wants to confirm a hypothesis with quantitative data analytics. Based on this knowledge one can for example get an idea of how complex the methods for classifying the data need to be and to confirm ones feature selections. To be able to do data analytics the data needs to be processed. This is to remove some of the irrelevant parts of a data capturing, as well as restoring damaged parts and also be able to convert the data to useful objects for data analytics and machine learning.

One common technique for understanding the relationship of diﬀerent characteristics is to normalize the data. The goal of normalization is to shows more clearly how large or small a value is in contrast to the rest in the capture. The same kind of analytics can be done for standardization, where one instead is interested in the spread. Both techniques can be related to statistics and the formulas are the following:

N ormalization: Xnormalized= X− xmin

x_max− xmin (2.1)

Standardization: Xstandardization = X− E[X]

σ (2.2)

(27)

2.4.1 Traﬃc capture and DataFrame

The communication between a source address and a destination address is possible to record by using a packet capture program. Python[21] is the selected program- ming language for the thesis due to its simplicity and that a lot of Machine Learning tools exist for this language. Wireshark[22] is the selected capturing program, as it can extract the captures to Python using a module called pyshark[23].

In Wireshark one can then fetch a summary of each packet information, this information can for instance be the packet size, the Ethernet type and the recoded index for the packet.

This captures need to be extracted, using for instance pyshark and stored in a way Python can extract the information. One structure type that can handle this data is a DataFrame[24]. DataFrame is a two-dimensional data structure, with the option of having columns with diﬀerent data types. This makes it useful for analytics and machine learning. A DataFrame is related to a structured query language (SQL) table. It is stored in the Random Access Memory to perform fast search and operations, with the option of storing the data on disc storage if it exceeds the Random Access Memory allocation available[24].

2.4.2 Evaluate a classification model

To evaluate the performance of each classification model generated in the research, a type of score will be necessary. We will introduce four diﬀerent types of scores that are typically used for evaluating classification performances. The four scores are accuracy, precision, recall and f1-score.

Accuracy measures the total amount correctly classified packets divided by the total amount of packets. Example, if five packets classify as the correct application, and the total packets tested are ten, the accuracy score would be 50%. Drawback of using accuracy as the measurement of the performance of the classification model, is that it does not express how precise the classification was between each label. For instance if the majority of the data belongs to one application type and the model decides to classify all packets as that particular application type, this will result in high accuracy score but the model is not precise.

Precision measures for each application, how many of the packets that was classified for that application are correct. Example, if ten ARP packets are classified as ARP of the total one hundred ARP packets, and no other application are classified as an ARP application. The precision will be 100% for the ARP label. This makes sure that the ones that are classified as an ARP application are guaranteed to be an ARP application. The drawback is that it ignores that only 10% of the total ARP packets where classified correctly.

Recall handles this issue by instead checking for every type of application, how many of them where classified correctly. For instance, the ARP example would give

(28)

a recall score of 10% for the data that belongs to an ARP application.

The final classification model evaluation score included in this study is the F1 score.

The goal of F1 score is to combine the recall score and precision score to find a har- monic combination of both scores. Using the following equation:

F1 = 2 ∗ P recision∗ Recall

P recision+ Recall (2.3)

Instead of a numeric score of the model classifications performance, one can get strong understanding of how the model classifies by studying how the model classified each data point. This is also very useful in a priority perspective. Because if two applications where misclassified between each other where both applications are mapped to the same priority level this would not be a big issue, but if the priority diﬀerences are large it would be a problem.

One way of doing this type of evaluation is to study a confusion matrix. The goal of a confusion matrix is to show row wise the true application and column wise how the model classified the data. Table 2.3 illustrates this, where four classes A-D where selected. Here each row number represents each class, and how that class data was classified. The first row in the table 2.3 is for class label ”Application A”, which contains 15 packets in total, where 10 of 15 where classified as ”Application A”. One of 15 was misclassified as ”Application B”. One of 15 was misclassified as

”Application C” and three of 15 was misclassified as ”Application D”. The second row is for the results of ”Application B”, third row for ”Application C” and the last row for ”Application D”.

Class \ Classified Application A Application B Application C Application D

Application A 10 1 1 3

Application B 0 10 2 3

Application C 1 1 10 3

Application D 0 0 0 15

Table 2.3: The table illustrates the concept of a confusion matrix; here we have four applications where each contains 15 packets in total. The first element in each row in the matrix is the true application for the packet. Each column after the first column element presents the classification.

(29)

2.5 Classification using machine learning

2.5.1 Introduction to machine learning

The traditional structure of a program is to build a model that creates the outcome one is looking for. Machine learning inverts that concept by defining the outcome and then learning the model mapping between the input and the outcome. This mapping is achieved by using large amount of data, on which the machine learning algorithms tries diﬀerent hypothesis to match the desired outcome.

There are three diﬀerent learning types when one discusses machine learning; supervised learning, unsupervised learning and reinforcement learning. Supervised learning is the case when the model knows what is correct or not, for example one has a set of pictures with and without cars where all pictures are label if they have a car or not in it. Now the model learns the mapping of the pixels to the output car/not_car. Unsupervised learning does not have the label provided for the data set. Now it cannot get the feedback on what is correct or not, and needs to figure out the distribution by itself. The final one is reinforcement learning where the model gets a reward for the action it has taken, where the goal of the model is to maximize the overall total reward. Reinforcement learning is generally used in games where the model creates a bot that can solve the game. The report will use the supervised learning approach, where each packet service type will be the output the model will classify.

2.5.2 Data

When a machine learning model is learning the mapping between the input and the desired output, one wants to be sure the mapping is a general solution that will work well when new data is classified. How one can test this, is to split the data-set into two or three parts. The largest part of the split will be the training data, which will be the data the model will train the mapping on. Then one can test the performance of the model on the second split of the data called testing data. This will then test how the machine learning model preforms with unseen data. The standard ratio between training and testing data is around [80%, 20%]

to make sure one have suﬃcient amount of data the model is testing. The second option is to split the data-set into three parts, the goal here is that one introduces a validation set, where this split becomes now [60%, 20% , 20%] of the total data.

Motivation for a validation set is to allow the model to change after training without introducing bias. This means one validates the model performance using the validation set and then change the setup of the model to perform better on the validation set. After one has changed the model setup, the training is done for this new model setup. The new model is then tested once again using the unseen validation data.

This continues until one is satisfied with the classification performance. What has happened now is that the setup of the model is optimized on performing the best on the validation data. This means even if the model has never been trained on this data, the complexity selection of the model has for instance been based on the

(30)

validation data performance. To then test the actual performance of the model, one tests the model using testing data, which is data the model has never been introduced to before. This result will imitate how the model would perform in the future. A limitation of using validation sets, are that they requires large amount of data, as 20% less of the data will be for training. If one believes that will not be enough data, one has come to a dilemma between bias and too little training data. A solution to handle this, can be random partition of the data as training and testing but still keep the [80%,20%] ratio. This allows the model to be trained and then tested several times with low bias. The model setup can now be changed and tested, where now the new model setup takes another partition of the data as training and testing. This allows the model to not be optimized for any particular testing data. This will reduce the bias, but requires considerable amount of iterations for improving the model.

Naïve Bayes

The Naïve Bayes classification is a supervised learning technique. It is founded on Bayes theorems, to create a probabilistic classifier for predicting the outcome.

The name ”naïve” comes from its assumption regarding that the data are always independent, which is a naive approach. The concept of Naïve Bayes is that one first has a dedicated training set. The training set is a data-set used to build the probability space for the diﬀerent classification labels based on the input features.

For example, lets say one have two input features A and B, where both are Boolean features. The outcome can be two labels, for instance C and D. With the training set, Naïve Bayes can find several useful probabilities:

P(C) = 1 3

P(A = T rue, B = T rue|C) = 1 9 P(A = F alse, B = T rue|C) = 1

2 P(A = T rue, B = F alse|C) = 1 P(A = F alse, B = F alse|C) = 0

Now when new data will be classified the model will use the previous probability space to predict the new data labels, where the label with the highest probability will be the predicted output. Lets say a data with feature A = T rue and B = T rue

(31)

should be classified, one can use Bayes theorems to solve this.

P(C|A = True, B = True) = P(C)P (A = T rue, B = T rue|C) P(A = T rue, B = T rue) P(D|A = True, B = True) = P(D)P (A = T rue, B = T rue|D)

P(A = T rue, B = T rue)

Classif ication= Max(P (C|A = True, B = True), P(D|A = True, B = True)) Classif ication= Max(P(C)P (A = T rue, B = T rue|C)

P(A = T rue, B = T rue) ,P(D)P (A = T rue, B = T rue|D) P(A = T rue, B = T rue) ) Classif ication= Max(P (C)P (A = T rue, B = T rue|C), P(D)P(A = True, B = True|D))

Classif ication= Max( 1

27,(1 − 1

3)(1 −1 9)) Classif ication= D

The classification label becomes D. This is how the Naive Bayes handle classifications.

Decision Tree

The goal of a Decision Tree is to take decision based on the input where the data should travel in the tree. The tree is a structure with several directed paths. The path selection is based on the input data values. Based on the decisions taken, the data will travel in the tree and end at a leaf. The leaf will represent a classification.

This will be the output for the data. This is better explained with an example. Lets continue with the pervious example from Naïve Bayes section, where we start by having a tree already generated. Meaning we have already trained the algorithm.

The trained Decision Tree can be seen in figure 2.5. Now, like the example before

Figure 2.5: The figure illustrates the concept of Decision Tree as a classification model, in this case the outputs are C and D and based on the Boolean features A and B, the model will follow the correct path and determine which output it is. The figure is done using https://www.draw.io/.

we have a new data which needs to be classified, containing the feature values A = T rue and B = F alse. Following the figure 2.5, we are only allowed to go a path where our feature value is in the path feature limit. The first path the data

(32)

will select is A = T rue, meaning in the figure 2.5 is to go the left path from the root. Then it will select the path where B = T rue, ending up with a classification of D. Classifying the data as label D.

How the Decision Tree classification model is built is by an algorithm called C4.5.

In the report, the C4.5 algorithm is an acronym for Decision Tree. The Decision Tree models start from the top-down root. Each feature in the training set will have a discretized value, meaning continues features will be discretized [25]. The C4.5 algorithm creates paths from the root based on the features values. If no features are left or all data in one specific path belongs to the same class label the path is finished [25]. How a path is separated in to multiple paths is based on a heuristic.

The heuristic used in C4.5 tries to maximize the information gain it can get per split.

This is done using two entropies, expected information E(Data) and information E_{f eature}(Data). The idea is to split a path based on the feature that gives the most information about each class, resulting in the amount of separations needed in the tree should be the as little as possible. How this is achieved is based on finding out the expected information, meaning the information of the diﬀerent labels in the training set Data[25]. This becomes equation 2.4 for the example above:

E(Data) = −count(classC)

count(Data) log₂(count(classC)

count(Data) )−count(classD)

count(Data) log₂(count(classD) count(Data) )

(2.4) Information for the two features becomes:

E_A(Data) = count(A = T rue)

count(Data) E(Data|A = True) +count(A = F alse)

count(Data) E(Data|A = False) E_B(Data) = count(B = T rue)

count(Data) E(Data|B = True) +count(B = F alse)

count(Data) E(Data|B = False) Based on the feature that have the highest diﬀerence between expected information

and information would be how the C4.5 splits[25].

M ax(E(Data) − E^A(Data), E(Data) − E^B(Data)) (2.5) This continues until it reaches the termination requirements. Now the testing data can use the generated Decision Tree to test its performance.

Support vector machine

Support vector machine (SVM) uses separation in the hyper plane for classification.

How it achieves this is to make the input into higher dimension, resulting in simpler separation [4], this concept is illustrated in figure 2.6. How it does the separation is to first find the maximum margin between the classes, based on what kernel function one has selected. The kernel function K for transforming the data into higher dimensions, can for example be linear, polynomial or radial based [4], this

(33)

Figure 2.6: The figure is illustrating the classification concept using SVM classification, here if a new data point is on the left side of the line it will be classified as the same label as the other points on the left side and vise versa, the figure is from [4].

can be seen for two vectors x and y:

Linear: K(⃗x, ⃗y) = ⃗x^T⃗y+ 1 (2.6) P olynomial: K(⃗x, ⃗y) = (⃗x^T⃗y+ 1)^p (2.7) Radial: K(⃗x, ⃗y) = e⁻^2p2¹ ^{|⃗x−⃗y|}² (2.8) After one has selected a suitable kernel for the data-set, one wants to make sure the separation line between the classes is placed in a way that the distance for both classes is maximized. Making sure the line is not biased towards a particular class.

This is seen in figure 2.6. How this is done is to find the data points closest between the two classes, called support vectors and make a separation line between. The support vectors are the points in figure 2.6 that are on the dotted lines. To make sure the separation line is the optimal one, α is introduced to guide the direction of the separation line, making the margin between the line and the two classes data points ⃗x:s as large as possible. t is the output for the data point ⃗x. In support vector machine, one can only have a binary output, where one class label output is represented as -1 and the other as value 1. A data-set using several classes uses Support vector classifier instead, which will be discussed later. Maximizing the equation below will make the margin as large as possible between the classes and

(34)

the separation line:

M aximize:^!

i

α_i− 1 2

!

i

!

j

α_iα_jt_it_jK(⃗xi, ⃗x_j) (2.9)

Constraint: 0 ≤ αi∀i (2.10)

When the maximum α:s are found, the support vectors ⃗x_supwill be the data points where:

⃗

x_iα_i̸= 0 ⇒ ⃗x_i ≡ ⃗x_sup_i (2.11) When the support vector machine is generated with the training data, testing data can be used to study the classification performance. This is done with equation 2.12:

Classif ication:^!

i

α_it_iK(⃗xi, ⃗x_sup_i) (2.12) If the result value becomes negative in equation 2.12, it will be classified as the negative class number -1. If the classification gets positive, it will be classified as the class having the number 1. For instance label C can be -1 and D will then be 1.

To be able to handle more then two classes for classification using SVM, one can use Support Vector Classifier (SVC). SVC uses SVM and one vs rest approach. The idea is to use several SVMs. For instance one have three classes to classify, classes C, D and E. First, one creates a SVM for C in one group; lets say C will have t as negative one, where D and E will be grouped together in the other group with t as positive one. If a test data now classifies as negative value, it is classified as C.

Nevertheless, if it got a positive value it can be D or E. The solution is to make one additional SVM where for instance D is in one group, and C, E is in the other. If it now, using new SVM gets a negative value, one knows it is classified as label D and if it gets a positive value it is classified as label E.

2.5.3 Deep learning

Instead of doing one complex mapping function as the methods describe above, a deep learning neural network includes several easier approximations that combined result in a complex mapping function ˆf(x) ≈ fⁿ(..(f²(f¹(x))..) between input x and output y, y = ˆf(x). This allows it to approximate diﬃcult functions, but still have a simple structure. These easier approximation functions fⁿ(x) are called neural layers. Each neural layer is containing several artificial neurons. The idea is inspired from neural science, and that is why the deep learning network is called deep neural networks [26] (DNN). Why it is called ”deep” is because the layers after the input layer, are connected to previous layer creating a ”deep” mapping function. How a deep learning neural network works, is that the network is feed the feature input data x, into the first layer. The first four dots on the left side on the network in figure 2.7 (b), would correspond to the artificial neurons in the first layer. Each artificial neuron has a weight assigned to it. This weight value changes during the training

(35)

(a) Concept of a artificial neuron. Notice that the activation function on the right here represent what the sum becomes, so if it is for example negative sum the value will be zero but if positive the value will be one. Figure is from [4].

(b) The image shows the concept of a deep learning neural network.

Notice that the two middle layers are only dependent on previous layer, this is called hidden layers. Figure is from [4].

Figure 2.7: The two core ideas of a deep neuron network, is artificial neuron and a neural network. These ideas can be seen in figure (a) of a artificial neuron and (b) a deep neural network.

data phase, to approximate the mapping function better. This will be explained in section 2.5.4. When the input data is feed to the first layers neurons, some of the neurons will be activated, based on their weight values and the input values.

Activated means those neurons will influence the next layers neurons, while not activated will not influence for this particular data input. Generally every artificial neuron from previous layer is connected to the neurons in the next layer, this can be seen in figure 2.7 (a), where the neuron is receiving signals from the previous layers connected neurons. How a neuron is activated is based on a activation function, for instance a unit step as shown in figure 2.7 (a). The idea is that the activation function should not be linear as it introduce non linearity to the model, allowing it to approximate non linear functions. As we have several layers, this non linearity can now approximate complex functions [26], which other machine learning models might not be able to do as eﬃciently. When the final layer in the network receives the signals from previous neurons, the neuron with the highest signal in the last layer will now map the data into an output classification label. For instance in figure 2.7 (b), we have four neurons in the last layer, which is the right-side layer in the model. Here each neuron corresponds to a classification label, for instance C,D,E,F. The neuron with the highest signal will decide the classification label for this particular input feature data x. The deep learning model described above, is called a feed forward network also known as multilayer perceprons (MLP).

2.5.4 Optimizer function

When one trains a deep learning model, one sends a batch m of data input x and output y through the model ˆf. The model performance is then evaluated using a loss function L. This loss function is the classification error during this batch. This loss error value function is sent back in the network, to change the artificial neurons weights θ. The name of this operation is backward propagation. This is done by

(36)

taking the partial deviation of the loss function for all weights, which becomes the gradient ˆg [26]:

ˆg ← 1 m∇θ

!m i=0

L( ˆf(xi|θ), yi)

This gradient ˆg is later used to improve the weights of the model, by minimizing the loss function by using a optimizer function [26].

Stochastic gradient descent

Imagine a function space where we have all possible weight values for all the neurons θ in one axis, the neurons in one axis, and the loss function value in one axis.

Stochastic gradient descent is minimizing the loss function, by moving the gradient in the negative direction in this function space expressed above, with a step of learning rate distance ϵk. The model weights are improved for K epochs, where an epoch means one batch iteration. To make sure the learning rate distance will not overshoot and travel past the minima, the learning rate is declining with a decay α during the epoch iterations. Stochastic gradient descent for improving the deep learning model can be explained with algorithm 1 [26]:

Algorithm 1: Stochastic gradient descent (SGD) Data: x, y

Result: Updates the model weights θ for k in K do

ˆg ← m¹∇θ"m+k

i=k L( ˆf(xi|θ), yi) θ← θ − ϵkˆg

ϵ_k← ϵk− α end

Momentum

A problem with Stochastic gradient descent, is that it can be wedged in a local minimum. For instance, if the loss function has large spikes the Stochastic gradient descent might just move between two spikes and never travel past them to find the true global minimum. A model optimizer function that handles this issue is momentum.

Momentum is inspired by momentum in classical physics [26]. How it decreases the oscillation is to store previous gradient direction. If the new updated gradient direction for minimizing the loss function is the same direction as the previous one the learning rate "speed" v, will accumulate. It will also decrease the "speed" if they are in the opposite direction. Using momentum for updating the weights for each

Machine Learning for Traffic Classification in Industrial Environments