On the Usage of Artificial Neural Networks for Cyber-Physical Threat Detection in DETECT

(1)

Author: Samuel B ÄCKSTRÖM &

Elise A NJEL

Supervisor: Francesco F LAMMINI

Semester: 21VT

Subject: Computer Science

Bachelor Degree Project

On the Usage of Artificial Neural

Networks for Cyber-Physical Threat

Detection in DETECT

(2)

Abstract

Keywords: cyber physical system, neural network, security, threat detec- tion, reinforcement learning

This thesis explores how a detection engine using Artificial Neural Networks (ANNs)

could be implemented within the DETECT framework. The framework is used for

security purposes in Cyber-physical systems. These are critical systems often vital

to important infrastructure so discovering new ways of how to defend against threats

is of huge importance. However, there are many difficult challenges that needs to be

addressed before employing an ANN as a threat detection mechanism. Most notable

what kind of ANN to use, training data and issues such as over-fitting. These chal-

lenges were addressed in the model that was created for this paper. The model was

based on the current literature and previous research. To make informed decisions

about the model a literature review was carried out to gather as much information as

possible. A key insight from the review was the use of recurrent neural networks for

threat detection.

(3)

1 Introduction 1

1.1 Background . . . . 1

1.2 Related work . . . . 3

1.3 Problem formulation . . . . 3

1.4 Motivation . . . . 4

1.5 Objectives . . . . 4

1.6 Scope/Limitation . . . . 4

1.7 Target group . . . . 4

2 Method 5 2.1 Reliability and Validity . . . . 5

3 Literature study 7 3.1 Literature review . . . . 7

3.1.1 Deep Learning for Detection of Routing Attacks in the Internet of Things . . . . 7

3.1.2 A deep learning model for secure cyber-physical transportation systems . . . . 8

3.1.3 Intelligent Sensor Attack Detection and Identification for Auto- motive Cyber-Physical Systems . . . . 8

3.1.4 Multilayer Perceptron with Binary Weights and Activations for Intrusion Detection of Cyber-Physical Systems . . . . 9

3.1.5 Cloud-Based Cyber-Physical Intrusion Detection for Vehicles Us- ing Deep Learning . . . . 9

3.1.6 Real-Time Cyber-Physical False Data Attack Detection in Smart Grids Using Neural Networks . . . . 10

3.1.7 MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Generative Adversarial Networks . . . . 11

3.1.8 Enhanced Cyber-Physical Security in Internet of Things Through Energy Auditing . . . . 12

3.1.9 Real-Time Sensor Anomaly Detection and Identification in Auto- mated Vehicles . . . . 12

3.1.10 Artificial Intelligence for Detection, Estimation, and Compensa- tion of Malicious Attacks in Nonlinear Cyber-Physical Systems and Industrial IoT . . . . 13

3.1.11 Anomaly Detection in Cyber Physical Systems Using Recurrent Neural Networks . . . . 14

3.1.12 SCO-RNN: A behavioral-based intrusion detection approach for cyber physical attacks in SCADA systems . . . . 14

3.1.13 Detecting control system misbehavior by fingerprinting programmable logic controller functionality . . . . 14

3.1.14 On-Line Error Detection and Mitigation for Time-Series Data of Cyber-Physical Systems using Deep Learning Based Methods . . 15

3.1.15 Deep Learning for Secure Mobile Edge Computing in Cyber- Physical Transportation Systems . . . . 15

3.1.16 Mitigating Data Integrity Attacks in Building Automation Sys-

tems using Denoising Autoencoders . . . . 16

(4)

3.1.17 Learning-Based Adversarial Agent Detection and Identification in Cyber Physical Systems Applied to Autonomous Vehicular Pla-

toon . . . . 17

3.1.18 Cyber security of smart grids in the context of big data and ma- chine learning . . . . 17

3.1.19 Dimensionality reduction and anomaly detection for cpps data us- ing autoencoder . . . . 18

4 Results 19 4.1 Statistics . . . . 19

4.1.1 CPS industry . . . . 19

4.1.2 Neural Network classes . . . . 19

4.1.3 Learning methods . . . . 20

4.1.4 Citations . . . . 21

4.1.5 Year of publication . . . . 22

4.2 DETECT detection engine . . . . 23

4.2.1 The model feeder . . . . 23

4.3 ANN Models . . . . 23

4.3.1 Neural Network Types . . . . 23

4.3.2 Data preprocessing . . . . 24

4.3.3 Hidden layers . . . . 24

4.3.4 Training Method . . . . 25

4.3.5 Architecture . . . . 25

4.3.6 Structure of Anomaly Detector . . . . 26

4.3.7 Structure of Anomaly Classifer . . . . 26

4.3.8 Data Fusion . . . . 26

5 Implementation 28 5.1 Mock dataset . . . . 28

5.2 Model feeder . . . . 29

6 Analysis 31 6.1 Limitations . . . . 31

6.1.1 Data . . . . 31

6.1.2 Overfitting . . . . 31

6.1.3 Changes . . . . 31

6.1.4 Transparency . . . . 31

6.1.5 Performance . . . . 32

6.2 Neural Networks . . . . 32

6.2.1 Recurrent Neural Networks . . . . 32

6.2.2 Convolutional Neural Networks . . . . 33

7 Discussion 34 7.1 The literature review, it’s analysis and subsequent design artifacts . . . . . 34

7.2 Model of a detection engine proof of concept . . . . 34

7.3 Partial implementation of the model feeder . . . . 34

8 Conclusion 35

8.1 Future work . . . . 36

(5)

References 37

Appendices A

A Github repository A

B Glossary A

(6)

1 Introduction

The thesis is about exploring the possibility of using an Artificial Neural Network (ANN) as a module inside the DETECT framework. DETECT (DEcision Triggering Event Com- poser & Tracker) is a framework with the purpose of detecting threats against critical infrastructure. The ANN would serve as a threat-detection mechanism. While modelling the ANN is the focus of this thesis, an implementation of a Model Feeder has also been done. The modelling of the ANN is based on the findings from a literature survey of the use of ANNs as the detection mechanism in a cyber-physical security context.

1.1 Background

Software systems together with physical infrastructure form what is called a Cyber-physical system, this includes both physical and software sensors (such as door lock systems and an intrusion detection system) [1]. These kinds of systems are found in a wide variety, often critical domains, from energy infrastructure, communication to transportation. Due to the dire consequences of failure (loss of life, significant loss of capital and so on) in these systems security is of huge importance, for example, failure in security systems for autonomous vehicles may result in property damage or the loss of life of either the driver or other road users. In recent years there has been an interest in using Artificial Intelligence (AI) techniques so solve this problem [2]. The main AI technique that has been used is called Artificial Neural Network (ANN). It has been employed mainly to classify and detect incoming threats. The rationale behind this use is that ANNs could adapt dynamically to existing threats to enhance the security of a system and more heuris- tically analyze the data to potentially alert security operation center personnel (because the amount of data is too much for security personnel to manage manually). In addition to this, unsupervised learning techniques can also be used to detect unknown threats and adapt to these if encountered. The challenge lies in that ANNs often work as a "black box", meaning it can be very difficult to diagnose bugs and problems if they arise.

Artificial Neural Networks (ANN) are systems of computing resembling animal brains.

They loosely replicate how neurons interact with each other inside of a biological brain.

This allows for complex computer systems which are able to do nonlinear computations,

pattern recognition, generalization, learning and adapting and work in high parallelism

(fast processing) [3]. The basic structure of an ANN is the perceptron which is meant to

mimic a neuron inside a brain. It is modeled as a simple unit of computation that holds

a value, usually a real number between 0 and 1. The perceptron can add each incoming

value to its own to produce an output. The perceptrons are organized into layers were the

output from perceptrons in a layer serves as input to other layers as seen in figure 1.1.

(7)

Figure 1.1: (a) Represents a perceptron with incoming and outgoing values. (b) Repre- sents a neural network with two hidden layers.

Source: [4]

The first layer in the network is called the input-layer, the last layer is called the output- layer and the intermediate are called hidden layers. There can be any arbitrary number of hidden layers. It is because of the fact that these hidden layers can be very numerous that it can be close to impossible to fully understand the implications of each layer and neuron. This creates the "black box" scenario described earlier [5].

The DETECT (DEcision Triggering Event Composer & Tracker) framework has been

developed to detect potential threats in cyber-physical space using a module-based soft-

ware architecture [6]. The DETECT framework uses an event-history database containing

cyber and/or physical events that occurs within a system and a repository of attack sce-

narios written in a custom Event Description Language to detect ongoing threats and

alarm/notify cyber-physical security control rooms.

(8)

Figure 1.2: The DETECT architecture Source: [6]

Because of the module-based architecture DETECT can use different techniques to serve as the detection module. These techniques include both Artificial Neural Networks and Bayesian Networks.

1.2 Related work

The DETECT framework was developed to deal with the increasing threats against cyber- physical systems and critical infrastructures. It’s goal is to automate detection of early threats within cyber-physical systems and critical infrastructures. It uses non-trivial at- tack scenarios which consists of attacks that are done in a predictable sequence, the attack scenarios are produced in a vulnerability assessment of the security process. A motiva- tion, the working principles and the software architecture for the DETECT framework has been presented in reference [6].

Previous research on DETECT detection engine modules has focused on using Bayesian Networks as a detection module for detecting cyber-physical threats. The specific at- tack scenarios from the attack scenario repository within the DETECT framework was modeled into attack trees and then transformed into Bayesian Networks using a proposed Model-to-Model transformation. The transformed Bayesian Networks were then trans- lated into machine readable XML code using a proposed algorithm. The problem with this approach as opposed to using an ANN, is that a BN model is much less dynamic, it can only respond to certain attack scenarios. This means that threats that have not been represented as a attack scenario by the system designers can go unnoticed. This risk is lessened by using an ANN [7].

1.3 Problem formulation

How would a detection engine using an ANN be modeled in the context of the DETECT

framework? By surveying the current use of Artificial Neural Networks (ANN) in cyber-

physical-security, it can suggest how a DETECT detection module could be designed

using an artificial neural network. Datasets for threat scenarios and security-events would

have to be developed as well. The ANN will not be implemented to limit the scope of the

thesis.

(9)

1.4 Motivation

As the digital transformation of society continues, the importance and need for a focus on security in these systems becomes more and more poignant. As incorporating security in a cyber-physical system is a difficult problem, DETECT aims to address detection of threats in critical systems [6]. By leveraging the ability for a neural network to respond to novel scenarios, a more adaptable and powerful heuristic detection engine for the use in DETECT could be possible, in comparison to a deterministic or another heuristic (for example, using Bayesian networks) detection engine.

1.5 Objectives

O1 Literature Review O2 Analysis of the findings

O3 Model of a detection engine proof of concept developed O4 Partial implementation

1.6 Scope/Limitation

The biggest limitation of this thesis is the lack of an implementation of the ANN model.

This means the model is untested and is backed up by previous research rather than exper- imentation. The reason for this was the lack of datasets available for model training which makes the implementation of an ANN very difficult. Because there is no implementation to perform experiments with, the model is based on previous research of ANNs that were used within cyber-physical security. This means the model can be used as a starting point for developing a functioning detection-module using an ANN. The model is tailored to be used in the DETECT framework.

1.7 Target group

Groups that might be interested in this work include researches who want to develop and

implement a heuristic detection module inside the DETECT framework. Cyber-security

professionals, specifically in the domain of Cyber-physical system that may want to im-

plement an artificial neural network in a cyber-physical system. Computer science stu-

dents and professors.

(10)

2 Method

The method used for this these is a literature review, to review the state of the art of the use of neural networks within cyber-physical systems. The knowledge and design arti- facts gathered from the literature review was used to define solution characteristics for the DETECT detection engine. The knowledge regarding DETECT was used specifically for the detection engine components and the mock dataset in DETECT. The literature review knowledge was used to define the architecture, model, training methods for the ANNs.

The search terms were conjured with the specific terms related to this paper, since the problem is how would a DETECT detection engine be modeled using ANNs, terms such as "Artificial Neural Network" becomes relevant, DETECT is also a cyber-physical secu- rity system, so how have other researchers modeled this in other cyber-physical systems also is relevent. Several search strings was developed to provide data from previous re- search. The search strings was used in Elsevier’s abstract/citation database Scopus to provide data (literature).

Search Terms

Artificial Neural Network Security

Cyber Physical System Table 2.1: Base search terms.

Search string

TITLE-ABS-KEY ( "neural network" OR "deep learning" OR "reinforce- ment learning" ) AND TITLE-ABS-KEY ( "security" OR "detection" OR "at- tack*" ) AND ( LIMIT-TO ( SUBJAREA , "COMP" ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) ) AND ( LIMIT-TO ( EXACTKEYWORD , "Cy- ber Physical System" ) )

Table 2.2: Search string used in SCOPUS for the literature review.

Both inclusion and exclusion criteria was defined. The limitations was developed to limit the scope of the literature review to the paper’s context.

Research that was relevant to the thesis (based on the search terms) was manually selected from the search results. To see the full list of all the papers, see the Github repository in Appendix A.

The implementation of the model feeder was based on the developed detection engine model. The mock datasets in combination with the model feeder was used as a basis for experimentation.

2.1 Reliability and Validity

Since the thesis defines a general model that can be used as a basis for implementing a

ANN based DETECT detection engine, some implementation details are not defined and

(11)

Include Criteria

Models and implements a neural network

Subject Area Computer Science

Language English

Document Type Any

Exclude Criteria

Year Older than 10 years

Peer Reviewed No

Modeling/implementation No

Table 2.3: Inclusion/Exclusion criteria used to manually filter out literature irrelevant to the thesis.

some may change depending on the specific DETECT implementation and the CPS it is

implemented within. Additional ANN implementation details may also change depending

on the patterns of the datasets used to train the ANNs to avoid over/under-fitting the ANN,

for example.

(12)

3 Literature study

The search string yielded 81 document results in Scopus. After reading the abstract and applying the inclusion/exclusion criteria 33 documents were deemed relevant.

Search string results 81 Relevant literature 33 Reviewed literature 19

Table 3.4: Literature review data.

Some of the relevant literature were excluded further when reviewing the documents due to the documents not fulfilling the Models and implements a neural network inclusion criteria.

3.1 Literature review

3.1.1 Deep Learning for Detection of Routing Attacks in the Internet of Things In this field, there is a lack of datasets, in this study, the authors have generated datasets of three specific routing attacks on IoT devices. The datasets were produced using Cooja IoT simulator, simulating IoT networkings that contains 10 to 1000 nodes. Three specific attacks were used to generate the datasets, the Decreased Rank Attack, the Hello-flood attack and the Version number modification attack [8].

In the pre-processing of the raw datasets some of the features of the data was extracted/converted, the features of the data was normalized and feature selection was applied. Some of the data (such as IPv6 addresses) that the machine learning algorithm was not able to handle, was processed to a Node ID instead, the data was labeled as well. Quantile transform and min-max scaling was applied to the datasets to improve the performance of the ANN. The feature selection was applied to eliminate irrelevant and less relevant features of the data and to gain the optimal subset of features from the datasets. The feature selection was implemented using random forests to reduce the noise and lower variance of the datasets, histograms to determine the lowest and highest feature importance and Pearson correla- tion coefficients to further normalize the data. The random forests features selection rates the importances by assigning them a coefficient, the features of highest and lowest impor- tances are discarded and then the process is iterated upon again, this is to prevent over- and under-fitting (removing noise in the data) [8].

Deep learning was used to generate a predictive model for IoT Attack Detection. The

ANN employed used to generate the predictive model had 7 hidden layers making it a

deep learning algorithm. The amount of input nodes were equal to the amount of features

that were selected in the feature selection of the pre-processing and the output layer is a

single node (regression mode). The hidden nodes has 50 nodes, 100 nodes, 300 nodes,

100 nodes and 50 nodes respectively, the ANN architecture was achieved by trial and

error (experimentation), the final architecture was decided on when satisfactory perfor-

mance was achieved along with lack of over-fitting. The datasets were split up into two

sets, one X set and one Y set, the Y set remained labled but the X set had the labels re-

moved, the sets were then devided further into training data and test data to produce the

sets X train, X test, Y train, Y test. Supervised learning was used with the labled set and

unsupervised learning was used with the unlabled set. The test sets were used to test the

(13)

generated predictive model [8].

The performance metrics used was a confusion matrix made out of Real positive/negative and Predicted positive/negative. Using this table accuracy, precision, recall, F1-score as well as sensitivity and specificity for the predictive model was calculated. The techniques used was able to generate predictive models that has a high training and prediction per- formance (circa 95% and above) [8].

The biggest limitations is the lack of datasets, so only the generated datasets were used to train and generate the predictive models. The resulting models are only able to predict the three attacks that the datasets represented. [8].

3.1.2 A deep learning model for secure cyber-physical transportation systems The paper sets out to develop a deep learning model that can detect eavesdropping and jamming attacks on the wireless networks that are used in cyber-physical transporation systems. The model uses pre-processing of the traffic sensors and mobile devices’ APK files (Android applications) and their log files to filter out what features to use in the learn- ing process. The system uses a combination of Restricted Boltzmann Machines (RBM) and a deep belief network, there are several RBMs stacked upon each other where as the out of the earlier RBM is the input to the next one . Unlabled data is used to pre-train a RBM (unsupervised) and then labled data is used to fine tune the RBM (supervised and back-propagated). The deep belief network is trained using the output of the stacked RBMs in combination of labeled data (supervised) [9].

The evaluation of the model shows on an average 6% gain in accuracy of detected attacks in comparison to other machine learning models. The model has a 12.61% higher accu- racy compared to softmax regression and a 2.61% higher accuracy compared to random forests [9].

3.1.3 Intelligent Sensor Attack Detection and Identification for Automotive Cyber- Physical Systems

In automotive Cyber-Physical Systems, a heterogeneous set of sensors might be used for breaking and steering a automotive vehicle. The sensors are vectors for several types of attacks, attacks such as physical attack (physical damaging or stealing a sensor), De- nial of Service attacks or deception attacks (change or insert data), the research focuses on detecting deception attacks. Do detect these types of attacks, Recurrent Neural Net- works (RNN) can be employed, more specifically Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) RNNs. These types of RNNs addresses the vanishing gradi- ent problem (using gradient-based learning methods may result in the back-propagation not adjusting the weights in the ANN) that RNNs may have [10].

The proposed strategy for detection is training a LSTM or GRU RNN that represents the

normal operation of the senors. The data from the a left-wheel encoder, a right-wheel

encoder and the Inertial Measurement Unit (IMU) are fed into min-max normalization

pre-processing and then fed into the trained RNNs which will output the state (normal

state, IMU attack left-wheel encoder attack, right-wheel encoder attack or a combination

(14)

of the attacks). The architecture consists of a RNN (LSTM or GRU) layer, a batch nor- malization layer and then a softmax layer. The output of the RNN is normalized so to reduce overfitting of the model. The softmax layer classifies the data by employing cross entropy methods. The architecture has 90 hidden neurons [10].

The experiments showed that GRU and LSTM implementations had higher precision compared to other machine learning models such as standard RNN, Neural Network, Support-vector machines or Principal Component Analysis, the results of the experiments are presented using a confusion matrix showing the actual class (attack) and the predicted class [10].

3.1.4 Multilayer Perceptron with Binary Weights and Activations for Intrusion De- tection of Cyber-Physical Systems

Various host-based intrusion detection techniques has been applied to CPS but a problem that arises from host-based intrusion detection is that if the host itself is compromised, other parts of the CPS are not able to detect intrusions. The research focuses on develop- ing a intrusion detection system implemented using a Multilayer Perceptron (MLP) ANN that can be run on low-powered devices. Since the MLP needs to be run on low-powered devices, the computational resources required needs to be low, using binary-weights and activations can reduce the computational resources required without losing accuracy [11].

The dataset KDD’99 from The Fifth International Conference on Knowledge Discovery and Data Mining was used to train the MLP using supervised learning. The datasets contains 22 different attacks as well as normal data and has 316174 samples (N _s ). The data was split into 80% training and validation data and 20% into test data. The data in was normalized before the network was trained. The model consisted of 41 input nodes (the amount of features in the dataset, N _i ) and 23 output nodes (amount of attacks in the dataset and normalcy in the dataset, N _o ). It had 3 hidden layers and the amount of hidden nodes (N h ) where calculated using the following formula, phi (φ) is a scaling factor between 2-10 [11].

N _h = N _s

(φ ∗ (N i + N o )) (1)

Figure 3.3: Formula used to determine the amount of hidden nodes [11]

The results show that the ANN model used functions, it is able to predict the attacks from the dataset, however, it has a higher error rate than other types of machine learning techniques. The model has an error rate of 1.241 whilst a non-binarized MLP has an error rate of 0.307. The benefit of this model is that it can be run on low-powered devices [11].

3.1.5 Cloud-Based Cyber-Physical Intrusion Detection for Vehicles Using Deep Learn- ing

Since the on-board devices on vehicles tend to be of low computer power, this research

has focused on offloading the intrusion detection onto a cloud-based intrusion detection

system (IDS) that implements artificial neural networks. A robotic vehicle was engi-

neered that acted as a testbed for the cloud-based IDS. The datasets were collected from

(15)

the test-vehicle by defining some specific functions that the neural network will use as input, such as incoming network traffic rates, outgoing network traffic rates, total CPU utilisation, disk write rate, accelerometer output, power consumption, current used by the vehicle, sensors which reads wheel positions and label (for supervised learning). Data was collected for normalcy as well as several attacks, including Denial of Service attack (DoS), command injection attack and a malware attack (adds network delay) [12].

Two deep neural networks was developed, one MLP network and one RNN with LSTM.

The networks has 600, 800 and 1000 hidden nodes (for experimental purposes). It uses a dropout ratio of 0.3 to reduce overfitting in the model, a validation ratios of 0.3, uses binary cross entropy as the loss function and uses a Sigmoid activation function for the RNN and Leaky Rectified Linear Unit (LReLU) as the activation function for the MLP [12].

The infrastructure used to offload the data implemented HTTP as the application protocol (and therefore TCP as the transport protocol), the data was sent to a local gateway which used a SSH tunnel through a WAN over to the cloud environment where the IDS was running. This introduced some network latency [12].

The deep neural network that was implemented using RNN with LSTM achieved the best overall accuracy for detecting the known attacks (86.9% accuracy). The amount of hidden neurons that produced the best result whilst still adding a small amount of detection latency was 800 neurons. A test were a unseen malware attack was employed as well and the RNN performed the best at predicting that attack as well (66.9%) [12].

3.1.6 Real-Time Cyber-Physical False Data Attack Detection in Smart Grids Using Neural Networks

The researchers sets out to create a cyber-physical attack detection (CPAD) system that is implemented using Neural Networks that can detect data integrity errors and attacks (such as replay attacks) from sensors in an electric grid cyber-physical system. The elec- tric grid that is used in the research uses the IEEE 30-bus power system standard. Instead of learning the neural network by physics-features (the physics that a power system uses), data is used to allow the neural network to learn the patterns of the system [13].

The dataset used to train the network was generated by reading senors from a system in a state of normalcy. A attack generation engine was created that generated a new set of data that represents spoofed sensor data (that may represent replay attacks and other data integrity attacks). The final dataset contained labaled data consisting of 10000 normal system states and 300000 system states with spoofed sensor values [13].

The neural network was implemented using a feedforward neural network with 150 in- put nodes (where as 30 of the nodes represented some physical features) and 31 output nodes (one nodes for each bus in the IEEE 30-bus power system as well as a node that represented normalcy). Softmax was implemented to classify the outputs. Experiments were run with 0 hidden nodes, a 20 hidden nodes layer and a 60 hidden nodes layer. The network was trained using the generated labeled dataset with supervised learning [13].

The results showed that the neural network was able to to detect data integrity attacks at a

(16)

higher rate than other machine learning techniques (such as SVM with/without physics- based features). The neural network with no hidden layers had a very low accuracy of 8%, however, the implementations with a 20 hidden-node layer and a 80 hidden-node layer has a accuracy rate of 97% and 99% respectively [13].

3.1.7 MAD-GAN: Multivariate Anomaly Detection for Time Series Data with Gen- erative Adversarial Networks

The paper focuses on anomaly detection in cyber-physical systems, since traditional anomaly detection techniques such as threshold based detection tend to be inadequate in complex cyber-physical systems, a new framework for detecting anomalies (with focus on cyber attacks on the CPS) has been proposed, developed and tested. The framework which uses several machine learning and neural network techniques such as Generative Adversar- ial Networks (GAN) and Long Short-Term Memory Recurrent Neural Network (LSTM- RNN) architecture was developed to detect anomalies within a time series distribution and find the temporal correlations. The proposed framework goes by the name Anomaly Detection with GAN (MAD-GAN) [14].

The MAD-GAN framework conists of two LSTM-RNNs, one for the discriminator (D) and one for the generator (G), the generator generates fake data and the discriminator attempts to distinguish the fake data from the real data. The output of D (accurate dis- criminated fake/real data) is what triggers the backpropagation of the neural networks.

The networks are trained using unsupervised learning. All data (real or fake) is not han- dled independently but is divided into sub-sequences by using a sliding window function.

The sliding window size is determined by the equation s w = 30 ∗ i, i = 1, 2, ...., 10, the value for i was changed in the experiments to determine the optimal size of the sliding window based on what value produced the best results. The amount of sub-sequences derived from the dataset m was done using the following equation m = ^{(M −s} _s

^w

⁾

s

where s s

is the step size in the sliding window. The anomaly detection comes from the fact that the discriminator can distinguish between fake data generated by the generator that represents an anomaly and real data which represents the state of normalcy in the CPS and that the generator can generate samples that are similar to the real data, the space between the real data and the fake generated data by the generator can be mapped to anomalies in the CPS [14].

Datasets were generated using both a Secure Water Treatment (SWaT) system and a Water Distribution (WADI) testbed (which is a subsection of the SWaT). These cyber-physical systems had things such senors and actuators that data was collected from and also at- tacked with different cyber attacks for a variable amount of durations to generate the attack data [14].

The metrics that was used to measure the framework performance was Precision, Re- call and Accuracy (F1). The metrics were calculated by using true positives/negatives (TP/TN) and false positives/negatives (FP/FN). The metrics were calculated using the following equations [14].

P recision = T P

T P + F P (2)

Recall = T P

T P + F N (3)

(17)

Accuracy = 2 ∗ P recision ∗ Recall

P recision + Recall (4)

The results showed that the proposed framework generally performed better compared to other anomaly detection techniques. The framework was compared to techniques such as Principal Component Analysis (PCA) and K-Nearest Neighbors (KNN). The optimal sliding window size for the SWaT dataset was i = 3. The framework did not perform as well on the WADI dataset, due to the highly unbalanced data within the set [14].

3.1.8 Enhanced Cyber-Physical Security in Internet of Things Through Energy Auditing

Since IoT devices are vulnerable to several types of attack, both physical- and cyber- at- tacks, attack detection needs to be able detect both these types of attacks. The proposed mechanism to detect attacks employs energy auditing due to the pattern of increased en- ergy use when a device is put through physical or cyber attacks, the energy patterns can also be indication of the specific attack being detected [15].

The algorithm in the proposed system uses energy audits as input and outputs whether the device is compromised or not. Before the data is feed into the ANN, it is preprocessed.

The preprocessing consists of conditioning, interpolation, noise removal and normaliza- tion. Conditioning is applied because of the potential loss of data samples and to sustain a stable sampling frequency. The noise removal is applied using a median filter because the median filter preserves sharp edges in the data. The deep learning ANN is implemented using two Convolutional Neural Networks (CNN). The two CNNs are employed in the mechanism, one for disaggregation and one for aggregation. The disaggregation CNN model is used to predict the power consumption of the CPU, the network transmit (TX) and the disk from the general power usage and the aggregation CNN model is used to pre- dict the general power consumption by using read the current power consumption of the CPU, TX and disk. The anomalies are detected through thresholds where actual power consumption deviates from the predictions. The sensitivity of the security system can be determined by the threshold that determines where the deviation is determined [15].

The testbed for collecting the datasets and running experiments on was built using 12 Raspberry Pi units connecting in a mesh network. The CNN models were trained using collected data from the testbed.

When running experiments were both physical and cyber attacks was employed on the testbed, security system was able to detect the attacks and determine the type of attack being employed [15].

3.1.9 Real-Time Sensor Anomaly Detection and Identification in Automated Vehi- cles

Since sensor anomalies is a challenge in Connected and Automated Vehicles (CAV), this

research sets out to detect sensor anomalies in real-time using artificial neural networks,

amongst others techniques. Sensor anomalies can come from faulty sensors, cyber attacks

or physical attacks, sensor behaviour that can be considered anomalous can be defined as

the following types [16].

(18)

Miss No data

Instant Instant change between two sensor readings Constant Constant change that differs from normalcy

Gradual drift A gradual shift from normal behaviour to abnormal behaviour Bias Offset data by a certain bias

The research focuses on all of these anomalies excluding the miss anomaly [16].

The proposed sensor anomaly detection system uses a Convolutional Neural Network (CNN) in combination with Kalman Filters. Data is read into the CNN using a sliding window with a fixed length. After the CNN has labeled the output data, any detected anomalous sensor readings are discarded and the normal sensor readings are fed into the Kalman Filters which does further anomaly detection, the normal are outputed as a fusion and the anomalous readings are discarded [16].

The CNN is trained using a dataset of labeled data, id est supervised learning. The CNN was implemented using three convolutional layers with max-pooling (40, 60 and 60 fil- ters), two layers of random dropout with a 0.1 rate (to avoid overfitting) and a Rectified Linear Unit (ReLU) activation layer. The network was trained using a batch size of 128 [16].

The datasets used were from the research data exchange database for the Safety Pilot Model Deployment program and some were generated specifically for this research. Sev- eral attacks/missreadings were injected when collection of the data in progress, the at- tacks/missreadings were meant to represent four sensor anomalies: instant, constant, grad- ual drift and bias [16].

The results show that the proposed system scores high in accuracy, sensitivity, precision and F1. In F1 it generally outscores anomaly detection using just Kalman filters or CNNs for all anomaly types, with the exception of constant anomaly types, Kalman filters out- performs the system when the anomaly magnitude is lower [16].

3.1.10 Artificial Intelligence for Detection, Estimation, and Compensation of Mali- cious Attacks in Nonlinear Cyber-Physical Systems and Industrial IoT Cyber-security not enough to ensure the safety of CPS. Employ control systems to com- pensate for the physical nature of CPS. The control system contains a nonlinear controller that uses a variable structure method. The estimator for estimating online attacks is using a Gaussian Radial Basis Function Neural Network (GRBFNN) [17].

Different types of attacks, deception and stealth, deception attacks alter the control pack- ets sent to the CPS over the network. In stealth attacks the sensors are interfered with by either altering them physically or by modifying the communication packets it uses [17].

For a CPS such a heavy duty vehicle with cruise control, the purpose is to keep the ve-

locity of the vehicle stable. This system keeps the vehicle stable even in the presence of

attacks by compensating for the anomalies detected. This is done by using the Lyapunov

(19)

stability function as the learning method for the GRBFNN [17].

3.1.11 Anomaly Detection in Cyber Physical Systems Using Recurrent Neural Net- works

Labelled attack data for supervised learning can be difficult to obtain. Unsupervised learn- ing has the advantage in that it can use unlabeled data, but early implementations of this approach have proven to have a high false positive rate. This paper uses a Long Short Term Memory Recurrent Neural Network [18].

Time-series data can be useful in detecting anomalies since cyber-attacks on CPSs usually occur over a period of time. RNNs have been used previously in applications where the integration of time-series data has been crucial. An example of this is an Intrusion Detec- tion System that was used for monitoring network traffic. In this case a static threshold was used to determine if an event could be considered an attack. This had the unfortunate consequence of producing many false positives. This LSTM-RNN does not just look at a single point in time but takes into consideration a sequence of data over a period of time.

This results in a lower false positive rate [18].

The dataset used to train the network was from an open source dataset called SwAT. This was data collected from a water treatment plant, more specifically it was the data gen- erated from the sensors controlling the pumps and valves. The network was trained on this data to recognize normal behaviour so that any anomalies to that could be detected.

The attacks happened consequently within 10 minutes of each other. The attacks con- sisted of fooling the PLC controlling the valves, meaning the sensors sent spoofed values.

According to the authors 9 out of 10 attacks were successfully detected [18].

3.1.12 SCO-RNN: A behavioral-based intrusion detection approach for cyber phys- ical attacks in SCADA systems

Supervisory control and data acquisition (SCADA) systems monitor and control critical infrastructure such as power grids and nuclear power plants. The researchers in this paper proposed a sine-cosine optimization based recurrent neural network (SCO-RNN) to detect cyber-physical attacks against SCADA systems. In this approach the parameters of the network are optimized with a sine-cosine algorithm. This resulted in a higher accuracy rate compared to other neural networks [19].

3.1.13 Detecting control system misbehavior by fingerprinting programmable logic controller functionality

Use power analysis of PLCs to detect intrusions. Both current and voltage data are used

for analysis. The idea is to identify any abnormal behaviour of a specific program by

analysing the current and voltage data during runtime. Two different machine learning

models were used to classify each program and to detect any deviations from the normal

operations of each program. RFs were used to analyze large data sets and CNNs were

used to classify data [20].

(20)

The advantage of using a CNN is that it doesn’t require much feature engineering and the data can be used as input “as is”. The results from the paper indicates that it is possible to classify PLC programs with a fairly high degree of accuracy at 99,08% from a single run of 10 programs. The RF method performed better thanks to the rolling average of the data. This was not done when testing the CNN, as the input was in an image format. It was difficult to classify programs that were similar [20].

3.1.14 On-Line Error Detection and Mitigation for Time-Series Data of Cyber- Physical Systems using Deep Learning Based Methods

Since the complexity of CPSs have increased over time it has made the traditional methods of finding faults less effective. New methods using machine learning have an advantage over traditional methods in that they do not require domain knowledge since it can extract features from the data by training. This in part has led to deep learning techniques domi- nating in terms of accuracy when used for finding anomalies among data sets [21].

The researches in this paper propose to use a Long Short Term Memory neural network for error detection. The reasons given for this is that LSTM networks work well with time-series and sequential data, which the input to the CPS usually consists of. Two net- works were trained and used in this paper, a LSTM network for single-step prediction and a LSTM encoder-decoder network for multi-step prediction. LSTMs, which are a variant of RNNs, are used because of their ability to find patterns over a long series of sequential data. It does this by using memory cells to store information about previous steps. This makes it suitable to use for time-series data [21].

3.1.15 Deep Learning for Secure Mobile Edge Computing in Cyber-Physical Trans- portation Systems

The paper presents a model based on deep learning using unsupervised learning as the training method to detect intrusions in a Mobile Edge Computing environment. Use of unlabeled data [22].

The deep belief neural network is trained to detect attack features/patterns in MEC envi- ronment when many heterogeneous wireless devices are connected. Unsupervised learn- ing was used to learn the features of attacks against systems in this environment [22].

The model consisted of two parts, a feature preprocessing engine and a detection engine.

The feature preprocessing engine uses dynamic learning to learn attack features. The fea- tures used in this study were required permissions, sensitive APIs and dynamic behaviour.

The features are extracted from APK files and are used as input into the attack detection engine [22].

In the attack detection engine there are two modules. Feature learning module and pre- diction output module. The feature learning module is based on a deep belief network and prediction output module is based on a softmax function [22].

The deep belief network combines a set of unsupervised networks. In this case they are

so-called “restricted boltzmann machines (RBM)”. These networks can be stacked and

(21)

the output from one can be used as input in another. 512 of them were used in total and that number was achieved by experimentation. Unlabeled, labeled and location features were used for training [22].

During the experimental testing of the network, ten different data sets were used. There were 500 malicious attack samples in each data set and 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500 and 5000 non-malicious samples respectively. The model was compared to 4 other algorithms using the same data, these are softmax regression, de- cision tree, support vector machine and random forest. The output from the softmax function was either a 1 or a 0 depending on if the sample was a malicious attack or not [22].

The accuracy was defined to be the number of correctly classified samples divided by the total number of samples. The accuracy results show that the proposed models accuracy is 12,6% higher than softmax-regression, 5.76% higher than decision tree, 3,20% higher than support vector machine and 2.61% higher than random forest. With the increase of the training dataset size the performance of the model was improved. Another factor was the ratio of malicious and benign samples that were used in the dataset. If the difference between the two types of samples were large it performed better in detecting attacks [22].

Another factor that greatly improved the model was the number of iterations or “epochs”.

The error rate rapidly decreased as the number of epochs approached 200. However, this runs a risk to overfit the model to the training data which would make it perform worse when trying to detect attacks among novel samples. The model has an accuracy improve- ment over 4 other machine-learning based algorithms [22].

3.1.16 Mitigating Data Integrity Attacks in Building Automation Systems using Denoising Autoencoders

The paper presents a deep learning framework for defending against attacks in a building automation system. A BAS uses sensor data to regulate the buildings heating, ventilation and air conditioning which means that it is vulnerable to attacks that change this data [23].

A denoising autoencoder (DAE) was developed for sensor data correction. The testing was conducted using real-world data gathered from a retrofitted AC with injected data that simulated attacks. The model is able to mitigate attacks compared to when no cor- rection was used [23].

An autoencoder is a three layer feedforward neural network that is composed of an output layer, input layer and hidden layer. The number of output and input layers are the same with a, typically, smaller number of hidden layers. The idea is to encode the input at the input layers and decode them at the output layer. This transforms the data and forces the network to learn the most noticeable features of it [23].

A denoising autoencoder works the same as an autoencoder with the exception that a cer-

tain random percentage of the input is modified to make it more resilient when exposed to

data that is partially corrupted. The K-fold cross validation method was used for training

and testing. The data was separated into 7 groups with data from 5 random days each [23].

(22)

3.1.17 Learning-Based Adversarial Agent Detection and Identification in Cyber Physical Systems Applied to Autonomous Vehicular Platoon

This paper investigates the effectiveness of using fully connected deep neural networks and convolutional neural networks to detect attacks on a fixed size vehicle platoon using only the data from the sensors in the platoon. The networks need to not only detect the attacks but also identify from which vehicle it came from [24].

Two different scenarios was under consideration, in the first one the network receives data from all the vehicles to determine the vehicle that is executing the attack. This is called a ‘global scenario’. In the second scenario each vehicle had its own independent network and analyzed the incoming data from each of the immediate neighbors. The experimental results show that both the FCDNN and CNN are successful in identifying an attack with an accuracy towards 97.5% in the case of the CNN only using local sensor information.

The limitation of the networks in this paper is that they can only be used for a vehicle pla- toon of 10 vehicles, for any other size a new network would have to be built and trained [24].

An attack consisted of a vehicle causing instability in the platoon and due to this vehicles accelerate and decelerate abruptly. To detect this the distance between vehicles and their velocity was measured. This turns the problem into a pattern recognition problem which neural networks are well suited to solve. Two neural networks were used, FCDNN and CNN [24].

A 1000 training samples were used for attack detection, where 500 of the samples were attacks. For attack identification 4000 samples were used as training data for each vehi- cle. All the data was normalized. The accuracy was calculated using a confusion matrix where the sum of the diagonal elements called SDE were divided by the TTS, the total number of test samples. Another metric that was used was precision. For the CNN in the global scenario, using range data with maximum noise produced an average precision rate of identifying the attacking vehicle of 96%. With the velocity data this was lowered to 93%. In the local case were only the sensor data from the neighbors were used the precision using range data was 91% and with the velocity data 90% [24].

The FCDNN produced worse results than the CNN in every single case. In all cases the global scenario had a better success rate but this is not always a realistic scenario [24].

3.1.18 Cyber security of smart grids in the context of big data and machine learn- ing

Cyber-security is increasingly becoming a crucial part in Smart grids. This paper investi- gates using machine learning, and specifically artificial neural networks for that purpose.

The chosen learning method in the case study was supervised learning. The reasoning for

this was that a power grid generates a huge amount of data that can be used to classify the

normal behaviour of the system. This data can then be labeled and used to train the neural

network to learn the baseline behaviour of the system. It can then detect anomalies in this

(23)

behaviour [25].

The chosen neural network was an Non-linear autoregressive neural network because it supports dynamic input without prior knowledge.. The network consisted of 12 layers with the Levenberg-Marquardt training function. It was trained in MATLAB for 500 epochs. Once it was trained it was given two sets of data, the first one with data that represent the system under normal behavior and the second with data that represents the behaviour when the system is under attack. The results were promising and it showed that the model could correctly classify normal and attack behaviour, although the exact accuracy was not mentioned [25].

3.1.19 Dimensionality reduction and anomaly detection for cpps data using autoen- coder

This paper focuses on the DR/AD approach for cyber-physical security, which is unsu- pervised anomaly detection (AD) and dimensionality reduction (DR), which is often used as a preprocessing step to anomaly detection. A nonlinear autoencoder (AE) is used to apply the DR/AD approach for security. Cyber-physical production systems usually deal with data that is highly dimensional which anomaly detection systems often have a hard time handling. DR can alleviate this problem by reducing the number of dimensions that the AD solution has to work with. The paper tries to implement this approach to AD with an autoencoder which serves both functions of dimensionality reduction and anomaly de- tection through reconstruction error [26].

The goal of DR is to transform high-dimensional data to a dataset of lower dimensions but which can be reversed so that the reduced dimension dataset can be decoded to the original dataset. The autoencoder that performs this work can be represented as a neural network with a p-dimensional hidden layer and an m-dimensional output layer, where p is the number of original dimensions in the dataset and m is the reduced number of di- mensions [26].

Anomaly detection can be performed establishing a baseline that the AD solution can learn and contrast with the real-time behaviour of the system. Through this technique it can then discover anomalies that do not conform to normal system behaviour. In this paper they use a semi-supervised learning approach by feeding the neural network mostly normal behaviour data with some abnormalities. This is done to guide the system to learn some patterns of anomalies. The danger with this technique is that by its very nature anomalies are diverse and hard to predict which means if the system is focused on a set of them it can more easily miss other anomalies not part of that set [26].

The AD solution in this paper uses a two-phased approach. In the first phase the autoen-

coder reconstruction of the low-dimensional data is analyzed and compared to input. If

the reconstruction error is higher than a determined threshold the behaviour is determined

as anomalous. The error rate is a representation of how much the input deviates from

the learned baseline behaviour. Unconformity to other aspects of the system behaviour

is detected in the second phase. In the second phase any already established AD method

that works well with low-dimensional data is applied [26].

(24)

To test the solution the dataset MNIST was employed. Both phases were tested and com- pared. The results from the experiments show that if the reduced dimensionality of the data is too small, too much information is lost to even reconstruct it. By increasing the re- duced dimensions gradually the AD-performance improves. However, this trend reverses at a certain number of dimensions where the data volume becomes too large. In conclu- sion the AE technique is effective in improving AD-performance if tuned properly to the optimal number of dimensions for any particular dataset [26].

4 Results

4.1 Statistics 4.1.1 CPS industry

A plurality of the literature did not specify the industry that the research they belonged to, the research was developed for cyber-physical systems in general.

• Unspecified: 3

• Transportation: 3

• Autonomous vehicles: 3

• IoT: 3

• Electric power: 2

• Water industry: 2

• Machine industry: 1

• Building automation: 1

• Production systems: 1

For some unspecific research, to derive what industry the research belonged to the dataset used was the basis. For example, some research have used datasets from water treatment plants [14], [18] and water distribution plants [14], this research has been labelled as Wa- ter industry.

4.1.2 Neural Network classes

The class of neural networks used to implement the models in the research literature was

measured. Note that some of the literature [9], [10], [14], [22], [24] used several classes

of neural networks.

(25)

Figure 4.4: Classes of neural networks used in implementations

Recurrent neural networks are the most commonly used neural network class. It was used in 7 of the papers. All recurrent neural network sub-classes were grouped together, the sub-classes of recurrent neural network that was used in the papers are the following.

Figure 4.5: RNN sub-classes

4.1.3 Learning methods

The type of learning that was used in the papers that implemented a neural network was

extracted as data. Some papers did not specify the type of learning used nor if the dataset

(26)

used to train the network was labeled or not.

Figure 4.6: Types of learning used in the literature

Supervised learning was the most common type of learning method used. Unsupervised learning was also common. Some papers used both methods when training their network.

4.1.4 Citations

The following graph shows how many citations the literature had, generated by the search

string in Scopus.

(27)

Figure 4.7: Literature citations

4.1.5 Year of publication

The first paper in the literature search results was published in 2015. The following graph shows the year of publications for all papers.

Figure 4.8: Literature publication year

(28)

4.2 DETECT detection engine

The model consists of two ANNs that work together. The first network is used for de- tecting anomalies among the events that are fed to it. It is a Long Short Term Memory (LSTM) network since it fits well with time-series data, which is the type of data that will be used as input. The LSTM network is trained on high level events that are generated by the CPS that the model is integrated with. The events are fed one at a time and thanks to the properties of an LSTM it can recognize patterns as more events are fed to it over time.

The job of the second network is to classify the series of events as a type of attack. This is done by first teaching the network what type of attack a series of events can correspond to. If a series of events fit a certain attack the network knows about, it will classify it as that.

Figure 4.9: The architecture of the detection engine represented as a high-level diagram

4.2.1 The model feeder

The preprocessing of the data will be done in the model feeder within the DETECT engine when running. Since the data is of a time series, the model feeder will fetch events from the DETECT event-history repository using a sliding window, as used in [14] for time series data. The size of the sliding window will be left to specific implementations, due to some attacks have a higher amount of events than other, this is dependant of the content of the DETECT attack repository (determined during risk analysis), however, as determined in [14], s _w = 30 ∗ i, i = 1, 2, ...., 10 [14] can be used to set the size of the sliding window were the value of i is determined by the contents of the attack repository and by experimentation (for best performance) when implementing the proposed DETECT module.

4.3 ANN Models

4.3.1 Neural Network Types

This section explains the reasoning behind the choice of the two neural networks. LSTM

networks are designed in such a way that they can utilize time-series data very well.

(29)

This is because they take previous input into account when it makes a prediction. This is very relevant to a CPS since through the regular activities that occur in the system, patterns of normal system behaviour emerge. For a neural network to detect these patterns just considering a single event in isolation is not enough. This is what makes LSTM networks a good candidate for use in the CPS domain. The papers [10], [12], [14], [18], [21], reference this as one of the reasons for choosing to use an LSTM network for their solution.

Another important benefit an LSTM network provides compared to a standard RNN or a GRU network, is that it solves the vanishing gradient problem which has been described previously.

4.3.2 Data preprocessing

The DETECT event data that will be used as input for the ANNs has the following schema [27].

Field Name Field Description Field format

IDev Event Identifier Ex

IDs Sensor Identifier Sx

IDg Sensor Group Identifier Gx

Tp Timestamp yyyy-mm-dd hh:mm:ss

Table 4.5: DETECT event schema. Source: [27]

The fields Event Identifier, Sensor Identifier and Sensor Group Identifier in the event schema are nominal data (categorical data in no specific order) [28]. Since input nodes in ANNs require numeric data, the nominal data have to be converted into numeric data.

There are several ways of encoding nominal data into numeric data, one-hot encoding is a common encoding-technique used in machine learning with generally better results compared to binary-encoding or feature hashing [29]. The same encoding has to be used when both training the ANN models and using the ANN models.

By applying feature selection [30] such as in [8], [22] to reduce the data into the essential data, to avoid overfitting. The Event Identifier and Sensor Identifier fields are the two essential fields required to see the patterns of the data.

Events will be sorted (ascending) using the timestamp field. This is to ensure that they are delivered to the ANNs in the correct order due to the events in the event-history may not always be in the correct timely order.

4.3.3 Hidden layers

For the amount of hidden nodes in the model, the formula 1 used in [11] can be used

to determine the amount of hidden nodes in the network. As mentioned in the literature

review, φ is a factor that varies from 2 to 10, the amount of nodes and how many layers

the hidden nodes are spread out on is subject to trial and error when implementing the

model, determined by the best results (accuracy, precision, recall, F1 score and so on).

(30)

4.3.4 Training Method

The ANN that is used for the anomaly detection network will employ unsupervised train- ing. Since the general patterns in the dataset of events (where DETECT will be imple- mented) can be learned without labeling the data, as in [12], unsupervised training can be used to detect anomalies.

For the ANN that is used to classify specific attacks, supervised learning will be used. The dataset that is used to train the data will have to contain a series of events that corresponds to specific attacks (from the DETECT attack repository, gathered during risk analysis) that is labeled with the attack it represents. The network will be trained with supervised learning using the mentioned dataset. This method for detecting specific attacks has suc- cessfully been used in [8].

As mentioned in the analysis sub-section about overfitting. To prevent overfitting, nor- malizing the data in the preprocessing stage and adding dropout layers to the model by a ratio of 0.1 to 0.3 will help reduce the overfitting.

4.3.5 Architecture

There are many different threats that a CPS can be exploited by. Through risk analysis many of these can be exposed but there is no guarantee that all of them will be. The threats are also not static, but rather they evolve over time. This is why it is important to guard against both current and future threats [31].

As mentioned the ANN model consists of two neural networks. These networks serve the same function of detecting an anomaly but they use different training methods. In this model they are called Anomaly Detector (AD) and Anomaly Classifier (AC). The AD is using unsupervised learning while the AC is using supervised learning. These two net- works run in parallel receiving the same input of events. Both of them process the data to produce their predictions which are then fed to the data fusion module.

Using this kind of architecture combines the advantages of unsupervised and supervised learning. The unsupervised approach is suitable for detecting unknown anomalies, mean- ing it can detect behaviour in a system that does not fit with its standard operations. How- ever, a network trained with unsupervised learning in this way cannot discern what kind of anomaly it has detected [31].

The supervised approach works by training a network on known threats. With this in- formation the network can recognize and classify an anomaly. This means that more information about the anomaly can be accessed in case of an intrusion into the system.

This can be of benefit when managing a breach as an operator would already know what kind of attack is occurring, and can thus act accordingly. The disadvantage of this ap- proach is that threats not known by the network would go unnoticed.

Combining two networks in this way does bring with its set of disadvantages with the first

one being that the architecture is more complex. This paper will not in detail describe in

detail of how to run two networks in parallel but suffice to say it does introduce additional

complexity. Running two networks at the same time is more resource intensive than just

(31)

running one.

The two networks do not necessarily produce a prediction in the same amount of time.

This can introduce a delay in the data fusion module as it has to wait for both predictions before producing a final result.

The predictions from both networks are used in the data fusion module to determine a course of action.

4.3.6 Structure of Anomaly Detector

The anomaly detector is an LSTM network with N + 1 input nodes in the input layer and one output node in the output layer. The variable N is determined by the combination of the number of event identifiers, sensor identifiers and sensor group identifiers. Finally, there is an additional input node for the timestamp. These identifiers are encoded using one-hot encoding. One event and sensor identifier are fed to the network one at a time from the event window ordered by the timestamp for each event. The output node pro- duces a real number between 0 and 1. This indicates the likelihood that a series of events is an anomaly.

In regards to the hidden layers of the network refer to section 3.4.4. Since the output is a number between 0 and 1, a threshold must be set to determine what should be considered an anomaly. This threshold is determined in the implementation and experimentation phase of the network which means it is not a number that can be decided upon during modelling.

4.3.7 Structure of Anomaly Classifer

Like the Anomaly Detector, the Anomaly Classifier is an LSTM network and has N + 1 input nodes. The anomaly classifier has an N+1 number of output nodes where N is the number of identified threats to the system with an additional one that indicates no threat.

The output nodes produce a real number between 0 and 1 as the result of how certain the network is that a series of events can be classified as an attack. This means that there is a need for a threshold value. In regards to the hidden layers of this network refer to section 3.4.4.

4.3.8 Data Fusion

Combining the predictions from two networks creates a matrix with four different out- comes as seen below.

Detected Anomaly No Anomaly Detected Classified Anomaly Known Threat Detected Known Threat Undetected No Anomaly Classified Unknown Threat Detected No Threat

Table 4.6: Detection/classification confusion matrix

In this module, several features can be implemented. Thresholds can be implemented

to determine the results of the confusion matrix and thresholds whether to send an alarm

(32)

Known Threat Detected

The Anomaly Detector detected an anomaly and the Anomaly Classifier could recognize it

Known Threat Undetected

The Anomaly Classifier could classify the threat but the Anomaly Detector did not detect it

Unknown Threat Detected

The Anomaly Detector detected an anomaly but the anomaly classifier could not classify it as an known attack.

No Threat Neither the Anomaly Classifier or Anomaly Detector con- sider the events as a threat.

Table 4.7: Data fusion results

to the PSIM (Physical Security Information Management) / SIEM (Security Information

and Event Management). Thresholds are determined by the specific implementation of

the purposed model, the specific CPS that the model is implemented in and the datasets

used to train the ANN models to avoid false positives being sent to the alarm center

(PSIM/SIEM).

(33)

5 Implementation

5.1 Mock dataset

Since there are no readily available datasets of threat scenarios or event-collections gener- ated by a CPS, a mock dataset was constructed loosly based on the 1999 DARPA Intrusion Detection Evaluation [32] and the Malware Training Sets: a machine learning dataset for everyone [33] datasets. In the mock dataset generated, only cyber-threats were used, it does however not affect the framework’s capacity to deal with both cyber and physical threats (which may be known by performing risk analysis prior to implementation, or by known datasets).

Figure 5.10: The sensor classes and the amount of events from each sensor in the dataset

The dataset was constructed by using the labled attacks from 1999 DARPA dataset to

represent two sensors (one sensor group) and the malware dataset to represent 10 sensors

(the second sensor group).

(34)

Figure 5.11: The event classes and the amount of events of each event class One event was created for each labeled attack from the 1999 DARPA dataset and a random event between 1 and 43 for each type of malware in the malware dataset. Timestamps were derived from the 1999 DARPA datasets and a random timestamps within the same interval was created for the Malware dataset. The created dataset contains 276 records.

5.2 Model feeder

A method of applying one-hot encoding is by using Python libraries such as Categori- cal Encoding Methods (category-encoders) [34] or scikit-learn’s preprocessing Onehot- encoder [35]. The package fits (models) the nominal data and then has the ability to trans- form (encode) the data into numerical data. It should be noted that all possible events (E1 to E n ) and all possible sensors (S1 to S n ) needs to be modeled (fit) in the using a one-hot category-encoders encoder before encoding (transforming) the data. The following table shows how the 12 sensors in the mock dataset is encoded using the category-encoders’

one-hot encoder.

Sensor 1 Sensor 2 Sensor 3 Sensor 4 Sensor 5 Sensor 6 Sensor 7 Sensor 8 Sensor 9 Sensor 10 Sensor 11 Sensor 12

0 1 0 0 0 0 0 0 0 0 0 0 0

1 0 1 0 0 0 0 0 0 0 0 0 0

2 0 0 1 0 0 0 0 0 0 0 0 0

3 0 0 0 1 0 0 0 0 0 0 0 0

4 0 0 0 0 1 0 0 0 0 0 0 0

5 0 0 0 0 0 1 0 0 0 0 0 0

6 0 0 0 0 0 0 1 0 0 0 0 0

7 0 0 0 0 0 0 0 1 0 0 0 0

8 0 0 0 0 0 0 0 0 1 0 0 0

9 0 0 0 0 0 0 0 0 0 1 0 0

10 0 0 0 0 0 0 0 0 0 0 1 0

11 0 0 0 0 0 0 0 0 0 0 0 1

Table 5.8: Table showing how the sensors are encoded using one-hot encoding.

Sensor 1 is encoded as "100000000000", sensor 2 is encoded as "010000000000", sensor 3 is encoded as "001000000000" and so on. All the events are encoded in a similar fash- ion.

The model feeder is implemented using Python. It loads the data from a JSON file us-

ing the built in json library. This data is then processed to be in the correct format for