• No results found

Anomaly Detection using Machine Learning Approaches in HVDC Power System

N/A
N/A
Protected

Academic year: 2021

Share "Anomaly Detection using Machine Learning Approaches in HVDC Power System"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

School of Innovation Design and Engineering

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Computer Science

-30.0 credits

ANOMALY DETECTION USING

MACHINE LEARNING APPROACHES

IN HVDC POWER SYSTEM

Mohammad Borhani

mbi19007@student.mdh.se

Examiner: Ning Xiong

M¨alardalen University, V¨aster˚as, Sweden

Supervisor: Johan Holmberg

M¨alardalen University, V¨aster˚as, Sweden

Company supervisor: Michele Luvisotto, Sarala Mogan

ABB HVDC/PGR, V¨aster˚as, Sweden

(2)

Abstract

With continuously increasing energy demand by modern cities, the efficient processing of generation, transmission, and distribution of electricity has gained much attention in the last decade. Reliability and efficiency have played a crucial role in designing modern power transmission line. Asynchronous communication and ability of bulk transmission in a long-distance manner provided the path for High Voltage Direct Current (HVDC) so as to be used in a modern transmission line. Any anomalous pattern of data existed in HVDC sys-tem may result in upcoming failure, which puts the reliability and efficiency of the syssys-tem at risk. In this thesis, a framework of data analysis is presented. The framework, which is divided into different submodules, focuses on preprocessing of data, dimensionality re-duction, and time series forecasting. In dimensionality rere-duction, different approaches and criteria were defined for HVDC experts that could be used as prior steps to establish machine learning approaches. Moreover, the central part of the framework considers the comparison between the prediction of time series using traditional modeling and ensemble machine learning approaches called Random Forest. Using mathematical foundations and proofs, an intuition of using Random Forest in time series forecasting is justified. Then, the contribution of the thesis focuses on transforming the univariate time series forecast-ing problem into supervised learnforecast-ing formulation that could be evaluated mathematically. Experimental result shows that the Random Forest model outperforms the traditional fore-casting approach. Random Forest can be deployed as a prediction of time series with high accuracy. This model helps the HVDC expert to predict the near future so as to provide a guideline for witnessing any anomalous pattern proactively. Moreover, Isolation Forest is used for the prediction of anomalies in HVDC data set. Finally, by considering the regression mode of Random Forest, an approach of predicting a target time series using an independent time series is also included in the thesis.

(3)

Contents

1. Introduction 1

2. Background and Related Work 2

2.1 DC Transmission . . . 2

2.2 Anomaly Definition . . . 5

2.3 Machine Learning . . . 9

3. The Problem Formulation 10 4. Methodology 11 4.1 Time Series Analysis . . . 12

4.2 Time Series Component . . . 13

4.3 Time Series and Stochastic Process . . . 14

4.4 Statistical Properties of Stochastic Process . . . 14

4.5 Concept of Stationarity . . . 15

4.6 Stochastic Model for Time Series Forecasting . . . 16

4.7 ARMA Model . . . 16 4.8 Random Forest . . . 19 4.9 PCA . . . 21 4.10 Multidimensional Scaling (MDS) . . . 22 4.11 Isomap . . . 23 4.12 t-SNE . . . 23 5. Experimental Analysis 25 5.1 Development Environment . . . 25 5.1.1 Jupyter . . . 25

5.1.2 Anomaly Detection in Raw Data Module . . . 29

5.2 Evaluation Metrics . . . 31

5.2.1 MAE . . . 31

5.2.2 RMSE . . . 31

5.2.3 MSE . . . 31

6. Experimental Results and Discussion 32 6.1 Time series Forecasting . . . 32

6.1.1 Univariate Forecasting . . . 32

6.1.2 Multivariate Time series Forecasting . . . 35

6.2 Answers to RQs . . . 36

6.3 Confidentiality Concerns . . . 36

6.4 Threats to Research Validity . . . 37

7. Conclusions 38

8. Future Works 39

(4)

1.

Introduction

Global energy demand has been increased significantly in the last decade [1]. Moreover, the reliable transmission between two points in a very long distances intrigued both academia and industry. In this manner, authors in [2] expect that renewable energy will possess a 63% share of primary energy supply in 2025, while it was only 14% in 2015. Many places that are responsible for energy generation, specifically renewable power plants, might be distributed in different geographical locations that are far away from end-users. Thus, efficient transmission of generated electrical power systems seems to be crucial in modern environments [3].

Equipment used in a transmission environment needs precise maintenance so as to hinder any possible failure in the best case. Authors in [4] argued that the role of failure prediction in electrical utilities should be greatly considered, as the accurate prediction results in defining optimal decisions to be taken in the future. For instance, any pattern of possible failure helps the industry to proactively forecast a plan to reduce the overall cost of the failure.

Due to the advantages like the capability of long-distance transmission and asyn-chronous interconnection that High Voltage Direct Current(HVDC) offers, it is widely used in bulk transmission [5]. Nevertheless, as HVDC systems are located in the wilder-ness, it may be prone to a fault more easily. Thus, anomaly detection mechanisms to timely recognize anomalies seem to be crucial for the industry. Moreover, the recent en-hancements in Artificial Intelligence(AI) and machine learning pave the way for the design and implementation of the anomaly detection system in HVDC. In chapter 2 of the the-sis, the background and related issues concerning anomaly detection are presented. After proposing the problem formulation, the methodology of the thesis is given in chapter 4. Experimental analysis of the proposed method, and discussion on the result are given afterward. Finally, the conclusion and path for future work are proposed in the final chapters.

(5)

2.

Background and Related Work

2.1 DC Transmission

Defining transmission line as a system consisting of a conductor to transmit electrical power from initial point, namely A1, to a destination point, namely A2, and considering the fact that Direct Current (DC), as the name indicates, offers a constant current, while Alternative Current (AC) is an electrical current that reverses its direction in periodic order, two types of transmission could be defined: AC transmission and DC transmission [6]. Although, due to the existence of transformers and poly-based circuits, in the late 1890s, the usage of AC overcame DC one, but based on the technical, environmental, and economical advantages that DC offers, it has gained much attention for long-distance transmission [7].

There exist some environmental constraints resulting in forcing the electrical industry to endure the issue of separation between power generating platform and load centers [8]. Due to the fact that separation could be in terms of hundreds of kilometers, efficient and reliable transmission is needed. In DC transmission, a cycle of conversion between AC and DC is done. Rectifier, on one side, converts AC to DC to be carried away on the transmission line, while a receiving side executes another conversion (DC to AC).

Assuming Vp to be the conductor’s peak voltage(with respect to ground), and cosφ

denotes the power factor, then power level could be calculated by the following equations:

PAC = 3 √ 2VpIcosφ (1) and PDC = 2VpI (2)

where the effective current rating of the conductor is denoted by I. The calculation yields:

PAC∼= PDC (3)

Now, let’s assume the characteristics of insulators are identical for both AC and DC transmissions, then, author in [8] reported that DC line is capable of carrying power with two conductors, while AC lines could carry the same amount of power as DC line with three conductors. Using Eq. (2), it can be noted that DC lines offer a less Right of Way(RoW) costs. Moreover, inexpensive and complex towers installation could be overcome in DC line transmission.

In technical evaluation, the angle corresponding to voltage phasors, connecting at two ends, affects the power transmission [8]. In Figure 1, power transfer capability is drawn against distance. As it can be seen from Figure 1, the DC line will carry the same amount of power as the distance increases, while AC line will be dropped gradually.

Authors in [9] defined the reliability of a transmission system as the denoted time frame in which system is able to operate adequately so as to be responsive to the required functionality under stated condition. Moreover, adequacy and security construct the un-derlying concept of reliability in transmission system. Supplying the total aggregation of demanded energy by system could be defined as adequacy, and the ability of system to resist against sudden disturbance is regarded as security [9]. In this regard, some exact definition of performance metric could be defined to evaluate any prototype over trans-mission system. Researcher in [8] introduced the concept of energy availability as follows:

Energy availability = 100 − (1 −Equivalent Outage Time

(6)

Figure 1: Power Transfer, Source: Adopted From [8]

and

Equivalent Outage Time = Actual Outage Time * fraction of system capacity lost (5)

Moreover, transient reliability, as another performance metric, could be defined as:

Transient Reliability = 100 ∗Number of Transmission system performed as designated Number of recordable as faults

(6) Witnessing a drop in voltage, resulting in decreasing the voltage below 90% of the appro-priate state is denoted by recordable faults. It is assumed that the energy availability or transient reliability of a transmission network should be in a range of 95% or more [8].

High voltage transmission could be classified as High Voltage Direct Current(HVDC), and High Voltage Alternative Current(HVAC). Direct current, as the name indicates, offers a constant current, while AC is an electrical current that reverses its direction in periodic order [10]. Authors in [11] reported that the first usage of HVDC for bulk transmission dates back to 1954 by ABB in Sweden.

Different advantages of using HVDC for power system could be classified as follows: • HVDC can connect two power networks with asynchronous AC systems [12]. • Power transmission needs to travel a very long distance. Due to the fact that HVDC

offers smaller losses in transmission compared to AC lines, and considering the case that HVDC lines are cheaper in the per-kilometer metric, HVDC offers more eco-nomical choice [12].

• Considering the fact that HVDC needs a lower rate of cross-section requirements, the usage of HVDC transmission is advantageous rather than old AC ones [13]. • In terms of utilization, HVAC suffers from disturbing procedure in AC lines, which

is known as skin effect, while the utilization might be in the full state for HVDC — only restricted by thermal limits [14].

(7)

Authors in [7] considered the high cost and complexity of converters as pitfalls for HVDC systems. Also, the multi-terminal network seems to be complex to be managed. Researcher in [8] included the expensive converters; the existence of DC and AC filters, and DC breakers as cons of HVDC as well.

Considering the long-distance transmission, it should be noted that, in order to reduce the overall losses in the system, it needs to operate in high voltage mode — in practice above 100 kilo-volts[6].

Figure 2: Cost Variation, Source: Adopted From [8]

Figure 2 compares the cost of deployment of HVDC and HVAC with respect to trans-mission in long distance. Although HVAC, initially, offers a lower cost of deployment, but as the value of x axis, representing the line length, increases, the slope of linear function of cost decreases. Hence, for the long-distance transmissions, HVDC presents a lower cost.

HVDC system could be best represented as a system for bulk transmission undertaking DC concept for transmission. In the simplest case, on the same site, back-to-back inter-connection of two converter constructs a HVDC system [7]. HVDC could be classified into the following categories:

- Mono-polar: The connection between two converters by a single line of conductor, wherein the earth could be used as the returned path.

- Bi-polar: In comparison to mono-polar, bi-polar conductors are used where each conductor owns its ground return.

- Multi-terminal configuration: In this system, more than two converter stations could be witnessed. Thus, the complexity of the system increases, compared to the above-mentioned types.

- Homopolar: Integration between two or more conductors that share identical polar-ity.

(8)

Figure 3: HVDC Types, Source: Adopted From [8]

2.2 Anomaly Definition

An outlier is defined as a realization of the observation with a considerable deviation from other observations. This consideration stimulates doubt that the outlier was created by an alternate mechanism [15]. Authors in [16] reported the following terms that may be used in the literature alternatively:

• abnormality • discordant • deviants • anomalies

There is an argument that outlier may be a point that could be interpreted either as abnormality or noise [16]. Figure 4 represents the importance of consideration of noise and anomalies in one context. Figure 4(a) illustrates an output without noise, while Figure 4(b)shows the same one with noises. Thus, considering the threshold to evaluate the context seems to be crucial for case studies. For instance, in supervised learning, without the existence of labelled data, a criterion needs to be defined to distinguish noise from anomaly [16]. Moreover, authors in [17, 18] reported the term Strong Outlier and Weak Outlier to tackle with the above-mentioned issue. Anomaly detection procedure could vary when deployed for specific domain needs. Authors in [19] reported that

(9)

Figure 4: Anomalies Vs Noises, Source: Adopted From [16]

• Availability of target column labels • Domain-specific requirement

could alter the way of selecting the best anomaly detection techniques in each use case. Each data row could be described by a specific number of attributes. Variables, dimensions, and features are other words that can be used as attributes interchangeably. The features can exist in different formats as:

• Binary • Continuous • Categorical

It should be noted that a data row could be described using one feature(univariate), or a data row can contain multiple features known as multivariate analysis. The overall nature of the input data set could affect the choice of proper anomaly detection in a specific application domain. For instance, a distance measured to be used in the nearest neighbor anomaly detection technique can be defined by the characteristics of attributes in the data set. In a more general view, based on the domain, it is possible that data rows can be related to each other. Sequence data is an example of this relation, where data rows are ordered, e.g., time-series data. Moreover, anomalies, based on their nature, could be classified as follows:

• Contextual anomalies: The existence of anomalies has a direct relationship with the underlying context. The term context is defined by a specific structure in the data set [20]. Figure 5 represents the notion of contextual anomalies by considering the monthly temperature.

• Point anomalies: By considering the overall data, point anomalies could be defined by the anomalous behavior a data point is showing with respect to other data. • Collective anomalies: As depicted in Figure 6, a collection of data points is set to

represent the collective anomalies if the overall pattern of those points seems to significantly offer different behavior with respect to other data rows. The notice that needs to be considered by researchers is that the individual point that existed in the collective anomalies set might not be taken as an anomaly by itself.

(10)

Figure 5: Contextual Anomalies [19]

Figure 6: Collective Anomalies [19]

Regarding the availability of the target column, one concern is that labeling the whole data set with the target column as normal or anomalous would be an expensive task for companies. Typically, the existence of labels divides anomaly detection approaches into the following categories:

• Unsupervised detection: In an unsupervised manner, the label is missing, and the model does not demand the training data set. It should be noted that there is a prior assumption in unsupervised detection such that the anomalous data populates a very small portion of test data. Moreover, due to the fact that the unsupervised technique does not enforce expensive task on companies, it seemed to be used more often.

• Supervised detection: Unlike unsupervised detection mechanisms, in a supervised detection technique, there exists a label column for normal and anomaly classes. Thus, the proposed model is built and trained on the training data set. Afterward, the model is ready to calculate its prediction on unseen data. Due to the expen-sive task of labeling, there might exist challenging issues with building supervised detection models.

• Semi-supervised detection: In this manner, only normal label exists for training data set.

(11)

After applying these detection techniques, anomaly detection usually outputs the result via score or labels. In a score output, each data point will be assigned by a score indicating the degree of being anomalous, while in label outputting, a simple category assignment( anomaly or normal) is assigned to each data row.

The importance of the data model in constructing anomaly detection system was in-vestigated in [16]. Nearly all anomaly detection techniques construct a normal data model which helps them to recognize any deviation from normal behaviour. Hence, any errors in the steps of model construction would result in poorly fit of data, which consequently ends up in poor recognition of anomalies. Moreover, any statistical inference from the data model needs to be supported by an evident test. Considering the point B in Figure 4(b), the definition of anomaly affects the way B is treated. In the case of defining extreme points as anomaly, then, B will be nominated to be flagged as anomaly, while defining the anomaly as the distance from neighbours, then B might be flagged as noise. A common method for anomaly detection deals with following steps:

• The behaviour and pattern of normal activity is calculated

• The deviation from normal behaviour would be the leading path to identify anomalies A hierarchical structure for anomaly detection could be summarized as follows [21]:

• Statistical Based – Parametric ∗ Gaussian ∗ Regression ∗ Mixture – Non-Parametric ∗ Histogram ∗ Kernel Based • Machine Learning Based

– Classification ∗ Neural Network ∗ Bayesian Statistics ∗ SVM ∗ Rule Based – Nearest Neighbour ∗ Distance Based ∗ Density Based – Clustering

In this regard, statistical-based approach, often, considers the statistical behaviour of data set(e.g., mean, standard deviation, probability density function). Then, statistical tests are used so as to calculate the deviation of abnormal data from normal data. Specif-ically, In parametric methods, there exists an assumption that agent (users) had a prior information regarding the distribution of data, while in the non-parametric ones, the sys-tems lacks prior information [22]. One of the methods in non-parametric approach is to calculate and draw histogram bins, and comparison is processed whether new data falls into any histogram categories labels [21].

(12)

2.3 Machine Learning

By defining a specific task and proper performance evaluation measures, author in [23] defines machine learning as a computer program that aims to learn from an experience, and improves it by considering the performance metric. The nature of learning introduces different kinds of learning strategies as follows:

• Supervised Learning: Learning system could be evaluated by comparing actual re-sponses with desired rere-sponses to minimize the error between them.

• Unsupervised Learning: There exists no desired output in the experience.

• Reinforcement Learning: The aim would be to obtain an optimal action strategy with an exploration of states and actions.

• Semi-supervised Learning: It falls between supervised learning and unsupervised learning, in which the goal is to address the given problem by using unlabeled data, as well as labeled data.

The underlying concept of classification is to allocate given data instances to a preset of classes based on their specific features. In neural network method, by undertaking the concept of the human nervous system, the learning agent could be trained to work in supervised or unsupervised learning. Bayesian statistics deploys a posterior probability calculation based on prior probability of data set using the concept of Bayes rule in the probability space [24]. Support Vector Machine, namely SVM, constructs disjoint groups by applying a multi-dimensional plane in considering the data set [21]. Finally, in classi-fication, decision tree techniques could be an illustration for rule-based learning, in which the normal behaviour of data is learned via supervised-learning manner. Classification-based method usually endures a hurdle of the existence of an accurate label of classes. Nearest-neighbour techniques consider the density or distance measure to construct the neighbour’s distance to specific data points. While this method is distribution-free to work with, but when the anomalies build a region with the close neighbourhood, labeling the data in an accurate way seems to be difficult [21]. In clustering techniques, the given data set, in an unsupervised manner, is divided into different clusters. A cluster consists of a colony of instances that share a same attribute. While clustering may offer a fast pace of execution time, compared to distance-based methods, but for small data instances, it may not proceed with high accuracy [21]. Authors In [21] reported that detection rate, accuracy, and scalability could be used as a measurement of performance for anomaly detection system.

Researchers in [25] proposed a deep learning approach to identify security cyber-physical anomalous patterns in Power Grid controllers. The model is used to learn normal data, and any deviation from normal behaviour will be analysed. A novel approach in which the estimation of error convergence is guaranteed to detect the pattern of opera-tional changes in power system monitoring is proposed in [26]. Authors in [27] proposed a review approach of recent enhancement in DC fault protection techniques in HVDC system. Machine learning techniques such as Artificial Neural Network and Fuzzy system were considered in [28] for detecting faults in HVDC transmission line. A new structure of secure connectivity in an industrial environment is also proposed in [29]. Authors in [30] proposed a structure of reinforcement learning in communicative network to predict the normal request of the system. By considering an industrial environment, authors in [31] proposed a resource provisioning approach to improve the reliability of the communica-tion. Moreover, with recent enhancement in mobile robots [32], the detection of anomalous behaviors could be extended to other realms as well.

(13)

3.

The Problem Formulation

Acquiring knowledge for prediction of assets maintenance time, and any possible anomaly detection in HVDC environment using raw data set can be time consuming, as it would involve significant efforts by domain experts. Thus, the thesis considers the value of employing machine learning algorithms to provide an automatic and accurate model model for anomaly detection and time series prediction.

• RQ1: Which machine learning strategy(supervised, unsupervised, and semi-supervised) is suitable in terms of higher accuracy for anomaly detection in HVDC systems? More accuracy can be defined as detection rate of anomalies in system, as well as low false alarm rate in recognition phase, noting that false alarm would be the case in which learning system reports an anomaly wrongly.

• RQ2: Does traditional stochastic-based time series forecast outperform machine-learning time series predictor in terms of evaluation of generated error?

• RQ3: Which features of data set of HVDC environment could be used for identifi-cation of a model of anomaly detection?

(14)

4.

Methodology

The thesis environment offers the manipulation of behavior in systems precisely and sys-tematically. Hence, experiments in the offline setting would be proper research method-ology for a quantitative-based thesis. In this regard, the thesis claims to explore relation-ships, and evaluate the accuracy of machine learning models and validating the measures. The following steps include the methods that will be used in the thesis:

• Systematic literature review: The main goal is to fully identify the gap for justifica-tion of the research problem.

• Selecting a sample: This will be done in order to observe data for further machine learning features extraction.

• Proposing different Machine Learning strategies and collecting data: This step will be considered to feed the machine learning algorithms.

• Interpreting and analyzing data: Comparison between different data groups will be made. Moreover, the relationship between variables will be investigated, and statistical analysis will be done.

• Evaluating research based on criteria and reporting

Figure 7 illustrates the overall steps in the research method that will be used in the thesis.

Research Problem Identification

Literature Review(Systematic Mapping Study Perspective)

Selecting Sample

Data Collection

Data Analysis

Evaluating Research and Reporting

(15)

In the concept of research methodology, In constructing an experiment, we are faced with the following variables:

• Independent Variable: Variables that we controlled is called independent variables • Dependent Variable: Variables that we want to see the effect of changes on - often,

there exists only one dependent variable.

Moreover, the sets of independents variables are called Factors, wherein a particular value of a factor is called Treatment. Often, in an experiment process, the other members of independent variables are controlled at a fixed level so as to see the cause-and-effect result of adopting particular treatment. According to [33], an experiment consists of a set of trials, where each trial is a combination of treatment, subjects, and objects. Subject is the agent(.e.g, people, smart agents) that used the treatment on the object; the object could be a document or even a computer program.

Authors in [33] defines controlled experiment as measuring the possible effect on the de-pendant variable by manipulation of one factor that can be seen in Figure 8.

Figure 8: Experiment Process, Adopted from [33]

Moreover, there exist treatments in a controlled experiment so as to compare the pos-sible outcomes. Theories confirmation, conventional wisdom confirmation, and evaluation of accuracy of models, as well as measure validation, are amongst the environment that experiment could be used as research methodology [33]. The overall process of the exper-iment could be classified into following order [33]:

• Experiment Scoping • Experiment Planning • Experiment Operation • Experiment Analysis • Experiment Presentation

Figure 9 depicts the detailed version of the experiment process that will be used in this thesis.

4.1 Time Series Analysis

Authors in [34] defined time series as ”a time-oriented or chronological sequence of obser-vation on the variable of interest”. Many business applications tend to collect their data using a timestamp. An example of time series is depicted in Figure 10.

(16)

Experiment Inception Experiment Scope Experiment Planning Experiment Operation Experiment Analysis Experiment Presentation • Goal • Context • Objects of Study • Perspective • • Determining Variables • Planning Activity • Stating RQs and Hypothesis • Design of Experiment • Validation Statement • Subject Selection

• Collecting the Measurements • Execution • Data Validation • Measurement Analysis • Hypothesis Testing • Final Report Figure 9: Experimentation

4.2 Time Series Component

Time series component can be classified into following categories:

• Trend: A pattern, often long term, that exists in time series data is called a trend. trend can be linear, nonlinear, positive, or negative.

• Level: The average value of time series is called level.

• Seasonality: Some changes that tend to represent regular or predictable behavior in short term is called seasonality.

• Residual: Random variations in a process that do not repeat are called residual. Figure 11 illustrates the main component of time series analysis [35].

Time series problem formulation could be classified as univariate analysis or multivari-ate analysis. In univarimultivari-ate analysis, the data set contain a single time-dependent variable,

(17)

Figure 10: Example of time series plotting [35]

Figure 11: Component of time series [35]

e.g., collected data from a specific sensor in HVDC environment. In contrast, multivari-ate analysis consists of multiple attributes that are time-dependent. An example of the multivariate analysis would be forecasting rainfall using temperature and humidity [35].

4.3 Time Series and Stochastic Process

Integration consists of sample space, time-based function, and probability measure is called a stochastic process denoted by X(t). A Stochastic process can be served as a represen-tation model for the observed time series.

4.4 Statistical Properties of Stochastic Process

For a stochastic process X(t) : t = 0, 1, 2, 3, ..., the mean is calculated as:

µt= E(Xt); t = 0, 1, 2, ... (7)

where µt is expected value of process at time t. By defining Cov(Xt.Xs) as

E[(Xt− µt)(Xs− µs)] = E(XtXs) − µtµs (8)

then the auto correlation function is defined as:

(18)

where

Corr(Xt, Xs) =

Cov(Xt, Xs)

pV ar(Xt)V ar(Xs)

(10)

It should be noted that both covariance and correlation are used to indicate the linear dependence between random variables.

4.5 Concept of Stationarity

The mathematical idea of stationarity in stochastic process states that a process Xt is

called strict stationary if its distribution function

FX(X; t) = P [X(t) ≤ x], t ∈ τ (11)

does not change for different selected timestamp. Moreover, specifically, if a process is stationary, then its distribution function is not dependant on time. Otherwise, we call the process non-stationary. It is important to consider that the concept of stationarity is used as a prior assumption of some of the statistical inference in time series prediction. For many use cases, the restriction imposed by strict stationarity makes the time series forecasting impossible. Thus, a weaker assumption has been considered as well. A Stochastic process Xt is called weakly stationary if

• The mean does not change through time

• Cov(Xt, Xt−k) = Cov(0, k) for all time t and time lag k

For instance, consider the random cosine wave, defined as

Xt= cos(2π(

t

12 + ψ) for t=0,1,2,...

In order to check the weakly stationary status for a desired random cosine wave, we need to calculate µtand Cov(t, s).

E(Yt) = E n cos[2π( t 12 + ψ)] o = Z 1 0 cos[2π( t 12 + ψ)]dψ = 1 2πsin[2π( t 12 + ψ)]| 1 ψ=0 = 1 2πsin[2π( t 12 + 2π)] − sin(2π t 12) (12)

Thus, the first condition of weakly stationary status is satisfied. The second condition yields: Cov(t, s) = E n cos[2π( t 12 + ψ)]cos[2π( s 12+ ψ)] o = Z 1 0 cos[2π( t 12 + ψ)]cos[2π( s 12 + ψ)]dψ = 1/2 Z 1 0 n cos[2π(t − s 12 +)] + cos[2π( t + s 12 + 2ψ)] o dψ = 1/2 n cos[2π(t − s 12 )] + 1 4πsin[2π( t + s 12 + 2ψ)]| 1 ψ=0 o = 1 2cos[2π( t − s 12 )] (13)

The autocorrelation function can be calculated by using Eq.(13). Thus, the process is stationary.

(19)

4.6 Stochastic Model for Time Series Forecasting

Different stochastic processes could be used to model the problem of time series forecast-ing. Two linear models known as Autoregressive (AR) and Moving Average (MA) were widely used in the literature [36]. A Combination of these models, Auto Regressive In-tegrated Moving Average (ARIMA), has been proposed as well [37]. Simplicity and ease of implementation offered by linear models paved the way for their extensive usages in industry. In contrast, predicting volatility in time series is done by non-linear model in a more appropriate way.

4.7 ARMA Model

the ARMA model, parameterized by p, q is a combination of AR(p) and MA(q) model. An underlying assumption of AR(p) models the future as a linear combination of prior lags(observation), as well as, a constant with coexistence of random error. The mathe-matical expression of AR(p) model is described as follows:

yt=c + φ1yt−1+ φ2yt−2+ ... + φ1yt−p+ t = c + p X i=1 φiyt−i (14)

where yt is the value of the time t, and tdenotes the random error term at time t. All φi

that contribute to the equation are model parameters, and constant is regarded as c, The p denotes the order of AR process.

while in AR(p) a regression intuition towards past value of time series is considered, MA(q) model is interested in using prior errors as explanatory variables [36]. Thus, the following equation demonstrates the MA(q) model

yt=µ + θ1t−1+ θ2t−2+ ... + θqt−q+ t = c + q X j=1 θjt−j+ t+ µ (15)

where mean of series is denoted by µ, and θis are parameters of MA(q) model. It should

be noted that random errors are assumed to be independent and identically distributed random variables(iid).

As the name suggests, ARMA model consists of integration between AR model and MA model parameterized by p, q as follows.

yt= p X i=1 φiyt−i+ q X j=1 θjt−j+ t+ c (16)

Now, let X1, X2, ... be a sequence of iid random variables with variance σ2 and µ = 0.

We can simply observe the time series yt as:

Y1 = X1

Y2 = X1+ X2

...

Yt= X1+ X2+ X3+ ... + Xt

(17)

Using the recurrent concept, Yt could be written as:

(20)

The mean of such time series could be calculated as follows.

µt= E(Yt) = E(X1+ X2+ ... + Xt) = E(X1) + E(X2) + ... + E(Xt) (19)

We could simply infer that µt= 0 for all time lags. Moreover, we have

V ar(Yt) = V ar(X1) + V ar(X2) + ...V ar(Xt)

= σ2+ σ2+ ... = tσ2 (20) With same intuition, the covariance function of γt,s= Cov(Yt, Ys) could be simply

calcu-lated as: γt,s = Cov(Yt, Ys) = s X i=1 t X j=1 Cov(ei, ej) (21)

This process is called a random walk. Now, AR(1) model could be related to the random walk as follows.

• If φ1= 1 and c 6= 0, then yt is random walk with drift

• if φ1 = 1 and c = 0, then yt is random walk

• if φ1 = 0, then yt is white noise.(time series without showing a pattern of

autocor-relation is called white noise

• if φ1 < 0, then ytseems to represent oscillation around mean

Using the same intuition, ARIMA model, which consists of Integrated part to make a series stationary, could be related to autoregression, moving average, and random walk, which is given below.

moving average ARIM A(0, 0, q) Autoregression ARIM A(p, 0, 0)

Random walk with drift ARIM A(0, 1, 0) + constant White noise ARIM A(0, 0, 0)

Random walk ARIM A(0, 1, 0)

Fitting ARMA(p,q) model on a data set would be a challenging task. In other words, find-ing appropriate value for p and q affects the result of the forecast in specific requirements. One of the most important approaches in this field is called Box and Jenkins approach [38]. The model consists of three steps as follows.

• 1.Identification • 2.Estimation

• 3.Diagnostic Checking

Step 1 deals with the identification of the order of required model, which is mainly done by plotting series, autocorrelation function, and partial autocorrelation function. In step 2, by investigating information criteria, estimation of different models is done. In the next step, by calculating residual diagnostics, model checking is done. Inspired by the main concept of Box and Jenkins method, a pseudo-code of tuning the parameter of ARIMA model is presented in this thesis to fit the best model for ABB requirement.

Defining the classification task as to find a target function, also called a classification model, that maps data row to a predefined label, a tree data structure method could be used to solve a classification problem. The desired tree has the following structure.

(21)

• Root node: Accepts no incoming edge, but usually has outgoing edges. • Leaf: The node that has no outgoing edges.

• Internal Node(Terminal): The node that has two outgoing edges or more, while it has only one incoming edge

In the design concept of decision tree for classification, the nodes that are not terminals represented a test condition to separate records. Moreover, each terminal node is desig-nated by a class label. An illustration for the decision tree in a classification scenario is depicted in Figure 12 [39].

Figure 12: Decision tree in classification

Mathematical intuition yields that for a class variable denoted by Y which can take 1, 2, .., k, as well as p variables, a tree-based classification solution partitions the variable space , X, into k partitions that are disjoint, where A1, A2, . . . Ak are disjoint sets.

Con-sidering the cases that X belongs to Aj, then predicted value of Y of the model is j. This

process is demonstrated in Figure 13 [40].

(22)

The same intuition of classification tree could be deployed for the regression tree with only some modification that Y variable can take continuous value. The pseudo-code for the regression tree is presented below.

Algorithm 1 Regression Tree

1: procedure Regression(A) 2: At each iteration

3: for each variable, xk do

4: Find optimal cutting point s

5: M ins[M SE(yi|xik< s) + M SE(yi|xik≥ s]

6: end for

7: choose k yielding lowest Mean Square Error . Terminate when MSE becomes small

8: end procedure

4.8 Random Forest

• a data set of bootstrap, known as B, forms the training data {y(b)

i , x

(b)

i }Ni=1 with

using a replacement method.

• For b = 1, 2, ..., B, calculate the value of prediction ˆy(b)

i on all data set

• The final prediction with average over bootstrap as ˆyi= B1

PB b=1yˆ

(b) i

The idea of bagging helps us with bringing down the variance of the model as follows:

V ar(1 B B X b=1 ˆ yi(b)) = 1 B2 B X b=1 V ar(ˆyi(b)) = 1 BV ar(ˆy (b) i ) (22)

We need to notice the case that ˆyi(b) are iid over b. This is specifically useful for tree-based structure, where they represent a high variance and low bias.

Before relaxing the iid assumption, we need to calculate the variance of sum of random variable(X1, ..., Xn) as follows. V ar( n X i=1 Xi) = E([ n X i=1 Xi]2) − [E( n X i=1 Xi)]2 ( n X i=1 E(Xi))2 = E([ n X i=1 Xi]2) = n X i=1 n X j=1 E(Xi)E(Xj) (23)

(23)

E([ n X i=1 Xi]2) = E( n X i=1 n X j=1 XiXj) = n X i=1 n X j=1 E(XiXj) V ar( n X i=1 Xi) = n X i=1 n X j=1

(E(XiXj) − E(Xi)E(Xj)) = n X i=1 n X j=1 Cov(Xi, Xj) V ar( n X i=1 Xi) = n X i=1 V ar(Xi) + n X i6=j Cov(Xi, Xj) (24)

Now, we need to relax the iid assumption as each bootstrap could be highly correlated. Assuming Corr(ˆy(b1), ˆy(b2)) = ρ, where ρ 6= 0 and V ar(ˆy(b)) = σ2, then we can calculate

V ar(1 B X b ˆ yi(b)) = 1 B2[ X b1 X b16=b2 Cov(ˆy(b1), ˆy(b2)) +X b V ar(ˆy(b))] = 1 B2[σ 2X b1 X b26=b1 ρ + Bσ2] = σ2BB−2(B − 1)ρ + σ2B−1 = (1 − B−1)σ2ρ + σ2B−1 = (1 − ρ)σ2B−1+ ρσ2 (25)

The above calculation shows as B → ∞, then V ar(¯y = ρσˆ 2), and there is a lower bound on variance. The lower value of ρ results in lower variance. Thus, in the case of the correlation of individual regression tree, bagging seems to be less helpful. Hence, the idea of randomization on variable selection would help the model. The Random Forest model that will be used in the thesis is illustrated in Figure 14.

(24)

4.9 PCA

Principle Component Analysis(PCA) is a method for dimensionality reduction that is used for transforming a data set from high-dimensional space to lower-dimensional space without losing much information. PCA assumes that a data resides in a subspace of real representation in high-dimensional space. For instance, In Figure 15, although the data points could be identified by (x,y) coordinates, but as the data resides in the subspace of 2D projection, PCA could appropriately reduce the dimensionality reduction of the data points, i.e., from two-dimensional space to one-dimensional space denoted by a line. More specifically, PCA tries to accept input as x ∈ Rd, and outputs y ∈ Rpwhere p << d. u1 in

Figure 15 is called the first principal component. Moreover, there exists another principal component that is orthogonal to the first principal component. Intuitively, PCA seeks to find a vector space of data in lower-dimensional space with preserving maximum variation for modeling, classification, or regression scenarios. Now, consider we have n data points that are in d dimensional space as X = [X1, X2, . . . , Xn]d×n. The goal is to project the

points on a vector called the first component U1 or finding the U1TX, where U1 is a vector

that is 1 × d vector.

Figure 15: PCA

The result of UTX will be 1 × n vector that is the final result of the method. The underlying goal of PCA is to maximize the V ar(U1TX). the problem formulation could be represented as an optimization problem in which we want to

M axU1V ar(U

T

1 X) (26)

V ar(U1X) = U1TSU1 where S is a d × d matrix of covariances. Thus, we need to maximize

U1TSU1. Adding a constraint as U1TU = 1, we could define an optimization problem of

L(U1, λ) = U1TSU1− λ(U1TU1− 1) (27)

(25)

that g(x, y) = c. This problem could be defined as:

L = f (x, y) − λ(g(x, y) − c) ∇f (x, y) − λ∇g(x, y) = 0 ∇f (x, y) = λ∇g(x, y)

(28)

Now, by expanding the Eq.27, we have ∂L ∂U1 = 2SU1− 2λU1= 0 2SU1 = 2λU1 SU1 = λU1 (29)

The first component is the vector that shows the maximum variation of data. Basically, principles could be found by eigendecomposition of covariance matrix S, considering the definition of

• Let a n×n matrix denote by A, an eigenvalue of A is called λ if there exits a nonzero vector of ¯X such that A ¯X = λ ¯X. In this case, vector ¯X is known as eigenvector of A corresponding to λ

The advantages of PCA could be classified as follows. • PCA seeks to extract the principal components • PCA could be used for feature extraction as well

• The result explains most of the existed variation in data

• It improves the performance of other algorithms that will be executed on data set, specifically in realm of big data.

The disadvantage of PCA is as follows.

• Non-linear structure fails to be predicted with PCA

4.10 Multidimensional Scaling (MDS)

Similar to the objective of PCA, multidimensional scaling (MDS) also seeks to map the original high-dimensional space to a lower one. The main difference with PCA is that the objective goal of MDS is to preserve the pairwise distance. Considering the case that D is an affinity matrix, i.e., the element of dii= 0, and dij > 0 for i 6= j.

MDS accepts an affinity matrix D(X), and tries to find t points, y1, ..., yt, in d dimensions

that D(X) is similar to D(Y ). Mathematically, we have an optimization problem that we want to minimize the below term.

minY t X t=1 t X i=1 (d(X)ij − d(Y )ij )2 d(Y )ij = ||yi− yj|| d(X)ij = ||xi− xj||

The similar intuition as PCA could be used to solve the optimization problem of MDS. The pros of MDS could be viewed as relaxing distance constraints, while sometimes the procedure of choosing distance transformation is complex.

(26)

4.11 Isomap

As discussed earlier, both PCA and MDS seek to provide a lower-dimensional represen-tation of data via different assumptions. Isomap is an extension of MDS on the geodesic space of nonlinear mathematical manifold. It should be noted we define the shortest path on a manifold curves surface by geodesic distance. The algorithm of Isomap can be represented by the following workflow.

• Calculate neighbours of data (Considering high dimensional space)

• By definition, calculate pairwise distance between all points using graph representa-tion

• Use MDS to preserve the distance

An example of Isomap, fetched from the original paper, is illustrated in Figure 16 [41].

Figure 16: Isomap

4.12 t-SNE

t-distributed stochastic neighbour embedding, known as t-SNE, is another dimensionality reduction algorithm that focuses on data visualization. In order to illustrate the mathe-matical intuition, at the first step, we need to consider stochastic neighbour embedding. Let the initial data be in the form of X = [x1, ..., xn]d×n(in d dimensional space). The

goal is to find a representation in Y = [y1, ..., yn]p×n where p is less than d. The overall

procedure of SNE is as follows.

1. Convert distances to probability such that for any xi and xj:

pj|i= e −|xi−xj |2 2σ2i P k6=ie −|xi−xk|2 2σ2 i (30)

where pi|i = 0, and pj|i means the probability of that i chooses j as its neighbour.

2. The definition of qj|i in Y space is as

qj|i= e

−|yi−yj |2 1

P

k6=ie−|yi−yk|

2 (31)

Based on Eq.30, points that seem to be close to each other will be granted by high probability. Moreover, by considering Eq.31 and Eq.30, if the value of yi is calculated,

(27)

a cost function with respect to the distance of distribution in p dimensional space or d dimensional space. An appropriate Y would be the one that minimizes the distance. Thus, a proper cost function would be KL divergence as:

• Cost = minyi,yjKL(P ||Q) =

P

ijpj|ilog pj|i

qj|i

By taking the derivative of the cost function with respect to the unknown, and simply applying gradient descent, a path of the solution is defined.

One important notice is that t-SNE differs from SNE in two ways. First, the probability of pj|i should be symmetric such that

pij = e −|xi−xj |2 2σ2 P k6=ie −|xi−xk|2 2σ2

where σ is same for the whole space. Second, the distribution of qj|i should be changed by

t-distribution, due to its longer tail. Since the t-SNE is a powerful technique to preserve the local structure of data points, it has been widely used in data visualization purposes[42]. Nevertheless, as a non-convex optimization problem, for a large data set, t-SNE suffers from computational cost.

(28)

5.

Experimental Analysis

The methods discussed in the previous sections have been tested on real data logged on HVDC transmission stations. This section discusses the tools and methodology considered in this experimental analysis, while a discussion on the outcomes is provided in the next section. The overall structure of the experimental section is given in a framework which is capable of fulfilling a specific range of requirements needed by HVDC experts. In the case of big data, the modules regarding dimensionality reduction and feature selection are provided in the framework that could be used as prior steps for other machine learning approaches as well.

5.1 Development Environment

5.1.1 Jupyter

The experimental result, specifically for industrial use cases, is the most crucial part of the research section at the industry side. Although using different programming languages and their Integrated Development Environments(IDEs) seem easy to user, but there exist many pitfalls to develop experimental cases in the traditional way of computing. Difficult migra-tion from one environment to another, being dependent on user’s computamigra-tional power, a text-based environment in output, and preserving privacy are amongst the disadvantages of using old programming environments. By considering the ability to export reproducible experiments and the ability of easy sharing of documentation, Jupyter Notebook is a sys-tem designed for interactive programming [43]. Jupyter Notebook supports Python, C, Julia, Javascript, and R [43] as an integrated way of interactive environment represented in web server. The advantages of using Jupyter notebook in data science projects are listed as follows.

• The ability to share code easily

• The environment could be language-independent • It is a tool for interactive data visualization

• The ability to implement a private server that runs distributed clients using Jupyter Hub

Considering all features offered by Jupyter Notebook, the experimental section was devel-oped in Jupyter’s environment, which makes future work reproducible. An illustration of The Jupyter Notebook used in the thesis can be seen in Figure 17.

(29)

Figure 17: Jupyter Notebook environment

In the previous section, the analysis of the mathematical foundation of methods was given. In this section, a prototype of the experimental phase will be given. Then, the model will be described, and the result of the implementation will be presented. The overall workflow of the time series forecasting module is illustrated in Figure 18.

(30)

Figure 18: Time series forecasting module

The univariate time series data is provided to time series preprocessing in this module. The following methods need to be processed.

• Stationarity checking: Using the notation of weakly time-series stationarity of the Dickey-Fuller test, all time series are checked for the status of stationarity.

• Index Checking: This is a proactive procedure to be sure about the integrity of time series indexes.

• Confidentiality checking: In this stage, rules imposed by the company are considered to remove confidential data.

• Splitting Phase: The data set is divided into the training set and testing set, while considering the intrinsic characterization of time series.

It should be noted that the experimental part of the thesis contains a function to accept a data frame object, which was preprocessed by prior steps, and executes the stationary test on all attributes of the data frame object. Then, two models will be ready to forecast time series data. In the ARIMAX setting, a hyperparameter tuning using a modified version of Box-Jenkin is used to find an appropriate parameter of ARIMAX model. Then, the data will be fitted on the training data set, and the model is then ready to forecast unseen data for the future.

(31)

The contribution of the thesis, in this section, is to visualize autocorrelation function for each unavriate time series analysis to acquire a prior knowledge. In this case, a prior assumption of how prior lags of time series affected the prediction of current time t will be clear to company expert. An example of autocorrelation function(ACF) is illustrated in Figure 19 .It should be noted that the experimental result of this section is anonymized based on the regulation of protecting private data in HVDC ABB.

Figure 19: Autocorrelation Function for univariate time series

As can be seen in Figure 19, The prior lags, up to time t − 11, represented a significant result of positive correlation such that all prior lags are greater than the confidence interval of blue region, In the same manner, the univariate time series is fed into ensemble learning of Random Forest Regression. At this juncture, for parameter tuning, the thesis contains a helper function to tune the following parameters using a grid search.

• The density of trees

• The value of maximum depth

• maximum number of attributes to consider as splitting

A helper function is written to use a grid search for the Random Forest model. In the next step, using multiple Python manipulation techniques, a new data set is initialized based on usage of prior lags of time series. For instance, a new data frame object has contained 13 columns in which 12 columns represent prior lags(t − 1, t − 2, ..., t − 12) and the target column represents the current time at t. It should be noted by shifting the new data some rows were filled by NaN, which means the prior lags of these rows were not accessible. One representation of the new data frame containing all prior lags is depicted in Figure 20.

(32)

Figure 20: Data frame with prior lags

After fitting the random forest regressor on training data, it is now ready to predict one-step prediction for next value represented in the testing data set. Finally, we would be able to measure the performance of a model using different approaches.

5.1.2 Anomaly Detection in Raw Data Module

Considering the anomaly as a data point that represents the behaviour of being “few and different”[44], a similar concept to Random Forest, Isolation Forest, constructs a randomly-generated binary tree that seeks to isolate the data points in a data set. In this manner, a tree that is used for the isolation process will be created by the following steps[44].

• 1. The model samples N data randomly

• 2. Feature of the data set is randomly chosen to be processed

• 3. By considering Min and Max value of the selected feature, a split value will be chosen to split the data set

• 4. Repeat the procedure of step 2 and 3 to isolate the N instances

The outlier score depends on the needed iteration of isolating each row of data. For instance, isolating a non-outlier data, due to the assumption that it is surrounded by other points, needs more iteration. In contrast, as the anomalies were assumed to be few and different, the isolating process needs fewer steps. Figure 21 depicts the execution of Isolation Forest on a normal and anomaly points. x0 is an anomaly point which was

isolated using 3 partitions, while x1, as a normal point, needs 9 partitions to be isolated

(33)

Figure 21: Isolation Forest - Anomaly Detection [45]

In the Raw Data module, the HDVC expert imports the data set into the module in a supervised manner. Similar to the procedure of Random Forest, the model is trained on the training data set and will be used for prediction on the test data set. The performance evaluation of the model on the training data set is depicted as a confusion matrix in Figure 22. In the same manner, as it can be seen from Figure 23, the model accurately predicted 58 rows of data as an anomaly, while it exports 13 rows as false negative.

(34)

Figure 23: Isolation Forest - confusion matrix on the testing data set

5.2 Evaluation Metrics

5.2.1 MAE

By defining yi as the actual value of time series, and ˆyi as the estimated value of time

series, Mean Absolute Error(MAE) is defined as follows.

M AE = 1 n n X i=1 |yi− ˆyi| (32)

MAE is actually the absolute value of residual by taking the difference between the true value and prediction.

5.2.2 RMSE

Using the same intuition, Root Mean Square Error(RMSE) is defined as

RM SE = v u u t 1 n n X i=1 (yi− ˆyi)2 (33)

In RMSE, since there exits a squared function before taking averaged values, then RMSE gives more consideration to large errors.

5.2.3 MSE

Mean Squared Error(MSE) is defined as a sum of squared error over the number of obser-vation.

(35)

6.

Experimental Results and Discussion

6.1 Time series Forecasting

6.1.1 Univariate Forecasting

One of the challenges regarding adopting Random Forest for predicting time series was that the model did not show significant accurate results when the input time series was non-stationary. A transform method could be used to remove temporal dependence. Dif-ferencing is defined as

dif f erence(t) = observation(t) − observation(t − 1) (34)

where the order of difference may start from 1. There exists a possibility that time series could be still temporal dependent after executing differencing of order 1. Thus, in some cases, a higher order of differencing is needed. Experimental section of the thesis contains helper function to implement differencing, as well as, an inverse method to roll-back the transformed value to original scale, which is included in the Appendix section. Upon receiving a data set containing univariate time series, the data set went through the modules represented in Figure 18 incrementally. The model received the univariate time series which is shown in Figure 24.

Figure 24: Input time series

The final output of ARIMA model is shown in Figure 25 .

Figure 25: ARIMA output

In Figure 25, the blue line represents actual value, while the red line is the output of ARIMA forecast. The traditional ARIMA model seems to start forecasting the future in a proper way. There seems to be an attempt to forecast the volatility of the data untill

(36)

the index time of Mar. Afterward, the model faces a very small portion of fluctuation, and seems to present only average line for multi steps prediction. In the same manner, the same input data was given to Random Forest predictor. The final result of random forest model is shown in Figure 26 .

Figure 26: Random Forest output

The result of the comparison of performance metrics regarding ARIMA forecasting and Random Forest is represented in Figure 27. As can bee seen in the Figure 27, the Random Forest outperforms traditional stochastic forecasting of ARIMA model. Since the performance metrics represent the error term, the ideal output of a forecast model is to generate 0.0 error. It should be noted that the performance metrics, defined earlier, are scaled dependent. Moreover, the values of time series lags are bounded inside a small interval. Thus, the value of error needs to be considered with caution. Although Random Forest outperforms ARIMA model in predicting time series, but the experimental result shows that, if the input time series is non-stationary, and the stationary test proves the high p-value, then the ARIMA model, intrinsically, provides an appropriate solution.

(37)

Figure 27: Performance evaluation

As can be seen in Figure 26, the machine learning approach could potentially detect the volatility of data in a better way. This way helps the expert in HVDC environment to be able to detect anomalies by defining a threshold that could be used for comparison of forecasting and actual value of data that will be generated in the future. Nevertheless, the Random Forest model does not offer 100% accuracy. It is possible to notice that in the example, the model has lower predictive accuracy in time frames prior to 2019-03 in Figure 26.

(38)

6.1.2 Multivariate Time series Forecasting

Machine learning approaches in the realm of time series forecasting is also available to model multivariate time series analysis. One of the benefits that Random Forest offers is the capability to use one time series so as to predict another time series. This topic intrigued the company as well. In this schema, a system expert reports each univariate time series that could be used for target forecasting. It should be taken into consideration that the traditional stochastic-based forecasting was not introduced to fulfill this specific requirement. The proposed random forest model in the framework is able to receive a time series as X and predicts the value of other time series as Y . The same intuition is used for this requirement, and the final result is shown in Figure 28.

Figure 28: Random Forest output for multivariate time series forecasting

HVDC experts reported that the time series that could be used for the prediction of a target time series. In this model, the proposed Random Forest model is trained on the training data set, and outputs its prediction on the test data set. In Figure 28, the target time series in test data set is represented by the red line. Blue line shows the prediction output on X value on the test data set.

(39)

Moreover, the difference between the actual value of target time series and predicted time series is calculated for the test data set, and the result is represented in Figure 29 in which X axis represents the time and Y axis shows the difference value. It is obvious that the error of prediction is close to zero in many time steps. A point that needs to be considered is that due to the fact that the target time series prediction might be influenced by other variables, i.e., its own prior lags and other time series and metrics, the output of prediction shall not represent high accuracy. Nevertheless, It is useful to predict the spike and trends of the target using the input time series.

6.2 Answers to RQs

The answers to research questions are as follows.

• RQ1: deploying the overall path of working on an unsupervised, supervised, and semi-supervised depends on the essence of company’s data set. In the case that data set contains target value, which was filled by expert in the industry, it is possible to conduct supervised learning. Otherwise, in the case of missing target attribute, the underlying method needs to focus on unsupervised manner as well. In the thesis, since the target label was not available, direct development of supervised learning was not possible, but the contribution of the thesis was to formulate the problem into supervised learning using prior lags of time series. This procedure brings the advantages of the accurate performance metric for the evaluation method, while in the pure unsupervised manner, there was not much information gain for the company. Finally, the proposed model of supervised learning was proved to be appropriate using mathematical notations. By expanding the idea of the Random Forest model, a tree-based anomaly detection technique called Isolation Forest is also considered for supervised anomaly detection that could be evaluated using a confusion matrix. • RQ2: Using an experimental research methodology, a comparison of performance between machine learning methods and traditional stochastic-based model was con-ducted. Based on the evaluation metric, the result shows that the machine learning methods provide the amount of accuracy in prediction that industry needs.

• RQ3: Some part of the experiment and background section deals with dimensionality reduction techniques for large data set. The thesis proactively considers the case that input data set would be too large for the training of model which results in lower running speed and accuracy. In such a case, the criteria of different dimensionality technique are provided to company side. In the thesis, the number of attributes in raw data was 113. Using

– Prepossessing Method - Dealing with Null values – Variance thresholding

– Dimensionality reduction’s criteria

the data was transformed to contain 50 columns.

6.3 Confidentiality Concerns

The following notes were embedded into the thesis to protect the privacy of the company’s data and hinder any possible information leakage.

(40)

• An agreement was signed between two parties(student and company) to protect the data from unauthorized access

• All intermediary steps that contain data were uploaded to the company web portal which was an internal portal

• Column names indicating the properties of the data were removed. Moreover, pos-sible disclosure that puts the privacy at risk is avoided

• Some index of the data was transformed to arbitrary values to make any prediction of real occurrence time more complex

• Student prepares two distinct presentations; one designed for academia, and another for company.

6.4 Threats to Research Validity

Considering the validity as the trustworthiness of the outcome of the research, and defining the scope that results seem to be true, three types of validity could be defined as follows [33].

• Construct Validity: This validity considers the status of identical interpretation con-cerning the measures studies and thesis assumption. In this manner, there might be different interpretations of the anomaly by both student and company. To mitigate the issue, both parties consider anomalies as point anomaly.

• Internal Validity: This validity often considers the extent of sufficient evidence to support the treatment. The important issue was that the thesis tackles with uni-variate analysis of time series, while taking exogenous variables into consideration may increase the accuracy of result.

• External Validity: This validity focuses on the generalizability of outcomes. Since the data selected for the experiment was chosen by the company, based on the regulation, there exists a threat to validity that results should be used with further consideration to another field of HVDC systems.

(41)

7.

Conclusions

The thesis aimed to introduce machine learning approaches in HVDC anomaly detection. Firstly, mathematical foundations of proposed methods were given. In some cases, the intuition of the benefit of deploying machine learning method was examined. Secondly, an experimental framework for data analysis in HVDC was proposed. In this schema, the input data set provided by HVDC sensors fed into the preprocessing step in which the data is processed to be ready for other modules. Dimensionality reduction module in the framework introduced different metrics for HVDC expert to lower the dimensionality of data in case of dealing with big data. PCA, MDS, Isomap, t-SNE were amongst the model that could be used. Next module, time series forecasting, presented an approach to compare the accuracy for additional time series forecasting and ensemble learning of random forest, which is a machine learning approach. Upon accomplishment of parameter tuning of both models, the forecast of time series was computed on test data which enables the framework to evaluate each forecasting. Afterward, the best model could be used for the future prediction that enables the HVDC expert to identify any deviation from the predicted output as an anomaly. The performance evaluation shows that ensemble learning with bagging for prediction outperforms traditional methods. Finally, Isolation Forest model was used to predict anomalies on the test data set, and its performance evaluation was reported.

(42)

8.

Future Works

This thesis applied anomaly detection methods offline to available data sets which were logged in HVDC stations in the past. A valuable direction to continue this work is to look into applying similar methods online, supporting HVDC experts in taking decisions in real-time. It should be noted that the idea of online treating with data shall introduce new constraints to current algorithms. A demo of possible tools and deployment of an online anomaly detection technique is reported to the company. Moreover, the current time series forecasting module has the ability to be used in multivariate time series prediction. In this manner, two or more times series that are deemed revelant and initially selected by HVDC experts could be used for predicting future trends of target variables.

(43)

References

[1] J. Sun, M. Li, Z. Zhang, T. Xu, J. He, H. Wang, and G. Li, “Renewable energy transmission by hvdc across the continent: system challenges and opportunities,” CSEE Journal of Power and Energy Systems, vol. 3, no. 4, pp. 353–364, Dec 2017.

[2] D. Gielen, F. Boshell, D. Saygin, M. D. Bazilian, N. Wagner, and R. Gorini, “The role of renewable energy in the global energy transformation,” Energy Strategy Reviews, vol. 24, pp. 38 – 50, 2019. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S2211467X19300082

[3] H. Zhang and S. Zhang, “A new strategy of hvdc operation for maximizing renewable energy accommodation,” in 2017 IEEE Power Energy Society General Meeting, July 2017, pp. 1–6.

[4] M. Dong, “Combining unsupervised and supervised learning for asset class failure prediction in power systems,” IEEE Transactions on Power Systems, vol. 34, no. 6, p. 5033–5043, Nov 2019. [Online]. Available: http://dx.doi.org/10.1109/TPWRS. 2019.2920915

[5] M. Chen, S. Lan, and D. Chen, “Machine learning based one-terminal fault areas detection in hvdc transmission system,” in 2018 8th International Conference on Power and Energy Systems (ICPES), Dec 2018, pp. 278–282.

[6] AC Power. John Wiley and Sons, Ltd, 2006, ch. 3, pp. 49–84. [Online]. Available:

https://onlinelibrary.wiley.com/doi/abs/10.1002/0470036427.ch3

[7] M. H. Okba, M. H. Saied, M. Z. Mostafa, and T. M. Abdel- Moneim, “High voltage direct current transmission - a review, part i,” in 2012 IEEE Energytech, May 2012, pp. 1–7.

[8] K. R. Padiyar, HVDC power transmission systems, 2nd ed. Kent, [England]: New Academic Science Limited, 2013.

[9] R. Karki, P. Hu, and R. Billinton, “Reliability evaluation considering wind and hy-dro power coordination.(technical report),” IEEE Transactions on Power Systems, vol. 25, no. 2, pp. 685–693, 2010-05-01.

[10] Electric Power Systems. John Wiley And Sons, Ltd, 2006. [Online]. Available:

https://onlinelibrary.wiley.com/doi/abs/10.1002/0470036427.ch3

[11] D. Tiku, “dc power transmission: Mercury-arc to thyristor hvdc valves [history],” IEEE Power and Energy Magazine, vol. 12, no. 2, pp. 76–96, March 2014.

[12] M. Barnes, D. Van Hertem, S. P. Teeuwsen, and M. Callavik, “Hvdc systems in smart grids,” Proceedings of the IEEE, vol. 105, no. 11, pp. 2082–2098, Nov 2017.

[13] M. Eremia, C. Liu, and A. Edris, Advanced Solutions in Power Systems: HVDC, FACTS, and Artificial Intelligence. IEEE, 2016, pp. 1–7. [Online]. Available:

https://ieeexplore.ieee.org/document/7656871

[14] A. Alassi, S. Ba˜nales, O. Ellabban, G. Adam, and C. MacIver, “Hvdc transmission: Technology review, market trends and future outlook,” Renewable and Sustainable Energy Reviews, vol. 112, pp. 530 – 554, 2019. [Online]. Available:

Figure

Figure 1: Power Transfer, Source: Adopted From [8]
Figure 2: Cost Variation, Source: Adopted From [8]
Figure 3: HVDC Types, Source: Adopted From [8]
Figure 4: Anomalies Vs Noises, Source: Adopted From [16]
+7

References

Related documents

As with k-NN, a suitable threshold for the anomaly score can be found by looking at the anomaly score for normal training period data samples.. Figure 20: OCSVM anomaly score on

Devise an algorithm that compares the burned in subtitles with a reference subtitle text file and using the clock of the stream to find that the a subtitle text is shown at

Is a percentage of 52 misclassified components a good result? It is difficult to tell, as there is no right or wrong answer to this question. At a first glance it seems a bit too

Probabilistic Fault Isolation in Embedded Systems Using Prior Knowledge of the System.. Masters’ Degree Project Stockholm, Sweden

To summarize, the main contributions are (1) a new method for anomaly detection called The State-Based Anomaly Detection method, (2) an evaluation of the method on a number

Figure 34: A figure which shows test flight 1124 for the variance estimation approach using LSTM and Ridge Regression.... (a) A figure which shows test flight 1124 using the

The best results were obtained using the Polynomial Random Forest Regression, which produced a Mean Absolute Error of of 26.48% when run against data center metrics gathered after

In this thesis, two different unsupervised machine learning algorithms will be used for detecting anomalous sequences in Mobilaris log files, K­means and DBSCAN.. The reason