Using Unsupervised Machine Learning for Outlier Detection in Data to Improve Wind Power Production Prediction

(1)

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2016

Using Unsupervised Machine

Learning for Outlier Detection in

Data to Improve Wind Power

Production Prediction

LUDVIG ÅKERBERG

(2)

(3)

Using Unsupervised Machine Learning for Outlier

Detection in Data to Improve Wind Power

Production Prediction

Master’s Degree Project in Computer Science

Degree Program:

Master of Science in Systems Control and Robotics

Author:

Ludvig ˚

Akerberg

ludake@kth.se

Supervisors:

Pawel Herman, KTH

Mattias Jonsson, Expektra

Examiner:

Anders Lansner

(4)

(5)

This Master’s Thesis was carried out at the company Expektra and the School of Computer Science and Communication at KTH.

I would like to thank my supervisor at Expektra, Mattias Jonsson, for the many intriguing discussions about machine learning crucial for the success of this thesis.

I would also like to thank my supervisor at KTH, Pawel Herman, for his professional and encouraging way of supervising this thesis which has taught me a lot about scientific research.

(6)

(7)

Abstract

The expansion of wind power for electrical energy production has increased in recent years and shows no signs of slowing down. This unpredictable source of energy has contributed to destabilization of the electrical grid causing the energy market prices to vary significantly on a daily basis. For energy producers and consumers to make good investments, methods have been developed to make predictions of wind power production.

These methods are often based on machine learning were historical weather prognosis and wind power production data is used. However, the data often contain outliers, causing the machine learning methods to create inaccurate predictions.

The goal of this Master’s Thesis was to identify and remove these outliers from the data so that the accuracy of machine learning predictions can improve. To do this an outlier detection method using unsupervised clustering has been developed and research has been made on the subject of using machine learning

(8)

Sammanfattning

Användning av oövervakad maskininlärning för outlier-identifikation i data för att förbättra prediktioner av vindkraftsproduktion Vindkraftsproduktion som källa för h˚allbar elektrisk energi har p˚a senare ˚ar ökat och visar inga tecken p˚a att sakta in. Den här oförutsägbara källan till energi har bidragit till att destabilisera elnätet vilket orsakat dagliga kraftiga svängningar i priser p˚a elmarknaden. För att elproducenter och konsumenter ska kunna göra bra investeringar har metoder för att prediktera vindkraftsproduktionen utvecklats.

Dessa metoder är ofta baserade p˚a maskininlärning där historiska data fr˚an väderleksprognoser och vindkraftsproduktion använts. Denna data kan inneh˚alla s˚a kallade outliers, vilket resulterar i försämrade prediktioner fr˚an maskin-inlärningsmetoderna.

M˚alet med det här examensarbetet var att identifiera och ta bort outliers fr˚an data s˚a att prediktionerna fr˚an dessa metoder kan förbättras. För att göra det har en metod för outlier-identifikation utveklats baserad p˚a oövervakad ma-skininlärning och forskning har genomförts p˚a omr˚adena inom maskininlärning för att identifiera outliers samt prediktion för vindkraftsproduktion.

(9)

Introduction

Over the years there has been an increasing awareness of the negative effect mankind has on the environment. This has caused a growing interest in renew-able and more environmentally friendly sources of energy. In the last decades, wind power has emerged as one of the strongest and fastest growing candi-dates as a source of renewable energy [1]. Only between 2014 and 2015 the total capacity of installed wind power in the world increased from 370GW to 443GW [2]. The Paris agreement [3] negotiated by 195 countries in December 2015 and signed by 178 countries in April 2016, where the main goal is to limit the temperature increase caused by global warming to well below 2 degrees, in-dicates that there will be no slowing down in the increase of wind power usage world wide.

However, wind power is dependent on weather and it is therefore difficult to regulate the power output. This destabilizes the electrical grid and other sources of energy have to compensate for the energy drop when there is no wind. One solution to this is to regulate the prices such that energy prices increase when energy production is low and vice versa [4]. This encourages consumers, both companies and households, to consume energy when the wind power output is high making the energy price low and save energy when the wind power output is low making the prices high.

A useful tool to make this work is to provide reliable forecasts such that producers and consumers can make predictions of the wind power production and thereby plan ahead. Methods for this already exist and are widely used. These methods commonly rely on statistical models and in recent years machine learning approaches such as Suport Vector Machines (SVM) and Artificial Neu-ral Networks (ANN) have received an increased attention [5]. These methods utilise historical weather and wind power data to infer wind power data in the near future. Weather prognosis is then used for predicting the power output. For the prognosis to be accurate, it is important that the historical data is reli-able. The data can contain outliers which misrepresent the relationship between weather and wind power production. These outliers can occur for instance, when a wind power plant is shut down due to maintenance while wind speed is high, or when the energy production is capped due to low energy prices. If these out-liers are identified and removed, the predictions made with data-driven machine learning methods can be more accurate. With better predictions, better invest-ments due to better planning can be made by power consumers and producers.

(12)

Figure 1.1: The figure shows wind speed vs power output where the values are extracted from the provided data. Data points with suspicious behaviour are encircled and have the potential of being considered as outliers.

Outliers are common in many real-world problems and are often difficult to describe precisely. It often varies depending on the problem and method at hand. Hawkins [6] provided a general definition and describes an outlier as ”an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism”.

In wind power production prediction data the characteristics of outliers can be identified by unexpected behaviour. The most obvious one is when the production output is low and wind speed high and vice versa but there are also occasions where the wind power production is capped at a certain value. This can occur for instance when the energy prize is too low so there is no point in producing more energy. Examples of such outliers are shown in Fig. 1.1 which displays a power curve, i.e. wind power as a function of wind speed, for data from a certain geographical location in Sweden.

An outlier measure can be seen as how much a data point deviates from the power curve. However, if more features than wind speed are to be considered, there might be a relationship with these features which explain why a data point deviates from the power curve. This is why non-linear supervised machine learning methods such as the ANN can be used since it is able to identify these complex relationships.

The issue of outlier identification in feature data has been raised in several recent scientific articles. Ghoting et al. [7] developed a method for outlier detec-tion using unsupervised machine learning and Kusiak et al. [8] used K-nearest neighbour search for outlier detection within data used for the specific case of wind power prediction.

The outcome of this thesis is an outlier detector able to identify and remove

(13)

outliers. The outlier detector was able to improve the accuracy of predictions of wind power production made by an ANN multilayer perceptron (MLP).

1.1 Problem formulation

The research problem behind this Master’s Thesis is if unsupervised machine learning methods can be used for outlier identification and if it can improve the prediction quality of an MLP.

The hypothesis is that if the data could be pre-processed so that data points which indicate an unexplainable behaviour could be identified as outliers and removed, this could result in more accurate predictions for the MLP.

This Master’s thesis was provided by the School of Computer Science and Communication (CSC) at KTH and the company Expektra. Expektra develop IT tools for energy market analysis. One of their products is a prediction tool for day-ahead wind power production prediction. The predictions are made using an MLP and data provided by Expektra. The data consists of weather forecasts provided by weather prediction suppliers as well as the measured wind power production outputs at certain geographical locations in Sweden.

1.2 Outline

In Chapter 2 the background of the thesis project is described. The chapter starts with presenting related work within the scientific field in Section 2.1. It contains work performed for feature selection, wind power prediction and out-lier detection. In Section 2.2.1 and Section 2.2.2 the theories behind ANN and K -means are described. In Chapter 3 the method is outlined starting with describing the approach for feature selection in Section 3.1. It is followed by Section 3.2 describing how K -means was used for clustering the data and Sec-tion 3.3 describes how the clusters are processed to detect outliers. In SecSec-tion 3.4 two methods are presented for selecting the parameters: The number of clusters, K, and the amount of data to be removed, p. Section 3.5.2 describes the test en-vironment used to evaluate the performance of the outlier detector. The results are presented in Chapter 4 followed by discussion and conclusion in Chapter 5.

(14)

Chapter 2

Background

2.1 Related Work

2.1.1 Wind power prediction and feature selection

Vladislavleva et al. [9] used symbolic regression to make predictions of wind power production as well as a feature selection method to identify which fea-tures had best connection to the targets. Selecting which feafea-tures to use is an important step which Amjady et al. [10, 11] used Mutual Information (MI) for when both predicting day-ahead energy market prices and short-term wind power production prediction. When developing a feature selection method, in-spiration was taken from Kemp et al. [12] and Verikas et al. [13], who both use the accuracy of ANN models when selecting features. The main idea behind the two articles is to reduce the number of features to a subset of the original data. This can often be useful since it reduces the dimensionality of the problem resulting in drastically decreased computation times.

2.1.2 Outlier identification in general

Outlier identification and removal is a broad field within machine learning. The characteristics of an outlier differs depending on which case is being observed. Gupta et al. [14] conducted a literature study in 2014 on previous work within outlier detection in temporal data. Ben-Gal [15] suggested that outlier detection can be performed by clustering data and removing certain clusters. In this thesis, instead of labelling whole clusters as outliers, data points within the clusters are analysed and labelled as outliers depending on their diversion from the rest of the data points within the cluster.

Ghoting et al. [7] developed a method for outlier detection using unsuper-vised machine learning by dividing the data feature space into clusters and used distance measures within each cluster to identify outliers. This has inspired the outlier detector method used in this Master’s Thesis but instead of looking at distances in feature space, distances in the target series within each cluster has been used.

(15)

2.1.3 Outlier identification for wind power production

fore-casting

Since some pre-knowledge for how the relationship between the features and the wind power production exists, this infomation can be used to choose and modify suitable outlier detection mehods. The most obvious relation is between wind speed and power output which can be visualized as a power curve. Wan et al. [16] models wind power curves based on historical wind power and wind speed data for future predictions. Liu et al. [17] trained a Probabilistic Neural Network (PNN) to classify data based on the power curve. It was fed with data where the power output was low and wind speed was high and vice versa and thereby learned to classify this type of data as outliers.

Kusiak et al. [8] used principal component analysis (PCA) and K-nearest neighbour search (KNN) to identify outliers in data used for wind power pre-diction. The method focused on fitting a power curve onto the data and then filter out the data points whose power output differ from the fitted power curve. A recurring observation from these articles is that they assume that data points that does not fit into a power curve can be considered as outliers. The same assumption is made in this thesis since it is well known that wind power is a dominant feature when related to the power output. Another conclusion often pointed out in the articles is the lack of data. This is where this Master’s thesis can contribute since there is plenty of data available from many different locations meaning the results are strengthened by tests made with the same method on data containing many different sources and shapes.

2.2 Theoretical background

2.2.1 The Artificial Neural Network

The theory behind the ANN is inspired by how the brain uses neurons in complex networks for learning. It consists of a network of nodes and layers. Each layer consists of a number of nodes. Each node represents a neuron and has several weights attached to it. Each weight, ωij, is a connection between the current

node, j, and a node, i, located in the previous layer. The features are fed into the network at the input layer where each feature is represented as one input node. The targets, which the network is learning to reproduce based on the inputs, is represented as the output layer with one node for each target feature. The MLP is an ANN used for modelling non-linear functions [18]. It has an input and an output layer as well as a number of hidden layers in between. The algorithm of the MLP initializes by setting all the weights to small random numbers. Then it’s time to train the algorithm using backpropagation [18]. The training has two phases; The forward phase and the backward phase. In the forward phase each input feature are fed though the network starting with the hidden layer. There the activation, aj, for each neuron j within the layer is

calculated from the inputs, xi, and the weights of the hidden layer, νij, as:

hj = X i xiνij, (2.1) aj= g(hj) = 1 1 + e−βhj. (2.2)

(16)

Figure 2.1: An MLP model with three input nodes, one output node and one hidden layer containing three hidden nodes. An extra weight is added to each node in the input and hidden layer with a value of -1. These are used to handle eventual bias.

g is the activation function which determines if the neuron fires or not and has the form of a sigmoid function. The activations is then passed along to the output layer where the final activations are computed:

hk=

X

j

ajwjk, (2.3)

yk= f (hk) = hk. (2.4)

Here another activation function, f , is used. This is a linear function which is used when a continuous output is desired. When dealing with a classification problem, a sigmoid function, f = g, is preferred [18].

In the backward phase, the errors, δout

k , on the outputs, yk, are calculated

using the given targets, tk:

δ_kout= tk− yk. (2.5)

Another error formula is used for classification problems:

δoutk = (tk− yk)yk(1 − yk). (2.6)

The error is fed backwards through the network to calculate the errors in the hidden layer, δhidden

j , and adjust the weights:

δ_jhidden= aj(1 − aj) X k ωjkδkout, (2.7) ωjk ← ωjk+ ηδoutk aj, (2.8) νij← νij+ ηδjhiddenxi. (2.9) 6

(17)

The training procedure is repeated until a certain stop criteria is met. More on this in Section 3.5.2.

When the training is complete, the network can be used as a model to predict outputs using the forward phase.

2.2.2 K -means

The K -means clustering algorithm described by J. MacQueen [19] is one of the first and most general feature based clustering methods. The algorithm consists of three steps; Initialization, assignment and update step. It initializes by dividing the feature data into K number of clusters where each cluster has an, often randomly, assigned cluster center value. The data points are then assigned to the cluster with the shortest distance to its cluster center in the assignment step. When all data points have been assigned to a cluster, the algorithm updates the cluster centers by calculating the mean of the data points assigned to each cluster. The algorithm is then repeated from the assignment step until the cluster centers have converged and no longer change position in the update step.

The K -means procedure can be described as follows: • Initialize:

– Initialize each cluster center by giving it the same value as a randomly selected data point.

• Assignment step:

– Calculate distances from each data point to each cluster center. – Assign the data points to the clusters with closest cluster center. • Update step:

– Calculate the mean value of all data points within each cluster and set these to the new cluster centers.

• If any cluster center changed value, repeat starting from assignment step. Davies Bouldin Index

A common problem for the K -means method is to decide the number of clusters, K, to use. A useful method for solving this issue is to use the Davies Bouldin Index (DBI) [20]. This index is a function describing the compactness and separation of clusters [21]. It has two measures; The scatter within the ith cluster, Si, and the distance, dij, between the cluster centers, µi, µj of two

clusters Ci and Cj. The two measures can be calculated as:

Si= 1 N X x∈Ci k x − µik, (2.10) di=k µi− µjk . (2.11)

(18)

Ni is the number of data points, x, within Ci. The DBI can then be described as: DBI = 1 K K X i=1 max j,j6=i Si+ Sj dij . (2.12)

To find a suitable number of clusters, this method can be used by finding K which minimizes DBI indicating that the clusters are compact and well separated.

(19)

Chapter 3

Method

3.1 Feature selection

Generally when working with clustering it is often essential to make a decision about data to use for training the algorithms, especially if there is plenty of data to choose from. Theoretically, neural networks and clustering methods can handle many features but in practice this is not ideal due to the curse of dimensionality. This means that for every feature, the computation complexity increases exponentially which can result in a very long computation time [22]. The demand for more data also grows with higher dimensionality due to the risk for poor generalization. In this section a method for feature selection is presented, the feature ranker. The method is able to rank the different features based on their relevance for the problem. When the ranking is complete, the best performing features can be selected for further use.

The feature ranker aims at ranking the features so that the most relevant features for the problem can be identified. The proposed algorithm starts by first using all the features and then iteratively remove the least important feature until only one feature remains. First a test set is separated from the data. Then a number of models is trained with an MLP where in each model, one of the features is replaced with white noise. The models are trained as explained further down in Section 3.5.2. The model which performs best on the test set decides which feature is to be eliminated. This feature is the one which affected the performance the least when replaced with noise meaning it did not have much relation with the target series. The worst feature is removed and the procedure is repeated until only one feature remains which will be ranked as the best performing one. To summarize, the procedure can be explained as:

1. Randomly pick a subset of test data from the data set.

2. Train an MLP model for each case when a feature is replaced with white noise and test the model on the test set.

3. For the model that performs best on the test set, remove the feature replaced with white noise.

(20)

5. The features are now ranked according to the order they were removed, with the first one removed being the worst.

3.2 Clustering weather data

The K -means algorithm is commonly used to automatically divide data into K number of clusters [23]. The algorithm is here used to divide the data points in the feature space into clusters. The features used are wind speed, humidity, pressure, precipitation and temperature. The resulting clusters can be plotted using the target series as a function of the different features. The result can be a scatter plot which at first glance does not make much sense and show no signs of trends between the features and the target series. The problem was to determine how to shape the clusters and also decide how many clusters to be used.

The approach was to cluster the input data based on the features selected by the previously mentioned feature ranker. The number of clusters were decided in two ways: either by applying the DBI [20] or by simply picking a suitable number, which makes the algorithm much faster since it does not have to re-cluster every time a new re-cluster is added. The approach for using the DBI for deciding the number of clusters is described in Section 3.4.

When the number of clusters, K, is chosen, the K -means clusters the data based on the features selected. Since the clustering is performed on weather data, each cluster can be seen as a representation of a certain weather condition. For instance, one cluster can represent high wind speed, low temperature and normal pressure (if the clustering was performed on these features that is).

When the clustering is complete, the clusters are analysed to find and remove outliers. This is explained in the following section.

3.3 Removing outliers

The proposed method for identifying and removing outliers, the outlier detec-tor, uses the information each cluster provides about the targets of the data points within the cluster. When identifying the outliers, the clusters need to be connected to the wind power production target series. Each data point within each cluster contains a target value which is the wind power production while all the clusters are clustered according to the given features such as wind speed, temperature etc. To see how each cluster relates to the targets, the mean µiand

the variance σiof the targets within the clusters is calculated where i marks the

cluster index.

The assumption made about the characteristics of an outlier is that the more a data point’s target value deviates from that of the rest of the data points within the same cluster, the more likely it is that the data point is an outlier. So the distance dij between a data point’s target value, tij and the mean of the the

whole clusters targets, µi, can be seen as an outlier measure.

µi= 1 N N X j=1 tij, (3.1) k dij = tij− µik . (3.2) 10

(21)

Figure 3.1: Two clusters in a graph with wind speed vs power output. In the graph to the left it is easy to identify data points which deviates from the rest on the power output axis. However, in the right graph it is more difficult to identify the outliers.

However, each cluster can have different tolerance when it comes to outliers. For instance, in this case when predicting wind power production, when the wind speed is below 3 m/s the power output is likely to be close to zero but when the wind speed is between 4-7 m/s the power output variance is significantly higher which makes it difficult to predict future power outputs, see Fig. 3.1. Therefore some clusters require a greater tolerance on dij which makes it not

an ideal choice for an outlier measure.

To consider the mentioned problem with different tolerance on dij, the

pro-posed method operates on the variance rather than the mean. The method consider one data point at a time as a candidate outlier. The candidate is removed from its cluster and the variance of the targets within the cluster is recalculated. The variance drop δij between the variances before the candidate

was removed and after is calculated. When every δij has been calculated within

a cluster i, the data point with worst outlier measure is singled out, ˆ δi= maxj(δij) σi , (3.3) IDi= argmaxj(δij). (3.4)

The reason for dividing by σi, which is the variance of all the data points

target values within cluster i, is to calculate the relative drop the data point causes when removed. This solves the previous mentioned issue with that each cluster needs different tolerances when calculating the outlier measure. When the worst data point in every cluster has been singled out, the one with biggest variance drop relative to the clusters own target variances is removed. A new worst data point is then calculated for the cluster of the removed data point and the procedure is repeated. This continues on until a selected stopping criterion, which needs to be determined, is fulfilled. For example, one stop criterion can be that it stops when a certain percentage of the whole data set has been labelled as outliers. In Section 3.4 another method is described for determining the stop criterion. The outlier detector can be outlined as in Algorithm 1:

(22)

Data: List of clusters Result: Outliers

1 Initialization;

2 for each cluster with index i do

3 σi= variance(targets within cluster i); 4 for each data point with index j do

5 σ_temp= variance(targets within cluster i without current data

points, j, target value);

6 δ_ij = σ_i− σ_temp; 7 end 8 ˆδi= maxj(δij) σi ; 9 IDi= argmaxj(δij); 10 end 11 Removing outliers;

12 while criterion not fulfilled do 13 c = argmax_i( ˆδi);

14 mark data point with index IDc in cluster with index c as outlier; 15 remove this data point from cluster c;

16 σc= variance(cluster targets);

17 for each data point with index j in cluster with index c do 18 σtemp = variance(cluster targets without current data point ); 19 δcj = σc− σtemp; 20 end 21 ˆδ_c =max_σj(δcj) c ; 22 IDc= argmaxj(δcj); 23 end

Algorithm 1: The outlier detector

3.4 Parameter selection

The developed outlier detector has two parameters that should be optimized. The first one is to decide the number of clusters, K, to be used and the second one determines how much data to be identified as outliers and removed from the dataset, p. Two approaches have been considered for setting these parameters automatically, depending on the data at hand.

3.4.1 The number of clusters, K

When deciding the number of clusters to be used it is useful to have a quality measure which tells how well the clustering has performed. The DBI described in Section 2.2.2 is used for this purpose. It gives a measure of separation between clusters as well as the compactness within.

This approach is implemented by modifying the K -means algorithm used. Instead of running it once, the algorithm starts with a set number of clusters and divides the data with the K -means. Then the quality of the clusters is calculated using the DBI. The DBI value is saved and the variances of the clusters are calculated. The cluster with highest variance is split into two by

(23)

replacing its mean, with two new ones:

x1= x + α, x2= x − α, (3.5)

where α is a small valued vector with same dimensions as the means. The reason for doing this is that the K -means converges quicker than when the algorithm starts all over by randomly setting new means [21]. The data is then re-clustered with the K -means algorithm and this procedure is repeated until the number of clusters reaches a pre-set maximum value. The K which resulted in the lowest DBI value is picked as the number of clusters to be used and the data is again clustered with K -means using this value on K. The approach can be summarized as:

1. Select a minimum value for K. 2. Run K -means.

3. Calculate and save the DBI value on the resulting clusters. 4. Calculate the cluster variances.

5. Split the cluster with highest variance into two new ones. 6. If the maximum value on K is not reached, go to step 2.

7. Pick the K which resulted in the lowest value on DBI and run K-means.

3.4.2 Outlier classification stopping criterion

When classifying outliers it is important to distinguish of what is and what is not an outlier. In the developed outlier detector this decision can be made by introducing a stopping criterion while the algorithm identifies outliers, more specifically in Line 10 in Algorithm 1.

In the algorithm, the outlier measure is measured as the relative variance drop a data point causes within a cluster when it is removed, see Eq. (3.3) in Section 3.3. For each iteration a data point is removed and a new worst case variance drop is calculated. While testing, it was found that the worst case variance drop seem to converge after a number of iterations. Therefore, the outlier classification stop criteria was set to when the worst case variance drop has converged within a certain tolerance, .

The stop criterion is met when the average of the difference of the variance drops, δi, of the 10 most recent outliers removed is below the value of :

1 10 N X i=N −9 ¯ δ − δi ¯ δ ≤ , (3.6)

where N is the count of the most recent outlier removed. The reason for dividing by 10 and ¯δ is due to normalization.

(24)

3.5 Evaluation

The purpose of this thesis was to find out if clustering can be used for outlier identification to improve performance of an MLP ANN. In order to find out if this is possible, a test environment was developed. This test needed to be robust enough to determine if the developed outlier detector performed well in both general cases and more specific ones. The different cases can be where the number of data samples is small or large or where the data is very noisy. The test method used is outlined below.

3.5.1 Experimental data

The data used for testing were measurements from six different geographical locations, each containing around 6000-7000 data points. The data consisted of hourly 24 hours ahead prognosis of wind speed, humidity, pressure, precipi-tation and temperature. The data also consisted of the measured wind power production for the time the prognosis was made. The targets are the measured wind power production and the features are the weather prognosis data.

The first thing to be tested was which features were to be used. This was performed by running the feature ranker described in Section 3.1 on five fea-tures and with five hidden nodes for the MLP and let it single out the three most important ones. The input features were wind speed, humidity, pressure, precipitation and temperature and the three features ranked the highest by the feature ranker was picked for further testing.

3.5.2 Test environment

The developed test method uses the difference in MSE in percentage of an MLP prediction as performance measure. It is the difference between before and after the outlier detector has been used on a data set:

M SE = 1 N N X i=1 (targetsi− predictioni)2 (3.7)

error (%) = M SEbef ore− M SEaf ter M SEbef ore

× 100% (3.8)

The test method splits the data into a set number of segments. Each segment will in turn be used as the test set of the MLP. It should be noted that these test sets will not be processed by the outlier detector, so it will always contain outliers. For each iteration, one segment is removed and the rest is used for training. The outlier detector described in Section 3.3 removes the outliers from the remaining data and it is then shuffled and split into training data and validation data, which is used for cross-validation for MLP early stopping [18], by a picked ratio, see Fig. 3.2.

(25)

Figure 3.2: The figure illustrates a classical cross-validation split. The test method splits the data into a number of segments and uses one as test data. The rest is shuffled and split into training and validation. The validation set is used for early stopping for the MLP.

The MLP is trained using the training data and uses early stopping meaning that it stops when the error on the prediction of the validation data has reached a local minimum. The MSE of the prediction on the whole training and validation set is calculated and stored in a list together with the model,

modelj= MLP-early-stopping(training, validation) (3.9)

errorj = MSE(modelj(trainingF eatures + validationF eature)−

(trainingT argets + validationT argets)). (3.10) The algorithm shuffles the data and picks new training and validation datasets. This procedure is repeated a set number of times and afterwards the model with the lowest MSE is picked,

b = argmin_j(errorj), (3.11)

i.e, the best model is modelb. The test data is now used for measuring the

performance of the model and the MSE is stored in a list,

errorb= MSE(modelb(testF eatures) − testT argets). (3.12)

The second data segment is now picked and used as a test set and the whole procedure is repeated until all segments has been used as test sets. When the algorithm is finished a list with a number of performance measures is presented showing the MSE of the various test sets. The test method is presented in Algorithm 2.

When testing the performance of the outlier detector, the algorithm first runs as depicted in Algorithm 2 and then it runs without Line 5. The performance can then clearly be seen by checking the difference between the outputs of the two cases.

(26)

Data: data

Result: list of MSE, testError

1 split data into a set number of segments i; 2 for each segment i do

3 tempData ← all data except segment i; 4 test ← segment i;

5 Run the outlier detector to remove outliers from tempData; 6 for j = 1 to N do

7 randomly divide tempData into train and valid; 8 train an MLP using train for training and valid for early

stopping;

9 modelj = model of the trained MLP;

10 errorj= model MSE on both train and valid; 11 end

12 b = argmin_j(error_j);

13 testErrori= MSE with modelb on test; 14 end

Algorithm 2: Test environment.

Number of hidden nodes

In order for the MLP to make accurate predictions, the number of hidden nodes and number of hidden layers needed to be selected appropriately. The number of hidden layers rarely needs to be more than one so this parameter was simply set to one [18]. The number of hidden nodes was decided by using the test environment described in Section 3.5.2 on data from one of the geographical locations. However, since this is a sensitivity test, and the number of nodes is of lower importance among the parameters, it was deemed sufficient to only use one test set. The number of nodes which performed the best was picked as the number to use in further tests. The tested number of nodes were 5, 10, 20 and 35.

3.6 Statistical hypothesis testing

The statistical testing consists of tests made to determine which features and parameters to use as well as the performance of the outlier detector when se-lecting parameters manually and when using the developed parameter setting methods. When using the developed evaluation method for testing, the data was divided into test, validation and training data with the proportions 25%, 25% and 50% respectively.

(27)

The main goal with the statistical testing was to prove that unsupervised clustering can be used for outlier identification and improve the prediction re-sults of an ANN. The strategy was to reject the null hypothesis that using the outlier detector has no effect on the MSE of the ANN predictions,

H0: E[MSE without outlier detector] = E[MSE with outlier detector]. (3.13)

In order to test this hypothesis, a one-sample Student’s test (t -test) was performed on the results form the different test sets. The t -test results are presented in Section 4.3.

(28)

Chapter 4

Results

The outlier detector was tested using the developed evaluation method described in Section 3.5.2. The results from the feature ranker will be presented as well as results from sensitivity tests for the parameters.

4.1 Feature selection

The feature ranker was tested using the following features: wind speed, humid-ity, pressure, precipitation and temperature. It was found that the three best performing features were wind speed, temperature and pressure, see Table 4.1. These three features are the ones used in the following experiments to test the outlier detector. Features 1st 2nd 3rd 4th 5th Wind speed 10 0 0 0 0 Humidity 0 1 1 5 3 Pressure 0 0 8 2 0 Precipitation 0 0 0 3 7 Temperature 0 9 1 0 0

Table 4.1: Displaying the results of the feature ranker. It displays how many times a feature was ranked in each position, 1st was the best and 5th the worst. The test was run 10 times.

4.2 Parameter sensitivity

4.2.1 Number of hidden nodes

The number of hidden nodes was examined by running the MLP test method but was only tested on one test set as explained in Section 3.5.2. The test location was Location 2. The test set contained 25% of the total data, 50% was used for training and 25% for validation. The numbers of nodes tested were 5, 10, 20 and 35. The resulting MSE when testing on the test set is displayed in Table 4.2. As can be seen in the table, there seem to be a saturation tendency

(29)

Number MSE of nodes 5 0.3969 10 0.3809 20 0.3763 35 0.3724

Table 4.2: The table shows the MSE using the developed test method for one test set for different number of hidden nodes for the MLP.

on the MSE when adding more nodes. However, the error did not decrease significantly when using 30 nodes compared to when 20 nodes were used, so to reduce computation time and limit the risk of over-fitting, the number of hidden nodes were set to 20 when testing the outlier detector.

4.2.2 Manual parameter selection

The two parameters to be set is the number of clusters, K, and the amount of data to be labelled as outliers and removed from the total data set, p. To manually select these parameters, a sensitivity test was made where different values on these parameters were tested. The test was made using the developed evaluation environment for one of the locations, Location 2. The results are shown in Table 4.3. The outlier identification is visualized in Fig. 4.1 for some different parameters. What should be noted when studying the figure is how the outlier detector removes points where wind speed is high and energy production is low and vice versa. It is also important to make sure the method does not remove too many points where both energy production and wind speed are high. As can be seen in Table 4.3, the results are trending towards a local minimum at K = 60 and p = 15%. These parameters were chosen when testing the outlier detector with manually selected parameters.

(30)

Parameters Test 1 Test 2 Test 3 Test 4 Average ± σ K = 20, p = 5% 0.4824 0.4658 0.4776 0.4838 0.4774 ± 0.0071 K = 20, p = 10% 0.4880 0.4640 0.4735 0.4813 0.4767 ± 0.0090 K = 20, p = 15% 0.4933 0.4627 0.4814 0.4808 0.4796 ± 0.0109 K = 20, p = 20% 0.4899 0.4690 0.4805 0.4877 0.4818 ± 0.0081 K = 40, p = 5% 0.4824 0.4689 0.4789 0.4877 0.4781 ± 0.0054 K = 40, p = 10% 0.4829 0.4649 0.4758 0.4820 0.4764 ± 0.0072 K = 40, p = 15% 0.4936 0.4630 0.4709 0.4795 0.4767 ± 0.0114 K = 40, p = 20% 0.4909 0.4651 0.4737 0.4815 0.4778 ± 0.0095 K = 60, p = 5% 0.4866 0.4581 0.4769 0.4784 0.4750 ± 0.0104 K = 60, p = 10% 0.4820 0.4600 0.4750 0.4794 0.4741 ± 0.0086 K = 60, p = 15% 0.4819 0.4570 0.4742 0.4817 0.4737 ± 0.0101 K = 60, p = 20% 0.4858 0.4650 0.4734 0.4800 0.4760 ± 0.0077 K = 80, p = 5% 0.4864 0.4693 0.4790 0.4874 0.4805 ± 0.0072 K = 80, p = 10% 0.4849 0.4585 0.4747 0.4812 0.4748 ± 0.0101 K = 80, p = 15% 0.4821 0.4619 0.4753 0.4760 0.4739 ± 0.0074 K = 80, p = 20% 0.4951 0.4651 0.4789 0.4863 0.4813 ± 0.0110 Table 4.3: The resulting MSE on the four test sets used by the test method. Each row shows the result when using different values on the number of clusters, K, and the amount of data to be labelled as outliers and removed from the whole dataset, p. In the final column, the average and standard deviation of the four test sets are displayed. The lowest average error is underlined indication which parameters to use for further testing.

(31)

Figure 4.1: The figure shows the effect the different parameters K and p has when using the outlier detector. The data is from Location 3 and is visualized as a power curve. The red dots are the outliers removed by the outlier detector. The figure only displays one of the features used for the clustering (wind speed) and it should be noted that two more features, pressure and temperature, were also used by the outlier detector.

4.3 Evaluation results for the outlier detector

for manually selected parameters

The outlier detector was tested using the developed test environment with the manually selected parameter values K = 60 and p = 15% based on the results from Section 4.2.2. The results are displayed in Table 4.4 and in Table 4.5 where two cases are tested. In the first case, all data points available from the geographic locations are used. In the second case, 3000 samples are randomly selected for testing to evaluate the performance of the outlier detector when less data is used.

As can be seen in the tables, the improvement in MSE for the MLP predic-tions varies for different locapredic-tions but in general the outlier detector followed by outlier removal improves the predictions. A one-sample t -test was used to statistically test if the MSE decrease can be proven to differ from a zero-mean population. Two independent tests were made. One for the results where all samples are used (4.4) and one for the results where a subset of the original data are used (4.5). The samples used for the t -test are the values from each test result for every location shown in the tables, resulting in a sample set of 24 samples for each test. The p-values of the t -tests is shown below.

(32)

Locations # samples Test 1 Test 2 Test 3 Test 4 Average ± σ Location 1 6911 -0.47% 1.78% -0.50% -0.55% 0.07% ± 0.99% Location 2 7048 -0.81% 1.07% 2.45% 3.10% 1.45% ± 1.50% Location 3 7049 -0.24% 1.14% -0.35% 2.80% 0.84% ± 1.28% Location 4 6915 -0.08% -0.53% -0.83% 1.36% -0.02% ± 0.84% Location 5 7048 1.30% 3.00% 3.28% 1.52% 2.28% ± 0.87% Location 6 7049 0.77% 1.71% 1.62% 3.11% 1.8% ± 0.84%

Table 4.4: The table shows the MSE reduction in percentage when using the outlier detector compared to when not using it. The four tests is the test results from the developed evaluation method. For each geographic location, every available data point were used and the number of samples is displayed in the second column. In the final column, the average and the standard deviation of the four test results are displayed.

Locations Test 1 Test 2 Test 3 Test 4 Average ± σ Location 1 -2.32% 0.15% -2.38% 0.77% -0.95% ± 1.42% Location 2 2.36% 2.44% 0.34% 1.98% 1.78% ± 0.85% Location 3 0.05% 4.35% 3.54% 1.9% 2.46% ± 1.65% Location 4 -0.39% 1.31% -0.36% -0.49% 0.02% ± 0.75% Location 5 1.93% 1.41% 3.38% -0.36% 1.59% ± 1.34% Location 6 1.76% 2.91% 2.53% 2.20% 2.35% ± 0.42% Table 4.5: The table shows the same as in Table 4.4 except for in this case, instead of using every available sample, 3000 randomly selected samples from each location were used. This was done to test if the result would differ if a fewer amount of data points were used.

• For the samples from Table 4.4 where all data points were used: pall =

0.0011.

• For the samples from Table 4.5 where 3000 randomly sampled data points were used: p3000= 0.0024.

This provides convincing evidence that the outlier detector has some effect on the prediction results.

The data are plotted in Fig. 4.2 showing the power curves of the data before and after the outliers have been removed. The figure shows that the outlier detector manages to mark the data points with low wind speed and high power output and vice versa. The data becomes more shaped like a power curve, which is reasonable since this relation between wind power production and wind speed is typical. It should be noted however, that the figures displays the targets (wind power production) and only one feature, wind speed, even though the outlier identifier also used two more features: pressure and temperature. This explains why data points in the middle of the power curves sometimes are marked as outliers since in relation to some other feature, they differ from the rest within their clusters.

(33)

(a) Location 1 (b) Location 2 (c) Location 3 (d) Location 4 (e) Location 5 (f ) Location 6

Figure 4.2: Power curves showing the data before and after the identified outliers have been removed for the six different geographical locations. The left column shows the original data, the middle shows the outliers marked in red and the right shows the data when the outliers have been removed.

(34)

4.4 Testing the method for automatically

select-ing the number of clusters, K

The method for selecting K was tested by comparing the test results between when the method was used to determine the value of K in the outlier detector and when the parameter was manually selected, K = 60. The amount of data to be removed was set to p = 15% according to the test results in Section 4.2.2. The test was made on all the six geographical locations.

As can be seen in Table 4.6, the results indicate that the method did not seem to improve the predictions in comparison when K was manually selected since in most of the cases, the MSE increased when using the method for automatically selecting K. Since so many test sets indicated worsened predictions, it was deemed unnecessary to further statistically test the results.

Locations # samples Test 1 Test 2 Test 3 Test 4 Average ± σ Location 1 6911 0.90% -0.07% -2.17% -3.40% -1.18% ± 1.69% Location 2 7048 -1.91% -2.79% 0.71% -0.50% -1.12% ± 1.33% Location 3 7049 -1.00% 0.32% -4.10% -3.91% -2.17% ± 1.89% Location 4 6915 0.51% -0.02% 1.22% -0.73% 0.25% ± 0.71% Location 5 7048 0.77% -0.21% 1.29% -0.11% 0.44% ± 0.62% Location 6 7049 -1.68% -1.37% 0.67% -0.63% -0.75% ± 0.90%

Table 4.6: The table shows the MSE decreases in percentage for the outlier detector when using the developed method for selecting K compared to when manually selecting the parameter value to K = 60. The four tests are the results from the different test data sets used in the developed evaluation method. In the final column the average and the standard deviation are displayed.

4.5 Testing the outlier stop criterion

A test was made for different values on using the evaluation method on one geographic location, Location 6. The results are presented in Table 4.7. As can be seen, = 0.0014 gave the best performance and was used for testing the outlier stop criterion method.

Test 1 Test 2 Test 3 Test 4 Average ± σ 0.0006 0.5882 0.6014 0.6889 0.6152 0.6235 ± 0.0390 0.0010 0.5923 0.5858 0.5796 0.5886 0.5866 ± 0.0047 0.0014 0.5969 0.5819 0.5712 0.5859 0.5840 ± 0.0092 0.0018 0.5738 0.5961 0.5750 0.6044 0.5873 ± 0.0133 0.0022 0.5771 0.5871 0.5990 0.5968 0.5900 ± 0.0087

Table 4.7: The table shows the MSE of the four test sets of the evaluation method when testing on data from one location, Location 6, for different toler-ances, , on the stop criteria. It should be noted that these MSE values differ from those in Table 4.3 since these tests were performed on a different location. The outlier stop criterion method was tested the same way as the method

(35)

for selecting K, simply by first testing with the outlier stop criterion and then with manually setting p = 15%. The MSE decrease for each geographic location and test set is shown in Table 4.8.

As can be seen in the table, the method does not seem to improve the results for most of the cases compared to the case when p was manually set to 15%. The results seemed conclusive enough to conclude that no statistical testing of the results would be necessary to prove that the outlier stop criteria does not improve the predictions.

Locations Test 1 Test 2 Test 3 Test 4 Average ± σ

d p d p d p d p Location 1 -1.19% 20.8% -0.71% 12.8% 0.10% 16.2% -2.92% 15.7% -1.18% ± 1.10% Location 2 -2.67% 19.3% -0.24% 19.4% -2.16% 10.7% -1.24% 9.3% -1.58% ± 0.93% Location 3 0.32% 15.2% -0.48% 2.6% 0.02% 5.1% 0.12% 7.6% -0.01% ± 0.30% Location 4 1.82% 11.8% 2.39% 4.5% 1.29% 4.1% -0.19% 7.0% 1.33% ± 0.96% Location 5 -1.49% 3.1% -0.01% 11.5% -0.09% 3.2% 0.79% 10.2% -0.20% ± 0.82% Location 6 -2.54% 8.1% -0.94% 12.8% 1.07% 14.3% -1.78% 28.1% -1.05% ± 1.35%

Table 4.8: The table shows the MSE decrease in percentage when using the outlier stop criterion to determine the amount of data to be removed, p in respect to when manually setting the parameter to p = 15%. For each of the four tests the MSE decrease, d, in percentage is shown together with the value of p determined by the stop criterion. In the final column, the average and standard variance deviation for the MSE decrease of the four test sets are displayed.

(36)

Chapter 5

Discussion & Conclusion

In this Master’s thesis a method for detecting outliers has been developed. The proposed outlier detector uses unsupervised machine learning clustering to iden-tify anomalies in weather and wind power production data. The outlier detector has been tested using an advanced test environment developed to provide a re-liable test result based on the MSE of wind power predictions made by an MLP model. In addition, a feature ranking method has been developed to find rele-vant input features for the clustering as well as two parameter tuning techniques for automatically setting the parameters: the number of clusters to be used and how much data is to be removed.

Data from six different geographical locations was used to test the perfor-mance of the outlier detector. The results provide evidence that the outlier detector is able to reduce the MSE of predictions made by the MLP by remov-ing the data identified as outliers. However, none of the two parameter tuners were proven to improve the results compared to the case when the parameters were manually selected beforehand.

5.1 Machine learning methodology

5.1.1 Parameters

The outlier detector is proven to work properly when used on day-ahead wind power production prediction. However, many parameters were manually se-lected. Apart from the number of clusters and the amount of data to be removed from the original set, also the number of layers and number of hidden nodes in the MLP where manually set to one and 20 respectively. According to Stephen Marsland [18] the number of layers allows for more complex functions to be modelled but more than two layers is rarely necessary. Marsland also claims that there is no obvious way to decide the number of hidden nodes and that one has to simply try with different amounts and see how it effects the results.

The convergence for the Outlier stop criteria seen in Eq. (3.6) also has two parameters that were not fully evaluated. These parameters are how many samples to be used to measure convergence and what tolerance, , the criterion should have. If these were properly analysed the Outlier stop criteria method could generate a better result. However, the observations made while testing

(37)

indicated that the result would not improve significantly enough to outperform the case when the amount of data to remove, p, was manually selected.

5.1.2 Test environment

Developing a reliable test environment for testing machine learning predictions using big data requires a lot of consideration. Much time and effort were put into developing a sound method for robust and reliable evaluation. Even though the test environment is able to provide reliable proof there still exist an uncertainty in the exact percentile improvement for the different locations. Since the data is divided randomly into test, validation and training the test will give inconsistent percentile improvement for each location if run several times. What the test actually is consistent about is that the overall improvement is positive.

Further statistical testing should help to resolve the issue of the probabilistic uncertainty. However, due to the scarceness of the data this poses a problem for further improvement of the statistical test results.

5.2 Comparisons to previous research

This Master’s Thesis has contributed by providing new methodology within the field of outlier detection within data used for wind power production prediction. There exists several methods for this purpose such as the ones developed by Kusiak et al. [8] and Liu et al. [17]. However, there seem to be a lack of test results and a limited amount of data used for testing. For instance, Kusiak et al. used a data set containing 3460 observations, while in this thesis data from six geographical locations each containing around 7000 observations were used. This can contribute to the scientific field since the test results from the different locations can show how the performance of a method can differ depending on the data used.

The evaluation method used in this thesis can prove useful in further re-search. Much effort was put into making the evaluation method reliable. In existing papers, it seem to be common to separate a test set from the original data and use the rest for training a model and then solely rely on the models performance on the test set. This method is mentioned by Marsland [18] and used by Kusiak et al. [8]. The method proposed in Section 3.5.2 does not only use a subset of the original data for testing, but lets all data in turn act as both test and training data. This is a more effective way to make use of all data available and provides a more reliable and accurate result.

5.3 Ethical review

The strongest ethical aspect of this Master’s thesis is that it contributes to making wind power a more predictable and therefore a more practical source of energy. It gets easier for electrical power producers and consumers to know when to produce and when to consume energy depending on the energy prizes. Another aspect is the fact that by identifying and removing outliers, the quality of existing data can increase and therefore limits the need for acquiring more data which in some cases can be a costly and damaging process. The

(38)

method can also be used to clean data bases from unwanted data and there-fore free up digital storage space, limiting the need for ever increasing storage capacity.

5.4 Future work

Since the two parameter tuners did not prove to increase the accuracy of the predictions, there is still room for optimization of the outlier detector. For instance, another clustering method than K-means could be used. A method which is able to select the number of clusters depending on the shape of the data. Fraley and Raftery [24] developed a methodology for this by using multivariate Gaussian Mixture Models with Expectation Maximization and Bayesian model selection. So using a similar approach instead of the K-means for clustering could result in a more optimal and general solution. This could work as long as the method is able to settle for many clusters since it is of importance to divide the data so that it properly divides the multivariate feature space and narrows down the spread of each feature within the clusters, making the clusters more tight.

The outlier detector was also unable to identify the data points where the energy production was capped. This can be seen as an outlier behaviour for the specific case of wind power production data. A method to identify these outliers could be done by finding a way to identify a time series within the data where the wind power production is capped.

A way to further strengthen the evaluation of the method could be to test and see if the outlier detector can improve the result of another machine learn-ing method than an ANN, such as an SVM. These algorithms have different methodology and can therefore behave different when identified outliers are re-moved from the data set.

(39)

Bibliography

[1] T. Ackermann, Wind power in power systems, vol. 140. Wiley Online Library, 2005.

[2] REN21, “Renewables 2016: Global status report,” REN21 Renewable En-ergy Policy Network/Worldwatch Institute, 2016.

[3] U. N. F. C. on Climate Change (UNFCCC), “Cop21 paris agreement.” [4] P. Menanteau, D. Finon, and M.-L. Lamy, “Prices versus quantities:

choos-ing policies for promotchoos-ing the development of renewable energy,” Energy policy, vol. 31, no. 8, pp. 799–812, 2003.

[5] C. Monteiro, R. Bessa, V. Miranda, A. Botterud, J. Wang, G. Conzelmann, et al., “Wind power forecasting: state-of-the-art 2009.,” tech. rep., Argonne National Laboratory (ANL), 2009.

[6] D. M. Hawkins, Identification of outliers, vol. 11. Springer, 1980.

[7] A. Ghoting, S. Parthasarathy, and M. E. Otey, “Fast mining of distance-based outliers in high-dimensional datasets,” Data Mining and Knowledge Discovery, vol. 16, no. 3, pp. 349–364, 2008.

[8] A. Kusiak, H. Zheng, and Z. Song, “Models for monitoring wind farm power,” Renewable Energy, vol. 34, no. 3, pp. 583–590, 2009.

[9] E. Vladislavleva, T. Friedrich, F. Neumann, and M. Wagner, “Predicting the energy output of wind farms based on weather data: Important vari-ables and their correlation,” Renewable energy, vol. 50, pp. 236–243, 2013. [10] N. Amjady and F. Keynia, “Day-ahead price forecasting of electricity mar-kets by mutual information technique and cascaded neuro-evolutionary al-gorithm,” Power Systems, IEEE Transactions on, vol. 24, no. 1, pp. 306– 318, 2009.

[11] N. Amjady, F. Keynia, and H. Zareipour, “Short-term wind power fore-casting using ridgelet neural network,” Electric Power Systems Research, vol. 81, no. 12, pp. 2099–2107, 2011.

[12] S. J. Kemp, P. Zaradic, and F. Hansen, “An approach for determining relative input parameter importance and significance in artificial neural networks,” Ecological modelling, vol. 204, no. 3, pp. 326–334, 2007.

(40)

[13] A. Verikas and M. Bacauskiene, “Feature selection with neural networks,” Pattern Recognition Letters, vol. 23, no. 11, pp. 1323–1335, 2002.

[14] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, pp. 2250–2267, Sept 2014.

[15] I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery handbook, pp. 131–146, Springer, 2005.

[16] Y. Wan, E. Ela, and K. Orwig, “Development of an equivalent wind plant power curve,” in Proc. WindPower, pp. 1–20, 2010.

[17] Z. Liu, W. Gao, Y.-H. Wan, and E. Muljadi, “Wind power plant prediction by using neural networks,” in 2012 IEEE Energy Conversion Congress and Exposition (ECCE), pp. 3154–3160, IEEE, 2012.

[18] S. Marsland, Machine learning: an algorithmic perspective. CRC press, 2015.

[19] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the Fifth Berkeley Symposium on Mathe-matical Statistics and Probability, Volume 1: Statistics, (Berkeley, Calif.), pp. 281–297, University of California Press, 1967.

[20] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE transactions on pattern analysis and machine intelligence, no. 2, pp. 224– 227, 1979.

[21] J. C. Bezdek and N. R. Pal, “Some new indexes of cluster validity,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 28, no. 3, pp. 301–315, 1998.

[22] H. Liu and L. Yu, “Toward integrating feature selection algorithms for classification and clustering,” IEEE Transactions on knowledge and data engineering, vol. 17, no. 4, pp. 491–502, 2005.

[23] K. Wagstaff, C. Cardie, S. Rogers, S. Schr¨odl, et al., “Constrained k-means clustering with background knowledge,” in ICML, vol. 1, pp. 577–584, 2001. [24] C. Fraley and A. E. Raftery, “How many clusters? which clustering method? answers via model-based cluster analysis,” The computer journal, vol. 41, no. 8, pp. 578–588, 1998.

(41)

(42)

Using Unsupervised Machine Learning for Outlier Detection in Data to Improve Wind Power Production Prediction

IN

DEGREE PROJECT

ELECTRICAL ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2016

Using Unsupervised Machine

Learning for Outlier Detection in

Data to Improve Wind Power

Production Prediction

LUDVIG ÅKERBERG

Using Unsupervised Machine Learning for Outlier

Detection in Data to Improve Wind Power

Production Prediction

Master’s Degree Project in Computer Science

Degree Program:

Master of Science in Systems Control and Robotics

Author:

Ludvig ˚

Akerberg

ludake@kth.se

Supervisors:

Pawel Herman, KTH

Mattias Jonsson, Expektra

Examiner:

Anders Lansner

Contents

Chapter 1

Introduction

1.1

Problem formulation

1.2

Outline

Chapter 2

Background

2.1

Related Work

2.1.1

Wind power prediction and feature selection

2.1.2

Outlier identification in general

2.1.3

Outlier identification for wind power production

fore-casting

2.2

Theoretical background

2.2.1

The Artificial Neural Network

2.2.2

K -means

Chapter 3

Method

3.1

Feature selection

3.2

Clustering weather data

3.3

Removing outliers

3.4

Parameter selection

3.4.1

The number of clusters, K

3.4.2

Outlier classification stopping criterion

3.5

Evaluation

3.5.1

Experimental data

3.5.2

Test environment

3.6

Statistical hypothesis testing

Chapter 4

Results

4.1

Feature selection

4.2

Parameter sensitivity

4.2.1