MORGANSVENSSON Short-termwindpowerforecastingusingartiﬁcialneuralnetworks

(1)

DEGREE PROJECT, IN COMPUTER SCIENCE , SECOND LEVEL

STOCKHOLM, SWEDEN 2015

Short-term wind power forecasting

using artificial neural networks

MORGAN SVENSSON

(2)

(3)

Short-term wind power forecasting using

artificial neural networks

Närtidsprognos av vindkraftsproduktion genom användandet av artificiella neurala nätverk

MORGAN SVENSSON

Master’s Thesis at

School of Computer Science and Communication Royal Institute of Technology KTH

Machine Learning Programme

KTH

Supervisor Pawel Herman

Examiner Anders Lansner

Expektra

Supervisor Niclas Ehn

(4)

(5)

Abstract

Wind power has seen a tremendous growth in recent years and is expected to grow even more in years to come. In order to better schedule and utilize this energy source good forecasting techniques are necessary. This thesis investigates the use of artificial neural networks for short-term wind power prediction. It compares two different networks, the so called Multilayer Perceptron and the Hierarchical Temporal Memory / Cortical Learning Algorithm. These two networks are validated and compared on a benchmark dataset published in the Global Energy Forecasting Com-petition, a competition used for short-term wind power prediction. The results of this study show that the Multilayer Perceptron is able to compete with previously published models and that Hierarchical Temporal Memory / Cortical Learning Algorithm is able to beat the reference model.

Keywords: neural networks, wind power generation forecasting, machine

(6)

Närtidsprognos av vindkraftsproduktion genom

användandet av artificiella neurala nätverk

Vindkraft är just nu den mest ökande förnybara energikällan i världen och med denna ökning är det viktigt att vi utvecklar bra prognosverktyg. Det här examensarbetet undersöker användandet av artificiella neurala nätverk applicerat på närtidsprognoser av vindkraftsproduktion. Algorit-merna som undersöks är byggda på den så kallade flerlagersperceptronen (MLP) och det hierarkiska temporala minnet (HTM/CLA). Dessa metoder valideras och jämförs genom data som publicerats inom GEFCom, en tävling för energiprognoser. Resultatet från studien visar att MLP metoden kan konkurrera med andra publicerade metoder samt att HTM/CLA kan slå referensmodellen.

Nyckelord: flerlagersperceptronen, maskininlärning, hierarkiska temporala

(7)

Acknowledgements

I wish to thank, first and foremost, my great parents Yvonne Svens-son and Benkt SvensSvens-son for always being there to support my inter-est.

Secondly I want to thank my supervisor Pawel Herman not just for the help I have received during this thesis but for the things I have learned from him studying at KTH.

I also want to thank the people at Expektra, Niclas Ehn, Gustav Bergman, Mattias Jonsson, Andreas Johansson, Per Åslund, Joel Ekelöf for introducing me to their area of expertise and energy fore-casting in general.

A special thanks to my good friends and classmates, Andrea de Giorgio and Vanya Avramova for all the ideas and discussions we have shared.

(8)

Indices, Constants an Variables

X1:n A sequence of values X = [x1, x2, ..., xn].

k = 1 : kmax Lead time or look-ahead time.

kmax Maximum prediction horizon.

N Total number of data points.

e Prediction error.

Normalized prediction error.

pt Measure of power generation at time t.

ˆ

pt+k|t Forecast power generation made at time t

for look-ahead time t + k.

wij Weight of a synapse in the neural network row i layer j.

b binary value.

s Weighted sum including bias of a perceptron.

x · y dot product between x and y

x y Row by row element-wise multiplication

Unit of measurements

MW Megawatts

GW Gigawatts

Notes

(9)

Chapter 1 Introduction

It has been estimated by the World Wind Energy Association (WWEA) that by the year of 2020 around 12% of the world’s electricity will be available through wind power, making wind energy one of the fastest growing energy resources [WWEA, 2014;Fan et al.,2009] but integrating wind energy into existing electricity supply systems has been a challenge and numerous objections have been put forward by traditional energy suppliers and grid operators, especially for large-scale use of this energy source. The biggest concern is that availability mainly depends on meteorological conditions and production cannot be adjusted as conveniently as other more conventional energy sources, this is because of our inability to control the wind. A single Wind Turbine (WT) is highly variable and its dependency on wind conditions can result in zero output for more than thousands of hours during the course of a year, however, aggregating wind power generation over bigger areas decreases this chance.

This is where wind power forecasting systems come into play, a technology that can greatly improve the integration of wind energy into electricity supply systems as forecasting systems provide information on how much wind power can be expected at any given point within the next few days. This results in the removal of some the randomness attributed to wind energy and allows a more accurate way to utilize this clean energy source while offsetting some of our dependency from other more unfriendly environmentally sources which in the long turn will cause a smaller degenerative impact on the environment.

There are many commercial forecasting models available. Prediktor1 _[_Landberg

and Watson,1994] is a physical model developed by the Risø National Laboratory, Denmark. It is constructed to refine Numerical Weather Prediction (NWP) in order

(12)

to transform data trough a power curve to produce the forecast while improving the error rate with Model Output Statistics (MOS). Previento [Focken et al.,2001] devel-oped in the University Oldenburg Germany uses a similar approach to Prediktor but with regional forecasting and uncertainty estimation. The Wind Power Prediction Tool (WPPT)2 _[_{Nielsen et al.}_, ₂₀₀₂_{] is a statistical model developed by Technical} University of Denmark and it consists of semi-parametric power curve model for wind farms taking into account both wind speed and direction. It uses dynamical predictions models describing the dynamics of the wind power and any diurnal variations. Zephyr [Giebel et al.,2000] is a hybrid model that is a combination of both the WPPT and Prediktor, in this model each wind farm is assigned a forecast model according to the available data. Sipreólico [Gonzalez et al.,2004] developed by Red Eléctrica de España is a statistical model that was designed in order to be highly flexible depending on available data. It achieves this by switching between 9 different models. Aiolos Wind3 _{is a hybrid model developed by Vitec that creates} forecasts by combining a statistical model with physical factors such as wind speed at different altitudes with wind direction and air density.

Expektra4_{is a serviced-based company founded in 2010 that provide and develop} a new method for short-term demand and supply forecasting based on an Artificial Neural Network (ANN) and traditional time series analysis. They are currently expanding into the area of wind power forecasting and looking for suitable methods. ANN have been used successfully on both wind speed forecasting [Lawan et al., 2014] and wind power forecasting [Kariniotakis et al.,1996]. It was demonstrated inLiu et al. [2012] that a complex valued recurrent neural network was able to predict output with high accuracy andHuang et al.[2015] showed that the initial weights being optimized with a genetic algorithm in a back propagated neural network gave impressive results so there are good reasons to think that part of the method Expektra are developing can be used for wind power forecasting.

Neural networks in general are highly flexible and recent advancements in deep learning have shown to outperform previous models in different domains [Schmidhuber, 2015]. These networks are very good at automatically finding features that by hand would take a lot of time and effort to achieve. One network that shares similarities with deep learning and have received less attention is the Cortical Learning Algorithm (CLA) / Hierarchical Temporal Memory (HTM) developed by Numenta [Numenta,2011]. This network is also built around the idea of having hierarchical structures, creating a deep neural network. CLA / HTM is

2_{http://www.enfor.eu/}

(13)

1.1. PROBLEM FORMULATION

currently tailored very specifically for time-series problems and have at the moment little research published around it, making it an ideal candidate for time series prediction with unknown potential on wind power forecasting problems.

Even though there is a lot of prominent methods already developed for wind power forecasting there are still room for improvements. Energy forecasting is such an important topic that competitions have been developed. The Global Energy Forecasting Competition (GEFCom) [Hong et al.,2014] is a competition that can be used to help evaluate the performance of new forecasting models. It is a competition that has attracted hundreds of contestants from all over the world, which has resulted in the contribution of many new and novel ideas. The GEFCom was created in order to :

1. Bring together state-of-the-art techniques for energy forecasting. 2. Bridge the gap between academic research and industry practice. 3. Promote analytical approaches in power / energy education.

4. Prepare the industry to overcome forecasting challenges posed by the smart grid world.

5. Improve energy forecasting practices.

Benchmark datasets and competitions are a valuable source when evaluating new models as it allows for a common ground to stand on. With the publication ofHong et al.[2014] the dataset in GEFCom2012 was also published. This dataset consists of data from 7 different wind farms that span a time period of tree years. It consists of observational data from the energy production and weather forecasts.

The GEFCom is divided into four tracks, load forecasting, price forecasting, wind power forecasting and solar power forecasting. The specified problem in the wind power forecasting track is built around the real-time operation of wind farms and the dataset is structured accordingly. Participants try to predict hourly power generation up to 1 − 48 hours ahead given meteorological forecasts and historically produced power.

1.1 Problem Formulation

(14)

power generation. i.e. What is needed to adapt this model to wind power forecasting

problems?

The second objective of this thesis will be to investigate HTM / CLA [Numenta, 2011] a modern state-of-the-art computational theory of the neocortex. i.e. Is it

possible to use Numenta Platform for Intelligent Computing (NuPIC) for wind power forecasting problems?

These questions will be addressed by evaluating the models against other models published in the area of short-term wind power forecasting, more specifically, those that have been published in GEFCom.

1.2 The scope of the problem

1. This study will focus on short-term forecasting. i.e forecasts done for 1 − 48 hours ahead. How to preform well on longer forecasts are left to further investigations.

2. Wind power forecasting is closely related to wind speed forecasting, i.e. trying to use local information at the wind turbine instead of using nwp data to forecast wind power. This thesis will not go into any specific details on how-to forecast wind speeds. Readers can take a look at Li and Shi[2010];Cadenas and Rivera[2009];Akinci[2015].

(15)

Chapter 2 Background

(16)

Figure 2.1: A figure that presents the general outline when forecasting using the statistical approach

Forecast models that use SCADA data as their primary input source usually have a good forecast accuracy at least for the first few hours but they tend to be less useful for longer prediction horizons [Giebel et al.,2011]. SCADA data can also be used to detect problems in WT something that can be helpful to improve the reliability of WT [Yang et al.,2013;Wang et al.,2014].

Statistical models seen in figure 2.1 are usually built around Numerical Weather Prediction (NWP) and SCADA data. NWP models are often used to build forecasting models as they introduce weather forecasts for the region where the wind turbines are located. NWP data usually contains information about thing like wind speed, wind direction, temperature and humidity. These models are operated twice or four times a day by a number of large weather services, The main forecasts usually starts at 00 and 12 UTC, corresponding to the world radiosonde launching, which is the only direct observation of the atmospheric state and has been the backbone of atmospheric monitoring for many decades1 _{extra forecasts usually start at 06 and} 18 UTC. Physical models as those seen in figure 2.2 include additional information about physical characteristics of the wind turbine and its surrounding, i.e. terrain data, information about obstacles, capacity and layout were the turbine is located and so on. Other useful information for physical models include the theoretical

1_{some problems with this approach is that it results with less information over large oceans and}

(17)

power curve; how much power is expected to be produced given a specific wind speed.

The time scale of WPF methods are generally divided into 3 main groups. Very-short-term (up to 9 hours), Very-short-term (up to 72 hours), medium-term (up to 7 days) [Costa et al., 2008] while the time step for these models are in the range of seconds to days depending on its application. Very short-term models used for wind power forecasting consists of statistical methods like Kalman Filters, Auto-Regressive Moving Average (ARMA), Auto-Auto-Regressive with Exogenous Input (ARX), Box-Jenkins etc. Input to these models are historical observations of wind speed, wind directions, temperature, etc. and common applications for this forecast horizon includes things like intraday market trading. Since these methods are merely based on past production they are generally not useful for longer horizons.

SCADA

Data

Physical Model

NWP

Data

WFC

Data

Forecast

wind power generation

Downscaling Transformation to Hub Height

Spatial

Refinements

Conversion to power

WT Power Curve

Model Output

Statistics (MOS)

Figure 2.2: A figure that presents the general steps when forecasting using a physical model.

(18)

2.1 Neural Networks and Time Series Prediction

One of the most common ANN is the so called Multilayer Perceptron (MLP) a net-work built around some very simple properties of a cortical neuron. This netnet-work have been around since the 50’s and have matured with a solid mathematical foun-dation. This network have been applied successfully to many different applications. It is built around the idea of having multiple layers of neurons. Each layer feeding information forward to the next layer. i.e a feed forward neural networks.

The main advantage of using a multiple layer instead of just a single layer one is to overcome limitations pointed out by Minsky and Selfridge[1960];Minsky and Seymour[1969] were a single layer perceptron is only capable of learning linearly separable patterns and thus unable to learn all functions. The MLP is not limited in this way and has been shown to be able to represent a wide variety of functions given appropriate parameters [Hornik et al.,1989] .

NuPIC is a platform actively developed and maintained by Numenta. It is a platform that introduce a collection of ideas and algorithms that are inspired by the structural organization of the neocortex. Two of the core concepts found inside NuPIC are the so called Hierarchical Temporal Memory and the Cortical Learning Algorithm. HTM was introduced inHawkins and Blakeslee[2007] and refers to a hierarchy of cortical regions in the brain.

The neocortex is classically divided into 6 different layers2_{, the current} imple-mentation in NuPIC is focused on emulating layer 3-4 (specifically layer 3) with extensions and research code being done to include more layers.

A CLA region consists of a collection of columns and each column consists of a handful of cells. This structure is based on the minicolumn hypothesis [Buxhoeveden and Casanova, 2002]. Each column in the CLA region have its own semantic meaning and the sparse activity of a handful of active columns will tell us something about the input. The CLA is modelled so that specific cells within each column reflect a temporal context of a pattern and a single cortical region (a CLA region) is trained with the CLA algorithm.

A typical CLA region consists of around 2K columns containing around 60K neurons in total while a typical MLP may have less then 100 neurons. Each neuron in a HTM network grow new synapses over time and its not uncommon to have around 5K synapses per neuron, meaning we would have around 300M synapses in total. A single region of this size uses around 100 MB of memory and it will take around 10 msec to do 1 inference and learning step3_.

2_{These layers should not to be confused with the hierarchy of regions}

(19)

2.1. NEURAL NETWORKS AND TIME SERIES PREDICTION

The HTM/CLA also differs from the MLP in that it does not directly use scalar weights to represent synaptic connectivity. Synapses inside a HTM have binary weights with a scalar permanence, each synapse will either be connected or discon-nected while MLP uses scalar weights. The condiscon-nectedness in HTM/CLA is based on a permanence, which is a value between 0.0 and 1.0. If the permanence is over certain threshold, we have a connected synapse i.e. we have weights of either 1 or 0. This binary mechanism is there to simulate synapses that are able to form and unform during learning. So essentially we have a weight change network vs wiring change network [Chklovskii et al.,2004].

(20)

(21)

Chapter 3 Method and Materials

The purpose of this chapter is to provide an explanation of the method used in this thesis. It will present terms and concepts needed and used in the field of Machine Learning, with the focus on how to create forecasts based on time series. It presents a motivational reason of why these methods are used as well as how they are used. This chapter also contains as a description of the implemented artificial neural network, how it is structured and optimized. It provides details about the experimental setup and how these experiments relates to GEFCom, and finally a description on how the datasets is structured and processed.

3.1 Preliminaries

3.1.1 Remarks

The methodology used to evaluate the prediction models presented in this thesis is based on the protocol presented in Madsen et al.[2005]1 _{which is a complete} protocol that can be used to evaluate the performance of short-term WPF and Wind-to-Power (W2P) models. The reason this protocol was chosen for this thesis was because it has been successfully used before as a guideline to evaluate a wide variety of forecast models such as AWPPS, Prediktor, Previento, Sipreólico, WPPT. The protocol has been used for both on-shore and off-shore wind farms and was developed in the frame of the ANEMOS research project [Kariniotakis et al.,2006] which brought together many relevant groups involved in the field. The aim of ANEMOS was to develop accurate and robust models that substantially outperform current state-of-the-art methods and one of its goals was to establish a common set

(22)

of performance measures that can be used to compare forecasts across systems and locations.

3.1.2 Definitions

Time-series

A time series sequence X of observations xt with a particular time-stamp t, i.e.

X = [x1, x2, ..., xt] or in short X1,...,t = X1:t, is a time dependent collection of variables.

Forecast horizon

A multi-step-ahead prediction is assignment of predicting k forecasts Xt+1:t+k given

a collection of historical observations Xt−p+1:t. In the case of GEFCom we want a

forecast for 1 − 48 steps ahead

Point or spot forecast

In this paper we model the forecast ˆpt+k|t as a so called point forecast (or spot,

forecasts) i.e. a single value for each forecast (compared to having a probability distribution)

Prediction error

The Prediction Error e for lead time t + k is defined in equation 3.1 as the difference between the forecast and the actual value where t denote the time index, and k is the look-ahead time, p is the actual/measured/true wind power and ˆp is the predicted wind power.

et+k|t= pt+k − ˆpt+k|t (3.1)

and the normalized prediction error as seen in equation 3.2

t+k|t= 1 pinst et+k|t = 1 pinst (pt+k− ˆpt+k|t) (3.2)

where pinst is the installed capacity of the wind farm (in kW or MW) which is

(23)

3.1. PRELIMINARIES

3.1.3 Reference models

Persistence (also called a naïve or plain predictor) as seen in equation 3.3 is the model most commonly used for benchmarking WPF models. This simple model states that the future wind generation value will be the same as the last measured value.

ˆ

ppersistance_t+k|t = pt (3.3)

A alternative would be to use an even simpler model (a climatology prediction 3.4 i.e. predicting the mean, a value that would be approximated from the training set (see, section 3.1.5). ˆ pmean_t+k|t = ¯pt= 1 N N X t=1 pt (3.4)

There are other reference models like the one suggested byNielsen et al.[1998] which have advantages over persistence but it was never widely adopted is not used by GEFCom.

3.1.4 Error metrics

In order to understand the reason why specific models perform well, it is usually a good idea to evaluate it against a wide variety of different criteria, as is emphasised by Kariniotakis [1997]. The sections describes error measures used, and in this section N denotes the size of the test set.

Forecast Bias

The Normalized Forecast Bias (NBIAS) describes the systematic error and is defined in equation 3.5, it is estimated by calculating the average error for each step ahead. It gives a indication of the direction of the error.

N BIASk = 1 N N X t=1 t+k|t (3.5)

(24)

Mean Absolute Error

The Normalized Mean Absolute Error (NMAE) is an error quantity that looks at the average of the absolute error of the prediction and is defined in equation 3.6

N M AEk = 1 N N X t=1 |t+k|t| (3.6)

Another common name used instead of Mean Absolute Error (MAE) is Mean Abso-lute Deviation (MAD), this value shows the magnitude of the overall error that has occurred due to forecasting and thus should be as small as possible. This error is also scale dependent and will be effected of data transformations and the scale of measurements.

Mean Squared Error

The Normalized Mean Squared Error (NMSE) is an error quantity that looks at the average of the squared errors 2

t+k|t and is build using the Normalized Sum of

Squared Error (NSSE) defined in 3.7

N SSEk =

N

X

t=1

2_t+k|t (3.7)

NMSE is defined in equation 3.8

N M SEk= 1 NN SSEk = 1 N N X t=1 2_t+k|t (3.8)

In this error, positive and negative errors does not cancel each other out and large individual errors will be penalized more harshly.

Root Mean Squared Error

The Normalized Root Mean Squared Error (NRMSE) is an error quantity that square root the NMSE, this error is defined in equation 3.9

N RM SEk = N M SE 1/2 k = 1 N N X t=1 2_t+k|t !1/2 (3.9)

(25)

3.1. PRELIMINARIES

3.1.5 Model selection

In regression and classification, one of the main issues we are faced with is “How do

we create a good model”. One way to define this “goodness” would be to look at the

model’s generalization ability, i.e. its performance on unseen data.

To make sure the model we are building will generalize to new unseen data it is important how we select and build the model in the first place and what we have control over is the data we have at hand, and how we used that data to build our model.

Training, testing and validating

In the case of supervised learning, we have access to a target series and its associated features, i.e the production series (our target) and wind speed and wind direction as features.

Common practice within Machine Learning, which is adhered to in this thesis, is to split the whole dataset into 3 smaller subsets. Then use two of these subset to find a good model (i.e. the training set and validation set) and the remaining (test set) is used for evaluation, i.e. the estimate on how accurate the model we have created would be on “unseen data”. With highly flexible models like an artificial neural network we need to be careful not to overfit the data, which is one of the reasons why we have the validation set, we don’t want to create a model that doesn’t generalize well because the model is fitted to every minor variation i.e. it has captured a lot of noise.

3.1.6 Evaluation

The improvement in respect to a considered reference model ref is defined in equation 3.10

I_kref,EC = 100 · EC

ref

k − ECk

EC_kref (%) (3.10)

(26)

Testing period

Date Time Forecast

2011-01-01 01:00 1-48 hours 2011-01-04 13:00 1-48 hours 2011-01-08 01:00 1-48 hours 2011-01-11 13:00 1-48 hours ... 2012-06-23 01:00 1-48 hours 2012-06-26 13:00 1-48 hours

Table 3.1: The first repetition of the first period is 8 January 2011 at 01:00 to 10 January 2011 at 00:00. The second repetition of the first period is 15 January 2011 at 01:00 to 17 January 2011 at 00:00. In between these periods missing data with power observations are available for updating the models.

3.2 Experiments

Training and testing is structured based on the structure of the GEFCom. The dataset spans from midnight of 1st of July 2009 to noon the 26th of June 2012. The period from 1st of July 2009 to 1st January 2011 at 01:00 is used for training and validating while rest is used for testing. In the testing range a number of 48-hour periods are defined (See, table 3.1) in-between these testing periods exists additional training data which enables the models to be updated in-between forecasts.

The testing periods repeats every 7 days until the end of the dataset and only meteorological forecasts that were relevant for the periods with missing power data are given, this was done in order to be consistent.

(27)

3.2. EXPERIMENTS

No Category Parameter Alias Type

1 Date Date date String

2 Date Year year Integer

3 Date Month month Integer

4 Date Day day Integer

5 Date Hour hours Integer

6 Date Week week Integer

7 Forecast Wind Speed ws Real

8 Forecast Wind Direction (°) wd Real

9 Forecast Wind U u Real

10 Forecast Wind V v Real

11 Forecast Issued hp Integer

12 SCADA Production wp Real

(28)

The database for the meteorological forecasts given by the GEFCom dataset contains sections of missing data each corresponding to the date in which the missing power information exists, these sections were filled out with the previous best forecast that were available in a pre-processing step. For example, if we do not have data for forecast issued at 2011-01-01 12:00 we use data from 2011-01-01 00:00 as a 48 hours ahead is available for this this date. If it were the case that previous section would also contain missing data we would go back an additional section and use those forecasts. If no forecast are available 48 hours back, we can use the best known forecast and extend it in the same fashion as the persistence model.

Any predictions made by these models should fall within a certain range so any forecast outside this range will be clamped. This is the main post-processing step that is being done. i.e there are an upper limit on what we can produce.

3.3 Holdback Input Randomization

The Holdback Input Randomization (HIPR) method described inKemp et al.[2007] is a method that can be used to investigate the importance of the input parameters. It works by sequentially feeding each data point in the test set to the neural network while replacing the values of one input-parameter at the time. This replacement is done by a uniformly distributed random values. Values in a range in the way the neural network was originally trained i.e. (-1,1). A NRMSE score is calculated for each replacement. The result of this is that we can get information about the relevance of the input.

3.4 Optimization methods

The two main optimization algorithms that have been used in this thesis are the Particle Swarm Optimization (PSO) algorithm presented byEberhart and Kennedy [1995] and the Levenberg-Marquardt (LM)2 _{algorithm independently developed by} Kenneth Levenberg and Donald Marquardt [Marquardt,1963;Levenberg,1944]. The PSO algorithm is a population-based stochastic algorithm similar to Genetic Algorithm (GA) but based on social–psychological principles instead of evolution. It

2_{The LM-algorithm was used in the beginning of the project but the result reported ended up}

(29)

3.5. NEURAL NETWORKS

can be summarized by the following steps. Each Networks were trained multiple times in order to avoid local minimum

• Step 1: Initialize particles with random velocities and accelerations.

• Step 2: Determine which particle is closest to the goal.

• Step 3: Adjust accelerations toward that particle.

• Step 4: Update particles positions based on their velocity and update velocity based on acceleration.

• Step 5: go to step 2.

The LM algorithm is a combination of the Error Back-Propagation (EBP) method [Rumelhart et al., 1988] and the Gauss–Newton method. This algorithm has the speed advantage of the Gauss–Newton and the stability of the EBP [Yu and Wilamowski,2011]. A detailed treatment of LM can be found inMoré[1978] and PSO inPoli et al.[2007]

3.5 Neural Networks

3.5.1 Multilayer Perceptron

The perceptron is built around a nonlinear model of a neuron, the McCulloch–Pitts model of a neuron [McCulloch and Pitts, 1943], it basically consists of a 2-step process where the cell-body contains a summation function of the weighted sum of all inputs including a bias. The perceptron is described by equation 3.11. The sum s is passed trough a activation function (see, 3.5.1) which mimics the activation or firing of the neuron.

s =

M

X

i=1

wixi+ x0 (3.11)

wi is the weight of the “synapse” of the input channel and is the parameter we want

(30)

∑

f(s)

...

w w w w Bias Signal Input Signal Output Signal

Figure 3.1: The perceptron.

The structure of the MLP consists of many perceptrons and it is shown in figure 3.2. The input signal flow from the input layer at the bottom to the output layer at the top. We have a bias signal seen on left side of the diagram which is set to a fixed number.

MLPs are fully connected networks meaning that neurons in any layer of the network is connected to all the neurons in the previous layer. Each connection in the network have a weight wij associated to it. The initialization process of these

(31)

...

Output signal Bias signal Input signal: Output Layer Hidden Layer Input Layer tanh tanh linear hours, u, v, week, ws, ws-1, ws-2, ws, ws+1, ws+2

Figure 3.2: Architectural graph of the neural network that will produce a single output value. It consists of a collection of hidden neurons in each H hidden layers as well as M input connections. Each edge seen in this graph have a wij associated

with it.

The performance of neural networks are generally improved if data is normalised. This is because if we where to use the original data directly it can cause a con-vergence problem. Normalization is done using the mapminmax function seen in equation 3.12

y = (ymax− ymin) · (x − xmin) xmax− xmin

+ ymin (3.12)

ymax is the max value of specified range, which in this case is 1 and ymin −1, x is

the value to be scaled. xmax is the max value of numbers to be scaled. xmin is the

min value of numbers to be scaled.

The forecasting model consists of 7 different networks, one for each wind farm, these networks are trained on the first section of the dataset before the test period. A random split of 60:20:20 is created where 60% of the available data is used for training and 20% of the data is used to validate the network. And 20% is set to be a hold-out set for the hyperparamters. Input features3_{feed into the models are}

(32)

ws, u, v, hours, ws, ws+1, ws+2, ws−1, w−2 were ws+x and ws−x donate a time shift of x, we can use ws+x as input into the model because ws is a forecast in-itself.

Activation Functions

The computation done for each neuron in the multilayer perceptron requires knowl-edge of the derivative of the activation function. In other words, the activation function we pick needs to be continuous. In this thesis we use the following two activation functions. The hyperbolic tangent function seen in equation 3.13 and

f (s) = tanh(s) = e

s_{− e}−s

es_{+ e}−s (3.13)

the linear transfer function seen in equation 3.14.

f (s) =        +1, if s ≥ 1 s, if − 1 < s < 1 −1, if s ≤ −1 (3.14) Hyperparameter optimization

In order to obtain the hyperparameter necessary for respective model to each wind-farm a random hyperparameter search was performed for all models. A hold-out validation set was used to pick best hyperparameter. Random search have been shown to work better then Grid Search when not all parameters are equally important [Bergstra and Bengio,2012].

3.5.2 Numenta Platform for Intelligent Computing

This section describes the key principles introduced with NuPIC. It explains in general terms the theory behind the Online Prediction Framework (OPF)4_{, which} uses the CLA and HTM algorithm, the OPF works as an API to create predictive models. The OPF consist of 5 major types of components; Encoders, Spatial Pooler (SP),Temporal Memory (TM), Temporal Pooler (TP) and Classifiers. Together these components construct a single CLA region5_{and Figure 3.3 demonstrates the} information flow through a single region.

4_{The OPF is used with Numentas commercial product GROK.}

5_{Currently, models created with the OPF do not use a TP nor does this client allow creations of a}

(33)

Another central concept in NuPIC is Sparse Distributed Representation (SDR), which refers to the activation of a small percentage of the neurons at any given time, this neuronal activity is represented as a n-dimensional spare binary vector

x = [b0, ...bn]with around 2% active cells. The output from the spatial pooler and

temporal memory are both SDRs, while the output from the encoder does not enforce this and is just a normal binary vector. A general overview of the properties of SDRs have been discussed byAhmad and Hawkins[2015].

Encoder

Spatial Pooler

Temporal Memory

Classifier

Figure 3.3: Information flow of a single region predictive model created with the OPF.

The overlap between two different SDR is defined by.

o(x, y) = x · y (3.15)

A match between two SDR is defined as by

m(x, y) = o(x, y) ≥ θ (3.16)

(34)

Testing period Scalar Encoding 1 11111000000000 2 01111100000000 ... 10 00000000011111

Table 3.3: Example were n = 14, r = 5, ψ = 1 of various encoded scalar values using a ScalarEncoder

new vector from a set of SDR’s. The downside of storing patterns this way is that the more patterns we store the bigger the probability of false positives.

Encoders

NuPIC contains many different encoders6_{, the job of an encoder is to convert raw} input into a more suitable representation (i.e a binary vector). The raw input are fed into the model using a dictionary data-structure.

Useful encoders include scalar and categorical encoders while more exotic encoders include one fore Global Positioning System (GPS) coordinates. This encoder can be used to extract information about anomalous movements. The dictionary representation of entries of raw inputs, are each encoded separately and concatenated using a multi encoder.

One property of the spatial pooler is that overlapping input patterns are mapped to the same SDR. This means that we want the encoder to encode input so that similar inputs share bits. The ScalarEncoder fulfil this property by the process illustrated in table 3.3.

We calculate the range of values to encode using equation 3.17 vmin represents

the minimum value of the input signal while vmax denotes the upper bound of the

input signal.

vrange= vmax− vmin (3.17)

There are three mutually exclusive parameters that determine the overall size of the output from a scalar encoder (n, r, ψ). n directly represent the total number

(35)

of bits in the output and it must be bigger then or equal to w which represents the number of bits that are set to encode a single value i.e. the “width” of the output signal7_{. r and ψ are specified w.r.t the input at while w is specified w.r.t. the output.} Two inputs separated by more than the radius have non-overlapping representations. Two inputs separated by less than the radius will in general overlap in at least some of their bits. Two inputs separated by greater than, or equal to the resolution are guaranteed to have different representations. ψ can be calculated using equation 3.18

ψ = r

w (3.18)

Depending on weather we want to have a periodic behaviour or not, n is calculated a bit differently, see the scalar encoder for implementation details8_.

Spatial Pooler

The spatial pooler receives a binary vector as its input from an encoder and will output a SDR. The structure of the spatial pooler consist of a input space and a set of columns. The output SDR represents which of the columns in a region that are active. Each column have synapses connected to the input space. The SP consists of around 50% randomly and potentially connected synapses, this is called the “potential pool”. Each synapse will connect and disconnect with the input space

during learning.

The general information flow of the spatial pooler consists of the following steps, input stimuli from lower regions leads to activation of the input space to which each mini-column is synaptically connected, this activation the so called “overlap score” is set for each column, The score is calculated based on the total sum of each neurons that tries to influence that column, weighted with a “boosting factor” that tries to increase the chance for certain columns to win more easily. A final top percentage (usually around 2% activity) of the columns with the highest influence (biggest overlap score) will be chosen, columns also inhibit close-by columns and the result of this process is a binary vector with few active bit, a SDR. This is illustrated in 3.19 where b can either be 0 or 1 and s is a value representing the score.

(36)

h b1 b2 b3 . . . bn i | {z } Input vector ·       b11 b12 b13 . . . b1n b21 b22 b23 . . . b2n .. . ... ... . .. ... bd1 bd2 bd3 . . . bdn       | {z }

Connected synapses for each columnn

=hs1 s2 s4 . . . sn i | {z } Overlap score Inhibition z}|{_→ h b1 b2 b3 . . . bn i | {z } Output SDR (3.19)

Learning in this structure is done by adjusting the permanence of the proximal den-drite to better match the input for each winning column, increase the permanence for the synapses that correctly matched the input and decrease the permanence for the rest. We also increase the boosting factor for losing columns to allow it to have a bigger chance of winning next time.

Temporal Memory

The temporal memory receives a SDR from the spatial pooler which represents the active columns, the output of the temporal memory is the activity from the whole CLA region. The general idea of TM9 _{is that cells in each mini-column provide} temporal context to the pattern identified with the spatial pooler, The TM’s job is to form transitions between different SDR’s and it achieves this by allowing active cells to form connections to previously active cells so that in a future setting each cell is able to predict its own activity.

Each cell in a region has several distal segments, ideally one for each pattern the cell has transitioned from, practically these segments are connected to a couple of patterns each. A cell enters a predictive state if there is enough activity on a segment and every segment consist of a collection of connections, synapses that have been formed to a subset of previously active cells (typically around 10-15 cells).

The Temporal Memory receives a sparse binary vector from the spatial pooler and the general information flow for this part of the CLA goes through two phases. We have the following steps in the first phase. 1.) For each active column we check to see if there are any cell in a predictive state, if there are any cell in a predictive

9_{There are a lot of details in how the Temporal Memory is implemented, pseudo-code and more}

details can be found in [Numenta,2011] the nupic git repository is the best source for finer details

(37)

state then. 2.) Activate that particular cell. If no cell was found in a predictive state then. 3.) Activates all cells in that particular column, a process called bursting.

Bursting is analogous to we do not know what cell to activate because we have

not seen this instance of the sequence of patterns before, and we are unable to put the column into the correct temporal context, let us activate all cells reflecting this uncertainty. Bursting does not occur when the temporal context has been predicted

by the CLA. Equation 3.20 show the first phase.

h b1 b2 b3 . . . bn i | {z } SP SDR       b11 b12 b13 . . . b1n b21 b22 b23 . . . b2n .. . ... ... . .. ... bd1 bd2 bd3 . . . bdn       | {z } Predictive State =       b11 b12 b13 . . . b1n b21 b22 b23 . . . b2n .. . ... ... . .. ... bd1 bd2 bd3 . . . bdn       | {z } Active State (3.20)

The second phase of the algorithm is there to figure out what cells should be turned into a predictive state for the next time-step. This is achieved by checking every distal segment on every cell for activity above a certain threshold, i.e., check for active segments. One active segment is enough to put a cell into a predictive state. The output from the temporal pooler is an vector representing the state of all cell in that region. Equation 3.21 show phase 2.

h b1 b2 b3 . . . bn i | {z } Active State ·       b11 b12 b13 . . . b1n b21 b22 b23 . . . b2n .. . ... ... . .. ... bd1 bd2 bd3 . . . bdn       X | {z } Segment X =hb1 b2 b3 . . . bn i X | {z } Segment Activation X > τ →hs1 s2 s3 . . . sn i | {z } Predictive State (3.21)

(38)

NuPIC Classifiers

There are many different classifiers included with NuPIC such as a k-Nearest Neigh-bour (kNN) and a Support Vector Machine (SVM), the OPF in particular uses a custom built classifier called the “CLAClassifier”. The purpose of this classifier is to decode predictions made by the CLA. All classifications preformed with CLAClassifier is preformed in different ways depending on how the input has been encoded.

Scalar values are decoded using a process were each cell in a CLA region is paired with two histograms. One of the histogram keeps track of the frequency of encountered patterns associated with each cell, the other one keeps track of a moving average for each bucket, This process is illustrated in figure 3.4.

SDR column 1

....

column 2

....

column N histograms maxvalue minvalue Likelihood Moving average

...

2 histogram per cell

Figure 3.4: The CLAClassifier

Training in NuPIC

Training the NuPIC model is done using online learning algorithms. The following schema seen in figure 3.5 is used to train and test these models. We have 7 different wind-farms so this schema is done for each respective wind-farm.

(39)

every step a-head in the CLAClassifier, which would give us 48 · 7 = 336 models in total (for all the wind farms). Not only does this change match Expekra’s approach but it also reduces the memory requirements for the CLAClassifier.

Dataset OPF-Model PSO - Swarming or Manual setup pre-training-data chunk Hyperparameter setup Online learning Activated Training phase training data stream Predictions Dataset OPF-Model Online learning Deactivated Testing phase testing data Predictions Multistep Predictions

Figure 3.5: Training a OPF model

Input and hyperparamter selection

(40)

(41)

Chapter 4 Result

This chapter contains information of the primary result that was obtained in this study. Figures 4.4-4.10 contain different error measurements on the test for respec-tive wind farm, we see that Expektra ANN model is able to perform well over all wind farms while the NuPIC model performs worse than Expektra but in general still better then our reference model. NuPIC is off target with a bias error on most wind farms. The graphs also indicates that NuPIC is unable to pick up some trends in the cumulated 2 _{graph. Expektra model on the other hand shows no clear problems in} cumulated 2 _{and is on target on all wind farms and appendix C has been included} to reflect this on different lead times.

0 5 10 15 20 Wind speed (m/s) 0.0 0.2 0.4 0.6 0.8 1.0 Cum ulativ e probabilit y wind speed 2 0 2 4 6 8 10 12 14 16 wind direction 50 0 50 100 150 200 250 300 350 400

normalized power output

0.2 0.00.2 0.40.6 0.8 1.0

(42)

Figure 4.2: Left Diagram, Wind-Speed vs Production. Right diagram. Wind-Speed vs Direction. Seen for, Wind Farm 1 GEFCom dataset (hourly data from January 1, 2010 to January 1, 2012).

(43)

10 20 30 40 look-ahead time k (in hours) −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 NBIAS Wind Farm 1 Expektra NuPIC Persistence 10 20 30 40 look-ahead time k (in hours) 0.0 0.1 0.2 0.3 0.4 0.5 NRMSE Wind Farm 1 Expektra NuPIC Persistence 0 1000 2000 3000 4000 5000 6000 7000 8000 time (in hours)

0 200 400 600 800 1000 cum ulated 2 Wind Farm 1 Expektra NuPIC Persistence 10 20 30 40 look-ahead time k (in hours) 0.0 0.1 0.2 0.3 0.4 0.5 NMAE Wind Farm 1 Expektra NuPIC Persistence

(44)

(45)

(46)

(47)

(48)

(49)

(50)

Wind Farm User 1 2 3 4 5 6 7 All Leustagos 0.145 0.138 0.168 0.144 0.158 0.133 0.140 0.146 DuckTile 0.143 0.145 0.172 0.145 0.165 0.137 0.146 0.148 MZ 0.141 0.151 0.174 0.145 0.167 0.141 0.145 0.149 Propeller 0.144 0.153 0.177 0.147 0.175 0.141 0.147 0.152 Duehee Lee 0.157 0.144 0.176 0.160 0.169 0.154 0.148 0.155 Expektra 0.165 0.158 0.184 0.164 0.179 0.153 0.153 0.165 MTU EE5260 0.161 0.172 0.193 0.162 0.192 0.156 0.160 0.168 SunWind 0.174 0.177 0.193 0.176 0.179 0.157 0.162 0.172 ymzsmsd 0.163 0.186 0.200 0.164 0.192 0.162 0.167 0.174 4138 Kalchas 0.180 0.179 0.197 0.175 0.200 0.160 0.165 0.177 NuPIC 0.243 0.254 0.264 0.310 0.290 0.224 0.240 0.264 Persistence 0.302 0.338 0.373 0.364 0.388 0.341 0.361 0.355

Table 4.1: NRMSE score of the entries published in [Hong et al.,2014], The NuPIC model and Expektra model are added so we can easily compare the result.

4.1 Experimental results

(51)

4.2. INPUT IMPORTANCE

4.2 Input Importance

For interpretation proposes in order to understand the model better, a analysis of the relative input parameter importance was performed. This analysis is illustrated in 4.11, Each box represent added noise to that channel and will result in a higher NRMSE score if that feature was important. A reference point “all-channels” repre-sent the error distribution of the model with no input replacement. We clearly see in this figure that wind speed channels ws is the most important attribute, adding noise to this channel will greatly effect the NRMSE score. We also see that the wind components u, v shows little to no influence and that the time stamp related inputs

hoursand week both indicates that there are seasonal and daily trends present in the dataset. al l− channel s hour s u v w eek ws w s− 1 w s− 2 w s− 3 w s+1 w s+2 w s+3 0.2 0.3 0.4 0.5 0.6 NRMSE

(52)

4.2.1 Adaptation and Optimization

All training for Expektras model was performed on a MacBook Pro with a Intel Core 2 Duo Processor (P8600 @ 2.40 GHz) on a 64-bit Operating System with 4 GB installed RAM Memory. After adapting the code to be able to run experiments using the GEFCom dataset an investigation was performed to evaluate the perfor-mance of the core source code given by Expektra. This investigation resulted in identification of some slow parts of code, after optimizing these issues mainly by using a Math.NET native library instead of the managed provider, a speed test was performed comparing the unoptimized version with the optimized one. Figure 4.12 shows the speed up achieved during training using LM algorithm.

10 15 20 25 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 Time

Normal

Optimized

Figure 4.12: Plot showing unoptimized version vs the optimized, when training a network with 10 input neurons, 10,15,20,25 hidden neurons, 1 output neuron. Training was done for 100 epochs using LM.

4.3 Summary

(53)

4.3. SUMMARY

horizon and NuPIC is performing better then persistence towards the end of the forecast.

0 10 20 30 40 50

look-ahead time (in hours) 0 20 40 60 80 100 Impro vemen t (%) -NRMSE

Improvement over persistence

Expektra

NuPIC

(54)

(55)

Chapter 5 Discussion

With the exponential growth of computer power it becomes easier to study deeper and more complex networks and with recent advantages in the area of deep learning a new interest in ANN has resurged. In practice ANNs have been used by most groups in the field of WPF but these networks never caught on as it was argued that improvements made by ANN were not usually enough for the extra effort in training these networks [Giebel et al.,2011]. This is steadily becoming less of an issue as computation becomes cheaper and bigger networks perform better.

By using a simple MLP network we observe that we are able to obtain results similar in performance to other models published inHong et al. [2014]. This can be used as a starting point for further investigations. The GEFCom dataset have a limited amount of input features and if we had access to a wider range of them the power of neural networks could be studied more in-depth. Given the number of features in the dataset this is harder to do.

Using persistence as a reference model was done in order to have the same base-line as the one used in GEFCom but a better reference model should be considered in the future such as the one presented inNielsen et al. [1998]. Especially if longer forecast is to be considered.

5.1 Method development issues

Working with Expektra’s model is very straightforward and no major issues were encountered during development but some issues were encountered working with NuPIC.1

1_{It should also be pointed out that it helps to have the people who developed the code on the}

(56)

1. Installing NuPIC is difficult. This seem to be a general problem looking at posts in the nupic mailing list2_{. A very important topic to fix if they want more} people to work with this model.

2. Using built-in Swarming (PSO) did not work well, especially for slightly larger dataset and for 48 prediction steps. The main problem was to fit everything into memory and that the code run too slow when swarming over multiple step-ahead, making it practically impossible to use. Scaling down the problem by just having a few prediction steps and multiple models helped, but the swarm model kept dismissing wind speed as an important feature which was fixed by specifically telling it to not touch that encoder.

3. Inconsistencies between the white-paper documentation of HTM/CLA and the actual implementation is making it hard to understand the underling principles of NuPIC. NuPIC present a very complex network and having any inconsistencies in documentation makes it hard to understand. The code is quite well documented but the additional material is a bit sparse.

These issues are all understandable and are continuously being improved upon, NuPIC is still a young platform so it is expected to find issues due to the complexity and research nature of the project. Training the CLAClassifer to use many step predictions instead of just one will most likely reduce the BIAS error seen in the result. To investigate this a more powerful computer is needed.

There are some differences in the input between the models. This setup was created because a OPF model need to encode the value it is going to predict. SCADA data is available so this is not an issue but it may introduce some unfairness between NuPIC and Expektra. In the current setup NuPIC does not seem to gain anything extra by the additional information and other models in the competition will most likely use different inputs as well.

In general, NuPIC is different in the way you feed data, you do so by sending in the front of the signal. We achieve temporal context through the temporal memory.

5.2 Future improvements and directions

One issue with the GEFCom dataset is that it does not contain any specific infor-mation about the location of these 7 wind farms. It is worth considering taking a a look in how to handle different models that are specialized for different types

(57)

5.3. CONCLUSIONS

of terrain as this has been shown to increase the performance [Kariniotakis et al., 2006]. Another thing to investigate could be to use a more advanced schema for hyperparameter selection. Jursa and Rohrig [2008] shows that the WPF error can be reduced by smarter use of optimization algorithms for feature selection and hyperparameter selection, which should improve the performance.

In general, having more types of inputs, from sources like SCADA and NWP sys-tems could give a lot of valuable information that can be used for better predictions. It is clear that wind speed is a key feature to produce good forecast, which also is supported by the input analysis in section 4.2, it has been argued that the error in WPF models stems largely from wrong weather forecasts [Giebel et al.,2011] so identifying good sources for weather forecasts would be desirable for more effective model learning. Combining several different weather forecasts could be investigated asNielsen et al.[2007] showed that power forecasts based on a number of different meteorological forecasts were better then a single source.

Regarding HTM / CLA, it is worth considering implementing a custom encoder for it, an encoder that would specifically be targeted at data concerning wind farms. SCADA data could also possibly be streamed directly into a NuPIC model, which would allow for some interesting online predictions and anomaly detections.

Training times are always an issue, especially for very large models. Both NuPIC and Expektra’s model would benefit in being implemented in a more parallel fashion as these algorithms are very well suited for parallel computing. The speed up would be of great value so looking into and implementing a GPU version of the algorithms could be worthwhile.

5.3 Conclusions

(58)

(59)

Bibliography

Subutai Ahmad and Jeff Hawkins. Properties of sparse distributed representa-tions and their application to hierarchical temporal memory. arXiv preprint

arXiv:1503.07469, 2015.

TC Akinci. Short term wind speed forecasting with ann in batman, turkey.

Elektron-ika ir ElektrotechnElektron-ika, 107(1):41–45, 2015.

E Michael Azoff. Neural network time series forecasting of financial markets. John Wiley & Sons, Inc., 1994.

James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimiza-tion. The Journal of Machine Learning Research, 13(1):281–305, 2012.

Daniel P Buxhoeveden and Manuel F Casanova. The minicolumn hypothesis in neuroscience. Brain, 125(5):935–951, 2002.

Erasmo Cadenas and Wilfrido Rivera. Short term wind speed forecasting in la venta, oaxaca, méxico, using artificial neural networks. Renewable Energy, 34(1): 274–278, 2009.

Dmitri B Chklovskii, BW Mel, and K Svoboda. Cortical rewiring and information storage. Nature, 431(7010):782–788, 2004.

A. Costa, A. Crespo, J. Navarro, G. Lizcano, H. Madsen, and E. Feitosa. A review on the young history of the wind power short-term prediction. Renewable and

Sustainable Energy Reviews, 12(6):1725–1744, August 2008. ISSN 13640321. doi:

10.1016/j.rser.2007.01.015. URL http://dx.doi.org/10.1016/j.rser.2007.

01.015.

Russ C Eberhart and James Kennedy. A new optimizer using particle swarm theory. In Proceedings of the sixth international symposium on micro machine and human

(60)

Shu Fan, James R Liao, Ryuichi Yokoyama, Luonan Chen, and Wei-Jen Lee. Fore-casting the wind generation using a two-stage network based on meteorological information. Energy Conversion, IEEE Transactions on, 24(2):474–482, 2009. Ulrich Focken, Matthias Lange, and Hans-Peter Waldl. Previento-a wind power

prediction system with an innovative upscaling algorithm. In Proceedings of the

European Wind Energy Conference, Copenhagen, Denmark, volume 276. Citeseer,

2001.

Lionel Fugon, Jérémie Juban, and Georges Kariniotakis. Data mining for wind power forecasting. In European Wind Energy Conference & Exhibition EWEC 2008, pages 6–pages. EWEC, 2008.

Patrick Gabrielsson, Rikard König, and Ulf Johansson. Evolving hierarchical temporal

memory-based trading models. Springer, 2013.

MA Gaertner, C Gallardo, C Tejeda, N Martínez, S Calabria, N Martínez, and B Fernández. The casandra project: results of wind power 72-h range daily operational forecasting in spain. In European Wind Energy Conference, 2003. Gregor Giebel, Lars Landberg, Alfred Joensen, Torben Skov Nielsen, and Henrik

Madsen. The zephyr project, the next generation prediction system. Wind Power

for the 21st Century, Kassel, 2000.

Gregor Giebel, Lars Landberg, Torben Skov Nielsen, and Henrik Madsen. The zephyr-project: The next generation prediction system. In Proc. of the 2001 European

Wind Energy Conference, EWEC’01, Copenhagen, Denmark, pages 777–780, 2001.

Gregor Giebel, Richard Brownsword, George Kariniotakis, Michael Denhard, and Caroline Draxl. The state-of-the-art in short-term prediction of wind power: A literature overview. Technical report, ANEMOS. plus, 2011.

Gerardo Gonzalez, Belen Diaz-Guerra, Fernando Soto, Sara Lopez, Ismael Sanchez, Julio Usaola, Monica Alonso, and Miguel G Lobo. Sipreólico-wind power predic-tion tool for the spanish peninsular power system. Proceedings of the CIGRÉ 40th

general session and exhibition, París, France, 2004.

Jeff Hawkins and Sandra Blakeslee. On intelligence. Macmillan, 2007.

Tao Hong, Pierre Pinson, and Shu Fan. Global energy forecasting competition 2012.

(61)

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. Daizheng Huang, Renxi Gong, and Shu Gong. Prediction of wind power by chaos

and bp artificial neural networks approach based on genetic algorithm. Journal

of Electrical Engineering & Technology, 10(1):41–46, 2015.

Ashu Jain and Avadhnam Madhav Kumar. Hybrid neural network models for hydrologic time series forecasting. Applied Soft Computing, 7(2):585–592, 2007. René Jursa and Kurt Rohrig. Short-term wind power forecasting using evolution-ary algorithms for the automated specification of artificial intelligence models.

International Journal of Forecasting, 24(4):694–709, 2008.

René Jursa et al. Wind power prediction with different artificial intelligence models. In Proceedings of the European Wind Energy Conference EWEC’07, 2007.

G Kariniotakis. Position paper on joule project jor3-ct96-0119. 1997.

Georges Kariniotakis, J Halliday, R Brownsword, Ignacio Marti, Ana Maria Palo-mares, I Cruz, H Madsen, TS Nielsen, Henrik Aa Nielsen, Ulrich Focken, et al. Next generation short-term forecasting of wind power–overview of the anemos project. In European Wind Energy Conference, EWEC 2006, pages 10–pages, 2006. GN Kariniotakis, GS Stavrakakis, and EF Nogaret. Wind power forecasting using

advanced neural networks models. Energy conversion, ieee transactions on, 11(4): 762–767, 1996.

Stanley J Kemp, Patricia Zaradic, and Frank Hansen. An approach for determining relative input parameter importance and significance in artificial neural networks.

ecological modelling, 204(3):326–334, 2007.

Andrew Kusiak, Haiyang Zheng, and Zhe Song. Wind farm power prediction: a data-mining approach. Wind Energy, 12(3):275–293, 2009.

Lars Landberg and Simon J Watson. Short-term prediction of local wind conditions.

Boundary-Layer Meteorology, 70(1-2):171–195, 1994.

(62)

SM Lawan, WAWZ Abidin, WY Chai, A Baharun, and T Masri. Different models of wind speed prediction; a comprehensive review. International Journal of Scientific

& Engineering Research, 5(1):1760–1768, 2014.

Duehee Lee and Ross Baldick. Short-term wind power ensemble prediction based on gaussian processes and neural networks. Smart Grid, IEEE Transactions on, 5 (1):501–510, 2014.

Ritchie Lee and Mariam Rajabi. Assessing nupic and cla in a machine learning context using nasa aviation datasets. 2014.

Kenneth Levenberg. A method for the solution of certain problems in least squares.

Quarterly of applied mathematics, 2:164–168, 1944.

Gong Li and Jing Shi. On comparing three artificial neural networks for wind speed forecasting. Applied Energy, 87(7):2313–2320, 2010.

Ziqiao Liu, Wenzhong Gao, Yih-Huei Wan, and Eduard Muljadi. Wind power plant prediction by using neural networks. In Energy Conversion Congress and Exposition

(ECCE), 2012 IEEE, pages 3154–3160. IEEE, 2012.

Henrik Madsen, Pierre Pinson, George Kariniotakis, Henrik Aa Nielsen, and Torben S Nielsen. Standardizing the performance evaluation of shortterm wind power prediction models. Wind Engineering, 29(6):475–489, 2005.

Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial & Applied Mathematics, 11(2): 431–441, 1963.

Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943. Marvin Minsky and Papert Seymour. Perceptrons. 1969.

Marvin Lee Minsky and Oliver G Selfridge. Learning in random nets. MIT Lincoln Laboratory, 1960.

Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In

Numerical analysis, pages 105–116. Springer, 1978.

MORGANSVENSSON Short-termwindpowerforecastingusingartiﬁcialneuralnetworks

Short-term wind power forecasting

using artificial neural networks

Short-term wind power forecasting using

artificial neural networks

MORGAN SVENSSON

Abstract

Närtidsprognos av vindkraftsproduktion genom

användandet av artificiella neurala nätverk

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Problem Formulation

1.2

The scope of the problem

Chapter 2

Background

SCADA

Physical Model

NWP

WFC

Forecast

Spatial

Refinements

WT Power Curve

Model Output

Statistics (MOS)

2.1

Neural Networks and Time Series Prediction

Chapter 3

Method and Materials

3.1

Preliminaries

3.1.1

Remarks

3.1.2

Definitions

3.1.3

Reference models

3.1.4

Error metrics

3.1.5

Model selection

3.1.6

Evaluation

3.2

Experiments

3.3

Holdback Input Randomization

3.4

Optimization methods

3.5

Neural Networks

3.5.1

Multilayer Perceptron

∑

f(s)

...

...

...

3.5.2

Numenta Platform for Intelligent Computing

Encoder

Spatial Pooler

Temporal Memory

Classifier

....

....

...

Chapter 4

Result

4.1

Experimental results

4.2

Input Importance

4.2.1

Adaptation and Optimization

Normal