The relationship between weather fore- casts and observations for predicting electricity output from wind turbines

(1)

(2)

The relationship between weather fore- casts and observations for predicting electricity output from wind turbines

ALEXANDER STAMP

TSCRM

Date: October 5, 2017 Supervisor: Pawel Herman Examiner: Olov Engwall Principal: Expektra AB

Swedish title: Förhållandet mellan väderprognoser och observationer för att förutsäga elproduktion från vindkraftverk

CSC

(3)

i

Abstract

Wind power production is of growing importance to many countries around the world.

To improve reliability and power grid stability related to wind power, forecasting of wind power is becoming an important commercial and research area. Machine learning methods are considered to be highly valuable when making predictions on time series data and as such have become prominent within wind forecasting as well.

This thesis extends an existing neural network prediction system with new input data series, in particular the observed wind speed from the wind farm itself. The goal was to investigate the effect this new data series has, and whether or not it could be used to improve predictions as compared to the baseline prediction system defined within this thesis.

To do this multiple methods of including the observed wind speed are developed, including a multi-stage network concept. These results are statistically tested to give more evidence for their comparison to baseline. The results show that the multi-stage network concept can use the observed wind speed to improve performance over the baseline case for specific prediction horizons.

(4)

ii

Sammanfattning

Betydelsen för vindkraftsproduktion växer i länder runt om i världen. För att förbättra tillförlitligheten och elnätstabiliteten i vindkraften blir dess prognoser viktiga kommer- siellt och ett forskningsområde. Maskininlärningsmetoder anses vara mycket värdefulla när man gör förutsägelser om tidsseriedata och har därmed framträdat inom vindpro- gnoser.

Detta arbete utökar ett existerande prediktionssystem av neurala nätverk med ny in- data, med särskilt den observerade vindhastigheten från själva vindkraftparken. Målet var att undersöka effekten av denna nya dataserie, och huruvida den skulle kunna an- vändas för att förbättra förutsägelserna jämfört med det befintliga referensprognossyste- met definierat i denna uppsats.

För att kunna göra detta utvecklas flera metoder för att inkludera den observerade vindhastigheten, inklusive ett flerstegs nätverkskoncept. Dessa resultat är statistiskt testa- de för att ge mer grund i deras jämförelse med referensmodellen. Resultaten visar att det flerstega nätverkskonceptet kan använda den observerade vindhastigheten för att förbätt- ra prestanda över referensmodellen för specifika prediktionshorisonter.

(5)

Acknowledgements

This Master’s Thesis was performed at Expektra AB and the CSC department at KTH. I would like to thank all of the people at Expektra but in particular my supervisor at Ex- pektra, Mattias Jonsson, for the practical and theoretical assistance that allowed me to carry out the thesis work. Thank you for answering my many questions. I would also like to thank my supervisor at KTH, Pawel Herman, for his deep insight into machine learning and excellent advice on scientific matters.

Finally, I would like to thank my parents for supporting me during the Master’s de- gree programme at KTH.

(6)

List of Figures

1.1 General statistical forecasting approach: ([Svensson, 2015]) . . . 2 2.1 Summary of Different Approaches to Forecasting Wind Power: ([Giebel et al.,

2011]) . . . 4 2.2 General Architecture of MLP: ([Catalão et al., 2011]) . . . 6 2.3 Cascade Network Structures for Transformer Condition Regression: [Shaban

et al., 2009] . . . 8 3.1 Block Diagram for Training of ANN using Levenberg-Marquadt Algorithm: [Yu

and Wilamowski, 2011] . . . 13 3.2 Difference between Model Type 4 and 5 . . . 16 4.1 Training and Testing Error vs Neuron Number for Multiple Trials - (a) 1 hour

ahead Model 1.1 (b) 6 hours ahead Model 1.2 (c) 24 hours ahead Model 1.3.

Red is training perf. and blue is test perf. . . 19 4.2 Training and Testing Error vs Neuron Number for Multiple Trials - (a) 1 hour

Red is training perf. and blue is test perf. . . 21 4.4 Training and Testing Error vs Neuron Number for Multiple Trials - (a) 1 and

6 hours ahead, Model 4.11-4.12. (b)24 hours ahead, Model 4.13. Red is training perf. and blue is test perf. . . 25 4.5 Scatter Plots of NWP vs Observed Wind Speed - (a) 1 and 6 hour ahead plot

of best single NWP vs observed wind speed (b) 1 and 6 hour ahead plot of network stage 1 vs observed wind speed (c) 24 hour ahead plot of best single NWP vs observed wind speed (d) 24 ahead plot of network stage 1 vs observed wind speed . . . 26 4.6 Training and Testing Error vs Neuron Number for Multiple Trials - (a) 1 hour

Red is training perf. and blue is test perf. . . 28

vi

(9)

LIST OF FIGURES vii

4.8 Training and Testing Error vs Neuron Number for Multiple Trials - (a) All inputs model 1 hour ahead (b) All inputs model 6 hours ahead (c) All inputs model

24 hours ahead. Red is training perf. and blue is test perf. . . 29

4.9 1 Hour Ahead, Multi Comparison Test Results Per Model Class based on Kruskal- Wallis Test. 2 groups have means significantly different from Model 1.1. Y axis is model category and x axis is column ranking. . . 31

4.10 6 Hours Ahead, Multi Comparison Test Results Per Model Class based on Kruskal- Wallis Test. 3 groups have means significantly different from Model 1.2. Y axis is model category and x axis is column ranking. . . 32

4.11 24 Hours Ahead, Multi Comparison Test Results Per Model Class based on Kruskal- Wallis Test. 3 groups have means significantly different from Model 1.3. Y axis is model category and x axis is column ranking. . . 32

4.12 All time horizons unified, Post-hoc Comparison Test Results Per Model Class based on Friedman Test with Bonferroni-Dunn correction. Y axis is model category and x axis is column ranking. . . 33

A.1 Scatter Plot of NWP Forecast vs Observed Wind Speed . . . 48

A.17 Scatter Plot of Observed Wind Speed vs Observed Wind Speed . . . 56

(10)

viii LIST OF FIGURES

A.35 Scatter Plot of Observed Wind Speed vs Observed Wind Speed . . . 65

(11)

List of Tables

4.1 1 Hour Ahead Correlation between NWP Predictions and Observed Wind Speed to 0.95 Confidence . . . 22 4.2 6 Hour Ahead Correlation between NWP Predictions and Observed Wind Speed

to 0.95 Confidence . . . 23 4.3 24 Hour Ahead Correlation between NWP Predictions and Observed Wind Speed

to 0.95 Confidence . . . 23 4.4 Linear Correlation Coefficients to Observed Wind Speeds - NWP versus Com-

bined NWP . . . 25 4.5 Comparison of Multi-stage models with Baseline and All Inputs Model by Co-

efficient of Determination . . . 30 4.6 Summary of Absolute Maximum Coefficient of Determination Values Per Model

Type and Prediction Horizon . . . 34 4.7 Summary of Best Hidden Layer Size Per Model Type and Prediction Horizon 34 4.8 Summary of Differences between Model Maximum Coefficient of Determina-

tion and Baseline Values Per Model Type and Prediction Horizon . . . 34 A.1 1 Hour Ahead Prediction Model Trials Coefficient of Determination: Best Model

Structure . . . 42 A.2 1 Hour Ahead Prediction Model Trials Coefficient of Determination: Best Model

Structure . . . 43 A.3 6 Hours Ahead Prediction Model Trials Coefficient of Determination: Best Model Structure . . . 44 A.4 6 Hours Ahead Prediction Model Trials Coefficient of Determination: Best Model Structure . . . 45 A.5 24 Hours Ahead Prediction Model Trials Coefficient of Determination: Best Model

Structure . . . 46 A.6 24 Hours Ahead Prediction Model Trials Coefficient of Determination: Best Model

Structure . . . 47

ix

(12)

(13)

Chapter 1

Introduction

Since the 1990’s wind turbines and other sustainable energy sources have become of in- creasing importance to reduce carbon emissions and meet other environmental goals.

However, one of the key problems of wind power and many of these energy production methods, such as solar photovoltaic, is that they are quite unpredictable and generally remain undispatchable.

There are many approaches to dealing with the unpredictability of wind power but an interesting method is advanced prediction systems to enhance grid planning. As [Giebel et al., 2011] puts it, “short-term prediction, in sync with the rise of wind power penetra- tion in more and more countries, has risen from being a fringe topic for the few utilities with high levels of wind power in the grid, to being a central tool to many. . . ”.

Academic interest around weather and power output prediction has grown at the same time. [Landberg et al., 2003] produced an earlier academic overview of the wind power prediction field in 2003. [Giebel et al., 2011] considered it to be very impractical to review all papers in the wind forecasting field due to the sheer increase in volume of published papers. They also considered the field to be becoming quite mature though optimal usage of numerical weather prediction (NWP) forecasts by end users still re- mained a large problem in their view.

Expektra AB is a company focused on using machine learning methods, predomi- nantly artificial neural networks (ANNs), in the power industry for predictions. They use weather forecasts to do their day ahead predictions for wind turbine power output; this is an approach with difficulty in achieving very high accuracy considering the long-term time horizon for what are primarily statistical methods and the high variability of wind power output.

This service is of use for operators and their customers, to act with greater confidence in their power market transactions by having a better idea of the likely power output of the turbine farm. Their objective is to improve their predictions to offer more value to their customers, with secondary benefits of offering increased grid stability and predictability. Expektra has access to some weather observation data, specifically wind speed, from the turbine Supervisory Control And Data Acquisition (SCADA) system, in addition to forecast data. This data is not currently being used and there is little research on how this data can be used within wind forecasting systems. As such, it remains an interesting research question how to utilise such data within prediction systems and if it delivers any impact on prediction system performance.

There are several ways to produce the wind power output predictions: (Fig. 1.1).

1

(14)

2 CHAPTER 1. INTRODUCTION

When using statistical models the wind power output predictions can be modelled with the following methods:

1. Train the model on weather forecasts of the same quality as the ones that are used when making live electricity predictions

2. Train the model on the best available historical weather forecasts in order to incor- porate the “real” weather dependency better and let the worse weather forecasts when making live predictions result in an analogous electricity prediction error 3. Train on observations of the same weather phenomena, such as wind speed obser-

vations, but use weather forecasts, such as wind speed forecasts, instead of observations when making the live electricity predictions

The thesis project will investigate some of these approaches and consider different methods of integrating Supervisory Control And Data Acquisition (SCADA data, most specifically weather observation data, into Expektra’s existing modelling system. The in- tuition is that using this new data, specifically wind speed observations, will improve the models as it will provide some additional knowledge of the relationship between the NWP data, which is forecast data, and the actual site winds, which is one of the main determinants of actual wind power produced.

Figure 1.1: General statistical forecasting approach: ([Svensson, 2015])

1.1 Problem Formulation

The leading question of this project can be stated as follows:

“What is the effect of incorporating wind speed observation data from the site SCADA system into a wind turbine prediction model that relies upon weather forecasts?”

It is worth noting that the baseline wind turbine prediction model here is considered to be a structurally similar model, i.e. with the same inputs and time delays but not the exact same neuron numbers, to Expektra’s highest performing models over different prediction horizons.

The project implies a requirement to establish a suitable neural network approach that effectively incorporates the SCADA wind speed observation data. The question will

(15)

CHAPTER 1. INTRODUCTION 3

also be examined at a range of time horizons for prediction (i.e. 1 hour ahead, six hours ahead and 24 hours ahead).

Addressing the proposed research question requires solving a number of practical challenges. Data shifting and cleaning, based on observations of the data, is required to obtain useful data from the many hundreds of dimensions present in the raw data.

1.2 Problem Scope and Limitations

The project will be limited in the types of machine learning methods considered to investigate the research question. Artificial neural networks (ANN) will be the only machine learning method used within this project.

Furthermore, time horizons for predictions will be limited to three classes at most, one hour ahead, 6 hours ahead and 24 hours ahead.

In terms of data limited numbers of delay terms on the inputs will be considered and limited features will be considered from the available data series. This thesis focuses on the addition of the SCADA observed wind speed so other data series are only used where they are in relation to the observed wind speed, i.e. NWP wind speed forecast data.

(16)

Chapter 2

Background

2.1 Time Series Modelling of Wind Power Predictions

The context for the research question is delimited by the time series modelling approach to analysing wind power, as opposed to the NWP modelling domain 2.1. NWP modelling and time series modelling are two subsets of the field of wind forecasting, which is widely covered in the literature review of [Giebel et al., 2011]. One of the very first papers in the time series wind power modelling domain came in 1984 with [Brown et al., 1984] and as [Giebel et al., 2011] explains, it was “using a transformation to a Gaussian distribution of the wind speeds, forecasting with a AR (AutoRegressive) process, upscal- ing with the power law, and then predicting power using a measured power curve.”

Power curve modelling, i.e. the output process that takes the wind speed and direc- tion predictions and converts them into power, is a whole research field in itself.

Figure 2.1: Summary of Different Approaches to Forecasting Wind Power: ([Giebel et al., 2011])

These methods were expanded by others in the field to use Kalman filters, autoregressive integrated moving average (ARIMA) models, Box Jenkins models and other techniques as time passed. The time period for forecasting was up to 48 hours ahead [Kavasseri and Seetharaman, 2009], though some of the methods were used with only a few hours time horizon and so are likely not as relevant to the specific research question here. These methods were typically analysed against the persistence method (per- sisting the previous time period’s wind speed for the forecast) or the improved persis-

4

(17)

CHAPTER 2. BACKGROUND 5

tence method, which uses a moving average term, otherwise known as the New Refer- ence Model [Nielsen et al., 1998].

Though those methods can accurately predict wind forecasts they do not produce the S shaped power curve function used to convert between wind speed and power output.

Other papers in the field have found the power curve approach not to be sufficient in the case of extremely large wind turbine farms over 100MW [Collins et al., 2009], potentially due to the geographical spread of the wind farm. It is worth noting that some prediction methods do not use an explicit power curve conversion step and instead have it as an implicit part of the prediction.

NWP is a narrower subset of the general field of wind predictions, but most of the time series modelling techniques with longer time horizons rely on NWP inputs to make their predictions, though some just use the most recently observed values if using a short time horizon. NWP is the use of complicated physics models describing solar input, at- mospheric fluid and heat flows, moisture and other factors to predict the weather. It is typically conducted on supercomputing systems and NWP services are provided by many national meteorology bureaus, such as Meteo France, Deutscher Wetterdienst or the European Center for Medium Range Weather Forecast [Giebel et al., 2011].

This field is rapidly developing in both a research and operational manner. At the time of [Giebel et al., 2011], there were a few global weather models produced by a few large weather bureaus, with individual countries providing limited area models (LAMs) with a 7-12 km horizontal resolution. Since then global models have progressed down to 10km resolution, with individual LAMs progressing below 4km, 1.5km in the case of the UK Met office (Met Office, 2017).

2.2 Neural Network Wind Power Forecasting

Artificial neural networks (ANN), commonly used in time series prediction problems [Box et al., 2015], are generally considered as an effective approach to wind power prediction. The first reference to the usage of ANN in wind power prediction, for very short term forecasts only, comes from around 1993 [Tande and Landberg, 1993].

Potentially the most common ANN structure is the MultiLayer Perceptron (MLP), a network that uses an input layer of neurons, one or more hidden layers of neurons and an output layer of neurons (Fig. 2.2). These neurons have some of the basic properties of cortical neurons such as summation, weighting and activation. The main reason that multiple layers are used is that single layer perceptrons were proved by [Minsky and Sel- fridge, 1960] and [Minksy and Papert, 1969] to only be capable of learning linearly sep- arable patterns; MLP have been shown by [Hornik et al., 1989] to be universal function approximaters.

Many different architectures involving ANN are discussed in a paper reviewing the wind forecasting academic field, including hybrid models between ANN, fuzzy logic and Support Vector Machine (SVM) [Soman et al., 2010]. It is clear that ANN are very commonly used and there is a large volume of research on the topic. [Soman et al., 2010]

concludes that ANN give very good forecasts and that they in particular solve some of the downscaling problems of NWP methods used for predicting wind conditions at a given location.

Svensson ([Svensson, 2015]) found that ANN of the type used by Expektra outper- formed the persistence method by a very large margin in terms of normalised root mean

(18)

6 CHAPTER 2. BACKGROUND

Figure 2.2: General Architecture of MLP: ([Catalão et al., 2011])

square error (NRMSE) for the specific 24-hour-ahead prediction scenario considered in this research topic. Specifically, they achieved a NRMSE of 0.165 across the target data set as compared to a NRMSE of 0.355 for the persistence method.

Next we consider a more specific area of research within ANN wind energy predictions. The interest in investigating the effect of the SCADA wind speed observations is inspired by past research by Morgan Svensson for Expektra. He looked at different types of neural networks for predicting wind power output solely from the forecast data [Svensson, 2015], which concludes that having more types of inputs, from sources like SCADA and NWP systems, could give a lot of valuable information that could be used to make better predictions. This paper also concludes that wind speed is a key determi- nant of predictor output, so one potential use for these other data series is identifying the best weather forecasts.

The paper [Svensson, 2015] goes on to note that using several different weather forecasts together could also be investigated as [Nielsen et al., 2007] showed that power forecasts based on a number of different meteorological forecasts were better than a single source. Specifically, [Nielsen et al., 2007] demonstrates that wind power forecasts obtained using different meteorological models may have comparable performance even if the correlation of forecast errors is low. Due to this somewhat nonintuitive fact, the individual forecasts can potentially be bettered by weighted and bias corrected sums of the forecasts. It is a reasonable extension to consider that ANN could be used for this purpose as well, if trained on the weather observation data from the actual turbine site.

Relatively few papers deal with the concurrent usage of observations and forecasts so this is an interesting research topic. [Tesfaye et al., 2016] used observation data to train an ANN on power output targets, with the test data being NWP forecasts.

However, the NWP and weather forecast data used in [Tesfaye et al., 2016] has 10- minute resolution and the system generates up to one-day-ahead forecasts based on this data. It differs to the data that is available to Expektra, which has a one-hour resolution.

They ([Tesfaye et al., 2016]) achieve highly accurate forecasts using feed forward ANN, whilst testing on the NWP data available, using the Matlab feed forward ANN fitting toolbox.

(19)

CHAPTER 2. BACKGROUND 7

2.3 Advanced Machine Learning Methods for Time Series Predic- tion

One method possible to integrate additional forecast information like [Svensson, 2015]

suggests is a hierarchical network structure. This has the advantage of reducing the effective dimensionality of the individual networks.

These combined network structures have a relatively long academic history. [Ginzburg and Horn, 1994] showed that a network structure where a first network was trained to predict sunspot data and then a second network was trained on the residuals of the first network’s predictions improved performance. The second network uses linear transfer functions so is effectively a linear regression of the residual on a series of residuals but still can reduce first stage errors by 20 percent when used for long-term prediction.

Different combinations of machine learning methods in different structures have been attempted for the application of time series forecasting. Wind forecasting has also used similar hybrid approaches [Soman et al., 2010]. Hybrid approaches actually dominate the field in terms of short-term and medium-term wind power forecasting according to this paper.

One interesting example is [Zhang, 2003](N.B. that the paper was actually written in 1999) who uses a hybrid methodology of ARIMA and ANN models, with the motivation that real datasets often show linear and non-linear trends. Furthermore, the paper notes for additional motivation that ARIMA model(s) cannot deal with non-linear relationships while the neural network model alone is not able to model both linear and non-linear trends equally well. The method is relatively similar to [Ginzburg and Horn, 1994] in that the ARIMA model is used to model the data first and then the ANN is trained on the residuals of the ARIMA model output. [Zhang, 2003] models sunspot data, Canadian lynx data and British Pound/US dollar exchange rate data to show the application of the method to various different problem sets. In this paper on the three different data sets the hybrid approach outperforms both the ANN and the ARIMA individual approaches.

Apart from modelling the linear and non-linear components separately there are other hybrid approaches that are considered for time series forecasting, such as unsupervised clustering to form regions of similar data and then training different machine learning models on each of these regions. [Cao, 2003] does this by using Self Organising Map (SOM) neural networks to cluster the data and then uses SVM to train upon each region.

The method significantly decreased required training time and number of support vectors as compared to the single best SVM model over the global input space.

Instead of separating data points into different clusters, other research in regression machine learning, similar to time series when using ANN methods, separates the individual data points within each vector into multiple networks. In [Shaban et al., 2009] individual parameters within transformer condition data with known relationships to each other were combined in a multiple network structure. These relationships are not tem- poral in nature, like time series prediction, but are known to have impact on each other through high voltage engineering practice and theory.

For instance, electrical resistance at different voltage levels was used to predict breakdown voltage in one first stage network and interfacial tension in another first stage network. From there, these two values and the colour of the insulation are used to predict the water content of the transformer oil in a second stage network. Finally, the water output of the second stage network and the breakdown voltage output of one of the first

(20)

8 CHAPTER 2. BACKGROUND

stage networks is used to predict acidity in a third stage network. 2.3

Figure 2.3: Cascade Network Structures for Transformer Condition Regression: [Shaban et al., 2009]

There are even hybrid structures for time series prediction based upon the idea of selecting the best inputs from a large set and then training a later stage ANN on the best inputs. A representative example is Fuzzy Cognitive Maps (FCMs) that are trained upon time series data using structure optimisation genetic algorithms followed by ANN (with Levenberg-Marquardt training algorithm) [Papageorgiou and Poczeta, 2016].

Beyond hybrid structures, there have recently been proposed other methods for time series prediction such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short term memory (LSTM) and gated recurrent unit (GRU). The impetus behind these network structures is that they, except for CNN, exhibit some concept of pattern recognition over time periods, or memory [Lipton et al., 2015]. CNN of- fers instead the ability to reliably train very deep networks and recognise data structure.

These have found many applications, even within such fields as petroleum wells [Gar- cia et al.]. Interestingly enough, in this paper there are similar problems to wind turbine data, where some data series have missing points that must be cleared from the data and the amount of data available is of similar magnitude to this project.

(21)

Chapter 3

Methods

3.1 Preliminaries

The purpose of this chapter is to provide an explanation of the methods used in this thesis to investigate the research topic. It will present terms and concepts required in the field of machine learning, with focus on time series forecasts. Furthermore, it presents the motivation behind using these types of methods.

This chapter also contains as a description of the implemented ANN and how it is structured and optimized. It provides details about the experimental setup and how these experiments relate to Expektra’s baseline wind power forecasting models, and finally a description of the structuring and processing of the dataset.

3.1.1 Definitions

Time-series A time series sequence X of values xt with a particular time-stamp t, i.e. X = [x1, x2,..., xt] or in short X1,...,t = X1:t, is a time dependent collection of variables.

Forecast Horizon A multi-step-ahead time series prediction is the task of predicting k forecasts Xt+1:t+k given a collection of historical observations Xt-p+1:t. In the case of this thesis report we consider 1 time step ahead, 6 time steps ahead and 24 time steps ahead. The time steps within the context of Expektra’s data are hours.

Point or spot Forecast In this paper we model the forecast pˆt+k|t as a so called point forecast (or spot forecast) i.e. a single value for each forecast, i.e. there is not a prob- ability distribution of forecasts.

Prediction Error The Prediction Error e for lead time from current time t+k is defined in Eqn. 3.1 as the difference between the forecast and the actual value where t denotes the time index, and k is the look-ahead time from the current time, p is the actual wind power and ^pˆ is the predicted wind power.

e_t+k|t= p_t+k− pˆ_t+k|t (3.1)

Open Set If we consider a time series set of 1:t of n dimensional vectors, X, mapped to a function output of Y:

F (X_1:t) ⇒ Y_1:t (3.2)

9

(22)

10 CHAPTER 3. METHODS

Indices ∈ N |Indices = 1 : t (3.3)

OpenSet ⊂ Indices (3.4)

HiddenSet ⊂ Indices (3.5)

OpenSet ∩ HiddenSet = {} (3.6)

(3.7) Expektra’s open set is considered a subset of the whole dataset, a subset in terms of timestamps, that is used within the training process for the ANN models but then is later also used for the testing process if desired. This is chosen using propi- etrary clustering and has the same set indices no matter the input data series that are used to produce the n dimensional input vectors.

Hidden Set The hidden set is very similar to the open set except is not used with the training process for the ANN models. Measures of performance are evaluated upon the hidden set.

Data Series Expektra has a very large number of different time series data series. It is not guaranteed that, if we consider time series of one data type A and time series of another data type B:

Indices(A) = Indices(B) (3.8)

The practical consequence of this is that the different data series must be inter- sected by timestamp and that the interval between timestamps or indices of data is not always regular. Some data series can have multiple weeks of gaps within them.

This thesis does not consider interpolation or other models to allow time series data points of n-missing dimensions to be used. They are discarded here.

These data series are referred to by number where necessary.

Data Availability The two main classes of data within Expektra’s set of data provided for this thesis are NWP data and SCADA data. SCADA data can of course not be used to make predictions over the hidden set, representing the actual process of making predictions in reality. It can only be used if time delays are applied to it related to the forecast horizon; i.e. the data must be available when predictions are being made.

NWP forecast data can be used based on its data availability, which is from zero to D. There are 16 discrete NWP wind speed forecast series from different locations relative to the wind farm for each availability horizon. This means that 32 NWP forecasts for wind speed are considered in this thesis. The availability integer represents the number of days ahead that this data can be used to make predictions for.

For instance, data of availability 1 can be used to make predictions for the current day, though it will be less accurate, but the maximum number of days ahead it can be used to make predictions for would be 1 day.

It is important to only select data for models that could actually be able to be used to make predictions, as although it would potentially make an excellent model, it could not be used.

(23)

CHAPTER 3. METHODS 11

3.1.2 Reference Models

The persistence models [Soman et al., 2010] for predicting wind speed, often cited as a classical benchmark, will not be used for comparison as it is preferred to compare the new solutions for incorporating the observations to the current process, which is only using the forecasts.

It is further noted that there is a reference ANN model, structurally similar to Expek- tra’s best performing model over the hidden set (c.f. section 3.1.1), for each of the three forecast horizons (c.f. section 3.1.1). This model has its performance calculated over the hidden set to present a baseline for comparison to other models.

3.1.3 Error Metrics Mean Squared Error

Mean Squared Error (MSE) is an important error quantity that looks at the average of the squared errors. This means that positive and negative errors do not invalidate each other, producing an artificially high performance metric. Large individual errors are penalised more harshly as compared to other comparison metrics. For a set of N predictions it is described as Eqn. 3.9:

M SEk= 1 N

t=1

X

N

e²_t+k|t (3.9)

The NRMSE, discussed earlier in references to the results of [Svensson, 2015], is sim- ply the square root of this quantity.

Coefficient of Determination

The R squared value, also known as the coefficient of determination (related to the frac- tion of unexplained variance produced by the modelling process) is the basis for comparison between the different models [Glantz and Slinker, 1990]. It is based on the MSE (Section 3.1.3) and is shown in (Eqn. 3.10):

R_k² = 1 − M SE_k

var(pk) (3.10)

The coefficient of determination, also known as R squared, uses the population variance to explain the variability in the predictions, i.e. the MSE. The advantage to this approach, and why Expektra uses it, is that it allows reasonable comparisons between different population datasets which might have different levels of variability. This is cer- tainly the case when considering different wind farms, of different sizes and climatic in- fluences.

It also retains the advantages of the MSE as well as it is just a linearly scaled version of such. This thesis report uses it because it allows Expektra to interpret the results better in comparison to their current models.

3.1.4 Model Selection and Evaluation

In regression and classification applications of machine learning creating a good model has many challenges. One of the most common challenges is the "generalisation" prob-

(24)

lem, i.e. does the model perform well when it is applied to data that has not been seen previously during training. This is especially a problem with neural networks due to their extremely flexible nature as universal approximaters [Hornik et al., 1989].

Following the best practices of supervised machine learning, as the ANN methods shown here are, to avoid this problem the data is divided into training, validation and test data sets. However, the test data set is selected as the hidden set, c.f. Section 3.1.1, and not at random; the remaining data is divided at random into training and validation data.

An additional model validation procedure followed in this thesis is that the ANN models used are randomly initialised and fully trained 30 times, for each potential hidden layer neuron size. From 1 to 30 neurons are trialled, making for a total of 900 trials for each model type. The layer size with the maximum coefficient of determination, for any one result, is presented as the best group from that model type. These 30 coefficient of determination values per model type are used as the basis for the statistical testing detailed in 3.5.

The final models for statistical testing and numerical comparison are selected using the maximum coefficient of the many individual runs. This means the absolute maximum is used for model selection and not the maximum average per specific number of nodes.

3.2 Experiments

3.2.1 Data Import

Data was provided by Expektra in the form of 533 CSV files for data and a JSON schema for the open and hidden set. The data to be used represents all available weather forecast and SCADA (Supervisory Control and Data Acquisition) data for a small wind farm of 16 wind turbines in coastal Norway since January 2014. After that it was imported into SQLite 3.18.0. An SQL interface was written in Matlab, which performed data inter- section and queries in SQL natively and the finished datasets, relatively small in size as compared to the overall dataset, were stored as Matlab .mat files.

The SQL interface took data series id’s as a matrix and corresponding time delays as another matrix, time shifting by minutes or other units being a native SQLite function.

This allowed confidence that finished data series were shifted the correct amount of time.

3.2.2 Equipment

Personal Computer, running Matlab 2015A for ANN experiments.

Training ANN in Matlab supports both multiple core CPU training and GPU training but the computer used only had an integrated GPU so CPU training was performed. The Matlab feed forward ANN fitting toolbox is used.

3.3 Optimization Methods

The Levenberg-Marquardt (LM) algorithm is a hybrid of the Error Back-Propagation (EBP) method and the Gauss–Newton method. This algorithm is fast like Gauss–Newton

(25)

but is far more stable, like EBP. It is used to train the ANN models in this thesis to achieve minimum possible MSE.

A detailed discussion of LM and its characteristics can be found in [Yu and Wilam- owski, 2011] but the key to the algorithm is that, "it performs a combined training process: around the area with complex curvature, the Levenberg–Marquardt algorithm switches to the steepest descent algorithm, until the local curvature is proper to make a quadratic approximation; then it approximately becomes the Gauss–Newton algorithm...". An excellent flow chart from [Yu and Wilamowski, 2011] is presented in Fig. 3.1.

Figure 3.1: Block Diagram for Training of ANN using Levenberg-Marquadt Algorithm: [Yu and Wilamowski, 2011]

It is noted that the advantages of LM are domain limited and in some applications it is not a suitable training algorithm. [Yu and Wilamowski, 2011] specifically identify cases where network size is large, i.e. image recognition, as not being suitable for the algorithm. This is because calculating the inverse of the Hessian Matrix, despite the use of Jacobian approximation to the Hessian, is extremely complicated and thus slow for large network size. In Fig. 3.1 this is the step following the Jacobian matrix computation.

Furthermore, memory requirements of LM can be large, due to the need to store the Jacobian matrix in memory, and this can be impractical when used with larger network sizes.

Nonetheless, it was found to be a very suitable algorithm for the problems presented in this thesis.

(26)

3.4 ANN Models

The ANN models used in this thesis are of the fully connected or MLP style, with the neurons within modelled after the McCulloch-Pitts model [McCulloch and Pitts, 1943].

Each neuron has multiple inputs controlled by weights and a bias, followed by a sum- ming function and then finally an activation function, similar in function to the firing of a biological neuron. Eqn. 3.11 shows this, where w equals weight on input line, x is the input value, M is the dimensionality of the inputs and x0 is the bias that can be applied to the neuron.

s =

M

X

i=1

w_ix_i+ x₀ (3.11)

These weights are initialised randomly to small values before each trial of the network at a given hidden layer size.

An ANN of MLP type consists of multiple layers, with typically multiple neurons in each layer. As per Fig. 2.2 the models used here have 3 layers: input, hidden and output layers. Each neuron is connected to each neuron in the following layer, thus the previous remark about this being a fully connected network type. The networks in this thesis only use one output, the prediction of electrical power output for the wind farm, so only have one output neuron.

When using these models to do regression or time series prediction, a special case of regression, the most appropriate activation function on the output neurons is a pure linear activation function (Eqn. 3.12). In the hidden layer, this thesis uses the tansig activation function, described in Eqn. 3.17. The tansig activation function is used as it gave the best results during testing.

purelin(x) = (3.12)

+1, if x ≥ 1 (3.13)

n, if − 1 < x < 1 (3.14)

−1, if x ≤ − 1 (3.15)

(3.16)

tansig(x) = 2

(1 + e^−2∗x) − 1 (3.17)

3.4.1 Data Scaling

Data must be scaled before it is fed to the network, as it significantly improves performance. This is extremely important in the case of this application, due to some of the large values for certain data series, such as electrical power production. This is because of the following effects of data scale upon neural networks and other machine learning methods:

• The random weight initialisation in ANN is partially dependent on the scale of the network inputs. ([Sarle, 2002])

(27)

• With no scaling, input attributes with greater numeric ranges can dominate those with smaller numeric ranges. ([Sarle, 2002], [Hsu et al., 2003])

• Large variations between variables may cause problems with the training process, and prevent solution convergence. ([Sarle, 2002], [Hsu et al., 2003])

• Regularisation algorithms, such as weight decay, benefit from using inputs with a similar scale. ([Sarle, 2002])

• Optimisation algorithms can also be sensitive to scaling issues. ([Sarle, 2002]) This scaling is done using the inbuilt mapminmax function in Matlab which is described by 3.18. The maximum and minimum scaled values are assumed to be 1 and -1 respectively.

x_scaled = (x_scaled−max− x_scaled−min)(x − x_min)

x_max− x_min + x_scaled−min (3.18)

The data is rescaled to produce the output of power prediction.

3.4.2 Considered Extensions to MLP

Although all ANN models used are of a MLP structure, integrating the SCADA data into them requires adjustments due to the forecast horizons used. This is due to SCADA wind speed data not being available until the hour after this has occurred, making it un- suitable for predictions without delays or structural innovation.

The naive way to integrate SCADA data would be like [Tesfaye et al., 2016] or by the method mentioned in Section 1 and that is the first model attempted after the baseline models, representing Expektra’s current process. In this second model, the inputs that were wind speed forecasts are replaced with inputs of wind speed observations for training, and then switched back to NWP forecasts, the original NWP forecast series used in the baseline phase, in the test phase.

An alternative to this is the third model tested. In this model, the observed wind speed data is added as an extra input to the model, trained upon observed wind speed and then tested upon forecasts. Furthermore, all 16 available NWP forecasts are used as potential replacements for the wind speed forecast. This leads to 16*900 trials of this model type, as compared to 900 of all other model types.

It is interesting to consider the correlation between each replacement forecast and the actual observations when it comes to replacing the observations in the third model. The question of whether the most highly correlated series achieves the best results is also examined.

From there, and considering [Svensson, 2015], [Shaban et al., 2009] and [Nielsen et al., 2007], a multiple stage network design is considered. This is referred to by the name

"multi-stage" network in the text. Although [Shaban et al., 2009] refers to this concept as a cascade network there are multiple concepts with that same name. Specifically the multi-stage network considered here is a multi-stage series connected network (MSSCN) with separate training procedures .

As there are many forecast variables, it is decided to use the forecasts with their time shifted versions (1 hour before and 1 hour later), for a total of 48 input dimensions, to

(28)

produce a composite forecast. This composite forecast ANN model is trained on the observed wind speed. This is done to give more sources of information to the model.

This first stage model can either be used as an extra input to the baseline Expek- tra model, which is model type 4, or it can be used to replace the existing wind speed forecast data inputs to the baseline Expektra model, which is model type 5. These multi- stage network structures do not require separate training and test data, like model types 2 and 3, but the first stage model must typically be trained before the second stage network.

The difference is illustrated in Fig. 3.2

Figure 3.2: Difference between Model Type 4 and 5

Please note that although there are 5 model types the individual models have different names in the results section due to their being multiple models per model type, i.e.

one per time horizon.

3.5 Statistical Testing of Performance

It is possible to further compare different models obtained by using statistical testing on the results of these models. There are many potential types of tests that can be used for this task but some common examples are Two Matched Samples t-Test, Sign Test, McNe- mar’s Test, Wilcoxon’s Signed Rank Test, Repeated Measures One-way analysis of variance (ANOVA) or Friedman’s Test [Japkowicz, 2012].

Typically these tests are used to first establish whether or not models are different, and then from there further tests can be employed to find the differences between models. Some tests are parametric, such as the t-Test, and some are non-parametric, such as

(29)

Friedman’s test. There is another constraint, in that some tests can only test two models at once, requiring multiple pair-wise comparisons.

As it is not clear whether or not the coefficient of determination results of the 5 model types over 3 time horizons can be considered to have a normal distribution over their trials, this thesis decides to use non-parametric methods. Furthermore, as there are significantly more than 2 model types to be tested this thesis decides to use testing methods that can compare more than 2 model types at once [Japkowicz, 2012].

By this criteria, the Friedman test followed by a Bonferroni-Dunn Post-hoc Test will be used in this thesis. However, [Åkerberg, 2017] used the t-Test to compare performance results of an outlier detection system to a baseline model without outlier detection, within the area of wind forecasting. Multiple types of tests can clearly be used in this domain of comparing model results.

The null hypothesis for the Friedman test is that all the models perform equally thus rejection of this means that there are least one pair of models within the set of models considered that have significantly different performance [Japkowicz, 2012]. All algorithms are ranked on each domain separately, and then for each model the sum of the ranks obtained on all domains is computed [Friedman, 1937].

These sums are compared to the Friedman statistic, using the 0.05 level of significance for the null hypothesis 3.19, where n is the number of domains and k the number of compared models:

χ²_f = [ 12

n ∗ k ∗ (k + 1)∗^X^j=1

k (Rj)²] − 3 ∗ n ∗ (k + 1) (3.19) This Friedman test, however, only shows if one model is significantly different to the other models tested. The post-hoc test is required to show which models are different to each other based on the results of the Friedman test.

One way to consider the testing as applied to this thesis is that each of the 30 trials of each model can be considered a separate domain, due to the random initialisation of each trial network. This then leads to 30 domains, comparing 5 models by their best performing group of 30. In this test design, each time horizon for predictions would have a separate analysis performed. As there are no repeat measures of the same model within the same domain, it is possible to consider this design quite similar to the Kruskal-Wallis test [Kruskal and Wallis, 1952].

Following the Friedman test, a Bonferroni-Dunn Post-hoc Test can be used. This is a Dunn test [Dunn, 1964] with a Bonferroni correction applied to the significance cutoff;

this thesis uses 0.05 for this value. The correction sets the significance level for the Dunn test to be the significance cutoff level divided by the number of tests; i.e. alpha divided by 30 in this case.

According to [Goldman, 2008] this is to avoid the problem of multiple testing wherein many test results can give false confidence in a significant results. However, it can be considered to be conservative in its assumptions but this thesis considers this to be an advantage.

The Dunn test itself, [Dunn, 1964], is an approximation to the rank statistics using the z-test statistic.

Another test design is to consider each prediction horizon to be a domain within the Friedman test. The outcome is still a comparison of the five model types to each other, but it is now of a global comparison over all time horizons, rather than a horizon specific

(30)

comparison. It is necessary to consider each trial a repeated measure in this case.

(31)

Chapter 4

Results

This chapter contains information on the results obtained in this study. Each section considers the related model across all three time categories used for predictions, i.e. 1 hour ahead, 6 hours ahead and 24 hours ahead.

4.1 Results of the Baseline Models

For 1 hour ahead predictions using the baseline Expektra model, Model 1.1, the maximum coefficient of determination achieved was 0.9255, with 6 neurons in a single hidden layer. Training performance increased with more neurons in the hidden layer but test performance decreased (Fig. 4.1). There were 11 inputs to the model, consisting of time delayed power production data and intra-day NWP forecasts.

Figure 4.1: Training and Testing Error vs Neuron Number for Multiple Trials - (a) 1 hour ahead Model 1.1 (b) 6 hours ahead Model 1.2 (c) 24 hours ahead Model 1.3. Red is training perf. and blue is test perf.

For 6 hours ahead predictions using the baseline Expektra model, Model 1.2, the maximum coefficient of determination achieved was 0.7925, with 19 neurons in a single hid-

19

(32)

20 CHAPTER 4. RESULTS

den layer. Training performance trends were the same as for Model 1.1 (Fig. 4.1). There were 17 inputs to the model, consisting of intra-day NWP forecasts.

For 24 hours ahead predictions using the baseline Expektra model, Model 1.3, the maximum coefficient of determination achieved was 0.7290, with 19 neurons in a single hidden layer. Training performance trends were the same as for Model 1.1 (Fig. 4.1).

There were 9 inputs to the model, consisting of inter-day NWP forecasts.

4.2 Results of Augmented Baseline Models

4.2.1 Replacing Forecasts with Observations

The first considered model after the baseline Expektra model replaces the NWP forecast inputs of the baseline Expektra model with the wind speed observation, including of course the same time shift as the original forecast data input.

During the training process the ANN is trained on a set of data but then during the testing process, the test data has the NWP forecast inputs replacing the wind speed observations again. This is to represent the real life constraint, when making the predictions, that the wind speed observations are not known until the hour after the wind speed occurs.

The results, in general, see a large test performance decline from the baseline model (Models 1.1-1.3). Observation performance is high, it is significantly higher than test performance.

For 1 hour ahead predictions using model 2.1, the maximum coefficient of determination achieved was 0.9045, with 4 neurons in a single hidden layer (Fig. 4.2). There were 11 inputs to the model, consisting of time delayed power production data, wind speed observations and intra-day NWP forecasts.

(33)

CHAPTER 4. RESULTS 21

For 6 hours ahead predictions using model 2.2, the maximum coefficient of determination achieved was 0.6236, with 18 neurons in a single hidden layer (Fig. 4.2). There were 17 inputs to the model, consisting of wind speed observations and intra-day NWP forecasts.

For 24 hours ahead predictions using model 2.3, the maximum coefficient of determination achieved was 0.4400, with 13 neurons in a single hidden layer (Fig. 4.2). There were 9 inputs to the model, consisting of wind speed observations and inter-day NWP forecasts.

4.2.2 Observations as Extra Model Input

An alternative way to introduce the extra information represented by the wind speed observation values is to add it as an extra input to the baseline models (Models 1.1-1.3).

This could have arbitrary time delays applied and have arbitrarily complicated training methods.

This study chooses to apply one extra input, the wind speed observation, with no time delay, and then for testing the model upon the test data set each possible NWP forecast replacement is applied instead of the wind speed observation. 16 different NWP forecasts can be applied to this extra input, c.f. Section 3.1.1. Due to this, the training of this model took longer than other others as each network hidden layer size, multiple trials and multiple forecast replacements had to be tested.

Results were even lower in terms of coefficient of determination than Models 2.1-2.3 and even had negative coefficient of determination values in some neuron size and feature replacement combinations, for the 24 hour ahead predictions. Observation coefficient of determination approached 1, i.e. possible overfitting occurred.

For 1 hour ahead predictions using model 3.1, the maximum coefficient of determination achieved on the test data was 0.8590, with 1 neurons in a single hidden layer (Fig.

(34)

Table 4.1: 1 Hour Ahead Correlation between NWP Predictions and Observed Wind Speed to 0.95 Confidence

Feature Id Lower Bound R Value Upper Bound

745 0.8220 0.8294 0.8365

553 0.7392 0.7495 0.7595

595 0.7407 0.7510 0.7609

637 0.7419 0.7522 0.7621

679 0.7307 0.7413 0.7516

721 0.7289 0.7396 0.7499

555 0.7460 0.7561 0.7659

597 0.7468 0.7569 0.7666

639 0.7384 0.7488 0.7588

681 0.7331 0.7437 0.7539

723 0.7415 0.7518 0.7617

557 0.7490 0.7590 0.7687

599 0.7511 0.7610 0.7706

641 0.7385 0.7489 0.7589

683 0.7355 0.7460 0.7561

725 0.7490 0.7590 0.7687

4.3). There were 11 inputs to the model, consisting of time delayed power production data, wind speed observations and intra-day NWP forecasts.

For 6 hours ahead predictions using model 3.2, the maximum coefficient of determination achieved was 0.5791, with 24 neurons in a single hidden layer (Fig. 4.3). There were 18 inputs to the model, consisting of wind speed observations and intra-day NWP forecasts.

For 24 hours ahead predictions using model 3.3, the maximum coefficient of determination achieved was 0.4303, with 15 neurons in a single hidden layer (Fig. 4.3). There were 10 inputs to the model, consisting of wind speed observations and inter-day NWP forecasts.

4.2.3 Correlation Analysis of Inputs and Outputs

When considering the most suitable replacement NWP forecast feature for the weather observations, as in Models 3.1-3.3, it is interesting to observe the correlation between the replacement feature and the wind speed observation.

According to the linear correlation analysis, for the 1 hour ahead feature set feature 745 (c.f. Table 4.1) is most highly correlated to the observed wind speed, for the 6 hour ahead feature set feature 745 (c.f. Table 4.2) and for the 24 hour ahead feature set feature 746 (c.f. Table 4.3) is the most highly correlated. The logical assumption would be that the most highly correlated features would produce the best results when replacing the weather observation input during testing phase.

However, it is actually found that features 595, 745 and 746 are the best replacements in terms of coefficient of determination performance over the test set, for one hour ahead, 6 hours ahead and 24 hours ahead predictions respectively using Models 3.1-3.3. The specific coefficient of determination values are 0.8590, 0.5791 and 0.4303 for 1, 6 and 24

(35)

745 0.8206 0.8279 0.8351

553 0.7391 0.7494 0.7593

595 0.7408 0.7511 0.7609

637 0.7419 0.7521 0.7620

679 0.7311 0.7417 0.7519

721 0.7292 0.7398 0.7501

555 0.7460 0.7561 0.7658

597 0.7470 0.7570 0.7666

639 0.7384 0.7487 0.7586

681 0.7335 0.7440 0.7541

723 0.7418 0.7520 0.7618

557 0.7491 0.7590 0.7686

599 0.7512 0.7611 0.7706

641 0.7385 0.7488 0.7587

683 0.7358 0.7462 0.7562

725 0.7493 0.7592 0.7688

746 0.7747 0.7838 0.7925

554 0.6617 0.6744 0.6868

596 0.6596 0.6724 0.6848

638 0.6621 0.6749 0.6872

680 0.6578 0.6707 0.6831

722 0.6540 0.6670 0.6796

556 0.6677 0.6803 0.6925

598 0.6649 0.6775 0.6898

640 0.6584 0.6713 0.6837

682 0.6584 0.6713 0.6837

724 0.6650 0.6776 0.6899

558 0.6709 0.6833 0.6954

600 0.6688 0.6813 0.6935

642 0.6592 0.6721 0.6845

684 0.6607 0.6735 0.6859

726 0.6718 0.6843 0.6963

(36)

hour ahead predictions. Indeed, despite the high linear correlation between all the replacement features, the non-linear ANN modelling experiences a large decline in performance when using the replacement features in every case 4.2.2.

4.3 Results of Multiple Stage Models

The previous naive model sets were not found to be effective in improving overall model performance over the baseline models (Models 1.1-1.3). The chief theoretical problem would seem to be training on one extremely useful feature set, wind speed observations which are obviously highly related to the wind power output, and then exchanging this for a less useful feature set, with less accuracy as compared to the actual observation.

4.3.1 First Stage Models

Another possible way to use the wind speed observation data is as a filter for the many available wind forecasts, as is discussed in the methods section 3.4.2. This leads to a two- stage network, with the second stage being close to the baseline Expektra model and the first stage being a filtering network for the NWP forecasts. This filter is trained on the weather observations and produces a new composite forecast [Nielsen et al., 2007].

The baseline Expektra model uses relatively few different forecast series out of the 16 available per availability horizon; that being the availability of forecast data when predictions are made. As such the first stage networks can use all of these, and then apply multiple time delays to them.

Two first stage models were created, one for 24 hours ahead and one for 6 hours and 1 hour ahead. This is because the 6 and 1 hour ahead cases share NWP forecast data with the same availability horizon, and can be reused. This is because the forecasts are provided by the number of days in advance they can be used. 1 and 6 hours ahead are within the 0 days ahead availability horizon whilst 24 hours ahead is in the 1 day ahead availability horizon. The first stage model contains no inputs that are different for the 1 and 6 hour ahead cases so can be reused; however, with further optimisation and feature engineering different features would probably be removed from the 1 and 6 hour ahead cases, leading to different first stage models.

One enhancement that is also possible for the first stage network in the 1 hour ahead case, that would differentiate it from the 6 hour ahead first stage model, is to use the observed wind speed from the previous hour but that is left for future work.

The 1 hour ahead and 6 hour ahead first stage model, Model 4.11 and thus also Model 4.12, has a maximum coefficient of determination targeting wind observations of 0.8094, with 13 neurons in a single hidden layer (Fig. 4.4). There were 48 inputs to the model, consisting of time delayed intra-day NWP forecasts.

The first stage model for 24 hour ahead predictions, Model 4.13, has a maximum coefficient of determination targeting wind observations of 0.7412, with 17 neurons in a single hidden layer (Fig. 4.4). There were 48 inputs to the model, consisting of time delayed inter-day NWP forecasts.

A useful output of the first stage networks could linearly represent the actual observations of wind speed; the concept being that then the second stage network does not then have to model the additional complexities it does in the baseline models, i.e.

Models 1.*. When looking at the weather forecast data considered the best, series 745 for

(37)

Figure 4.4: Training and Testing Error vs Neuron Number for Multiple Trials - (a) 1 and 6 hours ahead, Model 4.11-4.12. (b)24 hours ahead, Model 4.13. Red is training perf. and blue is test perf.

Table 4.4: Linear Correlation Coefficients to Observed Wind Speeds - NWP versus Com- bined NWP

Most highly lin- early correlated single NWP Fea- tures: 745, 745, 746

Output of Net- work Stage 1

1 hour ahead cor- relation coefficient

0.83 0.90

6 hours ahead cor- relation coefficient

0.83 0.90

24 hours ahead cor- relation coefficient

0.78 0.86

same day forecasts (1 hour ahead and six hours ahead) and series 746 for intra day forecasts (24 hours ahead), it can be seen that they only have correlation of 0.8279 and 0.7838 respectively between those series and the observed wind speed. Note that this is using both the test and training indices of the data to show the overall improvement in linearity.

The output of the first stage ANN, however, has correlation between itself and the actual speed wind observations of 0.8978 in the 1 hour ahead case and the 6 hours ahead case, and correlation between itself and the actual speed wind observations of 0.8637 in the 24 hour ahead case. This is significantly better than the baseline case, even with the best forecast. See Table 4.4.

Observing the qualitative case also confirms the improvement of linearity in relationship between forecast and actual observation after using the first stage ANN targeting wind speed observations. See Fig. 4.5.

(38)

Figure 4.5: Scatter Plots of NWP vs Observed Wind Speed - (a) 1 and 6 hour ahead plot of best single NWP vs observed wind speed (b) 1 and 6 hour ahead plot of network stage 1 vs observed wind speed (c) 24 hour ahead plot of best single NWP vs observed wind speed (d) 24 ahead plot of network stage 1 vs observed wind speed

(39)

4.3.2 Second Stage Models - Extra Input

The first type of second stage network is introducing the output of the first network as an extra input. Unlike Models 2.1-2.3 and Models 3.1-3.3 the training and test data is of the same type; i.e. no training on observations and testing on forecast data is performed.

Figure 4.6: Training and Testing Error vs Neuron Number for Multiple Trials - (a) 1 hour ahead Model 4.21 (b) 6 hours ahead Model 4.22 (c) 24 hours ahead Model 4.23. Red is training perf. and blue is test perf.

For 1 hour ahead predictions using model 4.21, the maximum coefficient of determination achieved was 0.9273, with 5 neurons in a single hidden layer (Fig. 4.6). There were 12 inputs to the model, consisting of time delayed power production data, first stage network forecasts and intra-day NWP forecasts.

For 6 hours ahead predictions using model 4.22, the maximum coefficient of determination achieved was 0.7990, with 25 neurons in a single hidden layer (Fig. 4.6). There were 18 inputs to the model, consisting of first stage network forecasts and intra-day NWP forecasts.

For 24 hours ahead predictions using model 4.23, the maximum coefficient of determination achieved was 0.7396, with 20 neurons in a single hidden layer (Fig. 4.6). There were 10 inputs to the model, consisting of first stage network forecasts and inter-day NWP forecasts.

These were positive results, which resulted in an improvement of 0.0018, 0.0065 and 0.0106 in terms of coefficient of determination over the baseline Expektra models, Mod- els 1.1-1.3. The improvement is most pronounced in the 24 hour ahead case, this is not surprising since it is prior knowledge that the 1 hour ahead model is dominated by the previous hour’s power prediction.

4.3.3 Second Stage Models - Replacing Forecasts

Taking the results of the previous model set into consideration, it seemed possible to introduce more improvement to performance by replacing all baseline model NWP forecast

The relationship between weather fore- casts and observations for predicting electricity output from wind turbines