A machine-learning approach to estimating the performance and stability of the electric frequency containment reserves

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

A machine-learning approach to estimating the performance and stability of the electric frequency containment reserves

HENRIK EKESTAM

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

A machine-learning approach to estimating the performance and stability of the electric frequency containment reserves

HENRIK EKESTAM

Degree Projects in Optimization and Systems Theory (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2018

Supervisor at Svenska Kraftnät: Andreas Westberg Supervisor at KTH: Anders Forsgren

Examiner at KTH: Anders Forsgren

(4)

TRITA-SCI-GRU 2018:283 MAT-E 2018:64

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

The stability and reliability of the power system is of utmost importance, with one measure being the frequency quality. For a number of years, the frequency quality has been decreasing in the Nordic synchronous area. The Revision of the Nordic Frequency Containment Process project has introduced a proposed set of pre-qualification requirements to ensure the stability and performance of frequency containment reserves. The purpose of this thesis has been to examine the potential of complementing the evaluation of the requirements through the use of machine-learning methods applied to signals sampled during normal operation of a power plant providing frequency containment. Several simulation models have been developed to generate such signals with the results fed into five machine-learning algorithms for classification: decision tree, adaboost of decision tree, random forest, support vector machine, and a deep neural network. The results show that on all of the simulation models it is possible to extract information regarding the stability and performance while with high accuracy preserving the distribution of physical parameters of the approved samples. The conclusion is that machine-learning methods can be used to extract information from operation signals and that further research is recommended to determine how this could be put to practice and what precision is needed.

Sammanfattning

Stabilitet och p˚alitlighet hos kraftsystemet ¨ar av yttersta vikt, med frekvenskvaliteten som en indikator.

Under ett antal ˚ar har frekvenskvaliteten sjunkit inom det nordiska synkronomr˚adet. Projektet The Revision of the Nordic Frequency Containment Process har föreslagit nya pre-kvalificeringskrav syftandes till att säkerställa stabilitet och prestanda hos frequency containment reserves. Syftet med detta exa- mensarbete har varit att utforska möjligheterna att komplettera utvärderingen av dessa krav genom att använda masininlärningsmetoder applicerade p˚a signaler hämtade fr˚an normal drift av ett kraftverk som levererar frequency containment. Flera simuleringsmodeller har utvecklats för att generera s˚adana signaler som sedan har analyserats av fem olika maskininlärningsmetoder för klassificering: beslutsträd, adaboost av beslutsträd, random forest, stödvektormaskin samt ett djup neuralt nätverk. Resultaten visar att det för samtliga simuleringsmodeller har varit möjligt att extrahera information kring stabilitet och prestanda och samtidigt med hög noggrannhet bevara fördelningen av fysikaliska parametrar hos godkända prover.

Slutsatsen är att maskininlärningsmetoder kan användas för att extrahera information fr˚an driftsignaler samt att fortsatta undersökningar rekommenderas för att avgöra hur denna information kan användas praktiskt, och vilken precision i bedömningarna som d˚a skulle krävas.

(6)

(7)

Contents

1 Introduction 1

1.1 Aim . . . . 1

1.2 Method . . . . 1

1.3 Limitations in scope . . . . 1

1.4 Structure of the report . . . . 2

2 Theory - Frequency containment reserves 3 2.1 Power usage and frequency . . . . 3

2.2 Frequency containment reserves . . . . 3

2.2.1 Pre-qualification requirements for FCR-N . . . . 4

2.2.2 Pre-qualification procedure for FCR-N . . . . 5

2.2.3 Simulation models for hydro power plants . . . . 6

2.3 Per unit scaling . . . . 8

2.3.1 Machine base . . . . 8

2.3.2 FCP base . . . . 9

2.3.3 Conversion table . . . . 9

3 Theory - Machine learning 10 3.1 Introduction to machine learning . . . . 10

3.2 Estimating the quality of classification . . . . 11

3.2.1 Bias – Variance trade-off . . . . 11

3.2.2 Training, validation and test sets . . . . 13

3.2.3 Cross validation . . . . 14

3.3 Classifiers . . . . 15

3.3.1 Decision tree . . . . 15

3.3.2 Random forest . . . . 16

3.3.3 Adaboost . . . . 16

3.3.4 Support vector machine . . . . 16

3.3.5 Neural network . . . . 18

4 Methodology 21 4.1 Simulations . . . . 21

4.2 Key performance indicators . . . . 21

4.2.1 Time domain . . . . 22

4.2.2 Frequency domain . . . . 23

4.2.3 Mixed domain . . . . 25

4.3 Hyperparameter tuning . . . . 25

4.4 Data management . . . . 26

4.4.1 Feature selection . . . . 26

4.4.2 Scaling and centring . . . . 27

4.5 Evaluation of classification quality . . . . 27

4.5.1 Bootstrapping . . . . 27

4.5.2 Accuracy assessment table . . . . 28

4.5.3 Confusion matrix . . . . 28

4.5.4 Classifier parameter distributions . . . . 29

5 Linear model 31 5.1 Simulations . . . . 31

5.2 Results . . . . 32

5.2.2 Comparison of classifiers . . . . 34

5.2.3 Confusion matrices . . . . 35

5.3 Discussion of model and results . . . . 37

(8)

6 Non-linear model 38

6.1 Simulations . . . . 38

6.2 Results . . . . 39

7 Non-linear model with noise 45 7.1 Simulations . . . . 45

7.2 Results . . . . 46

8 Sub-sampled non-linear model with noise 52 8.1 Simulations . . . . 52

8.2 Results . . . . 52

9 Discussion 57 9.1 Suggestions for further research . . . . 59

10 Conclusions 61 11 Literature 62 Appendix A Code libraries 63 Appendix B Full feature selection 64 B.1 Linear model . . . . 64

B.2 Non-linear model . . . . 65

B.3 Non-linear model with noise . . . . 66

(9)

List of symbols

Symbol Physical unit Description

b % Backlash

C MW Available FCR-N capacity for a power plant df Hz Scale factor for normal frequency band

dP MW Scale factor total FCR-N capacity

2D s Estimated backlash

∆f Hz Frequency deviation from nominal

∆P MW Delivered FCR-N

∆t s Sample interval

ep Hz

% Droop, inverse static gain

Ek J Kinetic energy

ϕ ^◦ Phase angle

f Hz Grid frequency

fn Hz Nominal grid frequency

F Transfer function of controller

G Transfer function of system

H s Inertia constant

J kg · m² Moment of inertia

k _Hz^% Load frequency dependency

Ki s⁻¹ Integral gain of controller

Kp 1 Proportional gain of controller

M_p 1 Maximal sensitivity peak value for performance M_s 1 Maximal sensitivity peak value for stability

ω ^rad_s Angular frequency

P MW Power

r Hz Reference value frequency deviation

s Laplace variable

S Sensitivity transfer function

Sn MW Rated power

T s Period, inverse of frequency

T s Simulated time interval

Ti s Integration time constant

T_w s Water time constant

T_s s Servo time constant

w MW Power disturbance

Y0 % Gate set point

In many cases the parameters have been scaled to use per unit values. For the used per unit systems, refer to section 2.3.

(10)

List of abbreviations

Abbreviation Description

AB Adaboost — a machine learning classifier ARX Autoregressive exogenous (-model)

CI Confidence interval

CV Cross validation

DT Decision tree — a machine learning classifier

ER Error rate

FCP Frequency containment process (-project) FCR Frequency containment reserves

FCR-D Frequency containment reserves, disturbed operation FCR-N Frequency containment reserves, normal operation FRR Frequency restoration reserves

KPI Key performance indicator

ML Machine learning

MSE Mean square error

NN Neural network — a machine learning classifier PI Proportional, integral (-controller)

RAR Analysis and review of Requirements for Automatic Reserves (-project) ReLU Rectified linear unit — a neural network activation function

RF Random forest — a machine learning classifier

SVM Support vector machine — a machine learning classifier TSO Transmission system operator

(11)

1 Introduction

The stability and reliability of the power system is of utmost importance. By Kirchhoff’s first law [1], the power production and consumption in the power system will be in balance. One consequence of this balance in an alternating current system is that frequency deviations will occur if an imbalance in production and demand exists. Such frequency deviations may be seen as an indicator of decreased reliability of the system and is thus to be stabilised and corrected by dedicated Frequency containment reserves (FCR). During normal – undisturbed – operation such containment reserves are denoted FCR-N.

For a number of years, the frequency quality has been decreasing in the Nordic synchronous area [2].

To remedy this decrease in frequency quality the Nordic transmission system operators 2014 initiated the Revision of the Nordic Frequency Containment Process (FCP) project. As part of the revision the FCP-project has introduced a proposed set of pre-qualification requirements to ensure the stability and performance of FCR providers [3]. The purpose of this Master’s thesis has been to examine the potential of complementing the evaluation of the FCP requirements through the use of machine-learning methods applied to the input and output signals sampled during normal operation of a FCR-N providing power plant.

1.1 Aim

The aim of the project is to examine the potential of complementing the evaluation of the FCP requirements on performance and stability through the use of machine-learning methods applied to the input and output signals sampled during normal operation of a FCR-N providing power plant. The examined methods should be non-invasive while keeping the physical interpretability and transparency with regards to the information handled by the algorithm and how it is used.

1.2 Method

Outline:

• Perform simulations on existing models of a hydro power plant.

• Take the generated input and output signals and calculate key performance indicators with physical interpretability.

• Use the key performance indicators as input to the examined machine-learning algorithms to train models for evaluation.

• Evaluate the accuracy of the resulting machine-learning models.

• Assess the potential of examined methods, the requisites for successful operation and make suggestions for further research.

The methodology of the thesis is further described in chapter 4.

1.3 Limitations in scope

The project is to:

• Constitute only an initial examination of the potential of the proposed methods to complement the FCP pre-qualification.

(12)

• Not to change or evaluate the established models of a hydro power plant. The model is taken to be equivalent to reality.

• The FCP pre-qualification requirements [3] are given and are used as the answer key to which power plants are to be seen as qualified.

1.4 Structure of the report

Chapter 1 is an introduction to this report. Chapter 2 contains a depiction of the theoretical background regarding frequency containment of the power grid and the proposed requirements on stability and performance. A corresponding introduction to the machine-learning setting and the applied classifiers is given in chapter 3. The methodology of this thesis – including the simulations, calculation of key performance indicators and construction of the machine-learning models, as well as evaluation of the models – is presented in chapter 4. Four variants of the simulation model with gradually increasing complexity have been applied, where chapter 5, 6, 7, and 8 respectively contains a detailed description of the simulation models as well as the results on each model, and a shorter discussion of the respective model and corresponding results. A more in-depth discussion of the methodology and the general results as well as suggestions for further research are given in chapter 9. The conclusions of the thesis are presented in chapter 10.

(13)

2 Theory - Frequency containment reserves

2.1 Power usage and frequency

The reliability of the power system is of paramount importance. By Kirchhoff’s first law [1], every amount of power consumed at the edge of the power grid has to be supplied by the grid at the same instant as it is used by the consumer. Hence, the power grid reacts instantaneously to the demands of the market at every point in time. The first line of defence against changes in demand is the inertia of the synchronous machines in the system. The kinetic energy E_k of a generator with frequency ω is

Ek = Jω²

2 , (1)

where J is the moment of inertia of the rotating mass. At nominal frequency f_n = ^ω_2πⁿ = 50 Hz the relation may be taken to be

Ek,n= Jω²_n

2 = H · Sn, (2)

where Sn is the rated power of the generator at nominal frequency and H is the inertia constant [4]. By taking a time-derivative on the preceding equation a relation between power usage and angular velocity may be derived as

Pt− Pg= J ωdω

dt. (3)

This is the so-called swing equation, where Ptis the power injected into the generator by the turbine and P_g is the power extracted from the generator by the power grid [5]. It is readily seen that an imbalance in power produced by the turbines and the power consumed by the grid leads to a change in angular frequency with a corresponding change in kinetic energy of the generator. A synchronous generator is synchronised with the power grid, i.e. the frequency of the generator is the same as the frequency of the voltage in the power grid. Thus, when energy is taken from the generators in the system in order to increase supply a corresponding dip in frequency is seen across the power grid. By the same reasoning a decrease in demand of the consumers will be compensated by the system as increased kinetic energy of the generators, with a corresponding increase in frequency. It can thus be concluded that changes in electric frequency of the power grid may be used as a measure of the imbalances in power production and consumption throughout the power grid.

2.2 Frequency containment reserves

The frequency of the system needs to be stabilised at a nominal value as it indicates the production balance in the grid. Devices connected to the grid may also be harmed if the electric frequency drifts outside of the design specifications for the device [4]. Thus deviations in frequency need to be contained in magnitude. By the preceding section a deviation from nominal frequency arises when the kinetic energy of the generators is used to even out imbalances in supply and demand of power. Special power sources and/or power sinks are deployed within the system in order to react to sudden changes in frequency and try to contain them by providing additional supply or demand. Such reserves are called Frequency Containment Reserves (FCR). Under nominal circumstances the frequency is to be held at f = 50.0 ± 0.1 Hz; the reserves responsible under these conditions is called FCR-N. When the system is disturbed, i.e. the frequency is outside of the nominal band another set of reserves by the name FCR-D is activated. The two reserves work together to stabilise the power imbalance in the system as well as the

(14)

frequency deviation. The frequency may then be restored by the slower frequency restoration reserves (FRR) that also restores the FCR capacity. [6]

2.2.1 Pre-qualification requirements for FCR-N

FCR-N capacity is supplied by actors on the market on behalf of the transmission system operator (TSO).

Since 2017 a new scheme of pre-qualification of FCR-N capacity has been proposed by the Nordic TSOs in order to ensure the quality of the frequency reserves [3]. The Frequency Containment Process project (FCP) has established two conditions on performance and stability respectively to be fulfilled by the FCR-N supplier in a pre-qualification process. The two conditions are understood by regarding the system as of the general form depicted in figure 1 below and described in a non-linear model of a hydro power plant in the following section.

F(s) Σ G(s)

Σ

Control unit System

Disturbance

Output y d

r

Figure 1. General overview of a feedback system with disturbance.

Two versions of the system are considered when stipulating the necessary conditions on performance and stability. The first version corresponds to a scenario where the amount of inertia within the system is taken to be at a minimal value and thus models the worst case scenario. This scenario is used for the stability criterion in order to ensure robust stability, i.e. the system should be stable even under model uncertainties. The transfer function representing the system is in this case denoted Gmin(s). In the second scenario the amount of inertia within the system is set at the average value which is used to establish the performance criterion. Thus, the performance condition states that the system should have some set performance at nominal conditions without regard to uncertainties in model and control. The transfer function for the system with average inertia is denoted Gavg(s). [3]

By defining the sensitivity transfer function as

S(s = jω) = 1

1 + F (s)G(s), (4)

the stability criterion from the FCP project may be formulated as

||S_min(s)||_∞< M_S, (5)

i.e. the supremum value of the sensitivity function for the minimal inertia system is to be below a threshold taken to be Ms= 2.31 [3]. Meanwhile, the performance requirement is stated as

||Savg(s)||_∞< σf

||D(s)Gavg(s)||_∞, (6)

or equivalently

(15)

||Savg(s)Gavg(s)D(s)||_∞< σf, (7)

where G_avgis the transfer function of the average inertia system and D(s) is the transfer function from unfiltered white noise to the disturbance [3], such that

d = D(s) · w, (8)

where w is specified to be a white noise source and d is the disturbance depicted in figure 1 above. The performance and stability conditions are illustrated in figure 2 below. The constant σf represents the power spectral density of the frequency deviation and scales to 1 in per unit scaling [3], further described in section 2.3 below.

10^-4 10^-3 10^-2 10^-1 10⁰ 10¹

10^-2 10^-1 10⁰ 10¹ 10²

Magnitude (abs) Performance req.

Stability req.

Frequency (rad/s)

Requirements by the FCP-Project

Smin(s) Savg(s)

Figure 2. Illustration of the stability and performance requirements in equation (5) and (6) respectively.

The solid blue line is to be below the dashed blue line at all frequencies for the stability requirement to be fulfilled. Similarly, the red solid line is to be below the dashed line of the same colour for fulfilment of the performance requirement.

2.2.2 Pre-qualification procedure for FCR-N

The performance and stability requirements described in the preceding section are evaluated for a power plant by performing sine-in-sine-out tests to estimate the response of the control unit F (s) and by extension the sensitivity transfer function S(s). The test are conducted by disconnection of the feedback loop in Figure 1 above and replacement of the feedback signal with a sine wave of specific frequency superimposed on a nominal signal with f_n = 50 Hz, and measuring the corresponding output. By measuring the gain and phase shift of the signal the transfer function at that specific frequency may be estimated. By performing these sine tests at a representative range of frequencies the total response of the sensitivity function may be approximated and evaluated against the performance and stability requirements.

By the FCP project the following time periods are to be used with the sine tests [7], with T = ^2π_ω : T (seconds) 10 15 25 40 50 60 70 90 150 300

Here the time periods are approximately evenly spaced on a logarithmic scale.

In addition to the sine-in-sine-out test a step response test is to be performed to establish the maximal capacity of the power plant for providing FCR-N[7], as illustrated in Figure 3 below.

(16)

0 50 100 150 200 250 300 time (a.u.)

49.8 49.9 50 50.1 50.2

f (Hz)

17 18 19 20 21 22 23

P (MW)

FCR-N Normalisation step sequence

ΔP₁ ΔP2

ΔP3

ΔP4 Output

Input

Figure 3. Pre-qualification step sequence with corresponding definitions of ∆P1, ∆P2, ∆P3, ∆P4.

From the step sequence the backlash of the power plant is estimated as the parameter 2D, which is used to determine the available FCR-N capacity C the power plant may deliver. The parameters 2D and C are defined as:

2D = ||∆P1| − |∆P2|| + ||∆P3| − |∆P4||

2 , (9)

C = |∆P1| + |∆P3| − 2D

2 . (10)

In addition, the FCP-requirements state that the results from the sine-sweep should be interpolated between data points, thus increasing the difficulty of fulfilment. The requirements also allow for a reduction of the requirements by 5 % to account for measurement uncertainties [3]. Neither of these conditions has been considered for this thesis, with partially cancelling effects.

2.2.3 Simulation models for hydro power plants

The simulation model used throughout this thesis is based on the reference model of a hydropower plant defined by the FCP project. The simulation model is presented in Figure 4 below and will be briefly discussed in this section together with the corresponding pre-qualification model in Figure 5.

(17)

Servo Penstock e_p

ep

0 r [p.u.]

Δ P FCR [p.u.fcp]

PI 1/f0

Scale Δf

System inertia w Disturbance [MW]

Δ f Frequency [p.u.fcp]

Pn2R · 600/C Local2Global

1/df scale Δf 1/C

Scale ΔP

BL GR

F(s)

Figure 4. Model used for simulation of the operation of a hydro power plant and its FCR-N characteristics.

The dashed box corresponds to the control unit F (s) in Figure 1.

The simulation model consists of two sub-systems, the power system G(s) and the control system F (s). The power system represents the frequency deviation that arises when inertia is used to even out imbalances in power production and consumption, while the control system models the inflow of water into the power system and the resulting power generation by the turbines. The power system transfer function may be derived by taking a Laplace transform on the swing equation. The control system involves a proportional-integral (PI) regulator with parameters Kp and Ki representing proportional and integral gain respectively, a servomotor with time constant T_sthat regulates the water flow into the turbines and a gate rate limiter that limits the rate of change of the gate servo signal. The water flow from the gate servo is modelled with a backlash, i.e. hysteresis, and sent into a system representing the waterways with water time constant T_w. The water ways block has a zero in the right hand of the complex plane and is thus a non-minimum phase system. The parameter e_p, called droop, limits the static gain of the PI-controller, which becomes 1/ep [5]. The static gain represents the system behaviour for a constant deviation from the set point. Increasing values of the droop decreases the gain and thus decreases the response from the system for a given deviation in frequency. The parameter Pn2R is a scaling factor introduced to scale the signal between per unit systems, further discussed in section 2.3.

The backlash models how the system resists acting on small changes in input, i.e. no change in water flow happen at all until the desired change is above some threshold. This is a highly non-linear behaviour that introduces hysteresis. Thus, it is harder to react to small changes in frequency within the non-linear model that includes backlash than it would be in a strictly linear model. A semi-linearised model is achieved by setting the backlash in between the servomotor and the waterways to zero. This version of the simulation model is in the continuation referred to as the linear model. This semi-linearised model still contains some non-linearities in the form of the gate rate limiter.

Servo Penstock

ep

Droop

Δ P

FCR [MW]

PI Δ f

Sine wave [Hz]

1/f0

Scale

Pn2R · 600/C

Local2Global BL

GR

Figure 5. Model used in simulated pre-qualification tests to evaluate the FCR-N stability and performance requirements.

The model for simulated pre-qualification tests is based on the simulation model but with the feedback loop disconnected, and is depicted in Figure 5 above.

(18)

2.3 Per unit scaling

Quantities are sometimes scaled to express them as fractions of typical values for a production unit with the intent of making production units comparable. For example the power output may be divided with the rated power to get an expression for the utilisation of the plant. The rated power Sn is then used as the base for power as expressed per unit by calculating

Ppu= P Pbase

= P Sn

. (11)

The FCP project defines two such bases, the machine base and the FCP base, introduced below. [3]

2.3.1 Machine base

The machine base is obtained by taking the rated power of all FCR-N providing plants summed as the power base. The frequency base is taken as nominal frequency, i.e. 50 Hz. Hence the power and frequency bases become:

Pbase= Sn= ep· dP · f0

df ·dP

C = 1 p.u. , fbase= f0= 50 Hz = 1 p.u. (12) The expression for the power base may be somewhat simplified by introduction of the parameter Pn2R

P_n2R = e_p· dP · f₀

df, (13)

such that the power base becomes Pbase = Pn2R · ^dP_C. Since the droop ep per the FCP-project[3] is defined as

ep= df /f0

dP/Sn−F CR

, (14)

the parameter Pn2R is equivalent to the rated capacity Sn−F CR of a FCR-N providing plant. It is also seen that with a FCR-N capacity of C per plant, and a total capacity of dP , the number n of such plants has to be

n = dP

C , (15)

and thus

Pn2R ·dP

C = n · Sn−F CR. (16)

(19)

2.3.2 FCP base

The purpose of the FCP base is to scale the FCR-N contribution from a power plant with the maximal allowed contribution C, per the FCP project pre-qualification step sequence illustrated in Figure 3. The frequency is scaled with the full activation frequency deviation, i.e. 0.1 Hz, for FCR-N. Thus the power and frequency bases become:

Pbase= C = 1 p.u. , fbase= df = 0.1 Hz = 1 p.u. (17)

2.3.3 Conversion table

Desired quantity = Scale factor × Given quantity (18)

Desired quantity

Given quantity

P PM B PF CP

P 1 P_n2R · ^dP_C C

PM B 1 _P¹

n2R ·^C_dP²

PF CP 1

Here the following relation holds, per equation (16):

1 Pn2R· C²

dP = C

n · Sn−F CR

. (19)

(20)

3 Theory - Machine learning

In this chapter the machine-learning setting is introduced in section 3.1, and an explanation of how the accuracy of classification can be assessed is given in section 3.2. The classifiers that have been applied are discussed in section 3.3.

3.1 Introduction to machine learning

The machine-learning problem is to establish a mathematical model that well expresses the relation between some input and output, i.e. to find a function that creates a mapping from the input to the output. The model may be used for either prediction or inference. In the prediction setting the aim is to predict properties of the unknown output from given inputs. This can mathematically be seen as finding correlations in a known data set without regards to causation. In the inference setting the aim is instead to find such causations in the data set. The main difference between predication and inference is thus that when performing predictions the aim is to estimate the output corresponding to a given input, i.e. make as correct predictions as often as possible, while the aim when performing inference is to explain why the output changes in that specific way. Hence a prediction model focuses on achieving god estimates while an inference model puts more emphasis on the interpretation of the connections that arises within the model. The result of this difference is that when creating a model for performing predictions the model should be evaluated by the prediction accuracy on independent, previously unseen, data. The sources of the correlations found by the model are of less interest. An inference model is harder to evaluate because of the need to discern the difference between correlations and causations. All models explored within this thesis are in the prediction setting.

The model is created by a specified machine-learning algorithm that is supplied with a set of known historical inputs with corresponding outputs. The inputs as well as the outputs may be quantitative or qualitative, i.e. categorical. If the output to be estimated by the model is quantitative the problem is of the regression type, while categorical outputs correspond to a classification problem. As the aim of this thesis is to predict if a power plant fulfils the FCP-requirements or not all models explored will be classification models.

The process of creating the model from historical data is called to train the model. The inputs and outputs corresponding to the same historical event may be collected into a set denoted as a training sample. The sample consists of a vector of inputs, denoted the features of the sample, as well as the output which is denoted as a label in the classification setting. The data set for training the model consist of all the historical samples that are known at the time of training. Some of the samples in the data set are withheld from the model at the time of training to instead be used as independent data for estimating the classification accuracy of the trained model. The samples withheld from training constitutes the test set, while the samples used for training is denoted the training set. In addition to enabling an assessment of the classification accuracy, the partitioning enables a method to reduce the amount of overfitting in the constructed model. Overfitting occurs when the model is permitted to learn from irrelevant information in the historical data set, for example random noise in the data or statistical anomalies with the distribution of the samples. Such dependence on irregularities in the train data will appear as decreased test set accuracy even while the training set accuracy remains high.

All of the classifiers examined for this thesis are represented by the choices of parameters and hyperparameters of the respective machine-learning model. The hyperparameters describe the general structure of the model while the parameters represent the precise details within that structure. For example, if polynomial regression is applied the hyperparameter would be the degree of the polynomial while the parameters of the model correspond to the coefficients of the polynomial. The training process introduced above only decides on the choices of parameters. If in addition hyperparameter selection is to be applied an additional validation set of samples withheld from the training process is needed to independently decide on hyperparameter choices. Alternatively, cross validation may be applied on the training set, at the cost of increased computational complexity.

(21)

The machine-learning setting is further examined in the sections that follow, with special regard given to methods for estimating the classification accuracy as well as increasing it.

3.2 Estimating the quality of classification

When training on the historical samples of features – as described in the preceding section – the quality of the resulting classifier has to be examined. Section 3.2.1 establishes a limit on accuracy that exists in a setting with random noise and the resulting bias-variance trade-off that is to be considered when designing a classifier. Methods to perform this trade-off are introduced in section 3.2.2 by dividing the available observations into train, validation and test sets and in section 3.2.3 by performing cross validation.

3.2.1 Bias – Variance trade-off

In regression and classification problems the objective is to construct a machine-learning model that fits the known training data and generalizes well to unknown test data. The bias is a measure of how well the model accommodates the supplied training data where a low bias means that the model is well fitted to the data. The variance measures the extent to which the model changes when a different set of training data is supplied, e.g. the impact of noise in the data has on the model. In general it can be said that a model that fits the training data well will also be fitted to the accompanying noise in the data. Hence, a low bias model will typically be of high variance, and vice versa. Thus it is in general impossible to minimize both the bias and the variance of a model at the same time, resulting in a bias-variance trade-off that has to be made when constructing the model. This phenomenon is illustrated in Figure 6 below.

x

-1 -0.5 0 0.5 1

y

-1.5 -1 -0.5 0 0.5 1 1.5

2 Low order polynomial

High bias - Low variance

x

-1 -0.5 0 0.5 1

y

-2 -1 0 1

2 High order polynomial

Low bias - high variance

Figure 6. Illustration of the Bias-Variance trade-off, here shown for a linear regression on a dataset with unknown noise. The low order polynomial does not fit the data very well, i.e. is biased, but also not very sensitive to changes in data and noise and thus of low variance. The high order polynomial fits the data well but is very sensitive to changes in data or noise. Hence, the high order polynomial is of low bias but high variance. The bias-variance trade-off is to find a regression model that balances these two phenomena.

The bias-variance trade-off may be studied mathematically by introducing the setting y = f (x) + , where y is the true output, x is the input vector of features, f is some deterministic but unknown function and

is random noise independent of the input with mean zero and variance σ². Let S be a given set of pairs of inputs and outputs, S = {(x1, y1), (x2, y2), . . . , (xn, yn)}, denoted the training set. The training set is a subset of all possible, perhaps infinite, pairs of input and output data. From this set an estimator f of the function f is to be made using some specified method of estimation. Because of the limitedˆ training set and random noise the estimator is to be seen as a realisation from an infinite class of possible estimators that could be created by that specific method. It is thus desirable to determine some general properties of the estimator ˆf with regard to unseen pairs of input and output data (x0, y0) /∈ S.

(22)

One measure of the quality of the class of estimators that has been used is the mean square error (MSE), defined as

MSE = E

y0− ˆf (x0)²

. (20)

It is readily seen that the MSE is non-negative, and equal to zero only for an ideal estimator. Another quality measure that is useful in discrete classification is the error rate (ER) here taken to be

ER = 1 n

n

X

i=1

I(y_i 6= ˆy_i), (21)

where I is an indicator function that is equal to 0 when the classification is correct and equal to 1 otherwise. Both measures can be decomposed into the variance and bias of the estimator together with the variance of the noise term, here shown for the MSE. First some preliminary results:

1. Var(X) = E[(X − E[X])²] = E[X²] − E[X]²⇒

⇒ E[X²] = Var(X) + E[X]² For some random variable X.

2. Var(y) = E[(y − E[y])²] = E[(y − E[f ])²] = E[(y − f )²] = E[²] =

= Var() + E[]²= Var() = σ²

Here it has been used that f is deterministic, i.e. that E[f (x₀)] = f (x₀), and that the noise has zero mean: E[] = 0.

3. E[y] = E[f (x)] + E[] = f (x).

From these results it is possible to find a decomposition of the MSE according to

MSE = E[(y − ˆf )²] = E[y²− 2y ˆf + ˆf²] =

= E[y²] − 2E[y ˆf ] + E[ ˆf ]²=

= Var(y) + E[y]²− 2E[y ˆf ] + Var( ˆf ) + E[ ˆf ]²=

= Var(y) + Var( ˆf ) + (E[y]²− 2E[y ˆf ] + E[ ˆf ]²) =

= Var(y) + Var( ˆf ) + (f²− 2f · E[ ˆf ] + E[ ˆf ]²) =

= Var(y) + Var( ˆf ) + (f − E[ ˆf ])²=

= σ²+ Var( ˆf ) + Bias( ˆf )².

Here Var( ˆf ) = E[( ˆf (x) − E[ ˆf (x)])²] is a measure of the spread in the estimator when varying the training data while Bias( ˆf (x)) = E[f (x) − ˆf (x)] is a measure of the mean deviation of the estimator from the true function with x held constant. σ² represents the irreducible error and arises from the variance of the noise. Note that in the above given derivation of the decomposition it has been assumed that the variables exist in a continuous setting and that and ˆf are independent. A derivation in the more general setting is found in [8].

In general it can be said that a more complex model for the estimator, i.e. a model with more degrees of freedom, will be better suited to explain the training data. The bias term arises when a model is used for estimation that is less complex than the function to be estimated, e.g. if a linear estimator ˆf is used to estimate a quadratic function f . Thus, increasing the complexity of the model will typically lead to a decrease in bias. Variance, on the other hand, arises when the estimator tries to fit the noise term , as well as the input x, to the output y. A more complex model will be more able to adjust to the error term, a phenomena called overfitting. Hence, increasing model complexity leads to increasing variance of the estimator.

(23)

Since the bias term is decreasing while the variance increases with increasing complexity, a trade-off between minimising bias and variance has to be made in order to minimize the MSE. This Bias - Variance trade-off in the error needs to be considered when designing and evaluating classifiers and is illustrated in Figure 7 below. Two methods that have been used are dividing the available data into training and test sets, and cross validation respectively. The methods are introduced in the sections that follow.

Model complexity

0 1 2 3 4 5

0 1 2 3

4 Simulated contribution to MSE Variance Bias²

2 MSE

Figure 7. Simulated contribution to mean square error from variance of the estimator, bias squared and irreducible error σ²due to the variance of the noise as the complexity of the model increases.

3.2.2 Training, validation and test sets

One way to perform the bias–variance trade-off is to randomly partition the available data into two sets:

a training set and a test set. The training set is used to fit the model parameters while the test set is used to evaluate the classification error of the resulting trained model. This method reduces the risk of overfitting, since the data set used to train the method is different from the set where the errors are evaluated, and computationally efficient. When training and test set has been used in this project, a 70/30 split has typically been performed, i.e. 70 % of the data goes into the training set and 30 % into the test set.

One disadvantage of performing this partitioning is that the available data points for fitting parameters are reduced as some data points are withheld for verification. Another disadvantage of partitioning the data arises when hyperparameters of a method need to be chosen in addition to the ordinary parameters.

For example, when fitting a polynomial

p_n(x) = c₀+ c₁x + c₂x²+ . . . + c_nxⁿ, (22)

the parameters are the coefficients ci of the polynomial while the degree n of the polynomial is a hyperparameter. If the coefficients are fitted for polynomials of various degrees on the training set, a second validation set is needed to determine the optimal value of the hyperparameter, i.e. which of the polynomials perform best. A third test set is then needed to evaluate the error of the final model. If only two sets are used either the parameters or the hyperparameters are at risk of overfitting. Hence further partitioning is needed over the training and test set when fitting hyperparameters, further reducing the amount of data available for training the model as illustrated in Figure 8 below. To avoid this problem, cross validation may instead be performed, as described in the next section.

(24)

Training set Valida�on set Test set

Fit parameters

Determine hyperparameters

Es�mate error

Figure 8. To independently determine parameters, hyperparameters and estimate classification errors the set of all observations has to be divided into three separate sets, if cross validation is not applied.

3.2.3 Cross validation

As an alternative to partitioning the data set into training, validation and test sets, k-fold cross validation may be performed. The idea is to randomly partition the set of observations into k folds, use one of the folds as the validation set and train the model on the remaining k − 1 folds. Using the validation set some specified error measure is calculated. The process is then repeated so that each of the k folds is used once as a validation set and k − 1 times included in the training set. The k-fold cross validation (CVk) estimate of the error is then achieved by calculating the average over the chosen error measure:

CVk = 1 k

k

X

i=1

Errori. (23)

For this thesis the error rate in equation (21) has been chosen as the error measure when performing cross validation. By using cross validation more data can be used for training the model and less for validation, since averaging the results increases the accuracy in the error estimates. This comes at the expense of having to retrain and retest the ML-model k times, once for each fold. For this project, a value of k = 10 has been used when performing k-fold cross validation.

Cross validation is used to combine the training set with the validation set, i.e. the sets to fit the parameters and hyperparameters respectively. To accurately estimate the final error rate it is still needed to have a set of data that is independent of the train and validation data to counteract overfitting. Hence, a separate test set is still required for error estimation on the final model.

Iter 1 Iter 2 Iter 3

Train data Validation data

Cross validation

Figure 9. Example of partitioning the original train data set into k = 3 folds to perform 3-fold cross validation. In each iteration the train subset is used to train the model while the validation subset is used to evaluate the hyperparameters of the model. The results are then averaged over the folds.

(25)

3.3 Classifiers

In this section the classifiers that have been considered in this project are introduced together with their respective parameters and hyperparameters.

3.3.1 Decision tree

The decision tree algorithm works by performing recursive binary splitting of the feature space, as illustrated in Figure 10 below. At each node in the tree, a single feature from the input vector x is chosen, x_i, and a threshold parameter a is chosen to split the feature space in two halves according to

H1= {X|Xi< a} , H2= {X|Xi≥ a} . (24)

This process is repeated recursively until all training samples at a node has the same classification or a threshold for the height of the tree is reached. Note that the same feature xi may be split several times at different nodes in the tree and for different values of a. To assess which feature to split at a node, and at which value, the Gini index has been used:

G =

K

X

k=1

ˆ

p_mk(1 − ˆp_mk) =

K

X

k=1

ˆ p_mk−

K

X

k=1

ˆ

p²_mk= 1 −

K

X

k=1

ˆ

p²_mk. (25)

Here K is the total number of classes, ˆpmk is the proportion of data points currently in region m that are of class k. It is readily seen that the Gini index is non-negative and approaches zero as more and more of the samples in region m have the same classification, i.e. if one of the ratios ˆp_mk approaches one and the others go to zero. Thus, the Gini index is a measure of impurity such that a lower index at a node means that a larger fraction of the samples at that node has the same classification. The feature x_i to split and the threshold a to split at is chosen greedily at each node to achieve the maximal reduction in Gini index.

If the tree is allowed to grow without bounds it may cause severe overfitting on the training data. One way to prevent this is to use a hyperparameter that limits the maximal height of the classification tree.

Another possibility is to let the tree grow until no further reductions in Gini index can be achieved and then post-“prune” the tree by using the validation set to remove nodes such that the validation set error decreases.

Figure 10. Example of a decision tree as created by the decision tree classifier. When a new data point is to be classified, the algorithm starts at the top of the illustration and follows a path by answering yes or no to the questions stated in each box within the path. The final classification is found when reaching one of the end nodes of the tree, i.e. approved or not approved.