YUSENWANG Short-termPowerLoadForecastingBasedonMachineLearning

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Short-term Power Load

Forecasting Based on

Machine Learning

YUSEN WANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Short-term Power Load

Forecasting Based on

Machine Learning

YUSEN WANG

Master in Electric Power Engineering Date: December 16, 2019

Supervisor: Yu Ye Examiner: Ming Xiao

(3)

iii

Abstract

(4)

iv

Sammanfattning

(5)

Chapter 1 Introduction

The power industry serves as an important part of society, which has great significance to the security of the whole country, social stability and people’s lives. Electric energy is difficult to store, which sets higher requirements for power generation, transmission and sale. Power energy should not be supplied in excess of demand, which results in wasting of energy resources. It should not be in short supply either, since it may cause outage in some districts. There-fore, power load forecasting is very important to maintain the balance of power supply and demand [1]. Environmental factors and historical data are used to predict loads in the future, which is beneficial to make plans for power gen-eration and transmission. The power load forecasting began in the 1980s, but the original forecasting work did not use complex methods and it is mainly done through manual calculation by experienced people, the forecasting re-sults are quite different from the actual situation. With the development of society, people’s demand for electricity is getting higher and higher, which puts forward higher requirements for the accuracy of load forecasting. What we can do is to do everything possible to ensure a real-time balance between electricity supply and demand. Unfortunately, it is impossible to achieve a complete supply demand balance via a forecasting method because of the ex-istence of emergencies and the influence of various factors. Therefore, power grids have some capacities to achieve a dynamic balance of power generation and demand. Accurate forecasting of power load can reduce the reserve ca-pacity of power grid and contribute to a better utilization of electricity. In terms of short-term electric load forecasting methods, no matter it is based on statistical learning-based forecasting methods or artificial intelligence-based forecasting methods, their basic idea is to explore the potential value of

(8)

2 CHAPTER 1. INTRODUCTION

torical electricity loads, and establish a mathematical model to make use of the historical data. Then, mathematical models are used to predict the power load at a certain time in the future. In the past ten years, smart grid technol-ogy has developed vigorously. Smart meters have replaced traditional meters. The existence of a large number of sensors in smart meters has dramatically improved the observability of power grid. After years of development, the electric power industry has accumulated a large amount of historical load data, which laid a solid foundation for the application of various forecasting models. With the development of smart grid and information technology, computers have undergone many updates. The computing performance of computers has experienced explosive growth. At the same time, the extensive application of graphics processors also provides powerful computing ability for deep neural networks.

In recent years, with the rapid development of machine learning, artificial in-telligence has made great breakthroughs in many fields. At present, many ar-tificial intelligence algorithms have played an important role in improving the accuracy of prediction, which make it possible for high-precision short-term load forecasting.

1.1 Objectives

Due to the low accuracy of traditional load forecasting methods, this thesis focus on proposing some short-term load forecasting methods based on ma-chine learning, and compare their performance. This thesis mainly completes the following works:

1) In order to compare the performance between traditional load forecast-ing method and machine learnforecast-ing method, a baseline model should be built using traditional method. In this thesis, an auto-regressive inte-grated moving average (ARIMA) model is built as a baseline model. 2) The Pearson coefficients are calculated for selecting features that have a

large impact on the accuracy of the prediction, the importance of each feature is also visualized.

(9)

CHAPTER 1. INTRODUCTION 3

4) Methods are tested on the actual data set, and their performance are compared with the traditional ARIMA model to verify the superiority of the proposed methods.

1.2 Thesis Outline

(10)

Chapter 2 Literature Overview

This chapter introduces the classification of power load forecasting problems and summarizes the short-term power load forecasting methods.

2.1 Classification of power load forecasting

In respect to different time scales of prediction, power load forecasting can be roughly divided into long-term forecasting and short-term forecasting. There are also some differences in definitions of long-term and short-term in different application scenarios. According to different time scale, power load predic-tion can be further divided in to into four classes [2]: ultra-short-term load forecasting, short-term load forecasting, medium-term load forecasting and long-term load forecasting. The corresponding time scales are minute, hour, month and year, respectively. Figure 2.1 shows the classification of power load forecasting based on different time scale.

Figure 2.1: Classification of power load forecasting

Ultra-short term load forecasting aims to predict the power load in the next few minutes, which is mainly used to monitor the operation of the power grid [3]. Short-term load forecasting is to forecast the power load in the next few

(11)

CHAPTER 2. LITERATURE OVERVIEW 5

hours, which mainly provides data for the optimal dispatch of power plants [4]. Medium-term load forecasting is to forecast the load in the next few months, which is used to make maintenance plans [5]. Long-term power load forecast-ing is to forecast the load in the next few years, which is used to guide the transformation of power grid [5]. Each kind of power load forecasting has its own application scenario. This thesis mainly studies short-term power load forecasting, which is also the basis for power companies to dynamically adjust their generation and transaction plans in the market.

2.2 Traditional methods for short-term load

forecasting

In the 1990s, computers gradually entered all walks of life. However, their computing ability is very limited. Most people mainly used statistical meth-ods to predict short-term load [6]. Traditional methmeth-ods mainly include the following:

2.2.1 Time Series Model

Time series model uses the characteristics such as auto-correlation, trend and seasonal variation to predict short-term load [7]. Time series models have been studied for decades. The most popular prediction methods are ARIMA and ARMA [8]. The time series model assumes a linear relationship between future loads and historical loads over the past few hours and stochastic error functions. Both ARMA model and ARIMA have achieved good prediction results.

2.2.2 Multiple Linear Regressions

Multivariate regression uses a linear function to map the relationship between the output y and multiple independent variables x1,x2,. . . ,xk. The purpose of multiple linear regression is to find a function to describe the relationship be-tween the output and variables, it tries to make predictions according to those independent variables. Power load can be affected by various of environmental features such as temperature, humidity and precipitation. The mathematical formula of multivariate linear regression prediction model can be expressed as:

(12)

6 CHAPTER 2. LITERATURE OVERVIEW

Where y is the short-term load to be predicted, xi is the feature that affects power load, βiis the regression parameter of xi, is the random error. Multi-ple linear regression has been widely used for smooth time series prediction. However, for time series with strong fluctuations, the accuracy of multiple lin-ear regression is very low.

2.2.3 Grey model

Gray model (GM) uses a small amount of historical data to build differential equations to predict short-term loads. Firstly, the historical data are accumu-lated to generate new sequences to weaken the randomness of the original data. Secondly, differential equations are established by using the generated sequence. For example, the GM (1, 1) model represents differential equations of 1-order and one variable. Other traditional methods need a lot of historical data to train the model, while the grey model needs relatively few data. Grey model is suitable for load curve that grows exponentially. For the stationary power load curve, its prediction accuracy is limited.

2.2.4 Expert system

Expert system is a computer program that has the ability to analyze the current circumstance and expand the knowledge base following the emergence of new information. The basic structure of the expert system is shown in figure 2.2, where the arrow shows the direction of data flow. Expert systems also simu-lates the decision-making process of human and tries to provide the optimal solution at current circumstance.

Figure 2.2: Correlation matrix of Pearson similarity method

(13)

CHAPTER 2. LITERATURE OVERVIEW 7

Finally, the expert system makes the prediction based on the stored conclu-sion. Expert system is easy to maintain. However, the process of acquiring knowledge is the main obstacle of expert system.

2.3 Machine Learning Methods For Load

Fore-casting

Computers have undergone many updates, and now their computing ability has been greatly improved. The emergence of massive data and GPU makes it possible to train deep neural networks. The popular methods mainly include the following [9]:

2.3.1 Support Vector Machine

Support vector machine (SVM) builds a hyper-plane in high dimensional space, which is used to classification or regression [10]. If the sample is linearly in-separable, the feature can be mapped to high-dimensional space by using the kernel function, and then a linear classifier is established. In terms of regres-sion, the normal vector of the hyper-plane contains a function that makes the objective and the estimation as close as possible. This hyper-plane should be able to accurately predict the distribution of data. Compared with traditional methods, SVM does not make a prior assumption on data, so it can deal with both stationary and non-stationary sequences. However, if the training set has a large number of samples, the training speed of SVM will be very slow [11].

2.3.2 Gradient boosting decision tree

(14)

8 CHAPTER 2. LITERATURE OVERVIEW

2.3.3 Deep Learning Methods

Deep learning is one of the machine learning techniques that has neural net-works with several hidden layer. With a more complex model architecture, it usually has a better performance than shallow learning when solving compli-cated problems, the advantages of deep learning can be summarized as follows [14]:

1) With more hidden layers in the architecture, deep neural network is eas-ier to capture non-linear data characteristics, and it has stronger ability to analyze the internal correlation between input vector and corresponding output.

2) Deep neural networks are easier to learn features of samples. Feature in-formation is transformed from one layer to another to form a new feature space, which makes the model easier to learn.

3) Deep learning uses large data set to train the model, which is capable of digging more intrinsic information from training data.

The typical deep learning models include convolution neural network (CNN), stacked auto-encoder network, deep belief network (DBN) and recurrent neu-ral network (RNN) [15].

2.4 Challenges

(15)

Chapter 3 Methodology

3.1 Auto Regressive Integrated Moving

Av-erage (ARIMA) Model

ARIMA model is widely used for forecasting a time series using the historical values [18]. An ARIMA model is characterized by three terms p, d, q, and it is often written as ARIMA(p,d,q). p is the order of the auto-regressive term, it represents the number of lags to be used to forecast; d is the minimum number of differencing process to make the original time series stationary, if the orig-inal time series is already stationary then d = 0; q is the order of the moving average (MA) term, which refers to the number of lagged forecast errors that will be taken into consideration. If the values of (p, d, q) are known, then the equation of ARIMA model can be expressed as:

Yt = α+β1Yt−1+β2Yt−2+...+βpYt−pt+φ1t−1+φ2t−2+...+φqt−q (3.1) In equation (3.1), α is the intercept value estimated by the model, βp is the coefficient of lag Yt−p, tis the auto-regressive error of the corresponding lag, φqis the coefficient of error t−q. ARIMA model based forecasting is simple to achieve. However, it requires that the time series is stationary or stationary after differencing. Dickey-Fuller test can be performed to check if the time series is stationary.

(16)

10 CHAPTER 3. METHODOLOGY

3.2 Analysis of Environmental Factors that

Influence Power Load

In electricity market, power load is influenced by various factors such as power load demand, weather conditions, seasonality, regions and residents’ living habits [19]. Among these factors, environmental factors play an important role on power load consumption, which mainly include temperature, humid-ity, wind speed, UV-index, pressure and so on [20]. In the process of power system load forecasting modeling, how to select effective environmental fac-tors as the input of the model is of vital importance.

In order to identify the correlation between each environmental factor and the power load, Pearson similarity method is adopted to analyze the correlation between each factor and the corresponding load consumption. The formula for calculating Pearson coefficient can be written as:

rxy = Pn i=1(xi− x)(yi− y) q Pn i=1(xi − x)2 q Pn i=1(yi− y)2 (3.2)

In equation (3.2), xiis the environmental factor, yiis the power load consump-tion. In this paper, maximum temperature, minimum temperature, dew point, cloud cover, wind speed, pressure, visibility, humidity and uv-index are taken into consideration as environmental factors.

3.3 Analysis of power load data

characteris-tic

3.3.1 Power load data preprocessing

(17)

CHAPTER 3. METHODOLOGY 11

3.3.2 correction of abnormal data

There are two correction methods for abnormal data [23]: (1) Horizontal processing method

Normally, power load change within a day should be smooth and con-tinuous, which means the difference between the load value at a certain moment and the load before and after it will not be too large. If large difference is observed within short time, it means that deviation may occur due to equipment records or human factors. In this case, we can handle the outliers using horizontal processing method, which can be described as:

|Y (d, t) − Y (d, t − 1)| > α(t) _(3.3) |Y (d, t) − Y (d, t + 1)| > β(t) _(3.4) Y (d, t) = Y (d, t − 1) + Y (d, t + 1)

2 (3.5)

Where α(t),β(t) is the difference threshold, Y (d, t) is the power load on day d at moment t, Y (d, t − 1) is the power load on day d at moment t − 1, Y (d, t + 1) is the power load on day d at moment t + 1.

(2) Vertical processing method

Load characteristics shows that it has daily periodicity.That is, it can be considered that the load data at the same time on adjacent days are similar, and the difference between the two should be maintained within a certain range. If it exceeds this range, it can also be regarded as bad data. In this case, outliers can be handled by vertical processing method, which is described as follows:

If |Y (d, t) − m(t)| > r(t) _(3.6) Then Y (d, t) =      m(t) + r(t), Y (d, t) > m(t) m(t) − r(t), Y (d, t) < m(t) (3.7)

Where r(t) is the threshold, m(t) is the mean power load at moment t in recent days, Y (d, t) is the power load on day d at moment t.

3.3.3 Filling Missing Data

(18)

3.3.4 Data normalization and quantification

After missing value filling and outlier processing, numerical data should be normalized and the categorical data should be encoded. The normalized op-eration formula is as follows:

Y0 = Y − Ymin

Ymax− Ymin (3.8)

In equation (3.8), Y

0

is the power load data after normalization, Y is the power load data before normalization, Ymin is the minimum value of the historic power load data, Ymaxis the maximum value of the historic power load data. Category-type data, such as weekly data and holiday identifiers, need to be encoded. For example, the holiday data is coded as 0 or 1, where 0 is a non-holiday and 1 is a non-holiday. After the data is encoded and normalized, it can be divided into training set and test set for model training and performance evaluation. The results obtained by training normalized data needs to be de-normalized subsequently in order to obtain the real power load forecast Value, the denormalization operation is as follows:

Y = (Ymax− Ymin)Yp+ Ymin (3.9) In equation (3.9), Y is the actual predicted values, Yp is the normalized pre-dicted values.

3.4 Gradient Boosting Decision Tree

3.4.1 XGBoost Algorithm

XGBoost is a decision tree based algorithm using gradient boosting frame-work proposed by Tianqi Chen and Carlo Guestrin in 2016. Like many other gradient boosting methods, XGBoost follows the idea of ensemble weak learn-ers using gradient descent architecture. However, XGBoost distinguish itself in the following ways [24]:

1) Better regularization capability: XGBoost penalizes the parameters of complex models through L1 and L2 regularization to avoid overfitting. 2) Sparsity Awareness: automatically handling missing value depending

on training loss.

(19)

4) Parallelization: XGBoost uses paralleled implementation to handle the process of sequential tree building.

3.4.2 LightGBM Algorithm

Light gradient boosting machine (LightGBM) is a decision tree based gradient boosting framework. Unlike other decision tree algorithms, LightGBM em-ploys a novel method called gradient based one side sampling to find the most suitable split for data samples. As shown in figure 3.1, for LightGBM, the de-cision tree grows leaf-wise while other boosting algorithms have a level-wise or depth-wise decision tree growth[25]. In this way, more loss can be reduced and hence results in better performance.

Figure 3.1: Leaf-wise growth for LightGBM

Advantages of LightGBM algorithm can be summarized as follows [25]: 1) Shorter training time: LightGBM converts continuous feature values

into discrete values which results in faster training process.

2) Lower memory usage: discrete values require less memory than contin-uous values.

3) High accuracy: decision tree built by leaf-wise growth approach is more complex than that of lever-wise growth approach. However, it may result in overfitting sometimes, this can be solved by tuning the hyperparame-ters in the model.

(20)

3.5 Structure of recurrent neural network

3.5.1 Basic recurrent neural network

In multi-layer perceptron based neural network, the input and output are inde-pendent from each other. However, in some cases, there is a strong correlation between input and output. For example, to predict which word will appear next in a sentence, we must know each word before the sentence and the order in which these words appear. RNN is a sequence-based model that can establish the relationship between historical and current information. For time series problems, this means that a decision made by RNN at the current moment may affect the input that arrives later.

An unfolded basic recurrent neural network is shown in figure 3.2.

Figure 3.2: Unfolded basic recurrent neural network

Where xt is the input vector at moment t, st is the hidden layer vector, ot is the output vector. U , W , V are parameter metrices. The update of stis based on the hidden layer vector at the previous moment st−1and the input vector at current moment st. The updating process and the output vector can be written as:

st= σs(U xt+ W st−1+ bs) (3.10)

(21)

3.5.2 Long short-term memory (LSTM) network

The RNN network is trained with back-propagation algorithm, the error will continue to decrease during the back-propagation training process, which re-sults in the gradual reduction of gradient. The vanishing gradient will make the network training difficult, and eventually make the model difficult to con-verge. LSTM solves this problem by establishing a long time lag between input and feedback. Each layer of neurons in the LSTM has multiple "gate" structures. This unique structure eliminates the need for layer-by-layer error propagation, part of the error can be directly transmitted to the subsequent net-work layer through the "gate" structure. By doing this, no matter how deep the network is, or how long the input sequence is, the error will not disappear [26]. RNN realizes parameter sharing through the use of repeated network mod-ules. The internal structure of each repeated module of the RNN is very sim-ple. LSTM also has a similar chain structure, but the structure of the repeating module is different. The repeating module of LSTM is no longer a single neu-ral network layer. It is composed of multiple modules and interacts in a special way. The structure of LSTM is shown in figure 3.3 [27].

Figure 3.3: LSTM structure

(22)

The forget gate controls what state information needs to be discarded. The inputs of forget gate are ht−1and xt, and the output is a value between 0 and 1, which are passed to each number in Ct−1. The calculation is shown in equation (3.12):

ft= σ(Wf · [ht−1, xt] + bf) (3.12)

When the information that needs to be discarded is confirmed, it is necessary to determine what information needs to be added to the state. Firstly, the input gate determines what value to update, and then the candidate layer will create a candidate state value which Will be added to the new state. The specific operation is shown in equation (3.13) and (3.14).

it= σ(Wi· [ht−1, xt] + bi) (3.13)

f

Ct = tanh(Wc· [ht−1, xt] + bc) (3.14) The updated cell state mainly includes two parts: discarded history informa-tion and newly added status informainforma-tion

Ct= ft× Ct−1+ it×Cf_t _(3.15)

Finally, the output value of the unit needs to be determined which is controlled by the output gate. Firstly, the output gate will determine which part of the cell state information will be output. Then, the cell state will be processed by the tanh function and multiplied by the output gate.

Ot= σ(Wo· [ht−1, xt] + bo) (3.16)

ht= ot× tanh(Ct) (3.17)

3.5.3 Gated Recurrent Unit (GRU) Network

(23)

gate rt. The structure of GRU network is shown in figure 3.4 [28].

Figure 3.4: GRU structure

The update gate ztchanges the update speed of state information by control-ling how much information from historical state information is brought into the current state. The value of the update gate is proportional to the amount of information brought in. A larger value indicates that more historical state information is brought in, which result in a faster update speed. The reset gate rt controls the extent to which historical state information is ignored. The value of the reset gate is inversely proportional to the ignored information. The smaller the value, the more information is ignored.

(24)

3.6 Indicators for Evaluating the performance

There are several popular indicators for evaluating the prediction performance, which mainly include root mean square error (RMSE), mean absolute error (MAE), mean squared error (MSE) and mean absolute percentage error (MAPE). In our case, MAE and MAPE are most suitable indicators since MAE is a lin-ear score which means that all the individual differences are weighted equally in the average, MAPE simply measures the percentage error in each measure-ment, and is commonly used due to its simplicity. The formula to calculate MAE and MAPE is shown in equation (3.23) and equation (3.24).

M AE = 1 N N X i=1 |yi−ybi| (3.23) M AP E = 1 N PN i=1|yi−ybi| yi (3.24)

(25)

Chapter 4 Experiments and Results

In this paper, all algorithms are tested on a data set which contains the en-ergy consumption readings for a sample of 5,567 London Households between November 2011 and February 2014. The corresponding weather data data set in London district has 9 features, which include maximum temperature, min-imum temperature, dew point, cloud cover, wind speed, pressure, visibility, humidity and uv-index. The power load consumption of all households are averaged. There are 825 available daily power load consumption samples, the first 800 samples are used for training and validation, power load in the last 25 days are used as test set, the performance of each method is evaluated and compared.

4.1 Feature Selection

Before we build the model, it is important to analysis which features have large impact on corresponding power load consumption. Even though the data set has explicit weather information, we can not feed all the features to our model since too many features will results in longer training time and makes the model computational expensive. Besides, not all environmental features have an impact on the power load consumption, if we feed irrelevant informa-tion to the model, it will not only increase the complexity of the algorithm, but also impair its performance [29].

Figure 4.1 describes the relationship between power load consumption and temperature. It can be observed that power load and temperature are highly correlated, the reason may be that more electricity is consumed for heating when the temperature is low. In this case, temperature is an important

(26)

20 CHAPTER 4. EXPERIMENTS AND RESULTS

ronmental feature and it should be taken into consideration. However, some features do not contribute to the forecasting result. Figure 4.2 shows the rela-tion ship between power load and wind speed. There is no clear pattern that these two parameters are correlated, in this case, wind speed is an irrelevant feature, which should be excluded. Thus, feature selection is an important step before we make the prediction.

Figure 4.1: Power load & Temperature

Figure 4.2: Power load & Wind speed

To solve this problem, Pearson coefficient method [30] is adopted to evaluate the correlation between each feature and the corresponding power load con-sumption. Before analysis, normalization is performed on each factor, then Pearson similarity method is conducted. The results are shown in a correla-tion matrix as shown in figure 4.3.

(27)

CHAPTER 4. EXPERIMENTS AND RESULTS 21

Figure 4.3: Correlation matrix of Pearson similarity method

minimum temperature, dew point and uv-index are selected as input features to our model.

4.2 ARIMA model Implementation

ARIMA model is a linear regression model that uses its own lags to predict the time series. The prerequisite for using ARIMA model forecasting is that the time series must be stationary. One common way to test whether a time series is stationary or not is Dickey Fuller test, if the p-value of the test is less than a significance level (normally 0.05), then we can infer that the time series is stationary. Otherwise, we have to make the time series stationary. One way to make the time series stationary is to difference it, which means use the current value to substract the value at the previous moment. d value in ARIMA model refers to the number of differencing to make the time series stationary, if the time series is already stationary, then d = 0. The original time series over three years is shown in figure 4.2. In this case, the time series is stationary after one differencing.

(28)

The value of p can be determined according to the partial auto-correlation plot (PACF). PACF can show the correlation between the time series and its lag, in this way, we can see which lag is needed in the auto-regressive term. The PACF plot is shown in figure 4.5. The last step is to determine the value of q, which can be obtained according to auto-correlation plot. Similar to partial auto-correlation plot, the auto-correlation plot shown in figure 4.6 describes the number of moving average (MA) terms that are needed to remove auto-correlation in the stationary time series.

It can be observed from figure 4.5 and figure 4.6 that auto-correlation plot gradually decay and partial auto-correlation plot shows there is a sharp drop after the first lag, which means most of the higher-order auto-correlations are well explained by the first lag.

Figure 4.5: Partial auto-correlation plot

Figure 4.6: Auto-correlation plot

After determining the value of p,d,q, we can fit the ARIMA model with the training set and see the result of predicted time series and the true value. The result is shown in figure 4.2.

Finally, we can make a power load prediction using the trained ARIMA model based on the test set, the forecasting result is shown in figure 4.2, which has a mean absolute error of 0.514 and a mean absolute percentage error of 4.654%

4.3 XGBoost Algorithm Results

(29)

Figure 4.7: Result of model fitting

Figure 4.8: Forecasting result of ARIMA model

way, irrelevant features are excluded and the final model will be simpler and less likely to overfitting. Besides, reducing the feature dimension will also accelerate the training process and improve the forecasting performance. In this paper, feature selection has been done using Pearson similarity method in the previous section. Thus, feature selection using XGBoost is ignored in this section. Using the conclusion of section 4.1, temperature, dew point, uv-index and holiday index are selected as input features to train XGBoost model. Hy-perparameter settings of XGBoost model is described in table 4.1.

(30)

corre-24 CHAPTER 4. EXPERIMENTS AND RESULTS

Table 4.1: Hyperparameter settings for XGBoost model Hyperparameters

max depth 5

learning rate 0.1 estimators number 160

sponds to temperature, dew-point, uv-index and holiday index. In the Pear-son coefficient matrix as shown in figure 4.3, temperature has the maximum Pearson coefficient value, followed by dew-point and uv-index, this conclusion is in consistence of the XGBoost feature importance result. Interestingly, in this case, holiday index seems has very small impact on the forecasting result. However, holiday index is a very important feature in the field of power load forecasting, it should be taken into consideration in most of the cases.

Figure 4.9: XGBoost forecasting re-sult

Figure 4.10: Feature importance

4.4 LightGBM Algorithm Results

Historical power load data and the corresponding temperature, dew-point, uv-index and holiday uv-index are collected to train the LightGBM model, hyperpa-rameter settings of LightGBM model is shown in figure 4.2.

(31)

Table 4.2: Hyperparameter settings LightGBM model Hyperparameters max depth 5 learning rate 0.07 estimators number 1000 min_child_sample 80 subsample 0.8

Figure 4.11: LightGBM forecasting result

4.5 LSTM Network Implementation

The LSTM network has the capability to learn both long-term and short-term features of training data. Thus, quality of input data are relevant to the per-formance of LSTM prediction [31]. It is indicated in [32] that clustering the data according to their similarities before training the model is helpful to im-prove the forecasting performance. Using the same idea, in this paper, data samples are firstly divided into several groups before training. Among various of clustering methods, K-means algorithm is adopted in this paper due to its robustness [33]. Specific steps for K-means clustering are described as follows [34]:

1) Initializing K centroids of K groups, this is achieved by randomly select K samples as centroids.

(32)

cen-26 CHAPTER 4. EXPERIMENTS AND RESULTS

troid, allocate each sample to the nearest group.

3) Update the centroids by averaging each group, if all centroids are not change, then the clustering process is over. Otherwise, return to step (2) After clustering the training data, we can feed K groups of training data to K different LSTM networks. When a new data sample comes, the Euclidean distance between the sample and each centroid is calculated, and the sample is allocated to the nearest group. The corresponding LSTM neural network is adopted to forecast the power load consumption at the next moment. The process of power load forecasting based on LSTM network is shown in figure 4.5.

(33)

LSTM architecture is built based on Python, backpropagation through time (BPTT) and Adam optimizer are selected for training the network, parameter settings for LSTM network is shown in table 4.3.

Table 4.3: Parameter settings for LSTM network Parameters model sequential activation sigmoid loss MAE optimizer adam epochs 50 batch size 10

The forecasting result of LSTM network is shown in figure 4.5, in this case, the mean absolute error of the results is 0.462 while the mean absolute percentage error is 4.061%.

Figure 4.13: Forecasting result of LSTM network

4.6 GRU Network Implementation

(34)

1) Set a historical power load series P = (Pt−1, Pt−2, ..., Pt−n) that will be used to predict the power load consumption at the next moment. One sample contains the historical power load value as well as its correspond-ing 4 features which include temperature, dew point, uv-index and hol-iday index.

2) Each sample is scaled and the time step is fed into corresponding GRU block in the GRU layer.

3) Outputs of the GRU layer are fed into a feed forward neural network to predict the power load consumption at the next moment.

Similar to LSTM architecture, GRU network is also trained with BPTT and using adam optimizer. Parameter settings for GRU network is shown in table 4.4.

Table 4.4: Parameter settings for GRU network Parameters model sequential activation sigmoid loss MAE optimizer adam epochs 50 batch size 10

(35)

(36)

Chapter 5 Discussion and Future Work

5.1 Summary of findings

In this case study, the forecasting results of all proposed methods are listed in table 5.1.

Table 5.1: Forecasting performance of all methods

Method MAE MAPE

ARIMA 0.514 4.654%

XGBoost 0.441 3.866% LightGBM 0.392 3.435%

LSTM 0.462 4.061%

GRU 0.439 3.821%

Compare to the traditional ARIMA model forecasting, all machine learning based methods has an overall better forecasting performance in terms of MAE and MAPE. One reason my be that ARIMA model can only use the histori-cal time series to make the prediction and it does not have the capability of analyzing environmental factors. In this case, among the four machine learn-ing methods, LightGBM yields for the best performance. Another gradient boosting decision tree based algorithm XGBoost also performs well, but the performance of GRU network is slightly better than that of XGBoost. Thus, it is hard to determine whether GBM algorithm or RNN stands out for this fore-casting problem. Before we make the make the prediction, it is difficult for us to determine the most suitable method in the first place. Therefore, we should test several forecasting methods and pick the one with the best performance among them.

(37)

CHAPTER 5. DISCUSSION AND FUTURE WORK 31

Feature selecting result suggest that temperature, dew point, uv-index are highly correlated to the power load consumption. In the feature extraction process, the Pearson similarity analysis of holiday index is not included. However, we add this feature to train the model. Feature importance analysis of XGBoost indicates that holiday impact does not contribute much to the forecasting re-sult in this case, the reason may be that the power load consumption of the analyzed district is not largely influenced by holiday.

5.2 Optimal number of groups

For LSTM network and GRU network, data are firstly grouped before training the model, the idea is to group data according to their similarities and train network separately, in this way, test samples will be allocated to the most suit-able model and therefore increase the forecasting accuracy. However, how to find the optimal number of groups is a problem that needs to be addressed. In the experiments, several group numbers of are tested and the corresponding forecasting performance are compared. The group number is increased from 1 to 9, results show that the mean absolute error decreases as the the group number increases. However, the training time also increases with more group number. There is a trade-off between accuracy and efficiency, the results show that with large number of groups, the decrease of the mean absolute error is not that obvious. Thus, we should consider both the training time and forecasting performance, then choose the optimal number of groups wisely.

5.3 The effect of different lookback periods

Since LSTM and GRU network have the capability to learn both short-term and long-term features of training data, lookback periods play an important role in network training. With more lookback periods, the network can learn more historical information and make better prediction. However, more look-back periods will make the data more complex and increase the training time. Besides, not all historical information is useful, adding too much lookback pe-riods may only increase the model complexity and has no contribution to the forecasting results.

(38)

32 CHAPTER 5. DISCUSSION AND FUTURE WORK

days are studied, the result shown that the Pearson coefficient firstly decrease and then increase. If t − n is the last moment that its Pearson coefficient value is larger than a threshold, then n is chosen to be the optimal lookback periods. This is a faster way to achieve, but it only considers the correlation between power load consumption and ignores the effect of environmental features.

5.4 Limitations

For the traditional forecasting method, only ARIMA model is studied. Due to the model limitation of ARIMA, it can not make use of environmental features while other proposed methods are capable of. Thus, it is hard to draw the con-clusion that machine learning forecasting methods can outperform traditional forecasting methods. Other traditional methods should also be tested and com-pared to draw a strong conclusion. Generally. RNN has a high accuracy in the field of time series forecasting. However, gradient boosting machine based algorithms can perform equally well as RNN in this case. LSTM and GRU network have been demonstrated to have better performance where time se-ries are longer. Therefore, it is necessary to investigate whether LSTM and GRU are able to learn long-term dependencies in time series data.

5.5 Future work

More traditional power load forecasting methods should be tested, the perfor-mance of traditional methods and machine learning methods should be tested in several scenarios, for example, using industrial power load data and resi-dential power load data, or using limited data (environmental features are not available) to test their performance in different circumstances. Some methods are suitable for some certain circumstances, it would be interesting to investi-gate the optimal forecasting method under a give circumstance.

(39)

opti-CHAPTER 5. DISCUSSION AND FUTURE WORK 33

mal algorithm and configuration for hourly power load forecasting.

(40)

Chapter 6 Conclusions

In this thesis, a commonly used traditional time series forecasting method (ARIMA model) and four machine learning based forecasting methods (XG-Boost, LightGBM, LSTM, GRU) are tested and compared for predicting the average daily residential power load consumption of a community in London district. The results show that machine learning based methods has an overall better performance than ARIMA model. Since only one traditional forecasting method is tested, it is insufficient to draw the conclusion that machine learn-ing methods outperform traditional methods. For the aspect of environmental features, results of Pearson similarity analysis indicate that temperature, dew point, and uv-index are highly correlated to the power load consumption. In the case study, among all proposed methods, the gradient boosting tree based algorithm LightGBM yields for the best performance, GRU network has a slightly higher mean absolute error than LightGBM, but it also performs well. It is hard to determine which method outperforms another in the field of power load forecast, several methods should be tested and compared in a certain cir-cumstance to select the most suitable one.

(41)

Bibliography

[1] B. F. Hobbs et al. “Analysis of the value for unit commitment of im-proved load forecasts”. In: IEEE Transactions on Power Systems 14.4 (Nov. 1999), pp. 1342–1348. issn: 1558-0679. doi: 10 . 1109 / 59 . 801894.

[2] Yuan-Yih Hsu and Chien-Chun Yang. “Electrical Load Forecasting”. In: Applications of Neural Networks. Ed. by Alan F. Murray. Boston, MA: Springer US, 1995, pp. 157–189. isbn: 978-1-4757-2379-3. doi: 10 . 1007 / 978 - 1 - 4757 - 2379 - 3 _ 7. url: https://doi. org/10.1007/978-1-4757-2379-3_7.

[3] VietCuong Ngo et al. “Ultra-short-term load forecasting using robust exponentially weighted method in distribution networks”. In: 2015 IEEE

Power Energy Society General Meeting. July 2015, pp. 1–5. doi: 10.

1109/PESGM.2015.7286602.

[4] S. Singh, S. Hussain, and M. A. Bazaz. “Short term load forecasting using artificial neural network”. In: 2017 Fourth International

Confer-ence on Image Information Processing (ICIIP). Dec. 2017, pp. 1–5. doi:

10.1109/ICIIP.2017.8313703.

[5] L. Duan, D. Niu, and Z. Gu. “Long and Medium Term Power Load Forecasting with Multi-Level Recursive Regression Analysis”. In: 2008

Second International Symposium on Intelligent Information Technology Application. Vol. 1. Dec. 2008, pp. 514–518. doi: 10.1109/IITA.

2008.397.

[6] I. Moghram and S. Rahman. “Analysis and evaluation of five short-term load forecasting techniques”. In: IEEE Transactions on Power Systems 4.4 (Nov. 1989), pp. 1484–1491. issn: 1558-0679. doi: 10 . 1109 / 59.41700.

(42)

36 BIBLIOGRAPHY

[7] N. Amjady. “Short-term hourly load forecasting using time-series mod-eling with peak load estimation capability”. In: IEEE Transactions on

Power Systems 16.4 (Nov. 2001), pp. 798–805. issn: 1558-0679. doi:

10.1109/59.962429.

[8] M. Y. Cho, J. C. Hwang, and C. S. Chen. “Customer short term load forecasting by using ARIMA transfer function model”. In: Proceedings

1995 International Conference on Energy Management and Power De-livery EMPD ’95. Vol. 1. Nov. 1995, 317–322 vol.1. doi: 10.1109/

EMPD.1995.500746.

[9] A. Almalaq and G. Edwards. “A Review of Deep Learning Methods Applied on Load Forecasting”. In: 2017 16th IEEE International

Con-ference on Machine Learning and Applications (ICMLA). Dec. 2017,

pp. 511–516. doi: 10.1109/ICMLA.2017.0-110.

[10] Ervin Ceperic, Vladimir Ceperic, and Adrijan Barić. “A Strategy for Short-Term Load Forecasting by Support Vector Regression Machines”. In: IEEE Transactions on Power Systems 28 (2013), pp. 4356–4364. [11] N. I. Sapankevych and R. Sankar. “Time Series Prediction Using

Sup-port Vector Machines: A Survey”. In: IEEE Computational Intelligence

Magazine 4.2 (May 2009), pp. 24–38. issn: 1556-6048. doi: 10.1109/

MCI.2009.932254.

[12] Jerome H. Friedman. “Stochastic Gradient Boosting”. In: Comput. Stat.

Data Anal. 38.4 (Feb. 2002), pp. 367–378. issn: 0167-9473. doi: 10.

1016/S0167-9473(01)00065-2. url: https://doi.org/ 10.1016/S0167-9473(01)00065-2.

[13] Jerome H. Friedman. “Greedy function approximation: A gradient boost-ing machine.” In: Ann. Statist. 29.5 (Oct. 2001), pp. 1189–1232. doi: 10.1214/aos/1013203451. url: https://doi.org/10. 1214/aos/1013203451.

[14] Seunghyoung Ryu, Jaekoo Noh, and Hongseok Kim. “Deep neural net-work based demand side short term load forecasting”. In: 2016 IEEE

International Conference on Smart Grid Communications (SmartGrid-Comm). Nov. 2016, pp. 308–313. doi: 10.1109/SmartGridComm.

2016.7778779.

(43)

BIBLIOGRAPHY 37

[16] K. Amarasinghe, D. L. Marino, and M. Manic. “Deep neural networks for energy load forecasting”. In: 2017 IEEE 26th International

Sympo-sium on Industrial Electronics (ISIE). June 2017, pp. 1483–1488. doi:

10.1109/ISIE.2017.8001465.

[17] Daniel L. Marino, Kasun Amarasinghe, and Milos Manic. “Building

Energy Load Forecasting using Deep Neural Networks”. In: CoRR abs/1610.09460 (2016). arXiv: 1610.09460. url: http://arxiv.org/abs/

1610.09460.

[18] George Edward Pelham Box and Gwilym Jenkins. Time Series Analysis,

Forecasting and Control. USA: Holden-Day, Inc., 1990. isbn: 0816211043.

[19] S. Parkpoom, G. P. Harrison, and J. W. Bialek. “Climate change im-pacts on electricity demand”. In: 39th International Universities Power

Engineering Conference, 2004. UPEC 2004. Vol. 3. Sept. 2004, 1342–

1346 vol. 2.

[20] S. Khatoon et al. “Effects of various factors on electric load forecasting: An overview”. In: 2014 6th IEEE Power India International Conference

(PIICON). Dec. 2014, pp. 1–5. doi: 10 . 1109 / POWERI . 2014 .

7117763.

[21] S. Zuozhi et al. “The Study of SCADA System-Based Continuous Data Protection Technology”. In: 2012 Fourth International Conference on

Computational and Information Sciences. Aug. 2012, pp. 400–403. doi:

10.1109/ICCIS.2012.353.

[22] Y. Cao, Z. J. Zhang, and C. Zhou. “Data Processing Strategies in Short Term Electric Load Forecasting”. In: 2012 International Conference on

Computer Science and Service System. Aug. 2012, pp. 174–177. doi:

10.1109/CSSS.2012.51.

[23] N. V. Kumar. “Detection and correction of time-skew in SCADA mea-surement”. In: 2017 IEEE Power Energy Society Innovative Smart Grid

Technologies Conference (ISGT). Apr. 2017, pp. 1–5. doi: 10.1109/

ISGT.2017.8086014.

(44)

38 BIBLIOGRAPHY

[25] Guolin Ke et al. “LightGBM: A Highly Efficient Gradient Boosting De-cision Tree”. In: Advances in Neural Information Processing Systems

30. Ed. by I. Guyon et al. Curran Associates, Inc., 2017, pp. 3146–3154.

url: http://papers.nips.cc/paper/6907- lightgbm- a-highly-efficient-gradient-boosting-decision-tree.pdf.

[26] F. A. Gers, J. Schmidhuber, and F. Cummins. “Learning to forget: con-tinual prediction with LSTM”. In: 1999 Ninth International Conference

on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470). Vol. 2.

Sept. 1999, 850–855 vol.2. doi: 10.1049/cp:19991218.

[27] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In: Neural computation 9 (Dec. 1997), pp. 1735–80. doi: 10.1162/ neco.1997.9.8.1735.

[28] Junyoung Chung et al. “Empirical Evaluation of Gated Recurrent Neu-ral Networks on Sequence Modeling”. In: CoRR abs/1412.3555 (2014). arXiv: 1412 . 3555. url: http : / / arxiv . org / abs / 1412 . 3555.

[29] A. M. Pirbazari, A. Chakravorty, and C. Rong. “Evaluating Feature Se-lection Methods for Short-Term Load Forecasting”. In: 2019 IEEE

In-ternational Conference on Big Data and Smart Computing (BigComp).

Feb. 2019, pp. 1–8. doi: 10.1109/BIGCOMP.2019.8679188. [30] Philip Sedgwick. “Pearson’s correlation coefficient”. In: BMJ 345 (2012).

doi: 10.1136/bmj.e4483. eprint: https://www.bmj.com/ content/345/bmj.e4483.full.pdf. url: https://www. bmj.com/content/345/bmj.e4483.

[31] W. Kong et al. “Short-Term Residential Load Forecasting Based on LSTM Recurrent Neural Network”. In: IEEE Transactions on Smart

Grid 10.1 (Jan. 2019), pp. 841–851. issn: 1949-3061. doi: 10.1109/

TSG.2017.2753802.

[32] X. Wang et al. “Factors that Impact the Accuracy of Clustering-Based Load Forecasting”. In: IEEE Transactions on Industry Applications 52.5 (Sept. 2016), pp. 3625–3630. issn: 1939-9367. doi: 10.1109/TIA. 2016.2558563.

[33] Q. Xu et al. “A Short-Term Wind Power Forecasting Approach With Ad-justment of Numerical Weather Prediction Input by Data Mining”. In:

IEEE Transactions on Sustainable Energy 6.4 (Oct. 2015), pp. 1283–

(45)

BIBLIOGRAPHY 39

[34] Saeed Reza Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. “Time-series clustering - A decade review”. In: Inf. Syst. 53 (2015), pp. 16–38.

[35] Jui-Sheng Chou and Ngo Ngoc-Tri. “Smart grid data analytics frame-work for increasing energy savings in residential buildings”. In:

Au-tomation in Construction 72 (Dec. 2016), pp. 247–257. doi: 10.1016/

(46)

YUSENWANG Short-termPowerLoadForecastingBasedonMachineLearning

Short-term Power Load

Forecasting Based on

Machine Learning

YUSEN WANG

Short-term Power Load

Forecasting Based on

Machine Learning

YUSEN WANG

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1

Objectives

1.2

Thesis Outline

Chapter 2

Literature Overview

2.1

Classification of power load forecasting

2.2

Traditional methods for short-term load

forecasting

2.2.1

Time Series Model

2.2.2

Multiple Linear Regressions

2.2.3

Grey model

2.2.4

Expert system

2.3

Machine Learning Methods For Load

Fore-casting

2.3.1

Support Vector Machine

2.3.2

Gradient boosting decision tree

2.3.3

Deep Learning Methods

2.4

Challenges

Chapter 3

Methodology

3.1

Auto Regressive Integrated Moving

Av-erage (ARIMA) Model

3.2

Analysis of Environmental Factors that

Influence Power Load

3.3

Analysis of power load data

characteris-tic

3.3.1

Power load data preprocessing

3.3.2

correction of abnormal data

3.3.3

Filling Missing Data

3.3.4

Data normalization and quantification

3.4

Gradient Boosting Decision Tree

3.4.1

XGBoost Algorithm

3.4.2

LightGBM Algorithm

3.5

Structure of recurrent neural network

3.5.1

Basic recurrent neural network

3.5.2

Long short-term memory (LSTM) network

3.5.3

Gated Recurrent Unit (GRU) Network

3.6

Indicators for Evaluating the performance

Chapter 4