Deep Neural Networks Based Disaggregation of Swedish Household Energy Consumption

(1)

i

Deep Neural Networks Based

Disaggregation of Swedish Household Energy Consumption

Praneeth Varma Bhupathiraju

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona Sweden

(2)

ii

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Science in Computer science. The thesis is equivalent to 20 weeks of full-time studies.

Contact Information:

Author(s):

Praneeth Varma Bhupathiraju E-mail: prbh16@student.bth.se

External advisor:

Susheel Sagar Data Scientist

Eliq AB, Gothenburg

University Advisor:

Dr. Huseyin Kusetogullari

Department: Computer Science and Engineering, Blekinge Institute of Technology, Karlskrona, Sweden

Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE-371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

iii

ABSTRACT

Context: In recent years, households have been increasing energy consumption to very high levels, where it is no longer sustainable. There has been a dire need to find a way to use energy more sustainably due to the increase in the usage of energy consumption. One of the main causes of this unsustainable usage of energy consumption is that the user is not much acquainted with the energy consumed by the smart appliances (dishwasher, refrigerator, washing machine etc) in their households. By letting the household users know the energy usage consumed by the smart appliances. For the energy analytics companies, they must analyze the energy consumed by the smart appliances present in a house. To achieve this Kelly et. al. [7] have performed the task of energy disaggregation by using deep neural networks and producing good results. Zhang et. al. [7]

has gone even a step further in improving the deep neural networks proposed by Kelly et. al., The task was performed by Non-intrusive load monitoring (NILM) technique.

Objectives: The thesis aims to assess the performance of the deep neural networks which are proposed by Kelly et.al. [7], and Zhang et. al. [8]. We use deep neural networks for disaggregation of the dishwasher energy consumption, in the presence of vampire loads such as electric heaters, in a Swedish household setting. We also try to identify the training time of the proposed deep neural networks.

Methods: An intensive literature review is done to identify state-of-the-art deep neural network techniques used for energy disaggregation. The chosen algorithms from the literature review are Simple Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), Long short-term memory (LSTM), Gated recurrent unit (GRU) and Recurrent convolution neural network (RCNN). All the experiments are being performed on the dataset provided by the energy analytics company Eliq AB. The data is collected from 4 households in Sweden. All the households consist of vampire loads, an electrical heater, whose power consumption can be seen in the main power sensor. A separate smart plug is used to collect the dishwasher power consumption data. Each algorithm training is done on 2 houses with data provided by all the houses except two, which will be used for testing. The metrics used for analyzing the algorithms are Accuracy, Recall, Precision, Root mean square error (RMSE), and F1 measure. These software metrics would help us identify the best suitable algorithm for the disaggregation of dishwasher energy in our case.

Results: The results from our study have proved that Gated recurrent unit (GRU) performed best when compared to the other neural networks in our study like Simple recurrent neural network (SRN), Convolutional Neural Network (CNN), Long short-Term memory (LSTM) and Recurrent convolution neural network (RCNN). The Accuracy, RMSE and the F1 score of the GRU algorithm are higher when compared with the other algorithms. Also, if the user does not consider F1 score and RMSE as an evaluation metric and considers training time as his or her metric, then Simple recurrent neural network outperforms all the other neural nets with an average training time of 19.34 minutes.

Conclusions: The thesis aims to assess the performance and potential of deep neural networks on the Non- intrusive load monitoring (NILM) problem. Also, to assess the performance of the proposed algorithms, the critical metrics selected are RMSE and F1 score. Hence, from our study, we have concluded that GRU is the best performing algorithm from the proposed set of algorithms in our study, which disaggregates the dishwasher energy consumption, in the presence of vampire loads Swedish household setting. Although GRU is the best neural network when we consider the RMSE and F1 score as the metrics. But, if we take the training time as the metric, then we can consider simple recurrent neural network as the best training time i.e., 19.34 minutes.

Keywords: Deep learning, Non-intrusive load monitoring, disaggregation.

(4)

iv

Acknowledgements

I like to thank my supervisor Dr. Huseyin Kusetogullari for his excellent supervision of my study. The study would not have been possible without his great support. His excellent guidance and knowledge have made me more knowledgeable by the end of my thesis. Also, I thank my external supervisor at Eliq AB, Gothenburg, Susheel Sagar, for his continuous support during the thesis.

Finally, I thank my family and friends for their unconditional love and support

(5)

v

List of tables

Table 4.1 Data Set... 15

Table 4.2 Window length of the dishwasher ... 18

Table 4.3 CNN Architecture ... 19

Table 4.4 RCNN Architecture ... 20

Table 4.5 LSTM Architecture ... 20

Table 4.6 SRN Architecture ... 20

Table 4.7 GRU Architecture ... 21

Table 5.1 Accuracy ranks for deep neural networks on 10-fold cross-validation ... 29

Table 5.2 F1 score ranks for deep neural networks on 10-fold cross-validation ... 30

Table 5.3 Training time ranks for deep neural networks on 10-fold cross-validation ... 30

Table 5.4 Accuracy ranks for deep neural networks on 10-fold cross-validation ... 31

Table 5.5 F1 score ranks for deep neural networks on 10-fold cross-validation ... 31

Table 5.6 Training time ranks for deep neural networks on 10-fold cross-validation ... 32

Table 5.7 Average training time of the algorithms on house 1 and 2 ... 32

Table 5.8 Average testing time of the algorithms on house 3 and 4 ... 36

(8)

viii

List of Figures

Figure 3.1: Elman Network ... 7

Figure 3.2 Basic LSTM structure ... 8

Figure 3.3 Gated Recurrent Unit ... 9

Figure 3.4 EHNET Architecture ... 11

Figure 4.1 Sliding Window Approach ... 16

Figure 4.2 System Operation Flow ... 17

Figure 4.3 Window length impact on disaggregation ... 18

Figure 5.1 House 1 Validation ... 26

Figure 5.2 House 2 Validation ... 28

Figure 5.3 House 3 Testing ... 34

Figure 5.4 House 4 Testing ... 36

(9)

1

1 INTRODUCTION

Buildings account for 30% of total energy usage around the world, of which up to 93% is because of residential buildings [1-3]. Energy consumption has increased to such level that it’s not economically or environmentally sustainable because of the supply of various electrical appliances for consumers within the market, has only worsened matters for the demand for electricity. Based on U.S. Energy Information Administration, the residential electricity usage in 2008 amounts to about 8.359×10 million kWh, representing about 22% of electricity consumption within the nation [39]. If this electricity consumption is reduced by just one percent, we can reduce about 6.269×10 ton CO2 emission and saving about 91 trillion dollars [39]. Consumers need a cheap Energy Management System (EMS) for his or her homes, to fill a range of needs. Management of loads supports conservation efforts and may provide warning of bizarre or unwanted energy use. When the local utility implements Time-of-Use (ToU) pricing, an EMS is crucial in aiding the active response of the patron. Central to EMS is real-time information on what devices at any moment are on and what energy they are consuming. Real-time knowledge of the usage of energy consumption of the appliances can give experts insights on the way to use power sustainably.

Nonintrusive Load Monitoring (NILM) makes possible the use of one (or a few) measurements of energy at an electrical panel. From the disaggregation of these measurements, consumption by individual devices is decided [23]. Studies have shown that 5-15% of the residential buildings’ energy consumption is reduced by providing the householders the information corresponding to energy breakdown: the quantity of energy consumed by the individual household appliances [4-5]. With such an analysis, utility companies can target conservation programs on homes that have inefficient appliances like fridge, dishwasher, water heater, etc [11].

Two types of techniques exist for monitoring the individual appliance’s power consumption, namely, Intrusive Load Monitoring (ILM) and Non-Intrusive Load Monitoring (NILM). In the ILM approach, each appliance’s energy consumption is monitored by installing a sensor for each device. However, installing a sensor for every device makes this approach an extremely costly affair [5, 9]. Non-Intrusive Load Monitoring (NILM) is considered, by many, one of the most promising technologies to unobtrusively identify and monitor not only the overall energy consumption but also individual appliances that co-exist in a building, without the need to instrument every electrical device [38].

NILM (also called energy disaggregation) is a computational technique for estimating the power demand of individual appliances in a household from a single meter (called mains meter), which measures the whole house’s power consumption [7]. In broad terms, it can be defined as being a set of signal-processing and machine-learning techniques used to estimate the whole-house and individual appliance electricity consumption from current and voltage measurements taken at a limited number of locations in the electric distribution of a house (optimally the mains, hence covering the demand of the entire house)[38]. For this reason, NILM is considered a low-cost alternative to the ILM technique.

Recently, deep neural networks have been applied to energy disaggregation by Kelly et. al. Neural networks presented in [7] have outperformed all other states of the art approaches. Furthermore, an extension of Kelly et al.’s neural network approach presented in [8] has proven to outperform the work done in [7].

(10)

2

1.2 Research gap

All the approaches mentioned above have worked with the open-source datasets that contain data collected from households from different countries such as the UK, US, India, etc. None of these datasets contains vampire loads. Vampire loads or always on loads are the appliances that still consume some power and are never turned off. For example, an electric heater in most of the Swedish households is never turned off (except in summer).

Among all the above-mentioned approaches, the neural networks-based algorithms presented in [7] and [8]

have shown significantly higher performance than others. These neural nets have been trained and tested on an open-source dataset called UK-DALE, which has data from 5 households in the UK. These households do not have vampire loads, such as electric heaters. Whereas, most of the typical Swedish households (excluding district heating-based households) have an electric heater connected to their mains meter, which measures the whole house power consumption. Due to the shallow temperatures in Sweden, the radiators are turned on throughout the year, except in summer. The heater power consumption adds noise to the mains power signal and thus makes the disaggregation task hard to perform [7-8]. In the literature, as explained in [7], disaggregation task has never been attempted in the presence of the vampire loads such as electric heaters. Additionally, the previous research work corresponding to NILM worked with the UK, US, and Indian datasets, but no research has been performed under the Scandinavian or Swedish setting.

Furthermore, as the neural networks require very high computational power, the authors in [7-8] have trained all the algorithms on powerful graphical processing units (GPUs). However, they have not analyzed and presented the time complexity of each of these algorithms that they have proposed.

1.3 Aim and objectives Aim

The research aims to assess the performance and time complexity of deep neural networks presented in [7] and [8] for disaggregating the dish washer’s energy consumption, under the presence of vampire loads such as electric heaters, in a Swedish household setting.

Objectives

x Preprocess the Swedish household data (which includes electric heaters in the mains reading) and to remove missing values and errors in the dataset. Also, separate the training and testing data in the datasets.

x Code and build the neural network algorithms presented in [7] and [8] so that the algorithms can learn using the preprocessed data on a GPU.

x Test and Compare the performance of the implemented algorithms on the Swedish dataset (with vampire loads)

(11)

3

1.4 Research Questions

RQ1

How well does the algorithms presented in [7] and [8] perform the task of disaggregation of

dishwashers under a Swedish household setting with vampire loads, in a Swedish household setting

Motivation:

As explained in the “Research gap” in Section 1, the task of disaggregation was never performed under the Swedish setting in the presence of vampire loads. Additionally, the extraordinary success of the neural nets presented in [7] and [8] in the UK setting without vampire loads, provides us the primary motivation for this research investigation.

The motivation for choosing a dishwasher as the appliance to be disaggregated is that the dishwasher is one of the most common household appliances that consume a significant amount of power, as it is used daily.

Additionally, multi-state appliances such as Dishwasher or washing machines are harder to disaggregate when compare to the two-state (ON-OFF) appliances such as Television [17].

RQ2

What is the time complexity of each of the algorithms presented in [7] and [8] on the Swedish household data, when implemented on a GPU?

Motivation:

In [7] and [8], the authors have not discussed the time complexity, i.e., the training time and time for inference of the neural network models. Whereas, it is crucial to analyze the time complexity of these algorithms to understand the scalability issues of them when being deployed in many households.

1.5 Outline

Chapter 1 introduces what the research is about, gaps in the earlier study, and talks about the research questions and its motivations. It also talks about why this will be researched. Chapter 2 introduces us to the previous works done related to this research and followed by section 3, where we give a brief introduction to machine learning and describe all the neural network algorithms used in our study. Chapter 4 initially describes the dataset briefly and later explains the methodology, metrics carried out to achieve the required answers for the research questions. Section 5 presents the results of the experiment carried out and tries to analyze these results in a detailed manner. In chapter 6, we discuss the results and answer the research questions along with the contribution made by this work. In the end, we discuss various validity threats and the limitations of this work. Finally, present some future work which could be carried out in this field of study.

(12)

4

2 RELATED WORK

Research on NILM started with the seminal work of George Hart in 1984 [17]. George Hart first introduced the concept of NILM in 1984. Since then, several approaches have been proposed to solve the problem of energy disaggregation of NILM. Early days, researchers have analyzed both transient and steady states to disaggregate the energy consumed by each appliance. Active power, reactive power, current waveforms, and harmonic components were used as features for load disaggregation in the steady-state analysis [40, 41]. On the other hand, when we analyzed the transient state, transient shape, and transient energy was used to disaggregate energy consumption [42].

They have used Fourier transforms, wave transforms, and other electrical parameters within the above cases.

The hand-constructed features consume plenty of your time to construct them and prone to errors. To mitigate these reasonable errors, researchers have started using deep neural networks in order that they can automatically learn features from the signal itself. Few signal processing-based approaches were also presented in [10] [13-14]. Additionally, some researchers also tried to include the domain knowledge (such as On-Off state changes, total energy consumption, etc.) into the models [12].

Recently, deep neural networks are applied to energy disaggregation by Kelly et. al. Neural networks presented in [7] have outperformed all other states of the art approaches. Kelly et al. used deep neural networks. Long Short-Term Memory and Denoising Auto-encoders (DAE) for energy disaggregation. It absolutely was observed that Denoising Auto-encoders had performed better when put next to other networks. Several favored approaches presented in [6] [15-16] were based on the Factorial Hidden Markov Models (FHMM).

Jacob et al. [6] proposed hidden Markov models during which they are ready to determine the operational states of an FSM device. Zhang et al. [8] proposed sequence-to-point learning with CNN during which they need to produce a single point representation for every appliance from the input aggregate sequence, and their approach outperformed Additive Factorial Hidden Markov Model (AFHMM) based approaches and denoising auto-encoders in energy disaggregation [40].

Bin et al. [43] propose a deep recurrent neural network referred to as LSTM. During this, the network can estimate the power signal of a target appliance or any sub circuit from the combination signal after a supervised training of the network by using a submeter measurement of the target appliance.

Yang et al. [44] used a deep convolutional neural network architecture to disaggregate energy consumption, and the results were good across various loads. The proposed method had a set fixed architecture and set of hyperparameters that produced acceptable results.

Due to the recent advancements in the field of deep neural networks, they have not only been used in energy disaggregation but also have been extensively used in image classification, pattern recognition, and machine translation [33, 34, 46, 47]. It’d be worthwhile to explore the recent advancements that are made in this field of deep neural networks and the way it is applied to energy disaggregation.

(13)

5

3 DEEP LEARNING METHODS USED IN OUR STUDY

3.1 Background

Learning is the main hallmark of human intelligence and the primary means to obtain knowledge. Machine learning is a fundamental way to make a computer intelligent. R.Shank has said: "If a computer cannot learn, it will not be called intelligent." Since learning is an integrative mental activity with memory, thinking, perception, feeling, and other mental activities closely related [19].

Machine learning is the study of how to use computers to simulate human learning activities; it is the study of computers gain new knowledge and new skills, identify existing knowledge, and continuously improve performance and achieve self-improvement methods [18].

Different Machine learning categories are explained below.

Supervised Learning: In this sort of learning, algorithms are generally given a certain quantity of inputs together with their corresponding outputs. The goal of this learning is to learn the pattern which enables the inputs to map to their corresponding outputs.

Unsupervised Learning: In this type of learning, algorithms are not provided with a certain amount of inputs along with their corresponding outputs. The algorithm needs to figure out the pattern from the input itself since there is no output available for the algorithm. The goal for this kind of learning is to discover the trends which are present in the inputs.

Semi-Supervised Learning: In this type of learning, algorithms are generally provided with a certain amount of inputs along with their corresponding outputs. It is also offered with only specific inputs without their corresponding outputs. The goal of this algorithm is to find the pattern in both kinds of scenarios when there are an input and output case, along with only input case.

Some machine learning techniques have an umbrella term, which is nothing but known as deep learning.

Neural networks are nothing but a system that consists of many layers (when opposed to shallow neural networks. The main objective for these kinds of networks is that to learn from the input features, in which each layer processes the data received and correspondingly sends the data to the next layer with better representation. Adding layers to this neural network can exponentially increase the state representation of that network.

3.2 Recurrent Neural Network

This type of networks have gained plenty of attention during recent years. It is different from the traditional Feedforward Neural Network (FNN) because it introduces a recurring structure for implementing a memory mechanism, which until now was absent in the FNNs [32]. These networks have shown attractive methods in solving machine learning tasks in recent times. In this type of system, the output of the network sometimes depends on the previous state. RNN is an extension of a traditional neural network, which can handle a variable-length sequence input. The variable-length sequence is solved by recurrent hidden layers whose activation at each time in RNN [49]. Through this structure, neurons keep track of past information and use it to influence the output at the current moment, making it suitable for predicting time series data [32]. The advantage of this sort of network is the feedback loop. There are many reasons for using this sort of systems for predictions several of them are as follows

(14)

6

x These kinds of networks provide an end to end learning. This kind of knowledge helps us in applying these networks to NILM. The amount of information which is passed to the system can also be minimized which in turn helps in adopting these networks for large scale use

x RNNs have a vast untapped potential because we know little regarding these networks, how to train, and their architectures. So, due to the complexity involved in understanding these RNNs, there is a potential for improvements in the coming years.

The training of the RNNs is like that of the feed-forward neural networks. The training of the networks is done via backpropagation through time. In the past, due to the presence of many local minima on the error surface of these networks, it was tough to achieve good results. The main drawback of this kind of network is the vanishing gradient problem. Due to the vanishing gradient problem, it causes the neural network to diverge. It is hence preventing these networks from learning long-range dependencies. Advancement in this field has led to new architectures and powerful GPUs, which have contributed to reducing the problem of vanishing gradient to an oversized extent and attain good results.

3.3 Simple Recurrent Network (SRN)

From the name, we can identify that the architecture of this network is nothing but a simple recurrent neural network. This simple recurrent neural network is also known as vanilla RNN. Elman network is one form of vanilla RNN.

x Elman network (figure 3.1) consists of four layers of the output layer, hidden layer, input layer, and context layer. The output rule for updating the parameters is defined in equation 3.1. In figure 3.1 the input layer is given by , represents hidden layer,

x represents the output layer, and the context layer is related to bias b. These layers are interconnected to each other, and they consist of adjustable weights.

x Generally, it is considered as a special kind of feed-forward neural network with additional memory neurons and local feedback [35]. One drawback of SRN is that it suffers from vanishing and explosion gradient problem. When we train the network using backpropagation, the error disappears because of the multiplication among many layers. This is known as the vanishing problem.

x The opposite of vanishing problem is the explosion gradient problem. In this scenario, the error keeps on increasing while we perform the training via backpropagation, resulting in a neural network to diverge.

Due to the recent developments in this field, researchers have tried to mitigate the problem without changing the architecture of the network. Navdeep et al. [36] proposed a solution to use the identity matrix or its scaled version to initialize the recurrent weight matrix. The activation function used was also a rectified linear unit (RELU). They could solve the problem of long-range dependencies by making these changes to the network.

The equation of updating the parameters in the Elman network is represented as follows

ℎ = ( + ℎ + ) (3.1)

= (ℎ + ) (3.2)

In the equation represents the input matrix, ℎ represents the hidden state, represents nonlinearity, represents hidden-to-hidden matrix, represents the hidden-to-output matrix, b represents bias, and t represents the time instantly.

(15)

7

Figure 3.1: Elman Network

3.4 Long short-term memory (LSTM)

A Recurrent neural network consisting of LSTM units is called an lstm network. These kinds of networks have been used for various problems resulting in excellent success. LSTM can be used as a sophisticated nonlinear unit to construct a more extensive deep neural network, which can reflect the effect of long-term memory and has the ability of deep learning. LSTM network consists of an input layer, an output layer, and a plurality of hidden layers. The hidden layer is composed of the memory cell. The structure of a basic LSTM is shown in figure 3.2. One cell consists of three gates (input, forget, output), and a recurrent connection unit [21]. The input for the LSTM cell is defined as , the current input is given at step t, and represents the hidden state. The output to the unit is given by , represents the internal memory. LSTM networks were designed to overcome the problem of vanishing gradient in recurrent neural networks. It consists of a concept called gates, which is used for a smooth gradient flow. Gates in LSTM output a value between 0 or 1 and in most of the cases it is either 0 or 1. we use a sigmoid function for gates because we want a gate to give only positive values and should be able to provide us with a clear cut answer whether we need to keep a particular feature or we need to discard that feature. The following set of equations determine the working of an LSTM cell

Input gate:

= ( + + ) (3.3)

represents the input gate, and g represents the sigmoid function. is input at the current timestamp, denotes the output of the previous LSTM block ( at timestamp t - 1), represents the weight for the input gate neurons, denotes the bias during the input gate. The input gate tells us what new information will be stored in the cell state

Forget gate:

= + + (3.4)

output

hidden

input context

(16)

8

represents the output gate, and g represents the sigmoid function. is input at the current timestamp, denotes the output of the previous lstm block ( at timestamp t - 1), represents the weight for the forget gate neurons, denotes the bias during the input gate. The output gate tells us which information to discard from the cell state

Output gate:

= ( + + ) (3.5)

represents the output gate, and g represents the sigmoid function. is input at the current timestamp, denotes the output of the previous lstm block ( at timestamp t - 1), represents the weight for the output gate neurons, denotes the bias during the input gate. The output gate tells us to provide the activation to the final output of the LSTM block at timestamp ‘t’.

Input transform:

= ℎ + + (3.6)

represents the candidate for cell state, and tanh represents the function. is input at the current timestamp, denotes the output of the previous LSTM block ( at timestamp t - 1), represents the weight for the candidate for cell state neurons, denotes the bias during the candidate cell state. The candidate cell state acts as an input for the status update.

State Update:

= . + . (3.7)

Represents the cell state (memory) at timestamp(t). In equation (5) we can see that the cell state knows what it needs to forget from the previous state . and what needs to be considered from the current timestamp .

s^t

ot

f^t c^t

ct-1 i^t c_int

(xt,st-1)

Figure 3.2 Basic LSTM structure

(17)

9

3.5 Gated Recurrent Unit (GRU)

GRU was proposed by Cho et al. [48] to create each recurrent unit (or neuron) capable of adaptively capturing dependencies on different time scales [32]. The usual method of RNN suffers from the vanishing problem. To mitigate this problem, we replace the hidden node with the GRU node. Figure 3.3 describes the brief functioning of GRU. Also, the equations describe the working of the GRU. A node in GRU consists of two gates, update gate and reset gate [49]. The update gate decides how much a unit updates its activation. This is represented below in equation (3.8). Reset gate helps to forget the previous state. This is calculated by equation (3.9). The hidden layer in this network is calculated by equation (3.10) using equation (3.11)

Figure 3.3 Gated Recurrent Unit

We use specific model parameters in GRU-RNN. Represents the input to the network at a time. Weight matrices are denoted by , , , , , . The calculation for the gate is done by using the below equations:

= ( + ℎ ) (3.8)

is an update gate that decides what proportion the unit updates its activation or content. represents the weighted matrix of input and is the weighted matrix of the previous time step

= ( + ℎ ) (3.9) The reset gate is like that of the update gate.

= ℎ( + ( ℎ )) (3.10)

represents reset gates and is an element-wise multiplication. When off ( close to 0), the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state [49].

ℎ = (1 − )ℎ + (3.11)

The output of the unit (or activation) ℎ , where t is time (or epoch) is a linear interpolation between the previous activation ℎ and the candidate activation

(18)

10

3.6 Convolution Neural Network (CNN)

Convolution Neural Network is a particular type of deep learning technique designed to acknowledge visual patterns from image pixels. It can recognize patterns with extreme variability [22]. Initially, this sort of network is used in the image recognition domain itself.

Each convolution layer of a CNN is composed of multiple feature maps. Feature maps are in the form of a plane, and all the neurons of the feature map are constrained to share the same set of synaptic weights. Each neuron in CNN takes inputs from a receptive field in the previous layer, which enables it to extract local features [22].

These kinds of networks have many layers, unlike recurrent neural networks. In recent times the most critical layers are

Convolution Layer: This layer is the primary building block of the convolution neural network. It consists of convolutional kernels that are spread all over the image. These convolutional kernels help in extracting features from the images. By adding more layers, we could extract more complex features. This is how nature works in a similar way. We can attain more complex features, in which we pass the output of the previous layers as an input to the current layer. Instead of hard coding the kernels, they learn by the backpropagation algorithm.

RELU Layer: RELU abbreviation stands for a rectified linear unit. In this layer, we observe that an element- wise activation function is performed, which has certain advantages when compared to the traditional activation functions, such as:

x It is swift while computing.

x Due to constant derivatives, it reduces the problem of vanishing gradient when compared to the tanh and sigmoid functions.

x It helps in reducing sparsity in the representation.

x The representation range ([0, ∞]) of RELU activation function is larger when compared to the representation range ([0, 1]) of sigmoid activation function.

Pooling layer: Downsampling of the data is done in this layer. We tend to consider the average of the small region of the input data. By downsampling, the information it helps the network to gain invariant to translations. Even though we translate the input data, the result will be the same. The significant advantage is the speed achieved by the network due to the downsampling in the input data between different layers of the network.

Dropout Layer: Overfitting is a significant problem during this deep neural network scenario because of the presence of many parameters. The explanation for introducing this layer is that to dropout randomly some amount of data during the training period. Dropping of data is generally done with some probability. The range of this probability is empirically chosen, but it is around 50%. By lowering the information, it not only reduces the problem of overfitting, but it also helps in reducing the bias towards weights in a specific layer where this process of dropout is being applied.

Dense layer: This dense layer is nothing but a feed-forward neural network.

(19)

11

3.7 Recurrent Convolutional Neural Network (RCNN)

To improve convolutional neural networks, many deep learning architectures have been tried, and one of them is a recurrent convolutional neural network (RCNN). Many of them, to ease the flow of information passage from one layer to another layer they skip connections. Han et al. [37] proposed an end-to-end model based on convolutional and recurrent neural networks for speech enhancement, which we term as EHNET. It consists of three components: the convolutional part exploits the local patterns in the spectrogram in both frequency and temporal domains, followed by a bidirectional recurrent element to model the dynamic correlations between consecutive frames [37]. The final layer helps in predicting the spectrogram, and it is also a fully connected layer. Compared with existing models such as MLPs and RNNs, due to the sparse nature of convolutional kernels, EHNET is much more data-efficient and computationally tractable [37]. As mentioned earlier, by easing the flow of knowledge from one layer to another layer, it reduces the number of parameters that require to be configured. Once the quantity of parameters is cut, it makes it easier for the usage of deeper networks. The EHNET architecture (figure 3.4) consists of 3 components noisy spectrogram is first convolved with kernels to form feature maps, which are then concatenated to create a 2D feature map [37]. The 2D feature map is transformed by the bidirectional RNN along the time dimension. The last component is a fully connected network to predict the spectrogram frame by frame. EHNET is trained end to finish by a loss function between the anticipated spectrogram and the clean spectrogram. In a similar kind, we try and apply the concept of RCNN in our work.

Figure 3.4 EHNET Architecture [37]

(20)

12

3.8 Optimization Methods

There are various types of optimization techniques that have shown excellent results on deep neural networks, and this is the main reason why they are widely being used in deep neural networks. Some of them are being shown in this section. Stochastic gradient descent is the base for all the other optimizers.

Stochastic Gradient descent

Neural network weights in a back-propagation algorithm are adjusted by computing the gradient number of times. If the training set is a significant computation of gradient for the complete dataset is impractical. Trying to calculate the gradient for the entire dataset would be too slow, generally requires a lot of memory. When we think about the stochastic gradient descent, it used as the gradient computed over a few examples (not the entire dataset). One added advantage of using this stochastic gradient descent is its ability to converge to the local minima. Instead of having large batches of data, small batches of data are applied to reach a stable convergence and to reduce learning variance. One added reason for sending small batches of data is that GPU has high computational power, processes small batches of data quickly since the operation has been parallelized. The below equation represents the updating step of Stochastic Gradient Descent.

= − × ; ^{( )}, ^{( )} (4.1)

θ is the parameter which needs to be updated, J( ) is typically associated with the ith observation in the data set(used for the training), α represents the learning rate, i.e., how fast the gradient needs to descend to the local minima, m represents the mini-batch size of the dataset

Nesterov Accelerated Gradient (NAG)

Momentum based stochastic gradient methods such as Nesterov's accelerated gradient descent (NAG) method is widely used in training deep networks, since they perform significantly better than Stochastic gradient descent (SGD) [24].The below equation represents the updating step of the Nesterov Accelerated Gradient.

In most of the cases the momentum and learning rate t is chosen by the user. The momentum schedule considered is generally very strict to achieve fast convergence.

= − ( + )

(4.2)

= +

Convex functions which have the property of Lipshitz-continuous (a function is said to be in lipshitz continuous if it is in the form of the equation) derivative, it tends to satisfy the equation. It tells us that the convergence seems to be in the form of O(N) iterations, where N represents the distance traversed.

| ( ) − ( )| = | − |_,∀ , (4.3)

( ) − ( ^∗) ≤ ^‖ ^∗^‖

( ) (4.4) NAG optimization method is most used for optimizing a deep neural network.

(21)

13 ADAM

Adam's method is an automatic tuning method of the learning rate that is used in stochastic gradient descent learning [25]. To cope with the difficulty of finding appropriate learning rates, various approaches have been proposed for automatically tuning the learning rates. Such methods include AdaGrad [3], RMSProp [4], AdaDelta [5], and so on. They have shown that the better models can be obtained by automatically tuning in the learning rates. The advantage of those methods is the robustness against complex problems such as non- convex optimization and non-stationary problems. Adam, proposed by Kingma and Ba [6], is also one method for automatically tuning the learning rate in SGD [25]. Its equations are given in the following manner

= ( )

= ^∗ + (1 − ) ∗

(4.5)

=1 −

= − ∗

+

α represents the stepsize. The , [0,1) represent exponential decay rates. At the initial stage and are initialized to zero.

In many of the cases, it has proven to have superior performance when compared to other optimization methods like AdaGrad, RMSProp, SGD, and NAG. ADAM seems to be a right choice because it converges fast and does not have a strict learning schedule.

(22)

14

4 METHODOLOGY 4.1 Literature Review

A literature review helps the researcher to know what research has been pursued in this field of disaggregation of electricity and practices applied to conduct this kind of experiment. Few guidelines presented in [51] have been followed to obtain state-of-the-art approaches in our research. So, to get an idea of the state-of-the-art procedures, we conduct a literature review. The following inclusion and exclusion criteria are followed to filter the literature

Is it related to Neural networks?

Is full text of the articles is available or not?

In the last 20 years, did the paper get published?

Is the paper in English?

Is it related to Non-intrusive Load Monitoring?

Is the GPU performance mentioned in the study?

By applying the above inclusion and exclusion criteria for our literature study, we can extract 35 papers for our research. The literature review is carried out in the stepwise manner which is represented below 4.1.1 Selection of the search strings

Various databases like Inspec, google scholar, IEEE, Scopus are used to perform this literature review.

Multiple results will be attained when we search in multiple databases. So, we choose keywords like Machine learning, Deep learning, Non-intrusive load monitoring, and in turn, used certain search strings like dishwasher, energy disaggregation, electric heater to obtain relevant literature. From these searches, we get new keywords that we, in turn, used for new searches to find more research related to our study.

4.1.2 Analyzing the Literature

In this phase, we try to separate, compare, and try to explain the literature obtained.

4.1.3 Evaluation of the Literature

In this phase, we further try to narrow down the literature. We try to assess the writing, which is only related to our topic of energy disaggregation. Also, we narrow it down to the present research and explain only those concepts which are necessary for our study.

4.1.4 Writing a Review of Literature

In this final phase, we try to organize and write down the information which we have extracted from the literature review in a systematic manner. So, the reader could understand it in an easy way. Also, we write in such a manner that the reader does not lose his interest while reading this research.

4.2 Experimental Environment

To answer the research questions, we have chosen an experimental approach. The experimental approach is chosen over the other since we are given the information and that we must perform the experiment to obtain the desired results. The aim of this study is to assess the performance and time complexity of deep neural networks for disaggregating the dish washer’s energy consumption, under the presence of vampire loads like electric heaters, in a Swedish household setting. The results obtained while performing the experiment are thoroughly analyzed and evaluated. The deep neural networks are compared among one another by Acuuracy and F1 score metric to get the most effective algorithm for

(23)

15

disaggregating the dish washer’s energy consumption. This helps in answering our RQ1. The training time of these neural networks is taken into consideration because they are being trained on GPU. This helps us in answering RQ2.

Dependent and Independent Variables

To begin the experiment, we need to first identify what the dependent and independent variables are.

Independent variables: The algorithms we have chosen for our experiments, for disaggregating the dish washer’s energy consumption are CNN, LSTM, SRN, RCNN, GRU. These are the independent variables that are being used in our study.

Dependent variables: The evaluation metrics utilized in our experiment are Root Mean Square Error (RMSE), F1 Score, Accuracy, Recall, and Precision. All these, along with training time, come under the dependent variable’s category in our study.

The experiment is performed in a python environment. We used a UBUNTU Virtual Machine Hired by amazon web services. It consists of 64 GB RAM and 1.6 GHz, 12 core processors. It required 64 GB RAM because neural networks generally consume plenty of memory. Neural networks are difficult to train on a CPU. Instead, a GPU is being used.

4.3 Dataset

The dataset for this research is being provided by an energy analytics company, Eliq AB. The data is being collected from 4 households in Sweden. All these households have a vampire load, an electric heater, whose power consumption can always be witnessed in the mains power sensor. A separate sensing device called “smart plug” has been installed in each of these four households to collect the dishwashers’ power consumption data. The dishwasher data is collected initially to train the network on how to disaggregate the dishwasher’s energy consumption. Later, we do not require the dishwasher data for energy disaggregation.

The goal of this dataset is to use the dataset to compare different algorithms. UK dale dataset has been one of the benchmark datasets for applying the concept of NILM. So, to be sure of the quality of our data set, we have collected the data in a similar form, like that of the UK dale dataset. Except, the data collected in our case was for 4 houses.

Also, a similar number of dishwasher activations have been collected in the Swedish household data collection process. The description of the dataset is given in table 4.1. This ensures that the size of both the datasets is equal, in terms of the number of dishwasher activations. The data from the mains sensor (collects total household power consumption) and the smart plugs (receives dishwasher power), will be used to train and test the NILM algorithms.

The Dishwasher power consumption of the four households are recorded for every 8s. The frequency sampling rate of the mains is 1Hz. The file of our dataset contains the power from the mains and the dishwasher consumption power and time stamp, at which point it was collected. The dishwasher data is less when compared to the sample mains in table 4.1 because the number of dishwasher activations in a house is less when compared to the sample of the mains. This dataset also had a problem with measurement gaps. This problem might be due to the malfunctioning of equipment or because the equipment is turned off for some time. These sorts of problems were not taken into consideration on NILM algorithms instead they are taken care of during the preprocessing of the dataset.

Houses Samples Mains Samples Dishwasher

1 1210857 895648

2 1025567 723576

3 854798 87954

4 105736 78576

Table 4.1 Data Set

(24)

16

4.4 Pre-Processing

The dataset has some problems, as cited in the dataset section. To reduce the issues in the dataset, we need to do some pre-processing while feeding the dataset to the system. The points which were missing in the dataset are being filled by forwarding filling. This is done when the interval of missing points is lesser than 20 seconds (if not a gap in the dataset is being left), and the resampling of the data is being done to 8 seconds interval. In the range, we try to derive the average of the points. We create mini batches to feed our neural networks. An interval that (seem to be higher than 7 seconds are not considered) have gaps that are not being considered in our dataset.

While training the convolutional neural network, the layers tend to search for equally spaced samples, which tends to be more important while considering the convolutional neural network. By finding similarly spaced samples, we avoid the problem of wrongly training the filters in a convolutional neural network. Unevenly spaced samples will result in different learning each time. The inputs passed on to the system are being shifted so that they have zero mean.

4.5 Adopting deep neural networks on NILM

In this approach, each algorithm will be trained with the data from two households and tested on the other two households. For example, in the current datasets, there are four households. Hence each algorithm will be trained on two houses and tested on the other two houses. It is thus showing the generalization capability of the proposed approach. The input for the networks was a time series that corresponds to the power consumption of the whole house and, in return, infer the actual power consumption of dishwasher appliance.

The signal of consumption is divided as windows, and these windows are passed to all the neural networks as inputs. The extraction of these input windows of data consumed by the dishwasher is done by the neural network, and accordingly, disaggregation is being performed. The whole windows of data are combined to show the inferred consumption of the entire system. The system operation flow is defined in figure 4.2

Figure 4.1 Sliding Window Approach – The windows act as an input to all the networks. At each moment in time, the window stride samples to the right and continues disaggregation

(25)

17

Figure 4.2 System Operation Flow

Window length selection

The size of the input, which is fed to the neural network, is nothing but known as the window length.

The window length is defined as the number of samples which are fed to the neural networks. In our case, the samples are collected every 8 seconds. For a dishwasher, a large window length can affect the disaggregation, but in a similar manner, it needs to capture all the activations. Hence, the window length cannot be small as well. The window length of the dishwasher is randomly selected. The logic behind choosing a suitable window length is better described in figure 4.3. Hence, the window length must be higher than the maximum dishwasher activations. Also, the window length should not be too large because it can affect learning. The window length selected for the dishwasher is shown in table 4.2.

Window length

(a) Large window length enables the network to capture all the state changes

(26)

18

Window length

(b) After moving samples to the right, the large window length is still capable of capturing all changes

Window length

(c) A smaller window length is not capable of capturing the state change

Figure 4.3 Window length impact on disaggregation

Appliance Window length

Dishwasher 500

Table 4.2 Window length of the dishwasher

(27)

19

Some architectures are being evaluated while performing this work. Some architectures are being shown below, which performed best while conducting this experiment. In a convolutional neural network, all convolutional layers were applied by batch normalization. At the end of all fully connected layers, a dropout layer was being added, with a dropout rate of 0.5. In this work, the recurrent neural networks are bi-directional [45] because it tends to improve the inference step.

CNN

The CNN architecture used in our study can be seen in table 4.3. The architecture of CNN consists of convolutional layers with an increasing number of filters. The total power consumption of the mains is divided into windows of 500 lengths, each as shown in table 4.2 and is feed as input to the network, which can be seen in figure 4.2. Initially, eight filters of size five and later eight filters of size three are used to form a 2D feature map. Max pooling is applied after each layer to reduce the number of parameters that are needed to be processed and providing the network with translation invariance capability. The stride was also set to 2. Then continue the same process with 16, 32, and 64 filters of size three and later, it is being transformed by CNN over time dimension. The last layer of the neural network consists of a fully connected layer and predicts disaggregated windows frame by frame. The CNN used in our thesis is trained end to end by a loss function between the disaggregated windows and that of the dishwasher signal.

Table 4.3 CNN Architecture

RCNN

The RCNN architecture used in our study can be seen in table 4.4. It consists of many recurrent convolutional layers with many numbers of filters. The total power consumption of mains is divided into windows of 500 lengths, each as shown in table 4.2 and is feed as input to the network, which can be seen in Figure 4.2. Initially, eight filters of size five and later 32 filters of size three are used to form a 2D feature map. The Max pooling has been applied after each recurrent convolutional layer in a similar way as CNN. The stride in RCNN is also 2. They are being implemented in a feed-forward manner, adding necessary skip connections, and tying weights of the previous layers. The last layer consists of a fully connected layer and predicts disaggregated windows frame by frame. The RCNN used in our thesis is trained in a similar way as CNN by a loss function to disaggregate the windows.

Layer Type Size

Convolutional 8 filters of size 5 Convolutional 8 filters of size 3 Max Pooling Pool size 4, stride 2 Convolutional 16 filters of size 3 Convolutional 16 filters of size 3 Max Pooling Pool size 4, stride 2 Convolutional 32 filters of size 3 Convolutional 32 filters of size 3 Max Pooling Pool size 4, stride 2 Convolutional 64 filters of size 3 Convolutional 64 filters of size 3 Max Pooling Pool size 4, stride 2 Fully Connected Dense layer

(28)

20 Layer Type Size

Convolutional 8 filters of size 5

RCL 32 filters of size 3, 3 iterations Max Pooling Pool size 4, stride 2

Fully Connected Dense layer Table 4.4 RCNN Architecture LSTM

The LSTM architecture used in our study can be seen in table 4.5. The total power consumption of mains is divided into windows of 500 lengths, each as shown in table 4.2 and is feed as input to the network, which can be seen in Figure 4.2. Initially, eight filters of size five are convolved with the kernel to form a feature map. The 2D feature map is transformed by bidirectional LSTM of 50 and 150 neurons over time dimension. The last layer consists of a fully connected layer and predicts disaggregated windows frame by frame. The additional drop out in the final layer of the neural network seemed to work well because it helped in improving generalization error. The LSTM used in our thesis is trained with a loss function to disaggregate the windows like other neural networks used in this study.

Convolutional 8 filters of size 5 Bidirectional LSTM 50 units

Bidirectional LSTM 150 units Fully Connected Dense layer

Table 4.5 LSTM Architecture SRN

The SRN architecture used in our study can be seen in table 4.6. The total power consumption of mains is divided into windows of 500 lengths, each as shown in table 4.2 and is feed as input to the network, which can be seen in Figure 4.2. A similar way in LSTM, eight filters of size five are convolved with the kernel to form a feature map. The 2D feature map is transformed by SRN over time dimension.

Since it is a feed forwarding neural network, it suffered from vanishing gradient problem, which was discussed in section 3.3. The last layer consists of a fully connected layer and predicts disaggregated windows frame by frame. The SRN used in our thesis is trained with a loss function to disaggregate the windows like other neural networks used in this study.

Convolutional 8 filters of size 5 Simple recurrent

Network

50 units Simple recurrent

network

64 units Fully Connected Dense layer

Table 4.6 SRN Architecture

(29)

21 GRU

The GRU architecture used in our study can be seen in table 4.7. The total power consumption of mains is divided into windows of 500 lengths, each as shown in table 4.2 and is feed as input to the network, which can be seen in Figure 4.2. Initially, eight filters of size five are convolved with the kernel to form a feature map. The 2D feature map is transformed by bidirectional GRU of 64 and 128 neurons respectively over time dimension. The last layer consists of a fully connected layer and predicts disaggregated windows frame by frame. The additional drop out in the final layer of the neural network seemed to work well because it helped in improving generalization error. The GRU used in our thesis is trained with a loss function to disaggregate the windows like other neural networks used in this study.

Convolutional 8 filters of size 5 Bidirectional GRU 64 units

Bidirectional GRU 128 units Fully Connected Dense layer

Table 4.7 GRU Architecture Sliding Window ensembling during the test time

Since the neural networks infer small windows of data, it is necessary to combine them to construct an entire disaggregated signal. If the windows do not have overlapping pieces, we must make sure that the whole signal is being covered. In this case, we do not need to do anything. The answer to each window is the answer to the entire system because every time instant has a corresponding window.

When the windows have overlapping areas (The overlapping case is always present in our work), it is necessary to combine all the output windows of consumption in each instant of time. To achieve this, an average of the inferred consumption has been taken over all the overlapping moments of time. This approach has helped a lot in minimizing the errors and improving the network performance since the significant mistakes done by a window are nullified using correct windows. Ensemble methods generally have this as their basic principle. In our study, the windows have a stride of one which maximizes the overlapping windows during a given time instant, which can be seen in figure 4.1.

Hence, improving the overall reliability of the system.

4.6 Metrics

Different metric methods are used to evaluate the NILM algorithms, and this turns out to be difficult when comparing different algorithms and purposes of load monitoring. In the initial stage, the thinking regarding the algorithms was projected in a such a manner that the appliances consisted of two states(On/Off), the metric used to evaluate the percentage was the ratio of a number of correct classifications with the load associated with a change in the total power consumed. Many different metrics are being used nowadays.

Some variables are being defined before introducing the metrics:

x TP – Total number of true positives – This happens when the appliance is inferred to be ON, and the ground truth seems to be On.

x FP – Total number of false positives - This happens when the appliance is inferred to be ON, and the ground truth seems to be OFF

(30)

22

x TN – Total number of true negatives – This happens when the appliance is inferred to be OFF, and the ground truth seems to be OFF

x FN – Total number of false negatives – This happens when the appliance is inferred to be OFF, and the ground truth seems to be ON.

x P – Total Positives in the ground truth x N – Total negatives in the ground truth

The variables which are defined above are positive when the appliance consumption is higher than a certain threshold and is negative when the power consumption of the appliance is lesser than or equal to the limit. (y(t) where y(t) represents the power consumption of the dishwasher at a certain point in time, and 10w is the threshold. The threshold is chosen manually for the dishwasher. This indicates that when the dishwasher is turned on, without considering the standby consumption into account, that is if its higher than 10w.

Each metric has its own ability to measure the algorithm, but no one metric alone can decide whether the algorithm is perfect or not. Hence, it is reasonable to consider multiple metrics to evaluate an algorithm. Some of the most used metrics are shown below.

F1 score, Root Mean Square Error, and various others are used for comparison of model accuracy.

Lower values are better for the root to mean square error. (indicating good accuracy). For the F1 score, a higher F1 score implies how good the algorithm is.The advantage of F1 is that it provides a single measure of quality more comfortable to understand by end-users. It combines the results values of precision and recall [27]. The RMSE is likely to be used for data that has an undesirable significant error. And both MAE and RMSE can be used together to diagnose the variation in the errors in a set of forecasts [26]. The three metrics are computed as follows

4.6.1 Root Mean Square Error (RMSE)

The RMSE is a quadratic scoring rule which measures the average magnitude of the error [26].

= ⁽ ⁾ (4.6)

represents the predicted value (disaggregated signal of dishwasher) and represents the actual value (actual signal consumption of dishwasher). t represents the time instantly. N represents the number of samples.

4.6.2 F1 Score

F1 score consists of both precision and recall, and it indicates the test’s accuracy.

4.6.2.1 Precision:

In the case of energy disaggregation, precision tells us how much of the total energy consumed by the dishwasher truly belongs to the dishwasher.

= * 100 (4.7)

4.6.2.2 Recall:

In the case of energy disaggregation, recall tell us how much amount of total energy is correctly classified.

(31)

23

= * 100 (4.8)

The harmonic mean of precision and recall gives us F1 score

= 2 ⋅ ^. *100 (4.9)

4.7 Training time

Training time is the time taken by the algorithm to train the dataset on house 1 and 2. The measurement of training time is taken in minutes.

4.8 Statistical Tests

The statistical tests are performed to validate the performance of the algorithms. The results of the algorithms are represented in the form of graphs. Though the graphs which are presented can be compared manually, sometimes comparing manually might lead to wrong results and conclusion. To avoid this kind of misinterpretation, we perform these statistical tests. By doing these tests, we can clearly identify the differences between these deep neural network algorithms and determine the best one among them [50]. Friedman and Nemenyi tests are performed on the deep neural networks, which are represented in the result section and to identify the best deep neural network.

Friedman Test: Friedman test is one of the statistical methods for comparing different algorithms and their performances. This test assumes the same dataset has been used for all the deep neural network algorithms, also preferably the same spits for the training and testing. The best performing algorithm is ranked as 1, and the next best performing algorithm is ranked as 2 and so on. If there is a tie between the algorithms, then the average value is assigned as their rank.

Let be the rank of the j-th of k algorithms on the i-th of N data sets. The Friedman test compares the average ranks of algorithms, = [50].

The null hypothesis states that all the algorithms are similar. So, the simplified form of Friedman statistic is given by the formula

= ( ∗ ∗( ) ∗ ∑ − [3 ∗ ∗ ( + 1)] (4.10)

k – represents the number of algorithms in our scenario R – represents the ranks of the algorithms

N – represents the number of cross-validation samples in our case

The value of critical value is measured at alpha = 0.05, and if the Friedman statistic value is lesser than the critical value, then the null hypothesis holds true. If the Friedman statistic value is great than the critical value, then the null hypothesis is rejected. If the null hypothesis is rejected, it means that the algorithms are different from each other. Also, once the null hypothesis is rejected, we can proceed with the Nemenyi test to determine which neural networks differ significantly from each other.

Nemenyi Test: Nemenyi test is used when all the classifiers are compared with each other [50]. This test is performed when the null hypothesis is rejected. Two classifiers are said to be not similar when their average of the ranks, at least differ by critical difference [50].

The Formula for the critical difference is = ⁽ ⁾ (4.11)

Deep Neural Networks Based Disaggregation of Swedish Household Energy Consumption