A Disaggregation Model for Studying Behaviours in Power Consumption

(1)

UPTEC ES 17 040

Examensarbete 30 hp

October 2017

A Disaggregation Model for Studying

Behaviours in Power Consumption

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

A Disaggregation Model for Studying Behaviours in

Power Consumption

Ellika Wik

A feature of the Smart Grid is the utilization of flexible load in the power system. The presence of flexible load allows part of the power consumption to be shifted from peak hours to off-peak hours; this change in power consumption is called a load shift. If the usage pattern of appliances is identified, it is possible to estimate the capacity of a potential load shift as well as evaluate if the utilization of flexible load in the power system results in a load shift. This master thesis project aims to create a model which works as an aid when studying usage patterns by identifying when appliances that contribute to the load shift are active. The model should be able to give

approximations of the switch-on and switch-off time of the appliances using only information from a single meter that measures the total power consumption of the entire household.

Recently, artificial neural networks have been successfully applied to these kinds of problems. The constructed model thus includes neural networks which regress the start time and end time of a target appliance. The networks are trained and evaluated both on simulated data and on real measured data from the Stockholm Royal Seaport project. The model is able to give highly accurate estimates of the start and stop time when trained with simulated data. When using real data the accuracy of the model is relatively low. In order to increase the performance the neural network part of the model has to be trained on a larger dataset.

A study of how the sampling time of the input affects the performance of the model is also carried out. The results show no evidence that the sampling time affects the accuracy of the model. However, the architecture of the neural networks trained to recognize data with different sampling frequencies are not identical; if the pooling layers of all networks were removed it might be possible to establish a connection between sampling time and performance.

ISSN: 1650-8300, UPTEC ES 17 040 Examinator: Petra Jönsson

Ämnesgranskare: Thomas Schön

(3)

Populärvetenskaplig sammanfattning

I samband med bygget av den nya stadsdelen Norra Djurgårdsstaden i Stockholm har en satsning på Smarta Nät gjorts. En del av satsningen är smarta tvättmaskiner och torktumlare som installerats i flertalet av de nya lägenheterna. Syftet är att flytta en del av kundernas belastning på elnätet från tider på dygnet när efterfrågan på el är stor, till tider då elkonsumtionen för hushåll är lägre. Ofta sammanfaller de tidpunkter då elkonsumtionen för olika hushåll är som störst vilket leder till att elkonsumtionen för hela kraftsystemet kan innehålla toppar då nätet belastas i högre grad. De smarta apparaterna tillåter användaren att ange en starttid då apparaten automatiskt ska köra igång, alternativt sätta igång maskinen med hjälp av en app på mobil eller surfplatta. Då priset på el följer efterfrågan kan kunder spara pengar genom att flytta delar av sin elkonsumtion till tider då efterfrågan är lägre vilket skulle minska effekttopparna i systemet. Att reducera effekttoppar ger ett mer effektivt elnät vilket gynnar både elkonsumenter och miljön. I det här examensarbetet skapas en modell, som fungerar som ett hjälpmedel för att utvärdera om implementeringen av smarta nät leder till ett förändrat elkonsumtionsmönster hos kunderna. Modellen ska kunna identifiera när en apparat är igång genom att studera elkonsumtionsdata från endast en mätare som mäter den totala elkonsumtionen i ett hushåll. Anledningen till att en sådan modell behövs är att om elkonsumtion mäts i ett hushåll är det oftast endast den totala konsumtionen som mäts. Det går dessutom att spara både tid och pengar genom att låta en algoritm identifiera när en apparat är aktiv istället för att installera separata mätare vid alla smarta apparater i hushållet.

(4)

ökningen i hushållets totala effektkonsumtion vara apparaternas sammanlagda elförbrukning.

Istället för att manuellt ställa upp villkor för att registrera aktiveringar av en apparat är det idag möjligt att med hjälp av maskininlärning använda en algoritm som själv lär sig en uppsättning av regler för att identifiera olika apparater. Artificiella neuronnät är en gren av maskininlärning som tidigare använts för att lösa liknande problem. Neuronnäten lär sig genom att studera exempel; om ett neuronnät ska lära sig att identifiera handskrivna siffror behöver det först se bilder på ett stort antal siffror tillsammans med information om vilken siffra det är. Neuronnätet som tränas i det här projektet förses med exempel på hur den totala effektkonsumtionen ser ut när apparaten är igång samt information om mellan vilka tidpunkter apparaten var igång. Utifrån det lär sig nätet att identifiera start- och stopptiden för en aktivering av apparaten i fråga. De apparater som har studerats i det här projektet är diskmaskin, torktumlare samt tvättmaskin.

För att undersöka hur väl den framtagna modellen fungerar tränas och testas den både med simulerad data över effektkonsumtion såväl som verklig data uppmätt i lägenheter som tillhör Norra Djurgårdsstaden. Simulerad data innehåller en del förenklingar, energisignaturen för diskmaskin, tvättmaskin och torktumlare ser till exempel likadan ut vid varje aktivering av apparaten.

(5)

Executive summary

The purpose of the project was to create a model which performs nonintrusive load disaggregation using artificial neural networks. When studying whether installing a smart appliance, for example a smart washing machine, will affect when the end consumers are using the appliance, it is necessary to be able to determine when that appliance is running. This information can be obtained through a load disaggregation. The constructed mode is trained to give approximations of the switch-on and switch-off times of a given appliance, using only information from a single meter that measures the total active power consumption of the entire household.

Artificial neural networks learn by studying examples. Different versions of the model are created by training the neural networks in the model on different datasets containing examples of the power consumption pattern of the appliance of interest. When creating the examples, both artificial data generated by a stochastic load model and data based on measurements from apartments in the Stockholm Royal Seaport are used.

The results from when the model is trained using the artificially generated dataset show that it is possible to use neural networks in order to perform an energy disaggregation provided two things. Firstly, that the power consumption pattern of the appliance looks roughly the same every time it is activated, and secondly, that the model is shown a sufficiently large number of examples during training. When using the dataset based on measurements from actual apartments to train the model, the accuracy is relatively low. The reason for this is believed to be that the power consumption pattern varied a lot between activations, and that the dataset used for training the neural networks in the model consisted of too few examples.

(6)

Acknowledgements

This report presents the results from my master thesis project for the master programme in energy systems engineering at Uppsala University and the Swedish University of Agriculture. The project was carried out at KTH Royal Institute of Technology and Ellevio in Stockholm during the spring and summer of 2017.

First, I would like to thank my supervisors Meng Song and Olle Hansson for all the help you have provided during the project and for patiently answering all my questions. A special thanks to Meng Song for taking the time to discuss all the problems I ran into during the project, your advice have been essential and I could not have finished this project without you.

I would also like to thank Johan Aspenberg from Ericsson and Hans Nottehed from Tingcore for providing important data to this project. I want to thank my subject reader Thomas Schön for your advice on the project and Carl Andersson for answering my questions about TensorFlow.

Finally, I say a big thank you to my parents for putting up with all my talk about neural networks and energy disaggregation. You have been a fantastic sounding board for all my thoughts and ideas.

(7)

List of abbreviations

ANN Artificial Neural Network

CNN Convolutional Neural Network

CPU Central Processing Unit

HMM Hidden Markov Model

MIT Massachusetts Institute of Technology

NILM Nonintrusive Load Monitoring

ReLU Rectified Linear activation function

(10)

1

1. Introduction

Using energy efficiently is a crucial part in limiting the world’s ever-increasing energy consumption. Energy is too often used inefficiently but by evaluating our energy usage, we can identify problem areas and invent smart energy solutions for conserving energy and thus the resources of planet Earth. One way to decrease energy consumption is simply by changing our consumption patterns. We all tend to use electricity at the same time; the hours when the energy consumption is especially high are called peak hours. During these peak hours the electricity production has to increase in order to cover the demand. Because of this, sometimes additional less efficient power plants must be run to cover the unusually high electricity demand (Nylén 2011; U.S. department of energy n.d.). If part of the power consumption could somehow be shifted from peak hours to off-peak hours, these energy losses would decrease and the electrical grid would become more efficient. The Smart Grid is a new type of electricity transmission and distribution system which aims to remodel the power grid of today in order to make it more efficient. It consists of a collection of new technologies, features and regulations (Bollen 2010). A feature of the Smart Grid is the utilization of flexible loads, flexible loads are utilities that can react to external parameters and adapt their consumption behaviour accordingly (Nylén 2011). An example of a flexible load is a smart washing machine which can be programmed to run when the electricity price is low and, since the price of electricity follows the demand, this would mean moving a load to off-peak hours.

Renewable energy sources such as wind power and solar power are intermittent energy sources. This means that they are not continuously available (Vattenfall 2015, 2017); they do not necessarily have to produce electricity during peak hours. The introduction of the Smart Grid concept makes it possible for the end users to adapt their consumption behaviour so that they consume more power when the renewable energy sources produce energy (Nylén 2011). By changing when they consume energy the end users can help reduce carbon dioxide emissions.

To evaluate the effects of the utilization of flexible loads the change in usage pattern of household appliances are studied. Identifying how different appliances in a household could contribute to a potential load shift is of interest when estimating the potential of flexible load in the residential sector. Examining if the implementation of the Smart Grid concept results in customers changing their consumption behaviours, thus reducing their environmental impact, is also of interest for the evaluation.

(11)

2

Nonintrusive load disaggregation relies on continuous measurements of the total power consumption over time for a household in order to find the time interval during which a given appliance present in the household is consuming power. In this master thesis project a nonintrusive load disaggregation model capable of identifying appliances that constitute the flexible load will be developed and evaluated.

The project is a collaboration between KTH and Ellevio. It concerns the Smart Grid initiative which is a part of the Stockholm Royal Seaport (SRS) urban development project. The resulting model is used to perform an energy disaggregation on energy consumption data from households in the Stockholm Royal Seaport which all have smart washers and dryers installed.

1.1 Stockholm Royal Seaport Urban Development Project

The Stockholm Royal Seaport is an urban development project with the purpose of creating a new district in Stockholm. Planning started in the early 2000s and the district is to be fully developed around 2030. The district will be comprised of both residential areas and new office buildings; the aim is to build 12,000 new homes and 35,000 workplaces (Stockholms stad n.d.).

Sustainability will be a key feature of the new district. The goal is for the SRS to become an international model for sustainable urban planning. A part of the sustainability initiative is a pilot project for Smart Grids in urban environment (Energimyndigheten 2015). The Urban Smart Grid Program in the SRS is a joint initiative by Ellevio, ABB, Ericsson, Electrolux, KTH, Swedish Energy Agency and other partners.

As a part of the Urban Smart Grid Program, the effects of implementing a Smart Grid in an urban area is evaluated. Smart washing machines and dishwashers have been installed in about 154 apartments in the area. Dynamic electricity and environmental signals will also be provided to the apartments through an in-home display. This is expected to result in residents reducing their energy consumption during peak hours.

1.2 Objective

(12)

3

How the performance of the disaggregation model is affected by the sampling time by which the household’s electricity consumption is measured will also be examined. The sampling time is the time interval between the measurement points in the data.

Figure 1. The total power consumption of a household shown together with the simultaneous power consumption of the washing machine and dryer in the household.

Nonintrusive load disaggregation relies on locating so called “signature features” of the appliances present in the total load. These signature features presents as patterns in the data. In Figure 1, an example of how active power consumption can be used as the signature feature is shown. The power consumption pattern of the dishwasher can clearly be seen in the total power consumption of the household.

An artificial neural network is a machine learning model loosely based on the human brain (Bishop 2010, p. 226). ANNs can be trained to perform a number of tasks by learning to recognize patterns. In this project an artificial neural network is taught to recognize the patterns of specific appliances in the aggregated load.

To summarize the project aims to answer the following questions:

 Is it possible to use a deep neural network in order to perform an energy disaggregation which identifies the time during which a specific appliance is running?

 Using the model developed during the project what accuracy can we achieve?  How well does the developed model work for the SRS data?

 How does the sampling time of the data affect the accuracy of the energy disaggregation?

(13)

4

1.3 Load Disaggregation

There are two primary techniques for measuring disaggregated energy usage down to the individual appliance: distributed direct sensing and single-point sensing (Froehlich et al. 2011). Distributed direct sensing requires the installation of one sensor at each device or appliance. The sensor will only measure the power consumed by the appliance it is connected to. This results in highly accurate measurements of the disaggregated energy usage. However, installing one sensor for every device in a household can be both expensive, time consuming, and complicated (Froehlich et al. 2011).

Single-point sensing is both cheaper and easier to install since it only requires one sensor (Froehlich et al. 2011). The sensor measures the total energy consumption of the household. Information collected by the sensor is processed by a computer in order to identify the power consumption and running time of individual appliances in the household (Zeifman & Roth 2011).

Nonintrusive load monitoring (NILM) relies on single-point sensing. The information collected by the sensor is analysed using some form of pattern recognition algorithm. The algorithm locates and identifies the signature features of the individual loads present in the total energy consumption in order to produce information about which appliances are switched on at a given point in time.

The first NILM method was developed at MIT in the 1980s (Hart, 1992). It uses step changes in real and reactive power as its signature features; a step change in power is assumed to indicate an appliance being switched on or off. For example, if there is a refrigerator which consumes 250 W and 200 VAR when it is running present in the household, a step increase of this size indicates that the refrigerator was switched on. Step changes in the power demand profile are a commonly used signature feature for NILM. Powers, Margossian and Smith (1991), Farinacco and Zmeureanu (1999) and Marceau and Zmeureanu (2000) all propose methods for energy disaggregation where step changes in the real power are used as the signature feature. When step changes in the power demand profile are used as the signature feature, low-frequency hardware is usually sufficient for measuring the aggregated load data. Low-frequency hardware is hardware that can record the power and voltage at a frequency of 1 Hz or below. Since the fundamental period of the voltage is either 1/50 s or 1/60 s, depending on which county’s power grid is being studied, NILM algorithms can only use relatively “macroscopic” features with this kind of hardware. With high-frequency hardware, it is possible to increase the sampling rate in order to capture “microscopic” features in the data.

(14)

5

Another microscopic feature that requires high-frequency hardware if it is to be monitored is the voltage noise. Appliances conduct a variety of voltage noise back onto the power wiring of a household, which is measurable using appropriate hardware. An example of where the voltage noise is used as the signature feature can be found in the article written by Froehlich et al. (2011); both the transient voltage noise and the steady state voltage are used as the signature feature in the NILM method presented.

The problem with NILM methods that depend on measurement data with a high sampling frequency is that they require high-frequency hardware. The currently available smart meters usually have a sampling frequency of 1 Hz or below which is much too slow (Kong 2016). To perform nonintrusive load disaggregation with data from the current smart meters, one of the methods using macroscopic signature features would have to be applied.

The early NILM methods‒for example the one developed at MIT (Hart, 1992) and the methods presented by Powers, Margossian and Smith (1991), Farinaccio and Zmeureanu (1999) and Marceau and Zmeureanu (2000)‒are only able to identify large loads which have a clearly identifiable load step when switched on. Appliances with a power consumption that varies continuously cannot be identified since no step change associated with the appliance being switched on can be recorded. The early methods also have difficulties identifying appliances with multiple phases within its running time.

In the articles written by Kelly and Knottenbelt (2015) and Kong et al. (2016), the authors each suggest a way to overcome these limitations. Both methods use a machine learning approach when performing the pattern recognition. In the model described by Kong et al., the appliances were modelled as hidden Markov models (HMM), whereas Kelly and Knottenbelt (2015) use artificial neural networks for performing the load disaggregation. Both methods rely on measurements from low-frequency hardware. Neural networks were also used when performing the pattern recognition in the models suggested by Srinivasan et al. (2006) and Lin and Tsai (2010), where the data is recorded using high-frequency hardware.

(15)

6

(16)

7

2. Deep learning

2.1 Machine Learning

“Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed”

Arthur Samuel, 1959

In the past decade, machine learning has been receiving an increasing amount of attention from both the research community, and the industry at large. There are many areas of application for machine learning algorithms today; speech recognition (Microsoft n.d., Google n.d.), image recognition (Szegedy et al. 2015), self-driving cars, and effective web search (Stanford University 2014), only to mention a few. New areas of applications are constantly being thought of.

Machine learning use algorithms that can learn from and make predictions based on data. The algorithms operate by constructing a model from a dataset with training examples (Bishop 2006, p. 2). It is this ability to learn and make decisions based on experience that makes people consider machine learning an approach to artificial intelligence (AI). Using machine learning, computers can build models capable of performing complicated tasks, like recognizing numbers and faces. This may seem trivial to humans since we do it intuitively, but it is difficult to formally describe how it is done and even more difficult to describe it to a computer. With machine learning the computer is taught to understand the world in terms of a hierarchy of concepts, with each concept defined in terms of its relation to simpler ones (Goodfellow, Bengio & Courville 2016, p. 1). Goodfellow, Bengio and Courville (2016, p. 1-2) explain this relation between simpler and more complex concepts by asking the reader to imagine it as a graph with many layers built on top of each other. This approach to machine learning, when the algorithm makes use of layers of concepts, is called deep learning.

In actuality machine learning is an umbrella term for a number of different algorithms, one of them being artificial neural networks (ANN). The ANN algorithm utilizes a network structure consisting of a large number of interconnected simple units called

artificial neurons which are loosely based on the neurons in the brain, thereby the name

neural networks (Goodfellow, Bengio & Courville 2016, p. 169, MIT 2016).

2.2 Artificial Neural Networks

(17)

8

approximation of the mapping. The training of the model is performed by studying a number of training examples consisting of both the values of the input 𝒙 and output 𝒚 to the model.

As previously stated the network consists of artificial neurons. A neuron can be represented as shown in Figure 2, it receives multiple inputs x1, x2,..,xm and produces a single output:

𝑧(𝑥) = 𝑏 + ∑𝑚𝑖=1𝑤𝑖𝑥𝑖 (1)

Figure 2. An artificial neuron.

In Figure 2 the lines represent synapses. Each synapse has a weight w. A bias variable b is also associated with each neuron. During training of the ANN the weights and biases will be adjusted in order to correctly model the output, they correspond to the parameters earlier denoted as 𝜽. Neural networks use a learning algorithm for training the network which can automatically tune the weights and biases.

The variable z(x) is known as the pre-activation. The pre-activation is transformed using a nonlinear activation function to produce the output of the neuron. The outputs of the artificial neurons are called activations and denoted a(x).

𝑎(𝑥) = 𝑔(𝑧(𝑥)) (2)

Activation functions allow artificial neural networks to model complex non-linear relationships. The traditional choice for the activation function is sigmoidal functions, such as the logistic sigmoid:

𝑔(𝑧(𝑥)) = 1

1+exp(−𝑧(𝑥)) (3)

However, recent research on image recognition has found it beneficial to use the so called rectified linear activation function, Equation 4, when training neural networks (Glorot & Bengio 2010, Krizhevsky et al. 2012).

(18)

9

A single artificial neuron has the capacity to favor or suppress certain linear regions of its input space. If the activation function used is the logistic sigmoid and a linear combination of the inputs to the neuron results in a large pre-activation, the activation will be near one. On the other hand, if the linear combination of the inputs is a small value, the activation will be near zero. This behavior means that it is possible to use a single neuron as a binary classifier. Single artificial neurons are also capable of solving linearly separable problems (Freund & Schapire 1999).

By placing the artificial neurons in multiple nonlinear layers, deep nonlinear networks with a great deal of expressive power can be constructed (Kelly & Knottenbelt 2015). An example of a neural network with four layers can be seen in Figure 3; the neurons are represented by the circles. The first layer in the network is called the input layer. As shown in Figure 3 the neurons in that layer have no input. The output of the neurons in the input layer is simply the input fed to the network. The rightmost layer contains the output neurons which outputs are the result of the entire network. In between the input and output layers are the so called hidden layers. A network can have one or many hidden layers; the network in Figure 3 for example has two hidden layers. Networks with many hidden layers are called deep neural networks.

Figure 3. An example of an artificial neural network with two hidden layers, six input neurons, and one output neuron.

(19)

10

network has one input and one output layer but, as previously stated, the number of hidden layers can vary.

The networks used for this project are all feedforward neural networks. This means that the output from one layer is used as input to the next and so the information only flows in one direction at a time. There are no information loops in a feedforward neural network (Nielsen 2015).

The information can flow either in the forward direction (green arrow above the network in Figure 3) or backwards (red arrow below the network in Figure 3). When information flows in the forward direction the network is calculating the output based on the inputs to the net, this is called the forward pass. The backwards pass is when information flows in the opposite direction; the learning process of the ANN takes place during the backwards pass (Kelly & Knottenbelt 2015).

2.2.1 The forwards pass

During the forward pass information moves forward through the network and each artificial neuron calculates its activation based on its inputs. Moving from one layer to the next during the forward pass can be described as follows:

𝑎_𝑗(𝑙)= 𝑔 (𝑏_𝑗(𝑙)+ ∑ 𝑎𝑘 _𝑘(𝑙−1)𝑤𝑗𝑘 (𝑙)

) (5)

where 𝑎_𝑗(𝑙) is the activation of the j:th neuron of layer 𝑙, 𝑏_𝑗(𝑙) is the bias of the j:th neuron of layer 𝑙, and 𝑤_𝑗𝑘(𝑙) is the synapse between neuron j in layer 𝑙 and neuron k in layer (𝑙 − 1) (Bishop 2006, p. 227). Using matrix Equation 5 can be rewritten as:

𝒂(𝑙)= 𝑔(𝒂(𝑙−1)𝒘(𝑙)+ 𝒃(𝑙)) (6) where 𝒂(𝑙) is a vector containing all activations of layer 𝑙, 𝒃(𝑙)_{is a vector containing the} bias for every neuron in layer 𝑙 and 𝒘(𝑙) is a matrix containing the weights of the synapses between layer (𝑙 − 1) and 𝑙.

2.2.2 The backwards pass

The learning, which consists of updating the weights and biases, takes place during the backwards pass (Kelly & Knottenbelt 2015). In order for the network to learn it has to be provided with training data. For example, in image recognition training data would be a picture together with the correct label. The network is presented with the picture as the input and, using that input, performs a forward pass in order to classify the image. The label the neural network assigned to the picture is then compared to the correct answer also called the target.

(20)

11

example inputs 𝑥₁, 𝑥₂… 𝑥_𝑛, a cost function is required. The purpose of the cost function is to quantify the performance of the network (Bishop 2006, p. 232-233). A commonly used cost function is the quadratic cost:

𝐶(𝑊, 𝐵) ≡ 1

2𝑛∑ ‖𝑦(𝑥𝑖) − 𝑦̂(𝑥𝑖)‖ 2 𝑛

𝑖=1 (7)

The first step in training the artificial neural network is initializing the weights and biases. After the parameters have been initialized a forwards pass through the entire network is performed in order to get the network’s output for a specific input. The cost function is then used to calculate the error of the output of the neural network relative to the target. The goal of the learning algorithm is to find values of the weights and biases which minimizes the cost function. This is achieved using gradient descent. The gradient decent algorithm updates the parameters in such a way that the model iteratively moves towards a set of parameter values that minimize the cost function. Since the cost 𝐶(𝑊, 𝐵) is a smooth continuous function of 𝑊 and 𝐵 its smallest value will occur when the gradient of the error function vanishes so that

∇𝐶(𝑊, 𝐵) = 0 (8)

Thus, by computing the gradient and taking a small step in the direction of −∇𝐶(𝑊, 𝐵) the error can be reduced (Bishop 2006, p. 236-237). The size of the step is another hyper parameter called the learning rate denoted 𝜂. The gradient descent update rule is as follows:

𝑤_𝑗𝑘(𝑙)→ 𝑤_𝑗𝑘(𝑙)′ = 𝑤_𝑗𝑘(𝑙)− η ∂C

∂w_jk(𝑙) (9a)

𝑏_𝑗(𝑙)→ 𝑏_𝑗(𝑙)′ = 𝑏_𝑗(𝑙)− η ∂C

∂b_j(𝑙) (9b)

Error functions typically have a highly nonlinear dependence on weight and bias parameters. Because of this, there will be many points in the parameter space at which Equation 8 is true. For any local minimum found there will be other points in the parameter space that make up equivalent minima, stationary points or inequivalent minima. This means that, when using gradient decent, the learning algorithm may get stuck in a local minimum instead of finding the global minimum. However, for a successful application of neural networks it may not be necessary to find the global minimum, although it might be necessary to compare several local minima in order to find a sufficiently good model (Bishop 2006, p. 237).

2.2.3 Stochastic gradient decent

(21)

12

parameter update is highly impractical (Karpathy n.d. a). Instead the gradient is computed over smaller batches of examples and the parameters are updated for each mini-batch. This approach to gradient decent is called stochastic gradient decent (Stanford University n.d.)

By using stochastic gradient decent the variance in the parameter update can be reduced which can lead to a more stable convergence towards good parameter values (Stanford University n.d.). However, if the mini-batch is too small the algorithm does not take advantage of the benefits of good matrix libraries and optimized fast hardware which could speed up the learning process (Nielsen 2015). Choosing an appropriate size for the mini-batches is thus a compromise between the two.

2.2.4 Weight initialization

Before the training of an artificial neural network can begin the parameters are initialized in order to give the stochastic gradient decent algorithm a place to start from. The initialization is important for the neural network. Choosing a good initialization method will help the network avoid getting stuck in a poor local minimum during training and also make the network converge to good parameter values in a shorter amount of time. The biases can generally be initialized to zero, but the weights need to be initialized carefully in order to break symmetry between hidden units of the same layer (Bengio 2012). A simple approach to weight initialization is to initialize all weights as a zero-mean Gaussian with a small standard deviation (Bengio 2012).

2.2.5 Adam optimization

Adaptive Moment Estimation (Adam) is a method for efficient stochastic optimization used when training artificial neural networks. The method was proposed by Kingma and Lei Ba in the paper Adam: A method for stochastic optimization (2015).

When using the gradient decent algorithm for training the neural network the update of parameters 𝜽 is performed using the same learning rate 𝜂 for every update. This, however, might not be ideal. In the beginning of the training the weights and biases are often far from a minima in the error function which means that a large learning rate would be preferable in order to change the parameters quickly. When the parameters are closing in on a minimum a smaller learning rate is preferable in order to make fine-tuned adjustments to the parameters (Nielsen 2015).

The Adam optimization algorithm adapts the learning rate to the parameters. It relies on the computation and update of the exponential moving averages of the gradient (𝒎_𝒕) and the squared gradient (𝒗_𝒕):

𝒎𝒕= 𝛽1∗ 𝒎𝒕−𝟏+ (1 − 𝛽1) ∗ 𝒈𝒕 (10a)

(22)

13

where the exponential decay rates are controlled by the hyper-parameters 𝛽₁, 𝛽₂ ∈ [0,1) and t is the time step. The moving averages are estimates of the 1st moment (the mean) and the 2nd raw moment (the un-centered variance) of the gradient:

𝒈𝒕= ∇𝜃𝑓𝑡(𝜽𝒕−𝟏) (10c)

The moving averages 𝒎𝒕 and 𝒗𝒕 are initialized as zero-vectors. Kingma and Lei Ba (2015) note that this leads to moment estimates that are biased towards zero. However, this initialization bias can be counteracted by calculating the bias-corrected estimates 𝒎̂𝒕 and 𝒗̂𝒕: 𝒎̂_𝒕= 1 1−𝛽₁𝑡𝒎𝒕 (10d) 𝒗 ̂𝒕= 1 1−𝛽₂𝑡𝒗𝒕 (10e)

The calculated values of 𝒎̂_𝒕 and 𝒗̂_𝒕 are then used to perform the parameter update:

𝜽𝒕= 𝜽𝒕−𝟏− 𝜂 ∗ 𝒎̂𝒕

√𝒗̂_𝒕+𝜖 (10f)

where all operations on vectors are performed element-wise. In Equation 10f 𝜖 is a very small number used to prevent any zero-division. Kingma and Lei Ba (2015) have shown that Adam produces better results compared to other methods when training neural networks.

2.3 Overfitting

Deep neural networks contain multiple layers, and each layer usually has more than one artificial neuron. Because of this, the number of parameters in the model is quite large. Imagine a neural network with one input, one output, and three hidden layers with five artificial neurons in each hidden layer, the total number of parameters in such a model will be 33 and this is a comparatively small neural network.

When training a model with a large number of parameters, overfitting can become a problem. Overfitting means that the parameters in the model are learning the specific examples in the training set rather than the underlying relationship (Ljung & Glad 1991). When an artificial neural network starts overfitting on a training set it is no longer learning anything useful (Nielsen 2015).

(23)

14

when the error on the validation set stops decreasing while the error on the training set is still getting smaller (Nielsen 2015).

Overfitting is a major problem when training neural networks, especially if the training set is small. The network is able to generalize much better if it is provided with a large training set. One of the best ways of preventing networks from overfitting is to increase the size of the training dataset (Nielsen 2015).

2.3.1 Regularization

There are, however, other ways, besides increasing the training set, to prevent a model from overfitting. Regularization techniques are techniques which can be used to reduce overfitting, even though the network architecture and the size of the training data is fixed (Nielsen 2015). Two of the most common regularization techniques are L2 regularization and dropout.

2.3.2 L2 regularization

The idea of L2 regularization is to add a penalty term, called the regularization term, to the cost function (Nielsen 2015). The regularization term in L2 regularization is the sum of squares of all the weights in the network scaled by a factor 𝜆/2𝑛, where 𝜆 is called the regularization parameter and 𝑛 is the size of the training set evaluated by the cost function (Nielsen 2015). The regularization parameter is another design parameter which is decided upon during training.

When the quadratic cost, Equation 7, is used together with L2 regularization the resulting cost function is written as:

𝐶 = 1 2𝑛∑ ‖𝑦(𝑥𝑖) − 𝑦̂(𝑥𝑖)‖ 2 𝑛 𝑖=1 + 𝜆 2𝑛∑ 𝑤𝑖 2 𝑘 𝑖=1 (11)

where the last summation is adding the square of all the k number of weights in the network together. The L2 regularization heavily penalizes peaky weight vectors and instead makes the network prefer to learn diffuse weight vectors (Karpathy n.d. b). Large weights are only accepted if they considerably improve the first part of the cost function (Nielsen 2015). By adding a penalty term for large weights the regularized network learns to respond to patterns which are seen often across the training set. This prevents it from learning the local noise in the data, thus reducing overfitting (Nielsen 2015).

2.3.3 Dropout

(24)

15

backpropagation is performed. The parameters are updated and then the process is repeated, first restoring the neurons that were dropped, then deleting a new subset of neurons (Nielsen 2015). Training neural networks like this prevent units from co-adapting too much.

When the trained artificial network is tested the thinned networks are combined to form an “unthinned” network (Srinivasan, Ng & Liew 2006). A way in which to think about dropout is that it approximately combines many different neural network architectures. Model combination nearly always improves the performance of neural networks and other machine learning methods (Srinivasan, Ng & Liew 2006).

Testing the neural network will be done with all the artificial neurons being active (Nielsen 2015). However, the parameters in the network have been learned under conditions in which a number of the neurons were dropped out. To compensate for that the weights have to be scaled-down, if a unit is kept with probability p during training, the outgoing weights of that unit are multiplied by p when testing the neural network (Srinivasan, Ng & Liew 2006).

Dropout has been proven very successful in reducing overfitting and thus improving the performance of neural networks. Srinivasan, Ng and Liew (2006) describes how the technique was applied to a number of different tasks.

2.4 Convolutional neural networks

A convolutional neural network (CNN) is a kind of feed-forward neural network that is especially common in the field of computer vision (Deshpande 2016). The neural network architectures described in section 2.2 does not take the spatial structure of the input data into account. If an image is used as an input to one of those nets, and every neuron in the input layer receives one pixel value, it would not matter which input neuron received which pixel value. This is because networks with this kind of architecture treat pixels that are far apart and close together the same. CNNs use an architecture which tries to take advantage of the spatial structure.

CNNs are composed of a number of different layer types of which the convolutional layer, pooling layer and fully connected layer are the main ones.

2.4.1 Convolutional layer

(25)

16

entire input image, moving with a set stride. The pixel values in the receptive field of the image are multiplied with the weights of the filter by element wise multiplications. These multiplications are then summed up to produce a single number. This process is mathematically expressed as:

𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑏 + ∑4𝑙=0∑4𝑚=0𝑤𝑙,𝑚𝑎𝑗+𝑙,𝑘+𝑚 (12) where 𝑎_𝑥,𝑦 is the input activation to the convolutional layer, b is the bias and 𝑤_𝑙,𝑚 is one of the weights in the filter (Nielsen 2015). The weights defining the filter are sometimes called shared weights, they do not change as the receptive field slides across the image (Bishop 2006, p. 268). Every unique positioning of the filter on the input volume produces a scalar value, that scalar is the output of the convolutional layer. This means that every input pixel won’t be connected to every hidden neuron, see Figure 4.

Figure 4. The figure shows how the filter moves along the image when the receptive field is 4 × 4 and the stride is 1.

(26)

17

Directly after the convolutional layer it is convention to apply a nonlinear layer. This is done by using one of the activation functions mentioned in section 2.2. The output of a convolutional layer with a filter size of 𝐿 × 𝑀 and stride equal to 1 followed by a nonlinear layer is:

𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑔(𝑏 + ∑𝐿𝑙=0∑𝑀𝑚=0𝑤𝑙,𝑚𝑎𝑗+𝑙,𝑘+𝑚) (13) where 𝑔(𝑥) is the activation function (Nielsen 2015).

2.4.2 Pooling layer

A pooling layer is sometimes added after the nonlinear layer that follows a convolutional layer. The pooling layer basically simplifies the output from the convolutional layer. Pooling layers can also be referred to as down-sampling layers.

There are several kinds of pooling layers, however, the most popular one is probably the max-pooling layer. A max-pooling layer takes a region of the input and outputs the maximum value of that region. The regions are similar to the receptive field discussed in 2.4.1. An example of how a max-pooling layer works is shown in Figure 5.

Figure 5. An example of max-pooling where the regions in the image are 2 × 2. Adding a pooling layer drastically reduces the spatial dimension of the input volume to the next layer in the network. This means that the number of parameters in the network can be reduced which speeds up the training process (Nielsen 2015). Adding a pooling layer also forces the feature detectors to become more broadly applicable which helps reduce overfitting (Masci, Meier & Ciresan 2011).

2.4.3 Fully connected layer

(27)

18

3. Method

Artificial neural networks have been proven useful in many areas of application, including energy disaggregation. Nonintrusive load monitoring which uses neural networks for the pattern recognition have previously been proven capable of identifying both appliances with multiple on-states and appliances with continuously varying power consumption, as stated in section 1.3. Since the aim of the project is to identify washing machines, dishwashers and dryers in the total load, and those kinds of appliances usually have multiple on-states, a neural network is applied for performing the pattern recognition. The real power is used as the signature feature since it is a feature already measured by today’s smart meters. Measurements of the active power from the apartments in the SRS are available.

The following approach to constructing a neural network capable of performing load disaggregation is heavily inspired by one of the algorithms described by Kelly and Knottenbelt (2015); a convolutional regression network is used to calculate the time when the appliance of interest was switched on and the time it was switched off. For every appliance that is to be identified when performing load disaggregation, a separate neural network is trained so that it is able to recognize the power consumption pattern of that specific appliance. This means that for a model which is to identify three different appliances, three neural networks are trained; one which recognizes the power consumption pattern of the washing machinge, one which identifies the dryer, and one which identifies the dishwasher.

The neural network models are written in the programming language Python using the software library TensorFlow. TensorFlow is an open source software library for numerical computation developed by researchers working on the Google Brain Team within Google’s Machine Intelligence research organization (Google Brain Team n.d.). TensorFlow was developed in order to conduct machine learning and deep neural network research (Google Brain Team n.d.). The networks are trained on the CPU on an ordinary laptop.

3.1 Creating the training dataset

As discussed in section 2.2.2 an artificial neural network must be provided with examples in the form of a training dataset in order to learn values for the parameters in the model. When training a network to perform an energy disaggregation the training dataset should be constructed from continuous measurements of the signature feature(s) of the aggregate load in which the target appliance(s) are present. The training dataset has to contain target values and thus information about the state of the target appliance is required and has to be recorded.

(28)

19

window length is chosen in a way so that it is able to capture the entirety of the target appliance activation and it is thus depending on the running time of the appliance. An activation is a time interval during which the appliance is running, the activation starts when the appliance is switched-on and stops when the appliance is switched-off. ANNs designed to recognize different appliances will consequently be trained with time windows of different size and have a different number of neurons in the input layer. Each time window has a corresponding vector of target values. The ANN is designed to calculate the time when an appliance is switched on and when it is switched off given an input time window. The target values thus become the switch-on and switch-off time relative to the window presented to the network as a percentage of the length of the window. For example imagine a window positioned so that it has a start time 07:00 and end at 8:00, at 07:30 the target appliance is switched on and at 07:45 it is switched off. The target value for the net to find with that particular time window as input becomes [0.5, 0.75].

In order for the network to learn to tell the difference between when the target activation is present in the example and when it is not, the training set it made up of both windows with activations and windows without activations of the target appliance. If the window does not contain an activation the target value is given as [0, 0]. If it is possible the network is also provided with windows containing activation of the other appliances in order to teach the network the difference between different activations. For example a network designed to recognize a washing machine is provided with training data containing activations of the dishwasher with the target [0, 0]. The dataset used for training consists to 50 % of windows with activations of the target appliance present and 50 % of windows without.

Neural networks learn best if the input to the network has zero mean (Kelly & Knottenbelt 2015). Before providing the network with a time window the mean of the window is thus subtracted. Another benefit of training the model on input data with zero mean is that the network does not need to consider loads that are always on.

To help with initialization the input window of aggregated active power data is also scaled by dividing it by a constant. The constant corresponds to the standard deviation of a random example window containing an activation of the target appliance. The reason for not dividing the input to the network by its own standard deviation is that it would change the scaling of the data which is likely to be important when performing load disaggregation. If every sample is divided by its own standard deviation the network cannot associate a step change of a certain size to a specific appliance since that step change will be different between examples.

(29)

20

consisting of example windows are then divided into training data and validation data; 75 % of the example windows are used for training and 25 % for validation.

3.2 Neural network architecture

The network architecture applied is basically a convolutional network which performs a regression in order to estimate two scalar, real-valued outputs. The outputs represent the switch-on and switch-off times of the target appliance and will be in the range of [0, 1]. If the input to the net does not contain an activation of the target appliance the outputs of the net should be approximately [0, 0]. Since the network estimates two scalar values the output layer has two neurons: one for the value of the start-time of the appliance and one for the stop-time.

The number of neurons in the input layer is equal to the window length which in turn depends on the running time of the appliance. Each input neuron will contain the power consumption at a given point in the input window. For example if the input window is 200 minutes long and has a sampling time of two minutes the first neuron will receive the measured value of the active power at the first minute in the time window used as input; the second neuron will receive the active power for the third minute and so on.

The first layers in the networks are convolutional layers. Max pooling is applied after some of the convolutional layers. This is done in order to reduce the number of parameters the network trains and to help reduce overfitting. The purpose of the convolutional layers is for the network to learn high-level features as described in section 2.4. After the last convolutional layer is a fully connected layer. Adding a fully-connected layer to a convolutional network allows the network to learn non-linear combinations of the features learned by the convolutional layers. Essentially the convolutional layers are providing a meaningful low-dimensional feature space and the fully connected layers are learning a function in that space.

Deep neural networks with a sigmoid activation function, see Equation 3, have been proven to take a comparatively long time to train (Glorot & Bengio 2010), and result in models with larger estimation errors compared to when other activation functions are used (Krizhevsky et al. 2012). If instead the rectified linear activation function (ReLU), see Equation 4, is used the network will be able to train several times faster (Krizhevsky et al. 2012). Hence, all artificial neurons in the final network architecture use the rectified linear activation function.

(30)

21

3.3 Training the neural network

The architecture of the final net is decided upon during the training of the neural network. Different combinations of number of convolutional and fully connected layers are evaluated during an iterative process in order to find one which produces a good result. All networks were trained end-to-end from a random initialization of the weights. The biases were initialized as zero. A schematic picture of how the training is carried out is shown in Figure 6, the process is repeated for each set of hyper parameters.

Figure 6. The training process of a neural network. An example window is used as input to the network (in reality a batch of example windows are used) and the window is

processed and evaluated.

Adam optimization is used as the learning algorithm, it is described by Equation 10a-10f, and mini-batches are used when training the network. The size of the mini-batches is decided upon during the training process by plotting the value of the cost function for the validation set versus real elapsed time (Bengio 2012). The mini-batch size which gives the most rapid improvement is then chosen (Nielsen 2015).

(31)

22

3.4 Applying the neural network to input data

When performing NILM the input to the model is an arbitrarily long sequence of measured aggregate data. The measured data is analyzed by the trained ANNs; however, the networks have an input window with a fixed size of at most a few hours. In order to disaggregate arbitrarily long sequences of data the ANN must be fed time windows with the same length as the input of the neural network, which are then processed by the network individually. This is accomplished by sliding the ANN along the input sequence. The input window to the network is placed at the beginning of the input sequence where it calculates its output. After the output is calculated the window is moved forward along the sequence by a set stride and again applies the ANN in the new position. This procedure is repeated until the entire sequence has been processed. In Figure 7 an example of how the network slides across the input is shown.

Figure 7. The input window to the ANN is sliding across the sequence by moving it in strides.

Before sliding the network over the input sequence the beginning and the end of the sequence is padded with zeros so that the first input the network is shown will be all zeros. By doing this the network will see each instance in the sequence the same number of times. When combining the output from the networks this is of importance, especially when the outputs overlap, why will be described in more detail below. Padding the input with zeros also ensures that the entirety of the sequence will be processed.

(32)

23

make multiple attempts at processing every instance of the input sequence. If an error is made by the net on one window it can be diminished if the nearby windows give a correct output.

The overlapping input sequences to the network will produce multiple outputs. This means that each time step will have multiple estimations of whether the target appliance is running or not. For the model to produce a disaggregated time series these estimations must be combined. The switch-on and switch-off times, which are the outputs of the network as it slides along the sequence, are interpreted as activations. The network is supposed to output [0,0] if there is no activation of the target appliance within the window but what sometimes happens is that it outputs a miniscule value which could be interpreted as a very short activation of the target appliance of maybe one or two minutes. To avoid this, a minimum activation time is applied before combining the outputs. When the network receives overlapping input sequences it will produce overlapping activations, all these activations are layered on top of each other and the overlap is measured. A threshold for the overlap is introduced; a set number of windows have to agree that the target appliance is running in order for the model to register it as being switched on in the final output in that particular point in time. The threshold is a design parameter of the model which is chosen during the training process. The introduction of a threshold for the overlap is the reason that zero padding the input sequence is necessary; if a point in time is present in fewer windows it will be more difficult for it to reach the threshold. This makes it more difficult for the model to identify activations in the beginning and end of the input sequence. However, zero padding the sequence prevents this.

The complete disaggregation model is described in Figure 8.

(33)

24

3.5 Evaluating the NILM model

The data held back from the training process is used for the evaluation. When evaluating the entire model what is of interest is the percentage of correct classifications of the activations of the target appliances. The disaggregation model can only output whether the appliance is on or off at a given instance in time, metrics that can be used for evaluating the percentage of correct on/off classifications will thus be used for the evaluation. Before the metrics can be introduced some units have to be defined:

 TP = number of true positives. A true positive means that the model output indicated that the target appliance was on when it was actually on.

 FP = number of false positives. A false positive means that the model output indicated that the target appliance was on when it was actually off.

 TN = number of true negatives. A true negative means that the model output indicated that the target appliance was off when it was actually off.

 FN = number of false negatives. A false negative means that the model output indicated that the target appliance was off when it was actually on.

 P = total number of positives in the evaluation data. A positive means that the appliance was on.

 N = total number of negatives in the evaluation data. A negative means that the appliance was off.

These values are accumulations over a given time period.

Makonin and Popowich (2015) suggest several ways of measuring the performance of a NILM model, four of which are the accuracy, precision, recall and the F1-score.

3.5.1 Accuracy

The accuracy measures how often the output of the network is true. This is done by dividing the number of true classifications by the total number of time instances classified.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑚𝑎𝑡𝑐ℎ𝑒𝑠

𝑡𝑜𝑡𝑎𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑚𝑎𝑡𝑐ℎ𝑒𝑠= 𝑇𝑃+𝑇𝑁

𝑃+𝑁 (14)

(34)

25

3.5.2 Precision, recall and F1-score

The precision in this context is a measurement of what percentage of the detected on-states are actual on-states:

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃

𝑇𝑃+𝐹𝑃 (15)

The recall measures what percentage of the on-states are correctly classified:

𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃

𝑇𝑃+𝐹𝑁 (16)

The F1-score is the harmonic mean of precision and recall. It measures how accurately the NILM method can predict if the target appliance is on:

(35)

26

4. Results

4.1 Data

The approach to energy disaggregation described in section 3 was applied to two datasets; one consisting of artificially created data and one of real measurements from the SRS project. The reason for performing an energy disaggregation on an artificial dataset was to examine how well the method performed on ideal data. When examining how the sampling frequency of the data affects the performance of the model artificial data was used.

4.1.1 Simulated data

Data for the artificial dataset is taken from the simulation results in Song (2013) created using the high-resolution stochastic model of domestic activity patterns and electricity demand described by Widen and Wäckelgård (2010). The power consumption for 50 households over one year is simulated with a time resolution of 1 sample per minute. Each sample value is the instantaneous power consumption in that moment in time.

The resulting dataset consists of power measurements for all the individual appliances that make up the total household load together with the aggregated power consumption of the household. The aggregated power consumption of the household is created by adding the power consumption of the individual appliances together. There is no measurement noise present in the simulated data. All activations of the same target appliance, the washer, dryer, and dishwasher, looks identical. This is of course a simplification of reality; not only may the appliances have different settings and programs that will affect the power consumption but even if the appliance is run with the exact same settings the power signature may vary slightly. Moreover, it is also assumed that all 50 households have the same exact appliances installed. This means that the power signatures of the appliances will be identical for all households.

(36)

27

Table 1. Information about the small dataset made up of simulated data. The dataset has the same number of activations as the dataset used to train the network with real

measured data. Number of activations Number of training examples Window length [min] Window length [time units] 30 180 300 300

One of the objectives of the project is to examine how the sampling frequency affects the result. In order to get training data with a different sampling frequency the simulated dataset is resampled. Four datasets with different sampling times are created: one dataset contains the power measured every minute, one every second minute, one every fifth minute and one that measures the power once every 10 minutes.

The resampling is done in a way that preserves the simplification that all activations of the target appliances look the same. If this is not done the models using the resampled data will most likely produce inferior results compared to the one using the original dataset. The reason for this is that the activations will be identical in the original dataset while they will be distorted by the resampling. In reality when using a smaller sampling frequency the signal will become slightly distorted and the power signature pattern might vary more compared to if a smaller sampling frequency had been used. However, since the original dataset contained the simplification that all activations look the same, and that the data is thus an idealization of reality to begin with, that simplification were kept when resampling the data. When examining how the sampling frequency affects the model in this project what is studied is if a shorter signal with less information is more difficult to identify in the aggregated data.

Before resampling the aggregated consumption the power consumption of the target appliances are subtracted, they are then resampled separately. The resampling is done by grouping the values by the new frequency and then calculating the mean. After resampling the power consumption of the target appliances each activation is replaced by a default resampled signature. The default resampled signature is created by resampling a single activation of the target appliance. By inserting a default signature each activation of the target appliance in the resampled data will look identical. After the aggregated power consumption and that of the individual target appliances are resampled they are added together to form the total resampled power consumption of the household.

(37)

28

training examples. Specifications for the dataset created with a sampling time of 1 minute are shown in Table 2. For specifications of the datasets for other sampling times see Appendix A.

Table 2. Specifications for the training and validation dataset with a sampling time of 1 minute. The time unit is one minute.

Appliance Number of activations Number of training examples Activation length [time units] Window length [time units] Dryer 2680 5339 121 200 Washer 2595 5169 163 300 Dishwasher 1390 2777 131 250

4.1.2 Data from the SRS project

The power consumption of the apartments in the SRS is continuously being measured. I have had access to the instantaneous active power measured with a sampling time of 30 seconds during January-March 2017. During this period of time there might have been a move-in effect and thus no “regular” use of apartments and appliances can be expected. All apartments have the same washing machine and dryer installed. No individual power measurements for any of the target appliances have been made; however, the status of two of the target appliances, the washer and dryer, at given points in time have been recorded. The status can, among other things, indicate whether the machine is on or off. There are missing values in the data recorded from the apartments which made it difficult to process the data automatically. The status that indicates that the machine switched off or on is sometimes missing and there are gaps in measurements of aggregate power consumption. Another problem with the available data is that if the user enters settings to the machine but then changes their mind and never starts the washing or drying cycle it will still register as an activation, even though it should not. Because of this, and the limited time frame of the project, data could not be pre-processed automatically. Instead intervals where no data points are missing and the recorded activations corresponded to actual usage of the appliances are chosen manually. This is quite time consuming and so only the washer was studied and the resulting dataset created for training the model is small and contains few activations of the target appliance.

(38)

29

Table 3. Information about the real measured data from the SRS project.

Appliance Number of activations Max on duration [min] Min on duration [min] Washer 41 290 26

Two datasets are created for training and evaluating the disaggregation model. In the first dataset, dataset 1, some activations from each apartment are held back during training and are instead used for evaluating the model. In the second dataset, dataset 2, the last apartment is not seen by the model during training. The data from that apartment is used for evaluating the model in order to determine how well the model performs on apartments not seen during training.

Activations are identified within the datasets and “cut out” to create example time windows. Since there are so few activations each activation is used multiple times in the training dataset, but with a different displacement within the time window. Windows that contain no activations are also included in the training examples. The specifications for the datasets used for training the neural networks are presented in Table 4.

Table 4. Information about the datasets used for training and validating the neural networks used in the model.

Dataset Number of activations Number of training examples Window length [min] Window length [time units] 1 31 186 300 600 2 30 180 300 600

4.2 The model trained on simulated data

4.2.1 The results when training with a small dataset