Automating dataflow for a machine learning algorithm

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2019,

Automating dataflow for a machine learning algorithm

KTH Bachelor Thesis Report

ALBERT GUNNESTRÖM ERIK BAUER

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Abstract

Machine learning algorithms can be used to predict the future demand for heat in buildings. This can be used as a decision basis by district heating plants when deciding an appropriate heat output for the plant. This project is based on an existing machine learning model that uses temperature data and the previous heat demand as input data. The model has to be able to make new predictions and display the results continuously in order to be useful for heating plant operators.

In this project a program was developed that automatically collects input data, uses this data with the machine learning model and displays the predicted heat demand in a graph. One of the sources for input data does not always provide reliable data and in order to ensure that the program runs continuously and in a robust way, approximations of missing data have to be made. The result is a program that runs continuously but with some constraints on the input data. The input data needs to be able to provide some correct values within the last two days in order for the program run continuously. A comparison between calculated predictions and the actual measured heat demand showed that the predictions were in general higher than the actual values. Some possible causes and solutions were identified but are left for future work.

Keywords

Automation, Dataflow, Machine learning, District heating

(3)

Sammanfattning

Maskininlärnings-algoritmer kan användas för att göra prediktioner på den framtida efterfrågan på värme i fastigheter. Detta kan användas som ett beslutsunderlag av fjärrvärmeverk för att avgöra en lämplig uteffekt. Detta projektarbete baseras på en befintlig maskininlärnings-modell som använder sig av temperaturdata och tidigare värmedata som inparametrar. Modellen måste kunna göra nya prediktioner och visa resultaten kontinuerligt för att vara användbar för driftpersonal på fjärrvärmeverk. I detta projekt utvecklades ett program som automatiskt samlar in inparameterdata, använder denna data i maskininlärnings-modellen och visar resultaten i en graf. En av källorna för inparameterdata ger inte alltid pålitlig data och för att garantera att programmet körs kontinuerligt och på ett robust vis så måste man approximera inkorrekt data. Resultatet är ett program som kör kontinuerligt men med några restriktioner på inparameterdatan. Inparameterdatan måste ha åtminstone några korrekta värden inom de senaste två dagarna för att programmet ska köras kontinuerligt. En jämförelse mellan beräknade prediktioner och den verkliga uppmätta efterfrågan på värme visade att prediktionerna generellt var högre än de verkliga värdena. Några möjliga orsaker och lösningar identifierades men lämnas till framtida arbeten.

Nyckelord

Automation, Dataflöde, Maskininlärning, Fjärrvärme

(4)

Authors

Albert Gunneström and Erik Bauer

Information and Communication Technology KTH Royal Institute of Technology

Place for Project

Research Institutes of Sweden (RISE) Stockholm, Sweden

Examiner

Johan Montelius

KTH Royal Institute of Technology

Supervisor

Alf Thomas Sjöland

KTH Royal Institute of Technology

(5)

1 Introduction

1.1 Background

Countries located in cold climates have a demand for heat. One way of providing this heat is by using district heating. District heating plants produce heat by burning fuel. The heat generated is transferred to households and facilities (clients) in the form of warm water [15].

Client demand for heat depends on many different factors, such as, outside temperature and client behaviour. Most clients use warm water at approximately the same time of the day, which leads to a spike in the demand at these hours.

For example many people shower in the morning, whereas very few shower in the middle of the night.

It is a challenging task for district heating plants to provide an appropriate temperature for its clients. If the provided temperature is too hot, unnecessary amounts of fuel will have been consumed, which leads to higher costs and emissions for the plant. If the output temperature is too low, clients will not get the service that they pay for and may also lead to adverse effects on facilities associated with cold inside temperatures.

It can take up to several hours for warm water to be transported to clients, therefore, there exists a need to predict what future demand for heat will be. By correctly predicting future demand for heat, the plant can modify its fuel consumption accordingly in order to output an appropriate output temperature.

Many district heating plants produce heat by burning different types of fuels. An example of this is Sweden’s largest district heating facility, located in Västerås.

The plant is able to burn biomass, coal and different types of oils in order to produce heat for its clients [6]. Fuel consumption is a cost for district heating plants and the incineration process leads to CO₂ emissions, which can have adverse effects on the environment.

A research project called ”Smarta flöden” [13], which addresses the problems

(8)

associated with predicting the future heat demand, has been launched by Research Institutes of Sweden (RISE). The project is organized in conjunction with Mälarenergi and the district heating plant in Västerås. Sensors that measure the use of heat have been installed in Västerås. This sensor data is used to get the client heat demand, which is measured in Watt hours (Wh). The sensor data has been recorded for more than two years and with this data RISE has developed a machine learning model that can predict the heat demand.

Plant operators are interested in what the heat demand will be for the next couple of hours. The machine learning model can be used to predict the future heat demand by utilizing weather forecasts as input data. It is a time intensive process to calculate all the manual steps needed in order to make a heat demand prediction. This means that the predictions need to automatically update when new input data is collected in order to provide useful information for plant operators. In this project we tackle the problem of creating a system where new input data is automatically fed into the machine learning algorithm to automatically update the predictions. A system diagram of this is seen in Figure 1.1.

Figure 1.1: System diagram

1.2 Problem

RISE has in their project ”Smarta flöden” developed a machine learning algorithm that predicts demand for heat [13]. In order to output a graph of predictions for the future demand, which may be used by a district heating plant operator, many sequential processing steps are needed.

The model makes a prediction for a given time based on the outside temperature, the average temperature and heat demand for previous 24 to 48 hours. One

(9)

problem in the project is how to collect this data, and from where. If input data to the machine learning model is missing or is somehow distorted it needs to be handled to ensure that the program is robust and does not crash easily. Another problem in the project is how to format the collected input data so that it can be used by the algorithm. Finally, after the algorithm makes a prediction, the result need to be plotted in a graph. All of these steps have to be performed automatically. How can this be implemented in a robust way?

1.3 Purpose

The purpose of the project is to collect and send data in real-time to a machine learning model and to display the predictions produced by the model in real- time according to specifications given by RISE. The purpose of the report is to document and describe the developed software solution that fulfills these specifications.

1.4 Goal

The main goal of the project is to use the machine-learning algorithm from RISE, and implement an application that automatically collects data from different sources, runs the data through the model and displays the results in a graph.

The graph automatically updates as new predictions are made. This main goal is divided into sub-goals according to RISE specifications:

1. Collect the following data automatically:

(a) Temperature forecast (b) Historical temperature

(c) Historical heat demand 2. Handle missing and faulty data

3. Format the data for the machine learning model 4. Automate the process of making a prediction

(10)

5. Display predictions in a graph that automatically updates

Further features requested by RISE is to display alternative heat demand predictions and old heat demand data. The specifications for these secondary goals are listed below.

1. Display alternative predictions

(a) Calculate and display predictions made on modified temperature data (b) Display several different predictions in the same graph

2. Display historical data

(a) Save previously recorded predictions from primary goal

(b) Display demand from recorded sensors and predicted demand in graph

1.4.1 Benefits, Ethics and Sustainability

The software developed in this project can be beneficial for operators at the district heating plant in Västerås since it provides them with a decision basis for how much heat to produce for their clients. It can also be beneficial in regards to sustainability since it can reduce the amount of heat that is produced at the plant and therefore reduce emissions. There are also potential financial benefits of using the decision basis since it may help reduce the amount of fuel used.

Heat demand sensors are used to measure the heat use of individual households.

This information may be used to deduce behaviours of the people living in that household, such as, if they are on a vacation or not. The data from sensors used in this project are aggregated and information on individual households is therefore lost. A confidentiality agreement has been signed to ensure that no client data in the project is released.

1.5 Methodology

The project work focuses on the development of a software solution and uses a combination of quantitative and qualitative methodologies. The purpose is

(11)

to develop this software such that it fulfills a set of functional as well as non- functional requirements, as seen in goals (section 1.4). Prototyping is used to reason about the requirements and to test out possible ways of implementing solutions. To ensure that the implementation is correct, software testing is conducted which include unit testing, integration testing and system testing.

1.6 Stakeholders

The degree project is given as a part of the ”Smarta flöden” project by Research Institutes of Sweden (RISE).

1.7 Delimitations

Only the effectiveness of the implementation is considered, i.e. only that it fulfills the functional requirements. The efficiency for which it performs the goal is not considered, unless it hampers the effectiveness. For example, if the model used for calculating predictions is fed new data once every minute, it can not take more than one minute to calculate the prediction based on that data.

Otherwise the model will not be able to output the prediction in time before the next data point arrives. In this example the efficiency has to be considered in order for the program to function properly, but any further optimization will not be performed. The interface part of the implementation is not analyzed in terms of its usability.

The district heating plant provides heat to several districts. This project only considers data and predictions for the district of Surahammar.

Other machine learning models developed by RISE, such as, models that predict supplied water temperature for clients and the transportation delay for heat is not considered in this project.

(12)

1.8 Outline

In section 2 background information and earlier work in RISE’s project ”Smarta flöden” is explained. A description of how the project is carried out and the techniques used are found in section 4. Results from the project are found in section 5. Finally, conclusions, discussions and future work are provided in section 6.

(13)

2 Theoretical Background

2.1 Smart flows

The project ”Smarta flöden”, or Smart flows in English, is focused on improving district heating plants by using machine learning, IoT and cloud technologies. The project has been in progress for over two years and is a collaboration between RISE, Mälardalshögskolan, SigholmKonsult, ABB, Evothings and Mälarenergi [13].

Mälarenergi is the owner of a district heat plant in Västerås [6]. The plant mostly uses biomass as fuel to generate heat but during certain peaks in heat demand it has to burn fossil fuels in order to ensure that enough heat is produced. Smart flows could potentially help Mälarenergi by providing better tools for planning their heat output. There are several parameters that are needed in order to plan an appropriate heat output, such as, the heat demand from the clients, the time it takes for the heated water to be transported to the clients and what the temperature of the supplied water is when it arrives at the clients. The supplied water temperature and the transportation delay will both depend on the distance between the plant and the client. Other factors that affect the result is the outside temperature and the behaviour of the clients. The clients heat demand is different depending on if its night or day, the day of the week and the time of year. For example, the demand for heat is lower during the summer than in the winter.

The plant is currently using a software system called Energy Optima 3 as a decision support system when planning the daily production. The support system makes suggestions based on electricity trading, weather forecasts, historical produced heat and consumed heat. The consumed heat is based on the warm water that is returned to the plant. As opposed to Energy Optima 3, the goal of Smart flows is to collect data on the consumed heat directly from the buildings [9]. The data is collected telemetrically using IoT devices and is stored in a cloud service.

(14)

2.2 Predicting heat demand

The machine learning algorithm used to create the model that predict the heat demand is, a so-called, support vector regression algorithm. In short, regression is about finding a mathematical function that approximates a relationship between pairs of random variables [2]. One is an input to the algorithm, also known as a feature, and the other is the output. The output from the machine learning model is the heat demand and the features are: the time, the outside temperature and the average temperature and demand for previous 24-48 hours. When these features are given to the model it outputs the heat demand for the given time.

In order to find the relationship between the features and the output, the machine learning model requires training. Training is done by supplying the algorithm with historical data (that has been collected previously) together with a correct value for the output, also known as a label. The algorithm tries to find and adjust a function to the data in such a way as to minimize the difference between its output and the label [2]. Once the model is trained, it can be applied to new data to make predictions for heat demand.

The algorithm requires that the input data is scaled down to a range between zero and one [5]. The input is scaled using a transform which works by first finding the minimum and maximum values of the input data and then applying the following function to each value in the input data [11]:

f (x) = x− min max− min

The scaled result is in the range zero and one. This means that the output from the model has to be transformed back to its original range in order to produce results in watt hours (Wh). This is done by applying the inverse of the scaling function on the scaled data.

The support vector regression algorithm is implemented in the Python programming language. It makes use of a library for machine learning called Scikit-learn [10] which provides an implementation of the algorithm as well as functions for scaling and for calculating the accuracy of the trained model.

(15)

The algorithm that trains the model reads historical data from files and performs preprocessing on it. The timestamp for each data point is in the format ”year- month-day hour:minute:second”. Only the month, day and hour is used by the algorithm. The day is converted to its corresponding day of the week, expressed as a number, where monday is 0 and sunday is 6. The average temperature and heat demand for previous 24 to 48 hours is calculated for each timestamp. This data together with the temperature for the timestamp are used as the features for the model. The heat demand for a timestamp is used as the label.

This dataset is scaled and split into two sets, one that is used to train the model and another for testing it. The model is trained on 85% of the original dataset and test predictions are made on the remaining 15%. The resulting predictions are scaled and compared to the actual historical demand using accuracy measurements.

Finally, the predictions are plotted against test data in a graph. An excerpt of this graph can be seen in figure 2.1, the full graph covers a longer period of time.

The dots are the predictions made by the algorithm and it shows that they match the actual demand fairly well.

Figure 2.1: Graph showing the actual and predicted heat demand.

(16)

2.3 Getting the measured heat demand to Azure

The households’ and facilities’ heat demands are measured by sensors which are installed in client buildings. The sensors keep track of the heat consumption, measured in watt hours (Wh). The values are continuously updated as heat is consumed. There are two different types of sensors installed in Västerås. The more modern sensor can send measurements every 15 minutes whilst the older sensor only sends data once every hour. Surahammar is a district with many modern sensors. This is one of the reasons for it being the district that Smart flows use to test their machine learning models.

The data sent from the sensors is the consumed heat since the last measurement and the time the measurement was taken.

These measurements are collected and sent to the district heat plant, where it is stored in a server. From the server the data is sent to Azure, a cloud-based computing platform provided by Microsoft [7]. In Azure, the heat demand from each building is aggregated for every district. Finally, the aggregated demand is stored to file in a blob storage container. The blob storage is used to store unstructured data in Azure [8]. The files are stored in a folder structure according to when they arrived to the blob storage. The recorded time of when it arrives is formatted in coordinated universal time (UTC). The path to a specific file looks like the following:"year/month/day/minute/filename". There is only one file at the end of a file path. The files contains data for one or several measurements, each one consisting of a timestamp, the district, a sum of the heat demands and the number of buildings that the sum is aggregated from, referred to as the count.

The number of buildings is given because it is not ensured that all data from one measurement arrives in Azure at the same time. The data from some sensors might even be missing. By looking at the timestamp for the demand data it is possible to see if files that arrive later have the same time stamp, if so, the demands can be summed. If the number of sensor values add up to the total number of sensors for that district, then all data has been received for that district. It is also possible for the count to have a value higher than the total number of sensors.

(17)

2.4 Regarding temperature data

The heat demand prediction for a given time requires the outdoor temperature at that time and the average temperature 24 to 48 hours before as mentioned in ”Predicting heat demand” (section 2.2). By making predictions on weather forecasts, it is possible to predict the future heat demand.

The Swedish Meteorological and Hydrological Institute (SMHI) provides an application programming interface (API) for accessing their meteorological forecasts [12]. The data covers weather forecasts for the next 10 days and includes several meteorological parameters, such as the outside temperature.

The Swedish Transport Administration, Trafikverket, provides an API for accessing data related to traffic. One type of data is the measured temperature from weather stations placed at roads around Sweden. A registration is required to access the data. Registering means that you are given a key that is used when sending requests to the API.

(18)

3 Methods

3.1 Fulfilling specification goals

The purpose of the project is to develop a software solution that fulfills a set of specifications. Prototyping and software testing is used to determine if the specifications are met.

The final product specifications given by RISE, is a complicated system consisting of features for data collection, analysis and graphical feedback. Many of the goals (section 1.4) have behaviours that are disjoint from each other and can be implemented separately without interference between different parts. We developed a system architecture consisting of smaller program modules, where each program module solves part of the overall system requirements. With all the modules implemented, the system architecture fulfills the product specifications.

One benefit of using this approach is that it is easier to modify parts of the program without having to rewrite large sections of code.

3.2 Programming language and computing platform

Programs and program modules are written in Python since it has an extensive set of libraries that are relevant for the project work. Microsoft provides a set of libraries for working with Azure using Python [17]. There are also libraries for sending HTTP requests and parsing JSON objects which are needed for downloading the data needed to run the machine learning model.

The model that was developed by RISE was developed in Python using scikit-learn which restricted the program to use this library for making predictions.

The programs are developed and tested on laptops running flavors of the Ubuntu operating system.

(19)

3.3 Prototyping

The prototype uses a pre-processed version of the dataset of historical heat demand and temperatures. This data is used to test the model after it has been trained. The pre-processing involves formatting data and scaling it. This is saved to a csv-file. The prototype reads data from this file one line at a time and makes a prediction on it. The prediction is plotted in a graph which is automatically updated as new predictions are made. A short artificial delay between calculating the predictions is introduced to emulate the fact that the real-time data takes more time between updates. The prototype is then changed and built upon further in order to fulfill the requirements.

3.4 Testing

Testing is used to verify that the developed software functions as intended. Testing is carried out in an incremental manner during the development process. Smaller pieces of code are tested as they are developed using unit testing. Groups of these pieces of code are then tested to see that they interact correctly which is known as integration testing. Finally, the system as a whole is tested. An incremental testing strategy allows for easy troubleshooting of errors, since the program logic is less complex for smaller software modules [4]. The alternative would be to test everything towards the end, when everything is put together. This would most likely make it harder to find causes of potential errors. Another benefit of incremental testing is that it makes it easier to spot potential architectural problems during the development process.

Automated unit tests are written for some of the modules. Others are tested manually. All integration and system testing is performed manually. Program modules are tested using black box testing. A black box test only considers the input and output of the program module.

A blob storage on Azure is set up in order to test the module that downloads heat demand from Azure. Files are uploaded to this storage in different sequences, some with complete data and others lacking data. The program module depends

(20)

on the data that has been written to file previously, especially the timestamp of the data that is last written to file. Different states of this file is also part of the tests.

(21)

4 The Work

4.1 System architecture

The system architecture consists of several program modules that together achieve the goals of the project. Although this project is limited to only work for the region Surahammar, a modular approach to the architecture is followed in order to simplify future modifications of the program. Modifications such as changing the program to make it work for multiple districts. An overview of this system is seen in figure 4.1. In the system, all the data collection program modules store the data to files, which function as an intermediary stage for other program modules that utilize this data. This simplifies the process of adding new program modules and features which also utilize this data.

The final program is intended to be used by plant operators, which means that the program has to be easy to start and exit for someone without much computer knowledge. In figure 4.1 one can see that the system architecture is composed of two programs, one that controls three data collection program modules and another that makes predictions from the data and displays the results. The program that schedules the data collection program modules is written in order to simplify how the system is executed and to make sure that all the data collection programs are started.

Figure 4.1: System diagram

(22)

4.2 Collecting temperature data

There does not exist a single weather information data service that is able to provide all necessary weather data needed for the machine learning model.

Instead, two different services, namely, Sveriges meteorologiska och hydrologiska institut (SMHI) [14] and Trafikverket [16] are used to collect all necessary weather data. Recorded temperatures and temperature forecasts are gathered using open APIs from their respective websites. SMHI’s API is used to find weather forecasts and Trafikverket is used to find the actual temperature data (as opposed to forecasted temperature). An overview of what data is collected, and from which service, can be seen in figure 4.1. SMHI also provides recorded temperature data, but not for Surahammar. Trafikverket happens to have a weather station there, which is why data is collected from them.

SMHI’s data is accessed by sending an HTTP GET request that specifies the longitude and latitude of the location for the requested forecasts. The geographical area for which the forecasts are made is divided into a grid and the forecasts for the gridpoint closest to the requested coordinates are returned. The forecasts for the two first days have temperature data for every hour and is structured in JSON text format. The result from the get request are timestamps and the weather forecast data for the timestamps. An overview of the structure of the returned result from the get request is seen in figure 4.2. Each forecast is made up of a set of parameters, one of which is the temperature.

Figure 4.2: The structure of the returned JSON object

The program module that collects the forecasts sends a request to SMHI and parses the returned JSON object into a dictionary. A dictionary in Python is a set of

”unordered key-value pairs” [3]. A key might for example be a timestamp, which

(23)

has its set of corresponding parameters as a value. It then reads the temperature forecast for each timestamp in the dictionary and saves it to file. One detail that is not mentioned in SMHI’s description of the returned data format, but that affects how the data is handled, is that the position of the different parameters in the returned JSON object change place after the first five timestamps. This means that the module has to change where in the dictionary it is reading from in order to get the temperature and not some other parameter.

A new file is created every time the program module is executed and the file is named after the time it is executed. Each file contains all forecasts that are available at that time, which means that if the program module is executed before SMHI has made new forecasts, the new file will contain the same forecasts as the last file.

Trafikverket records temperature data every ten minutes for their weather stations. Actual temperature data is needed when calculating the average temperature for the past 24-48 hours, which is used by the machine learning model.

Similarly to how SMHI’s API works, an HTTP GET request is sent to Trafikverket’s API and the result is sent back in JSON format. The get request specifies what kind of data is requested and for which district. In this case it is the air temperature from the weather station in Surahammar. The result that is sent back is the last five temperature measurements, taken at ten minute intervals, and the time they were taken. The JSON object is parsed into a dictionary and the temperatures and timestamps are extracted from this and saved to a file. Unlike the forecast data where new data is saved to a new file there is only one file for the actual temperature data, with new data being appended at the end of the file.

Each line in this file are made up of the five temperature measurements, each with a timestamp. If the module runs before new measurements are taken the measurements will be the same as the last and therefore the file might contain duplicates of data.

Both Trafikverket and SMHI discard project relevant weather data every hour, making it inaccessible through the use of their APIs. Therefore, the weather data is saved every hour in order to not loose this data. One program module saves

(24)

the current forecast every ten minutes, and another program module saves the current actual outside temperature. It is assumed that the weather data collection manages to collect the weather data successfully at least once every hour. This assumption is necessary for the heat demand predicting program module to function properly. The weather forecast data is also used by the machine learning model in order to make future heat demand predictions.

4.3 Collecting the demand from a cloud service

As explained in section 2.3, the heat demand data is continuously sent to and stored in a blob storage in Azure. This data is not guaranteed to arrive in order and data from different times can be interleaved. Moreover, some sensor values might be completely missed, never to arrive at the Azure blob storage. These circumstances makes it necessary to cover many different corner cases for how the data is received.

The program module that downloads data from Azure tries to solve these problems by downloading several files and then checking if they contain enough data to write to file. It downloads all files in the blob storage from the last three days and searches each file for data coming from the district Surahammar. This data is put into a list, which is sorted according to time. Another list with all timestamps for the last three days is created, one timestamp for every quarter of an hour. Each timestamp in this list also has fields for the heat demand and the number of buildings the heat demand is aggregated from. These fields are initially set to zero. The list with data from Azure is iterated through and the demand and the number of buildings the data is aggregated from, called count, is added to the fields in the other list for each timestamp. This way parts of a measurement that might have come in to the blob storage in the wrong order are captured.

Before the data can be saved to file there is the problem of potentially missing data, signalled by a count different from the number of sensors in the district.

Demand data is invalidated if the recorded number of sensor values differ from the actual number of sensors by more than 5%. This is indicated in the program by setting the value of the demand field in the list to NaN, or not-a-number. The heat

(25)

demand data for the past 24-48 hours is necessary for predicting future demand and a solution to fill in these missing data points is therefore necessary.

To solve this problem, a combination of two different approaches for handling missing data is used, namely interpolation and predictions. The machine learning model that predicts heat demand can be used to approximate the missing demand data. The reason for using model predictions for missing data, is that it is more precise than using interpolation. An underlying problem of using model prediction is that it requires data for the average demand for the past 24-48 hours.

If demand data is missing for that time interval the predicted heat demand will be incorrect. One could imagine that these missing data points could be filled in by the model as well, but then these predictions requires the average demand for the same interval backwards and the problem repeats recursively. To avoid this problem, missing demand is interpolated from valid heat demand data (less than 5% difference in number of sensor values from the actual number of sensors) if there is not enough data 48 hours back to make predictions. The interpolation assumes that there will exist, at least one, valid heat demand value for the past 48-72 hours. This assumption is needed in order for the interpolation to work in the interval for the past 24-48 hours.

The machine learning model prediction requires the temperature as an input, both an average temperature, as with the demand, and a temperature for the time of the prediction which is extracted from the file of the actual temperatures. If there is enough data 48 hours back then the machine learning prediction is used to fill in the missing data.

Decreased precision for predictions may become a problem if many of the demand data values are approximated using interpolation or predictions. The program responsible for displaying demand predictions prompts a warning to the user if too many values within the past 24-48 hours are approximated. This feature is achieved by storing a flag when saving the demand data. The flag can be either interpolated, predicted or real. The warning is shown if the program that displays demands reads too many values which have the flags that are either interpolated or predicted.

The heat demand data from Azure can also be used in order to determine how

(26)

close the actual demand is to the prediction. The model has been trained on heat demand data using fewer sensor values than what is collected from Azure. This means that the predictions from the automated program are not as precise as it would have been if both programs used the same heat demand sensors.

4.4 Displaying predictions

The primary goal in the project is a program that reads saved data, makes predictions and displays the results. This program combines all the weather and heat demand data that has been produced by the other modules and uses this data to make predictions for the future heat demand.

The machine learning model has several input parameters such as the date, weather forecast temperatures, average demand and outside temperature for the previous 24-48 hours.

The temperature for the past 24-48 hours is read from file and the average temperature is calculated and used as input to the machine learning model. The average demand is calculated in a similar manner with the difference that it also checks a flag for whether the value is approximated or not. The demand can have data that has been collected from the Azure cloud server but it also possible that the demand is either interpolated or predicted. If the number of interpolated or predicted demand data in the 24-48 hour interval exceeds a threshold percentage, a message to console will notify plant operators that the current demand predictions are based on approximated input parameters. The threshold percentage is set to 25% when calculating the average demand but this parameter can easily be adjusted.

Training a machine learning model is a time intensive process compared to making a demand prediction [1]. The model is trained in a program developed by RISE. The model is not altered in our project and can therefore be saved to file.

This model is imported in the program that makes and displays predictions.

The model uses input parameters that are scaled to be between zero and one.

The output from the model is also scaled to the same interval. When making

(27)

predictions using the model, scaling needs to be applied in order to get correct predictions. The scaling factors are exported to file in the model training program and can be imported by programs that makes model predictions.

All the input parameters for the machine learning model are formatted and scaled to the same interval as is used to train the model as explained in section 2.2. The model predicts the heat demand from these input parameters and the output is scaled to get the unit in watts. The prediction of future heat demand is saved to file in order to allow further data analysis.

Heat demand predictions for the following eight hours are displayed in a graph which automatically updates every hour with the latest demand predictions done on the latest weather forecast. Even longer future predictions are possible, but this information is not useful for plant operators who are only concerned with heat demands for less than eight hours ahead. It is worth noting that the weather forecast is more precise the shorter into the future it is.

4.5 Displaying multiple predictions

The impact of changes in weather forecasts may be important for plant operators to take into consideration when deciding a proper heat-output from the plant.

By implementing a program that displays alternative predictions the impact of change in temperature predictions can be estimated. This is achieved by adding or subtracting a constant temperature to the weather forecast and then running this alternative temperature through the machine learning model where it predicts the heat demand. This produces several results which have different heat demands which are displayed in a graph. From this information it may be possible for plant operators to see if a small change in temperature in the forecast results in a large change in the predicted heat demand.

4.6 Displaying historical and predicted demand

A goal defined by RISE is to display historical and predicted demand in the same graph. This information is interesting since it provides a basis for analysis of

(28)

the precision of the machine learning model predictions. This is implemented by saving heat predictions from the model to file, which is done in the program responsible for displays predictions. These files are read and the predictions for a certain forecast time, for example one hour ahead, is extracted and displayed in a graph together with the recorded historical demand.

(29)

5 Result

In the project several program modules are developed in order to fulfill the project goals. Three program modules are used to gather the data that is needed as input to the machine learning model. One for gathering temperature forecasts, one for the actual temperatures and one for downloading heat demand from Azure. These program modules are scheduled to run once every tenth minute in a scheduling program. Each of the modules save the data they collect to file. The files are in the format of comma separated value (CSV) files and contains timestamps together with a value. The timestamp is the time that the value is valid for.

The program module that downloads historical demand from Azure can handle cases when the data is missing. It does this by either using the machine learning model to make predictions for the missing values or, if it lacks enough data to make a prediction, interpolates from demand data that it has received. The program checks the data files when it runs if there is enough data to make predictions for missing values or if interpolations is needed. The model needs temperature and demand data the last 24 and 48 hours in order to make a demand prediction. The program module therefore tries to make sure that there is data in the file for this period. This is done by downloading everything in the Azure blob storage for the past 72 hours. By downloading for 72 hours instead of 48, one can ensure that demand values exist that can be used to interpolate missing data.

The primary goal in the project is a program that reads input data from different files, makes predictions for the coming eight hours and plots the result. This produces a graph like the one in figure 5.1. The latest received input data is used to make new heat demand predictions every hour. The graph of predicted heat demand is updated with the latest predicted heat demands automatically. The predictions from this program are saved to file, which allows the results to be analyzed and used by other programs.

A secondary goal of the project is to display predictions made on modified temperature data. This feature is added by modifying the program that displays and makes predictions so that it also makes alternative predictions on the weather forecasts. Alternative predictions are done on the temperature forecast with a

(30)

Figure 5.1: Future demand calculated from weather forecast

temperature offset of plus two or minus two degree celsius. These alternative predictions are plotted in a graph as can be seen in figure 5.2. The dotted line is the heat predictions on the weather forecast, whereas the top boundary is predicted on the weather forecast minus two degrees, and the bottom boundary on the forecast plus two degrees.

Figure 5.2: Heat demand for prediction on +/- 2 degree celsius change in temperature. The area above the line are predictions for colder temperature and the area below for warmer.

(31)

Another program that plots data was also developed. It reads the predictions that were computed and saved to file by the first plotting program and compares them with the historical heat demand downloaded from Azure. A graph of this is seen in figure 5.3. In this graph one can see that the predicted heat demand is, on average, higher than the actual heat demand. The heat demand coming in to Azure was inspected and it was found that data from some of the sensors in Surahammar did not deliver correct heat demand values to Azure, causing the aggregated heat demand to be lower than it should. Some of the values were zero when arriving to the Azure server.

Figure 5.3: Graph of actual and predicted heat demand

The weather forecast temperature gathered from SMHI is different to the recorded temperature from Trafikverket. The temperature data from Trafikverket seemed to be consistently lower than the SMHI predictions by as much as two degrees celsius.

(32)

6 Discussion and conclusion

6.1 Discussion

The machine learning model developed by RISE in the project ”Smarta flöden”

can be useful to district heating operators in a real-time scenario if heat demand predictions are automated. Heat demand predictions can be made without needing automation but the problem is that it requires several processing steps which is a time intensive process if done manually. Tools for automating processing steps have been implemented and have been combined to create a system which displays the predicted heat demand for the following eight hours.

The machine learning model predictions need input data in order to work. If input data to the model does not exist, the program will not function properly. This puts constraints on how the data collection works and how it handles problems, such as missing or delayed input data.

Creating a robust system that is able to function without requiring all input data is a difficult task to solve and is highly dependent on what assumptions are made on the accessibility of input data. The system developed in this project made the assumptions that it is possible to access and collect weather data from Trafikverket and SMHI at least once every hour. The assumptions on heat demand data collection is that there will exist data for past 48 to 72 hours in the Azure blob storage. This is assumption is necessary in order for interpolation of missing data to work. By approximating missing input data, the system is more robust at the cost of being more susceptible to incorrect heat demand predictions that arise from uncertainties in the approximations.

The demand predictions made by the demand predicting program is, on average, higher than the recorded heat demand as seen in figure 5.3. This is most likely due to the fact that some of the measured demand is missing when arriving in Azure. It could also be that the model has been trained on a smaller number of heat demand sensor values than what the automated program uses. In order to improve the precision for the predictions, the prediction model need to have the

(33)

same input data as the machine learning model is trained on. Figure 2.1 shows what the precision can be if the model makes predictions on the same type of data that it has been trained for which is significantly better than what is seen in figure 5.3.

Improvement on the precision for predictions can be achieved by changing the data that arrives at the Azure blob storage and making sure that the data is the same data as the model is trained on. Alternatively, the precision can be improved by retraining the model on data gathered throughout this project. The original model is trained on weather and demand data that was collected over a period of more than two years. The six weeks of data collection throughout this project is not enough for the model to get as good precision as the original model.

If the model had been trained on the same data as is being collected from SMHI, Trafikverket and Azure, it would be possible to retrain the model to incorporate new data and thereby possibly improving the precision for the model predictions.

There are other possible explanations for why the predictions deviate from the recorded heat demand. That the weather data used to train the model is from a temperature sensor and this data is not the same as the data that is collected from SMHI and Trafikverket. This might affect the precision of the model if the different temperature sensors record different temperatures.

The consistent difference in temperature between Trafikverket and SMHI indicates that there might be a systematic error for either the forecast or the temperature sensor. This might impact calculations of the average temperature and thereby affecting the precision of the machine learning model.

The graph of predictions on modified forecast temperatures (figure 5.2) shows that the impact of temperature change on the predictions is dependent on more variables than the time. The maximum and minimum predicted heat demand converge/diverge depending on what the other input variables are. The purpose of this graph is for district heating plant operators toe better estimate how a temperature change in the forecast will affect the predicted demand for that time.

This is a crude way of approximating confidence intervals for the weather forecast,

(34)

and if SMHI had, in their open API, provided data about the confidence on their weather forecasts, this data could be incorporated in the heat demand prediction graph. Unfortunately this is not the case, and the approximation of using a two degree difference in the forecast is used instead.

6.2 Conclusion

In conclusion, it is possible to automate a program that makes predictions on the future heat demand and all of the goals of the project have been achieved.

Modules for collecting data from different sources have been developed, as well as a program that makes predictions based on the collected data and displays this in a graph. The assumption of accessibility of input data can be weakened by approximating missing input data. This might impact the precision of the predictions in a negative way but it also makes the system more resilient to crashing due to missing input data.

6.3 Future work

The developed system is limited to the the region Surahammar. The program modules for collecting weather and heat demand data are implemented in such a way that they can be easily modified to work for other regions as well. This allows expansion of the program to work for several districts in Västerås.

The heat demand that is collected could be made more accurate. Instead of aggregating the heat demand measurements in Azure, the individual sensor values could be sent and stored in Azure. Some of the demand values arriving at Azure were incorrect, and by skipping a step in the dataflow, and sending the sensors values directly to Azure it might be possible to eliminate this problem.

The sensors records the total use of heat since the sensors started measuring. To calculate the demand for a period of time, the sensor value from the end of the period has to be subtracted from the value at the start of the period. This means that a program calculating this difference has to keep track of the demand from each individual sensor. Each sensor has an individual id, so this could be done. A

(35)

problem with this approach, and the reason the values were aggregated, is that demand from individual buildings can be seen. This could, at least partly, be solved by hashing the sensors id to make them anonymous.

(36)

References

[1] Chen, Chia-Yu. Deep Learning Training Times Get Significant Reduction.

2018. URL: https : / / www . ibm . com / blogs / research / 2018 / 02 / deep - learning-training (visited on 05/21/2019).

[2] Christmann, Andreas and Steinwart, Ingo. Support Vector Machines.

Springer, New York, NY, 2008.

[3] Documentation, Python. 5.5. Dictionaries. URL:https://docs.python.

org / 3 . 7 / tutorial / datastructures . html # dictionaries (visited on 05/27/2019).

[4] Gilen, Daniel. Software Quality - Concepts and Practice. John Wiley and Sons, 2018.

[5] Hsu, Chih-Wei, Chang, Chih-Chung, and Lin, Chih-Jen. A Practical Guide to Support Vector Classification. Guide. Department of Computer Science, National Taiwan University, 2010.

[6] MälarEnergi. Kraftvärmeverket - Västerås

hjärta. 2018. URL:https://www.malarenergi.se/om-malarenergi/vara- anlaggningar/kraftvarmeverket/kraftvarmeverket-vasteras/ (visited on 03/01/2019).

[7] Microsoft. What is Azure? 2019. URL:https://azure.microsoft.com/en- us/overview/what-is-azure/ (visited on 04/05/2019).

[8] Myers, Tamra, Wooley, Tessa, and Sharkey, Kent. What is Azure Blob storage? 2018. URL: https : / / docs . microsoft . com / en - us / azure / storage/blobs/storage-blobs-overview (visited on 04/12/2019).

[9] Rise Sics Västerås. Smarta Flöden - State of the art. 2017.

[10] Scikit-learn developers. Scikit-learn. URL:https://scikit- learn.org/

stable/index.html (visited on 05/14/2019).

[11] Scikit-learn developers. sklearn.preprocessing.MinMaxScaler.

URL:https://scikit-learn.org/stable/modules/generated/sklearn.

preprocessing.MinMaxScaler.html (visited on 05/14/2019).

(37)

[12] SMHI. SMHI Open Data API Docs - Meteorological Forecasts. URL:

https://opendata.smhi.se/apidocs/metfcst/index.html (visited on 05/16/2019).

[13] Svenman Wiker, Linnéa. Optimerad fjärrvärme med rätt produktionsmängd genom smarta flöden. 2016. URL:https://www.sics.

se/media/news/optimerad- fjarrvarme- med- ratt- produktionsmangd- genom-smarta-floden (visited on 02/27/2019).

[14] Sveriges meteorologiska och hydrologiska institut. Väder Väderprognoser Klimat och Vädertjänster i Sverige. 2019. URL:https://www.smhi.se/

(visited on 04/09/2019).

[15] Swedish Energy Agency. District heating. 2015. URL: http : / / www . energimyndigheten.se/en/sustainability/households/heating-your- home/district-heating/ (visited on 03/26/2019).

[16] Trafikverket. Trafikverket Sverige. 2019. URL: https : / / www . trafikverket.se/ (visited on 04/09/2019).

[17] Wong, Lisa and Park, Jino. Azure libraries for Python. 2017. URL:https:

/ / docs . microsoft . com / en - us / python / azure / python - sdk - azure - overview?view=azure-python (visited on 05/28/2019).

(38)

TRITA TRITA-EECS-EX-2019:165

Automating dataflow for a machine learning algorithm

Automating dataflow for a machine learning algorithm

KTH Bachelor Thesis Report

ALBERT GUNNESTRÖM ERIK BAUER

Abstract

Keywords

Sammanfattning

Nyckelord

Authors

Place for Project

Examiner

Supervisor

Contents

1 Introduction

1.1 Background

1.2 Problem

1.3 Purpose

1.4 Goal

1.5 Methodology

1.6 Stakeholders

1.7 Delimitations

1.8 Outline

2 Theoretical Background

2.1 Smart flows

2.2 Predicting heat demand

2.3 Getting the measured heat demand to Azure

2.4 Regarding temperature data

3 Methods

3.1 Fulfilling specification goals

3.2 Programming language and computing platform

3.3 Prototyping

3.4 Testing

4 The Work

4.1 System architecture

4.2 Collecting temperature data

4.3 Collecting the demand from a cloud service

4.4 Displaying predictions

4.5 Displaying multiple predictions

4.6 Displaying historical and predicted demand

5 Result

6 Discussion and conclusion

6.1 Discussion

6.2 Conclusion

6.3 Future work

References