Predicting Container-Level Power Consumption in Data Centers using Machine Learning Approaches

(1)

Consumption in Data Centers using

Machine Learning Approaches

Rasmus Bergström

Computer Science and Engineering, master's level 2020

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

(2)

(3)

Due to the ongoing climate crisis, reducing waste and carbon emissions has become hot topic in many fields of study. Cloud data centers contribute a large portion to the world’s energy consumption. In this work, methodologies are developed using machine learning algorithms to improve prediction of the energy consumption of a container in a data center. The goal is to share this information with the user ahead of time, so that the same can make educated decisions about their environmental footprint.

This work differentiates itself in its sole focus on optimizing prediction, as opposed to other approaches in the field where energy modeling and prediction has been studied as a means to building advanced scheduling policies in data centers.

In this thesis, a qualitative comparison between various machine learning approaches to energy modeling and prediction is put forward. These approaches include Linear, Poly- nomial Linear and Polynomial Random Forest Regression as well as a Genetic Algorithm, LSTM Neural Networks and Reinforcement Learning.

The best results were obtained using the Polynomial Random Forest Regression, which produced a Mean Absolute Error of of 26.48% when run against data center metrics gathered after the model was built. This prediction engine was then integrated into a Proof of Concept application as an educative tool to estimate what metrics of a cloud job have what impact on the container power consumption.

iii

(4)

(5)

This work was be performed at Xarepo AB. It was made possible by the access to anonymized data from the Oulu and RISE Lule˚a data centers, as part of cooperation in the ArctiqDC project, with funding by Interreg North.

Special thanks to Marcus Liwicki, Saleha Javed and Olov Schel´en for great input and support throughout the entire duration of the thesis.

v

(6)

(7)

Chapter 1 – Introduction 1

1.1 Research Question . . . 2

1.2 Delimitations . . . 2

1.3 Evaluation . . . 3

1.4 Thesis Structure . . . 3

Chapter 2 – Related Work 5 2.1 Green Computing . . . 6

2.2 Resource Allocation Policy . . . 6

2.3 Workload Analysis and Prediction . . . 7

2.4 Energy Consumption Analysis and Prediction . . . 8

Chapter 3 – Theory 11 3.1 Contributing Factors . . . 11

3.1.1 Data Center Related . . . 12

3.1.2 Workload-Related . . . 12

3.1.3 Environment Related . . . 13

3.2 Candidate Models . . . 13

3.2.1 Bayesian Inference . . . 14

3.2.2 Genetic Algorithms . . . 14

3.2.3 Artificial Neural Networks . . . 15

3.2.4 Reinforcement Learning . . . 15

Chapter 4 – Method 17 4.1 Data Collection . . . 17

4.1.1 Metrics from Data Centers . . . 18

4.1.2 Weather Data . . . 20

4.2 Data Exploration . . . 20

4.3 General Model Setup . . . 21

4.4 Bayesian Inference . . . 22

4.4.1 PyMC3 . . . 22

4.4.2 Linear Regression . . . 22

4.4.3 Polynomial Linear Regression . . . 23

4.4.4 Polynomial Random Forest Regression . . . 23

(8)

4.5 Genetic Algorithm . . . 24

4.5.1 Individuals . . . 24

4.5.2 Fitness . . . 25

4.5.3 Selection . . . 25

4.5.4 Mutation & Crossover . . . 25

4.6 Artificial Neural Networks . . . 26

4.6.1 Basic Model . . . 27

4.6.2 Long Short-Time Memory . . . 28

4.7 Reinforcement Learning . . . 28

4.8 Proof of Concept . . . 30

Chapter 5 – Results 31 5.1 Data Exploration . . . 31

5.2 Bayesian Inference . . . 35

5.2.1 Linear Regression . . . 35

5.2.2 Polynomial Linear Regression . . . 35

5.2.3 Random Forest Regression . . . 38

5.3 Genetic Algorithm . . . 40

5.4 Artificial Neural Networks . . . 41

5.4.1 Basic ANN . . . 41

5.4.2 Long Short-Time Memory . . . 43

5.5 Reinforcement Learning . . . 44

Chapter 6 – Evaluation 45 6.1 Qualitative Model Comparison . . . 45

6.1.1 Prediction Accuracy . . . 45

6.1.2 Prediction Time . . . 46

6.1.3 Summary . . . 47

6.2 Effectiveness . . . 48

Chapter 7 – Discussion 49 7.1 Research Question . . . 49

7.2 Comparison with Previous Work . . . 50

7.3 Usefulness of the Findings . . . 50

Chapter 8 – Conclusion and Future Work 51 8.1 Conclusion . . . 51

8.2 Future Work . . . 52

viii

(9)

Chapter 2 – Related Work 5

2.4.1 Power consumption prediction results shown in papers over the last decade. 8

Chapter 3 – Theory 11

Chapter 4 – Method 17

4.1.1 Python string containing a PromQL query . . . 18

4.1.2 PromQL query expressed using the Python adapter . . . 18

4.3.1 General model setup, the model should find a relationship between the metrics and the power consumption. . . 21

4.5.1 The first version of an individual in the Genetic Algorithm. b is the bias, n is the number of degrees in the polynomial, m is the number of parameters. 24 4.6.1 The architecture schema for the basic ANN model. . . 27

4.6.2 An overview of the Encoder-Decoder LSTM chain. . . 28

4.7.1 How NAF updates parameters. . . 29

4.7.2 The main loop of the Reinforcement Learning approach. . . 29

Chapter 5 – Results 31 5.1.1 The raw data obtained from the data center. It is difficult to detect correlations by eye. . . 32

5.1.2 Principal Component Analysis of container_power with 6 parameters. . 33

5.1.3 Correlation matrix of the parameters and output. The stronger correlation between the power consumption and the network metrics could partly be resulting from the fact that both are averaged node metrics. . 34

5.2.1 Test sample accuracy of simple linear regression between container_cpu_seconds and average node_power per container. . . 36

5.2.2 Polynomial linear regression . . . 37

5.2.3 Polynomial Random Forest Regression . . . 39

5.3.1 The training accuracy per generation when running the Genetic Algo- rithm. Due to the large error in the beginning it is very hard to visualize, but the decrease in error was quite gradual, and the best level of accuracy was achieved around generation 1000. . . 40

(10)

5.4.1 The training and validation errors over the course of the training. They show that, unexpectedly, validation errors are lower than training errors. 41 5.4.2 The training and validation predictions compared to the actual values of

the container_power. Note: The values on the x-axis are indexes, not epochs. . . 42 5.4.3 The training and validation predictions at epoch 3000, compared to the

actual container_power. The values on the y-axis are the container_power in Watt-Hours, with the indexes on the x-axis. . . 43 5.5.1 Results of the Reinforcement Learning per Epoch. As can be clearly seen

on the graph, the result did not converge towards a low absolute error. . 44

Chapter 6 – Evaluation 45

Chapter 7 – Discussion 49

Chapter 8 – Conclusion and Future Work 51

x

(11)

Chapter 2 – Related Work 5

Chapter 3 – Theory 11

3.1.1 Data center metrics . . . 12 3.1.2 Workload metrics . . . 12 3.1.3 Environment metrics . . . 13

Chapter 4 – Method 17

4.1.1 Data center metric granularity . . . 19

Chapter 5 – Results 31

5.2.1 Prediction accuracy of linear regression between container_cpu_seconds and container_power . . . 35 5.2.2 Results of polynomial linear regression. The leftmost column contains

the details for the actual values that the regression is trying to predict, the rest of the columns show the distribution of the absolute error achieved with K^thdegree polynomial linear regression. The best accuracy is marked with boldface. . . 36 5.2.3 Results of polynomial random forest regression. The leftmost column

contains the details for the actual values that the regression is trying to predict, the rest of the columns show the distribution of the absolute error achieved with K^thdegree polynomial random forest regression. The best accuracy is marked with boldface. . . 38

Chapter 6 – Evaluation 45

6.1.1 The attempted approaches and their outcomes. . . 47

Chapter 7 – Discussion 49

Chapter 8 – Conclusion and Future Work 51

(12)

C ^HAPTER 1

Introduction

A large portion of today’s computation is carried out in data centers. Cloud providers transparently manage the entire infrastructure, from cooling and hardware to virtualization and horizontal scaling. While potently reducing the amount of operations needed on the part of the user, the opacity of the underlying resources can lead to a diminished association between the jobs submitted to the data center on the one hand, and the pollution they cause on the other.

When browsing the offers of the main cloud vendors it is clear that energy consumption is not meant to drive the customer’s choice of cloud provider.

Foundational to this work is the belief that people are eager to make a difference in the fight against the raging climate crisis. With access to advanced models for predicting and understanding the energy consumption of cloud systems, they could be informed of the size of their environmental footprint ahead of time. This would empower users to make educated choices with regards to the green footprint of their technology usage.

Most research in the field [1][2][3] is related to making the data centers more energy efficient. This means that energy modeling and prediction has been studied as a means to building advanced scheduling policies for data centers. This work differentiates itself from other research in the field in that it deals solely with improving the prediction accuracy itself, with the intention of informing users.

This section outlines the goal of the project by stating the research question and break- ing it down into actionable steps. It also covers practicalities such as the delimitations and evaluation of the project.

1

(13)

1.1 Research Question

Can the prediction of energy consumption of a job in a data center be optimized using machine learning methodologies?

In the context of this thesis, prediction does not refer to foretelling the power consumption at a different (later) time. Instead, the term refers to estimating the consumption given a collection of other metrics. This is seen as a necessary step in order to later being able to forecast future energy consumption (see Section 8.2).

The research question can be broken down into the following steps,

1. Select metrics that impact the energy consumption of a container running in a data center.

2. Figure out how to attain data for these metrics and clean it for use.

3. Develop multiple models to predict the energy consumption of a container based on the metrics chosen.

4. Evaluate the different models and select the best one.

5. Present the metrics and predictions to the user in a way that enables and motivates them to reduce the environmental impact of their cloud usage.

1.2 Delimitations

This project is focused solely on making accurate predictions, with the goal helping users make educated decisions about their cloud energy performance. What follows is a list of concerns that could be meaningful to explore, but that are considered outside the scope of the project, with motivations as to why.

• Taking any (scheduling) actions based on the results of the predictions. This is because the goal is to inform users, not to make a more efficient data center.

• Forecasting energy consumption into the future, since is a related topic but requires a different approach.

• Changing the prediction model during run-time, since it would require much more engineering and it is unclear what advantages it would have.

(14)

1.3. Evaluation 3

• Varied climates and seasons, because only data gathered during the course of the project and from the available data centers will be used.

• User research into what kind of presentation has the greatest impact on people’s desire to sacrifice comfort and ease-of-use for a better environmental footprint, since that research would preferably be performed when it is batter known what the nature of possible predictions is.

1.3 Evaluation

Since it was unknown from the beginning of the project whether it would be possible to derive the desired prediction accuracy from the data set, the project had a nature of research and exploration. The following points were to be guiding when evaluating the quality of the findings,

1. A qualitative comparison between the different models attempted in the course of the project (See Section 6.1).

2. An evaluation of the effectiveness of the model when faced with data that it has not yet encountered (See Section 6.2).

3. A critical reasoning about how useful the information obtained is to reduce the environmental footprint of the individual users (See Chapter 7.3).

1.4 Thesis Structure

The chapters of this thesis are structured according to convention with related work and theory followed by method, results and evaluation. In the cases where the theory was considered too short to warrant its own section, it was included directly in the method section. Since the work itself is split over the following parts,

• Contributing Factors

• Data collection and exploration

• Candidate Models (Ordered by complexity)

many of the chapters follow that same division. In order to help the flow of the report, these parts appear in the same order in each chapter.

(15)

(16)

C ^HAPTER 2

Related Work

The project started with an extended and deep literature review. Because the problem statement differs from most related research, it was deemed necessary to throw the net wide, researching many adjacent problems to get a comprehensive understanding of the field. The chapter is split into four sections.

First, papers regarding Green Computing were explored. Efforts to discover more green ways to develop and deploy software are not a new phenomenon. This research was aimed at discovering to what extent cloud emissions had been explored.

Then, in order to learn more about data centers and the ways energy efficiency is currently measured and optimized, a large collection of papers suggesting scheduling op- timizations were read. Though some of these documents even mention energy-awareness, they did not suggest actual power modeling schemes. Instead, they were valuable for the insights they gave into the world of data center energy consumption as a whole.

One big difference between the approach taken in this thesis and those of most other studies is that in those studies, the user-provided workload is considered an input of unknown weight. In the case of the problem domain explored in this project the user’s input could be considered known, since the goal is to inform the user of their own impact.

That said, many efforts have been expended to model the expected workload in a data center at a given time. Though the goal is different, theirs was also a prediction endeavor, and could therefore provide useful learnings to help in this project.

Finally, the papers most relevant to this research are covered. These are papers that deal directly or indirectly with cloud center energy consumption analysis and prediction.

Many of these papers still do not discuss the accuracy of their predictions, since prediction is seen as a means to improve scheduling algorithms, and those algorithms are the main purpose of the paper.

5

(17)

2.1 Green Computing

Though since the beginning energy consumption has been on the mind of hardware man- ufacturers, it has rarely been a main focus for software. In recent years, the intensifying climate crisis and the proliferation of cloud computing with its virtualization and con- tainerization have started to change this mindset. Though some findings have been made, Hindle [4] has outlined the continuing large need for research in the space.

In an instructive article from 2013, Chauhan et al. [5] introduced a framework for thinking holistically about green infrastructure throughout an organization. They pos- tulated that in order to achieve lasting impact in a large corporation, the green mindset has to be present from requirements and design and all the way to test and deployment.

They also suggested that the customers should be given the ability to hold cloud vendors accountable by tracking energy consumption limits in the Service-Level Agreements.

Ardito et al. [6] used power profiles from mobile devices to show that more often than not, performant code is green code. They argue that established practices such as refac- toring and eliminating dead code can have a large impact on device energy consumption.

2.2 Resource Allocation Policy

The main goal of most energy-aware techniques used in data centers is to improve energy efficiency by implementing better scheduling and resource allocation. This is a hot topic, with an explosion of papers over the last decade, all suggesting innovative optimization techniques.

Of these, many have achieved promising results with regards to energy consumption by presenting novel algorithms for Virtual Machine (VM) allocation (See Berral et al. [7, 8], Qiu et al. [2], Portaluri et al. [9], Fang et al. [10]) that take the maximum energy consumption of the Physical Machines (PM) into account when deciding where to put the VMs. Most of these algorithms operate under the assumption that the best way to improve energy efficiency is to place the VMs as densely as possible on a subset of the PMs, so that the rest of the PMs can be turned off completely, thus saving energy.

He et al. [11] decided to also account for the energy price and the potential availability of renewable energy sources. They registered a 60 % improvement compared with a previous solution (though theirs is the only paper that refers to that previous solution).

Wang et al. [12] used thermal data in addition to the energy consumption data and were able to improve energy efficiency, though with a slight rise in SLA violations.

Zhou et al. [13], Radhakrishnan et al. [14], Kar et al. [15] and Javed et al. [16] all used Genetic Algorithms to improve the energy efficiency of clouds when subjected to various workloads. The difference between many of those solutions and the subject of this thesis is that their algorithms change the behavior of the data center in response to the workloads. Shaw et al. [1] and Zhou et al. [17] used Reinforcement Learning to allocate resources for optimal energy efficiency, also with good results.

(18)

2.3. Workload Analysis and Prediction 7

2.3 Workload Analysis and Prediction

One of the main challenges with estimating energy consumption in data centers is the heterogeneous and dynamic nature of the workloads that the cloud is expected to handle.

Many attempts have been made to accurately and confidently estimate such workloads in order to optimize resource allocation.

Rajarathinam et al. [18] used a non-linear auto-regressive network with exogenous input and were able to show that their method was superior to the purely statistical method they used as reference. Qazi et al. [19] based their approach on Chaos Theory and nearest neighbors classification in order to develop a framework that allowed them to make fine grained predictions.

Ramezani et al. [20] applied fuzzy workload prediction and a fuzzy logic system to predict and control future changes in CPU workload. They were able to predict which PMs would become hotspots by continuously looking for poor VM performance. Kalyampudi et al. [21] used a Moving Error Rate to predict the workload of various nodes. They were able to obtain an average error rate of 6.18 % data sets from 5 different data centers.

Zhang et al. [22] applied deep learning to predict CPU utilization of VMs, both for the next hour and the next day. They also discuss ways to speed up training using Polyadic Decomposition. They were able to obtain a Means Absolute Percentage Error of 0.26 and a Root Mean Square Error of 9.97 for 60 minute predictions.

Nwanganga et al. [23] classified a given workload according to the nearest neighbor structural similarity to historical workloads and used those previous workloads to predict the behavior of the new workload with successful results in some cases, but with varied support and confidence values. They propose introducing more features into the specification.

Ding et al. [24] combined Moving Average and Median Absolute Deviation in order to predict the workload. Sadly they did not focus on revealing the accuracy of that model, but they said that it was effective relative to other models.

(19)

2.4 Energy Consumption Analysis and Prediction

An important component of forecasting is the analysis of past data in order to gain valuable insights. This section is all about energy consumption modeling and estimation.

While there have been numerous papers in that domain, it is worth to note that the field has improved rapidly over the last decade and that recent results are far better than older ones. Figure 2.4.1 shows some of the previous results over the last 10 years.

Figure 2.4.1: Power consumption prediction results shown in papers over the last decade.

A large problem when exploring the previous work is the lack of access to benchmark metric and power consumption traces. Each paper seems to be referring to different data centers and/or datasets. Most authors state that the traces were from peak data center performance, which is not the case for the datasets used in the course of this thesis. It is also not the norm to provide access to the traces for reproduction of the results. This makes it very difficult to establish what the current state of the art actually is.

(20)

2.4. Energy Consumption Analysis and Prediction 9 Earlier attempts, dating back to 2010, can be found in Meisner et al. [25], who modeled peak power consumption by characterizing the relationship between server utilization and power supply behavior. They were able to predict the peak power trace with an error below 20 %. Meanwhile Dhiman et al. [26], using Gaussian Mixture Vector Quantification, achieved an average error of less than 10 %.

Jaiantilal et al. [3] used linear as well as random forest regression to model energy consumption for scheduling purposes. They did not explicitly state the error they obtained, but from their graphs it looks like the random forest regression was more effective.

In 2016, Dayarathna et al. [27] performed an in-depth study of the existing literature on data center power modeling available at that time, and emphasized taking the entire data center system into account when modeling energy consumption.

Canuto et al. [28] proposed deriving a single model per platform to account for het- erogeneity in cloud systems. They surmised that the correlation between certain metrics and energy consumption will vary between platforms, and used a minimum set of indi- cators for each platform, based on that correlation. At the time, their results were very promising.

Borghesi et al. [29] used random forest regression to predict job power consumption in high-power computing scenarios. They reported that training and predicting went very fast, with a mean error of around 8–9 % over the entire test period (15 % when including outliers).

Li et al. [30] used extensive power dynamic profiling, auto-encoders and deep learning models to try and optimize the accuracy of predictions. They presented two models, one coarse and one fine-grained, and reported 79 % error reduction for certain cases.

Kistowski et al. [31] used multiple linear regression to show that the power consumption of CPU and storage loads could be predicted with a prediction error of less than 15 % percent across a number of virtualized environment configurations. They further introduced a heuristic for pruning workloads to avoid using workloads that may lead to a decrease in prediction accuracy.

Liu et al. [32] used an LSTM-based approach, landing at a mean absolute error rate of 4.42 % on data center power consumption. Ferroni et al. [33] used a divide and conquer approach to model power consumption of heterogeneous data centers. They were able to achieve a relative error of 2 % on average and under 4 % in almost all cases. Instead of building one comprehensive model they identified distinct working states of the system and built a model for each of them.

Rayan et al. [34] used polynomial regression to predict power consumption as well as the number of physical machines needed, all based on the daily workload. They did not share numbers for the error but the graphs seemed to show good results.

Hsu et al. [35] made a feature selection from over 4000 operational trace data variables and ran though through a non-linear auto-regressive exogenous model. They used sliding window and validation data sets for model building and were able to achieve a mean squared error of 1.13 %.

(21)

Khan et al. [36] studied node power consumption and discussed approaches to future estimation. They covered vast amounts of log data with statistical and machine learning analysis and were able to estimate plug energy consumption with a mean absolute error rate of 1.97 %. They found that the biggest impact came from failed jobs, as well as from the CPU and Memory metrics.

Patil et al. [37] suggested forwarding an ensemble of base predictors (Exponential Smoothening, Auto-Regressive Integrated Moving Average, Nonlinear Neural Network and Trigonometric Box-Cox Auto-Regressive Moving Average Trend Seasonal Model) to a fuzzy neural network with self-adjusting learning rate and momentum weight.

Yi et al. [38] used two LSTM in tandem to predict the temperature and energy consumption of the processor in the next step of their resource allocation algorithm. They found that a single LSTM yielded inferior prediction accuracy. With the tandem approach they achieved a root mean square error of 3 %.

Kistowski et al. [39] introduced an off-line power prediction method that used the results of standard power rating tools. They used a selection of four different formalisms, from which they attempted to automatically select the best one. They were able to achieve an average error of 9.49 % for three workloads running on real-world, physical servers.

Yi et al. [40] showed that deep reinforcement learning can be effective when allocating compute-intensive job in data centers. They used an expectation maximization algorithm to construct a Gaussian mixture model. They found that constructing separate LSTM networks for each of the clusters led to a higher prediction accuracy.

(22)

C ^HAPTER 3

Theory

In this section the supporting theory is put forth. It builds heavily on the research con- ducted in Chapter 2 and on other sources. Section 3.1 is concerned with a walk-through of the various data metrics that could be important for the project, while Section 3.2 outlines the theory behind the various methodologies used to perform data analysis and prediction.

Due to the lack of related work in exactly the same problem domain, the theory portion of this thesis is limited. Most of the models used were developed by trial and error, and are covered in Chapter 4.

3.1 Contributing Factors

This section outlines the selection of metrics used as parameters when analyzing and predicting power consumption. Section 3.1.1 describes actual metrics from the data center. Section 3.1.2, covers factors that are related to the workload. Section 3.1.3 contains factors related to the environment in which the data center operates.

11

(23)

Table 3.1.1: Data center metrics

Name Description Unit

cpu_seconds CPU Processing Time Seconds

memory_bytes Memory Usage Bytes

read_bytes File System Read Bytes write_bytes File System Write Bytes receive_bytes Network Receive Bytes transmit_bytes Network Transmit Bytes

power Node Plug Power Watt-Hours

Table 3.1.2: Workload metrics

payload_bytes The size of the payload Bytes payload_cycles Number of cycles to perform job Scalar

3.1.1 Data Center Related

There are hundreds of metrics available from most data center monitoring systems. What can be difficult when switching from one distributed setup to another is comparing the metrics, it is easy to end up with an apples to oranges comparison. In order to combat this, the focus was put on the most straightforward measurements, data that the user themselves could experiment with.

The metrics chosen can be seen in Table 3.1.1. It is worth noting that many papers [26][32] focused predominantly on cpu_seconds and memory_bytes for prediction purposes.

3.1.2 Workload-Related

The properties of the job the user submits impact its performance. Armed with knowledge about what impact their decisions make, the user can make educated decisions about how to optimize. In Table 3.1.2, metrics are highlighted which describe the workload.

(24)

3.2. Candidate Models 13

Table 3.1.3: Environment metrics

temperature Air temperature Celsius

wind_speed Wind speed km/h

weather_description Human readable description of the weather String

pressure Air pressure hPA

humidity Air humidity %

time_of_day Time of Day Seconds

month The number of the month [1-12]

power_price The average price of power that day SEK

3.1.3 Environment Related

A data center does not operate in a vacuum. There are a number of factors in the environment, and many of them might impact the performance of the data center. The metrics chosen for the various factors for use in training prediction models can be found in Table 3.1.3.

3.2 Candidate Models

There are numerous ways in which to perform data analysis and prediction, ranging from simple linear regression to more advanced approaches. Indeed, as can be seen in Section 2, many different approaches have been tried in adjacent problem domains with notable success. In order to make an interesting study, it was deemed wise to try a variety of different approaches and perform a comparative analysis between them.

This section outlines the supporting theory for the approaches attempted during the project. Section 3.2.1 outlines the theory behind Bayesian Inference, which was chosen as a statistical baseline upon which to draw. Section 3.2.3 covers Artificial Neural Networks, Recurrent Neural Networks and Long Short Term Memory to see explore they can be optimized for the task at hand. Section 3.2.2 explores an evolutionary approach, and finally Section 3.2.4 explores and evaluates the merits of Reinforcement Learning as applied to the problem formulation.

(25)

3.2.1 Bayesian Inference

Bayes’ Theorem [41] is a mathematical framework for estimating the probability of an event based on some initial belief or knowledge that we have, commonly known as the prior. The scenario is the following, we have just observed event B, and we are trying to estimate P (A|B). According to Bayes’ Theorem (See Equation 3.1), we can then use the prior P (A) to estimate it.

P (A|B) = P (B|A) P (A)

P (B) (3.1)

Bayes’ Theorem has many applications, one of which is Bayesian Inference, which refers to the process of extracting properties from data using Bayes’ Theorem. Equation 3.1 can then be rewritten as shown in Equation 3.2 (Θ represents the prior distribution).

P (Θ|data) = P (data|Θ) P (Θ)

P (data) (3.2)

In other words, one makes an initial assumption about the distribution of the data given a set of parameters. One uses this prior to make a prediction based on the next data point observed. The actual value can then be differentiated with the prediction in order to find the error, which is then used to update the prior distribution. With more observations, the prior becomes more and more accurate and becomes the final prediction of the algorithm. Bayesian inference has the advantage that it can be performed on an on-line basis and can be relatively quick to perform in most cases.

3.2.2 Genetic Algorithms

Genetic Algorithms is the name for a large group of algorithms inspired by Darwinian evolution and molecular genetics, more specifically by the biological processes in chromosomes. [14]. In essence, Genetic Algorithms are random search algorithms with the ability to that self-organize, adapt and learn. [13].

The methodology was originally introduced [42] as a probabilistic optimization algorithm. To apply the terms used by Darwin [43] nature (environment) is represented by the problem definition, and individuals (chromosomes) are represented by candidate solutions. A set of individuals is known as a population.

Genetic Algorithms work as follows. To start the process, a population is initialized in a way that in some way maps to the problem definition. The individuals are then scored using a fitness function to evaluate how well they solve the problem, this is known as selection. The fittest individuals are then allowed to reproduce, exchanging genes and then splitting to create a new generation in the crossover step.

Finally, mutation is allowed to take place by arbitrarily changing a subset of individuals, after which the new generation is ready to take on nature. This process continues until some predefined fitness criterion has been met.

(26)

3.2. Candidate Models 15

3.2.3 Artificial Neural Networks

As the name suggests, Artificial Neural Networks (ANN) take inspiration from the behavior of biological neurons in order to perform learning tasks. At its simplest form, an ANN is a layered system. At each layer the neurons assign weights to the inputs from the previous layer. By running an experiment many times one can then let the error propagate back through the system, constantly reassigning the weights to improve the output.

A Recurrent Neural Network (RNN) is an ANN where the result of the previous training step is taken into account when making the next prediction. RNNs have been proven to be successful in solving problems in a wide range of domains. One of their major shortcomings is their inability to remember features further in the past, since more recent results tend to cloud earlier ones.

A Long Short-Term Memory RNN (LSTM) is an attempt to combat this problem by adding channels to access such memory in the past. This approach has been used to address similar prediction problems in the past. A common thread was to incorporate a pair of chained LSTM networks, known as an autoencoder, where one network is responsible for encoding the historical data, and the second responsible for recreating the original representation based on the encoding. This approach leads to a desired loss between the decoding and encoding, known as a drop-off, that reduces overfitting on a subset of the data features.

3.2.4 Reinforcement Learning

The goal of regular reinforcement learning is to explore and learn from an environment.

There are two main kinds of reinforcement learning, model-based and model-free. In the model-based reinforcement learning, supervised learning is used to learn about a domain that is already at least partly known. In the model-free approach it assumed that no knowledge of the environment is known ahead of time. Instead the algorithm works by giving every state in the environment a so-called Q-score. This Q-score is an estimation of the highest possible reward obtainable originating from that state.

Model-free learning (or Q-learning) is then performed by going through the possible actions, one by one, estimating the state that would result from that action. The algorithm then selects the action that would give the highest Q-score. Whenever the algorithm in- teracts with the environment it remembers the different outcomes that came from taking a certain action in a certain state and uses that to improve the Q-scores. This is the essence of Reinforcement Learning.

(27)

There are two main challenges with applying Reinforcement Learning to predicting energy consumption.

1. The algorithm as described above is based on the premise that one can cycle through the list of possible actions in a given state and compare all the outcomes. In other word it is assumed that the action space is finite. This is rarely the case for the physical world. In the problem described in this thesis the action space is continuous.

This problem can be addressed. In Gu et al. [44, 45] two algorithms were presented that use normalizations techniques to be able to use the techniques described above on problems with a continuous action space.

2. Reinforcement learning essentially is about finding causality between correlated actions and rewards. In this problem, since we cannot actually change the behavior of the data center in order to reduce energy consumption. In the current problem definitions, the actions do not impact the state (i.e. the accuracy of our prediction does not change what the next value will be). Thus the algorithm will most likely not converge.

(28)

C ^HAPTER 4

Method

This chapter outlines the way data was collected and explored, as well as the implementation of the various models used to predict power consumption. These are organized first by general approach and then split into subsections based on individual models.

4.1 Data Collection

In Section 3.1, the various metrics were described that were thought to impact power consumption. This section covers to what extent those metrics were available, and how they were collected and stored. All the data gathered throughout the thesis was anonymized and made available at https://github.com/Xarepo/green-data.

Both the data centers supplying data for the projects were running Rancher [46] on top of Kubernetes [47]. This was fortunate since that setup provides data monitoring out of the box. This data is gathered in real time and stored in a Prometheus [48]

time-series database for up to 7 days before being discarded. Accordingly, it had to be gathered continuously throughout the project and stored separately. Section 4.1.1 describes that process in detail, including the production of an adapter for effective extraction of Prometheus data into a Python-friendly format.

The plug power of the nodes in the data center was not part of the monitoring data provided out of the box by Rancher. Instead, plug power consumption was measured separately and added to a separate database for simple extraction. This was considered straightforward enough to not warrant its own section.

The collection of environmental data is covered in Section 4.1.2. Sadly, no metrics were obtainable for the characteristics of the actual jobs running in the data center (The ones discussed in Section 3.1.2).

17

(29)

4.1.1 Metrics from Data Centers

Prometheus allows for queries using PromQL, a Domain-Specific Language for time-series queries. To mitigate the impracticalities of building large query strings, a wrapper layer was built to allow for rapid query composition using Python syntax.

Prometheus data is queried by metric, with an optional subfield to filter the results of the query. To get support from Python introspection, all the available metrics were added to a Python class, providing quick in-editor completion. In order to facilitate filtering, a class attribute lookup was used to convert each metric into a function accepting a list of filters.

The result of the adapter layer was that queries that would previously have been written as Python strings (Figure 4.1.1) could now be written as Python code (Figure 4.1.2), vastly improving productivity, since PromQL syntax errors were now Python syntax errors.

query = (

'sum(rate(container_cpu_usage_seconds_total'

+ '{name!~".*prometheus.*", image!="", container_name!="POD"}' + '[5m])) by (node)'

)

Figure 4.1.1: Python string containing a PromQL query

query = p_sum(

p_rate(

p.container_cpu_usage_seconds_total([p_ignore_k8s()]),

"5m", ),

["node"], )

Figure 4.1.2: PromQL query expressed using the Python adapter

(30)

4.1. Data Collection 19

Table 4.1.1: Data center metric granularity

Name Granularity

cpu_seconds Container

memory_bytes Container

read_bytes Container

write_bytes Container

receive_bytes Pod

transmit_bytes Pod

power Node

Metric Granularity

Available information granularity differed by metric. Some data could be obtained for each container, some at pod level and some data was only available at the node level. In Table 4.1.1 the granularity of different metrics is listed (Compare to Table 3.1.1).

It was decided to gather data at two levels. All the available data was added up and gathered at the node level. The names of these data point were prefixed with node (node_cpu_seconds, node_power etc.). Additionally, all the data that could be gathered at container level was also gathered at that granularity, and the names of those data points were prefixed with container. No data was gathered at the pod granularity.

Whenever a data point is used on a finer granularity than was available as was the case with container_power, container_receive_bytes and container_transmit_bytes, the per container average of that metric on that node is meant. This could lead to some outliers on sparsely used nodes, but was considered the best way to facilitate the use of those data metrics when modeling.

(31)

4.1.2 Weather Data

It was considered interesting whether environmental data such as the weather had an impact on power consumption. To investigate this, accurate weather data was needed.

After some research about the available weather data APIs, it was decided to use the Weather Underground API [49].

Their API, among other things, gives access to the conditions at Lule˚a Airport every half hour. Through it, all the data points in Table 3.1.3 except power_price were obtainable. In the sake of simplicity, the weather conditions were then extrapolated to all timestamps within the half hour.

4.2 Data Exploration

The first approach to data exploration was to make scatter plots of the container_power against each of the parameters available. Unfortunately, on these plots, it was very difficult to discern any correlations with the human eye.

For this reason, Principal Component Analysis (PCA) was performed to try and visualize deeper patterns in the data. Principal components are vectors where the parameters have been encoded in a way that retains meaningful information about the relationship between said parameters. The goal of PCA, therefore, is to reduce the number of dimen- sions in order to visualize relationships between principal components and the output without losing important data.

A correlation matrix was also made, which is a table that is used to show the correlation between the different parameters.

(32)

4.3. General Model Setup 21

4.3 General Model Setup

The general setup of all the models was the following. A collection of metrics was submitted as parameters to the model (See Figure 4.3.1). In order to prevent overfitting on past data and to decouple the user impact from the time when the job was submitted, container names and timestamps were not used as parameters. Instead, relationships were sought between the parameters and the power consumption of the container.

The goal of each model was thus to take the list of parameters and, using only that information, make a prediction as to how much power a container with those parameter values is expected to consume.

Figure 4.3.1: General model setup, the model should find a relationship between the metrics and the power consumption.

(33)

4.4 Bayesian Inference

The Bayesian modeling commenced with reading up on various implementations of Bayesian Inference to prediction problems in Python. From those sources it was surmised that PyMC3 would be the right library for the job, and that approach is covered in Section 4.4.1.

When those approaches struggled to handle the large amount of data used in the modeling, the rest of the attempts were performed using scikit-learn [50]. Sections 4.4.2 – 4.4.3 describe the progression from a simple linear regression to more powerful, polynomial models.

In all the regression attempts using the Bayesian modeling techniques, the dataset was split into two smaller sets. The first, roughly 90% of the points, was used to train the model. The other 10% was kept back for testing. In all the attempts made using Bayesian Inference, these latter 10% were used to produce the actual prediction results.

4.4.1 PyMC3

The main sources for the initial implementation were the PyMC3 [51] getting started guides [52], [53] as well as a related blog post [54].

Initially, it was believed that investigating the node data might be sufficient to make extrapolations about the power consumption. A simple linear regression was attempted, then it was attempted to use GMM as well as ordering the data and introduce switch- points between the nodes. It was determined that it did not contain significant enough insights, so the node models were discarded and the focus moved to examining the data on a container level.

Making that shift meant dealing with data that was roughly 20 times larger than the node data, which made PyMC3 feel very slow. Therefore, it was decided to move the statistical modeling to scikit-learn instead.

4.4.2 Linear Regression

Scikit-learn has a built-in model for Linear Regression. All that was needed to perform linear regression was to provide the input/output pairs and to fit the linear regression model to the data. In order to judge prediction accuracy the dataset was split into a training and a test set. Consistently for the regression models, the training was only performed on the training set, and the final accuracy estimate only calculated using the test set.

There was only so much information that can be extracted from the dataset using linear modeling. Thus, the next step was to introduce the other parameters and to perform polynomial regression.

(34)

4.4. Bayesian Inference 23

4.4.3 Polynomial Linear Regression

Scikit-learn has a preprocessing module for expanding a dataset into its polynomial features. It takes as parameters the dataset and the degree (called K here) of the expansion.

For example, given the list [a, b] and K = 2, the list [1, a, b, a², ab, b²] would be returned.

For the polynomial prediction, a pipeline was built that took as input the data, a list of the desired parameters and K. This algorithm first performed a K^th degree polynomial expansion and then passed it through the Linear Regression module discussed in Section 4.4.2.

4.4.4 Polynomial Random Forest Regression

The Polynomial Random Forest Regression worked in the same way as the Polynomial Linear Regression described in Section 4.4.3. A pipeline was built that took as input the data, a list of the desired parameters and K. This algorithm first performed a K^th degree polynomial expansion and then passed it through the RandomForestRegressor module from scikit-learn.

(35)

4.5 Genetic Algorithm

The Genetic Algorithm modeling commenced with reading up on various implementations of Genetic Algorithms in Python. Different libraries were considered and DEAP [55]

chosen as a helpful framework for building genetic models. Sections 4.5.1 – 4.5.4 cover the definition of individuals, fitness, selection and crossover respectively.

4.5.1 Individuals

In an approach very similar to the polynomial regression used in the Bayesian Inference model, it was decided to define an individual (See Figure 4.5.1) as a bias b and a two- dimensional (m × n) list W , where m signified the number of parameters supplied to the model, and n the degree of the polynomial expansion. These values were initialized according to a uniform distribution.

Figure 4.5.1: The first version of an individual in the Genetic Algorithm. b is the bias, n is the number of degrees in the polynomial, m is the number of parameters.

Later, using the same idea as in Section 4.4.3, the individuals were simplified to be defined as a one-dimensional list with the same length as the number of variables obtained by running PolynomialFeatures from scikit-learn on the parameters.

(36)

4.5. Genetic Algorithm 25

4.5.2 Fitness

A prediction for a point I = [i₁, . . . , i_m] was defined as the result of Equation 4.1, using the values from the individual, whose fitness was to be determined. A sample of 500 points was randomly selected for each generation. The fitness of an individual was defined as its average prediction error over a sample of the input.

prediction = b

+ w_1,1i₁+ w_1,2(i₁)²+ . . . + w_1,n(i₁)ⁿ + w_2,1i₂+ w_2,2(i₂)²+ . . . + w_2,n(i₂)ⁿ + . . .

+ w_m,1i_m+ w_m,2(i_m)²+ . . . + w_m,n(i_m)ⁿ (4.1) Later, when the individuals were simplified as a one-dimensional list, a prediction was redefined as the dot product between the individual and the polynomial expansion of I.

4.5.3 Selection

Tournament selection was used to determine which individuals to bring into the next generation. It refers to the procedure of repeatedly choosing a small number (3 in this case) random individuals from the population, and comparing their fitness. At each step, the individual with the best fitness is selected for the next generation. This process is continued until the population of the next generation has reached the same size as the population of the previous one.

4.5.4 Mutation & Crossover

The uniform distribution of the weights had as a consequence that the initial fitness was very bad. To combat this, mutation was performed very aggressively. Mutation probability was set to very high (80%) and a custom mutation algorithm was introduced.

This aggressive approach was a trade-off. It had the benefit that the results would start converging quickly towards the best possible outcome, though it also meant that the final value might lack some precision.

(37)

The custom mutation worked as follow. Each value in the individual went through a step, where it could be altered one of the four following ways,

• 1/4 chance — it would stay the same

• 1/4 chance — it would be doubled

• 1/4 chance — it would be halved

• 1/4 chance — its sign would be inverted

For mating, two-point crossover was used between each of the rows in the 2-dimensional list. The bias was left unchanged by crossover.

4.6 Artificial Neural Networks

The initial approach with regards to building a prediction engine was to look at previous approaches and try to reproduce the state of the art. In essence, this meant starting with a chained LSTM approach, as used in some of the papers covered in Section 2.4.

When this approach did not yield great results, it was decided to start from the beginning and build up increasingly complex models in order to gain understanding and thus be able to make more intelligent decisions going forward. Thus, this section covers the progression from a basic ANN model to the more advanced models.

(38)

4.6. Artificial Neural Networks 27

4.6.1 Basic Model

This model architecture of the first model attempted can be found in Figure 4.6.1. It was implemented as a basic ANN, with one hidden layer and using ReLU (See Equation 4.2) as the activation function.

ReLU(x) =

(0, if x ≤ 0

x, otherwise (4.2)

Over time this approach grew to be seen as a very useful starting point, since it allowed for implementing saving, loading and good plotting without being as computationally heavy as the more complicated approaches.

Figure 4.6.1: The architecture schema for the basic ANN model.

(39)

4.6.2 Long Short-Time Memory

The LSTM modeling commenced with reading up on LSTM in general, specifically on Colah’s blog [56]. Various implementations of LSTM in Python were considered, and the first implementation inspired by [57]. That model is based on two chained LSTM networks.

The first one takes the input parameters and encodes the time series into a fixed length vector. The second takes this vector and interprets it back to a prediction in the desired domain. A overview of the architecture can be found in Figure 4.6.2.

Figure 4.6.2: An overview of the Encoder-Decoder LSTM chain.

4.7 Reinforcement Learning

The idea was to tackle the continuous action space by basing the reinforcement learning model on an implementation [58] of Normalized Advantage Functions [44].

A data center environment was created for the agent to explore. This environment returned the actual data center values grouped by container and sorted by timestamp.

The agent was then to make a guess at the next container_power, and the loss was defined as the absolute error of the guess. For visualizations of the parameter optimization and the reinforcement learning process, please refer to Figures 4.7.1 – 4.7.2.

(40)

4.7. Reinforcement Learning 29

Figure 4.7.1: How NAF updates parameters.

Figure 4.7.2: The main loop of the Reinforcement Learning approach.

(41)

4.8 Proof of Concept

The purpose of this work is to help educate users on the effects of their power consumption and to motivate them to reduce they environmental footprint. As a part of the project, a Proof of Concept application was built in order to demonstrate how this could be done.

The Proof of Concept has three parts, which are described in below.

The first part is a monitoring engine where the user can run queries against the cloud center in real time in order to gain insights about the distribution of the metrics studied.

This part does not necessitate a prediction engine but is useful in giving an introduction as to what the metrics are.

The second part is the power consumption estimation engine. It allows the user to submit six different metrics that describe the characteristics of their planned job, and to see an estimation as to what the power consumption of that job could be given those characteristics.

The third part is very similar to the second. Instead of submitting six metrics, how- ever, the user is asked to submit five. A graph is then showed of the estimated power consumption over the sixth parameter, given the five fixed parameters. It is believed that this view could help a user understand in which scenarios what metrics have the most impact.

(42)

C ^HAPTER 5

Results

This chapter shows the results of the various approaches outlined in Chapter 4. In Sec- tion 5.1, the graphs resulting from the data exploration can be found. Then in Section 5.2, the results of the different regression attempts are covered. In Sections 5.3 – 5.5 the results of Genetic Algorithm, Neural Networks and Reinforcement Learning respectively are outlined.

5.1 Data Exploration

This section contains the results of the data exploration performed on the raw data as described in Section 4.2. The dependencies between the container_power and the data center metrics can be found in Figure 5.1.1. It is very hard to detect any correlations among these metrics with the human eye.

The result of the PCA can be found in Figure 5.1.2. It shows that there seems to be clusters within the parameter space that have the same or similar container_power. The correlation matrix can be found in Figure 5.1.3. It shows that the strongest correlation is relative to the network metrics, which could be partly due to the fact that they are averaged node metrics.

31

(43)

Figure 5.1.1: The raw data obtained from the data center. It is difficult to detect correlations by eye.

(44)

5.1. Data Exploration 33

Figure 5.1.2: Principal Component Analysis of container_power with 6 parameters.

(45)

Figure 5.1.3: Correlation matrix of the parameters and output. The stronger correlation between the power consumption and the network metrics could partly be resulting from the fact that both are averaged node metrics.

(46)

Table 5.2.1: Prediction accuracy of linear regression between container_cpu_seconds and container_power

Actual Predicted Error

count 96287.000000 96287.000000 96287.000000 mean 74.961997 74.797272 22.084664 std 49.375480 1.463845 44.141939 min 12.923077 74.522052 0.006016 25% 54.090909 74.523434 7.649254 50% 69.222222 74.530191 20.432554 75% 91.538462 74.565144 28.015783 max 1890.000000 94.599362 1815.477917

5.2 Bayesian Inference

This section contains the results for the Bayesian Analysis. Section 5.2.1 shows the results for the linear regression outlined in Section 4.4.2 and Sections 5.2.2 – 5.2.3 contain the results for the polynomial regression attempts described in Sections 4.4.3 – 4.4.4.

5.2.1 Linear Regression

In Table 5.2.1, the prediction accuracy of linear regression between container_cpu_seconds and the average node_power per container is displayed. Figure 5.2.1 shows that there seems to be a very slight correlation between higher CPU usage and power consumption (though there were some quite impactful outliers).

5.2.2 Polynomial Linear Regression

The result of performing K^thdegree polynomial linear regression on the metrics to predict container_power can be found in Table 5.2.2 for K ∈ [1..7]. It shows that K = 4 yielded the lowest mean absolute error, but that the overall accuracy was best at K = 3. The graphs for the K = 3 polynomial linear regression can be found in Figure 5.2.2.

It is worth to note that the improvements made by going from linear to polynomial linear regression were very small. Although these prediction results were precise enough to be guiding in many of the applications of such predictions, they were still quite far off the current state of the art.

(47)

Figure 5.2.1: Test sample accuracy of simple linear regression between container_cpu_seconds and average node_power per container.

Table 5.2.2: Results of polynomial linear regression. The leftmost column contains the details for the actual values that the regression is trying to predict, the rest of the columns show the distribution of the absolute error achieved with K^th degree polynomial linear regression.

The best accuracy is marked with boldface.

Actual K = 1 K = 2 K = 3 K = 4 K = 5 K = 6 K = 7

count 96287 96287 96287 96287 96287 96287 96287 96287

mean 74.96 23.06 23.04 21.54 21.53 21.61 22.05 23.74

std 49.38 41.31 41.31 37.94 38.16 41.39 86.05 596.99

min 12.92 6.38e−4 6.38e−4 1.23e−4 7.12e−3 8.85e−4 5.64e−4 8.59e−4

25% 54.09 8.70 8.70 7.32 7.77 7.75 7.83 7.84

50% 69.22 18.55 18.55 20.66 19.79 20.26 20.25 20.24

75% 91.54 30.14 30.14 27.38 28.11 28.09 28.31 28.18

max 1890.0 1819.8 1819.8 1815.1 1816.1 5126.4 20967 84823

(48)

Figure 5.2.2: Polynomial linear regression

(49)

Table 5.2.3: Results of polynomial random forest regression. The leftmost column contains the details for the actual values that the regression is trying to predict, the rest of the columns show the distribution of the absolute error achieved with K^th degree polynomial random forest regression. The best accuracy is marked with boldface.

Actual K = 2 K = 3 K = 4

count 365655 365 655 365 655 365 655

mean 57.48 0.7180 0.6443 0.6670

std 31.51 7.132 7.129 7.431

min 6.30 0 0 0

25% 36.06 2.487e−14 2.132e−14 1.421e−14 50% 58.00 8.527e−14 7.816e−14 7.105e−14

75% 70.00 0.167 0.094 0.079

max 1260.00 1134 1144 1145

5.2.3 Random Forest Regression

The result of performing K^th degree polynomial random forest regression on the metrics to predict container_power can be found in Table 5.2.3 for K ∈ [2..4]. There were some issues with displaying the results of K > 4, and since the performance beyond that point degraded for each K it was decided to leave those values out of the report.

The table shows that in general, K = 3 yielded the lowest mean absolute error, and the best overall accuracy. The graphs for the K = 3 polynomial random forest regression can be found in Figure 5.2.3.

The results for the Polynomial Random Forest Regression were really good, with an error percentage of 1.10%. In order to validate these findings on new data, test data was collected from 3 new days and the trained model exposed to that data instead. Sadly, the model performed quite poorly in that case, with an error percentage of 26.48%.

(50)

Figure 5.2.3: Polynomial Random Forest Regression

(51)

5.3 Genetic Algorithm

This sections show the results of the Genetic Algorithm described in Section 4.5. It was possible to get the Genetic Algorithm to converge at an mean absolute training error of around 68, which represents an error of around 89%. After the change was made to the individuals to use PolynomialFeatures, the algorithm converged at an error of around 28%, though with very small difference in the individual predictions.

In Figure 5.3.1, the training accuracy per generation is shown. Due to the large error in the beginning it is very hard to visualize, but the decrease in error was quite gradual, and the best level of accuracy was achieved around generation 1000. This figure is representative for both representations of individuals.

Figure 5.3.1: The training accuracy per generation when running the Genetic Algorithm. Due to the large error in the beginning it is very hard to visualize, but the decrease in error was quite gradual, and the best level of accuracy was achieved around generation 1000.

(52)

5.4 Artificial Neural Networks

This section contains the results for the Artificial Neural Networks. Section 5.4.1 shows the results for the basic ANN described in Section 4.6.1 and Section 5.4.2 contains the results for the LSTM approach detailed in Section 4.6.2.

5.4.1 Basic ANN

For some reason, something was going very wrong with the training. The training predictions were improving on a lot slower rate than the validation predictions, as can be seen in Figure 5.4.1. It is believed that this is due to a bug somewhere in the model, and it is a part of the next steps to locate that bug. The accuracy after 9900 epochs can be found in Figure 5.4.2.

The lowest error was found in Epoch 9995 (out of 10000 total epochs). The error kept getting smaller, but any tests run yielded very poor results. Remember that this inaccuracy probably is due to the training/validation issues described earlier.

Figure 5.4.1: The training and validation errors over the course of the training. They show that, unexpectedly, validation errors are lower than training errors.

(53)

Figure 5.4.2: The training and validation predictions compared to the actual values of the container_power. Note: The values on the x-axis are indexes, not epochs.

(54)

5.4.2 Long Short-Time Memory

The lowest error at that point was found in Epoch 3056. The mean averaged error loss at that point was around 70, which is an error of almost 100%. Figure 5.4.3 shows the result of the prediction in Epoch 3000.

Figure 5.4.3: The training and validation predictions at epoch 3000, compared to the actual container_power. The values on the y-axis are the container_power in Watt-Hours, with the indexes on the x-axis.

(55)

5.5 Reinforcement Learning

This section shows the results of the Reinforcement Learning approach outlined in Sec- tion 4.7. Unfortunately, due to the issues discussed in Section 3.2.4, no converging model using Reinforcement Learning was obtained. Please refer to Figure 5.5.1 for the accuracy over time when running the Reinforcement Learning. Due to time constraints and since it was suspected from the beginning that the methodology was unsuited for the task, it was decided to not continue improving the model beyond this point.

Figure 5.5.1: Results of the Reinforcement Learning per Epoch. As can be clearly seen on the graph, the result did not converge towards a low absolute error.

(56)

C ^HAPTER 6

Evaluation

This section contains the evaluation of the results. The criterion for the evaluation can be found in Section 1.3. Section 6.1 contains a qualitative comparison between the models tried in this thesis and their results. Section 6.2 then evaluates effectiveness of the best model against the state of the art.

6.1 Qualitative Model Comparison

This section discusses the advantages and disadvantages to each of the approaches tried throughout the project. The approaches are then scored according to prediction accuracy, prediction time and overall potential. An evaluation of each of these properties can be found in Sections 6.1.1 – 6.1.3.

6.1.1 Prediction Accuracy

This is the most important part, how accurate the prediction is. The spectrum of success was very wide, with a number of solutions not converging at all or converging at really bad estimates (more than 100% wrong). These included the Neural Network and Re- inforcement Learning Approaches. The Neural Network approach was not studied very extensively as part of the project so it is possible that there are more gains to be had there.

Amidst the working solutions, the Genetic Algorithm solution never became amounted to more than a glorified linear regression, potentially due to the similarities in the approaches. It achieved an error of around 23%, but the characteristics of the predictions were such that the algorithm was more or less performing the same guess every time, which is not ideal since the goal was to learn from the parameters.

45

(57)

Then there were the Bayesian Inference methods. It should be said up front that these outperformed the rest when it came to prediction accuracy, with Polynomial Random Forest Regression clearly being the winner, though the difference is much less clear when running the model on completely new test data.

The reason why the Bayesian Inference performed the best is believed to be that the approach is more straightforward. Since it works in a purely statistical manner, the risk for overfitting is reduced. Though LSTM based approaches proved unsuccessful in this project, it is believed that with enough experimentation, they could potentially be leveraged to achieve even better results.

6.1.2 Prediction Time

This section deals only with how fast a model could be considered to reach optimal prediction strength. This is important since the goal of the project is related to green energy, and it therefore behooves the algorithm to be energy efficient as well.

Due to the train/validation issue described in Section 5.4.1, that model never converged (in 10000 epochs), probably since it was only ever considering the training set.

The slowest algorithm of them all was by far the LSTM approach, which would converge after roughly three days of computation (Running on a Ubuntu 18.04 VM, 16GB RAM).

Then came the Genetic Algorithm which was optimized using the techniques described in Section 4.5.4 and after that would converge in around 12 hours.

The Polynomial Random Forest Regression also took quite some time to compute, roughly 8 hours on a dataset of around 3.5 million points. The Polynomial Linear Re- gression was a lot faster, only around 20 minutes. The fastest algorithms were the Simple Linear Regression and the Reinforcement Learning approach (though the latter converged to an error of around 4285%, and it is unclear whether to refer to that as convergence).

(58)

6.1. Qualitative Model Comparison 47

Table 6.1.1: The attempted approaches and their outcomes.

Name Build Time Mean Absolute Error

Linear Regression 10 seconds 29.46%

Polynomial Linear Regression 20 minutes 28.72%

Polynomial Random Forest 8 hours 26.48%

Genetic Algorithm 12 hours Around 28%

Basic ANN Didn’t converge —

LSTM 3 days Almost 100%

Reinforcement Learning Didn’t converge —

6.1.3 Summary

From Sections 6.1.1 – 6.1.2 and as shown in Table 6.1.1, we see that the Polynomial Random Forest Regression was the most accurate of the solutions, but that it only outperformed the other regression approaches by percentage points when run against the final test set.

It can thus be concluded that if the best absolute prediction is wanted, the Polynomial Random Forest Regression is recommended, and that the Polynomial Linear Regression can be seen as a happy medium between model build time and prediction accuracy.