Dynamic allocation of servers for large scale rendering application

(1)

Dynamic allocation of servers for large scale rendering application

Samuel Andersson

Computer Science and Engineering, master's level 2021

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

(2)

Abstract

Cloud computing has been widely used for some time now, and its area of use is growing larger and larger year by year. It is very convenient for companies to use cloud computing when creating certain products, however it comes with a great price. In this thesis it will be evaluated if one could optimize the expenses for a product regardless of what platform that is used. And would it be possible to anticipate how much resources a product will need, and allocate those machines in a dynamic fashion?

In this thesis the work of predicting the need of rendering machines based on response times from user requests, and dynamically allocate rendering machines to a product based on this need will be evaluated. The solution used will be based on machine learning, where different types of regression models will try to predict the response times of the future, and evaluate whether or not they are acceptable. During the thesis both a simulation and a replica of the real architecture will be implemented. The replica of the real architecture will be implemented by using AWS cloud services.

The resulting regression model that turned out to be best, was the simplest possible. A linear regression model with response times as the independent variable, and the queue size per rendering machine was used as the dependent variable. The model performed very good in the region of realistic response times, but not necessarily that good at very high response times or at very low response times. That is not considered as a problem though, since response times in those regions should not be of concern for the purpose of the regression model.

The effects of the usage of the regression model seems to be better than in the

case of using a completely reactive scaling method. Although the effects are

not really clear, since there is no user data available. In order for the effects

to be evaluated in a fair way, there is a need of user patterns in terms of daily

usage of the product. Because the requests in the used simulation are based on

pure randomness, there is no correlation in what happened 10 minutes back in

the simulation and what will happen 10 minutes in the future. The effect of

that is that it is really hard to estimate how the dependent variable will change

over time. And if that can not be estimated in a proper way, the results with

the inclusion of the regression model can not be tested in a realistic scenario

either.

(3)

CONTENTS CONTENTS

1 Introduction 1

1.1 Background . . . . 1

1.2 Motivation . . . . 1

1.3 Problem definition . . . . 2

1.4 Sustainability . . . . 3

1.5 Delimitations . . . . 3

1.6 Thesis structure . . . . 4

2 Related work 5 3 Theory 6 3.1 Regression analysis . . . . 6

3.1.1 Linear regression . . . . 6

3.1.2 Multi-linear regression . . . . 8

3.2 Erlang . . . . 10

3.3 Service rate . . . . 11

3.4 Server utilization . . . . 11

3.5 Response time . . . . 12

3.6 R2 score . . . . 12

3.7 Mean squared error . . . . 12

3.8 Mean absolute error . . . . 13

3.9 Root mean squared error . . . . 13

4 Implementation 14

4.1 System architecture . . . . 14

(4)

CONTENTS CONTENTS

4.2 System components . . . . 14

4.2.1 Web server . . . . 14

4.2.2 SQS-queue . . . . 15

4.2.3 Rendering machine . . . . 15

4.3 Simulation . . . . 15

4.3.1 Idea of the simulation . . . . 16

4.3.2 Settings in the simulation . . . . 17

4.3.3 Data in the simulation . . . . 18

4.3.4 Outputs from the simulation . . . . 20

4.4 Single request in AWS-architecture . . . . 20

4.5 Multi request implementation . . . . 22

4.6 Regression model . . . . 25

4.7 Scaling . . . . 27

5 Evaluation 29 5.1 Regression models before delimitation . . . . 29

5.2 Regression model after delimitation . . . . 30

5.3 Final solution . . . . 41

6 Discussion 44 6.1 Reactive vs proactive . . . . 44

6.2 Solution . . . . 45

6.3 What could have been done differently? . . . . 45

6.3.1 Planning . . . . 45

6.3.2 Multi requests from client . . . . 46

6.4 Problems . . . . 46

(5)

6.4.1 Response from rendering machines . . . . 46

6.4.2 Map response to correct client . . . . 47

6.4.3 Multi request . . . . 47

6.4.4 Timing in Node.js . . . . 48

7 Conclusions and future work 49 7.1 Conclusions . . . . 49

7.2 Future work . . . . 49

(6)

Acronyms and Abbreviations

Abbreviation Description

IaaS Infrastructure as a service PaaS Platform as a service

SQS Simple Queue Service GPU Graphics processing unit LSTM Long Short-term memory

Std Standard deviation

R2 R-squared

MSE Mean Squared Error

MAE Mean Absolute Error

RMSE Root Mean Squared Error FIFO First In First Out

API Application Programming Interface

URL Uniform Resource Locator

(7)

1 INTRODUCTION

1 Introduction

1.1 Background

How many servers do you really need in order to guarantee a user friendly product for the lowest possible price? That is a question that every company that provides a service asks themselves. There is no easy answer to it either, as it depends on a lot of different factors and parameters. What resources in terms of hardware is needed? How many users is the service estimated to supply? What is a user friendly product really? As you may understand the questions are many, and the answers to each question varies for every specific product/service It is known that it is pretty pricey to rent servers for cloud computing purposes, by analyzing the need of server power in order to satisfy the users, and dynam- ically allocated these servers, these costs could be reduced by quite a margin.

If it in addition to that would be possible to predict the need, and counteract the increasing need in a proactive manner, then nothing would impact the user experience either, in the same time as you would not pay for more resources than you actually need.

The company that is issuing the project are soon releasing a product to their customers. In that product, there are many different factors that are influencing the response times on the requests that the users are sending. There are different rendering times for each request, as some requests might only need a very tiny render, while some might need to render a whole scene or a complete view.

Thus, the rendering times are very widely spread.

Since the product is not released yet, there is no user data or user patterns available either. How should you then be able to anticipate how many rendering machines that will be needed in order to provide a user friendly product? The purpose of this thesis is to find out if there is a way to anticipate and predict how many rendering machines that will be needed at a specific time.

1.2 Motivation

By only renting servers that are needed, the outcomes for a specific service can be reduced as stated earlier. However it is not only a question about economy but also for the environment. A server, or in this case a rendering machine performs very heavy computations that takes a lot of hardware in order to perform its calculations. To not use more rendering machines than your service actually needs does therefore not only generate greater economical profit from your service, but also it generates a profit for the well-being of the planet.

There are a lot of existing solutions to this problems. The existing solutions are

(8)

1.3 Problem definition 1 INTRODUCTION

often provided by the landlord of the servers/machines (AWS for instance), and that means that you can not change your supplier of the rendering machines and expect the exact same type of result afterwards, neither in terms of price or performance. If you could scale up and down the number of rendering machines by yourself then that problem disappears.

In addition to the above mentioned motivations there are also another big factor that motivates this work, namely the progression of cloud computing. The usage of cloud computing has increased by a lot [16] for the last 10 years and is expected to increase even more in the future. It is expected that the number of IaaS and PaaS -providers will increase, and to have a general and independent solution to a substantial problem when it comes to hosting of a service is right in time.

1.3 Problem definition

The number of users that are using a product varies a lot depending on a lot of factors. Are all the users located in the same country? Does all the users have a job? Do all the users have the same habits? No one could possibly answer these questions without knowing each and every customer very well. How should it then be possible to figure out how much resources that are needed in order to provide a user friendly service? That is simply not possible. With dynamic allocation of rendering machines in the cloud these problems could disappear.

During the course of this thesis project an algorithm will be created that dy- namically regulates the number of rendering machines that are needed. The algorithm will analyze the evolution of response times on the requests that are sent from the users to the rendering machines. If the response times grows and seems to be reaching a state where the response times are unacceptable it will allocate another render machine to sustain the system that provides the service.

The algorithm that will be used will be based on machine learning and regres- sion analysis of the response times. The algorithm will continually need to be adjusted and analyzed in order to hopefully provide an acceptable prediction of the response times. By testing, adjusting and continually analyze the cor- rectness of the predictions, the expectation is to be able to deliver a regression model that is capable of predicting the response time with great results.

The algorithm will be tested in an environment that is implemented in AWS

that represents the architecture of the currently running service. Now in order

to also test the algorithm fast and often there will also be a need to implement a

simulation of the process. In the simulation the real scenario, with users sending

requests to a render machine will be imitated and the correlating response time

to each request will need to be realistic and match the response times that are

generated from the architecture that is implemented in AWS.

(9)

1.4 Sustainability 1 INTRODUCTION

All of the above paragraphs can be combined into the problem definition of this thesis: ”Is it possible to create a model that can be used to predict the response times of requests? If so, could that model be used in order to proactively scale up or down the number of rendering machines?

How does it compare to a reactive solution?

1.4 Sustainability

From a sustainability standpoint the thesis is working, as stated earlier, by decreasing the power consumption needed in order to provide a user friendly and energy efficient service as possible. The rendering machines needs a very powerful GPU to be able to render a picture from a polygon, therefore it also takes a lot of power to supply the GPU.

If it is more rendering machines active than necessarily needed, then one ren- dering machine should be taken down. To provide these powerful machines with power that is not needed is not good for the environment. Hence, the earth is also in need for a solution like this. That is since everybody needs to do what they can in order to save the earth from the situation it is in right now.

1.5 Delimitations

There are some limitations that needs to be done since the time is limited. For example, in the real service that soon will be launched, there is a cache that is storing some requests such that many requests does not have to go to a render machine at all. Since only the response times of the requests that are in need of a render machine are of interest, and the fact that it will take a lot of time implementing the architecture and simulation, the cache can be disregarded in this thesis. This would not change the result either way since the algorithm should be placed ”behind” the cache, that means that the only concern are the requests that are surpassing the cache in the real system as well, since the cached requests will have a very low response time.

Another delimitation in the project will be the load generator that will be used in the project. It will not be very advanced as it is not the main focus of this project. The load generator will need to able to generate different types of requests however, as they are pretty varying in terms of render times. The load generator will be based on some data that product owner will provide me with, that corresponds somewhat to the reality. One pattern that the load generator could use is for instance a normal distribution with different types of requests, another type is a completely randomized load within some frames and intervals.

However it should be stated already that the differences in rendering times, with

(10)

1.6 Thesis structure 1 INTRODUCTION

respect to different request types, was another delimitation that was done during the work of the thesis. The different types of requests was later on discarded, and instead the usage of a rectangular distribution within some intervals of the rendering times was used instead. By doing so, all the requests had the same

”type” and used the same time interval. There will be more explanations to this paragraph later in the thesis under the evaluation section.

1.6 Thesis structure

In this thesis section 3 brings you some information about related work on the subject, section 4 is completely focusing at the theoretical parts of the thesis.

Such things that might be of use to understand what they actually are before

proceeding further in the report, or things that might be good to go back to

in order to remember what they actually are when they are referred to. In

section 5 there are descriptions of how the implementation of different things

has been, what the different components are, what the idea behind some things

are, what settings that are possible, what data that is generated and so on. The

evaluation of the solution is covered under section 6, where there are a lot of

different graphs presented, and a lot of explanation on why the solution behaves

in a certain way. In section 7 the discussion takes place, and most of all the

problems that arrived during the thesis and what that could and should have

been done differently is also discussed. Last but not least, section 8 rounds up

everything that has been presented before in some conclusions and what some

future work could be to this thesis.

(11)

2 RELATED WORK

2 Related work

When it comes to forecasting and machine learning there are numerous of differ- ent types of solution that you could find. Forecasting and predictive regression is something that is used in many different areas, and there has been a lot of different articles read. Some examples of such work could be: using multi linear regression in order to predict the load capacity of reinforced bridges [10], fore- casting of stock prices by using a LSTM neural network [20] and also work that aims to forecast the weather in a data driven approach [17].

Although there are a lot of different work that surrounds forecasting and pre- dictions of different things, there has not been a lot of articles found that are similar to what this thesis aims to accomplish. That is most likely due to the fact that there already are solutions to this problem, as long as you use the service provided by the ”landlord” of the servers. AWS AutoScaling [2] for in- stance is mostly reactive and acts when the problem has already occurred, or when the problem is occurring, as a trigger. The desired solution is to be able to predict if a problem will occur and take action before that happens. Although one can use AWS AutoScaling for predictive scaling [3], its needed to provide them with historical data of the user activity, and that is not something that is available by now. In addition to that it would be beneficial to have a general solution that could be applied on other server providers than AWS, since the rendering machines has been rented at several different places. The need for proactive scaling is very important in this specific case since the boot time for the machine itself and all of its software is estimated to be somewhere around 5 10 minutes.

This thesis is not only about regression and machine learning, as there are a

lot of other concepts that been needed to be learned and understand. A lot of

them are surrounding queueing networks [12] and the correlated topics. There

are also different queueing models, the most simple one is the M/M/1 queueing

model [1] and the one that is worked with in this thesis, which is a M/M/C

queueing model [13]. The M/M/C queueing model is just an extension to the

M/M/1 queueing model with the difference that there are c servers instead of a

single server.

(12)

3 THEORY

3 Theory

During the thesis there has been some theoretical work that it would be advan- tageous to explain a little bit more about. Some of the theory that is explained in this section have had a larger part of the thesis, and some might not have been used that much at all. Although it is very important to explain what the concepts are, so that it might become clear why they have been used in the thesis.

3.1 Regression analysis

Before the actual regression models are discussed, regression analysis [11] itself should be discussed, as it might not be familiar to everyone. Regression analysis is a technique or a method used to describe an independent variable, based on some dependent variables. This might sound abstract when reading, but in reality it is not very abstract at all. Regression analysis is often used in statistics and is a statistical process for estimating the relationships between a dependent variable and an independent variable, as stated earlier.

In our case, regression analysis is used in a machine learning context. That is done by providing a data set to a regression model, and when this model has been trained with respect to the given data set, it can determine the correlation between different attributes in the data set. What this gives you is the proba- bility to determine or even predict the value of an independent variable, given the value of the dependent variable(s).

3.1.1 Linear regression

Linear regression [7] is the most simple model possible in a regression analysis.

However, if the data set is well suited, and the dependent variable is wisely picked, it could also be the most precise model. The resulting model is something that is probably very familiar, namely this one:

Y = βx + m (1)

Where Y is the independent variable and X is the dependent variable. As this

is a very simple model, the challenge using this model is to pick some attribute

X that is describing Y in a way that is as good as possible. Many times there

is no dependent variable initially that is suiting the model. However, in some

cases one can derive data from the already existing data that describes the

independent variable pretty good.

(13)

3.1 Regression analysis 3 THEORY

There are some metrics that could (and should) be used when picking a depen- dent variable. These are kurtosis and skew. Both kurtosis and skew are metrics that measures ”the quality of the data” in terms of how many or how big the outliers are. The data should be as similar to a normal distribution as possible, that is because most underlying statistical models assumes the data provided to the model to be normally distributed. [9] When it comes to skew, it measures how the overall shape of the data is compared to a normal distribution. Which is illustrated in Figure 1

Figure 1: Positive and negative skewness (Source: Wikimedia Commons under CC BY-SA 3.0).

Kurtosis on the other hand only measures outliers with respect to the height of the data. The kurtosis of a normal distribution is equal to 3. Hence, there it is often spoken about excess kurtosis instead of kurtosis. The excess kurtosis is the measured kurtosis of a variable minus 3, as the only interest is in how the data differs from normally distributed data.

There are some debates regarding which intervals of skew and kurtosis that

are acceptable when constricting a linear regression model. In some cases the

data might be highly skewed or have high values on the excess kurtosis, and

still perform very good. There are no guarantees that your model will perform

(14)

3.1 Regression analysis 3 THEORY

better with no skew and no excess kurtosis. If that is the case however, the shape of the data will fit your model better, and you are more likely to get a better result.

In the case of high skew values (normally if skew is ≤ −1 or ≥ 1) or high excess kurtosis (normally if excess kurtosis is ≤ −1 or ≥ 1) the most common solution is to make transformations on the data. The most common transformation is probably the logarithmic transformation with either the natural logarithm or the logarithm with base 10. There are numerous different kinds of transformations that could be used, but when you are using transformations on the data in your regression model you need to remember to both transform the input, and to rewind the transformation when you are extracting your prediction from the model. That is because the answer also will be terms of the transformed data, which means that the answer will need to be rewinded with respect to the transformation before it is interpretable.

3.1.2 Multi-linear regression

Multi-linear regression [14] is actually precisely the same as the above mentioned linear regression model, however there is one big difference. In the multi-linear regression there are several different dependent variables that together are de- scribing the independent variable. A multi linear regression model could look something like this:

Y = β 0 x 0 + β 1 x 1 + . . . + β n x n + m (2) As can be seen in the equation above, there are way more variables in this equation compared to the equation described under the linear regression model, and that is very natural, as there are more variables that are explaining the independent variable. These variables are all cooperating in order to describe the independent variable. Some may be more significant, while some may be not very significant at all (however they should not have been picked if that is the case).

In multi linear regression there are also a couple of things that should be checked for in your data before picking the dependent variables. In a multi linear regres- sion model there should be no multicollinearity. Multicollinearity is when one of the dependent variables is highly correlated with another dependent variable, and this is not something that is wanted in the model because the coefficient estimates of the model tends to be unreliable on correlated dependent variables.

One good metaphor for this is if one attends to a concert with multiple artist

singing the same song at the same time, and the job is to determine which artist

that is the best singer. To eliminate multicollinearity one could use a correla-

tion matrix of the data, and remove one of the highly correlated variables. An

(15)

3.1 Regression analysis 3 THEORY

even better way to do this would be to use something that is called the variance inflation factor, which is a measurement of how much a particular variable is contributing to the standard error in the regression model. This can be used since the variance inflation factor for a variable would be very big if there ex- ists high multicollinearity. A rule of thumb is to eliminate variables until the variance inflation factor is ≤ 5 for all variables.

After you have created your model you should also evaluate whether or not there are any heteroscedasticity or auto correlation. Heteroscedasticity is basically a term which means that the errors of the models does not have constant variance, and that is nothing that you want in your model, since that implies that there may be some significant error that occurs every time for some specific input, creating an unreliable model. Testing your model for heteroscedasticity can be done by using the Breusch-Pagan test [6] or the White test [18].

The possible reasons for heteroscedasticity is often when there are outliers in the data set, which is something that you also should test your data set for. A general definition of an outlier is described in equation 3, where M ean is the mean of the data and Std is the standard deviation of the data.

Outlier ≥ |M ean ± 3 · Std| (3)

When it comes to outliers there are also some tradeoffs that should be taken into account. Some might say that outliers should be removed from the data set, as that make the model better and more precise. Although that might be true in some cases, it is also not always true since the outliers are actually data that describes an event that also has occurred, and might be very important for the model. Some models should be used in order to detect when there might be

”outlying data” coming up, and then you need to train the model to be able to spot and recognize patterns where there are outliers.

When speaking about auto correlation in regression, one generally speak about auto correlation on the residuals. Residuals are described in equation 4, where R is the residual, Y p is the predicted value and Y o is the observed value. The residual of a data point is the vertical difference between the regression line and the actual value of the data point, and it could be explained as the ”error of the prediction”.

R = Y o − Y p (4)

Auto correlation is when values of the same variables is based on related objects, that is that you can almost ”predict” future values based on the preceding values in a series. A good example would be the measured temperature during a month.

You would expect the second day of a months temperature to be more similar

(16)

3.2 Erlang 3 THEORY

to the first day of a months temperature rather than the last day of a months temperature. If this would be true for the data set of temperatures, it would exhibit auto correlation.

When merging these two definitions together it becomes auto correlation on the residuals. And this is something that is not wanted in the model, because the computed standard errors and the p-values are misleading when there is auto correlation. When there is auto correlation in your model it is often a sign that the model that is used might not be the correct one, given your data set. Often when time-series data is used, that is data generated from a time-series, one ends up having auto correlation in your model since the observations will be dependent on the preceding values. To test the model for auto correlation one can use the Ljung-Box test [5].

3.2 Erlang

In the thesis work the inclusion of Erlangs C-formula has been used in order to evaluate what the probability is of a request ending up having to queue before being processed. Erlang is primarily used as a indication of traffic intensity, and the the most widely used formula that includes erlangs is probably Erlang B, which originally described the probability of a phone call being blocked, making the phone call completely disregarded or dropped. Erlang C is however, as just mentioned, an estimation on the probability of a phone call having to queue before being processed.

Erlang can be used in many different areas, and it could also be used in this thesis as well. For example, if the number of service towers is substituted for rendering machines and a phone call is substituted for a rendering request, the formula ends up describing the exact same thing, but for this problem instead of the phone call problem. In equation 5 it can be seen how erlang E is calculated, where λ is the arrival rate and h is the average service time. Then in equation 6 the Erlang-C formula can be seen, where P _w is the probability of a service needed to be queued, E is erlang and c is the number of servers.

E = λh (5)

P _w =

E

^c

c!

c c−E

P c−1 i=0

E

ⁱ

i! + ^E _c!

^c

_c−E ^c (6)

(17)

3.3 Service rate 3 THEORY

3.3 Service rate

Service rate is a measurement on how many services that can be handled given a specific service time. The term service corresponds, in our case, to a render in a rendering machine. The service rate is calculated as shown in equation 7, where µ is the service rate, λ is the arrival rate of an operation and t is the service time (the time it takes to perform the service). The service rate is pretty much a measurement on how many services you can perform per time unit (λ and t must be in the same unit).

µ = λ

t (7)

Please note that this is not the service rate for the entire system, but only for one server. however, the system service rate is describing the entire systems service rate, and is calculated as shown in equation 8. Where c is the number of servers and µ is the service rate of one server.

µ _s = µ · c = λ

t · c (8)

3.4 Server utilization

Server utilization is a measurement of how much of the time, on average, the servers of the entire system are ”busy” or in use. The server utilization ρ is calculated according to equation 9 where λ is the arrival rate, c is the number of servers and µ is the service rate of one server.

ρ = λ

cµ (9)

Given ρ one can get some information regarding the stability of the system. If

ρ > 1, the queue of the system will grow and will eventually be out of control. If

ρ < 1, some services might have to queue (since the service times are not exactly

the same for every service). If all service times were the same, this would mean

that the queue on average would be empty. That is because according to ρ,

the system service rate is greater than the arrival rate. however, as mentioned

above, this is not very usual in practise since the service times often varies and

the arrival rate is an average measured over time.

(18)

3.5 Response time 3 THEORY

3.5 Response time

By using the above mentioned calculations, there is actually a way to theo- retically calculate the average response time. The response time is the time a customer spends in both the queue and in the service itself. The average re- sponse time can be calculated as shown in equation 10, where T is the average response time, P w is the probability of a service to be queued (erlang c), c is the number of servers, λ is the arrival rate and µ is the service rate.

T = P w

cµ − λ + 1

µ (10)

3.6 R2 score

R2 score [8] is a simple, yet widely used metric to determine ”how good a model”

is. It is basically measuring how far off the predictions are from the observed value in the test data, rather than if you would just be using the mean value of the observed data as a prediction. It is calculated as shown in equation 13, where SS tot and SS res are described in equation 11 and 12 respectively.

SS tot = X

i

(y i − ¯ y) ² (11)

Where y i is the observed value at position i and ¯ y is the mean of the observed values

SS _res = X

i

(y _i − f i ) ² (12)

Where y _i is the observed value at position i and f _i is the predicted value at position i

R ² = 1 − SS res

SS _tot (13)

3.7 Mean squared error

Mean squared error (MSE) [4] is a metric that is describing how much on average

the model is guessing from the correct value. MSE is calculated as shown in

equation 14 where M SE is the mean squared error, Y i is the observed value at

position i, ˆ Y i is the predicted value at position i, and n is the number of entries

in the test data. The reason for squaring the difference between the observed

(19)

3.8 Mean absolute error 3 THEORY

and the predicted is to ”punish” larger errors.

M SE = 1 n

n

X

i

(Y i − ˆ Y i ) ² (14)

3.8 Mean absolute error

Mean absolute error (MAE) [19] is a metric that is describing how much on average the model is guessing from the correct value. MAE is calculated as shown in equation 15 where M AE is the mean absolute error, Y _i is the observed value at position i, ˆ Y i is the predicted value at position i, and n is the number of entries in the test data. In M AE the only focus is to calculate the mean magnitude of the errors, not their direction and not to punish larger errors.

M AE = 1 n

n

X

i

|Y i − ˆ Y i | (15)

3.9 Root mean squared error

The root mean squared error (RMSE) [15] is a metric that is describing how much on average the model is guessing from the correct value. RMSE is calcu- lated as shown in equation 16 where RM SE is the root mean squared error, Y _i is the observed value at position i, ˆ Y _i is the predicted value at position i, and n is the number of entries in the test data. This metrics main focus is just to make the the M SE a little bit more interpretable, since in M SE the square of the differences from the obtained and the predicted value are summed up, making it a little bit hard to interpret. So RM SE is still punishing larger errors, just as M SE are. But in RM SE the answer is more interpretable compared to M SE.

RM SE = √

M SE = v u u t 1 n

n

X

i

(Y i − ˆ Y i ) ² (16)

(20)

4 IMPLEMENTATION

4 Implementation

4.1 System architecture

The system architecture consists of four core parts. There is a client that is using the web application which in turn uses the SQS-queue for queueing the requests that is coming from the client. From the SQS-queue messages are pulled by rendering machines that are located within an auto scaling group such that the number of rendering machines can be increased or decreased (scaled up and down). In Figure 2 below, the work flow of the architecture is described.

Figure 2: Current system architecture in AWS.

4.2 System components

4.2.1 Web server

The web server that is implemented using Node.js hosts a web application from

which you can send requests. This corresponds to the service that the prod-

uct owners are supplying to their customers but it is not anywhere near that

advanced. You can send three different types of requests and the transport

time, render time, and the total time of the request will be displayed on the

web site. You can also select to send multiple requests using the multi request

feature. In this feature all the responses to each request will be stored in the

web server temporarily. When the last response to a request is received all the

data will be sent to the client, and some graphs will appear with that displays

the development of different data attributes.

(21)

4.3 Simulation 4 IMPLEMENTATION

4.2.2 SQS-queue

The SQS-queue is a service provided by AWS. The web server sends the request to the SQS-queue that is of FIFO-fashion. That is, the first request that comes in is the first request to get pulled from the queue. The SQS-queue is the queue that stores the requests that is to be processed by the rendering machines. From this queue you can also get all the necessary attributes such as queue length etc.

All use of the SQS-queue is handled via AWS API.

4.2.3 Rendering machine

The rendering machines are located inside an Auto Scaling-group. That is basically because it is necessary to be able to increase or decrease the number of rendering machines to be used. The rendering machines themselves are built in a simple python script that pulls for messages (requests) from the queue and then processes the request and waits/sleeps for the time that the rendering- request is supposed to take. The amount of time a render takes depends on which type of request it is. For example, initially a fast request takes between 0,3 to 0,7 seconds, a normal request takes between 0,7 to 1,5 seconds and a heavy request takes between 1,5 and 8 seconds. These times were all based on number that the product owner had provided me with at the start of the thesis.

Due to some delimitations in the thesis, and also due to optimization of the rendering times, these times were later on changed. But that will explained later. After the ”rendering” is done, the machines adds some information to a message body and sends it back to the web server.

4.3 Simulation

In addition to making a ”real architecture” there is also a need to make a simulation of the process. The reason for this is that if you want to test your regression model or your solution, you might not want to make it in real time as that would be very time consuming. If you have a simulation that is good and is producing results that are comparable to those from the real scenario, you could instead use your simulation for testing purposes. The results from a simulation that corresponds to user activity from an hour could be obtained in only 15-30 seconds instead of waiting for that whole hour.

A simulation is also very useful as it gives me the advantage of testing differ- ent settings in a way that would not be possible in the AWS-architecture. New settings could be implemented and new data could quickly be derived. Also, dif- ferent parameters, that might not be easy to replicate in the AWS-architecture, but plays a big part in the real system, could be added in the simulation as well.

But the main reason for using a simulation is because it is easy to test different

(22)

4.3 Simulation 4 IMPLEMENTATION

settings and run multiple simulations over a day. That is not something that would be possible by only using the AWS-architecture, due to the time it would require.

4.3.1 Idea of the simulation

The idea of the simulation is to base it on a table of entries, where each entry corresponds to a time slot. In this simulation each time slot is 0.1 seconds.

You could then choose for how long you would like to run the simulation, and that corresponds to how many time slots you would like to have in your table.

Now, during the simulation there is for each time slot a certain probability that a request will be sent to the system. That means that you can also test the system’s performance during different user loads. In the simulation there are also different types of requests. There is one ”easy type”, one ”normal type”

and one ”heavy type”. Each has different probability of occurring and each has different intervals of what times they take. But as already mentioned, this changed later on in the thesis.

Applying a certain request to a render machine is simply done by going through all the current rendering machines that the simulation are using, and evaluate which one of them that will be available first. The one that is available first will be allocated to perform the incoming request, just as it would be in a real scenario, where each rendering machine is pulling from a queue for jobs as soon as it is idle. Now, since the rendering machine might busy when the request comes in, and should perform this new request as soon as its previous ones has executed, the starting time of the request T s needs to be calculated. That is very easy and is calculated by the following equation:

T _s = max ( T a

T _i (17)

Where T a is the time where the rendering machine will become available again, T i is the current time slot for the simulation, and T s which is the time slot where the render request starts being processed. This is because there might be scenarios where a render machine is idle when a request comes in, and then it should start executing at the current time slot, and not at a time slot that has passed. After this is done, T _a needs to be updated. That is done very easily according to this equation where T _r is the time it takes to execute the render in the machine:

T a = T r + T s (18)

(23)

4.3 Simulation 4 IMPLEMENTATION

In order for the simulation to be similar to the real architecture it also needs a queue. The queue is not really holding any requests, however it is still working in the same way that it would do in the real architecture. For each and every time slot where there is an incoming request, the start time of the request is calculated according to Equation 17. If the calculated start time of the request is larger than the actual time slot the the request was sent on, the request is inputted in the queue. For each time slot there is also a check if there are any pending requests in the queue that is supposed to start at this given time slot.

If that is the case, the request is removed from the queue. Otherwise the queue is left untouched

The main functionality of the simulation is briefly described above and as you can probably tell it is a very simple way of simulating a rather complex issue.

It is in my opinion a very good way to do the simulation as it takes a lot of thinking about what is really happening and what things that are dependent of each other in the real scenario, and when you get the results you get it presented in a very good and clear way. Another benefit with this simulation is that you can use and store the data that the simulation provides you with.

4.3.2 Settings in the simulation

In the simulation there are a various different settings that could be used in order to generate and compare different data. As mentioned above, you can choose how many entries the simulation should be based on, and what load the simulation should use. But in addition to these things, you have way more possibilities. If wanted, the simulation could be executed without any scaling what so ever, and in this case the initial number of rendering machines specified will be used through the whole simulation time.

There are also a parameter that specifies how many rendering machines in the simulation that is the least you can have. For example you could say that scaling is allowed, but you are never allowed go go any further down than having, say x, number of rendering machines. This implementation was done because it is necessary in an architecture like this, because there might be a tight lower bound on how many rendering machines you want to be allocated, as it might be severe consequences if you go below that limit.

When it comes to load you can choose the load that you want. The load corre-

sponds to the probability of an entry to turn into a request. Hence, for example

a static load of 30% indicates that on average 3 requests per second will be

sent. You could also have dynamic, varying load during the execution of the

simulation, which was implemented to test how the results turned out if the load

was dynamic instead of static. This is done by setting an interval in which the

load could vary between, and then you also specify in what rate the load should

change. For example you can say that for every 200 seconds in the simulation

(24)

4.3 Simulation 4 IMPLEMENTATION

the load should increase by 1%. The variation of the load is constructed in such a way that it is changing like a sinusoidal. In an interval of load between x and y, the load starts at x, increasing by 1% every t seconds until y is reached.

After y is reached the load is decreasing by 1% every t seconds until x is reached again. And then it proceeds like that during the whole simulation.

There are also a lot of settings when it comes to delays and times. For example you could specify how long the boot up time is for a rendering machine. Lets say that you decide to scale up the number of rendering machines at time t, and the boot up delay of a rendering machine is t _b . Then the time when the rendering machine is actually up and running is at time t s = t + t b , since it was told to be a boot up delay of time t b . There is also a corresponding boot down or shut down delay of a rendering machine that you could specify, which is basically the same exact same thing, but comes to action when you are scaling down the number of rendering machines in the architecture. This is a very important implementation as there are a pretty hefty boot up time on the rendering machines, as they need to download a lot of files before they are good to go and ready for rendering.

Another time setting is an attribute which determines how long time it needs to be between the evaluation of the architecture. This means that after a machine have been taken up or taken down from the architecture, there must be some time before the next evaluation of the architecture is done, as you would want to wait a moment so that you can actually see what impact the change had on the response times, the queue size, and so on.

4.3.3 Data in the simulation

In this section all the data that is generated and used for each request, that are generated in the simulation will be listed. After the listing it will also be stated why these data attributes were generated.

1. Time

• The time where a request came into the system 2. Load type

• What load type the incoming request is of 3. Render machine

• Which render machine that is doing the render

4. Start time

(25)

4.3 Simulation 4 IMPLEMENTATION

• What time the request will start being processed by a rendering ma- chine

5. Rendering time

• How long the rendering will take for the rendering machine

• Depends on the load type

• Randomized within some time intervals 6. Done time

• What time the request will be returned with a rendered picture

• Calculated by: start time + rendering time 7. Response time

• How long it took from an incoming request to an outgoing response

• Calculated by: done time - time of request 8. Number of rendering machines

• The number of rendering machines that are currently active 9. Queue size

• The number of requests that are currently queued, waiting to start.

10. Queue size per rendering machine

• An estimation of how many requests that are queued to each render- ing machine

• Calculated by: number of rendering machines ^{queue size}

11. Mean response time over the last 10 requests 12. Erlang C

• The value obtained from equation 6 13. Service rate

• The value obtained from equation 7 14. Server utilization

• The value obtained from equation 9 15. Average response time

• The value obtained from equation 10

16. Load recent 10 seconds

(26)

4.4 Single request in AWS-architecture 4 IMPLEMENTATION

• The load in percent the recent 10 seconds

• Calculated by number of requests

number of entries over the latest 10 seconds 17. Change in queue size per rendering machines

• The change in percent of the variable described above (queue per rendering machine)

• Measured over a specified time interval. Usually the used time was 10 minutes

All these attributes were generated with the regression model in mind. The de- sire was that some of those attributes could be used in a multi linear regression model as one of the dependent variables, in order to model the regression times more accurately. However, the correlation was not the best for some of them.

And the ones that actually were good for this purpose often struggled with that they violated the multicollinearity property of a regression model, which is pretty understandable since most of the theoretical attributes are highly de- pendent of the number of rendering machines.

4.3.4 Outputs from the simulation

The primary output from the simulation is the table that is used in the sim- ulation. In the output table all of the time slots where there are no requests are stripped off from the table, as they are not very interesting. There are also generated a couple of plots that are all based on the contents of the table. The table is saved as an excel file and the plots are also saved as pictures.

4.4 Single request in AWS-architecture

When performing a single request in the ”real system” the execution starts off with the client requesting a specific type of request. As has been mentioned a couple of times already, there are ”FAST”, ”NORMAL” and ”HEAVY” re- quests. Before the request is sent from the client, the request body is created.

That body consists of two things, the current time in milliseconds since 1970 and the type of the rendering request. The request of the client is then sent to the web server, the web server is processing the information in the request body, creates a new message body that includes the information sent from the client, and forwards the message to the SQS-queue.

Since the response will not be from the SQS-queue, but from a rendering ma-

chine to which there is no active connection, it is needed to, as stated in the

paragraph above, include some more information that will be sent to the ren-

dering machines via the queue. This information is for example the IP-address

(27)

4.4 Single request in AWS-architecture 4 IMPLEMENTATION

of the web server, such that the rendering machine knows where to send the

”response”, and also the size of the queue (how many messages that are located in the queue by the time the message will be sent to the queue).

In addition to creating a new message body, the response object to the client will need to be stored with some identifier. That is because as described earlier the response from the rendering machine will come in at a specific route in the web server. In order to determine to which client the information received in the request should be sent to, the response object from the clients request along with the Message ID that is returned from a successful publication to the SQS-queue is stored in a data structure.

From the rendering machines perspective they are continually pulling for mes- sages from the SQS-queue, when they are not doing work. Hence, as soon as the message arrives in the queue, it will be received and processed by a rendering machine. The rendering machine will then evaluate what type of rendering it will do, do the render, which in this case corresponds to just randomizing a time between a specific interval, which maps to the request type, and sleep for that randomized time. After that is done, the rendering is considered done and the time it took to perform (sleep) the request is stored in the response body, along with some other information, the ID of the pulled message for instance.

When a request from a rendering machine is coming in at the web server, the first thing that is done is to do a lookup in the data structure that is storing the response object. Since there is an identifier (message id) both in the data structure, and in the incoming request body from the rendering machine, it can easily be performed a lookup in the data structure, locate the response object to the client, and respond to the client by using the located response object.

When the response finally is arriving at the client, some computations are done using the times that are stored in the request body, and the time of the when the response is arriving at the client (in milliseconds since 1970). The rendering time does not need to be calculated, as it is already in the response body, and the transport time can be calculated using T A , T S and T R . Where T A is the time of the arrival of the response, T S is the time when the request was sent and T _R is the rendering time. The transport time can then be calculated by using the following equation:

T _T = T _A − T _S − T _R (19)

And the total time of the request T can be calculated by using the following equation:

T = T A − T S (20)

(28)

4.5 Multi request implementation 4 IMPLEMENTATION

In Figure 3 below, it can be seen what the output in the web applications looks like.

Figure 3: The result from a single request.

4.5 Multi request implementation

When the execution of a multi request is performed the flow of the requests is visualized in Figure 4. In that figure the notation on each edge corresponds to how many requests that are done between the connected instances during the execution of a multi request. x is a positive number, where x ≥ 1.

Figure 4: Request flow of multi request.

When implementing this feature the goal was to make it as similar as possible to

the simulation, such that it would be a specific load that the user decides from

the client, and it should be based on entries (basically a measurement on how

long the execution will go on). The reason for that it would be advantageous

if the multi request feature were to be as similar to the simulation as possible

is because if they are as similar as possible, the outputs from the two can be

(29)

4.5 Multi request implementation 4 IMPLEMENTATION

compared.

First of all, the client sends a request to the web server indicating that a multi request should be performed. From the client it is also stated how many entries that is wanted during the execution, and what static load percentage that is wanted during the execution, which can be seen in Figure 5. From there it is placed a spinner with a ”Loading”-text on the web page, indicating that the request is being handled right now and that the user should wait for the request to return. This can also be seen in Figure 6

Figure 5: The page of the multi request feature.

Figure 6: The multi request-page after the request has been sent.

At the web server the request from the client is received, indicating that there

(30)

4.5 Multi request implementation 4 IMPLEMENTATION

should be multiple requests sent to the rendering machines. Just as the case of a single request, the response object along with a specific identifier needs to be stored. The identifier is now not the message ID, since there will be multiple messages sent to the queue, but just a random ID that is associated to the multi request. In addition to the response object there is also a need to store the data that will be produced from the rendering machines, so empty lists that should hold the data produced also goes into the data structure.

From this point the web server starts sending some requests, that is done be recursively calling a method x times, with a delay of 100 milliseconds between each call, where x is the number of entries that the user wants to run. For each of those entries a random number between 1 and 100 is generated, and it is checked whether that number is less than or equal to the load percentage that the user provided. If the number is less than the load percentage, a request is sent to the queue. If not, nothing is done on that specific entry. That will cause a probability of z percent that an entry will result in a request, where z is the load percentage that the user specified. For each entry that turns into a request a body is generated with some data. The queue-size and the number of rendering machines active is for instance some of the data that is included in the body.

For every response/request that is received from a rendering machine, implying that a render has been done, the results of the render are mapped to the supplied ID that was included in the body of that request, such that it will be mapped to the correct client. The multi request is considered done when the last entry has been received at the web server, hence, the last entry must turn into a request before it can be guaranteed that the process is finished. So, when the last entry is received at the web server, it is known that the process is done, and by using the response object stored to the ID in the request body, it is easy to locate both the data and the response object. The data that has been generated is then sent back to the client.

When the response to the client finally arrives, the data that is received is

used to plot some graphs. The graphs are produced by using Chart.js which is

a convenient module for plotting graphs in a javascript-web application. The

results from this action can be seen in Figure 7, please note that the data shown

can be changed by using the buttons below the graph.

(31)

4.6 Regression model 4 IMPLEMENTATION

Figure 7: Client side after response.

4.6 Regression model

The implementation itself of the regression model is very easy now a days. There are so many libraries that are available for both constructing the model and for evaluating them. It was also going to be a need for some libraries that are used for both handling your data set and plotting graphs. The libraries that was used for doing all these kinds of things will be gone through one by one and some comments about why and what they were used to will be addressed.

Pandas is a library that most of you that are reading this will probably be familiar to, as it is the most widely used library for dealing with data frames of any kind in python. This library was used to so many things that it would just become ridiculous to bullet point all of them. In general this was used to pre- process the data, which corresponds to loading the data from the excel file to a pandas data frame, checking and converting the types of the data in the data frame, checking if there are any null-values or corrupted rows in the table and so on. Pandas was also used to evaluate the data in the data frame, as the pandas library holds useful functionality such as the possibility to generate a correlation matrix of the columns in data frame. Also there is functions to describe the values of the columns in the data frame. From this you get statistical data of the data frame such as the mean, standard deviation, minimum value, maximum value etc. There are numerous of functionality in the pandas library that has been coming in clutch in the implementation of the regression model, and the ones mentioned are just a few of the ones that has been used.

Matplotlib is another library that most of you readers are probably familiar

with. One very important aspect when you are creating your regression model

is to understand and know your data, and what better way could it possibly

(32)

4.6 Regression model 4 IMPLEMENTATION

be than to use plots in order to actually see what you are working with. In combination with statistics and reading material you would never be able to understand what is happening and why you might not get the results that you wanted without plotting graphs of your results/progress along the process of creating your regression model. At least that was a big problem for me in the beginning, as my experience in the subject machine learning was extremely low.

A very useful way to start when the dependent variable should be picked is to plot all the columns of your data frame along with the independent variable.

From these plots one can evaluate whether there are any interesting patterns or correlations between the variables. An example of a plot like this is shown in Figure 8, where there clearly is a linear relationship between the independent variable response time and the dependent variable Queue/RM. The usage of these kinds of plots was done before any verification was done by using any statistical tool, such as a correlation matrices or the variance inflation factor.

The decision to do this was for me to understand and to be able to predict what attribute that would be good as an independent variable. Statistical tools could then instead just be used to verify (or disregard) my own predictions.

Figure 8: Linear relationship between independent variable and dependent vari- able.

This was also just one example on how plots has helped in the process of im-

plementing the regression model. More or less for every statistical calculation

there has been a plot trying to explain or predict the outcome of the statisti-

(33)

4.7 Scaling 4 IMPLEMENTATION

cal calculation, as a way of learning what everything is describing and doing.

Plots in general has been a key factor during the whole process and has been used in many cases, and some of the plots will be shown later in this report, such as the comparison between the predictions and the observed values, which also has been used for every created model, as that is in my opinion the best way to see how your model is performing. That is because many of the sta- tistical measurements will not tell you the whole story, if you are calculating the mean squared error, mean absolute error, root mean squared error, r2-score or whatever of your model, it might give you the impression that your model is completely useless. That might not always be case, even if the results on those metrics are bad. From this plot you might also see anomalies and be able to draw conclusions on why your model is being punished/performing bad in certain situations.

When it comes to the statistical analysis of a model, one of the libraries that has been used is Statsmodel.api as they provide tests for some of the things that are explained in the in the theory section, such as kurtosis and skew and so on.

It was very easy to use and to get started with, rather than implementing the maths by yourself it was very neat to use an existing library.

The creation of the model is done by using another library that many people are familiar with. Sklearn from science kit learn was used for this purpose, as they provide all the necessary but very time-consuming operations to split your data into training and testing sets, creating your model, training your model and using your model. The whole process of doing these things is just a few lines of code, which is very, very convenient.

When the model is evaluated and everything is looking fine, the model is saved to a binary file by using a python built in library that is called pickle. This is also very convenient as it makes it possible to save different models and re-use them or import them in other python projects, such as in my simulation that was going to use my model later on.

After the creation of a notebook-file that was doing all the above described work for me, everything that was needed to be done was to change what data set that should be used. Since each run in the simulation stores the result in an excel-file all that was needed was to copy this excel-file and import it in the notebook.

The solution turned out to be very neat and made it possible to change and compare results from different data sets in very short time.

4.7 Scaling

During the simulation there was some decisions to be made regarding the scaling.

The most crucial decision was to decide when scaling should be done during

run time of the simulation, when generating data. There has been different

(34)

4.7 Scaling 4 IMPLEMENTATION

models/rules that has been used throughout the thesis, so that might be of interest when discussing the overall implementation of the solution.

There are different rules for up and down-scaling, as down scaling is more volatile to the response times in terms of making wrong decisions. In general, down scaling has been done when there has been a queue size of 0 for a specified amount of time, since you do really not want do scale down the number of rendering machines before you have been working down the size of the queue, and kept it in the region of 0 for a while. There has also been tested to scale down when the server utilization has been below a certain level for a specific amount of time. The reason that you want to check your chosen parameter over time is because that eliminates the risk of making a rushed decision that might not be the correct one.

For up scaling there has also been some different types of implementations.

Before the regression model itself was used in the simulation, the queue per rendering machine parameter was used for the most part. When the parameter queue per rendering machine had been averaging above a certain threshold over a certain time period, an up scaling was done. Server utilization has also been used as an up scaling parameter, as it is describing the current situation of the system in terms of load and available capacity pretty good. In that case you also look for server utilization over a certain time period, and compare it to a threshold that has been picked, 80% for instance. Usually the time period used for up scaling has been shorter than the one used in down scaling, as it is not as important to make the correct decision, as it would just be better, although more pricey with more rendering machines. It is also more important to make a quick decision in terms of up scaling than in the case of down scaling.

When the regression model came to use in this scenario, there was a need to be able to understand how the dependent variable will change over time. Without that information, there will not be of any worth using the regression model, as the whole intention is to forecast what the response times will be in the future.

Now, this was done by using the change in percent of the dependent variable,

by looking back, say x minutes back in time and compare that value of the

dependent variable to its current value. Then the change in terms of percent

becomes, say h. By doing this it was possible to forecast, by some mean at least,

the change in response times by using using the dependent variable, increased

or decreased by h%. In the coming section it will be described why this is not

a feasible solution in my case.

(35)

5 EVALUATION

5 Evaluation

5.1 Regression models before delimitation

The initial regression models were not performing in a sufficient way what so ever. Pretty much everything that is described in the theory section did not meet any requirements (kurtosis, skew, auto correlation, heteroscedasticity). A lot of different models were tested, but neither of them seemed to work. The reason for this result was pretty obvious, but it was not anything that should be changed at the time.

As mentioned above, the reason was pretty clear. The interval of the rendering times was way too large, as they could vary from 0.5 seconds all the way up to 8.0 seconds. This caused the results from the simulation to be very random, as one could end up being ”unlucky” by having several 8.0 second-jobs in a row in theory. Even though if the queue was not large when those ”unlucky jobs”

came in, they would become a bottleneck for the entire system, as they take so long time to complete. Imagine a situation with 30% load, 3 rendering machines that are all available, and an initial queue size of 0, where 3 of the 8.0 second rendering requests comes in to the system at the same time. In this situation it would on average come in 3 new requests for every second that those render jobs are performed in the rendering machines, leaving us with an average queue size of 24 when those jobs are finished. Then one also needs to take into account that some of those 24 jobs that are queued are probably another ”heavy job”, while the majority of the jobs are probably ones that takes below 1.5 second to complete.

Because of the combination between the arrival of the different rendering types, and the wide interval of rendering times. There is basically too much random factors that plays a part in the response times, that makes it very hard for a regression model to find any correlation between different attributes. Looking back at the results that were reached, the highest correlated attribute from one simulation was in fact only 33% correlated with the response times. In short there were basically too much randomness that had a severe impact on the results, such that the regression model failed to make any predictions that were anywhere near acceptable.

When this insight was reached, the decision to shrink the interval of render

times was made in order to see if that would help the result of the model. At

the company there was also ongoing work to optimize the rendering times, some-

thing that was also completed about a week after decision was taken. Instead

of having different types of request that takes different amount of time, all the

requests from this point takes between 0.5 and 1.5seconds, and the distribution

of the rendering times are rectangular within that range.

Dynamic allocation of servers for large scale rendering application

Dynamic allocation of servers for large scale rendering application

Samuel Andersson

Computer Science and Engineering, master's level 2021

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

Abstract

The effects of the usage of the regression model seems to be better than in the

case of using a completely reactive scaling method. Although the effects are

not really clear, since there is no user data available. In order for the effects

to be evaluated in a fair way, there is a need of user patterns in terms of daily

usage of the product. Because the requests in the used simulation are based on

pure randomness, there is no correlation in what happened 10 minutes back in

the simulation and what will happen 10 minutes in the future. The effect of

that is that it is really hard to estimate how the dependent variable will change

over time. And if that can not be estimated in a proper way, the results with

the inclusion of the regression model can not be tested in a realistic scenario

either.

CONTENTS CONTENTS

Contents

1 Introduction 1

1.1 Background . . . . 1

1.2 Motivation . . . . 1

1.3 Problem definition . . . . 2

1.4 Sustainability . . . . 3

1.5 Delimitations . . . . 3

1.6 Thesis structure . . . . 4

2 Related work 5 3 Theory 6 3.1 Regression analysis . . . . 6

3.1.1 Linear regression . . . . 6

3.1.2 Multi-linear regression . . . . 8

3.2 Erlang . . . . 10

3.3 Service rate . . . . 11

3.4 Server utilization . . . . 11

3.5 Response time . . . . 12

3.6 R2 score . . . . 12

3.7 Mean squared error . . . . 12

3.8 Mean absolute error . . . . 13

3.9 Root mean squared error . . . . 13

4 Implementation 14

4.1 System architecture . . . . 14

CONTENTS CONTENTS

4.2 System components . . . . 14

4.2.1 Web server . . . . 14

4.2.2 SQS-queue . . . . 15

4.2.3 Rendering machine . . . . 15

4.3 Simulation . . . . 15

4.3.1 Idea of the simulation . . . . 16

4.3.2 Settings in the simulation . . . . 17

4.3.3 Data in the simulation . . . . 18

4.3.4 Outputs from the simulation . . . . 20

4.4 Single request in AWS-architecture . . . . 20

4.5 Multi request implementation . . . . 22

4.6 Regression model . . . . 25

4.7 Scaling . . . . 27

5 Evaluation 29 5.1 Regression models before delimitation . . . . 29

5.2 Regression model after delimitation . . . . 30

5.3 Final solution . . . . 41

6 Discussion 44 6.1 Reactive vs proactive . . . . 44

6.2 Solution . . . . 45

6.3 What could have been done differently? . . . . 45

6.3.1 Planning . . . . 45

6.3.2 Multi requests from client . . . . 46

6.4 Problems . . . . 46

6.4.1 Response from rendering machines . . . . 46

6.4.2 Map response to correct client . . . . 47

6.4.3 Multi request . . . . 47

6.4.4 Timing in Node.js . . . . 48

7 Conclusions and future work 49 7.1 Conclusions . . . . 49

7.2 Future work . . . . 49

Acronyms and Abbreviations

Abbreviation Description

IaaS Infrastructure as a service PaaS Platform as a service

SQS Simple Queue Service GPU Graphics processing unit LSTM Long Short-term memory

Std Standard deviation

R2 R-squared

MSE Mean Squared Error

MAE Mean Absolute Error

RMSE Root Mean Squared Error FIFO First In First Out

API Application Programming Interface

URL Uniform Resource Locator