Service Metric Prediction in Clouds using Transfer Learning

(1)

IT 19 041

Examensarbete 30 hp Juli 2019

Service Metric Prediction in Clouds using Transfer Learning

Nevine Gouda

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Service Metric Prediction in Clouds using Transfer Learning

Nevine Gouda

Cloud service management for telecommunication operators is crucial and challenging especially in a constantly changing operational execution environment. Thus

Performance models can be used to maintain service quality. But building a traditional performance model to predict the clients' service quality might require re-training the models from scratch in the case of environmental changes to maintain prediction accuracy. And in doing so, a huge data-collection overhead arises that can significantly degrade the performance, especially if the system needs to perform accurate

real-time predictions. Thus, for the aim of improving the prediction’s accuracy, in dynamic environments, we use transfer learning approaches. It re-uses knowledge obtained from one domain (source domain) to another (target domain). But it is important to determine how transferable a source domain is in learning a task in a specific target domain. In this thesis, we use an information theoretic approach for estimating the performance of transferring representations between domains in classification problems. We use neural networks models to show a negative correlation between the novel metric called H-score, and the prediction’s loss. This results in significant speed up and time efficiency when using the H-score for selecting the best representations to transfer between domains compared to traditional transfer learning approaches. We also find that manual feature selection is a more viable approach than automatic feature selection especially for transfer learning. We collected and evaluated the transfer learning approaches using traces from running Video-on-Demand and Key-Value Store services executed on different test-beds.

Ämnesgranskare: Salman Toor Handledare: Andreas Johnsson

(4)

(5)

ACKNOWLEDGEMENT

The author is grateful for the opportunity to be part of Ericsson Research:

Research Area Artificial Intelligence - Data Science and Automation great team. The deepest appreciation goes to Andreas Johnsson, Farnaz Moradi, J¨orgen Gustafsson and Chrstofer Flinta for providing the author with con- structive discussions, constant support, and guidance throughout the thesis.

They all contributed to the results of this project. The author wants to ex- tend her appreciation to her reviewer and professor from Uppsala University, Salman Toor, who introduced her to this opportunity and helped her stay on track. She would finally like to dedicate this work to her family who supported her every step of the way through her Master’s degree and thesis.

(6)

1 Introduction

1.1 Background

Telecommunications and mobile services are an essential part of everyone’s life. And with its exceedingly high demand, next-generation telecom and in- ternet services are migrating the execution of their services to the cloud.

Moreover, service quality and assurance are inevitably crucial for achieving high-quality services for end-users, as well as profitability and productivity for the telecom companies. However, the management of such cloud-based systems is both challenging and demanding. Thus performance models are deemed vital in predicting the experienced service quality for the end-user’s side during execution. These generic performance models come in many forms and types depending on the complexity of the problem to be solved.

Where they all would have one goal; the ability to predict the service quality by learning from observations in the service’s infrastructure, such as CPU usage, Memory usage, etc. However a key challenge arises with dynamic cloud environments as shown in Figure 1, is the limited number of observations and labeled data. And this is where Transfer Learning comes in.

We, as humans, have the inherent ability to transfer and convey our own knowledge across various new tasks. Where what we gain as knowledge while learning about a certain task; can be naturally conveyed whilst attempting to learn/solve a new, yet, related task. In fact, the more related the tasks, the easier it is for us to transfer what we learned from one task to another.

(9)

Figure 1: Service Execution in a Dynamic Cloud. Picture from [1].

Thus, instead of learning how to solve a task from scratch, one can transfer what they learned from the past. Equivalently, learning a performance model for cloud services from scratch in isolation is very expensive in terms of time and data collection overhead. Especially when the service operational environment changes during execution since it causes changes in the feature- space distribution. Leading to model re-training for maintaining the prediction accuracy. Therefore, transfer learning attempts to overcome the isolated learning paradigm and utilize the knowledge learned from one feature-space problem to a new yet related one.

(10)

1.2 Project Scope and Main Contributions

This M.Sc thesis project aims at pushing the state-of-the-art for service metrics estimation in the cloud and data center environments using transfer learning by tackling the research challenge:

Investigate and develop approaches for enhanced prediction of cloud service metrics through utilizing cross-domains’ knowledge using transfer learning.

In other words; this thesis aims to find approaches for predicting the service quality a user device experiences, called Y , in Figure 1 based on the infrastructure observations, called X, by transferring knowledge obtained from one Execution environment (Exec env. 1) to another (Exec env. 2) in a cloud-based setup.

Thus the thesis involves experimentation, evaluation, and development of the following tasks:

1. Expand on existing scenarios in [2] in order to further investigate the effectiveness of transfer learning.

2. Investigate metrics for measuring transferability of features from the source to the target domain.

3. Develop feature processing and selection strategies for increasing positive transfer learning.

(11)

This report addresses the above challenges and tasks and makes the following contributions:

i. For Task 1, a total of 16 scenarios are designed to investigate the effectiveness of transfer learning.

ii. Whereas for Task 2, a transferability metric called the H-score is shown to be a good estimator for transfer learning and for determining the best feature representation to transfer from a specific source domain to the target domain. And we discover the following:

(a) Using H-score to find the best feature representation has a minimum of 2x speed up and can reach up to 96x, compared to using the Log-Loss after training and testing all the possible feature representations for transfer.

(b) The H-score has a negative correlation with respect to the feature representation transferred using partial transfer.

(c) The feature representation with the highest H-score value corre- sponds to the lowest Log-Loss value when using full transfer.

iii. And finally For Task 3, we find that Manual Feature Selection (MFS) is a more viable approach that increases positive transfer learning compared to Automatic Feature Selection (AFS).

(12)

2 Literature Review

This section provides an overview of the literature relevant to this thesis’

work. Firstly we give a quick survey on cloud service management and machine learning. Followed by a more detailed review of transfer learning and its related works. Please note that the literature on each of these sections is very extensively rich that goes beyond the purposes of this thesis. And so a high-level overview is used, but one can refer to these [3], [4] for a more depth review on machine learning.

2.1 Cloud & Service Management

Nowadays, the world has been heavily moving towards the adaptation of cloud computing services with a huge rate in a wide variety of applications.

And what is cloud computing exactly? It is the delivery of ubiquitous IT and telecom services to clients via the network such as applications, platforms, storage, networking, computing resources, etc. The aim is to offer flexible, easy to access, self-service, scalable distributed resources on-demand. Thus it removes the necessity for clients (companies and/or end-users) to deploy and manage their own resources, which could be very expensive compared to using cloud services instead [5]. Defined by NIST [6], there are four different deployment models, private, public, hybrid and community.

• Private cloud is provisioned to be used only for a single organization.

• Public cloud is designed to be for open public use. Google App Engine is an example for Platform as a Service (PaaS) public cloud. OpenStack

(13)

[7] for Infrastructure as a service (IaaS) public cloud. And Office 365 and Salesforce are considered to be Software as a Service (SaaS) public clouds.

• Community cloud is designed to be for a specific community that shares a common vision, mission or location, etc. An example for a community cloud is SNIC, [8], with Swedish PIs and their collaborators as the base of their community.

• Hybrid cloud is the combination of more than one of the above deployment models.

When services are made available through a network for the clients based on demand; that is known as Cloud Services. And they are purposefully created to be self-provisioning, scalable, provide access with ease and elastic for the clients’ desired services such as resources and applications. Thus these services have the ability to dynamically scale up or down based on demand.

Which raises the need for cloud service management. But that could be quite challenging and complicated. And that is where performance models come in handy. As they can predict the quality of the services found at the client’s side that would ultimately help in managing and adjusting the service and infrastructure accordingly.

Cloud Computing has been the focus of many researchers were a lot of related works; [9], [10] ,[11], [12], [13], are very interesting research topics and promising solutions and approaches in areas such as grid, distributed,

(14)

cluster computing, etc.

2.2 Machine Learning

Figure 2: Machine learning broad categorization of algorithms.

2.2.1 Overview of Machine Learning

Machine learning has been proven to be a very powerful scientific method for data analysis for decades. As it aims at building models that can learn from the data to classify them, recognize patterns, make decisions, learn to play a game, and even perform predictions depending on the problem it tries to solve. Machine learning algorithms are split into several broad categories as shown in Figure 2; supervised learning, unsupervised learning, and reinforcement learning. Each category differs in approach, the task’s aim and the type of the data (for instance labeled and unlabeled data).

(15)

Supervised Learning; is the approach of building a model that learns how to map the input data to the output labeled data. A basic supervised learning algorithm such as linear regression or Naive Bayes classification includes training the model with the training data that consists of one or more input feature, and the desired output. These models aim to iteratively optimize an objective function (which differs based on the algorithm) to predict the desired output. Test data is then used to test how well does the model predict new, unseen data [14]. Supervised learning has two types of tasks; Classifi- cation and Regression. Their main focus is the nature of the output Y . If Y is a binary categorical in value, such as Spam or Not-Spam, or multi-class categorical in value, such as Blood Type: A, B, AB, O, then the problem is a Classification Task. Whereas if Y is a continuous value such as time, height or cost/money then the problem is a regression task.

Unsupervised Learning; Its main difference from supervised learning is the lack of the desired output labeled data. Unsupervised algorithms aim at finding structure and patterns in the distribution of the input data [3]. For instance, a clustering algorithm such as K-Means attempts to cluster and partition the input data based on closeness to K number of centroids that define the clusters [15].

Reinforcement Learning; as illustrated in figure 3 aims to learn from ex- perience. Which can be achieved by having an agent that learns to take a suitable action that maximizes the reward in a particular state of the environment. The agent performs an action and receives feedback from the

(16)

Figure 3: Machine Learning Types.

Figure 4: The pipeline flow for a supervised machine learning problem as illustrated in [16].

environment that indicates to some extent how good or bad that taken action was [4].

Due to the nature of the data used in this thesis as illustrated in Figure 1, where they are presented as a pair of input X and output Y data; we will focus only on supervised learning for the remaining of the report. Given that we have an n number of features in the input data with m number of data records mapped to its desired output where a model is to be built to predict the said output. The process is illustrated in Figure 4.

(17)

Figure 5: The basic structure of a Neural Network with 1 Hidden Layer.

2.2.2 Artificial Neural Networks

Neural Networks as illustrated in Figure 5, is one of the frameworks widely used today for classification tasks. It consists of a large variety of algorithms and has been proven to be very useful in fields such as pattern and image recognition problems. So what is Artificial Neural Networks? It a compu- tational adaptation based on the construct and functions of our biological neural network. And the information flows in and out of each neuron for the sake of ”learning” thus mimicking how our brains work. The basic unit of a Neural network is a Neuron as illustrated in Figure 6. These neurons take in any number of inputs x_i, and weights w_i that specify the importance of its respective input for the output, and an activation function that produces the output value. And so, a group of neurons together form a layer, whether input, output or hidden layer as shown in Figure 5. Each layer is independent of each other in terms of the number of neurons, activation

(18)

Figure 6: Neural Network’s basic unit: A Neuron.

function, etc. However, the output of each layer is considered as the input of its subsequent layer as illustrated in Figure 5. In a supervised problem, the whole neural network is trained to learn the weights W = w₁, w₂, ...w_n to predict the desired output value(s) where the correct class for each record is known (hence the term supervised). And so the performance of the predicted values can be determined by how close or far the predicted output is to the real, correct ones using an error term. This is then used to adjust the weights in the hidden layers in an iterative approach with the aim to have new predicted values that are now closer to the real, correct values, and so on[17]. This iterative approach varies depending on the algorithm used such as backpropagation, resilient backpropagation, etc.

Because of its natural ability to handle a huge number of inputs, infer complex relationships between them Neural Networks have been proven to be very successful and powerful framework that have a wide range of applications such as character or image recognition which comes in handy for real-life problems such as fraud detection, weather, and stock forecasting, etc.

(19)

Figure 7: Illustration of differences in the approach of traditional Machine Learning algorithms Vs. Transfer Learning.

However traditional supervised Neural Network algorithms work well under a specific assumption; the training and testing data are drawn from the same feature space and distribution. And when this assumption fails most models are re-built and retrained on the new data. But in the real world this might not be possible or could be very expensive in terms of time, cost or availability [18].

2.2.3 Transfer Learning

Inspired by our inherent abilities to transfer our knowledge from a task that we know well to a new but similar/related task, Transfer Learning various approaches aim to solve the key challenges in traditional machine learning algorithms previously mentioned; limited number of observations and labeled data can resulting in very bad performance and rebuilding models from scratch for new tasks with different feature space distribution. The

(20)

main difference between traditional machine learning approaches and Trans- fer Learning is illustrated in Figure 7. Below, a formal definition from [18] is provided.

Definition 2.1. Given a source domain D_S and learning task T_S, a target domain D_T and learning task T_T , transfer learning aims at reducing the cost of learning the predictive model M_T in D_T using the knowledge in D_S and T_S, where D_S 6= D_T , or T_S 6= T_T [18].

Thus transfer learning is the advanced process of sharing knowledge between different domains with different distributions with the purpose of finding an objective predictive model that facilitates both the target domain’s data as well as the source domain’s knowledge. Transfer Learning can be adapted in many applications such as pattern and image recognition, robotics where knowledge can be transferred from simulations to real-life robots and cloud service metrics predictions, which is the scope of this thesis.

In terms of knowledge transfer, there are 4 types of knowledge that can be transferred [18]:

i. Parameter transfer: transferring hyper-parameters such as learning rate, epochs, batch size, etc.

ii. Instance transfer: re-weighting the source domain’s labeled data and transferred for training at the target domain

iii. Feature-representation transfer: learning a well-represented feature (where

(21)

the knowledge to be transferred in this method is embedded in said feature representation) for the target domain.

iv. Relational knowledge transfer: the statistical relationship between data records is transferred to the target domain

The performance of transfer learning is highly dependent on the similarities between the source and target domains. As the distance between them increases, transfer learning becomes more challenging and in return would affect the prediction performance at the target domain. And to the extent of the author’s knowledge, there has been a shortage in research on how to measure the similarity between domains for the sake of estimating the performance of transfer learning. However, the authors in [2] present a novel metric called the H-score, that actually estimates the performance of transferred representations between domains. Which we will use and experiment in our approach.

This thesis is considered a continuation of the work presented in [1], [19], [20]

and [21] . A lot of adaptations for the various types of transfer learning are emerging in different applications. In the next section we will present some of these related works.

2.3 Related Work

Transfer learning has been the focus of numerous amount of papers in research fields such as Pattern Recognition, Image Classification, Convolutional

(22)

Neural Networks (CNNs) and many others. Therefore this section provides a quick review of relevant literature.

In [22], they present a novel feature selection approach achieved by identifying both variant and invariant features between two different data-sets aiming to improve positive transfer learning for prediction problems. They do so by showing that the prediction accuracy improves significantly compared to other transfer learning approaches. Further, in [23], they use an image classification task to investigate how certain features/weights from a neural network are transferable. And they found that the performance of transfer learning can be hugely affected depending on the position of the transferred layers. Whether they were from the first generic layers or last specific layers.

They also show that the performance of transfer learning is affected by the distance between the source and the target domains.

Moreover, transfer learning has been used in other machine learning problems, such as Reinforcement Learning. For instance, in [24], they separate the visual transfer task and the control policy and achieving a better performance. As they demonstrate their approach using the Breakout game and Nintendo’s car driving game.

In addition, performance prediction has proven to be well equipped for utilizing transfer learning for achieving better results. The authors of [25] use transfer learning with random forests for the purpose of predicting and classifying server behaviors. They propose the approach of building a model from all the training samples collected from multiple small IT environments.

Which would help the target model with small data samples to learn from

(23)

the transferred model with large data sample.

Finally, the authors in [26] aim at finding the best source domain selection for a specific target domain in a natural language processing Question Answering problem. They achieve that by using document vector distance (DVD) and term frequency-inverse document frequency (TF-IDF) to measure the distance/similarity between data-sets. They found a correlation between a very similar data-sets (small distances) and transfer learning performance.

The research in transfer learning is quite broad and sparse in various different fields and approaches. And it is still getting a huge amount of attention until today. But since it goes beyond the purpose of this thesis to explore them all, there are more extremely promising and intriguing [27], [28], [29], [30], [31], [32] related works .

(24)

3 Problem Specification

In this section, we describe the motivation and problem specification of the thesis and illustrate the nature of the system under investigation along with its key challenges. We also define the basic notations that would clearly define the scope of the project.

3.1 Motivation

The main challenge of this thesis is to investigate solutions and approaches for the system illustrated in Figure 1, where clients access the services executed on the servers and data centers through the network. These said services can be anything, such as Video-On-Demand (VoD) Service or Key- Value Store (KVS) Service. And services can be executed on a stand-alone server or a cluster of servers, in a bare-metal or a virtualized working environment. The system under investigation has several challenges that rise when the execution environment is based on Virtual Machines or Containers that are:

i. short-lived

ii. can migrate from one environment to another iii. can be scaled up or down depending on the demand

When a new execution environment starts due to one of the above reasons, there will be a very limited number of samples and/or labeled data for that new environment. And as time passes by, more data will be available. And

(25)

training a model with very a few samples due to migration or scaling problem would have a poor performance and would tend to over-fit due to the lack of data. Therefore the need for Transfer Learning rise specially when that new environment is still fresh with small number of available data. And one can use a model that is fully trained from the source domain with a good prediction accuracy and perform Transfer Learning to improve the new environment (target domain) execution’s performance that has few samples and/or labeled data.

3.2 Problem Setting

3.2.1 Aim & Research Challenges

The aim of this thesis is to find a method of estimating the transferability/similarity of a trained source domain to a specific target domain that would indicate how well transfer learning would perform. However, this comes with its research questions and challenges. Would this method be able to generalize for any pair of data-sets? Can it be used for selecting the best feature representation to be transferred? Given this transferability method, is it possible to select the best source-domain for a specific target domain?

And so this M.Sc thesis project attempts to tackle these challenges for pushing the state-of-the-art for service metrics estimation running on a dynamic cloud environment. This is achieved by the following tasks:

i. Designing cloud service scenarios that would show how effective is

(26)

transfer learning.

• 6 scenarios were previously designed in [2] and the task is to expand on these 6 scenarios to further investigate new use cases.

ii. Finding a method for measuring transferability/similarity between the source and target domains.

• Given a source domain that is trained and will be transferred to a target domain, the task is to find a transferability or a similarity metric that would estimate how well Transfer Learning would help improve the target model’s performance.

• Compare the metric of transferability with transfer learning’s performance.

• Use the transferability metric for selecting the best feature representation given a specific source and a target domain.

iii. Processing and Selecting Features to increase positive transfer learning.

• Automatic and Manual Feature Selection methods are investi- gated for improving transfer learning performance in the target domain.

iv. Investigate the effectiveness of selecting the optimum source domains for a specific target domain.

• Given a fixed target domain, the task attempts to find and select the best source domain to increase positive transfer learning and achieve the best performance from the available source domains.

(27)

Figure 8: Demonstration of Transferring knowledge between the Source and Target Domains.

As stated in the Transfer Learning Section 2 in the Literature Review, there is more than one approach and more than one algorithm for transferring knowledge from one domain to another, ranging from transferring knowledge of instances, parameters to feature representations. Noting that any number of architectures can be used for building the models such as Random Forests [33], Bayesian Networks [34] and Neural Networks [35].

And in this thesis we are focusing on supervised classification Neural Net- work Architectures and that is because:

i. The nature of the problem and data traces as previously state in Section 3 categories the problem as a supervised task.

ii. As a continuation of the scenarios, experiments, and results of the previous work in [1], we are proceeding with Neural Networks as the base of the models’ architectures.

iii. For certain tasks however, such as predicting conditional distributions (e.g., p(y|x)), deep architectures are state-of-the-art [36], which makes

(28)

studying transfer learning in the context of neural networks an important topic.

iv. Since the findings and contributions reached in [2] is achieved only for classification problems for now. And in order to be able to use the H-score method, we decided to transform all the output labels to a quantization form through binning the labels to change the problem from a regression task to a classification task.

3.2.2 Notations

In this section, we introduce the notations for the remaining of the report.

First of all, a Domain, D = {X, Y, M, ˆY }, consists of 4 components:

i. Input Feature Space X: where X = {x₁, x₂, ...x_n} for n number of features is the feature space used to train the model.

ii. Output Label Y : where Y is the output variable to be predicted.

iii. Objective Predictive Model M : X −→ Y : where M is the model used to predict the output labels by training it with training data in the Input Feature Space.

iv. Output Label ˆY : where ˆY is the predicted labels obtained from the predictive model.

Thus for a Source Domain that has A number of samples at time τ , it consists of an Input Feature Space, X_S, an Output label Y_S, an Objective Predictive Model M_S : X_S −→ Y_S, and a Predicted label, ˆY . While for

(29)

a Target Domain that has B number of samples at time τ :, it consists of an Input Feature Space, X_T, an Output labels Y_T, an Objective Predictive Model M_T : X_T −→ Y_T, and a Predicted label, ˆY . And so Transfer Learning aims to learn at a reduced cost the predictive model M_T using the knowledge obtained from the target’s own domain as well as the source domain. Which can be illustrated in Figure 8.

(30)

Figure 9: Illustration of our approach’s process to be used when we conduct our transfer-learning scenarios experiments.

4 Approach

In this section, we introduce our notion for attempting to solve the challenges defined in Section 3. We first give an overview of our approach which will be used and elaborated more in the experiments to be conducted in Sec- tion 7. Then we dig deeper in the feature selection phase. Afterwards, we go into estimating the transferability using a novel metric, called the H-score.

We also describe our approach in building the target model using the partial and full transfer approaches. Finally, we describe the tools we used in developing our methods.

(31)

4.1 Approaches Overview

Our main contribution is using the H-score to measure the transferability and to find the best feature representation between the source and target domain.

And to do so, we decided to test it on 10 different scenarios (check the Section 6 for more details on the scenarios). And in attempt to improve the performance we use different feature selection and transfer methods. Thus an overview of our approach is illustrated in Figure 9, where we generally do the following for each scenario:

i. Collect and gather the raw data for the source and target domains ii. Clean the data from missing data, duplicates, etc.

iii. Split the source and target data into train, test and validation sets.

iv. Perform feature selection whether automatically or manually.

v. Build, Train and Test a Neural Network model using the source domain data.

vi. Transfer the source model and its trained weights to the target domain using partial and full transfer approaches.

vii. Split the data in the target domain into 8 samples of different sizes ranging from 100 samples to 20,000 training samples.

viii. Predict the transferability using H-score for each layer from the source domain with respect to the sample size for both partial and full, approaches.

(32)

ix. Test the transferability by performing Transfer Learning using partial and full transfer and compare it with H-score with its respective layer.

For elaboration purposes the source model can consist of a Neural Network of 5 Layers. Given that both domains have the same Input Feature Space (i.e. Same Features), but have different data samples and input feature distribution. Thus the source model is trained and tested based on its own training and testing data traces to predict the Output label ˆY_S. And in order to perform transfer learning in general the first l number of layers (and their respective weights) are chosen to be transferred to the target model as the feature representation fS(X). While r layers of these transferred layers can be selected to be frozen. Thus one can generally define f_S(X) as a n−dimensional functional representation from the source domain for the Input X. Which can be viewed as the discriminative predictive feature representation of X, ˆY , when selected at the output layer.

4.2 Feature Selection

4.2.1 Theoretical Background

Feature Selection is the process of selectively choosing a subset of features that positively contribute to the prediction of our desired outputs. This can in return affect the performance of our models [37]. And that is because there might be some redundant or noisy features which in some cases could result in over-fitting thus low accuracy values for the test data. In addition, performing feature selection can help avoid/reduce the curse of dimensional-

(33)

ity by reducing the Space Complexity and could save more time due to the reduced number of features. There are two types of feature selection; Man- ual and Automatic Feature Selection. Where Manual Feature Selection is the process of looking and the feature space and selecting the relevant features with domain knowledge and expertise. While an automatic feature selection includes an algorithm that attempts to find a subset of the features that avoids over-fitting, reduce the curse of dimensionality and reduce the mode’s error. Automatic Feature Selection methods are categorized into three main types; wrappers (measures the ”usefulness” of each feature), filters (filters the features based on their ”relevance”) and embedded methods (measures the ”usefulness” of the features using a guided learning process) [37].

4.2.2 Approach

For the purposes of this thesis, we use both approaches, Manual and Au- tomatic feature selection. Where Embedded methods are used for automatic feature selection based on feature importance obtained from a tree-based model on each source domain. The main reason is that we want to study the impact of both methods with respect to transfer learning. And see if one of them generally always outperforms the other or not and why.

(34)

4.3 H-score

4.3.1 Theoretical Background

An important question arises when dealing with transfer learning. When would transfer learning be effective enough to result in positive transfer learning and to what extent? Transfer learning’s performance has been very dependent on the similarity between the source and target domains. The higher the similarity the better the transferability and the accuracy. However the higher the differences the more challenging it is to achieve positive transfer learning. Thus it is important to be able to measure how well transfer learning would perform based on the domains’ similarities. However, the most common approach in determining the effectiveness of transferability is by experimental tests. Where one would train the objective predictive model, M_S, calculate the accuracy, transfer knowledge to the target domain, D_T, and then measure the accuracy after Transfer Learning (TL) and check if there is an improvement or not. This approach is on its own time-consuming, expensive and might not be very informative on why is Transfer Learning working, or not working. However, a novel and an easily-computable evaluation function that estimates the performance of a specific feature representation from one domain to the other in classification problems was presented in [2].

This method is highly promising as it would in return simplify the problem of transfer learning to finding the best source domain for a given target domain, and vice versa. This evaluation function simply measures how effective is the functional representation to be transferred in representing the Target’s Input Feature Space with respect to the Output Feature Space. Thus it somewhat

(35)

predicts if Transfer Learning would be effective or not before actually performing any Transfer Learning. This approach saves time and potentially helps in locating the best source model, as well as the best source domain for a specific target model. In the next section, we will discuss the H-score evaluation function.

Definition 4.1. H-Score: Given the input data X = x₁, x₂, .., x_i and output label Y . Where f (x) is a k-dimensional, zero-mean feature function as shown in 8. The H-score of f (x) with respect to the learning task represented by P_{Y X} is [2]:

H(f ) = tr(cov(f (X))⁻¹cov(EPX|Y[f (X)|Y ])) (1) Where the H-score for transferring the feature representation from the source domain f_S to the target domain can then be defined as:

H_T(f_S) = tr(cov(f_S(X_T))⁻¹cov(EP_{XT |YT}[f_S(X_T)|Y_T])) (2) The aim of this definition of the H-score is to select the best feature representation for a specific Source and Target domain using:

i. The input data X from the target domain ii. The output data Y from the target domain

iii. The feature representation f_S(X) to be transferred. Where f_S(X) can simply be obtained by feeding-in the target’s input data in the Neural Network from the source model shown in Figure 8 and catching the output from the desired layer.

(36)

A high H-score value implies a large inter-class variance using the feature representation f_S and a small feature redundancy. Thus one can conclude the higher the H-score the better the layer transferred is in representing the input data w.r.t to the output data. In addition, one can select the best Source Domain for a given Target Domain Task with the use of the defined H-score. Which brings us to the next Definition.

Definition 4.2. Source Task Selection: Given N Source domains with their respective trained models and labels and a fixed target domain with its respective input data and output label. Let f_S₁, f_S₂, ....f_S_N be the minimum error probability feature functions of the source domains. Selection of the source domain can be defined as [2]:

T(Si, T ) = argmaxiH(fSi) (3)

Where the optimum Source domain for the target task is the one that has the highest H-score for the best Source Domain Feature representation w.r.t to the target Domain.

4.3.2 Approach

In order to utilize the H-score for our transfer-learning scenarios experiments, we do the following:

i. Build and train the source model with its own data traces.

ii. We split the data in the target domain into 8 samples of different sizes ranging from 100 to 20,000 training samples.

(37)

Figure 10: Demonstration of Transferring knowledge between the Source and Target Domains using Partial Transfer.

iii. calculate the H-score using equation 2 for every possible number of frozen layers (1 frozen layer, 2 frozen layers, etc) found in the source model on each target domain sample size.

Thus if we have a source model M_Sconsisting of 5 layers, and a target domain with 8 different sample sizes, then we calculate a total of 40 H-score values for one scenario. The main purpose of this is to be able to see the effect of each layer and sample size on the H-score.

4.4 Partial Vs. Full Transfer

For a classification supervised problem, the paper [2] found a positive correlation between the H-score and the accuracy for the training and testing data. However, they suggest that this correlation only holds for partial transfer. Meaning that for partial transfer, all the layers and weights that are transfered are frozen, and only the final layer can be retrained with a linear activation function. Whereas full transfer can still be performed but only to select the best feature representation (the one with the highest H-score)

(38)

even though the correlation doesn’t hold anymore. And so our approach for transfer setup is to try both approaches; partial and full transfer when building the target model. That is because we want to use the partial transfer to validate their results and check that the correlation holds for our scenarios as well. On the other hand we aim to have a good prediction performance, and that might be possible if we have the flexibility of full transfer. Where the differences between them are the following:

i. Partial Transfer can be illustrated in Figure 10. It shows that all the layers that are transferred are frozen except the final output layer, which can be retrained with a linear activation function, such as the SoftMax Function. In other words, l ≤ number of M_S layers and r = l, where l is the number of transferred layers from source model and r is the number of frozen layers.

ii. Whereas Full Transfer can be illustrated in Figure 11. The figure il- lustrates that all the layers are transferred from the source domain but the first n layers are frozen while the remaining transferred layers are retrained with the target domain’s training data. In other words, l = number of M_S layers and r ≤ l. The retrained layers can have non-linear activation functions such as Rectified Linear Unit (ReLU).

And if the output task Y is different in both domains Y_S 6= Y_T then the output layer at the target domain is replaced with its own layer.

(39)

Figure 11: Demonstration of our approach of Transfer Learning using Full Transfer Approach.

Due to the differences between them and the limitations the partial transfer has, full transfer in most cases has the potential to outperform partial transfer. And so the H-score is used in this thesis to predict the transferability between the source and target domains. Firstly we test it with only partial transfer in the target domain to validate the correlation and check if it still holds for our scenarios as well. For the remaining of the experiments, full transfer is used to achieve the best transfer learning performance whilst utilizing the knowledge obtained from the H-score.

4.5 Tools

We use Python version 3.6.8 for developing the experiments and results for this thesis. The tools we used in the development of Neural Networks, Feature Selection, H-score calculations, data processing cleaning, and data splitting is the following:

i. Pandas library [38] was used for the basic data structure in the data analysis.

(40)

ii. NumPy [39] Library is for the calculations of mathematical and statistical operations such as calculating the H-score.

iii. Keras [40] is used in building, training, manipulating and transferring layers in our Neural Network Architectures.

iv. TensorFlow [41] is used as the back-end Library for the Keras library.

v. Scikit-learn library [42] is used with Keras for our Neural Network.

And used for splitting the data into training and testing, measuring the accuracy of our models, and pre-processing the data.

vi. Matplotlib Library [43] is used for plotting results such as Loss values, H-score, etc.

(41)

5 Testbed & Data Traces

The traces used in this work for building models and running experiments were collected from testbeds located at both Ericsson Research and KTH, Royal Institute of Technology in Stockholm. And so this section elaborates more on the traces collected, the setup of the services running, the infrastructure of the servers, the desired output variables and the generation of load and fault patterns.

5.1 Testbed & Services

The testbed used for the creation of the traces consists of a server or a cluster of servers and any number of client devices accessing services running on the server(s) as shown in Figure 1. For more details about the testbed’s infrastructure such as the number of cores per server, RAM memory, proces- sors, hard disks, network interface cards, etc. then check these papers [1], [19], [20].

Two types of services are used in the data-collection experiments, Video- On-Demand (VoD) and Key-Value-Store (KVS) services. However, they are used as a proof of concept and the services conceptually are not limited to these two only. But for the sake of this project’s scope, we will only focus on VoD and KVS services. The VoD service runs on the server(s) where the clients play videos through a modified version of the VLC media player software. It can be executed in a single server environment or a cluster of

(42)

6 servers. While the KVS service uses the Voldemort service for the client to read and/or write key-values in a distributed data store. Both services can run separately or simultaneously, where we have data traces for both cases.

5.2 Load Patterns

To be able to test the performance of our servers and to test transfer learning under a different distribution of the output space, it is important to try more than one load pattern. In previous works, some of the traces contained up to 5 different load patterns. But we will be focusing only on 2 types in this thesis. Both patterns were previously generated and collected in [44] and used in this thesis. These patterns are collected when the load generator produces requests following a Poisson process.

i. Periodic Load: The arrival rate simulates a sinusoidal function with a period of 60 minutes.

ii. Flash-Crowd Load: The arrival rate simulates a Flash-Crowd model [45]. The load starts off at a low level and reaches peak levels at flash events.

5.3 Fault Patterns

Similar to load patterns, fault patterns are important in our investigations and experiments as it would simulate a realistic execution environment. In return makes the experiments results more realistic than for a system that has

(43)

no faults. Faults were injected by the load generator into the host machines to simulate a faulty executable server environment.

In previous work, [19], several fault types were introduced; Memory, CPU and I/O hogs. And a fault has a binomial distribution probability p, to be injected in a specific time slot. In this thesis, we will re-use the fault patterns previously generated and collected in previous works. Thus we use all three types of hogs when we want to study the effect of fault patterns in transfer learning and service metric prediction performance.

5.4 Used Data Traces

Thus in order to perform the experiments with these different setups of services, hardware, load and fault patterns, for various use cases we first need to collect the data and traces. And since this thesis is the continuation of these [1], [19], [20], [21] previous works, we re-use the traces previously collected and expanded on them. Given that in [1], they used 6 different traces to conduct their experiments. Whereas in this thesis, we re-use all of the same traces and use one extra trace for a virtualized server setup as illustrated in Table 1. Note that the term fault pattern in Table 1 means that an overload occurred affecting the distribution of the data due to a fault that occurred.

All the traces collected and listed in Table 1 consist of two main components:

(44)

Data Trace ID Service(s) Server Architecture Load Pattern Fault Patterns Number of Samples

1 KVS Bare-Metal Periodic Load No Faults 28,962

2 VoD Bare-Metal Periodic Load No Faults 37,036

3 KVS + VoD Bare-Metal Periodic Load No Faults 26,488

4 VoD + KVS Bare-Metal Periodic Load No Faults 27,699

5 VoD + KVS Bare-Metal FlashCrowd Load No Faults 29,151

6 VoD Bare-Metal(Single Server) Periodic Load No Faults 51,043

7 VoD Virtualized(Single Server) Periodic Load (CPU + Memory + I/O) Faults 35,617

Table 1: Listing of the testbeds used for the evaluation of the scenarios in Table 4.

i. Feature Space X ii. Output variables Y

The Feature Space X consists of the device statistics obtained from the kernel of the Linux Operating system that runs on the server(s) as illustrated in Figure 1. System Activity Report (SAR) open-source Linux library [46]

is used to collect said kernel data. It can provide an approximation of 1700 features for each server. These features include Memory utilization, Network statistics, I/O operations, and CPU performance utilization statistics.

While Output variables Y are service-level metrics obtained and measured from the clients’ device (mobile phones, personal computers, etc.). VoD services have Frame Rates, Audio Rates, and read operations as service metrics which are provided and collected using VLC player [47]. But in this thesis, we mainly focus on Display Frame Rates for VoD services. And KVS services have output variables such as Reads Average (Average read response time per second) and Writes Average (Average write response time per sec-

(45)

ond) that are provided and collected using the customized benchmark tool, Voldemort [48].

These traces are used in various setups in our scenarios’ experiments. And a trace can be used as the source or target domain depending on the scenario under investigation. The main purpose of using several traces is to explore various use cases that would capture the effectiveness of transfer learning in advancing state-of-the-art cloud service management.

6 Scenarios

In this section, we use the traces introduced in Section 5 to define several scenarios to help investigate the effectiveness of transfer learning in advancing the state-of-the-art cloud service metrics estimation. We first generally categorize the scenarios based on what they attempt to study and test. And then we then define the notations for the scenarios and list them with respect to the changes found between the source and target domains.

6.1 Scenarios Overview

In the thesis, there are 16 scenarios designed to measure several factors that may affect the distance or similarity between the source and target domain.

Which would in return affect the overall performance of transferability across different domains. These factors include the effects of:

(46)

i. The feature processing and selection

The hypothesis is that transferability’s performance would be affected when implementing feature processing and selection methods. Whereas feature processing would be implemented on all traces. All the Scenar- ios are to be tested using Manual and Automatic Feature Selection.

Whereas 3 Scenarios are designed to test the effect of feature processing by:

(a) Varying Feature space size across domains (b) Introducing network parameters

ii. Changing the hardware environmental conditions

The hypothesis is that the changes in hardware conditions could have a large effect on a models’ transferability. Therefore 5 scenarios are designed to test the effect of the following:

(a) Scaling Down the number of servers (b) Scaling Up the number of servers

(c) Moving from a bare-metal server source to a virtualized server target

(d) Moving from a virtualized server source to a bare-metal server target

iii. Changing the software environmental conditions

Same with the hardware changes, the hypothesis is that the changes in software conditions could have also a significant effect on a models’

transferability. Therefore 10 scenarios are designed to test the effect of:

(47)

(a) Moving from a shared to a dedicated environment.

(b) Moving from a dedicated to a shared environment.

iv. Different workload patterns

The Workload would affect the service metrics and the X parameters;

thus, it is important to test the effect of varying workload patterns between the source and target domain. 3 scenarios are designed to test the load’s effect on transferability by:

(a) Moving from a periodic load source domain to a flash crowd load target domain

(b) Moving from a flash crowd load source domain to a periodic load target domain

v. Different fault patterns

Since fault patterns help simulate a realistic model it is important to test the effect of faults on transferability. Thus, the hypothesis is that transferability would be low from a fault-free source domain to a faulty target domain. And the more faults found in the source domain the lower the distance between the domains, the better the transferability.

2 scenarios are designed to test the faults’ effect on transferability by moving from varying fault patterns between source and target domains:

(a) Fault-free environment → All faults environment (b) All faults environment → Fault-free environment

Note that the term all faults mentioned above means CPU, memory and I/O fault types or at least overload types.

(48)

We finally classify each of the 16 scenarios in 4 levels of severity levels. The term severity here refers to the condition of the scenario’s changes that would affect its transferability. Our approach for classifying the scenarios was from an intuitive perspective. Meaning that depending on each scenario’s changes in the configuration parameters mentioned above, we hypothesize whether these changes are of low, medium, high or very high severity. And our hypothesis depends on the impact each change would have on transferability as well as the number of changes between the source and target domains. For instance:

• We categorize moving from shared to dedicated environment, or vice versa, as a low severity factor.

• Moving from a stand-alone server to a cluster of 6 servers, or vice versa, is categorized as a high severity scenario.

• We also categorize scenarios moving from a faulty, virtualized, stand- alone server to a fault-free, Bare-metal, a cluster of servers, or vice versa, as a scenario with a very high severity.

6.2 Transfer-Learning Scenario Configurations

Each transfer-learning scenario investigates the effect of changing specific configuration parameters. These changes would affect the distance/similarity measure between the source and target domains. This in return would be the indicator for the intuitive severity of changes hypothesis. These configuration parameters are the following:

(49)

i. Input Service Metrics X parameters, I.

The parameters are defined as:

I = {i₁ : Compute, i₂ : Compute & N etwork} (4)

ii. The server’s hardware types, H.

The types are defined as:

H = {h1 : BareM etal, h2 : V irtualized} (5)

iii. The software service(s), S, running on the server(s).

S = {s₁ : Key V alue Store(KV S), s₂ : V ideo on Demand(V oD), s₃ : V oD(single server), s₄ : (KV S + V oD), s₅ : (V oD + KV S)}

(6) iv. The Output Variables Y parameters, O.

O = {o₁ : ReadsAvg, o₂ : W ritesAvg, o₃ : F rameRate} (7)

v. The workload patterns, W , simulated on the server.

The patterns are defined as:

W = {w₁ : P eriodic Load, w₂ : F lash Crowd Load} (8)

vi. The different fault patterns, F , occurring on the server.

The patterns are defined as:

(50)

F = {f1 : N o f aults, f2 : (CP U + M emory + I/O) f aults}. (9)

Table 2 lists the environmental changes for each scenario and their respective expected severity. For instance, for scenario 1 (low severity scenario), it has the following fixed parameters in both the source and target domains:

i. i₁: Compute service metrics are used for the input feature space, X.

ii. h1: Bare-Metal hardware is used for the server cluster.

iii. o₁: Reads Average is used as the prediction output variable, Y . iv. w₁: Periodic Load is simulated for the workload pattern.

v. f1: No fault patterns are used in the environment.

On the other hand, the service running moves from a dedicated environment (only the Key-Value-Store service running) in the source domain to a shared environment (both Key-Value-Store and Video-on-Demand are running) in the target domain.

(51)

Scenario Configuration Pattern Changes (Source → Target) Severity 1* {i1, h1,s1 → s4, o1, w1, f1} Low 2* {i1, h1,s1 → s4,o1 → o2, w1, f1} Low 3* {i₁, h₁,s₂ → s₅, o₃, w₁, f₁} Low 4 {i₁, h₁,s₄ → s₁, o₁, w₁, f₁} Low 5 {i₁, h₁,s₅ → s₂, o₃, w₁, f₁} Low 6 {i₂, h₁,s₅ → s₂, o₃, w₁, f₁} Low 7* {i₁, h₁,s₂ → s₅, o₃,w₁ → w₂, f₁} Medium

8 {i₁, h₁,s₅ → s₂, o₃,w₂ → w₁, f₁} Medium 9 {i₁ → i₂, h₁, s₁, o₁, w₁, f₁} Medium 10 {i₂ → i₁, h₁, s₂, o₃, w₁, f₁} Medium 11 {i2, h1,s2 → s5, o3, w1, f1} Medium 12* {i1, h1,s3 → s2, o3, w1, f1} High 13* {i₁, h₁,s₃ → s₂, o₃,w₁ → w₂, f₁} High 14 {i₁, h₁,s₂ → s₃, o₃, w₁, f₁} High 15 {i₁,h₂ → h₁,s₃ → s₂, o₃, w₁, f₂} Very High 16 {i₁,h₁ → h₂,s₂ → s₃, o₃, w₁,f₁ → f₂} Very High

Table 2: Listing of Scenarios.

*: Previous scenarios used in [1].

(52)

37.5 %

Low Severity Scenarios

31.25 % Medium Severity Scenarios

18.75 %

High Severity Scenarios 12.5 %

Very High Severity

Figure 12: The distribution of the 16 Scenarios with respect to Severity.

7 Evaluation

In this section, we investigate the problem setting defined in Section 3 by apply the approaches described in Section 4 on the Scenarios listed in Table 4 using the traces found in Table 1 to conduct our experiments. Firstly we describe the workflow leading to the experiments. Starting with the pre- processing of the data, feature selection methods used, Building ANNs Mod- els, H-score experiments and wrapping up with results reached.

(53)

7.1 Work Flow

7.1.1 Data Pre-Processing

An essential part of machine learning analysis, as illustrated in Figure 4 is pre-processing and cleaning the data traces before applying any algorithms.

As it could drastically improve the performance in terms of accuracy and training. And so we performed several pre-processing steps to prepare the traces in the source and target domains for the transfer learning phase. These pre-processing steps are:

i. Dropping and cleaning the traces from any missing values.

ii. Renaming features to make sure all the traces have the same input feature space, X.

iii. Splitting the data into training, testing, and validation sets.

• For the source domain, the data is split using random splitting into 80% training data and 20% testing data with no validation set.

• For the target domain, the data is split into 80% training & testing data, and the remaining 20% is the validation set. Noting that the 80% is further split into training and testing based on the training sample size; which ranges from 100 to 20,000 samples.

This is illustrated in Figure 13

iv. The input data for both domains are scaled using the StandardScaler available at the Scikit-learn library [42]. This ensures the data to have

(54)

Figure 13: The splitting of the Target Domain Data into training, testing and validation for different sample sizes.

a distribution with mean value ≈ 0, and a Standard Deviation ≈ 1.

v. Discretizing the output variable is performed to define the task as a classification task. And so, Discretization is performed by approxi- mating the decimal values to the whole number. Thus resulting in 28 output classes/categories for the VoD Display Frame Rate task, and 99 output classes/categories for the KVS Reads Average task.

7.1.2 Feature Selection

Feature selection is used to reduce the curse of dimensionality, avoid over- fitting and remove noisy or redundant features. And as previously mentioned in Section 4, we use both Manual and Automatic Feature Selection methods.

For Manual Feature Selection, we use domain knowledge and previous work [19] insights to select the most relevant features. And so for Manual Feature Selection (MFS) a total of 18 features are selected and listed in Table 3.

(55)

CPU Memory I/O per second Block per second Network Statistics per second CPU container Memory Used Total Transactions Block Reads Received Packets

CPU host Committed Memory Read Bytes Block Writes Transmitted Packets

Swap Used Write Bytes Received Data (KB)

Cached Swap Transmitted Data

Page Fault TCP sockets

File handles

Table 3: The 18 reduced feature set obtained from manual feature selection.

Whereas for the Automatic Feature Selection (AFS) we used the similar approach as in [44]. Which used 3 embedded algorithms; XGBoost, RandomForestRegressor, and ExtraTreesRegressor algorithms found in the scikit-learn library [42]. And since they result in three feature subsets. The union of the subset features is collected and used for the remaining of the experiments. A total of 45 features is used for scenarios that contain single- server traces in either the source of the target domain. While a total of 49 features is used for the remaining scenarios.

7.1.3 Building Neural Network Models

To be able to test the effectiveness of transfer learning we need to build a model for both domains. And due to the nature of our problem, a supervised classification task, and with the aim of continuing the work from previous papers, we use Deep Neural Network as the base of our models. The architecture of the model differs based on the service running in the scenario. So for Video-On-Demand services, we have a 5 layered network with the number of nodes equal to the number of features (18 in MFS and 45 or 49 in AFS) in

(56)

the input layer. And 256 nodes in each hidden layer. In addition to 28 nodes in the output layer which is based on the quantization/binning performed in the pre-processing phase. On the other hand, we have a 4 layered neural network for Key-Value-Store services with the number of nodes equal to the number of features (18 in MFS and 45 or 49 in AFS) in the input layer.

And 64 nodes in each hidden layer. In addition to 99 nodes in the output layer.

Keras Library with TensorFlow as the back-end was used for implementing of the neural network. And for building and training the Source Model M_S the following parameters were used:

i. Adam Optimizer is used with a learning rate of 0.001 ii. L2 regularization of 0.001

iii. Number of epochs set to 200 iv. Batch size set to 256

v. Cross Entropy (Log-Loss) is used as the loss function

vi. Rectified Linear Unit (ReLU) activation function is used for all the layers.

However, for partial transfer experiments, the output layer in the Source Model, M_S has Softmax as the activation function instead of ReLU.

(57)

7.2 H-score Experiments

The Experiments are created to test the H-score and study its effectiveness with our traces and scenarios aiming to understand the transferability between domains. We prioritized 10 scenarios out of the 16 scenarios in Table 2 with varying severities to test the H-score with respect to accuracy. These prioritized scenarios are listed in Table 4.

For each scenario, the target domain’s data is split into various sample sizes ranging from 100 to 20,000 {100, 200, 500, 1000, 5000, 10, 000, 20, 000}. The H-score is computed for every possible number of layers to be transferred from the source to the target models for a given target domain sample size.

For instance, if we have 5 layers in the source model, then we calculate the H- score if we only transfer the first layer, the first 2 layers, the first 3 layers, and so on. And then record the correlation between the H-score and the Target Domain’s log-loss after transferring the source model’s layers. We test this using both partial and full transfer approaches as mentioned before in Section 4. Partial transfer is used to test the negative correlation between the H- score and the Log-Loss. While the full transfer is used to improve prediction performance at the target model since the best feature representation still has the highest H-score.

7.3 Results

After conducting the experiments described in the previous subsection with the approaches mentioned in Section 4 on the scenarios listed in Table 4, we

Service Metric Prediction in Clouds using Transfer Learning

Examensarbete 30 hp Juli 2019

Service Metric Prediction in Clouds using Transfer Learning

Nevine Gouda

Institutionen för informationsteknologi

Abstract

Service Metric Prediction in Clouds using Transfer Learning

ACKNOWLEDGEMENT

Contents

1 Introduction

1.1 Background

1.2 Project Scope and Main Contributions

2 Literature Review

2.1 Cloud & Service Management

2.2 Machine Learning

2.3 Related Work

3 Problem Specification

3.1 Motivation

3.2 Problem Setting

4 Approach

4.1 Approaches Overview

4.2 Feature Selection

4.3 H-score

4.4 Partial Vs. Full Transfer

4.5 Tools

5 Testbed & Data Traces

5.1 Testbed & Services

5.2 Load Patterns

5.3 Fault Patterns

5.4 Used Data Traces

6 Scenarios

6.1 Scenarios Overview

6.2 Transfer-Learning Scenario Configurations

7 Evaluation

7.1 Work Flow

7.2 H-score Experiments

7.3 Results