• No results found

Distributed Edge Cloud Availability

N/A
N/A
Protected

Academic year: 2021

Share "Distributed Edge Cloud Availability "

Copied!
55
0
0

Loading.... (view fulltext now)

Full text

(1)

IT 20 085

Examensarbete 30 hp November 2020

Distributed Edge Cloud Availability

Jiayi Yang

Institutionen för informationsteknologi

(2)
(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Distributed Edge Cloud Availability

Jiayi Yang

With 5G being rolled out across the world, new performance-sensitive applications are emerging in various domains. To meet their performance goals, such as

availability, latency, etc., a new distributed edge cloud will be set up near the logical edge of the telecom operator's network. However, lower availability inside the distributed edge cloud could cost significant economic losses to the cloud provider.

The application's availability requirements may be expressed in terms of the Service Level Agreement (SLA). Therefore, the availability information could be exposed to an application orchestrator to ensure the applications' SLA is met. Some early works have addressed the availability inside data centers, however, none of the existing works considers the availability over a distributed edge cloud system. To cover this gap, this thesis work proposes a methodology to assess edge cloud availability. The methodology contains a system model of distributed edge clouds to solve the availability estimation problem. Some availability metrics are also defined from the perspective of both edge cloud operators and application developers. Moreover, a simulation framework is extended to evaluate this methodology. This thesis also studies the sensitivity analysis of availability due to various parameters. The sensitivity results show that overload delay is a factor that affects system availability, along with node and link failures. Furthermore, the results show that node failure is a more dominant factor that affects system availability, instead of link failure. The thesis also makes recommendations for further research work.

Examinator: Stefan Engblom

Ämnesgranskare: Salman Toor

Handledare: Amardeep Mehta

(4)
(5)

Acknowledgements

In the name of Allah, Who gives me health, strength and courage to complete my master thesis and my whole master study. I would like to thank and remember Him first and foremost.

This master thesis work is provided by and performed in Ericsson Research, Stockholm. I wish to express my sincere appreciation to my supervisor, Amardeep Mehta, for his tremendous support, guiding inputs and vital ideas. I do appre- ciate his time spent in guiding me, especially in this tough and abnormal time.

I would also like to thank Mattias Wildeman and Beatriz Grafulla, for their favor and support as the team leader.

I would like to thank Salman Toor for his valuable feedback and suggestions during reviewing my thesis. And I would also like to show my gratitude to all my colleagues and friends who helped me, either directly or indirectly, in achieving the goal.

Last but the most important, I wish to express my deepest gratitude to my parents, heartfelt thanks to them for their love and support throughout, both financial and moral, for the past 2 years, the past 6 years, and the past 24 years.

And the last of our call will be, “Praise to Allah, Lord of the worlds!”

(6)
(7)

Contents

1 Introduction 1

2 Background 3

2.1 Distributed edge cloud . . . . 3

2.2 Availability . . . . 4

2.3 The model used in this thesis . . . . 5

2.3.1 Different types of models related to this topic . . . . 5

2.3.2 Markov Chains . . . . 6

3 Related Work 7 3.1 Estimating the availability of a data center . . . . 7

3.2 Modeling the availability of a distributed system . . . . 7

3.3 Modelling the edge cloud . . . . 8

4 System Model 9 4.1 Topology . . . . 9

4.2 Device model . . . . 10

4.3 Network link model . . . . 11

4.4 Application model . . . . 11

4.5 User demand model . . . . 12

4.6 Failure model . . . . 12

5 Problem Description 15 5.1 Placement policy . . . . 15

5.2 Metrics definition . . . . 15

6 Simulation framework 17

6.1 Motivation . . . . 17

(8)

6.2 Simulator architecture . . . . 17

6.3 Result evaluation . . . . 19

7 Sensitivity Analysis 21 7.1 Simulation setup . . . . 21

7.1.1 Infrastructure . . . . 21

7.1.2 Application . . . . 22

7.1.3 Application demand . . . . 23

7.2 Sensitivity results due to workload variation . . . . 23

7.2.1 High locality . . . . 24

7.2.2 Medium locality . . . . 25

7.2.3 Low locality . . . . 28

7.2.4 Comparison of user availability . . . . 30

7.3 Sensitivity results due to placement configurations . . . . 31

7.4 Sensitivity results due to failure types . . . . 35

8 Conclusion & Future Work 39

(9)

List of Symbols

G(V, E) Topology of an edge cloud system V A set of nodes

E A set of edges

v

i

A node device in the node set V c

i

The resource capacity of device v

i

ω

i

The processing speed of device v

i

ζ

i

The processing cost of device v

i

e

j

A network link in the edge set E, also denoted as (u, v) representing the network link between node u and node v

d

j

The propagation delay of link e

j

b

j

The bandwidth of link e

j

l

j

The network latency of link e

j

¯ l

j

A simplified network latency of link e

j

, when omit bandwidth a

x

An application in the set of all applications A

S

x

A set of services which are requested by application a

x

r

x

Resource consumption of application a

x

ξ

x

The size of the request message of application a

x

u

k

A user in the set of all applications U v

k

The gateway of user u

k

t

x,k

The maximum acceptable response time by user u

k

when request appli- cation a

x

¯ t

x,k

The user’s maximum acceptable latency, a simplified application dead- line when omit execution time

A

k

A set of all the applications requested by user u

k

t

0

The user-perceived response time M T T F Mean time to failure

M T T R Mean time to repair

M T BF Mean time between failures

(10)
(11)

1 Introduction

In the last decade, cloud computing has been emerging as a new paradigm of computing. The vast computing resources on the cloud are leveraged to deliver elastic computing power and storage to support resource-constrained end-user devices. Cloud computing has been driving the rapid growth of many internet companies. For example, the cloud business has risen to be the most profitable sector for Amazon [1], and Dropbox’s success highly depended on the cloud service of Amazon.

However, in recent years, there is a new trend in cloud computing. With 5G being rolled out across the world, the function of clouds is increasingly moving towards the network edges [2]. It was predicted by Ericsson that 5G will make up around one-fifth of all mobile data traffic by 2023, where 25% of the use- cases will depend on edge computing capabilities [3]. The majority of the new 5G revenue potential is expected to come from enterprise and Internet of Things (IoT) services, of which many will rely on edge computing. It is estimated that more than 24.9 billion IoT-connected devices will be used by 2025 [4], most of which have limited resources for computing, communication and storage, and have to rely on edge computing for enhancing their capabilities [5]. Accord- ing to the recent Ericsson Mobility Report released in June 2020 [6], the total number of mobile subscriptions was about 7.9 billion all over the globe, where 5G subscriptions reached 80 million in Q2 2020. And by the end of 2025, 5G subscriptions are forecast to reach 2.8 billion. The trend of increasing in mobile usage is fundamentally driven by the augmentation of mobile users and mobile application development [7] [8].

In terms of applications, a wide range of new performance-sensitive applications are emerging in various domains, such as healthcare, transportation, entertain- ment, etc. An increasing amount of data will be produced and consumed at the edge of the network, i.e. between the mobile devices and data centers. Further- more, the edge-based applications always have strict performance requirements, such as availability, latency, etc. However, the traditional centralised infrastruc- ture paradigm geographically separates users and infrastructure, and thus does not accommodate the mission-critical and performance-sensitive applications.

Therefore, to meet the applications’ performance goals, a new distributed edge cloud will be set up near the logical edge of telecom operator’s network.

The distributed edge cloud provides execution resources (compute and stor- age) for applications with networking close to the end users. It delivers low- latency, bandwidth-efficient, and resilient end user services with a global foot- print. Thereby, latency-sensitive computation and user interaction components can be placed at the edge nodes, while additional heavy-duty computation and database storage will be hosted in the data center nodes. Compared to the tra- ditional centralized cloud, edge solutions provide the main benefits include low latency, high bandwidth, device processing and data offload as well as trusted computing and storage [3].

Despite these advantages, edge cloud operators still have to face large-scale in-

frastructure management challenges. According to Fareghzadeh et al. [9], per-

(12)

formance management is a big challenge, because it is fundamental for auditing service quality objectives and policies. For example, applications’ availability requirements may always be expressed in terms of the Service Level Agreement (SLA). Lower availability inside a distributed edge cloud could cost significant economic losses to the cloud provider, even the shortest downtime of a dis- tributed edge cloud system can result in significant financial and reputation damage to the providers. For instance, according to the International Working Group on Cloud Computing Resiliency (ICWGR), an outage of one hour can result in a financial cost between $89,000 and $225,000 for highly critical ap- plications [10]. Therefore, the availability information could be exposed to an application orchestrator to ensure application’ SLA is met.

Some early works have addressed the availability of data centers, designed with different availability levels and infrastructures [11–15]. To the best of our knowl- edge, none of the existing works considers the availability of a distributed edge cloud system. In this context, this thesis work addresses the following main ques- tion: How to generate computational models to evaluate distributed edge cloud infrastructure availability? To answer such a question, we pro- pose a methodology that generates computational models to assess the availabil- ity of the edge cloud. Moreover, we extend a simulation framework to implement this methodology, and sensitively analyse the dominate factors that affect edge cloud availability.

The main contributions of this work are:

• proposing a system model to estimate distributed edge cloud availability;

• extending a simulation framework with the system model; and

• Sensitivity analysis of the underlying parameters that affect edge cloud availability.

This thesis report is organized as follows:

• Section 2 presents some technical background that is necessary to better understand the proposal;

• Section 3 reviews related works;

• Section 4 proposes the system models associated with the distributed edge cloud;

• Section 5 addresses the main problem and the merits to be measured;

• Section 6 evaluates the models by a simulation framework;

• Section 7 presents the sensitivity analysis on the edge cloud availability;

• Section 8 concludes this thesis work and delineates future works.

(13)

2 Background

2.1 Distributed edge cloud

Since data is increasingly produced at the edge of the network, it would be more efficient to also process the data at the edge, which is not supported by the centralized data center oriented cloud computing model. Edge computing refers to the enabling technologies allowing computation to be hosted at the edge of the network, for downstream data which come from cloud services and for upstream data which come from IoT services [16]. The edge here is defined as any computing and network resources along the path between data sources and cloud data centers. Therefore, edge cloud is the federation of the data center nodes along with all the edge zones [17]. The edge cloud operator is assumed to have a existing traditional IaaS data center cloud, and by adding edge cloud functionality, the edge cloud operator can now extend the cloud’s capabilities to deploy applications at the edge networks. Figure 1 shows the presence of the various edge zones in the edge cloud.

Internet ISP 1

ISP 2

ISP 3

Edge Zone

Edge Zone

Edge Zone

Edge Zone

Edge Zone

Edge Zone

Edge Zone

Edge Zone Edge Zone Edge Cloud

Central Data Center

Figure 1: The distributed edge cloud

Distributed edge computing in telecom, is often referred to as Multi-access Edge Computing (MEC, formerly Mobile Edge Computing), provides execution re- sources (compute and storage) for applications with networking close to the end users, typically within or at the boundary of operator networks. According to European Telecommunications Standards Institute (ETSI), Multi-access Edge Computing offers application developers and content providers cloud-computing capabilities and an IT service environment at the edge of the network.

According to the white paper published by ETSI [18], MEC can be characterized by:

• On-Premises: Mobile edge computing performs in segregates that enhance

(14)

its performance in machine-to-machine environment. MEC property of segregation from other networks also makes it less vulnerable.

• Proximity: Being deployed at the nearest location, mobile edge computing has the advantage to analyze and materialize big data. It is also beneficial for compute-hungry devices, such as augmented reality, video analytic etc.

• Lower latency: Mobile edge computing services are deployed at the nearest location to user devices that isolate network data movement from the core network. Hence, user experience is accounted for high quality with ultra- low latency and high bandwidth.

• Location awareness: Edge distributed devices utilize low-level signalling for information sharing. MEC receives information from edge devices within the local access network to discover device location.

• Network context information: Applications providing network information and services of real-time network data can benefit businesses and events by implementing MEC in their business model. Based on RAN real-time information, these applications can judge the congestion of the radio cell and network bandwidth that in future help them to make a smart decision for better customer deliverance.

2.2 Availability

According to Toeroe and Tam [19], availability is the percentage of time on which the service is up during a given interval. The QuEST Forum

1

describes availability as the probability that a system is running when it is required. More precisely, [20] introduces the availability of an item or system as the combination of its reliability and maintainability to perform its required function at a stated instant of time or period, shown as Table 1. Reliability can be defined as the ability of an item to perform its required functions for a stated time and under operational conditions [11], while maintainability can be defined as the ability of an item or system to be restored, using prescribed procedures and resources, to a state that it can perform its required functions [20]. However, an item is more available when it is hard to fail, i.e. more reliable, and has a high recovery rate, i.e. more maintainable.

The availability can be also calculated by service availability, as described in Equation 1. During service uptime, the service is operational. The service total time denotes the period in which a system is evaluated, no matter being operational or not. Therefore, the service total time is the sum of operational time and the service downtime, as Equation 2 shows. In the downtime, the service is not operational, staying in the repair process until it is finished.

availability

service

= upT ime

service

totalT ime

service

(1)

1QuEST Forum is a global association of companies dedicated to impacting the quality and sustainability of products and services in the ICT (Information and communications technology) industry.

(15)

Reliability Maintainability Availability

Constant Decreases Decreases

Constant Increases Increases Increases Constant Increases

Decreases Constant Decreases

Table 1: Dependency of availability concerning reliability and maintainability [21]

totalT ime

service

= upT ime

service

+ downT ime

service

(2) These concepts can be associated with the average behaviour of the system for availability calculation. In the following formula, the availability is calculated by division of the MTTF (Mean Time To Failure) and the MTBF (Mean Time Between Failures). The MTBF is also defined as the sum of MTTF and MTTR (Mean Time to Repair), indicating the time between the detection of a failure and the detection of next failure, as shown in Equation 3 [11].

availability = M T T F

M T BF = M T T F

M T T F + M T T R (3)

Some studies such as [22] and [23] do not employ MTTF metric, instead, they replace the MTTF definition by MTBF as the mean service uptime, referred to by time after failure recovery until the next failure. In the aforementioned studies, availability is calculated using Equation 4. The result of this calculus is similar to Equation 3 because MTBF assumes the same meaning that MTTF in these studies.

availability = M T BF

M T BF + M T T R (4)

2.3 The model used in this thesis

2.3.1 Different types of models related to this topic

The various models used to solve real-life cloud high availability problems can be classified into three categories: Non-combinatorial or State-space models;

Combinatorial or Non-state-space models; and Hierarchical and Fixed-Point Iterative models [11].

The Non-Combinatorial or State Space Models are good for analyzing and veri-

fying system behaviors. These models can be built using approaches like Markov

Chains, Semi-Markov processes, Markov Regenerative Processes, Stochastic

Petri Nets (SPN), or Stochastic Reward Nets [11]. Those strategies make a

state-space structure that models all states and transitions that a system can

(16)

reach, e.g. failures and repair operations. They permit the representation of complex systems with resource constraints and dependencies between subsys- tems [24]. However, those models face the state explosion problem, a problem related to the huge number of states in a system, making the built model difficult to be solved through analytical tools [25].

The Combinatorial or Non-State-Space Models enable a high-level and concise representation of the relationship between components of systems and subsys- tems [26]. In this class, we can find Reliability Block Diagrams (RBD), Fault Trees, or Reliability Graphs [11]. Differently from the state space methods, these methods are free of the state-space explosion problem. The main dis- advantage of using the Combinatorial models is that they do not enable the representation of system activities and processes, such as rejuvenation, repairs, and failures. [25]

The Hierarchical and Fixed-Point Iterative Models come to mitigate the weak- nesses and put together the advantages of the Non-combinatorial and Combina- torial methods to leverage the analysis and modeling of many kinds of systems.

This hybrid approach is commonly used to model systems with multiple com- ponents. To model such systems, it is recommended to use the combination of various simple methods to build multiple simple models, rather than using a single sophisticated model [27].

By comparing the advantages and disadvantages of the different types of models, we choose Markov Chains to model the the failure of components, since our assumption is that all the components fail independently.

2.3.2 Markov Chains

The Markov chain is a mathematical model that experiences transitions from one state to another according to certain probabilistic rules. The defining char- acteristic of a Markov chain is that no matter how the process arrived at its present state, the possible future states are fixed. In other words, the probability of transitioning to any particular state is dependent solely on the current state and time elapsed. The state space, or set of all possible states, can be anything:

letters, numbers, weather conditions, baseball scores, or stock performances, and in this case, it represents the up state or down state of a component in edge cloud.

A Markov chain is a stochastic process, but it differs from a general stochastic process in that a Markov chain must be ”memory-less.” In another word, the probability of future actions are not dependent on the previous steps. Therefore, the key assumption is that events occur independently and with constant rates over time [15]. And in this thesis, we assume that the individual components fail independently of each other, and recover independently of each other.

The detail of the Markov Chains model we introduced will be explained in

Section 4.

(17)

3 Related Work

The related work can be divided into three aspects: the availability of a data center, the availability of a distributed system, and the infrastructure of an edge cloud.

3.1 Estimating the availability of a data center

Several works introduced methodologies to estimate the availability of data centers. Endo et al. [11] introduced concepts and metrics, e.g. service avail- ability, MTTF, MTTR, etc., and reviewed computational modeling techniques, described the basic models used to estimate and model cloud availability.

Rosendo et al. [12] proposed a methodology to assess the data center availability through RBD models. In their approach, the hardware configuration of data center, as well as the MTTF/MTTR values of all the components are required as input, then the components are represented as blocks in RBD, the RBD model would be solved to calculate the overall availability. The reason of using RBD in their work is that it is possible to derive closed-form equations to calculate dependability metrics by using RBD, and the results are usually obtained faster than other methods, such as SPN or Markov chains [28]. However, RBD models cannot represent the dynamical behaviour of a system, such as mechanisms of active redundancy (warm-standby) [22], neither repair dependencies between different system components [29].

Similarly, Santos et al. [13] introduced models to assess the availability of ser- vices running in a cloud data center infrastructure, however, RBD is used to represent server components, i.e. hardware, operating system, virtual machine, and application instance, while Stochastic Petri Net (SPN) is used to model the failure behavior of the components. Gomes et al. [14] focused on the cooling sys- tem availability in a cloud data center, which is modeled by Petri Nets models.

Instead of modeling, Ahmed et al. [30] used machine-learning-based approach to compute availability, automated detecting the localization of performance faults for data center services running under dynamic load conditions.

However, although the works mentioned above only address the availability inside data centers, the methodologies and techniques introduced in the works are helpful for our work on the availability of edge clouds.

3.2 Modeling the availability of a distributed system

There are also works focused on the availability of distributed systems, but most of the works focused on a specific aspect or a subsystem, instead of the whole system.

Ford et al. [15] proposed a Markov model for the data availability in globally

distributed storage systems. The authors classified and grouped the failures

(18)

on storage nodes, and analysed their characteristics and contribution to overall unavailability, then formulated Markov models for the failures on the file system instances, which gives us the idea on how to model the node failures in edge clouds.

Meza et al. [31] addressed on the reliability characteristics of the network within both inter and intra data center, analysed the different types of network failures, and presented link failure distribution, which may help us model the link failures in edge clouds.

3.3 Modelling the edge cloud

Regard to edge cloud, researches suggest that the infrastructure of the edge cloud system is in a multi-rooted forest topology [32], and more usually in a tree topology [33–35]. T¨ arneberg et al. [36] proposed a system model for the infrastructure, including features and measurements in data center model, network model, application model, and user model, which provides the main idea on how to model an edge cloud system. An optimistic algorithm for application placement in the Mobile Cloud Network is also presented, which will be used in our simulation to allocate the microservices. In addition, Mehta et al. [37]

enriched the application demand model by considering different demand locality, which helps us model the user demand and evaluate our model in different user demand scenarios.

Lera et al. [38] used simulations to evaluate the performance of Fog Comput-

ing, however, when estimating the availability they randomly removed nodes

as node failures, instead of modeling the failures. It is useful from the edge

cloud providers’ perspective, when the providers want to know how the overall

availability will be affected if they have to shut down a node. However, since

the failures on nodes and links follow an exponential distribution [31], from the

perspective of the application developer and user, it is necessary to model the

failure when they acquire the availability of the edge cloud system. Therefore,

we refer the simulator developed by Lera et al. [39] in our simulation evaluation,

and extend the simulation framework with our proposed system models, which

include a failure model.

(19)

4 System Model

In order to assess the availability of a distributed edge cloud, a system model is proposed as following:

4.1 Topology

According to the undirected forest topology [32], a general distributed edge cloud architecture is presented in Figure 2. Three layers can be identified: 1) cloud layer; 2) edge layer; and 3) user layer. And three types of devices can be identified: the device for the cloud provider on the cloud layer; the user gateways, which are for the clients to access the system; and the edge devices, or the network devices between the cloud and the user gateways. The device here can refer to a data center, a cloudlet subcenter, a fog device, or an IoT device, all the devices have resources to allocate and execute services. The devices communicate with each other through network links, which can be either wired or wireless.

In this work, we want to estimate the availability of edge clouds owned by Mobile Network Operators. As far as we know, existing 3rd and 4th generation mobile access networks are generally tree-structured [40], 5th generation networks are also expected to follow the same structure [36]. Therefore, in the experiments, the infrastructure is composed of a set of edge devices, distributed in a tree graph with a certain depth. Furthermore, the capacity, reliability, processing speed and cost of the nodes vary with depth [37]. The nodes are connected with each other through network links. The links’ latency progressively decreases with depth.

Application demand originates from the leaf nodes in the network and it prop-

agates to the devices where the application services are hosted. We assume

that the cloud device stores the image of all the services or their encapsulating

element, for example, a container [41]. If it is necessary to deploy and execute

a new instance of a service in an edge device, this device needs only to request

the image of that service from the cloud. The cloud can also execute an in-

stance of a service if the placement algorithm considers that this can improve

the performance. Note that we also assume the services are shared between the

users and it is not necessary to deploy one service for each user. Thus, a user

can request any of the instances of service in the edge cloud architecture but

usually requests the closest one.

(20)

Cloud Layer

Edge Layer

User Layer

Figure 2: The edge cloud architecture

The topology can be presented in a network graph G = (V, E), in which the nodes V = {v

i

| i = 0, 1, ..., n − 1} represent the edge cloud devices, and the edges E = {e

j

| j = 0, 1, ..., m − 1} represent the network links between devices.

Detailed models of devices and network links are explained in the following sections.

4.2 Device model

Each device can be denoted as v

i

, where i ∈ {i = 0, 1, ..., n − 1}, defined by the following features,

• available capacity of resources c

i

;

• processing speed ω

i

, measured in terms of instructions per unit of time;

• cost per unit of time ζ

i

.

The available resource capacity c

i

is a vector containing the capacities of each

physical component. For simplicity, we consider a scalar value to measure the

resource units. It represents the capacity of one component, as for example,

number of cores for the Central Processing Unit (CPU), gigabyte (GB) for the

main memory, terabyte (TB) for the hard disk, etc. This resource unit could

easily be extended by including as many units as necessary and considering a

vector, for example, c

i,p

denotes the CPU capacity and c

i,m

denotes the memory

capacity, etc.

(21)

4.3 Network link model

The network link e

j

= (u, v) are identified by two connected nodes u and v, i.e., u, v ∈ V . We consider bidirectional communication, i.e., (u, v) = (v, u). The network links are defined by the following features,

• propagation delay d

j

;

• network bandwidth b

j

;

• network latency l

j

, which can be calculated as follows:

l

j

= d

j

+ ξ b

j

, (5)

where ξ is the size of the packet to be transmitted.

4.4 Application model

The application model is based on a distributed data flow (DDF) [42]. An application is defined by microservices and messages among microservices [43–

45].

Microservices is an architectural style that structures an application as a col- lection of services that are single-purpose and loosely-coupled [46]. Compare to traditional monolithic services which encompass the entire application func- tionality in a single binary, microservices allow individual components of the application to be elastically scaled, simplifying and accelerating development, with each microservice being responsible for a small subset of the application’s functionality. Since microservices enable rapid, frequent and reliable delivery of large, complex applications, more and more large cloud providers, such as Ama- zon, Twitter, Netflix, Apple, and eBay, have already adopted the microservices application model [47].

Thus a DDF is represented with a directed cyclic graph where nodes denote microservices and edges denote the transmissions between microservices.

There are different types of applications. For instance, some requests traverse to more than one components of the application with a certain probability, while some requests only require one component. And the transmission between the services follows a Markov process. Furthermore, for the multi-component appli- cations, there are also different types due to different task goal. For example, it can be compute-intensive, or I/O-intensive [48–52]. For instance, Sysbench is an example of a compute intensive benchmark application, while PostMark is a benchmark for I/O intensive applications.

Therefore, we model each application a

x

with following features,

• services S

x

= {s

y

| y = 1, 2, ..., Y }, a set of services s

y

which are required

by application a

x

;

(22)

• resource consumption r

x

, a vector that measures the consumption of each physical component, similarly to the available resources in a device;

• the size of the request message ξ

x

.

Let A = {a

x

| x = 1, 2, ..., X} denote the set of all the applications hosted in the edge cloud system.

4.5 User demand model

The user u

k

is defined by

• user gateway v

k

∈ V , the gateway device where the user request applica- tions;

• the maximum acceptable response time t

x,k

, defined by user u

k

when request application a

x

.

Let A

k

denote the set of all the applications requested by user u

k

.

The application demand model is constructed using two parameters that enable us to capture time variation and spatial variation of each scenario.

According to Mehta et al. [37], the application demand locality can be modeled by using the relative demand (µ) from leaf nodes that aggregate at specific layers in the edge network. In the experiments, we will consider three types of application demand locality: high locality, medium locality, and low locality, which will be described in the fellow section.

The relative demand (µ) of an application can be chosen in the range from 0 to 1 when simulating the application end-user distribution scenarios. For example, when application demand originates from two different leaf nodes, if µ represents the demand coming from the first leaf node, then (1 − µ) will be the demand from the second leaf node. For instance, if the value of µ is 1, then the whole demand is coming from the first leaf node only, while if the value of µ is 0.5, then the demand is coming equally from the two leaf nodes.

Each user workload follows an exponential distribution, with an arrival rate λ.

4.6 Failure model

We model the failure of components as a Markov Chain, with the assumption that the individual components fail independently of each other, and recover independently of each other.

Each device and network link has two states, up and down. When a device (i.e.

node) is down, the device is not accessible by any link and cannot participate

(23)

in any computing. The services on the device are unavailable. When a network link is down, devices cannot communicate with each other through this link.

We assume that the MTTF and MTTR of each component are known. Let η denote the rate of failure,

η

i

= 1 − M T T F

i

M T T F

i

+ M T T R

i

= M T T R

i

M T T F

i

+ M T T R

i

. (6)

Similarly, Let ρ denote the rate of repair, ρ

i

= 1 − M T T R

i

M T T F

i

+ M T T R

i

= M T T F

i

M T T F

i

+ M T T R

i

. (7)

Therefore, if component i is up at the moment t, then the probability of i failing (i.e. state changes from up to down) at the moment t + 1 is:

P

i

(up → down) = η

i

, (8)

otherwise, component i is still up.

Similarly, if component i is down at the moment t, then the probability of i being repaired (i.e. state changes from down to up) at the moment t + 1 is:

P

i

(down → up) = ρ

i

, (9)

otherwise, component i is still down.

The Markov Chain model is shown in Figure 3

up down

η ρ

Figure 3: Failure model

(24)
(25)

5 Problem Description

This thesis aims to develop a method for estimating the availability of a dis- tributed edge cloud system. In this section, firstly we discuss the services place- ment policy, then we introduce the availability metrics that will be measured.

5.1 Placement policy

The placement configuration is selected from the application developer perspec- tive, to minimize the total cost while meeting the performance constraints. Ac- cording to the heuristic and the optimisation algorithm introduced by T¨ arneberg et al. [36], the optimisation problem could be formulated as

min J (s

k

) =

n−1

X

i=0

i

· r

k

ω

i

) s.t. P l

j

∈ path(u

x

, s

k

)

¯ l ≤ 1, (10)

where ω

i

and ζ

i

are the speed and cost per time unit of Node v

i

, while r

k

is the resource consumption of Service s

k

, and l

j

denotes the latency of Link e

j

, i.e. searching for the placement configuration of Service s

k

with the lowest total cost J (s

k

), while the total latency from User u

x

is no longer than the maximum acceptable latency ¯ l.

5.2 Metrics definition

We define an application a

x

requested by user u

k

is available, when a cluster of devices and network links, C(V, E), exist, while the total resource capacity can meet the application requirement r

x

, meanwhile, the user-perceived response time, i.e. the time between a specific application request being sent by the user and the application finishing execution, should not exceed the application deadline t

x,k

. The user-perceived response time includes the network delay of the request between services and the response times (execution and waiting time) of the services. The state of available can be express as

∃ C(V, E) s.t. X

c

i

≥ r

x

X ( r

x

ω

i

+ l

j

) ≤ t

x,k

, (11)

that the total response time does not exceed the application deadline, when the resource requirements are met.

Based on the service availability introduced in Section 2, we define two types of

availability assessment, from a user perspective and from a system perspective.

(26)

From a user perspective, a user would consider the system is available if he or she can communicate with the system, send a request to the system and get feedback from the system within an acceptable time. Therefore, we define the system availability as a deadline satisfaction ratio, which is the percentage of application requests that are processed before the application deadline t

x,k

. t

0

is used to denote the user-perceived response time, i.e. t

0

=

rsx

i

+ l

j

. Then the equation for the system availability is as follows:

sysAvailability(A

k

) = |t

0

≤ t

x,k

|

|A

k

| (12)

where |A

k

| is the number of times that user u

k

request applications a

x

∈ A

k

, and |t

0

≤ t

x,k

| is the number of those requests that satisfied the deadline.

From a system perspective, user availability is defined as the ratio of the users that are able to reach all the services of their applications out of all the users.

For example, if no failure exists in any components, the service availability is 1.0. However, devices and network links commonly fail, cutting the shortest paths between the users and shutting the application services down. At best, such failures only increase the network delay (due to the requests going through another longer path) and degrade the deadline satisfaction ratios. However, the effects can also prevent the user from reaching all the application services, which degrades the service availability ratios. The equation for the user availability ratios is

userAvailability(a

x

) = |u

k

, s.t. ∃ path u

k

to a

x

|

|u

k

, s.t. u

k

request a

x

| . (13)

(27)

6 Simulation framework

In this section, a simulation framework is extended to evaluate the proposed model.

6.1 Motivation

To evaluate the proposed model, an analytical approach could be introduced, however, to deal with the randomness and uncertainty due to applications’ work- load, a lot of work are needed, for example, using a penalty-barrier function to give a penalty when compute usage is beyond capacity [36]. Instead, an event- driven simulator could be used to simply handle the complex scenarios and understand the sensitivity due to applications’ probabilistic workload distribu- tion.

Existing simulation frameworks such as CloudSim [53] and YAFS (Yet Another Fog Simulator) [39] offer competent network and data center models. However, neither frameworks model the failures on components. Therefore, a simulation framework based on YAFS was extended with the system model detailed in Section 4.

6.2 Simulator architecture

The extended simulator utilises SimPy [54] as the underlying event-driven frame- work, and is constructed around the proposed system model detailed in Section 4 which represents an edge cloud topology with edge devices, network links, applications, failures on the components, and user demands.

The simulator is defined by six main classes: core, topology, selection, placement, population, and application. The simulator architecture is shown in Figure 4.

Topology is the main element of the core class, core class manages the simulation execution and controls the cycle of life of processes. The customized policies:

selection, placement, and workload population can interact dynamically along

with the simulation execution by the customized algorithms.

(28)

Application

Selection Placement

Population

Topology

networkX.library

CORE

.deploy(app, selection, placement, population)

SimPy.library .run()

[Message Execution] + [Link Transmission] + Log

Messages

System Availability

Pandas.DataFrame

INPUT

OUTPUT

RESULT

Figure 4: The simulator architecture

After defining the topology and application, application microservices will be allocated under the placement policy that specified in Section 5. The input to the simulator is a time series workload composed of a quantity of demand for each application in each leaf node for each time instance, which follows a exponent distribution. The workload is propagated throughout the network to the devices in which the required microservices are deployed. During the simulation time, the availability status of each node and link follows the failure model, which is detailed in Section 4.6, when a node or link fails, it will be removed from the topology until it is repaired.

The output results are stored in CSV-based files, which record detailed infor-

mation of every application message. In addition to the basic information such

as message id and message type, the output results also contain the routing

information, i.e. source node and destination node, etc., and timestamps during

the life cycle of the message.

(29)

User demand Network routers Node

waiting queue service

Timestamps

emit reception in out

Computed times

latency wait service

response

total response

Figure 5: Timestamps generated by the discrete-event simulator and post- computed times of a message

Figure 5 shows the four timestamps involved in the transmission of a message from the source device to the destination device where the application service located. The timestamps are relative to the simulation time since it is a discrete- event simulator. Emit time is the value that represents the emission time of a message in a source device. Reception time represents the recorded time when a message arrives at the destination node. When the message arrives, it might be in a waiting queue, we record the entry and the exit of the service (in time and out time, respectively).

6.3 Result evaluation

With the timestamps described above, some useful measures such as the latency, waiting time, response time and total response time could be computed. An application request could be considered as Success, if and only if total response time is not longer than user maximum acceptable response time, or deadline, i.e. t

0

≤ t

x,k

. Furthermore, messages that failed to arrive the destination node could be marked, application requests that failed to return the result to the user (due to a node or link failure) could be noted.

Therefore, the system availability can be calculated by System Availability = Success Requests

Total Requests . (14)

The validity of the simulator has been verified by producing the same results as

the analytical approach when solving simple scenarios. The simulation results

are accurate and reliable.

(30)
(31)

7 Sensitivity Analysis

In this section, we do the sensitivity analysis of parameters that affect the availability, respectively the demand variations, placement configuration, and failure type, by using the proposed simulation method.

7.1 Simulation setup

In the experiments, we run simulations within 100,000 milliseconds, with the following setup:

7.1.1 Infrastructure

The infrastructure employed in the experiments is distributed in a tree graph as in Figure 6, and the parameters are present in Table 2.

v0

v1 v2

v3 v4 v5 v6

(Root Layer)

(Intermediate Layer)

(Leaf Layer)

(User Layer)

Figure 6: Infrastructure topology

(32)

Property Value Node Capacity [37]

Root 1000

Intermediate 10

Leaf 1 (Reference)

MTTF/MTTR (h) [31, 55–57]

Root 6000/14

Intermediate 8000/12

Leaf 10000/10

Link 2500/13

Processing Speed (Intrs/ms) [38]

Root 1000

Intermediate 500

Leaf 200

Cost [37]

Root 1 (Reference)

Intermediate 1.5

Leaf 2

Link Latency (ms) [38]

Root ↔ Intermediate 5 Intermediate ↔ Leaf 2

Leaf ↔ User 1

Table 2: Infrastructure parameters

7.1.2 Application

The application is modeled by 2 microservices, s

1

and s

2

as shown in Figure 7, and the parameters are listed in Table 4. The user workload follows an exponential distribution.

As we discussed before, there are two types of application request, an I/O in- tensive request and a compute intensive request. For the I/O intensive requests, once Service s

1

received Message m

0−1

from the user, it could consume the mes- sage and send the result back to the user. While for the compute requests, after Service s

1

processing Message m

0−1

, another message m

1−2

would be triggered and sent to Service s

2

, then Service s

2

send the result back to the user. As an example, the fractions of both types of requests will be 50%, the transmission can be presented in a Markov matrix as in Table 3. The execution of I/O task is 500 instructions per request, while the execution of compute task is 2000 instructions per request.

s

1

s

2

User

m0-1 m1-2

m2-0 m1-0

Figure 7: Application topology

(33)

user s

1

s

2

user 0 1 0

s

1

0.5 0 0.5

s

2

1 0 0

Table 3: Application requests transmission

Request Type Value

Required Services I/O {s

1

}

Compute {s

1

, s

2

}

Request Fraction I/O 50%

Compute 50%

Execution (Intrs/req)

I/O 500

Compute 500 + 2000

Deadline (ms) I/O 6

Compute 20

Table 4: Application parameters

7.1.3 Application demand

In this experiment, we consider three types of application demand locality. Low locality applications can be modeled with the demand that aggregate at the Root layer. Leaf nodes v

3

and v

5

are chosen for this purpose. Similarly, medium locality applications can be modeled user demand from user devices that aggre- gate at the Intermediate layer. We choose leaf nodes v

3

and v

4

for this purpose.

We also model high locality applications with the user demand aggregate at leaf node v

3

. The settings are listed in Table 5.

Locality User Gateways

High v

3

Medium v

3

, v

4

Low v

3

, v

5

Table 5: Application demand settings

Furthermore, we choose values of µ in the range from 0.5 to 1 for the experiments in medium locality and low locality scenarios. And for high locality scenario, we choose different values of the arrival rate λ of the user workload distribution.

7.2 Sensitivity results due to workload variation

In this part, we will analyse the system availability within different demand

locality scenarios. For high locality, we will focus on the sensitivity due to the

workload variation of time, while with medium and low locality, we will analyse

(34)

the sensitivity due to workload spatial variation. In addition, a comparison of user availability is also presented.

Note that in the experiments, a same number of users are used and a same num- ber of services are placed for all the scenario, only time or spatial distribution of user demand is changed.

7.2.1 High locality

To analyse how the overload affects the system availability, firstly we assume a cost-sensitive scenario, where the users have a maximum acceptable network latency, 5 ms for I/O intensive tasks, and 10 ms for compute intensive tasks.

According to the searching algorithm in [36], the placement configuration shown in Figure 8 is selected for the High locality scenario, as the most economical configuration when meeting the latency requirement, that Service s

2

is allocated on Node v

0

and Service s

1

on Node v

1

.

v1 v2

v3 v4 v5 v6

s1

s2 v0

User

Figure 8: Microservices placement for high locality scenario

With this placement configuration, according to the results of user availability, for I/O intensive tasks, in 99.66% of the simulation time the user availability is 1, in another word, it is 28.9 hours per year that the user availability is under 1. Similarly, for compute intensive tasks, the user availability is 1 in 99.34% of the simulation time, and 56.4 hours downtime per year.

To estimate the system availability with the message queues taken into account,

we simulate with different values of the arrival rate λ, where interarrival time

1/λ ranging from 4 to 25 ms. The results are shown in Table 6 and Figure 9.

(35)

1/λ System Availability

4 74.84%

6 90.14%

8 94.66%

10 96.37%

15 97.89%

20 98.51%

25 98.56%

Table 6: System availability in high locality scenario

5 10 15 20 25

1/

75%

80%

85%

90%

95%

System Availability

Figure 9: System availability in high locality scenario

The time variation results suggest that when the interarrival time 1/λ is long enough, there is no overload or waiting queues in the system, the system avail- ability will be stable on a high level, around 98.5% in this case. But if the interarrival time 1/λ is shorter than 10ms, the heavy demand may decrease the system availability, since the increasing waiting time will lead to a longer response time. Hence, the overload is a main factor that affect the edge cloud availability.

However, since the system overloading can be overcome by adding more capacity or using an auto-scaling placement algorithm, we will focus on the failure impact on the availability in the fellow experiments. Therefore, we will analyse the factor on availability on spatial variation demands, with a higher interarrival time 1/λ, for instance, higher than 10 ms according to the time variation results, in order to reduce the overload influence.

7.2.2 Medium locality

To analyse how the factors affects the availability when the user demands are in

spatial variation, we run simulations with cost-sensitive applications, as same

as the applications in high locality experiments, and choose 10 ms as the global

interarrival time 1/λ.

(36)

As for medium locality scenario, the placement configuration shown in Figure 10 is selected within the economy and performance trade-off, where Service s

2

is allocated on Node v

0

and Service s

1

on Node v

1

.

v1 v2

v3 v4 v5 v6

s1

s2 v0

User User

Figure 10: Microservices placement for medium locality scenario

With this placement configuration, according to the results of user availability, for I/O intensive tasks, in 99.47% of the simulation time the user availability is 1, in another word, it is 45.7 hours per year that the user availability is under 1. Similarly, for compute intensive tasks, the user availability is 1 in 99.15% of the simulation time, and 73.2 hours downtime per year.

To compare the difference of system availability between the local demand dis- tributions, we choose values of the relative demand µ in the range from 0.5 to 1.

The results are presented in Table 7 and Figure 11, which show that when the user demand follows a more equal distribution, i.e. µ close to 0.5, the system availability will be higher than when the user demand aggregates at one single leaf node.

µ System Availability

0.5 94.62%

0.6 94.76%

0.7 93.27%

0.8 92.37%

0.9 91.10%

1 89.03%

Table 7: System availability in medium locality scenario

(37)

0.0 0.2 0.4 0.6 0.8 1.0 86%

87%

88%

89%

90%

91%

System Availability

Figure 11: System availability in medium locality scenario

To find out the reason for the result, we observe the response time of each application request, and count the number of the failed tasks, as shown in Figure 12.

0.5 0.6 0.7 0.8 0.9 1

5.14 5.16 5.18 5.20 5.22 5.24 5.26

Average Response Time

(a) Average response time of I/O intensive tasks

0.5 0.6 0.7 0.8 0.9 1

14.90 14.92 14.94 14.96 14.98 15.00 15.02 15.04

Average Response Time

(b) Average response time of compute in- tensive tasks

0.5 0.6 0.7 0.8 0.9 1

210 220 230 240 250

Number of failed requests

(c) Number of failed tasks

Figure 12: Reasons that affect medium demand locality availability

In addition to the increasing number of failed task, which may lead to lower

availability, the response time is also increased with µ close to 1. That might

be because if the user demand aggregates, there will be a heavy workload at

one node or one link in the system, which may increase the response time, then

affect the system availability. Therefore, both failure and overload delay affect

(38)

the system availability when the user demand differs in spatial variation.

7.2.3 Low locality

For low locality scenario, the placement configuration shown in Figure 10 is selected, where Service s

2

is allocated on Node v

0

and Service s

1

on both Node v

1

and Node v

2

. And the total number of application requests is the same as the experiments in medium demand locality.

v1 v2

v3 v4 v5 v6

s1

s2 v0

User User

s1

Figure 13: Microservices placement for low locality scenario

With this placement configuration, according to the results of user availability, for I/O intensive tasks, in 99.32% of the simulation time the user availability is 1, in another word, it is 58.9 hours per year that the user availability is under 1. Similarly, for compute intensive tasks, the user availability is 1 in 98.96% of the simulation time, and 90.3 hours downtime per year.

Then we turn to system availability. We choose the same values of the relative demand µ as the experiments in medium locality, and the results are presented in Table 8 and Figure 14.

µ System Availability

0.5 96.23%

0.6 95.85%

0.7 94.34%

0.8 92.48%

0.9 89.79%

1 86.16%

Table 8: System availability in low locality scenario

(39)

0.0 0.2 0.4 0.6 0.8 1.0 86%

88%

90%

92%

94%

96%

System Availability

Low Locality Medium Locality

Figure 14: System availability in low locality scenario

And we also observe the response time of each application request, and count the number of the failed tasks, as shown in Figure 15.

0.5 0.6 0.7 0.8 0.9 1

5.150 5.175 5.200 5.225 5.250 5.275 5.300 5.325

Average Response Time

(a) Average response time of I/O intensive tasks

0.5 0.6 0.7 0.8 0.9 1

14.5 14.6 14.7 14.8 14.9 15.0 15.1

Average Response Time

(b) Average response time of compute in- tensive tasks

0.5 0.6 0.7 0.8 0.9 1

195 200 205 210 215 220 225 230

Number of failed requests

(c) Number of failed tasks

Figure 15: Reasons that affect low demand locality availability

The results in low locality scenario show a similar trend as in medium locality,

but if we compare the values, the system availability in low locality scenario

is always higher than in medium locality scenario. Since the total number of

user requests are the same according to our assumption, the result might be

explained by that in medium locality scenario, the workloads aggregate at the

intermediate node v

1

, so that the overload will be heavier than in low locality

(40)

scenario. Furthermore, the increasing trend of average response time and the number of the failed task shows that the system availability in low demand locality is also affected by both failures and overload delay.

7.2.4 Comparison of user availability

The user availability of every time unit in the simulation time is calculated.

Since the user availability is 100% in most of the time, instead we use the

“downtime”, or the total time that the user availability is under 100%, as the merit to compare. The results are shown in Table 9 and Figure 16, which suggest that lower demand locality leads to lower user availability.

High locality Medium locality Low locality 0

20 40 60 80

Time of user availability under 1 (hours per year) I/OCompute

Figure 16: Total time when user availability under 100%

Type of task High Locality Medium Locality Low Locality

I/O 28.9 45.7 58.9

Compute 56.4 73.2 90.3

Table 9: Total time when user availability under 100% (hours per year)

A possible explanation for this might be that for low demand locality, more

nodes and links are required to be available to maintain user availability. For

example, Node v

0

, v

1

, v

2

and Edge (v

0

, v

1

), (v

0

, v

2

), (v

1

, v

3

), (v

2

, v

5

) are required

to be available to make sure the user availability in low demand locality is

100%, in total 3 nodes and 4 links are needed, every single node or link failure

might cause a group of users cannot access the application services, which will

decrease the user availability. However, it only needs 2 nodes and 3 links to

achieve the same user availability level in medium demand locality, with Node

v

0

, v

1

and Edge (v

0

, v

1

), (v

1

, v

3

), (v

1

, v

4

). Therefore, user availability of lower

demand locality is easier to be affected by the failures.

(41)

7.3 Sensitivity results due to placement configurations

In addition to the cost-sensitive application we discussed before, here we also consider a latency-sensitive application, where the maximum acceptable network latency of I/O intensive tasks decrease to 2 ms, which means Service s

1

has to be allocated in the leaf nodes to meet the latency requirement.

To compare the system availability in the two types of configuration for cost-

sensitive and latency-sensitive application, we run simulations for three user

demand scenario: high locality, medium locality, and low locality. To exclude

the overload influence, according to Figure 9, we choose 1/λ = 15 for high

locality, and 1/λ = 30 for both users in medium locality and low locality. The

user demand and microservices placement configurations are shown in Figure

17.

(42)

v1 v2

v3 v4 v5 v6

s1

s2 v0

User

(a) Cost-sensitive, high locality

v1 v2

v3 v4 v5 v6

s1

s2 v0

User

(b) Latency-sensitive, high locality

v1 v2

v3 v4 v5 v6

s1

s2 v0

User User

(c) Cost-sensitive, medium locality

v1 v2

v3 v4 v5 v6

s1

s2 v0

User User

s1

(d) Latency-sensitive, medium locality

v1 v2

v3 v4 v5 v6

s1

s2 v0

User User

s1

(e) Cost-sensitive, low locality

v1 v2

v3 v4 v5 v6

s1

s2 v0

User User

s1

(f) Latency-sensitive, low locality

Figure 17: Configurations with different user demand scenarios and application

types

References

Related documents

The previous steps creates the Terraform configuration file, while the last step is to execute it. The command terraform apply is used to execute a Terraform config- uration

Network throughput, jitter and packet loss are measured for different encryption and hashing algorithms thus studying the impact of the best algorithmic combination on

I studien ingår som tidigare redovisats också ett material av krossad kalksten från Loke i Jämtlands län, och som framgår av resultaten i tabell 1 har detta mate- rial vid

In IaaS, where this project uses the OpenStack as a cloud provider, just using resource utilization from the compute nodes cannot meet the security concerns because of using the

How does cloud computing affect the external variables culture and network in the internationalization process of an SME offering cloud services..

Amazon RDS database instances are basically instances of MySQL, Microsoft SQL server or Oracle database running on an Amazon’s EC2 platform. Since users do not have access to the

Främsta huvuduppgifter är att vara uppdaterad inom sparande och placeringar för att sedan uppdatera sina kollegor, vara bollplank och driva sparfrågor framåt och även

DIPOS INDEXER är lämpligt att ansluta till ett programmerbart styrsystem för automatisk körning eller till "knappar" för manuell körning. Enheten är avsedd att monteras