Latency and Traffic Aware Container Placement in Distributed Cloud

(1)

Linköpings universitet SE–581 83 Linköping

2019 | LIU-IDA/LITH-EX-A--19/088--SE

Latency and Traﬃc Aware

Con-tainer Placement in Distributed

Cloud

Latens- och traﬁkmedveten placering av containrar i distruberad

molnmiljö.

Lenny Johansson

Supervisor : Ahmed Rezine Examiner : Cyrille Berger

(2)

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Distributed cloud is a key technology for 5G networks and is emerging as an alterna-tive cloud infrastructure for hosting latency critical and traffic-intense applications. Placing computational resources on the edge of networks allows applications to be hosted closer to end users and traffic generating sources, which will reduce latency and traffic deeper in the network. This thesis presents a two-phase approach to solve the combinatorial op-timization problem of latency and traffic aware container placement in distributed cloud. Each phase is evaluated using a phase-specific simulated environment. The first phase involves placing containers in data centers and is solved using an integer programming model. Three different objective functions are presented and evaluated using acceptance ratio and average cost as performance metrics. In the second phase, containers are placed in servers. A traffic-aware heuristic is presented and evaluated against traditional bin packing heuristics. The traffic-aware heuristic managed to drastically reduce all traffic-related met-rics at the cost of a few additional active servers in comparison to the bin packing heuristics. The traffic-aware heuristic can therefore be a good approach when placing traffic-intense applications in data centers in order to avoid network congestion.

(4)

Abstract iii Contents iv List of Figures v List of Tables vi 1 Introduction 1 1.1 Aim . . . 2 1.2 Research questions . . . 2 1.3 Delimitations . . . 2 2 Theory 3 2.1 Distributed Cloud and Edge Computing . . . 3

2.2 Microservices . . . 3 2.3 Virtualization . . . 4 2.4 Combinatorial Optimization . . . 6 2.5 Minimum cut . . . 7 3 Related Work 14 3.1 VM placement . . . 14

3.2 Virtual Network Embedding . . . 15

3.3 VNF Placement . . . 16 4 Method 18 4.1 Problem Description . . . 18 4.2 DC Selection . . . 19 4.3 Server Selection . . . 22 4.4 Evaluation . . . 26 4.5 Solving . . . 31 5 Results 32 5.1 DC Selection . . . 32 5.2 Server Selection . . . 35 6 Discussion 38 6.1 Results . . . 38 6.2 Method . . . 39

6.3 The work in a wider context . . . 41

7 Conclusion 42

(5)

2.1 Monolithic app vs microservices . . . 4

2.2 VM vs Container . . . 5

2.3 Minimum cut . . . 7

2.4 Gomory-Hu tree . . . 11

4.1 Distributed cloud . . . 26

4.2 Data center topology . . . 28

4.3 Example requests . . . 29

5.1 Acceptance ratio . . . 33

5.2 Average node load . . . 33

5.3 Average inter-DC link load . . . 34

5.4 Average cost . . . 34

5.5 Inter-rack traffic . . . 35

5.6 Total traffic . . . 35

5.7 MLU . . . 36

5.8 Average link utilization . . . 36

5.9 Active PMs . . . 36

5.10 Active links . . . 37

(6)

4.1 Network Parameters . . . 19 4.2 Request Parameters . . . 20 4.3 Variables . . . 20 4.4 Network configuration . . . 27 4.5 DC configuration . . . 27 4.6 Request configuration . . . 28

(7)

Cloud environments are being used to host an increasing workload of network and appli-cation services. Cloud computing is made possible with virtualization technology. The technology is used to virtualize physical resources into a pool of virtual resources [1]. Cloud providers can offer infrastructure as a service (IaaS) to its customers, allowing service providers and organizations to host their workloads on the the cloud. Hosting services on the cloud will therefore decrease capital expenditures (CAPEX) for application service providers since they avoid the initial cost of purchasing IT infrastructure. Resources can often be dy-namically allocated to match changing service demands. To benefit from this flexibility, appli-cation providers need to design their appliappli-cations to be scalable and services must be placed in a resource efficient way while also avoiding service level agreement (SLA) violations.

One software architecture that takes advantage of cloud computing is the microservice architercture where monolithic applications are broken down into smaller components (mi-croservices) that communicate using application programming interfaces (API). One of the main benefits of the microservice architecture is that each microservice could potentially be scaled independently, so when demand for an application increases it is only necessary to scale the components of the application that are overutilized [2]. The microservice archi-tecture is greatly enhanced by using the lightweight virtualization technology: containers. Unlike traditional virtual machines that run their own operating system (OS), multiple con-tainers can share a host OS. This leads to a smaller resource overhead and faster startup time. Low resource overhead makes it possible to run more services on the same hardware while faster startup times allows for faster migrations and scaling [3].

Latency critical applications are traditionally not suited for cloud deployment. Cloud data centers (DC) are often located in central locations far away from end users so having increased latency is therefore unavoidable. In the coming 5G networks, one of the promises is to bring compute resources closer to users to support applications that need ultra-reliable low-latency communication (URLLC)[4]. Instead of using centralized large-scale data cen-ters, compute resources are spread across the networks edge creating a distributed cloud. 5G networks are further enabled by virtualizing radio access network (RAN) and core network (CN) functions [5, 6]. The virtualization of network functions, called network function virtu-alization (NFV), would allow certain network functions to be hosted on commercial off-the-shelf (COTS) servers instead of specialized hardware [7]. Virtual network functions (VNFs)

(8)

can therefore be hosted in data centers in a flexible and coordinated manner. This increased flexibility brings many benefits but also comes with an increased scheduling complexity.

Microservice architecture has been proposed for both edge applications and VNFs [8, 9]. Placing workloads with high bandwidth demands on the network edge reduces the band-width pressure on the core network which could otherwise lead to congestion. However, us-ing compute resources in small data centers has an increased operational expenditure (OPEX) in comparison to larger data centers due to economy of scale [10]. Data center at the edge will also have a more limited amount of compute resources. Therefore, only bandwidth and latency critical parts of applications should be hosted on the edge while other workloads should be placed in larger data centers.

A complete service (application or network service) based on microservice architecture can be seen as a graph where the nodes are microservices with resource requirements and the edges are virtual links with bandwidth and latency requirements. Deciding how to optimally embed service graphs over distributed cloud can be seen as a combinatorial optimization problem (COP). COPs can be modelled and optimally solved with integer programming. The service graph embedding problem requires services to first be mapped to data centers and then be mapped to servers inside data centers. The first phase is similar to the virtual network embedding (VNE) problem which is known to be NP-hard [11]. The second phase is closely related to the virtual machine placement (VMP) problem which often involves solving one or more NP-hard problems [12]. Finding an optimal solution for this type of problem is infeasible in the general case when the problem instance is large. To tackle the combinatorial explosion of the problem, heuristics or metaheuristics are often used to find good, but sub-optimal, solutions in reasonable time. In this thesis we will evaluate a two-phase approach where the first phase is solved optimally for a small distributed cloud using different objec-tives and the second phase is solved using heuristics. The optimization approaches in each phase are evaluated using simulations.

1.1 Aim

This thesis aims at describing, modelling and evaluating optimization techniques for latency and network aware container placement in distributed cloud. The thesis will investigate optimization strategies used to solve similar problems and how they are evaluated.

1.2 Research questions

1. How can the container placement problem in distributed cloud be modelled and solved?

2. How can the container placement optimization techniques be evaluated and how well does different optimization techniques perform?

1.3 Delimitations

The scope of the distributed cloud in the simulation environment is limited to 10 data cen-ters and 1000 servers. Resource requirements are limited to CPU, RAM and bandwidth for simplicity. Additionally, the network topology is simplified to data centers and data centers interconnections while ignoring intermediate network nodes and links. A single distributed cloud topology is used in evaluation. An online but static approach to placing containers, i.e. no dynamic aspects are taking into account such as load fluctuations or container migrations.

(9)

This section will present background and theory on concepts related to the problem.

2.1 Distributed Cloud and Edge Computing

The term distributed cloud has recently emerged to describe geographically distributed com-putational resources that are connected and pooled together to create a single coherent ex-ecution environment, similarly to resources in traditional large-scale centralized cloud data centers [13]. Distributed cloud in 5G networks will likely be a composition of micro data cen-ters close to the edge of the network to larger data cencen-ters in more central locations [14, 15]. Many telecommunication networks already have necessary infrastructure with locations that could host smaller data centers to support a distributed cloud. The transition into NFV will eventually require mobile network operators to deploy and run data centers as distributed clouds to benefit from coordinated resource allocation and automation that distributed cloud solutions can provide.

Edge computing means that computational resources are placed closer to end users and data generating sources in such a way that the computational efforts are performed on the edge of the network. Therefore, the traffic never has to travel further than the edge which lowers response time and reduces network traffic deeper into the network. Common use cases for edge computing are: augmented reality, vehicle-to-vehicle communication and tac-tile internet ( e.g. remote surgery).

2.2 Microservices

Microservice architecture is a software architecture where applications are built using on many small independent services [16]. Fig 2.1 illustrates the difference between a traditional monolithic application and a microservice-based application. A microservice is a small ser-vices that has a single responsibility and can be scaled and deployed independently from other services. Microservices communicate using standardized APIs (e.g. over HTTP).

One of the main benefit of microservices is the scalability of the approach. The modularity between the services allows only bottleneck services to be scaled. Another benefit is main-tainability. If a service needs to be updated, only that particular service needs to be updated

(10)

instead of an entire application. There are also some possible disadvantages microservice ar-chitecture [17]. One of the drawbacks are the increased orchestration complexity that occurs when deploying multiple services instead of a single large applications.

(a) Monolithic application (b) Microservice-based application Figure 2.1: Illustrative example of the difference between a monolithic application (a) and a microservice-based application (b).

2.3 Virtualization

In its most general form, virtualization is the creation of a virtual representation of a compo-nent or a system [18]. A virtual compocompo-nent creates a resource and interface mapping between itself and an underlying real component. Virtualization enables resource segmentation which makes it possible to divide real resources between multiple instances of virtual resources. Vir-tualization is a key enabling technology of cloud computing where hardware is virtualized to create a pool of virtual resources [1].

Virtual Machines

The virtual representation of a machine is called a virtual machine (VM). Virtualization makes it possible to share resources and run multiple isolated VMs on a single physical machine. In the context of cloud, VM most commonly refer to system VMs that run complete operating systems on top of a hypervisor. Hypervisors can be placed either directly on top of the hard-ware (type 1) or on top of a host OS (type 2)[19]. The hypervisor emulates all the underlying hardware which allows emulation of complete computers. Resources can thus be partitioned between multiple VMs.

Containers

Containers are often viewed as lightweight VMs. Containers use OS-level virtualization and directly communicate with the kernel of its host OS. Containers can be seen as isolated appli-cations or processes which do not run their own OS or on emulated hardware [19]. Because containers do not run their own OS they carry low resource overhead and has low startup time [3]. Startup time is further reduced since containers do not need to virtualize hardware. One of the main drawbacks of containers is that containers share the host OS so they will have less isolation compared to VMs. This is advantageous from a resource perspective, but it can also be a security risk.

Containers have a small resource overhead which makes it possible to run many small containerized workloads in parallel. This makes container-based virtualization useful for

(11)

microservice-based applications where many small services are deployed independently. The low startup time for containers makes it fast to scale services as needed. Figure 2.2 illustrates the difference between VMs and containers.

(a) VMs (b) Containers

Figure 2.2: Illustrative example of the difference between VMs (a) and containers (b).

Network Function Virtualization

Network functions, e.g. firewalls or load balancers, are traditionally sold as hardware units with specialized hardware. Network function virtualization (NFV) separates network func-tions from its underlying hardware. In the coming 5G networks, many network funcfunc-tions are targeted for virtualization, both core network functions and radio access network functions [5, 6]. Certain virtual network functions (VNF) can be placed on COTS in data centers to reduce CAPEX and OPEX. Similar to the microservice architecture, VNFs can be chained to-gether to create service function chains (SFC) where each service could potentially be scaled independently. Microservice architecture has also been suggested for large VNFs which could benefit from being broken down into smaller components and chained together. Container-based virtualization has been suggested for NFV [9]. Using containers, VNFs could be man-aged using container orchestration tools, such as Kubernetes [20].

Container Orchestration

Being able to automatically deploy, scale and manage containerized applications is crucial for large workloads. This as lead to the development of container orchestration tools.

Kubernetes

Kubernetes [21] is currently one of the most popular container orchestration tools [22]. It is an highly extendable and configurable open-source container management system devel-oped by Google. Kubernetes can be configured to handle automatic fail-over, so that a given number of containers are always running, and automatic horizontal scaling.

In Kubernetes jargon, computers that can host containers are referred to as nodes and a unit of computation is called a pod. A pod is a collection of one and more containers that are grouped together and will be deployed as a unit. A cluster is a collection of nodes with one or more master nodes that manages deployment and scheduling of pods on the cluster. There are different controllers for different types of workloads, e.g. controllers for stateless and stateful applications, batch jobs, and daemons. Applications can be seen as long-lived services while batch jobs run to completion.

The Kubernetes scheduler takes pods as inputs and selects the most appropriate node for placement. The task of Kubernetes scheduler is to decide which node a pod should be placed

(12)

on with regards to the state of the system and placement policies. The pods are queued and scheduled sequentially. Pods can be deployed either in a declarative or imperative manner. When deploying declaratively, a configuration file is passed to a Kubernetes master node through an API. The configuration file specifies scaling policies and constraints, e.g. pod-to-node or pod-to-pod affinity/anti-affinity. The Kubernetes scheduler can be replaced with a custom scheduler. It is also possible to add multiple schedulers which can be used to have specialized schedulers for specific types of workloads. For example one scheduler for jobs that run to completion and another scheduler for long lived applications.

2.4 Combinatorial Optimization

Combinatorial optimization is a subfield of mathematical optimization that encompasses problem where the solution set is discrete. The goal in combinatorial optimization is to find an optimal solution in the discrete space of feasible solutions. Combinatorial optimization prob-lems (COP) appears in areas such as operational research, artifical intelligence and computer science. Many real world problems, e.g. planning, scheduling and assignments, exhibits combinatorial properties. Constraint programming (CP) and integer programming (IP) are two techniques that are used to model and solve COPs. Both CP and IP can be used to find optimal solutions to COPs and prove that the solutions are optimal. However, general COPs are often NP-hard and require some form of heuristic, metaheuristic or approximation al-gorithm to find solutions when the problem instances are large. Generally, these techniques will not be able to guarantee optimaility but, if applied correctly, can find good (near optimal) solutions to COPs in reasonable time [23, 24].

Generalized assignment problem

Generalized assignment problems (GAP) are a class of NP-hard combinatorial optimization problems where m tasks are to be assigned to n agents such that some cost i minimized [25]. Each task must be assigned to one agent and each agent can potentially be assigned multi-ple tasks. Agents have an amount of available resources and a cost associated with being assigned a task. Let cijbe the cost associated with assigning task j to agent i. Let xijbe a

map-ping variable where xij =1 if agent i is assigned task j, 0 otherwise. Let ai be the amount of

available resource of agent i and let tjbe the amount of required resource of task j. A model

of the problem is shown below.

Minimize n ÿ i=1 m ÿ j=1 xijcij (2.1) Subject to m ÿ j=1 xij=1 @i P[1..n] (2.2) m ÿ j=1 xijtjďai @i P[1..n] (2.3) xijP t0, 1u @i P[1..n], @j P[1..m] (2.4)

Bin packing problem

Bin packing problems are similar to GAP but instead of minimizing the cost, the number of agents (bins) are minimized. Bin packing problems can contain multiple dimensions (mul-tidimensional bin packing problems) and are often solved with heuristics [26]. Fit-first de-creasing (FFD) and best-fit dede-creasing (BFD) are two bin packing heuristics that first sort the

(13)

items that are to be placed in decreasing order. In the multidimensional problems, sorting items can be done in many different ways and is often problem dependent. One way to sort the items is by weighted sum which means that a weight must be assigned to each dimension which should reflect the importance of that dimension. After sorting, FFD then tries to fit the items in the first bin that can fit it. A new bin is only opened when any previous bin can’t fit the current item. In BFD, the best-fit (tightest fit) is selected from all available bins. The best-fit bin can be selected in many different ways in multidimensional problems and can be highly problem dependant.

An example bin packing problem is the problem of placing VMs on servers with the aim to minimize the number of active. The dimensions of the items in this problem could e.g. be CPU and RAM.

2.5 Minimum cut

This section will explain and define minimum cuts and minimum s-t cuts to give a theoretical introduction to concepts used in the method chapter1. We will also introduce Gomory-Hu trees which are tree structures that store minimum s-t cuts for all vertex-pairs in a graph. An illustrative example of a min cut and min s-t cut is shown in fig. 2.3.

Figure 2.3: An illustrative example of a min s-t cut and a min cut. The left graph shows the weight and cut of a min s-t cut between vertex a and vertex b. The right graph shows the weight and cut of a min cut of the same graph. Both cuts partitions the graph into two disjoint sets of vertices, one set on the left side of the cut and the other set on the right side of the cut.

Min cut

Let G(V, E)be a graph where V is a set of vertices and E is a set of edges. A cut in G will partition vertices in V into two disjoint sets. Lets call these two sets S Ă V and T = V\S. Any edge that crosses the cut will have one vertex in S and one vertex in T. The weight of the cut is the sum of capacities of all edges in the cut. A minimum cut is a cut that has the lowest weight of any possible cut in G. Note that a minimum cut is not unique. There can be multiple different cuts that have the same weight.

Stoer-Wagner algorithm The minimum cut of a graph can be found with the Stoer-Wagner algorithm presented in [27]. Let G(V, E, w)be a weighted undirected graph where V is a set of vertices, E a set of edges and w a weight function of the edges in E. The algorithm finds the minimum cut of G by performing |V| maximum cardinality searches where each search

(14)

finds an arbitrary min s-t cut. The algorithm recursively contracts vertices of G until there are only two vertex-sets remaining.

The algorithm starts by selecting any vertex a P V (a will remain the same throughout the entire algorithm). Next, a set A is created and a is added to it. This starts a maximum cardinality search to find an arbritrary min s-t cut in the graph. While A does not equal V, the most connected vertex y P V/tAu is added to A, i.e. the vertex with the highest edge sum between itself and A that is not already added to A. The min s-t cut (referred to as cut-of-the-phase in [27]) of the current graph is the weight of the cut between the last vertex added and the rest of the graph. The last two vertices added to A are then merged by replacing them with a new vertex. If there are any edges going from both of them into a single vertex, the edge is replaced by a new edge with the weight of the sum of the two removed edges. If there is an edge between the contracted vertices then it is removed.

Either the calculated min s-t cut is the minimum cut or s and t are in the same partition in the minimum cut. Therefore, merging them will not affect the calculation of the minimum cut. The smallest min s-t cut in the graph will always be the minimum cut so the smallest min s-t cut is stored and returned in the end of the algorithm. Pseudocode of the algorithm is shown in algorithm 1.

input :Graph: G(V, E), weight function w output:Minimum cut: minCut

1 a Ð any vertex in G.V 2 minCut Ð I NF 3 while |V| > 1 do 4 A Ð tau 5 while A ‰ V do

6 add the most connected vertex to A

7 end

8 calculated cutO f Phase

9 update G by contracting last two vertices added to A 10 if cutOfPhase < minCut then

11 minCut=cutO f Phase 12 end

13 end

Algorithm 1:Stoer-Wagner algorithm

Karger’s algorithm Karger’s algorithm, presented in [28], is a randomized approach to find the minimum cut of a graph. It works by randomly selecting and contracting edges until only two vertices remain. Restarting and repeating this process(n₂)ln n times has a 1_n probability of not finding a minimum cut. With sufficiently large n, the probability of finding a minimum cut is very high.

Min s-t cut

Let s P V be a source node and t P V be a sink node, a minimum s-t cut is a cut where s P S and t P T and the weight of the set of edges in the cut is the minimum of any possible cut between s and t. The weight of a minimum s-t cut is equal to the maximum flow between s and t [29]. To find the minimum s-t cut we can therefore either use a min s-t cut algorithm or a max-flow algorithm. The residual graph generated by passing as much flow as possible from s to t can be used to partition V into S and T. This can be achieved by traversing the residual graph from s (e.g. in breadth-first order), if v P V is reachable from s, then v P S, else v P T.

(15)

Push-relabel algorithm The push-relabel algorithm, presented in [30], is a maximum flow algorithm with time complexity O(n2m), where n is the number of vertices and m is the number of edges in a graph2. The algorithm find the maximum flow between a source and a sink vertex by trying to push as much flow as possible from the source towards the sink (called a preflow). The flow is moved locally between vertices and the amount of incoming flow is allowed to exceed the amount of outgoing flow. During its execution, vertices are labeled based on distance to the sink or source. The label is used to select edges to push flow through. If the flow being pushed saturates a vertex, the excess flow is either pushed closer to the sink as long as the sink is reachable from the vertex. Otherwise the excess flow is returned to the source using the vertex labels. At the end of the algorithm, all intermediate vertices have zero excess and the amount of flow sent from the source is the maximum flow.

Let G(V, E, c)be a graph with vertex set V, edge set E and capacity function c. Any edge

(a, b) P E will have a corresponding capacity c(a, b)which indicates the maximum amount of flow that can be passed on the edge(a, b). Let f(a, b)define the amount of flow currently on(a, b), i.e. the flow going from vertex a to vertex b. Let e(a)define a function that accounts for the amount of excess flow being passed through vertex a, formally defined in eq. 2.5. Let r(a, b)define the residual capacity of link(a, b), shown in eq. 2.6. A residual graph Gr can be

created from from the original graph G with the same vertices in V and the a set of residual edges Er. In Gr, some vertices could be inaccessible due to saturated edges, i.e. edges where

r(a, b) =0. e(b) = ÿ aPV f(a, b)´ÿ aPV f(b, a) @b P V (2.5) r(a, b) =c(a, b)´f(a, b) @(a, b)PE (2.6) Moving flow between two vertices a and b is done by adding δ = min(e(a), r(a, b)) to f(a, b)and subtracting δ from f(b, a), given that δ ą 0. Vertices are labeled based on the distance from the sink using a labeling function l(a). The labeling function has two static labels for the source and the sink: l(s) = n and l(t) = 0. Given a residual edge (a, b) P Er

where r(a, b) ą 0, then l(a) ď l(b) +1. If l(a) ă n then the label is a lower bound of the distance from a to t in Gr. If l(a)ąn, then t is not reachable from a and instead l(a)´n is a

lower bound on the distance between a and s in Gr. The initial labels can be selected as 0 for

all vertices except the source vertex s which is set to n.

After initializing all necessary data structures, the algorithm will run a loop until there are no active vertices. A vertex a P V/ts, tu is considered active if e(a) ą 0 and l(a) ă 8. Two procedures are performed during each iteration: push and relabel. The push procedure, presented in algorithm 2, pushes preflow on an edge (a, b) if a is active, r(a, b) ą 0 and l(a) = l(b) +1. The relabel procedure, presented in algorithm 3, relabels a vertex a if it is active and @b P V, r(a, b)ą0 and l(a)ďl(b).

In the generic description of the algorithm, the vertex and operational order is not spec-ified and is assumed to be arbitrary. For empirical efficiency, it is important in practical im-plementations of the algorithm to order nodes and keep track on current edges. One way to achieve this is to add active vertices into a FIFO queue and store the edge targeted to be processed for each vertex in linked lists. At each iteration, the first element in the queue is processed and if any new active vertices arise during processing they are added to the back of the queue. Likewise, if the vertex still remains active after being processed it is also added to the back of the queue. Let H be a data structure that stores the current edge of each vertex. H(a) = b means that the current edge for vertex a is(a, b). The neighbours of each vertex is stored in arbitrary order and if all neighbours of a has been checked and e(a) ą 0, then

2_{The time complexity of the push-relabel algorithm can be improved to}_O(_{nm ln}n2

m)by using the dynamic tree

data structure. Furthermore, the base time complexity ofO(n2_m₎_{can be improved to}_O(_n3₎_{by using a first-in}

(16)

a is relabeled. The discharge procedure, presented in 4 is used to process the first vertex in the queue and order the push and label operations. The discharge operation is run until the queue is empty. Finally, the max-flow (min-cut) is returned. The main algorithm is presented in 5 which also shows the initialization of the different data structures used throughout the algorithm. input :Edge: (a, b)PE 1 δ Ð min(e(a), r(a, b) 2 f(a, b)Ð f(a, b) +δ 3 f(b, a)Ð f(b, a)´ δ 4 e(a)Ðe(a)´ δ 5 e(b)Ðe(b) +δ

Algorithm 2:Push operation

input :Vertex: a P V 1 if Dr(a, b)ą0 then 2 l(a)Ðmin(l(b) +1 @(a, b)PEr) 3 end 4 else 5 l(a)Ð 8 6 end 7

Algorithm 3:Relabel operation

1 a Ð pop first element of Q 2 label Ð l(a)

3 while e(a) > 0 and l(a) ď label do 4 (a, b)Ðcurrent edge of a

5 if r(a,b) > 0 and l(a) = l(b) + 1 then 6 run push(a, b)

7 end

8 else

9 update H(a)to the next neighbour 10 if H(a) is the first vertex in the list then 11 run relabel(a)

12 end

13 end

14 if b becomes active and is not in Q then 15 add b to Q

16 end 17 end

18 if a is still active then 19 add a to Q 20 end

21

(17)

input :Graph: G(V, E, c), source: s, sink: t output:Flow: f

1 Initialize: f , c, l, Q, H 2 for all (a,b) P E do 3 f(a, b)Ð0 4 f(b, c)Ð0 5 end

6 l(s)Ðn

7 for all a P V/tsu do 8 l(a)Ð0 9 e(a)Ð0 10 end

11 for all neighbours b of s do 12 f(s, b)Ðc(s, b)

13 f(b, s)Ð ´c(s, b) 14 e(b)Ðc(s, b)

15 add b to Q 16 end

17 while Q is not empty do 18 Run discharge 19 end

20

Algorithm 5:Push-relabel algorithm

Gomory-Hu Tree

In some problems, we are interested in finding minimum s-t cuts between every pair of nodes in a graph. A naive way to generate all minimum s-t cuts would require |V|2minimum s-t cuts. A more efficient way to generate and represent all minimum s-t cuts is with a Gomory-Hu tree [31]. Given an undirected graph G(VG, EG)where VG is a set of vertices and EG is

a set of edges, a Gomory-Hu tree T(VT, ET)is a weighted tree generated from G where the

edges ET represent the minimum s-t cut between pairs of vertices in VG. An example of a

graph with its corresponding Gomory-Hu tree is shown in fig. 2.4.

Figure 2.4: A weighted undirected graph (left) and a Gomory-Hu tree (right) constructed from it. The min s-t cut between two vertices is the smallest link on the path between the two vertices in the Gomory-Hu tree, e.g. the min s-t cut between a and f is 17. The min cut is the smallest link in the tree which is 13.

A Gomory-Hu tree can be constructed in |V| ´ 1 minimum cuts and is therefore an efficient way to compute and represent all minimum s-t cuts in a graph. A minimum cut in G is an edge in ETwith the lowest capacity. A minimum s-t cut between any pair of vertices in VGis

(18)

The original Gomory-Hu method of constructing min s-t cut trees works by first creating a supernode that contains all vertices of the graph G such that VTonly contains a single node,

a set of all vertices in VG. Then the algorithm iteratively expands the tree T until all nodes in

VT has a single vertex from VG, i.e. all vertices from the original graph G has been added to

their own nodes in T.

At each iteration, a node X P VTis selected that has two or more vertices from VGin its

set. Next, two vertices s and t are selected from X and all connected components of VT/X are

computed and contracted into vertices into a new graph G1 _{created from the original graph}

G. A connected component is any node in T except the current node X, vertices in X are not contracted in G1_{. The min s-t cut between s and t is then computed in G}1_{. Any vertex in X that}

is in the s partition will be in a new node Xsand any vertex in X that is in the t partition will

be in a new node Xt. X is then replaced by Xsand Xt. An edge with the weight of the min s-t

cut is added between Xs and Xtin T. The edges from other nodes in VT to X are moved to

either Xsor Xtdepending which side of the cut they belong to.

Gusfield’s algorithm The original Gomory-Hu method suffers from implementation com-plexity. A simple Gomory-Hu tree construction algorithm, called Gusfield’s algorithm, is described in [32]. The algorithm avoids node contractions and maintaining non-crossing cuts which greatly simplifies its implementation. The downside of the algorithm is that the min s-t cut is computed for the entire graph at each iteration instead of being computed in the con-tracted graph. The algorithm is presented in algorithm 6 and works as follows: Let n=|VG|

be the number of vertices in a graph G that we want to compute the Gomory-Hu tree for. Two vectors of length n are initialized: p that stores representative nodes in the Gomory-Hu tree T and f l that stores weights of edges in T, i.e. f l stores maximum flow (min s-t cut) between two adjacent nodes in T. Following the description of the original algorithm, all indices start at 1. p is initialized to 1 for all indices. This means that all vertices from G is contained in a single node in T, with value 1 that is the representative of a supernode in the original Gomory-Hu method. At each iteration, the iterator s both represents a vertex in VG

and a node in VT. The p vector will contain pointers to the neighbour of the current node. So

p[s]will both represent the node that the vertex s is contained in and the neighbour of node s in T. Any p[i]that contains s will also be a neighbour of s in T. f l[i]represents the weight of an edge(i, p[i]), i.e. an edge between node i and its neighbour found in p[i].

The first min s-t cut computed will be between vertices 1 and 2 which will split node 1 into two nodes, each containing the vertices of their corresponding side of the cut. Let s and t be two vertices in VG and contained in the same node in VT. An edge between s and t is

added with the weight of the cut between the two nodes in f l[s]. After the cut, any node on the s side of the cut that is pointing to t will instead point to s. However, if the node p[t]is pointing to is on the s side of the cut, then T is updated accordingly. Lets x= p[t]be the node t is pointing to, if x is in on the s side of the cut, then p[s]will be updated to point to x and p[t]

will instead point to s. The flow in f l must also be updated to reflect the structural change of T. f l[s]which is the weight of the edge originally between s and t instead takes the value of the edge between t and x. The edge from s to t is also added as an edge from t to s.

(19)

input :Undirected graph: G output:Gomory-Hu tree: T 1 n Ð |V_G|

2 Initialize: p, f l with length n 3 set all values in p to 1 4 for s in range 2 to n do 5 t Ð p[s]

6 cut Ð compute min s-t cut in G between the vertices s and t 7 S Ð vertices on the s side of cut in cut

8 f l[s]Ðweight of cut 9 for i in range 1 to n do

10 if i is not s and i is in S and p[i] = t then 11 p[i]Ðs 12 end 13 end 14 if p[t] is in S then 15 p[s]Ðp[t] 16 p[t]Ðs 17 f l[s]Ð f l[t] 18 f l[t]Ðweight of cut 19 end 20 end

(20)

Container placement research is still in its infancy. Even so, container placement share many of the same objectives and constraints as VM placement problems which has been studied extensively. Placing latency critical services on the edge is also strongly related to VNF place-ment which, although recent, has received significant attention from researchers. In this thesis a containerized service could be a VNF or an application service. Virtual network embed-ding (VNE) also share many similarities to our problem definition, which makes it another research area worth exploring. Therefore, this chapter will present some of the more relevant and recent research that has been done in VNE and placement of VMs and VNFs. We will focus on optimization techniques and performances metrics used for evaluation.

3.1 VM placement

VM placement (VMP) in cloud environments have been an active area of research [33, 12, 34]. VMP is broad and encompasses many different problem formulations and objectives. VM placement problems are often formulated as a vector bin packing problems. Bin pack-ing is used to minimize the number of runnpack-ing servers which will reduce wasted energy from running idle or underutilized servers. Common VMP objectives are to minimize en-ergy, cost, network utilization or SLA violations. Some studies look at optimizing multiple, often conflicting, objective functions to find balance between OPEX, energy and application performance. Authors in [34] describes optimization techniques commonly used for energy efficient VMP and migration. The techniques are divided into four groups: constraint pro-gramming, bin packing heuristics, stochastic integer programming and genetic algorithms. VMP objectives and solution strategies are divided in multiple categories in [33]. The objec-tives are separated into mono- and multi-objective research and further separated into studies on: energy, cost, performance, network traffic and QoS optimization. Solution techniques are divided into: heuristic, metaheuristic, deterministic and approximation algorithms.

A recent but important objective of VMP is to handle intra-DC traffic in a resource and energy efficient way. Handling intra-DC traffic is important in our problem definition be-cause of possibly large traffic flows between services. To achieve energy efficiency in VMP, the placement strategy should minimize the number of active servers, network devices and network links. Furthermore, to achieve application performance in data centers, strongly con-nected services should be placed close together to avoid congestion on links. This approach

(21)

was used in [35] where the authors presented an energy-saving VMP heuristic based on min-imum cut combined with a best-fit heuristic. The problem was abstracted as a bin packing problem combined with a quadratic assignment problem, both of which are NP-hard. A sim-ilar approach was used in [36] where the problem was formulated as a balanced minimum-k cut problem1_{. The solution strategy first partitioned VMs into equal sized groups and then}

VM-groups were mapped to server racks. Lastly, inter-VM traffic routing optimization was performed with the objective to minimize the number of active network devices. In [38], VMs were clustered using the concept of community structures. The strategy created smaller VM communities as needed by trying to greedily fit communities into partitions of servers using bin packing heuristics.

Authors in [39] looked at VMP for geographically distributed data centers. The authors used approximation and heuristic algorithms to perform VMP in a distributed cloud with the objective to minimize latency between communicating VMs. The resource allocation was divided into multiple phases: DC selection, request partitioning, server selection and VM placement. DCs were selected to minimize the maximum distance between candidate DCs. Machines were selected based on distance between racks with the objective to minimize inter-rack communication. We formulate our problem similarly to [39] but instead of minimizing the maximum distance between candidate DCs, the problem includes latency constraints on the communication between containers, link bandwidth limitations, deployment cost and load balancing.

Because VMP over distributed data centers can be abstracted into a combination of multi-ple NP-hard problems, finding the optimal solution becomes infeasible for non-trivial prob-lem instances. When resources are not abundant and the placement probprob-lem contains many hard constraints heuristic algorithms can quickly become complex. One way to utilize mathe-matical optimization to solve larger instances of VMP is to separate the problem into multiple steps. This approach was used in [40] where the placement was divided into two separate steps, sub-network placement and host placement. The first step clusters VMs to parts of the network and the second step solves VMP on the sub-networks. The objective was to mini-mize the worst cut load ratio and the results from a two-step integer programming approach were close to optimal.

3.2 Virtual Network Embedding

Network virtualization has been identified as an enabling technology for the future internet [41]. The goal is to be able to easily and dynamically instantiate and run multiple virtual net-works over a shared physical network. A virtual network is a network with virtual network nodes and virtual links. Embedding virtual networks onto a physical network is known as virtual network embedding (VNE). Sharing infrastructure between multiple tenants requires resources to be allocated efficiently which turns VNE into an optimization problem. The VNE problem formulation is closely related to network slice embedding [42]. In 5G networks, net-work slicing will enable customers to create isolated virtual netnet-works (slices) over a shared physical network. A slice could be defined by business requirements such as latency, reli-ability and throughput. The goal of the network operator is to embed slices in a resource efficient way while satisfying business requirements. The mathematical formulation of these constraints turns network slice embedding into an optimization problem that closely resem-bles the more well-studied VNE problem.

VNE has received significant attention from the research community [43, 44, 45]. A cate-gorized collection of exact solutions of VNE is presented in [44]. However, the VNE problem is NP-hard and most research has thus focused on heuristic or metaheuristic approaches to solve the problem for large instances [45]. One way to reduce the size of a problem instance is 1_{The minimum k cut problem consists of partition vertices of a graph into k disjoint subsets, while minimizing}

(22)

to select a subset of the nodes and links in the network for each request, as seen in [46]. After reducing the size of the problem it can be solved using integer programming.

In [43], VNE problems are divided into six categories: static, dynamic, centralized, dis-tributed, concise and redundant. In static problems, online or offline, embedded virtual net-works are never re-configured while dynamic solutions will handle re-configurations when new requests arrive. Centralized approaches calculate the embedding from a single entity while multiple entities are collaborating in distributed approaches. Redundancy is meant to add reliability and fault-tolerance at the cost of additional resources. Problems that do not take redundancy into consideration are referred to as concise. Using this categorization, our problem formulation will follow a concise, centralized and online static approach. However, re-configurations can be realized as separate step, e.g. by running a re-optimization process when some trigger occurs, for instance when a VN request is rejected [47].

Authors in [43] categorized VNE objectives into: QoS, profit maximization and VNE re-silience. QoS can describe a number of different metrics such as: latency, throughput and jitter. Profit is often specified as maximizing acceptence ratio of VNE requests or minimizing cost. VNE resilience is achieved by allocating backup resources.

The VNE problem consists of two sub-problems: virtual node embedding and virtual link embedding where a single virtual link might be embedded on multiple physical links, i.e. a physical path. The sub-problems can be solved sequentially or jointly to get more optimal resource allocation at the cost of computational efforts. Authors in [43] refer to the approaches as uncoordinated and coordinated VNE respectively.

By viewing data centers as nodes in a network, many of performance metrics and con-straints used for VNE can be applied in the data center selection phase. Common objectives in VNE are link and node balancing [48, 46]. Creating a balanced utilization of resources can increase acceptance ratio over time by reducing resource fragmentation. We will also explore using link and node balancing for data center selection with the goal to increase acceptance ratio.

Some VNE research solve the virtual link embedding sub-problem using splittable flows where flow on virtual links are allowed to be split over multiple physical paths. However, VNE often consists of embedding virtual links with unsplittable flow. Flow can be set as splittable or unsplittable by using real or binary values for the virtual link mapping variables. In this thesis we will use unsplittable flow between data centers.

VNE problem often place constraints on node mapping where virtual nodes from the same requests are not allowed to share physical nodes. This differs from our problem were physical nodes are allowed, even encouraged, to host multiple virtual nodes from the same request to reduce traffic on physical links.

3.3 VNF Placement

VNF placement (VNFP) has recently emerged as an important and active area of research [49]. VNFP share many similarities with VNE and is sometimes presented as a VNE problem. VNFP problems are most commonly formulated as embedding problems where requests for service functions chains (SFC) needs to be placed on a physical network. SFCs are complete network services and can be seen as forwarding graphs that needs to be embedded on a physical network. VNFP problems are sometimes based on demands from end-users where service demands are routed to VNFs that have enough capacity to satisfy the demands. In these types of VNFP problems, VNFs are mapped to hosts and demands are mapped to VNFs. Demands can then either be split between multiple VNFs or constrained to single instances. New VNFs are scaled, moved, placed or removed as demand change.

VNFP problems often include latency constraints. In [50], VNFP is formulated as an in-teger linear program with the objective to minimize total round-trip delay and deployment cost. An affinity-based heuristic is included for large problem instances. The model allows

(23)

multiple instances of each service type to be placed on the same node. Every user request for a service is mapped to a single instance such that user traffic flows are not allowed to be split between multiple instances of the same service type.

The objective of the VNFP problem formulation in [51] is to minimize OPEX and resource fragmentation. OPEX is a combination of resource and network costs together with penalty fees for SLA violations. The aim of the resource fragmentation objective is to minimize the number of active servers and links, which is similar to the energy minimization objective used in traditional VMP. A dynamic programming based heuristic is suggested to deal with larger instances. The approach was evaluated with real data traces.

In [52], a latency-aware VNFP problem was described in the context of a distributed cloud in 5G. A realistic topology was presented in the evaluation. Data centers were only seen as computation nodes and no specific server was selected for placement. The problem was mod-elled as a resource constrained shortest path problem with unsplittable flow. The objective was to minimize total delay and was calculated by taking the sum of latency and processing time.

(24)

This chapter will formally define the problem and present a two phase solution strategy. In the first phase, containers are mapped to data centers. A integer linear programming (ILP) model and three different objectives are presented together with a greedy node selection algorithm. In the second phase, containers are mapped to physical machines (PMs) inside data centers. To solve the server selection phase, a traffic-aware heuristic is presented together with three other heuristics. Lastly, the evaluation process is described.

4.1 Problem Description

The problem consist of placing a number of containers (services) into a number of physical machines (servers) while optimizing operational cost, resource utilization and application performance. The servers are located in geographically distributed data centers with paths connecting them, i.e. a distributed cloud. Each path consists of a number of links and each link has a cost, capacity and a latency associated with it. Servers have a limited amount of resources of different resource types. We will use CPU and RAM but additional resources could easily be added if necessary. Each service requires some amount of these resources. A service can have a specific data center it must be placed in or a set of allowed data centers. Requested applications are described as forwarding graphs where nodes are services and the edges are virtual links. Virtual links between services require some amount bandwidth and can have latency constraints which specifies the minimum allowed bandwidth and the maximum allowed latency.

The objective of the first phase is to optimize for cost and resource utilization in a way such that future requests have a higher chance of being accepted. OPEX is calculated as the cost of allocated bandwidth on links and CPU on nodes. OPEX is also affected by where applications are placed, e.g. edge data centers have a higher operational cost than centralized data centers. Latency is treated as a hard constraint but can optionally be added to the ob-jective. The objective of the second phase is to minimize intra-rack traffic inside data centers while also minimizing the number of active servers, i.e. servers with one or more containers running on them.

(25)

4.2 DC Selection

In this section the latency constrained DC selection problem will be formalized as an ILP problem. Three different objectives are presented: Cost, RLB and NLB. A greedy node se-lection algorithm is also implemented to use as a baseline when evaluating the performance of the model. The notation[a..b]is used to denote that a variable can take any integer value between a and b. The meaning of z P[a..b]is thus tz PZ : a ď z ď bu.

Network Parameters

Let Gp= (Np, Lp)be a undirected graph that depicts a physical network. Npis a set of nodes representing data centers and Lpis a set of physical links connecting the data centers. Nodes have CPU and RAM and a resource allocation cost. Links have bandwidth, latency and a bandwidth allocation cost. Every resource type have corresponding residual variables that stores the remaining resources for nodes and links. Table 4.1 presents the network parame-ters. There are source nodes connected to edge data centers that can generate traffic. Source nodes are access points but will be treated as data centers with no resources. We will use notations with p in exponent position (e.g. Gp) to denote physical network parameters if the notation overlaps with virtual parameters.

Symbol Meaning

Gp(Np, Lp) Gpan undirected graph representing a physical network where Npis a set of nodes representing data centers and Lpare data center interconnection links Np Set of data centers where i P Npmeans that there is a node with index i Lp Set of physical links where(u, v) P Lpmeans that there is a link connecting

node u with node v and(u, v) = (v, u)

β Number of DCs in Np

cpup_i Number of CPU cores of node i P Np ram_ip Amount of RAM of node i P Np bwuvp Capacity of link(u, v)PLp

latpuv Latency on link(u, v)PLp

Rcpu_i Residual CPU of node i P Np Rram_i Residual RAM of node i P Np

Rbwuv Residual bandwidth of link(u, v)PLp

linkCostuv Cost of transferring data over link(u, v)PLp

nodeCosti Cost of allocating CPU on node i P Np

Table 4.1: Network Parameters

Request parameters

Requested applications can be seen as forwarding graphs where the nodes are services, i.e. containers, and the links are virtual links. Let Gv= (Nv, Lv)be a service graph where Nvis a set of services and Lva set of virtual links. Virtual links represent communication between services. Each service has resource requirements and each link has a maximum allowed la-tency and minimum allowed bandwidth. Table 4.2 shows the service parameters.

(26)

Gv₍_Nv_{, L}v₎ _{A forwarding graph where the nodes N}v_{are services (containers) and L}v_are

virtual links connecting the services

Nv _{Set of containers where i P N}v_{means that there is a node with index i}

Lv Set of virtual links where(s, t)PLvmeans that there is a link going from node s to node t

α Number of containers in Nv cpuv_i CPU demand of service i P Nv ramv_i RAM demand of service i P Nv

loci Set of allowed locations (DCs) for service i P Nv

bwv_st bandwidth demand of link(s, t)PLv latv_st Maximum allowed latency on(s, t)PLv

Table 4.2: Request Parameters

Variables

The variables and their domains are presented in table 4.3. The two decision variables are x and y. x decides which DC a container is placed in and y which creates the path between embedded nodes. Variables cpuLoad and linkLoad are used in the objective functions and constrained during initialization by the residual capacity of physical nodes and links.

xijP t0, 1u 1 if virtual node i is placed on physical node j, 0 otherwise

ystuvP t0, 1u 1 if virtual link(s, t)PLvis embedded on physical link(u, v)PLp,

0 otherwise

cpuLoadjP[0..Rcpu_j ] Number of CPU allocated on physical node j

linkLoaduvP[0, Rbwuv] Amount of traffic assigned to physical link(u, v)PLp

Table 4.3: Variables

Optimization Model

The DC selection problem is modelled with three different objectives.

Cost The first objective is to minimize cost without taking any other factors into consider-ation. Some virtual nodes will be pushed deeper into the network depending on the traffic patterns and the amount of CPU they require since edge nodes are more expensive to allocate resources on. The Cost objective is stated in eq. 4.1 and is the sum of all allocated resources on nodes and links multiplied with their cost.

Cost= β ÿ j=1 nodeCostj˚cpuLoadj+ ÿ (u,v)PLp linkCostuv˚linkLoaduv (4.1)

(27)

RLB A common objective in VNE is to balance loads on links and nodes [46, 48]. The resid-ual load balance (RLB) objective is expressed in eq. 4.2 and also include the cost of resources1. eis a small value to avoid division by zero.

RLB= β ÿ j=1 nodeCostj˚cpuLoadj Rcpu_j +e + ÿ (u,v)PLp linkCostuv˚linkLoaduv Rbw uv +e (4.2)

NLB In eq. 4.3, RLB is extended by multiplying nodes and links with their capacity to create a more normalized load balancing (NLB), i.e. load balancing by fractional residual capacity instead of only residual capacity.

NLB= β

ÿ

j=1

nodeCostj˚cpuLoadj˚cpupj

Rcpu_j +e + ÿ (u,v)PLp linkCostuv˚linkLoaduv˚bwuvp Rbw uv +e (4.3)

ILP formulation Let Objective be one of the objectives described above, the modelled is then formulated as follows:

Minimize Objective (4.4) Subject to β ÿ j=1 xij=1 @i P[1..α] (4.5) xij=0 @i P[1..α], @j P antii (4.6) cpuLoadj= α ÿ i=1 xij˚cpuvi @j P[1..β] (4.7) α ÿ i=1 xij˚ramvi ďRramj @j P[1..β] (4.8) ÿ vPadju ystuv´ ÿ vPadju ystvu=xsu´xtu @(s, t)PLv, @u P[1..β] (4.9) linkLoaduv = ÿ (s,t)PLv (yst_uv+yst_vu)˚bwv_st @(u, v)PLp (4.10) ÿ (u,v)PLp (yst_uv+yst_vu)˚latuvp ďlatvst @(s, t)PLv (4.11)

Each virtual node must be place on one physical node and is only allowed to be placed once, this is enforce by eq. 4.5. Some virtual nodes can be constrained to specific locations, i.e. to specific physical nodes. The mapping must therefore be 0 between a virtual node and all physical nodes not in its list of allowed locations. Let antiibe a list of all physical nodes

not in locifor virtual node i. Eq. 4.6 will then ensure that the virtual node i is not placed on

a physical node that is not in its allowed locations. Because eq. 4.5 forces all virtual nodes to be mapped to some physical node, x will only contain allowed container-DC mappings. When a virtual node is mapped to a physical node, some of its resources are allocated to it. 1_{In VNE research the load balance objective is often stated with weights for each node and link. It is common to}

(28)

Eq. 4.7 and 4.8 ensures that the mapping does not consume more than the available (residual) resources in the DCs. Allocated CPU is used in the objective function and therefore assigned to a variable that is constrained by the residual CPU of each physical node. Eq. 4.9 is the classic flow conservation constraint which ensures that the sum of flow entering a node will be equal to the sum of flow leaving the node, except for sources that produce flow and sinks that consume flow. Let adjube all adjacent nodes (neighbours) of u P Np. The constraint then

states that the sum of all incoming link minus the sum of all outgoing links will be 1 if source s is mapped to u, -1 if sink t is mapped to u, 0 otherwise. Since y is binary we ensure that the flow is unsplittable. If we want to allow splittable flows, y can be set to floating-point numbers. Eq. 4.10 assign the amount of load on links to representative variables in linkLoad, i.e. the sum of bandwidth demands of all virtual links assigned physical links. linkLoad is constrained by residual bandwidth during variable declaration. Since the traffic can move both ways in the network, the sum of both directions is taken. Eq. 4.11 ensures that the maximum allowed latency between containers is not violated. The constraint states that if a virtual link is embedded on a physical path, the latency sum of all physical links on that path must be lower than the maximum allowed latency of the virtual link.

Greedy node selection

A greedy node selection algorithm was implemented to use as a baseline during evaluation. The algorithm is described in 7 and works as follow: the nodes in the request are first sorted by traversing the forwarding graph in a breadth first order, starting at the source node. Each virtual node is then sequentially mapped to a physical node (i.e. DC) by a simple filter and rank approach. The filtering phase removes physical nodes that are not in the list of allowed locations, has insufficient resources or fails to satisfy latency or bandwidth requirements to previously mapped nodes. Physical nodes are then ranked based on normalized load and cost. Next, the virtual node is mapped to the best physical node. If all physical nodes are re-moved during the filtering stage an exception occurs. The placement decision is then passed to the integer programming solver using the same optimization model as above for route optimization.

input :Network: nw, Request: req

output:Container to DC mapping: mapping 1 Initialize: mapping

2 order Ð Breadth-first order of nodes in req.nodes 3 for node in order do

4 f iltered Ð Filter nw.nodes w.r.t node and mapping

5 ranked Ð Ordered rankings of nodes in f iltered w.r.t node and mapping 6 Add a mapping between node and best node in ranking to mapping 7 end

Algorithm 7:Greedy node selection

4.3 Server Selection

After services has been mapped to data centers they need to mapped to servers. The problem is very similar to the DC selection problem and many of the parameters from table 4.1 and 4.2 are the same in this phase. However, rather than load balancing, the objective is instead to minimize the number of active servers and the amount of traffic between servers and racks in the DC. We use a similar mathematical model as seen in the DC selection phase but we relax eq. 4.10 to investigate link overutilization of each heuristic. We also assume that intra-DC latency is low enough to satisfy any latency constraint, therefore eq. 4.11 is ignored.

(29)

Solving server selection optimally for non-trivial problem instances is infeasible. Based on the assumption that microservice-based applications and service function chains can have high inter-service traffic, the solution needs to be traffic-aware. Previous work in traffic-aware VMP often take advantage of the hierarchical structure of DCs to partition resources and then map VM clusters to partitions using minimum cuts in the VM communication graph. This is also the approach taken in this thesis.

Server selection heuristic

The server selection heuristic presented here is inspired by techniques described in [35, 36, 38]. The main idea behind the algorithm is the same as [38], mainly traffic-aware container clustering, partitioning servers into components and assigning clusters to components. The partitioning phase can be done in multiple hierarchical steps to increase closeness of commu-nicating containers, assuming the DC topology allows for hierarchical resource partitioning. The main steps involved are described below and the main algorithm is presented in algo-rithm 8.

Step 1: Partition servers in the DC into sub-components, e.g. per rack. This step could create a hierarchical partitioning with recursively smaller sub-components to reduce proximity of placed services in a request.

Step 2: Create a Gomory-Hu tree from the communication graph in the request. The Gomory-Hu tree will contain minimum s-t cuts for every pair of services in the request. Step 3: Find the smallest component the entire request can fit into (e.g. a server or rack). The smallest component could be the entire data center if resources are fragmented or the request is very large and servers are only partitioned at the rack level.

Step 4: Recursively divide the request into smaller and smaller clusters, assigning each clus-ter to a smaller and smaller component until all containers are assigned to a server. If a clusclus-ter can not fit into a single component, then the cluster is split into two new clusters at the min-imum cut of the cluster. During the split, the minmin-imum cut in the Gomory-Hu tree of the cluster is removed. Each cluster is then assigned its corresponding part of the tree. The clus-ters are finally added to a priority queue that ranks clusclus-ters based on the highest intra-cluster traffic, i.e. sum of all links between nodes in the cluster.

Description of heuristic

Assuming the DC has been partitioned into components, the algorithm begins by generating a Gomory-Hu tree from a request. The next step finds the smallest component in the DC that can fit the entire request. Depending on how the DC has been partitioned, the smallest component could be e.g. a server or a rack. If the request is very large or resources are scarce and fragmented, the smallest component could be the entire DC. This would mean that the request must be split to be able to fit into any sub-component. A priority queue is used to order container clusters based on intra-cluster link sum, i.e. the sum of traffic between all nodes in a cluster. Let LS(nodes)be a function that returns the link sum of the nodes in nodes. The Gomory-Hu tree and the current target component is passed along with the nodes in the priority queue. At line 9 in algorithm 8, the best-fit sub-component is found by looking at all child nodes of the component, e.g. PMs if C is a rack. If there is no sub-component that can fit all nodes, then the cluster is split into two new clusters at the minimum cut in the clusters Gomory-Hu tree. The two new clusters are then pushed onto the priority queue. However, if there is only a single node in nodes, then we have failed to find a mapping for it. This can be caused by resource fragmentation since resources of higher level components are calculated

(30)

input :Data center: dc, Request: req

output:Container to server mapping: mapping 1 Initialize: mapping, unmapped

2 tree Ð Get Gomory-Hu tree generated from req

3 C Ð Find smallest best-fit component in dc that can fit req 4 Remove req.nodes resources from C

5 Initialize a priority queue q

6 Push LS(req.nodes), tree, C, req.nodes on q 7 while q is not empty do

8 ls, tree, C, nodes Ð q.pop()

9 C1ÐFind best fit sub-component of C that can fit nodes 10 if C’ exists then

11 Remove resource of nodes from C1 12 if C’ is a PM then

13 Add a mapping between nodes and C1to mapping

14 end

15 else

16 Push ls, tree, C1, nodes on q

17 end

18 end 19 else

20 if tree is empty then

21 Add resources from nodes to C and its parents 22 Add nodes to unmapped

23 end

24 else

25 lt, ln, rt, rn Ð Split request at the smallest edge in tree 26 Push LS(ln), lt, C, ln on q

27 Push LS(rn), rt, C, rn on q

28 end

29 end 30 end

31 Map all nodes in unmapped using BFD and add results to mapping Algorithm 8:Server selection heuristic

by taking the sum of residual resources from its children. If C1 _{is a PM then a mapping is}

made between all nodes in nodes and C1_{(line 13).}

At line 25, the request and the Gomory-Hu tree is split. Nodes are partitioned based on the two trees generated from removing the minimum cut edge in the Gomory-Hu tree. Because of the properties of Gomory-Hu trees, we never have to re-compute any minimum cuts. When splitting a requests there will eventually only be single nodes left in each clusters. Time complexity

Let n be the number of virtual nodes and m the number of physical components (i.e. the number of PMs and partitions). Best-fit is linear in the number of components,O(m), and we need iterate all component for all items (in the worst case). Running best-fit for all items has a time complexityO(nm).

The time complexity of the proposed heuristic can also be bound by the time complexity of constructing a Gomory-Hu tree. A Gomory-Hu tree can be constructed in n ´ 1 minimum s-t cuts. Therefore, the time complexity to construct the Gomory-Hu tree is the multiplication