A Service for Provisioning Compute Infrastructure in the Cloud

(1)

UPTEC IT 19013

Examensarbete 30 hp

Augusti 2019

A Service for Provisioning Compute

Infrastructure in the Cloud

Tony Wang

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

A Service for Provisioning Compute Infrastructure in

the Cloud

Tony Wang

The amount of data has grown tremendously over the last decade. Cloud computing is a solution to handle large-scale computations and immense data sets. However, cloud computing comes with a multitude of challenges that scientist who are using the data have to tackle. Provisioning and orchestration cloud infrastructure is a challenge in itself with a wide variety of applications and cloud providers that are available. This thesis explores the idea of simplifying the provisioning of computing cloud applications in the cloud. The result of this work is a service which can seamlessly provision and execute cloud computations using different applications and cloud providers.

(4)

(5)

1 Introduction

There has been a tremendous growth in data over the past decade. This trend can be observed in almost every field. The Large Hadron Collider experiment at CERN [2] and Square Kilometre Array project [7] are examples of scientific experiments dealing with data beyond the petascale. This requires efficient, scalable and resilient platforms for the management of large datasets. Furthermore, to continue with the analysis, it is required to make these large datasets available to the computational resources. Recently, together with the cloud infrastructures, a new concept has emerged to offer Infrastructure-as-a-Code (IaC). IaC enables run-time orchestration, contextualization and high-availability of resources using programmable interfaces [4]. The concept allows mobility and high-availability of customized computational environments. AWS Cloud Foundry, Open-Stack HOT and Google AppEngine are the platforms aligned with the concept of IaC. However, it is still overwhelming and time-consuming to capitalize on this concept. In order to satisfy researchers and to seamlessly access the customized computational en-vironment for the analysis, it is required to create a level of abstraction that hides the platform-specific details and intelligently place computational environment close to the datasets required for the analysis.

This thesis proposes a software service that aims to support the researchers in the Hi-erarchical Analysis of Spatial and Temporal Data (HASTE [3]) project to seamlessly compute applications on different cloud services. The main capabilities of the software are its cloud agnostic ability, tracing of the build process of the compute infrastructure and its ability to be data aware, meaning that it can locate the data resource that is used in the proposed computation.

2 Background

Cloud computing is appearing as a new trend in the ICT sector due to the wide ar-ray of services the cloud can provide. Many companies such as Google and Amazon are offering different kind of cloud services, Google App Engine1 _{and Amazon Web}

services (AWS)2 _{respectively. Each service manages their own infrastructure in their}

own fashion. The cloud providers control large pools of computers and profit from the cloud by renting out user requested resources. The users are billed on a timely, pay per month or on a usage basis where users pay depending on the workload of the rented resources. Other than for commercial use, cloud computing is expanding in scientific

(8)

2 Background research using platforms such as OpenStack3 _{to provide computation. However cloud}

computing comes with many challenges that are tackled by businesses using the cloud for commercial use along with scientist that are looking for the cloud to run scientific computations.

2.1 Cloud Computing Concepts and Obstacles

The term cloud computing has existed since the 1960s however the concept gained pop-ularity in 2006. There has been no clear definition of the term cloud computing. The definition cloud computing is however described by the National Institute of Standards and Technology (NIST) [13] who defines cloud computing as model for enabling con-venient on demand network access to a shared set of configurable computing resources that can be rapidly provisioned and released with minimal effort.

Generally speaking, the cloud can be divided into four different architectural layers. Zhang et. al. [24] describes the layers in the following way. The lowest level can be described as the hardware layer, this is where the bare metal resides as routers, switches, power and cooling systems. Next is the infrastructure layer which creates a set of re-sources that are configured on the hardware through virtualization technologies. Above the infrastructure layer is the platform layer where operating systems and application frameworks lie. The final layer is the application layer where software applications are deployed.

The business model of the cloud can be categorized in different services that are derived from the architectural layers. NIST defines the services as following. Infrastructure as a Service (IaaS) provides processing, storage and networks. The user should have the ability to deploy and run software on the infrastructure such as operating systems and applications. Examples of IaaS providers are Google and Amazon. Platform as a Ser-vice (PaaS) allows the user to use the cloud infrastructure through provided tools. The user does not control the underlying networks, operating or storage but only the self de-ployed applications. Software as a Service (SaaS), the highest level of user abstraction where the user is only capable of accessing the cloud through the provider’s interface commonly in the form of a thin client interface or a web browser.

As mentioned in the introduction, the newly coined concept infrastructure as a Code (IaC) is on the rise. The principle of IaC is to treat the infrastructure as code. Thereafter use the code to provision and configure the infrastructure, more importantly provision-ing virtual machines (VMs) in IaaS. The code represents and gives the desired state

(9)

2 Background of the infrastructure without having to walk through manual steps and previous config-urations [16]. This concepts allows the ability to apply similar software engineering techniques related to programming and software development when building one’s in-frastructure. This means that a blueprint or state of the infrastructure can be version controlled, shared and re-used. The end purpose of IaC is to improve the quality of one’s infrastructure [21].

Another important concept is the container concept which is growing in the cloud com-puting field and containers are most often used at the application level to replace vir-tual machines. There are many advantages of using containers. Containers are more lightweight than VMs, start time and resources usage is decreased [14]. Docker4 _{is one}

of the most well-known and used tool for containerizing applications. Docker provides services that builds and assembles applications. Each Docker container is based on an system image, a static snapshot of a system configuration. A Docker container is run by the client, when a Docker container is ready to be run, it looks for the Docker image on the machine or on a remote registry to download the Docker image. Once the image is ready, Docker creates a container and allocates a file system with a read and write layer and creates a network to interact with the host machine [19]. The main principles of using Docker containers are to avoid conflicting dependencies, an example could be if two websites need to run two different versions of a framework, then each version can be installed in a separate container. Also, all dependencies are bound to a container which means that the need to re-install dependencies disappear if the application is to be re-deployed. Furthermore, Docker containers are not very platform dependent. The only requirement is that the operating system runs Docker [14].

As a consequence of multiple providers many individuals who are using cloud services are facing the issues of adapting to each and every different cloud provider. One of the main obstacles is the concept of vendor lock-in [11], meaning that the cost of changing vendor is too high to justify the change which leads to the problem of being locked into one vendor because of the cost of changing vendor. The lack of standards causes an in-creased difficulty to manage interfaces to different cloud vendors. Multiple recent works have tackled the problems of vendor lock-in by developing APIs that interfaces to var-ious types of cloud. Developing standards could be a good solution however the larger cloud vendor who are leading the cloud business does not seem to agree on proposed standards.

(10)

2 Background

2.2 Scientific Computing

Scientific computing is a research field that uses computer science to solve problems. The types of research can be related to large scale computing and simulation which often requires large amount of computer resources. Recently, scientific computing pro-gressively requires more and more computing to cope with the immense amount of data that is generated. The amount of data stored by massive-scale simulations sensor de-ployment, high throughput lab equipment etc. has increased in the recent years. Already in 2012, it was predicted that the amount of data generated was to pass 7 trillion giga-bytes [17]. And the amount of data used for computing exceeds the power of an individ-ual computer, to counteract this problem, distributed computing systems are therefore used in some cases. Cloud computing proposes an alternative solution for running sci-entific computations which can be specifically beneficial for scisci-entific computing. Re-searchers can take advantage of the potential lesser cost of running cloud computations by reducing the administration costs and taking advantage of flexible cloud scalability. Also, cloud computing allows researchers located in different areas an opportunity to ease the collaboration process. Compare this to running computations on personal com-puters or campus exclusive resources where there may be limited resources, security issues and difficulties in sharing data.

In the large scaled scientific computing field, one of the popular frameworks is Apache Spark5_{(Spark) [22]. Spark is the largest open source software for unified programming}

and big data projects. Spark has a programming model called Resilient Distributed Datasets (RDD) that can include a wide range of processing techniques including SQL, machine learning and graph processing. The key point of RDDs is that they are func-tions that are scattered around a compute cluster which can then run funcfunc-tions in parallel. This of course requires that the user has access to cloud infrastructure. Users can then use RDDs by applying specific functions, for example map, filter and groupBy. The main speedup of Spark is its data sharing capabilities. Instead of storing its data in disk it stores the data in memory to allow faster shareability. Spark was developed as a tool for users, including scientist.

2.3 HASTE Project

Part of the work of this thesis is to assist the HASTE project. Their aims are to intelli-gently process and manage microscopy data in images. The HASTE project is funded by the Swedish Foundation for Strategic Research [1]. Their main objectives are to dis-cover interestingness in image data using data mining and machine learning techniques

(11)

2 Background and to develop intelligent and efficient cloud systems. To find the interestingness of a microscopic image, different machine learning techniques are used that are processed in the cloud. The HASTE project pipeline and the scientist at the project have different academic backgrounds and the whole project consists of different smaller projects that are related to each other, multiple projects uses the cloud to store and execute data and computations.

SNIC Science Cloud6 _{is a community cloud run by Swedish National Infrastructure for}

Computing with the purpose of providing large scale computing and storage for research in Sweden. SNIC mainly provides IaaS and higher level PaaS. The HASTE project runs their computation on the SNIC Science cloud. The work in this thesis exclusively uses the SNIC Science Cloud to provision infrastructure for the users.

2.4 Motivation

In order to demonstrate the demand of this service, a motivational example is presented. As of now, researchers of the HASTE project and potentially other individuals as well who have to run scientific cloud computations have their own procedures to provision and cluster their own infrastructure by using their own written scripts, different com-mand line and graphical interfaces. This could be rather time consuming and a re-searcher would perhaps rather spend time doing actual research instead of infrastructure provisioning. The problem of vendor lock-in comes in here when scripts and interfaces may vary a lot depending on cloud vendor, a scientist may have successfully executed their experiments on one cloud infrastructure, when the need to change cloud provider arises, the process has to be repeated. Another important factor to consider is the physi-cal placement of the data, a scientist may have to manually look up the metadata inside a potentially hidden or difficult to access file. Another difficulty in cloud orchestration is the error prone characteristics of the long provisioning process. Various errors can arise from the orchestration process that are difficult to find and debug. An example scenario of a HASTE researcher who would like to run a machine learning computation can be acted out in the following fashion. The researcher has to locate the credentials and other metadata regarding the cloud provider before starting a machine. The next step is to provision the data requested to the machine, and to execute the code, the researcher has to install all the required packages. This is a lengthy process that is preferable to avoid having to repeat. Another example is the orchestration of multiple instances. If a researcher wants to run a compute cluster, the researcher has to start multiple ma-chines and then with much effort connect them together into a cluster. Again, doing this process can arguably be even more time consuming than previous example.

(12)

3 Related Work

2.5 Purpose

The purpose of this thesis is to support the researchers at the HASTE project and to build a general software application service for automatic provisioning of cloud infras-tructure with intelligent data aware aspects. The data aware aspects comes from using pre provisioned metadata to the service to allow the users to skip setting cloud meta-data variables. The main purpose is to provide options for the HASTE researchers to seamlessly run HASTE relevant software on the SNIC Science Cloud through the ser-vice. The purpose is to simplify the provisioning process compared to running HASTE cloud projects through manual provisioning. A general use case of provisioning a Spark compute cluster and a case where a container application is run are also provided to exemplify a process which a non HASTE scientist can benefit from. The researchers should have the ability to provision compute infrastructure through easy accessible com-mand line and graphical interfaces. The service includes the potential to provision not only to the SNIC Science Cloud but also for other OpenStack cloud projects and other non-OpenStack cloud providers. Additionally, a tracing mechanism is implemented to provide transparency and feedback on the clustering process for the underlying orches-tration process, to give the user more transparency about any potential errors during the process. Furthermore, a conceptual web interface is created for the purpose of granting the user a simple graphical interface to use for creating their infrastructure.

3 Related Work

There exists several other cloud computing frameworks for the purpose of abstracting the cloud orchestration layer and to counteract the problems of vendor lock-in where it is too burdensome and difficult to deploy applications on different cloud providers while keeping important aspects such as security and quality of service consistent. A few have been using Model-driven design as their main method for designing the frame-work. Model-driven design is not the focus of this work however it is an interesting take on development that one can draw inspiration from. Other frameworks and software ap-plications have also been developed and published as open source to help the developer community to deploy infrastructure.

(13)

3 Related Work are to run MODACLOUDS as a platform for deployment, development, monitoring and adaptations of applications in cloud.

Chen et al. presents MORE [10], this framework uses Model-driven design to ease the challenges of deployment and configuration of a system. MORE provides the user with a tool to model the topology of a system structure without demanding much do-main knowledge. The model is further transformed into executable code to abstract the orchestration of the system. The user eventually gains access to the cloud infrastructure. Other non model-driven developed tool also exists. For example Sandobalin, Insfran & Abrahao presents an infrastructure modelling tool for cloud provisioning called AR-GON [15]. The tool is supposed to solve the management of infrastructure as a code (IaC). Their goal is to take the DevOps concept and apply it to IaC. Through a domain specific language they are able to develop ARGON to reduce the workload for opera-tions personnel. With ARGON, developers have the opportunity to version control and manage their infrastructure without the need to consider the interoperability of different cloud providers.

To further investigate into cloud interoperability and approaches to avoid vendor lock-in Repschlaeger, Wind, Zarnekow & Turowski [23] implemented a classification framwork to compare between different cloud providers. Their purpose was to help e-Governments with the problem of selecting an appropriate cloud vendor with regards to prices, security and other important features. Their method of development was to investigate through literature surveys and expert interviews.

Furthermore Capuccini, Larsson, Toor & Spjuth developed KubeNow [9] a framework for rapid deployment of cloud infrastructure using the concept of IaC through the Kuber-netes framework. The goal of KubeNow is to deliver cloud infrastructure for on-demand scientific applications. Precisely, KubeNow offers deployment on Amazon Web Ser-vices, OpenStack and Google Compute Engine.

Additionally, there are other more well known frameworks that bring the benefits of IaC. Some examples include Ansible7_{, Puppet}8_{, AWS OpsWork}9 _{(Uses Chef + Puppet)}

and Terraform10_.

Unruh, Bardas, Zhuang, Ou & DeLoach presents ANCOR [20], a prototype of a system built from their specification. The specification is designed to separate user

require-7_{https://www.ansible.com/} 8_{https://puppet.com/}

(14)

4 System Implementation ments from the under-laying infrastructure, and to be cloud agnostic. ANCOR uses Puppet as a configure management tool however ANCOR supports other configure man-agement tools such as Chef, SaltStack, bcfg2 and CFEngine. ANCOR mainly targets OpenStack however there is a possibility of using AWS as well. ANCOR developed with a domain specific language based on YAML. Their conclusions show that ANCOR can improve manageability and maintainability and enable dynamic cloud configuration under deployment without performance loss.

SparkNow11 _{is a type of provisioning software that focuses on rapid deployment and}

teardown of clusters on OpenStack. It simplifies the provisioning process by provid-ing pre written provisionprovid-ing scripts and through user arguments it can provision the requested infrastructure without requiring the user to learn the orchestration process. KubeSpray12 _{is similar to SparkNow in the sense that they simplify infrastructure}

pro-visioning however their focus is on rapid deployment of Kubernetes13 _{clusters instead}

of Spark clusters on OpenStack and AWS clouds.

The related work mentioned in this section pushes their focus on developing standalone tools and different domain specific languages for creating infrastructure. They put a lot of effot on deploying cloud applications with ease through their tools reducing the complexity of creating cloud infrastructure. The IaC concept is again explored and used efficiently to provision infrastructure. Cloud vendor lock-in is discussed as well, con-cerning the ability to deploy application on different providers which is important for the users. This thesis proposes the ability for a user to request cloud infrastructure using less domain knowledge, the options to choose which provider to deploy infrastructure and monitoring of the provisioning process. Another proposal is to explore further in-frastructure abstraction using even less required knowledge, adding another abstraction layer over existing software. Using data-aware aspects which takes advantage of meta-data to pre-provision the orchestration service with metameta-data. Furthermore this work implements a tracer for the orchestration process to track the orchestration flow.

4 System Implementation

To develop the provisioning software numerous technology was used. The service is split into different modular parts who communicate with each other. The user com-municates with a server through interfaces which in turn comcom-municates with another

11_{https://github.com/mcapuccini/SparkNow}

(15)

4 System Implementation module which uses external frameworks to provision infrastructure. Overall the system can be seen as a client-server application. The whole system is also traced using external libraries that are integrated within the whole system.

4.1 System Overview

To start of, there is a conceptual graphical user interface that is built as a web interface using common scripting languages HTML, CSS and JavaScript. Moreover the React14

library is used as the main library for writing and structuring the interface. Using React, the business logic and markup is split into components that allows for more flexibility and re-usability.

The service who handles the requests and provisions the infrastructure is called the ne-gotiator. The REST service lies between the user and the negotiator and that is written in python 2.7 with the Flask15_{library and its functionality is split into a Representational}

State Transfer (REST) service that can be called for communication. The middleman or the broker between the REST server and the negotiator is provided by RabbitMQ16

which is a message broker that handles the requests from the client and sends them to the negotiator. The negotiator is designed so that new calls to different cloud providers can be integrated into the module by constructing new classes for each provider. Apply-ing a REST service grants the system an interface between the user and the negotiator which gives the system a possibility to seamlessly alter the communication with the ne-gotiator. By defining the REST endpoints the module can consistently accept expected arguments to create infrastructure.

SNIC Science Cloud implements OpenStack which is a platform for orchestrating and provisioning infrastructure. This project uses Terraform in conjunction with OpenStack to provision the compute infrastructure on SNIC Science Cloud. Where Terraform was used as the framework to provide IaC. Tracing is performed by OpenTracing17_{which is}

an open source tracing library that is available in multiple languages.

The general step by step process of the system from a user perspective to request for infrastructure can be described as the following steps:

(a) The user creates a POST request to the REST service from any interface which 14_{https://reactjs.org/}

(16)

4 System Implementation

Figure 1: A high level overview of the system. can be from a web interface or a command line interface.

(b) The request arrives at the REST server and the request is forwarded to the message broker and the server returns the web URL to the tracing interface.

(c) The message broker receives the request and puts the request, now a message in the queue for consumption.

(d) The consumer forwards the request to the negotiator module which handles the request and provisions the infrastructure.

(e) After orchestration, the user is sent feedback regarding the infrastructure. (f) The process can be traced during and after each request

A high level overview of the system is shown in Figure 1 to gain a better abstract un-derstanding on how the system is communicating. What can be seen is the user who interacts with the system through a REST service, implemented as a Flask server. The request is forwarded to the negotiator who then depending on the request provisions infrastructure in the user requested cloud provider.

(17)

(18)

4.2 Terraform

Terraform is a tool that applies the IaC for provisioning infrastructure. Terraform can be used to build, change and version control cloud infrastructure. The state of the cloud infrastructure is described in Terraform configuration files written by the users, after successfully executing the Terraform configuration scripts using the Terraform binary, the infrastructure requested in the configuration file is provisioned. The main motivation for using Terraform is its simplicity to change and add new infrastructure for different providers and also the power of IaC that is used to dynamically provision infrastruc-ture. Furthermore using Terraform may avoid the problem with vendor lock-in because of the multitude of providers Terraform supports. Software such as Heat18 _works

sim-ilarly however they only provide for one platform (OpenStack). While Terraform can perform the same tasks and also enable multiple providers, for example it can orches-trate an AWS and an OpenStack cluster at the same time. Terraform is cloud agnostic in the sense that the software Terraform can be used by various providers however one may think that a single configuration can be used by different providers but that is not the case. To create an infrastructure configuration that creates an equivalent copy of an infrastructure on two different providers then one has to write two different con-figurations, although some of the functions can be shared such as variables. Still it is simple to change provider, the syntax, functions and thought process to write code stays the same. The configuration files are written in HCL (HashiCorp Configuration Lan-guage)19_{, which is a configuration language built by HashiCorp}20_{who are the founders}

of Terraform. The same language is then used for all the providers that Terraform sup-ports. HCL can also be used in conjunction with the JSON format to allow for more flex-ibility. An example configuration can be seen in Listing 1, which shows a configuration for an AWS cloud when executed Terraform creates one instance of type t2.micro in the us-east-1 region using the user’s access and secret key. The providerblock determines the provider and the resource block describes which resources that are pro-visioned. Additionally, Terraform has the ability to provide more than just compute instances but for storage, networking, DNS entries and SaaS features and much more.

(19)

region = "us-east-1" }

resource "aws_instance" "example" { ami = "ami-2757f631"

instance_type = "t2.micro" ...#Additional blocks }

This work uses Terraform’s OpenStack configuration to provide the OpenStack based infrastructure that is used by SNIC Science Cloud. The basic configuration for Ter-raform’s OpenStack provider consists of the provider block, that is similar to the AWS example above which determines the OpenStack provider and resources blocks that describes the provisioned resources. Below is an example of an OpenStack configu-ration where a single instance is created under a specific user. Additional connection variables, auth url, tenant name, tenant id, user domain name are given to connect to the specific cloud. The single instance is created using given parameters to specify, image name, the flavor, key pair and security groups. In this example, variables are used as input parameters instead of static strings. Each parameter is referencing a variable that stores the argument. Using this method, variables can be set from exte-rior methods, for example from the command line interface, environment variables or external executable files.

Listing 2: provider "openstack" { user_name = "${var.user_name}" password = "${var.password}" tenant_id = "${var.tenant_id}" tenant_name = "${var.project_name}" auth_url = "${var.auth_url}" user_domain_name = "${var.user_domain_name}" }

(20)

...#Additional instance variables }

...#Additional blocks

4.3 REST Service

A REST service is an architectural design pattern for machine to machine communi-cation [12]. By applying a REST architecture the separation of concerns principle is applied, that is the separation of the user interface and the system back-end. The result is that the portability and scalability of the system is improved. The REST service and the rest of the system can be developed independently. A REST service requires that the client makes requests to the service. A request contains a HTTP verb which defines the operation to perform, a header containing the data to pass to an endpoint. The four basic verbs are POST, PUT, DELETE and GET. The negotiator REST service has two callable endpoints POST, DELETE

Using the POST request, the endpoints accepts the user arguments for provisioning the infrastructure. The DELETE endpoint is then used to delete existing infrastructure. The endpoints themselves use the functions of the negotiator module when called upon. This allows a flexible REST implementation where changes, such as new endpoints can be made to the REST service without affecting the negotiator module. The REST service expects the data in the header to be in JSON format and then replies with data in JSON format. The JSON format is human readable and simple to use and supported by most languages for easier integration. The REST server must be asynchronous otherwise the user has to make a request then wait for the result, considering that provisioning a clus-ter may take several minutes. To solve this problem, the user is returned an id of the request, The id is then bound to the request and any future calls on the requested infras-tructure is used in conjunction with the id.

(21)

4 System Implementation Listing 3:

{

provider: ’some_provider(Openstack, aws, google etc.)’ }

4.4 Message Queue

The system uses RabbitMQ which implements the advanced message queue protocol (AMQP)21_{. The message queue is placed between the REST service and the negotiator.}

The idea behind using the message queue protocol is to avoid having long running and resource intensive tasks. The tasks are instead scheduled to be executed when ready. To summarize the process, tasks are turned into messages and put in a queue until they are ready to be executed. RabbitMQ itself is the broker which receives and delivers mes-sages. A queue lives inside RabbitMQ and producers, programs that sends messages are producing messages which the broker stores in its queue. A consumer program who consumes the messages in the queue is run to handle the producer’s messages. Figure 3 depicts an image representing the workflow of the message queue. The producer which is in this work the REST service puts new messages (requests from the users) into the queue. The consumer side of the system is then ready to execute the requests from the queue.

The benefits of a message broker is that it can accept messages and thus reduce the load of the other programs such as the REST service. Consider the fact that the pro-visioning process takes several minutes. A synchronously implemented service will then be on hold for the whole process and therefore lock other clients from connecting. The message queue avoids this problem, meaning that they can execute a request to the service and then unlock the requesting process for something else. Another important benefit from message queues is the modularity, it is developed to be separated from the rest of the system and it can be written in any language and started and run separated from the REST server and the negotiator [5].

The system’s message queue is the middleman between the REST service and the nego-tiator. The REST service sends the parsed POST or DELETE request as a message from the user to the broker which then stores the message in the queue and waits for it to be consumed. After consuming the message, the receiving part of the message queue will call the negotiator to start the requested provisioning process.

(22)

Figure 3: Producers adds tasks to queue which consumers consume [6]

4.5 Data Aware Functionality

The data aware aspect is one of the main characteristics of the negotiator. The purpose is to pre-store metadata regarding the cloud provider to avoid having the user config-ure metadata infrastructconfig-ure arguments. During run-time, the metadata is fetched from a metadata store which has pre-stored values from the user or another user. The metadata store uses a key-value based data storage where the key of the data is the name and the value of the data contains the relevant metadata that the negotiator module needs to locate the data. Since different provider requires different metadata. The metadata is stored under a provider key which could be aws or openstack.

Listing 4 shows an example of metadata for an OpenStack provider. The variables are required to start an instance on an OpenStack cloud. Looking at these variables they are very tedious and most of the time kept the same and rarely changed. The external network id, tentant id are for example two variables that a user most probably does not want to have a responsibility to control. By pre-storing these variables the users of this cloud does not have to keep track of these variables which reduces the amount of input parameters on the user side. However when a metadata parameter is changed, a user has to manually change it, this also adds a positive effect that helps multiple users of the system where one change means that the other users do not have to change the same variable. Compare this to if the variables are stored locally on a user’s machine and when a change happens all the users of the system has to change that variable.

Listing 4:

"openstack": { "example_data": {

"external_network_id":

(23)

"floating_ip_pool": "Network pool", "image_name": "Ubuntu",

"auth_url": "https://cloud.se:5000/v3", "user_domain_name": "cloud",

"region": "Region One",

"tenant_id": "r2039rsovbobsaboeeubocacce", "project_name": "tenant"

4.6 Negotiator Module

The negotiator module accepts the user arguments from the message broker which re-ceived the request from the user via the REST service. Then according to the arguments provisions the requested infrastructure. To start, the module finds the provider ar-gument which was mentioned in section 4.3. By looking at the provider arar-gument the module can find the implementation of the system corresponding to the provider. Using the provider value, the provisioning can begin.

4.6.1 Resource Availability

The first part of provisioning is pre-determining if the provisioning is possible with regard to the available resources. The meaning of available resources depends on the provider and implementation. Cost based providers which have virtually unlimited com-putation can implement the module to check if there is enough balance to provision the resources while in non cost based the general determining factor is the amount of com-puting that is available. By pre-determining the available resources the negotiator can detect if the system is going to be stopped later due to any insufficient resources error. By using the provider value, the negotiator finds the file that corresponds to the provider. The file must contain the function check resources(resources) that deter-mines if the resources are available. This is similar to how interfaces are built in object oriented design. Each file must implement the check resources(resources) function. As an example, if the user is requesting resources and uses OpenStack as the provider value then the module will look for the file called OpenStack. However this step can be skipped if there is no implemention of the check resources(resources) function. or if the file does not exist.

(24)

4 System Implementation if there are existing resources available.

4.6.2 Terraform Configuration Generation

The second step of the provisioning process is to generate a Terraform configuration file that represents the infrastructure that the user requested. Similar to the first step where resources are checked, the module looks for the folder and file which both have the same name as the provider and calls on a function in the file. The requirement for the imple-mentation is that the file must be under a folder with the same name as the provider and the file must have a function called orchestrate resources(request) that accepts the request as a parameter. The function must return a valid Terraform JSON configuration. The Terraform configuration file is generated programmatically depend-ing on the user’s request. Different implementations of specific infrastructure uses the user input differently.

One of the core power of the module is to use metadata that is bound to certain data blobs. That is the data awareness function. The metadata is then collected from the name of the data that the user has requested. In the cases where the user should pass the name of the data. Using the potential metadata and user data, a python dictionary that corresponds to a valid Terraform JSON configuration is generated and returned to the module which will later convert into a Terraform JSON file. The Terraform configura-tion implementaconfigura-tions are described in secconfigura-tion 4.8.

4.6.3 Executing Terraform Scripts

(25)

4.7 Tracing

Considering the fact that there can be a bundle of errors in the provisioning process. Everything from name errors, network errors etc. debugging and finding errors is a time consuming process. Integrating a tracing system can assist in locating the errors in the process. This work implements a tracer to trace the provisioning process from top to bottom. Each unique request is traced starting from the user request until the process is complete in the cloud. Tracing through the REST server and the negotiator is the same, however the tracing is implemented differently for each type of orchestration. There are different methods to track the process. This project uses OpenTracing in combination with Jaeger 22_{, using Jaeger bindings to python}23 _{to trace the whole process in a web}

interface.

OpenTracing is an open source tracing API used to trace distributed systems and Jaeger is a python library implementing the API. A trace describes a flow of a process as a whole, a trace propagates through a system and creates a tree-like graph of spans, that represent a segment of processing or work. Using a tracing framework one can then trace the error prone or time consuming processes. The trace is implemented in a way to create spans of each part in the system.

Listing 5 shows the initialization of a tracer. A trace object is created from the jeager client python library which is Jaeger’s python implementation. The important parameter to

look at is the reporting host which is the address to the machine that is hosting the Jaeger server. The trace object will forward the traces to the server.

Listing 5:

(26)

4 System Implementation ’logging’: True, ’reporter_batch_size’: 1, }, service_name=service, ) return config.new_tracer()

tracer = init_tracer(’Trace’)

To trace a distributed system, each part of the system must be bound to the same trace. A span object which is a key-value pair is sent through the process starting from where the trace is created. The trace begins when the REST server accepts the user request. The span object is forwarded through the system. When the REST server reaches the RabbitMQ sender, then the span object is forwarded through the message that is sent to receiver through the message’s header properties. The span continues after receiving the message until the Terraform provisions the infrastructure. The span object is sent to the requested infrastructure by writing the value of the span object to a Terraform variable. The negotiator can then use this Terraform variable to send the span over to the infrastructure and continue the trace there. By continuing to run python scripts inside the machines in the infrastructure the trace is continued.

4.8 Infrastructure Implementations

Four different configurations are implemented. However as mentioned in previous sec-tions there is a possibility of creating more implementasec-tions as long as the implemen-tation rules are followed and the requested configuration to implement is supported by Terraform. The configurations implemented in this work are the following which will be described in the next sections:

• A general Spark Standalone cluster • Haste specific HarmonicIO cluster

• Haste specific application to load microscopy images • A configuration to run a container application

(27)

Figure 4: Example figure of the orchestration of a compute cluster.

and harmonicIO is the use of docker containers. By tieing applications inside docker containers the difficulties of deploying the applications on different operating systems is solved. The only limitations for Docker containers are that the machine must be able to run Docker, however most of the common Linux distributions can run Docker. This allows the ease of deployment on different operating systems and versions of operat-ing systems. The same cluster can for example be run on Ubuntu and CentOS. Usoperat-ing Docker containers in combination with Docker Compose24_{, the deployment is eased}

down to configuring the compose file to run the container. After Terraform is complete, the negotiator will send an email to the user to notify the completion. The email is given in the request to the service.

Figure 4 shows a typical example on how to create a compute cluster using Docker containers. The system access the machines in the cloud and starts the orchestration by communicating with the machines who are then using Docker images from a remote Docker repository to download the containers which contain the programs to deploy distributed applications. The Spark standalone cluster, the HarmonicIO cluster and the data image loading application (only with one machine) uses the same method.

(28)

Terraform techniques used to execute scripts inside virtual machines are using provisioner blocks. They include methods to upload files and to execute commands through ssh. Additionally the data block is used for rendering run-time variables onto script files.

4.8.1 Spark Standalone Cluster

The main idea behind a Spark Cluster is to run functions on a distributed compute clus-ter. Meaning a cluster that spans several machines to increase the processing power. A Spark Standalone cluster25_{is a Spark cluster that does not contain additional tools such}

as YARN26 _{or Mesos}27_{. The required parameters to provide for this configuration are}

the worker count which is the number of spark workers the cluster is running, the data that is used to determine in which region the cluster is to be placed, the public key of the user to later access the machines. Lastly, the name of the flavor to be used for the virtual machines.

To provide a Spark cluster on the infrastructure, Docker containers were used. By tieing the Spark application inside a Docker container, the deployment difficulties of deploy-ing the Spark cluster is reduced. To fetch and start containers, two separate Docker Compose files were used, one to start the Spark master and the other to start the Spark workers.

The Terraform configuration file is pre-written for the Spark cluster. Meaning the Ter-raform configuration file representing the Spark infrastructure is already configured ex-cept for some variables to adjust the cluster to the user’s request are left to interpolate. To configure the cluster according to the user’s request, the variables in the pre-written configuration file are set through variable interpolation.

The first step of creating the Spark cluster starts with spawning the master and giv-ing it a floatgiv-ing IP for outer access and spawngiv-ing the number of worker machines that was requested, this number is interpolated through a variable that is set through the user request. After the machines are spawned, the scripts that are used in the machines are uploaded to the master through Terraform’s file provisioner. A snippet of how the scripts are uploaded can be seen in Listing 6. All files in the scripts folder are uploaded to the host machine in the connection block using a private key to ssh to the machine. The master machine is given multiple scripts. One bash script to run its commands, two

25_{https://spark.apache.org/docs/latest/spark-standalone.html}

(29)

4 System Implementation previously mentioned Docker Compose files for master and slave and one script that is to start the worker machines.

Listing 6: provisioner "file" { connection { host = "${openstack_networking_floatingip_v2.floating_ip.address}" type = "ssh" user = "ubuntu"

private_key = "${file("${var.ssh_key_file}")}" }

source = "./scripts/" destination = "scripts" }

After the master machine is instantiated, it executes one of the bash scripts to download Docker and Docker Compose to start a Docker image which starts a Spark master using the Spark master Docker compose file. Then it transfers the worker Docker Compose and the worker script to the worker machines using the scp command. Finally it exe-cutes the worker script inside the worker machines through the ssh command to start the Spark worker container. The worker script downloads Docker and Docker Com-pose like the master however it uses the worker Docker ComCom-pose file to start the Spark worker which connects to the master to form a cluster.

(30)

4 System Implementation the machines which in turn communicates with a remote Docker repository to form a cluster.

Listing 7:

data "template_file" "master_template" {

template = "${file("scripts/master_script.sh")}" vars { slave_adresses = "${join(",", openstack_compute_instance_v2.slave.*.access_ip_v4)}" } } Listing 8: #master_script.sh slaves=${slave_adresses} 4.8.2 HarmonicIO cluster

HarmonicIO [18] is a streaming framework developed through the HASTE project. To summarize HarmonicIO, it is a peer to-peer distributed processing framework. Its pur-pose is to let users stream any data to HarmonicIO and have the data directly processed in HarmonicIO worker nodes and then store the processed data in data repositories which also lets the users preview the data before the process is complete. A HarmonicIO cluster operates with the master-worker architecture as well similar to a Spark cluster. There is an individual master machine and one or multiple worker machines that han-dles the data processing. Manual orchestration of a harmonicIO cluster is similar to orchestrating a Spark cluster. The important steps are the following, starting with the master node:

1. Instantiate a master node machine.

2. Download the HarmonicIO remote repository 3. Run the bash script to install dependencies

(31)

Then for each worker

1. Instantiate a worker machine

2. Download the HarmonicIO remote repository 3. Install docker

4. Change the master address and the internal address in the configuration file 5. Finally run the worker script to start the worker node

The implementation to deploy a HarmonicIO cluster is similar to the spark cluster. The user parameters for the HarmonicIO cluster are the number of workers, the flavor of the instances, which region to place the cluster and lastly the public key of the user to later gain access to the master node.

The Terraform configuration file is again pre-written to fit a harmonicIO cluster with dynamic variables that are input from the user. The configuration follows a similar con-figuration to the Spark cluster. A provider block is used for a master and another is used for the workers. A count variable in the worker block determines how many workers are to be deployed. Listing 9, 10 shows the configuration for the nodes. The master node is a single machine while the number of workers are determined by the count variable. Other variables are shown as well, the flavor, image, and key pairs to the instances.

Listing 9:

resource "openstack_compute_instance_v2" "master" { name = "HIO Master"

image_name = "${var.image_name}" flavor_name = "${var.flavor_name}" key_pair = "${var.key_pair_name}" security_groups = ["default", "Tony"] network {

name = "${var.network_name}" }

}

Listing 10:

(32)

count = "${var.instance_count}"

name = "${format("HIO-Worker-%02d", count.index + 1)}" image_name = "${var.image_name}" flavor_name = "${var.flavor_name}" key_pair = "${var.key_pair_name}" security_groups = ["default"] network { name = "${var.network_name}" } }

Once the machines are created. The next steps are to connect the machines to create a HarmonicIO cluster. A python script is provided, one for the master and another one for the worker machines. The master python script follows steps (2) in 4.8.2 and onwards to download HarmonicIO from a remote repository, install the dependencies, set the configuration file and to execute the script to start the master. Then it uses the scp command to transfer the python script and a worker bash script to the workers one at a time. The bash script for the worker is to install the python dependencies and the python script is used for running steps (2) and onward. Both python master and worker scripts initiates a Jaeger tracer object which continues the trace and each step is wrapped inside a trace span to inform the trace server and the user that the steps are running. To ensure that the trace is a continuation of the request trace from the negotiator, the Terraform variable that contains the span object value is interpolated to the python scripts.

4.8.3 Loading Microscopy Images

This configuration implements a deployment and execution of a HASTE specific pro-gram that loads a certain set of microscopy images from a larger set of images

(33)

4 System Implementation using a boolean value.

Considering the fact that this infrastructure is not a cluster but rather a single machine, the Terraform configuration becomes simpler. The pre-written Terraform configuration file is written with one provider block to create a single machine. A python script used for downloading dependencies, downloading object store files, tracing and running a container is provided along with a docker compose file used to download and run the container.

After the machine is created, the python script is provided then it installs its own de-pendencies. The script is then run in the following way. It starts a tracer with a span continuing from the negotiator. Every following step is then wrapped in a span to create a trace around each step. The script downloads Docker and Docker compose used for the container, it downloads the object store files from the given container. It starts the container image using Docker compose with the object store files mounted as container volumes.

The application inside the Docker container is written to read files from the mounted directory so it can execute its program and then store the result in a result file. The host machine can then use the result file and upload it to the object store. After this step, the host machine has finished executing and the negotiator will check whether the user requested that the VM should be terminated when finished executing. If it is requested, the VM is destroyed.

4.8.4 Single Container Application

Lastly, there is a configuration that deploys a given container application from Docker Hub28_{on OpenStack. The purpose of this configuration is to create a general}

configura-tion for deploying containers. This configuraconfigura-tion is similar to previous. The parameters are the URL to the Docker container on Docker Hub, the commands that the user want to execute, the region, the return address and the public key. This configuration creates a single machine in the requested region using the metadata store to fetch the region data. The machine runs a python script to trace and install the required dependencies and then runs commands to download Docker and then it executes all the requested commands.

(34)

5 Results

4.9 Simple Web User Interface

To increase the user experience, a graphical user interface was developed. The web interface is developed using the React library to create a simple single page application. The web interface is essentially a proof of concept. The web interface contains three steps to simplify the process for the user. Figure 5 shows three images where each image represents each step. The first image lets the user select a choice to delete or create new infrastructure, the second image is where the user chooses the configuration and lastly, the input parameters are given for the configuration before pressing create which sends a REST request to start the process.

1. Select a choice to create new infrastructure or delete existing infrastructure. 2. Select which configuration to request.

3. Fill in the parameters for the chosen configuration and create the infrastructure, 4. The trace URL is returned to the user which gives the user access to the trace of

the requested infrastructure.

5 Results

The results of this work is a service allows users with a few clicks or a simple REST request create infrastructure for scientific computing in the cloud. The benefits of the service is the ability to theoretically create any type of infrastructure on any provider that is supported by Terraform. That is, it is possible to further implement the system to create more infrastructure provisioning configuration than the previous four mentioned in the method section. This work implements three different options for infrastructure using the OpenStack provider. An option to create a Spark cluster, another option to create a cluster for the HASTE specific HarmonicIO application, a third option to run a single container and lastly an option to run a green channel application used for image processing using a single machine with extra features that are automatic execution and storing the files and finally automatic tear-down of the machine.

(35)

5 Results

(a) (b)

(c)

(36)

5 Results

Figure 6: Example trace of the Spark Standalone configuration

5.1 Spark Standalone Cluster

A request containing one worker with a trivial data set in a region inside SNIC Science Cloud was sent to the service with the user’s public key, a trivial flavor and the user’s email address. The trace id and URL to the web interface is immediately returned after the request and the service starts orchestrating the cluster through Terraform using the pre-configured scripts. The result is three created machines where one is the master with a floating IP attached and two worker machines. The trace in Figure 6 shows how a master is created which in turn start a worker.

5.2 HarmonicIO Cluster

A HarmonicIO cluster was requested with two workers. The request includes the worker count, data, flavor, the public key and user’s email address. After sending the request, the trace id and URL to web interface containing the trace is returned. The REST server accepts the request and starts orchestration of a cluster with three machines. Where one machine is the master with a floating IP attached and two worker machines.

(37)

5 Results

(a) (b)

Figure 7: Spans of the HarmonicIO trace.

sub-figure 7(a) Shows the starting point of the trace. The negotiator gets the request and executes the Terraform configuration to start the orchestration. The master machine accept the continuation of the trace and starts its own scripts to start the HarmonicIO master. The second sub-figure 7(b) shows how the workers are started. The first worker receives the continuation of the trace and uses its script to start the HarmonicIO worker process. When the first worker has finished. The second worker starts the same process and the orchestration is complete.

5.3 Image Loader

(38)

5 Results

Figure 8: Trace including spans and time of the image loader configuration. VM in the UPPMAX region because the negotiator understands that the process should be executed in UPPMAX due to the metadata. The result of this execution ends with a set of images loaded in the same container and an notification to the email address explaining that the process is finished.

A full trace of the whole process is available in the jaeger client interface which can be accessed with the id. The trace can be seen in Figure 8. The trace shows that the REST server receives the request and pushes it to receiver of the message broker. The negotiator then handles the request to begin creating the infrastructure. The Terraform configuration which is generated is executed to provision the infrastructure and the trace is sent to the VM where the execution of the script can be seen. The dependencies and container objects are downloaded and the docker container is executed and run. Finally a notification through email is sent to the user.

5.4 Running a Trivial Container

(39)

6 Discussion & Evaluation

(a) (b)

Figure 9: Spans of the container application. and some commands are executed.

6 Discussion & Evaluation

This section evaluates the service of this work, comparisons are made against other software who uses similar methods. As well to evaluate the selling points of this work which is the tracing, data aware function. Also some of the system’s drawbacks and weaknesses are reviewed.

6.1 Comparison Against Other Methods

(40)

6.1.1 SparkNow

SparkNow as mentioned in the related work section is an open source project used to deploy a Spark cluster on Openstack. The summary of the workflow to deploy a Spark cluster from SparkNow is to download the repository, export a set of environment vari-ables that comes from OpenStack metadata and to use the source Linux command on the OpenStack RC file to set additional environment variable, next step is to use Packer (an image building software) to build an image. Then to configure additional metadata and variables for the cluster architecture to finally to orchestrate a cluster using Ter-raform.

SparkNow is perhaps more difficult deploy for the average user. It is required to have a considerable amount of OpenStack knowledge to set the environment variables plus knowing exactly which variables that should be used and where. Also, the user has to install multiple binaries that includes Terraform, Packer and Git. Some Linux familiar-ity is also required to deploy with SparkNow.

The work of this thesis provides a Spark cluster however with many features and the ability to skip most of the required deployment steps. The main differences are the data aware function so the user can provide the metadata or have the variables pprovided by someone else and the tracing mechanism. Also, zero installations are re-quired because this work provides a REST service accessible from the web or from the command line. However, SparkNow provides many options for the user to configure the Spark cluster differently, depending on the user’s need there are more options to con-figure the cluster by SparkNow. The amount of configuration that this work provides is the worker count and flavor.

6.1.2 KubeSpray

Kubespray is also an open source project that similarly uses Terraform configuration to provision a Kubernetes cluster. They use different deployment methods allowing for more options. One uses Terraform and another uses Ansible. Using Terraform, it is possible to deploy on both AWS and OpenStack.

(41)

6 Discussion & Evaluation software is only required once. Providing the metadata is also only required once. Just like SparkNow, KubeSpray offers much more of configuration.

6.1.3 Manual Provisioning

There are multiple ways to manually deploy any type of cluster involving multiple ma-chines or running computations inside a single VM. However the manual process is laborious compared to the multitude of solutions that has been developed so far. How-ever comparing this work and most other works, the manual process requires much more extensive knowledge on the deployment process. Not only is it required to know how to deploy a spark cluster, that is installing dependencies and the required software on both master and worker machines but also how to use the cloud provider which could be OpenStack, AWS, Google App Engine or any other provider. Deploying the Harmoni-cIO or Spark cluster manually is no easy feat either. A few HASTE members know how to deploy it otherwise there are manual instructions29_{. Consider a new member who}

does not know how to deploy HarmonicIO and if they would have to do it manually then it would be difficult and perhaps troublesome for other HASTE members. The work of this thesis could then reduce the workload of the members of HASTE.

6.2 Future Development Complexity

The main selling point of this service is the potential of being cloud agnostic. It is al-ready theoretically possible for the service to provide infrastructure for different providers as long as Terraform supports the provider. However to achieve this potential, the ser-vice needs to be further developed adding more configurations than the three that has been mentioned before and also adding the same configurations for different providers. Adding more configurations is not necessarily an easy feat. As for now, the minimum requirement to add a new configuration is to add a folder with a file that returns a valid Terraform configuration or to add a folder that has Terraform configuration ready. How-ever to develop a new configuration that is useful it is required to have enough knowl-edge about Terraform and the Terraform language to create new configurations.

The diagram in Figure 10 describes the current configuration and how to add a new configuration. The requirement to add a new configuration is to add a new class that im-plements the negotiator interface that has the function orchestrate resources which returns a python dictionary. Since python does not explicitly have interfaces, the system is programmed in a way to simulate interfaces. However the difficulty here

(42)

Figure 10: New configurations are created by interfacing.

is that the dictionary has to be a valid Terraform configuration that is equivalent to a Terraform configuration in JSON format. Programming the Terraform configuration is perhaps not easy and the programmer must have sufficient knowledge about the provider and Terraform to create a configuration. However, a simple Terraform configuration is most often not enough to create a full infrastructure. It is often important to provide exterior scripts along the Terraform configuration to execute commands or install de-pendencies inside the machines in the infrastructure. Most importantly a python script to support the tracing mechanism under the orchestration process. Also, the because running the python script with Jaeger tracing implemented require the machines to have dependencies installed.

6.3 Tracing

(43)

pro-6 Discussion & Evaluation visioning configuration have a different type of implementation and scripts, each script needs to include tracing differently. However it is possible to skip the tracing part for future configurations. Adding a trace that is actually useful is time consuming as well. It is interesting to discuss the usefulness of the trace to different type of users. To the scientist, the trace might be incomprehensible and essentially useless. On the other hand, someone who understands the provisioning process well could certainly use the trace to understand the eventual issues during the process.

6.4 Data Aware Function

The data aware function reduces the metadata required for the user. The issues with the data aware function is that the required metadata has to be pre-provided. This does not avoid the whole purpose of the function. There has to be someone at some point in time to add the required metadata. Also, to add the required metadata the person requires knowledge about the negotiator and the configuration implementation on how the metadata is to be stored. Any changes to the implementation of the configuration may require a change in the metadata as well which could cause problems where a change in the configuration could result in a change that breaks the service. This means that the users must rely on someone or themselves to provide the metadata, so this service works well in a perfect scenario where someone provides the metadata for the user otherwise the point of the data aware function is meaningless.

6.5 Security Issues

(44)

6.6 Limitations of This Service

The main limitations in this service lies in the implementations of the four different configurations. For each configuration, the user is locked in for a configuration that uses certain set of parameters with a specific infrastructure configuration. Important to note is that the main change the user can do in the Spark and HarmonicIO cluster is the number of workers in the cluster. However if the user requires other functionality or changes to the cluster it is not possible unless they do manual configuration after the service has completed the initial creation. For example, if the user wants to use another Spark version, another Spark configuration or include external tools such as YARN or Mesos. This limitation exists for the image uploader as well, it only has one purpose which is to run the Docker container.

This can be solved by changing the configuration to allow for more parameters. How-ever to do these changes, the complexity of the configuration grows because more code has to be added. For each change in paratemer, more code has to be added in the form of Terraform configuration and depending on the change there might be change in the python code which means that more tracing code has to be added as well.

Because the foundation of this works is built on Terraform, the complete service is limited by Terraform, however Terraform is an open source tool that is continuously updated. There is of course the problem where if Terraform becomes obsolete then this work can not progress anymore unless it extends to implement additional tools. It is not easily possible for someone to extend the software outside Terraform. Adding configu-rations outside Terraform’s scope would be increasingly difficult.

(45)

8 Conclusion

7 Future Work

Because this work focused on OpenStack deployment. It would be interesting to im-plement cluster for different providers. Even though it is theoretically possible to run on, for example Amazon because Terraform supports the Amazon provider. Actually implementing and seeing it work is to be desired. This would mean that a user would have the option to deploy a single cluster configuration on two different cloud providers or the user would be able to have data on different cloud providers or eventually create a cluster on a the most appropriate provider depending on other parameters such as cost or availability.

To further add more configurations and parameter options to each configuration it is important to keep the design simple. Some of the main points of discussion is how dif-ficult it is to further develop the system. From a design perspective it is possible think of this work like any other software system to implement more design principles such as interfaces to add to the system’s longevity.

There are other tools and frameworks that are not Terraform that this thesis has not ex-plored. Another future implementation would be to write an abstract layer over multiple different infrastructure provisioning tools. For example, to combine the capabilities of Terraform and Ansible over one layer so to let the user interact with both tools instead of just one. Or use the previosuly mentioned SparkNow and KubeNow in combination with this system to create Spark and Kubernetes clusters.

The previously mentioned limitation of requiring the machines to be able to run a Python 2.x version can be solved by having docker containers that include all the dependencies that are required such as python libraries and docker installed. However there comes a problem with development complexity. This means that the initiation of the machine requires a Docker container and the application inside the container has to itself start Docker containers. For example a Spark cluster must run a Docker container which itself starts the Docker master container.

A Service for Provisioning Compute Infrastructure in the Cloud

Examensarbete 30 hp

Augusti 2019

A Service for Provisioning Compute

Infrastructure in the Cloud

Tony Wang

Abstract

A Service for Provisioning Compute Infrastructure in

the Cloud

Contents

1 Introduction

2 Background

2.1 Cloud Computing Concepts and Obstacles

2.2 Scientific Computing

2.3 HASTE Project

2.4 Motivation

2.5 Purpose

3 Related Work

4 System Implementation

4.1 System Overview

4.2 Terraform

4.3 REST Service

4.4 Message Queue

4.5 Data Aware Functionality

4.6 Negotiator Module

4.7 Tracing

4.8 Infrastructure Implementations

4.9 Simple Web User Interface

5 Results

5.1 Spark Standalone Cluster

5.2 HarmonicIO Cluster

5.3 Image Loader

5.4 Running a Trivial Container

6 Discussion & Evaluation

6.1 Comparison Against Other Methods

6.2 Future Development Complexity

6.3 Tracing

6.4 Data Aware Function

6.5 Security Issues

6.6 Limitations of This Service

7 Future Work

8 Conclusion