Dynamic Configuration of a Relocatable Driver and Code Generator for Continuous Deep Analytics

(1)

Dynamic Configuration of a Relocatable Driver and Code Generator for Continuous Deep Analytics

OSCAR BJUHR

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ARCHITECTURE AND THE BUILT ENVIRONMENT

(2)

Master’s Programme, Software Engineering of Distributed Systems Date: June 24, 2018

Supervisor: Paris Carbone, Lars Kroll Examiner: Christian Schulte

Swedish title: Dynamisk Konfigurering av en Omlokaliseringsbar Driver och Kod Genererare för Continuous Deep Analytics School of Electrical Engineering and Computer Science

(3)

(4)

iii

Abstract

Modern stream processing engines usually use the Java virtual machine (JVM) as execution platform. The JVM increases portability and safety of applications at the cost of not fully utilising the performance of the physical machines. Being able to use hardware accelerators such as GPUs for computationally heavy analysis of data streams is also restricted when using the JVM. The project Continuous Deep Analytics (CDA) explores the possibility of a stream processor executing native code directly on the underlying hardware using Rust.

Rust is a young programming language which can statically guarantee the absence of memory errors and data races in programs without incurring performance penalties during runtime. Rust is built on top of LLVM which gives Rust a theoretical possibility to compile to a large set of target platforms. Each specific target platform does however require a specific configured runtime environment for Rust’s compiler to work properly.

The CDA compiler will run in a distributed setting where the compiler has to be able to reallocate to different nodes to handle node failures. Setting up a reassignable Rust compiler in such a setting can be error prone and Docker is explored as a solution to this problem.

A concurrent thread based system is implemented in Scala for building Docker images and compiling Rust in containers.

Docker shows a potential of enabling easy reallocation of the driver without manual configuration. Docker has no major effect on Rust’s compile time. The large Docker images required to compile Rust is a drawback of the solution. They will require substantial network traffic to reallocate the driver. Reducing the size of the images would therefore make the solution more responsive.

Keywords: Stream Processing, Heterogeneous Cluster, Big Data, Rust, Cargo, Docker

(5)

Sammanfattning

Moderna strömprocessorer använder vanligtvis Javas virtuella maskin (JVM) som plattform för exekvering. Det gör strömprocessorerna por- tabla och säkra men begränsar hur väl de kan använda kapaciteten i den underliggande fysiska maskinen. Att kunna använda sig av hård- varuaccelerator som t.ex. grafikkort för tung beräkning och analys av dataströmmar är en anledning till varför projektet Continuous Deep Analytics (CDA) utforskar möjligheten att istället exekvera en ström- processor direkt i den underliggande maskinen.

Rust är ett ungt programmeringsspråk som statiskt kan garantera att program inte innehåller minnesfel eller race conditions", detta utan att negativt påverka prestanda vid exekvering. Rust är byggt på LLVM vilket ger Rust en teoretisk möjlighet att kompilera till en stor mängd olika maskinarkitekturer. Varje specifik maskinarkitektur kräver dock att kompileringsmiljön är konfigurerad på ett specifikt sätt.

CDAs kompilator kommer befinna sig i ett distribuerat system där kompilatorn kan bli flyttad till olika maskiner för att kunna hantera maskinfel. Att dynamiskt konfigurera kompilatorn i en sådan miljö kan leda till problem och därför testas Docker som en lösning på pro- blemet.

Ett trådbaserat system för parallell exekvering är implementerat i Scala för att bygga Docker bilder och kompilera Rust i containrar.

Docker visar sig att ha en potential för att möjliggöra lätt omallo- kering av drivern utan manuell konfiguration. Docker har ingen stor påverkan på Rusts kompileringstid. De stora storlekarna på de Docker bilder som krävs för att kompilera Rust är en nackdel med lösningen.

De gör att om allokering av drivern kräver mycket nätverkstrafik och kan därför ta lång tid. För att göra lösningen kvickare kan storleken av bilderna reduceras.

Nyckelord: Ström Processor, Heterogent Kluster, Big Data, Rust, Cargo, Docker

(6)

List of Figures

1.1 Overview of the Continuous Deep Analytics system . . . 5 2.1 Rust compiler phases . . . 17 3.1 Overview of the Rust code generator’s distinct phases . . 27 3.2 Overview of the Rust cross compiler . . . 29 5.1 Time to fetch remote dependencies, y axis is fetch time

in seconds, the blue box plot represents a Docker container with virtual network adapter, the orange box plot represents a Docker container using the host’s network adapter directly and the grey represents using the host platform directly for fetching . . . 36 5.2 Compile times when compiling ripgrep, y axis repre-

sents compile time in seconds . . . 37 5.3 Compile times when compiling kompics, y axis repre-

sents compile time in seconds . . . 37 5.4 Compile times when compiling kompics without extreme

point, y axis represents compile time in seconds . . . 38 5.5 Compile times when compiling the generated task, y

axis represents compile time in seconds . . . 38 5.6 Diagram of Docker image sizes in gigabytes using the

slim or regular Rust image as base image . . . 39 5.7 Collective size of Docker images with regular rust im-

age as base image . . . 40 5.8 Collective size of Docker images with rust:slim image

as base image . . . 40 A.1 Sizes of Docker images with regular Rust image as base

image . . . 51

vii

(9)

List of Tables

5.1 Execution time in seconds for copying ripgrep into a container . . . 35

Listings

2.1 Example of a crate’s Cargo.toml . . . 16 2.2 An example of the usage of RUN in a Dockerfile . . . 24 4.1 A Dockerfile for building a cross compiler image with

armv7-unknown-linux-gnueabihf as the target platform . 32 4.2 Snippet of Scala implementation for setting up a con-

tainer and executing the cross compiler . . . 34

viii

(10)

Chapter 1 Introduction

The amount of produced data at each given moment is huge, and it is growing every year. The data contains everything from sensor mea- surements to consumer transactions [46]. Being able to analyse, gain knowledge from and make predictions based on the available data benefits both commercial and societal well being. Therefore, a growing number of organisations have adopted big data analytics into their business model [20].

Big data analytics requires high performance execution platforms [33]. This is commonly achieved through scale-out performance. In- stead of executing on a single super computer, which would be called scale-up, the application will run on a cluster of distributed machines composed of low-cost commodity hardware [66]. The benefit of using a cluster as execution platform is performance scalability [22]. How- ever, coordination in such a parallel distributed setting is non-trivial and error prone. Many developers therefore turn to an abstract high- level framework for distributed parallel programming where the distribution is implemented implicitly by the framework [8] [15].

Using such frameworks makes it easier for developers with limited experience of distributed parallel programming to implement applications which execute on a cluster [22]. Most of the commonly used frameworks for distributed parallel computing is built on top of the Java Virtual Machine (JVM) [8] [15]. Applications developed with such frameworks are compiled to JVM bytecode before execution.

Bytecode is code executable by a virtual machine (VM) whilst native code is instructions which are executable directly by the underlying physical hardware. Running bytecode on the JVM has the benefit

1

(11)

of being portable in comparison to programming languages compiled to native code. Native code has to be compiled ahead of time for all specific hardware platforms which the application should be executed by whilst the bytecode only have to compile for execution on the JVM [4]. Native code does however execute faster than bytecode. The JVM is also hampered by some of its core design decisions, one of them being the garbage collector. They do provide great portability and safety guarantees but make the JVM hard to migrate and execute on accelerators such as GPUs. Thus far, the JVM is not able to analyse applications and offload execution of arbitrary parts such as loops to GPUs.

The JVM can only use GPUs by explicitly implementing parralisable code e.g. applications compiled to the GPU’s native code executed through the Java Native Interface (JNI). Doing this does however break the JVM’s portability [34]. Using the JNI can also cause performance penalties [35].

Using the JVM as the execution platform for a distributed parallel computing engine therefore simplifies its development and deployment. The downside is that it restricts the utilisation of a cluster, even more so if it is executed on a heterogeneous cluster. A heterogeneous cluster can contain different types of hardware accelerators, such as GPUs and FPGAs combined with regular CPUs. The reason for using different accelerators is their different capabilities. GPUs are commonly used in e.g. machine learning for their superior parallel computing power [55]. FPGAs have the benefit of being more energy efficient compared to GPUs [10]. To fully utilise the potential of a heterogeneous cluster, distributed parallel computing applications must be compiled to native code for the underlying hardware of each individual node [40].

1.1 Background

The rise of distributed heterogeneous execution platforms brings the need for abstract metaprogramming languages [68].

Metaprogramming languages used to solve problems in a specific domain, called domain specific languages (DSL), enable experts with limited programming experience to develop and understand software applications. Thus, DSLs help bridge the gap between problem do- mains [39], e.g. MATLAB is a DSL which makes it possible for sci-

(12)

CHAPTER 1. INTRODUCTION 3

entists to easily build interactive numerical applications [13]. DSLs achieve this by supplying the developer with abstract concepts rele- vant to the problem domain [39]. The domain specific concepts is in turn interpreted into a more concrete low-level implementation and finally compiled to executable code. A DSL program can therefore also be viewed as a program generator [68].

Flink [15] can be viewed as a DSL for building stream processing pipelines and executing them on distributed clusters. A Flink pipeline will be compiled and scheduled once and then executed for a long unspecified period of time. Flink therefore has to be adaptive and fault tolerant during runtime.

Flink’s system architecture is composed of three actors: the client, the job manager and the task managers. The client is not part of the cluster, but it is a front-end used for specifying the pipeline. The client will compile and optimise the logical pipeline before sending it to the job manager. The job manager is responsible for coordination, physical deployment and fault tolerance. It can be viewed as the master of the cluster. The task managers are workers, each using the JVM to execute assigned tasks. The system is very resilient and can reconfigure the pipeline during runtime either due to task manager failures or if new task managers join the cluster. Even job manger failures are handled by reallocating the job manager to a new machine in the cluster [16].

Since Flink uses the JVM as execution engine, the Job Manager do not have to compile tasks for different target platforms. All tasks are ready to be deployed to and executed by any task manager in the cluster [15].

1.1.1 Generating Native Code

Transforming and compiling high-level abstractions in a metaprogramming language to concrete executable code will most likely depend on a set of runtime tools being correctly setup, e.g. a compiler and other necessary tools such as libc [69]. To avoid requiring extensive configuration of the client, dedicated nodes in the cluster can be assigned to generate and compile code. Thus, the client can generate a platform independent representation of the intended program which the ded- icate compiler nodes have to interpret to a concrete implementation and compile to executables for the platforms existing in the cluster.

This will however require that the dedicated compiler nodes are able to generate executables for a number of different execution plat-

(13)

forms. Compilation of applications which are intended to execute on another platform is called cross compiling [69]. A cross compiler is a compiler which can be executed by the host platform but emits executables intended to run on another target platform. Platform is a joint description of a machine’s underlying hardware architecture and software, such as the operating system and instruction set.

The Delite compiler framework [40] is a framework for building DSLs which generate native code for a heterogeneous cluster consist- ing of GPUs and CPUs. A DSL implemented with Delite is intended to expose high-level abstractions to the user which still enable utilisation of a heterogeneous platform. The DSL program is converted to an intermediate representation (IR) which is subject to domain specific optimisations. The optimised IR is then used to emit code for the specific target platforms in the cluster. For CPUs, Delite emits Scala and C++ code which is compiled to either JVM bytecode or native code for the target platform. This makes Delite portable, it can always fall- back to the JVM if the C++ cross compiler for a specific target platform does not work. For GPUs, Delite emits CUDA code which is compiled to native code for the specific target GPU architecture using CUDA’s compiler. The CUDA compiler is natively a cross compiler since it is not intended to be executed on a GPU [56]. Thus, cross compiling CUDA code is supplied by nVidia. Delite is used by the Flare project [26] to transform relational operations on big data stored in a table structure to native code. This is explained in section 1.5.1.

1.1.2 Continuous Deep Analytics

The thesis is a part of the project Continuous Deep Analytics (CDA) which aims to solve the need for continuous computationally heavy analysis of data streams. Therefore, CDA needs to be able to utilise the performance capabilities of a heterogeneous cluster. This should not be achieved by requiring manual low-level implementation by the user. CDA will resemble Flink by supplying high-level abstractions to the user and implicitly handling the parallelization and distribution of the work effort.

CDA is in the an early phase, therefore the conceptual architecture of the system which is available at the time of writing may not be the final architecture. The conceptual architecture is illustrated in figure 1.1. It is similar to Flink’s architecture and is composed of three actors:

(14)

a client, a driver and a worker.

Figure 1.1: Overview of the Continuous Deep Analytics system The client is a front-end for the user. The front-end framework will be available in several programming languages. The user will specify the stream processing pipeline in the front-end. The pipeline will then be converted to a platform independent intermediate representation (IR) which will be sent to the driver. The IR is going to be optimised e.g. using dataflow optimisations. This can either happen at the client or at the driver.

The driver is responsible for splitting the pipeline into tasks, map- ping the tasks to specific workers in the cluster, interpreting the IR of each task to a concrete implementation and compiling it to hardware specific executables for the workers. It should also monitor the cluster and the pipeline.

Workers will be responsible for executing the tasks.

1.2 Problem

CDA aims to migrate from the JVM to increase performance. The problem with shifting to programming languages running on bare metal with low-level control is usually the loss of portability [4] and safety guarantees [42].

Rust is a relatively young programming language which is compiled to native code and achieves high performance [53]. It is also

(15)

statically safe, a safe Rust application is guaranteed not to contain e.g.

memory errors [58]. Rust’s compiler is built on top of LLVM which gives Rust great theoretic possibilities to cross compile and emit native code for a wide range of platforms [30]. LLVM will be further discussed in section 2.3.

Therefore, CDA wants to be able to generate Rust programs from the IR. Just like Flink, CDA has to be fault tolerant. It should be possible to handle driver failures by reassigning it to new nodes, else the driver would become a single point of failure for the whole system.

Pipeline reconfiguration during runtime has to be possible as well.

The research question is therefore: how can a relocatable driver generate source code and compile it to executables for a heterogeneous cluster?

1.3 Purpose

The thesis describes the work with designing and evaluating a prototype for a dynamic code generator which generates and cross compiles Rust code.

The main reason for using Rust is its high performance, good static safety guarantees and ability to target a large set of platforms. This will make CDA a high performing portable stream processor whilst reducing the risk of errors during runtime.

1.4 Goal

The goal of the thesis is a prototype Rust cross compiler which combines generated Rust source code and emits executables for a set of target platforms.

1.4.1 Benefits, Ethics and Sustainability

As massive internet of things (IoT) becomes part of reality, frameworks for handling huge amount of data become essential. Analysing and re- acting to the data in a correct way and adapting to performance needs on demand will be an important aspect of massive IoT [73].

Being able to think in a high-level of abstraction will enable the developers of such systems to argue and prove the correctness of their

(16)

application. Higher-level abstractions also make it easier for domain experts with limited experience of developing distributed applications to understand and implement such software [39].

This will increase correctness and reliability of applications. Relia- bility will be crucial as software applications becomes responsible for critical systems in society, as in the smart cities concept [73]. Appli- cations supporting the core of society must be highly reliable. Rust with its safety guarantees [58] and high performance [11] is a good programming language for such applications.

Combining CPUs with GPUs to form a heterogeneous cluster has proven to be more energy efficient than using a homogeneous cluster of CPUs as execution platform [57]. A distributed parallel computing engine targeting heterogeneous clusters can therefore reduce energy consumption.

Big data analytics has also given raise to important ethical questions about integrity such as:

Who has right to access the available data? To what ends should it be allowed to analyse the data? How should the deployment of the analysis be restricted? [14] Deploying data analysis on cloud systems can affect how responsibly the data is being handled. Using cloud services with data centers located in other countries can result in a lack of legal control of the data [12].

These questions are indirectly connected to CDA. They are more of a concern for the user of the framework. It is however important to highlight the issue.

1.5 Related Work

Most popular distributed computing frameworks are based on the MapRe- duce model [22]. MapReduce operate by splitting the logic of a distributed computing application into two phases, map and reduce. Map applies a function and generates key-value pairs which is piped to a reduce function. A reduce function somehow merges values with the same key. This enables easy parallelization of large computations.

MapReduce’s initial solution for fault tolerance was to move all intermediate results to an external stable storage. This impeded the execution and decreased performance significantly.

One of the most widely used frameworks for distributed parallel

(17)

computing is Spark. Spark is an expressive framework running on the JVM [8]. Spark combines ideas from predecessors such as MapReduce [22], FlumeJava [18] and Dryad [41].

Spark introduced Resilient Distributed Datasets (RDD) as a new core concept to avoid eagerly moving data to external storage. RDDs represent an abstract execution plan of operations on data stored in stable storage. Operations on RDDs are lazily evaluated. No intermediate result is calculated before the final result from a RDD is explicitly requested by the user. At the time when it is requested, Spark will optimise the execution plan and distribute the workload over the cluster.

Spark improved performance compared to other popular distributed parallel computing frameworks significantly. Spark performed up to 20 times better than the open source implementation of MapReduce called Hadoop [76].

1.5.1 Big Data Analytics and Native Code Generation

Applications analysing Big Data tend to use a mixture of procedural algorithms and relational queries. Using Spark’s RDDs or other distributed computing frameworks for relational queries can be tedious.

It requires a lot of manual optimisations to match the performance of frameworks specialise for relational queries. To tackle this problem, Spark introduced the DataFrame concept along with an optimiser specifically developed for query optimisation [8]. DataFrames represent a distributed set of rows with a uniform schema such as a table in a relational database. These rows can be manipulated using the DataFrame API, which consists of a set of lazily evaluated relational operators. The operations performed on a Dataframe will construct an abstract syntax tree (AST). Using an AST representation for the staged operations enables optimisation using AST transformation rules. Fi- nally, the AST is converted into a Scala AST at compile time using quasiquotes and compiled to JVM bytecode. Compiling to bytecode removes the need for runtime interpretation of the AST which would decrease the frameworks performance.

Spark’s performance still suffered radically due to the focus on scale-out performance. Using the JVM as a solution for easy deployment on a cluster comes with the inherent restriction of not utilising each machine’s maximum potential. Operating on a virtual machine is inherently slower and less energy efficient than operating on "bare

(18)

metal". Flare [26] is an attempt to fix parts of the issue. Flare trans- forms relational operations on DataFrames into native code for both CPUs and GPUs using Delite [40]. Compiling to native code increased performance significantly. Flare performed 2 to 3 times better than regular Spark when using a cluster only containing CPUs. Flare performed more than 7 times better than regular Spark when GPUs were introduced to the cluster. Flare is a showcase of the potential for better performance by moving from running bytecode on a VM to executing native code on a heterogeneous cluster.

1.5.2 Stream Processing

The first concrete stream processing engine for handling large scale data streams with low latancy was developed at Google as a tool for in- ternal use. It was called MillWheel [3] and was based on the out of order processing concept (OOP). OOP enable correct stream processing without impeding the flow of the stream as much as a strict total order stream processor does [49]. MillWheel therefore acted as a proof of concept for low latency exactly-once stream processing. Apache Flink was the first open-source project for both stream and batch processing which achieved low latency, scalable performance, reliable state management and exactly once processing of stream input [15]. Flink was based on ideas from MillWheel and OOP. Flink is a good start- ing point and source of knowledge for the CDA project due to its low latency, scalability, reliability and the fact that the implementation is open source.

1.6 Methodology

The project will be based on a literature study of scientific writings in the subject field. Discussions about ideas and solutions will be had with the supervisors at SICS who are working with the project Contin- uous Deep Analytics.

The concrete requirements for the solution are then derived from the background study and the discussions. These requirements will reflect how the solution will cooperate with the rest of the Continuous Deep Analytics system in the future.

(19)

1.7 Delimitations

The result of the thesis will be a prototype, a proof of concept, and will therefore be limited in its expanse. The front-ends and the IR of CDA are not yet developed and will not be developed in the thesis either.

The development of the driver, such as state management and task assignments to nodes, is outside the scope of this thesis. The driver is purely incorporated as a conceptual proof that the code generator can cooperate with a theoretical driver. Generating Rust source code is not part of this thesis. The focus of this thesis is how to combine generated Rust source code, setting up Rust meta data, configuring a correct runtime environment to enable cross platform compilation and collecting the necessary emitted files to run the executables.

As the supported platforms have not been decided yet, the platforms which are chosen to generate executables for act as a proof of the compilers capability to target other platforms. These platforms may not be chosen as a part of CDA’s supported platforms when CDA matures.

1.8 Outline

The rest of the report is structured as follows: processing big data, the Rust programming language, cross compiling and Docker are discussed in chapter 2. The conceptual ideas of the cross compiler are described in chapter 3. The concrete implementation is presented in chapter 4. The design decisions are evaluated and future work is pro- posed in chapter 5. Finally a concise conclusion is given in chapter 6.

(20)

Chapter 2 Theoretic Background

Batch- and stream processing are two techniques to process big data.

Batch processing operates by dividing a large set of data into small batches, distributing the computation of each batch to a cluster of machines and collecting the results. The batch processing job is completed when the final result has been collected.

Spark is one of the most commonly used engines for batch processing. Spark’s main abstraction is called resilient distributed dataset (RDD) [76]. It is used to construct and optimise a logical execution plan. The workload is then divided into computations of small batches which are distributed among machines in a cluster for data parallel execution. No direct synchronisation between the executions are needed.

RDDs combine ideas from MapReduce [22] and FlumeJava [18].

In reality, many use cases of batch processing operate on data which is continuously created over time [15]. This is in essence what stream processing is intended to do, continuously accept new input. There- fore Spark introduced an API called discretized streams (D-Streams) to enable stream processing in Spark. D-Streams operates by organis- ing input from the continuous stream into batches spanning short time intervals. It is possible for the user to define transformations of these batches and the result will be available as RDDs [75].

Other stream processing engines usually construct a pipeline and continuously accept input as a stream and does not organise the data as batches [3] [15]. Stream processing does not require that the data which should be processed is finite and fully available at the start either, in contrast to batch processing. The stream processing pipeline does not have a specific end condition and will run continuously [65].

11

(21)

Flink [15] is a popular open-source stream processing engine. A Flink pipeline process data in an out-of-order processing (OOP) [49]

fashion. This means that total order of input data is not guaranteed, but a notion of order is kept with the help of markers to ensure pipeline progress.

Markers will be put into all of a pipeline’s input sources to mark an end of a collection of streamed data called an epoch. All data received between the previous marker and the current marker belong to the current epoch. Flink freezes the next epoch from being processed until all markers from the current epoch has been received, indicating that the current epoch has been fully processed before the processing of the next epoch begins. This does not mean that the whole pipeline is frozen until an epoch is fully processed. Each machine has the ability to unfreeze its own input channels when it has received a marker for the current epoch from all input channels. Thus, each machine guarantees that it has received all data from the epoch before continuing to the next epoch.

Flink also saves the pipeline’s current state at the end of each epoch to be able to rollback and reinstantiate the pipeline in case of failure [16].

2.1 Flink Architecture and Fault Tolerance

Using discrete batches for stream processing, as D-streams does, introduce high latency throughput and some inaccuracy since the time dimension is not explicitly part of the application code. The difference of arrival time between two messages in the same batch is discarded since they are processed simultaneously. Time has to be included in the logic trough timestamps or similar to allow for higher accuracy [15].

Flink was developed with this in mind and does not collect stream input in batches prior to computation. Instead Flink deploys a pipeline architecture which allow individual events to flow more unconstrained.

Flink has an API in Java, Scala and Python which is used to specify the pipeline. The pipeline is used to generate a logical dataflow graph.

The logical dataflow graph is optimised before being sent from the client, the machine which the user specified the pipeline on, to the job manager [16].

(22)

CHAPTER 2. THEORETIC BACKGROUND 13

The job manager receives a representation of the pipeline in the form of an logical dataflow graph. The graph consists of operator and edges, where operators are logical computations of data and edges are input- and output channels. The logical dataflow graph is used to create a physically deployable graph for the cluster by parallelizing operators into subtasks which are ready for physical execution [15].

Flink’s execution engine is composed of the job manager and task managers. The job manager will keep track of the pipeline and act as the master whilst the task managers execute tasks. As described earlier, Flink uses markers and epochs to guarantee that the pipeline makes progress. A persistent state of the pipeline is periodically saved for fault tolerance.

Flink has to be robust since pipelines are meant to run continuously for undefined periods of time. In case of task manager failure the whole pipeline is redeployed to last saved persistent state. Flink manages this by assuming that the input streams are able to rollback as well. Flink also manages job manager failure by electing a new job manager [16].

2.1.1 CDA and Fault Tolerance

CDA also has to be able to handle driver failure since it is going to be a long-lived service. Otherwise the driver will become a single point of failure, meaning that the whole pipeline will fail if the driver fails.

One key difference between CDA and Flink is that the client in CDA will be oblivious of what platforms are used to execute the pipeline.

The Flink client knows that the pipeline is only going to be executed by the JVM. Therefore the logical dataflow graph which is shipped from the client to the job manager is already set for physical deployment on the JVM [16]. In CDA, the driver will receive an IR which is platform independent from the client. The driver then has to interpret the IR to an executable physical graph for the concrete platforms available in the cluster. CDA wants to interpret the IR to a programming language which can take better advantage of the performance capabilities in a heterogeneous cluster. Compiling another programming language requires setting up the driver correctly to emit executable code for the target platforms available in the cluster.

At the same time, CDA has to be able to reallocate the driver to handle driver failures. Driver reallocation will therefore depend on

(23)

the machine which is elected driver being correctly setup. Requiring manual setup of the driver’s runtime environment at all nodes in case they are elected leader introduces three problems:

1. If a new platform is introduced to the cluster, e.g. if a GPU is added, the nodes has to be reconfigured to be able to compile to that platform.

2. If CDA evolves and requires configuration of new tools in the runtime environment, e.g. a new hardware accelerator is supported by CDA, the clusters currently running CDA will have to be manually updated.

3. Incorrect or inconsistent manual configuration can produce obscure errors which will be hard to debug.

Being able to deploy a consistent runtime environment for the driver is therefore desirable and is discussed in section 2.6.

One additional advantage of using the JVM as Flink’s execution platform is the safety guarantees provided by the managed runtime, specifically memory safety which is provided by the garbage collector (GC). The GC does however introduce a performance overhead which makes it unsuitable for performance critical applications. To remove this overhead most developers of performance critical applications choose to use an unsafe language, e.g. C, which enables more low level control [11].

2.2 Rust the Programming Language

The trade-off between giving developers low-level control and being able to guarantee safety properties of programs such as memory safety is a common problem for general purpose languages (GPL). It has been a prioritised problem in the programming language research domain [42].

Rust, a programming language developed at Mozilla Research, claims to have solved the problem without incurring performance penalties due to garbage collection or other runtime checks [53]. Rust’s method for handling memory, the borrow checker, has been proven to be logically safe [58]. The logical proof validates that safe Rust code cannot

(24)

contain behaviour which will cause the heap to leak memory or allow access to uninitialised memory. Well-typed safe Rust programs are therefore statically guaranteed to be memory safe.

Rust is introduced in "The Rust Programming Language" [71]. Rust is released through three channels, stable, beta and nightly. Nightly is created each night and can introduce breaking changes. Therefore, a program which works today with nightly Rust may not work tomor- row. Each sixth week a beta version of Rust is created from the current nightly version. The beta version is subject to tests for the coming six weeks to locate bugs. After the six weeks have passed, the beta version is made into the next stable version of Rust. Nightly Rust therefore has features which will first appear in stable Rust within the coming six to twelve weeks. Many developers choose to use the nightly release channel for this reason [37].

Rust supplies the developer with high-level abstractions without incurring performance penalties during runtime which is similar to what C++ does. Two of Rust most distinguishing concepts compared to other popular GPLs are ownership and lifetimes of resources.

Ownership is used in Rust to ensure memory safety and to avoid data races. Variables can be owned by a restricted set of pointers and special rules are enforced when accessing the data. Only one pointer can have the right to modify the data at each time. Several pointer can use references for read access, but not simultaneously as one pointer have writing access. Lifetimes are used to avoid dangling pointers, which occur when freed memory is accessed.

2.2.1 Unsafe Rust

There are however some actions which has to be used which cannot be statically verified to be safe by Rust. Therefore, Rust had to include a construct for unsafe scopes. In an unsafe scope, Rust will allow actions deemed unsafe by the regular Rust rules such as raw pointer manipulation and manipulation of mutable static fields.

An example of an unsafe operation is calling functions in dynamically linked libraries. The compiler can never guarantee that the dynamically linked library is called correctly. Dynamic linking will be explained in section 2.5.

The unsafe scope will however not turn off other safety checks.

Breaking the ownership rules is still not allowed and will yield static

(25)

errors at compile time. The unsafe scope is extensively used, e.g. in the standard library, but it is usually wrapped in a safe API [11]. By implying safe, the developer assures that no unsafe or undefined behaviour can occur when using the API. The Rust compiler will not be able to statically check this safety, so all safety guarantees are provided by the developer.

This is in essence what other GPLs with low-level control do, they rely on the users logically reasoning and validating the correctness of their own code. This creates the possibility of errors. The difference is that whilst errors may originate from any part of e.g. a C program, a runtime memory error caused in a Rust program has to originate from a unsafe scope. Debugging of runtime errors is thus simplified in Rust.

RustBelt [42] define a semantic model for proving soundness of Rust modules which use the unsafe clause but claim to expose a safe API. RustBelt helped extend Rust’s claim for statically checked safety and support the Rust community by locating bugs in the standard library.

2.2.2 Compiling Rust with Cargo

Most Rust projects, which are called Rust crates, use Cargo for building. Cargo is Rust’s builder and package manager. It helps maintain- ing repeatable builds for crates.

[package]

name = "hello_world"

version = "0.1.0"

authors = ["Default author <default@author.com>"]

[dependencies]

rand = { git = "https://github.com/rust−lang−nursery/rand.git" } time = "0.1.12"

hello_utils = { path = "path/to/utils" }

Listing 2.1: Example of a crate’s Cargo.toml

Cargo requires a set of metafiles in the .toml format and a specific directory structure when building a crate.

The Cargo.toml file is the main metafile and contains crate settings such as dependencies, crate name and authors. Cargo is documented online in "the Cargo Book" [70].

(26)

Figure 2.1: Rust compiler phases Cargo fetches remote dependencies e.g.

from the Rust community’s package reg- istry crates.io [17]. An example of a small Cargo.toml files can be seen in listing 2.1.

A Cargo.toml file will consist of key-value pairs grouped under tables. A table is indicated by the surrounding brackets, e.g. [dependencies]. In listing 2.1, all three dependencies for the crate will be fetched differently by Cargo.

The first will be fetched from github, from the URL specified. The second dependency will be fetched from crates.io, "0.1.12" specifies which version to fetch. The last dependency is a path to a crate available on the local machine.

Cargo will automatically fetch the remote dependencies and then build all dependencies before compiling the main crate. Building a dependency may cause more remote crates to be fetched by Cargo.

The compiling process of most GPLs can be divided into two course grained parts. The first part is the transition from source code to an IR and the second part is IR to executable code.

The IR avoids as many assumptions as possible regarding the target platform. It should also be decoupled from the actual source code, being an abstract representation of the source code’s meaning, i.e. the code’s semantics. The IR is used to emit native code or bytecode for a specific target platform [7]. This step is composed of two parts, generating object files of compiled code and linking object files together to create an executable [63].

Rust’s compiler will parse source code and go through two stages before translating to

LLVM IR, see figure 2.1. The two stages are high IR and middle IR, which are subjects to static checking and transformations such as macro expansion. The LLVM will optimise the generated LLVM IR and emit a number of object files containing compiled code. These are

(27)

then linked together to form an executable [62].

The LLVM IR is created for one Rust crate at a time. LLVM’s optimisation and generation of object files from the IR is usually responsible for the majority of the compile time of a Rust crate. Work is done to try and reduce the Rust compiler’s compile time [19]. A way to reduce compile time is distributed compilation, where the workload of compiling a project’s modules are split across a cluster [1].

2.3 LLVM

LLVM [45] is a compiler framework which consists of a virtual low- level instruction set. The virtual instruction set captures the primi- tives commonly used to implement features in high-level languages.

This enables a large set of different high-level languages to target the LLVM bytecode during compilation. The bytecode resembles assem- bly code and does not guarantee any type- or memory safety. LLVM assumes that the high-level programming level will decide to which degree type safety and memory safety should be enforced. The LLVM bytecode is virtual, meaning that it tries to be as platform independent as possible. One example of this is that the number of available registers is unbound for the bytecode. The amount of registers is specified only when the bytecode is compiled to native code for a specific target platform.

LLVM will optimise a program before it is compiled to native code.

An IR is created from the bytecode and safe optimisation techniques are applied to it, thus not altering the semantics of the program.

LLVM also support profile-directed optimisations at runtime. LLVM can take feedback from the executed binary to find hot paths. A hot path represent an execution path which is frequently used during runtime e.g. which branch of an if-else statement is mostly used. Us- ing this feedback, the LLVM restructures the instructions to improve runtime performance and recompiles the optimised program to native code. This is part of what is called just-in-time (JIT) compilation and is done by other VMs as well. E.g. the JVM’s JIT compiler optimises at runtime using profile-directed optimisations [9].

It is required that the executing platform has LLVM installed if LLVM should be able to perform profile-directed optimisations at runtime. In the context of CDA this means that the workers have to have

(28)

LLVM installed.

2.4 Cross Platform Compiling

Using LLVM as a back-end enables Rust to theoretically cross compile to a large set of target platforms [69]. Rust does support a large number of host platforms, each to a varying degree. Some are offi- cially guaranteed to work whilst others depend on community efforts to work properly [61].

Rust uses an external linker for creating executables from LLVM’s emitted units of compiled code. Thus, cross compiling Rust requires the linker to be explicitly set for the target platform [69]. Apart from the linker, a cross compiled version of Rust’s and C’s standard library plus additional C tools used in the application are needed to cross compile a Rust program successfully [28].

The default Rust installer called "rustup" helps with fetching a compiled versions of Rust’s standard library for different target platforms.

Finding a correct version of C’s standard library, other C tools and linker still remains a problem to be manual solved. It can be a tedious process since "each combination of host and target platform requires something slightly different" and to find the correct setup for each combination "typically involves pouring over various blog posts and package installers" [69].

Thus the linker has to be chosen correctly for each specific pair of host- and target platform. The linker has to be compiled to execute on the host platform but link compiled code units and emit executables for the target platform. Given that the Rust compiler supports 27 host platforms and Rust’s standard library is cross compiled for 63 target platforms [61], the number of possible host and target platform pairs is theoretically 1’701. In the context of CDA, the amount of pairs can be restricted by only supporting some architectures as host platforms for the cross compiler. Also, all Rust’s target platforms is probably not of interest for CDA.

2.5 Rust Linking and Binaries

External libraries used by an application has to be available during runtime if the application’s binary should be able to execute. This is

(29)

done by linking the libraries to the application. There are two ways external libraries can be linked, statically or dynamically.

Static linking is done by the compiler when compiled code of the used libraries are combined with the compiled application into a single binary file. Symbolic references to the libraries are then replaced by machine addresses during compile time. Therefore, the binary produced will contain everything necessary to execute the program.

Dynamic linking on the other hand will not replace the symbolic links prior to runtime, and the used external libraries will not be copied into the binary file. Instead libraries are loaded into memory during runtime and can be shared between several running applications. The advantages of dynamic linking is a reduced binary size and reduced memory usage during runtime since libraries can be shared between applications. A dynamically linked binary is however not standalone and requires that the dynamically linked libraries are available on the host platform [21].

Rust depends on libc for compiling, and the implementation usually used by Rust on Linux is glibc. Usage of glibc has technical dif- ficulties when it comes to statically linking everything into the binary to make it completely standalone. Rust binaries compiled with glibc therefore dynamically link libc. This requires that libc is installed on the host platform when executing the binary. A small implementation of libc called musl does exist as an alternative which can be statically linked. Compiling Rust with musl enables completely standalone binaries [69]. In the CDA project, dynamic linking of libraries could help to reduce the size of task binaries which are transmitted over the network between the driver and workers. Instead, CDA could depend on the worker storing the dynamically linked libraries, such as glibc.

The libraries would then be reused when the task binary is updated for a worker. But it does require some setup of the worker’s runtime environment.

2.5.1 Size of Rust Binaries

When deploying a CDA pipeline, a reduced binary size would mean less network traffic and faster deployment. The size of Rust binaries tends to be large compared to other popular GPLs’ binaries. The Rust compiler usually opt on the side of better performance, static safety and stand alone executables instead of a reduced binary size. The con-

(30)

tributing factors for Rusts large binaries are [31]:

Monomorphization - Rust generates a concrete version for each unique usage of a generic type or function. This is done for runtime performance but comes at a price of longer compile time and larger binaries.

Optimisations for runtime performance - Rust does many static transformations for runtime performance beside the aforementioned monomorphization. Loop unrolling, inlining functions where they are called are examples where smaller binary size is opted out in favour of runtime performance. These can be disabled by using the Oz optimisation flag. Oz will "optimize for size at the expense of everything else"

[2].

Debug Symbols- To enable backtrace in case of errors during runtime, which is called panic in Rust, debug symbols are kept in the binaries even when building in release mode. These are possible to remove using tools like strip [32]. This will however break Rust’s backtrace and make debugging more difficult.

Static Linking of Allocator- Rust uses Jemalloc as default memory allocator. Jemalloc is reliable and delivers high performance alloca- tions at the expense of larger binaries. It is by default statically linked into the binary. Effort is being spent on enabling custom allocators [5]

which could be used to reduce binary size.

No Link Time Optimisation- Rust does not do Link Time optimisation (LTO) by default, but can be instructed to do so by putting [profile.release]

lto = true

in the Cargo.toml meta file. LTO may reduce the binary size by op- timising across compilation units and eliminate dead code i.e. unused code. This will increase the duration of the compilation, possibly significantly prolonging the compiling process which is why it is disabled by default [23].

Static Linking of the Standard Library - Parts of the standard library are statically linked into all binaries, such as the library for back- tracing. This will increase the binary size. To avoid this, the

#![no_std]

tag can be used. This will disable Rust’s standard library in the ap-

(31)

plication. The binary size may shrink due to this but at the expense of lost static safety and more complexity for the Rust programmer.

The standard library’s safe APIs, which are extensively checked by the community for safety bugs e.g. with RustBelt, are then circumvented.

Using the flag "-C prefer-dynamic" will however make Rust’s compiler prefer dynamic linking. Static linking is then performed only when the compiler can’t find a copy of a dependency which can be dynamically linked [51]. The result of this is that Rust’s standard library will be dynamically linked for the platforms which a dynamically linkable copy of the standard library exist. Setting the "RUSTC_FLAGS" environment variable to "-C prefer-dynamic" can also be used to achieve the same results.

2.6 Dynamic Resource Allocation in a Clus- ter

A compiler requires knowledge about which specific platforms it is compiling executables for. In the context of CDA this means that the compiler must know which platforms are available in the cluster. The set of platforms available to a distributed application can change dynamically during runtime due to node failures or new nodes joining the cluster.

Moreover, multiple applications developed in different frameworks could coexist in the same cluster. In such a scenario, the cluster’s resources are typically distributed between the different applications using a resource negotiator such as YARN [72] or Mesos [36]. The benefit of using an external resource negotiator is that the resource management is decoupled from the application logic.

Most resource negotiators are however developed mainly for batch processing, which are short-lived and static. Usually a short-lived application requests a static set of a cluster’s resources from the resource negotiator. When they are available, the application will use them as an execution platform until the job is finished. Then the resources are all released and can be reassigned to other applications. This does not fit the dynamic nature of long-lived services such as stream processors. Stream processors should be able to dynamically adapt to spikes in data streams. When a spike has passed, unused resources should be released to allow for a better utilisation of the cluster [47]

(32)

[54]. YARN supports dynamic resource reallocation through the use of Apache Slider [48].

2.6.1 Node Configuration

Setting up the correct runtime environment for an application is often hard by itself. Libraries, system tools and other necessary applications have to be configured for the application to work properly. The complexity is further increased when the application is intended to run on a cluster of multiple different host platforms [44] [74].

A scenario where nodes are constantly reassigned to different applications makes the problem even more complex. Obscure errors can occur in applications due to inconsistent and incorrect manual configuration of nodes. This is called the consistency problem [29]. To avoid such misconfiguration, the correct configuration can be shipped in an image using Docker [24]. A Docker image is used to initiate and execute a container. Containers will guarantee applications a more consistent runtime environment when executing on distributed nodes.

2.7 Docker Containers

A container can be compared to a VM. The difference is that containers execute applications directly in the host operating system’s (OS) kernel and do not simulate a guest OS kernel, which a VM does. This makes containers more lightweight compared to VMs [29].

Each Docker container does however run in its own isolated environment with its own filesystem and environment variables. The reason Docker was developed more lightweight compared to VMs was to achieve better performance. Docker containers can be started in less than a second and do not inherit the performance penalties from executing on top of a VM [6].

Another problem which Docker solves is the responsiveness problem. Dynamic scaling of performance according to demands has tradi- tionally been hard to make highly responsive. Adding more computing power to a cluster usually required substantial effort to configure new machines and their runtime environments. This either led to too little capacity during peaks, due to a lack of hardware resources in the cluster, or too much capacity outside of peaks to be ready and able to perform well during peaks. The latter may not look as a problem,

(33)

but it means that hardware resources are tied up when they could be used for other services. Thus, the cluster is not fully utilised and the runtime cost for the application is increased [29].

Docker solves this problem by packing the correct runtime environment inside a Docker image which eliminates the need for manual configuration. The Docker container can be started quickly to respond to a peak. After the peak has passed, the Docker container can be stopped to free the hardware resources. This ties to the problem of reallocating the driver. It enables quick configuration of a newly elected driver’s runtime environment as a response to a driver failure without any manual effort.

Containers have been proven to be a feasible solution to ease deployment whilst still achieving good scale-out performance in a number of cases:

The deployment of new software radio access networks (RAN) can be simplified whilst still fulfilling the high performance demands of RANs [52]. Distributed forensic processing showed a potential of al- most linear speedup when using containers to deploy new nodes [67].

Complex optimisation problems may be solved quicker using distributed genetic algorithms deployed with containers [64].

2.7.1 Building Docker Images

A Docker image is specified with a Dockerfile. The image is built by calling "docker build <path-to-directory>" where the directory contains a "Dockerfile" file.

Dockerfile instructions are specified at [25]. There are four instructions which will be used in this project: FROM, RUN, ADD and ENV.

All Dockerfiles have to start with a FROM statement. It specifies which image should be the base image for the file. "FROM ubuntu"

will build upon the latest available version of the Ubuntu image. Alter- natively, "FROM ubuntu:16.04" can be used to consistently fetch version 16.04 of the Ubuntu image.

RUN apt-get update && \

apt-get -y install -qq gcc-arm-linux-gnueabihf Listing 2.2: An example of the usage of RUN in a Dockerfile RUN will execute a command and create a new layer of the image,

(34)

which will be built upon by the latter stages in the Dockerfile. Listing 2.2 is an example of how to use the RUN instruction.

This would fetch the gcc-arm-linux-gnueabihf from the apt repository. The backslash is used to escape newline characters to make RUN instructions span multiple lines.

ADD copies files from the host platform to the Docker image. Docker can only add files located in the same directory as the Dockerfile or its subdirectories. The Dockerdeamon only has access to the context directory and its sub-directories. Therefore, "ADD dir/config <destination>" is a valid instruction whilst "ADD ../config <destination>" is not.

ENV is used to set environment variables for a Docker image.

2.7.2 Performance of Containers

In the context of a high performance computing engine such as CDA, the performance penalties of using containers should be evaluated.

Even if the deployed tasks can be executed without Docker, a substantial increased compile time due to Docker may render Docker an unfeasible solution.

Using containers has basically zero performance overhead for computationally heavy executions [74]. Local I/O operation, such as reads and writes to files, does not incur large overheads either [27].

The concern is however the performance overhead when commu- nicating across container boundaries with virtual network devices. Dur- ing high traffic loads over the network, a performance overhead of using containers appears [60]. The host platforms native network device can be used directly to circumvent the virtual network device and reduce the performance overhead [44].

Overpopulation of nodes, i.e. deploying multiple containers per physical core, also creates a risk for performance degradation [60].

2.7.3 Cross Compiling Rust with Docker

The Rust crate called "cross" aims to supply "Zero setup cross compilation and cross testing of Rust crates" [77]. It depends on a Docker container with the Ubuntu image as base image. The image is built to fetch the appropriate linker, C’s standard library and other C tools e.g.

openSSL for the target platform with 64-bit Linux as host platform.

(35)

"cross" uses rustup to fetch cross compiled versions of Rust’s standard library.

Rust does have an official repository with a Rust Docker image on the Docker HUB [50]. It is intended to be used both to compile and execute Rust code. Rustc, Cargo and Rustup are all installed in the image.

This reduces the Dockerfile implementation, the only requirements is to configure Cargo for the target platform and fetch the correct linker, Rust’s and C’s standard library and C tools to enable cross compilation.

The Rust Docker repository also contains a rust:slim Docker image which is a reduced version of the regular Rust image. The slim image should only be used in certain cases:

"Unless you are working in an environment where only the rust image will be deployed and you have space constraints, we highly recommend using the default image of this repository." [50]

The slim image could therefore be a good option for the code generator. Rust is possibly the only thing which is going to be used in the environment. Space is probably not constrained but it is however desirable to reduce the network traffic required to initialise the code generator and cross compiler. Reducing the size of the base image would help this.

(36)

Chapter 3 Design

The code generator and cross compiler will go through 5 distinct phases depicted in figure 3.1.

First phase will generate the concrete Rust source code for the tasks.

This is not part of this thesis.

Second phase is specification of the project meta data. Most impor- tantly required dependencies have to be specified. The dependencies will be fetched at compile time.

Third phase will configure the cross compiler’s runtime environment with the correct linker, all required tools and settings.

Fourth phase will invoke the compiler. During this phase the required dependencies will be fetched.

Fifth and final phase will collect all generated binaries and dynamically linked dependencies.

Figure 3.1: Overview of the Rust code generator’s distinct phases The code generator and cross compiler are two independent parts.

The Rust source code generator will be a Scala application running on the JVM. It will theoretically receive the IR from the client and inter- preter it to a Rust program. The Rust program will consist of source code files and a directory tree. The source code generator is another student’s thesis.

27

(37)

3.1 Setup of the Generated Rust Crate

To assemble the required dependencies, Rust’s default package manager and builder Cargo will be used. The reason for using Cargo is that it is actively maintained and developed by the Rust community. Cargo is highly configurable and does not restrict how the Rust compiler is invoked. Therefore, the maintenance effort does not fall on CDA and the usage of Rust is not restricted.

Cargo requires a project specific Cargo.toml file to fetch dependencies and build a project. The Cargo.toml file is generated using a meta toml file where all possible dependencies for CDA tasks will be specified. Each time a remote dependency is accessed in the generated Rust source code, the dependency’s toml entry in the meta toml will be added to the project’s Cargo.toml. Thus remote dependencies which are specified in the meta toml but are not used by the project will not be fetched. This will reduce the network traffic and compile time of tasks.

Cargo will automatically build binaries for all .rs files in the src/

bin/directory in the project directory tree. Therefore, the first task’s source code is going to be generated as a task1.rs file and the second’s as task2.rs file etc in the src/bin/ directory.

To reduce binary size and network traffic required for the driver to distribute them, dependencies can be dynamically linked using "- C prefer-dynamic". All dependencies which the compiler finds a dynamically linkable version of, e.g. .so files on Linux, will not be included in the binary itself. All used dynamically linked dependencies have to be available on the workers if the workers are to be able to run the task executables. Therefore the driver has to store and be able to distribute both the binaries and the dynamically linked dependencies.

Finding the dynamically linkable dependencies will be done after the cross compiler has compiled the source code.

3.2 The Cross Compiler

The cross compiler is started after the Rust source code and meta files have been generated. The cross compiler will go through three distinct phases, building the required Docker images, using containers to compile Rust source code and finally collecting executables and de-

(38)

CHAPTER 3. DESIGN 29

pendencies to the host machine. The workflow of the cross compiler can be seen in figure 3.2.

Figure 3.2: Overview of the Rust cross compiler

The cross compiler is executed as a Docker container to consistently configure Cargo’s runtime environment for each specific target platform. Each target platform is going to have a unique Docker image. Since Docker works in layers, different target specific images will reuse common layers e.g. the base image will be reused by all images.

This means that the amount of network traffic and storage capacity needed to instantiate are decreased, only the images for the available platforms in the cluster needs to be fetched or built. The images will also reuse common layers from already built images.

A container will be created for each target specific image and Rust crate being compiled. The container is specified to use the host network adapter directly to reduce the risk of extending the time required to fetch remote dependencies. The Rust crate’s directory tree will be copied to the container. Then Cargo will be invoked in the container with release mode to fully optimise the code. Target specific executables is thus produced which are subsequently copied back from the container to the host machine. Dynamically linked libraries, if available, are copied to the host machine along with the executables.

Additionally Cargo’s cache of fetched remote dependencies’ source code and all the resulting compiled code can be copied to the host

(39)

machine. These files can be copied into the container before the next compilation to avoid refetching and recompiling remote dependencies. This will however require some storage on the host platform, how much depends on the amount of remote dependencies used.

If any error occur, either when building Docker images or when executing containers, log files are created. The error messages in the log files should make it clear where the cross compiler or code generator failed.

(40)

Chapter 4 Implementation

The Cargo.toml generator starts by parsing the meta toml file. Parsed entries from the meta toml are stored in a nested hashmap structure.

The inner hashmaps represent one toml table, e.g. [dependencies] has one specific hashmap. The entries in the hashmap are key-value pairs where the key is the toml key and the value is the whole entry in the meta toml.

The outer hashmap contains key-value pairs where the key is the table name, e.g. "dependencies" for the earlier example and the value is the inner hashmap. Thus duplicates are avoided, implementation is easy and lookup time is constant.

The compiler is implemented as a concurrent thread based system.

Threads can be created either for building specific Docker images or compiling Rust in containers.

4.1 Docker Images for Cross Compiling

The cross compiler images are based on the rust or rust:slim image available on Rust’s Docker HUB. The cross compiler will construct a set of target specific Docker images which will each be used to cross compile to a specific target platform. The target specific images each contain a linker, Cargo settings for cross compiling, a cross compiled version of Rust’s and C’s standard libraries and potentially settings to dynamically link dependencies. The images can be further extended with other cross compiled C tools and settings.

Docker images for cross compiling are created for the following Rust target triples:

31

(41)

1. armv7-unknown-linux-musleabihf 2. armv7-unknown-linux-gnueabihf 3. i686-unknown-linux-musl

4. i686-unknown-linux-gnu 5. x86_64-unknown-linux-musl 6. x86_64-unknown-linux-gnu 7. x86_64-pc-windows-gnu

x86_64-unknown-linux-XXX is 64 bit Linux whilst i686-unknown- linux-XXX is 32-bit Linux. Gnu will dynamically link C standard library whilst musl will statically link C standard library and therefore create fully standalone binaries.

# Use Rust’s slim image as base image FROMrust:slim

# Install nightly and set it as default RUNrustup install nightly && \

rustup default nightly

# Copy linker setting for cross compiler RUNmkdir −p /.cargo/

ADDconfig /.cargo/

######## armv7−unknown−linux−musleabihf ###########

RUNapt−get update && \ apt−get −y install −qq \ gcc−arm−linux−gnueabihf

# Point Cargo to the linker

ENVCARGO_TARGET_ARMV7_UNKNOWN_LINUX_MUSLEABIHF_LINKER=arm−linux−gnueabihf−gcc \ CC_armv7_unknown_linux_musleabihf=arm−linux−gnueabihf−gcc

RUNrustup target add armv7−unknown−linux−musleabihf

###################################################

Listing 4.1: A Dockerfile for building a cross compiler image with armv7-unknown-linux-gnueabihf as the target platform

All images share some attributes but some cross compiling scenar- ios require more configurations than others. One example that requires more setup is cross compiling for the armv7-unknown-linux- musleabihf platform, see figure 4.1. It requires a correct linker, Cargo settings, cross compiled versions of C’s and Rust’s standard library and some additional environment flags to use the correct linker. The generated Rust example tasks depend on the Rust version of kompics [43]. Kompics is a programming model for distributed systems. As of

Dynamic Configuration of a Relocatable Driver and Code Generator for Continuous Deep Analytics