A comparison of compiler strategies for serverless functions written in Kotlin

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

A comparison of compiler

strategies for serverless functions written in Kotlin

KIM BJÖRK

(2)

(3)

A comparison of compiler strategies for serverless functions written in Kotlin

-

En jämförelse av kompilatorstrategier för serverless-funktioner skrivna i Kotlin

Kim Bj¨ ork - kimbjork@kth.se Royal Intitute of Technology

Stockholm, Sweden

Supervisor: Cyrille Artho - artho@kth.se Examiner: Pontus Johnson - pontusj@kth.se

January 2020

(4)

Abstract

Hosting options for software have become more modifiable with time, from re- quiring on-premises hardware to now being able to tailor a flexible hosting solution in a public cloud. One of the latest hosting solution option is the serverless architecture, entailing running software only when invoked.

Public cloud providers such as Amazon, Google and IBM provide serverless solutions, yet none of them provide an official support for the popular language Kotlin. This may be one of the reasons why the performance of Kotlin in a serverless environment is, to our knowledge, relatively undocumented. This thesis investigates the performance of serverless functions written in Kotlin when run with different compiler strategies, with the purpose of contributing knowledge within this subject. One Just-In-Time compiler, the Hotspot Java Virtual Machine (JVM), is set against an Ahead-Of-Time compiler, GraalVM.

A benchmark suite was constructed and two serverless functions were created for each benchmark, one run with the JVM and one run as a native image, created by GraalVM. The benchmark tests are divided in two categories. One consisting of cold starts, an occurrence that arises the first time a serverless function is invoked or has not been invoked for a longer period of time, causing the need for certain start-up actions. The other category is warm starts, a run when the function has recently been invoked and the cold starts start-up actions are not needed.

The result showed a faster total runtimes and less memory requirements for GraalVM-enabled functions during cold starts. During warm starts the GraalVM-enabled functions still required less memory but the JVM functions showed large improvements over time, making the total runtimes more similar to their GraalVM-enabled counterparts.

(5)

Sammanfattning

Möjligheterna att hysa (engelska: host) mjukvara har blivit fler och mer modi- fierbara, fr˚an att behöva äga all h˚ardvara själv till att man nu kan skräddarsy en flexibel lösning i molnet. Serverless är en av de senaste lösningarna.

Olika leverantörer av publika molntjänster s˚a som Amazon, Google och IBM tillhandah˚aller serverless-lösningar. Dock har ingen av dessa leverantörer ett officiellt stöd för det populära programmeringsspr˚aket Kotlin. Detta kan vara en av anledningarna till att spr˚akets prestanda i en serverless-miljö är, s˚a vitt vi vet, relativt okänd. Denna rapport har som syfte att bidra med kunskap inom just detta omr˚ade.

Tv˚a olika kompilatorstrategier kommer att jämföras, en JIT (Just-In-Time) -kompilator och en AOT (Ahead-Of-Time) -kompilator. Den JIT-kompilator som används är Hotspot Java Virtual Machine (JVM). Den AOT-kompilator som används är GraalVM.

För detta arbete har en benchmark svit skapats och för varje test i denna svit skapades tv˚a serverless-funktioner. En som kompileras med JVM och en som körs som en färdig binär skapad av GraalVM. Testerna har delats upp i tv˚a kategorier. En där alla tester genomg˚att kallstarter, n˚agot som sker d˚a det

är första g˚angen funktionen kallas eller d˚a det har g˚att en längre tid sedan funktionen kallades senast. Den andra kategorien är d˚a testet inte behöver g˚a igenom en kallstart, d˚a har funktionen blivit kallad nyligen. Körningen kan d˚a undvika att genomg˚a vissa steg som krävs vid en kallstart.

Resultatet visade att de tester som genomfördes inom kategorin kallstarter visade p˚a att körtiden var snabbare och att minnesanvändningen var mindre för de funktioner som kompilerats av GraalVM. I den andra kategorin, d˚a testerna inte behövde genomg˚a en kallstart, krävde GraalVM-funktionerna fortfarande mindre minne men JVM-funktionerna visade p˚a en stor förbättring när det kom till exekveringstid. De totala körtiderna av de tv˚a olika kompilatorstrategierna var d˚a mer lika.

(6)

Introduction

Companies are constantly looking to digitize and are conceiving new use cases they want to explore every day. This is preferably done in an agile and modular way. The key factors to making this possible are a reasonable cost, fast realization time and flexibility.

Hosting is an area that has followed this trend. As a response, to this need for a more agile way of working, companies have moved from bare metal on- premises hosting to cloud hosting. In a cloud adaptation survey by IDG done in 2018, 73 % of companies stated that they already had adopted cloud technology and 17 % said they intended to do so within a year [1]. Another survey predicts that 83 % of enterprise workload will be in the cloud [2].

By using cloud computing, companies can allocate just the amount of computation power they need to host their solutions. A cloud solution can also easily be scaled up or down when the need changes. Cloud computing also makes it possible for small-scale solutions to be hosted with great flexibility and be economically defensible.

The next step in this development toward more agile and modular hosting options could be claimed to be the serverless architecture. A serverless architecture lets customers run code without having to buy, rent or provision servers or virtual machines. In fact a serverless architecture also relieves a client of everything that is connected to servers and more traditional hosting, such as maintenance, monitoring and everything infrastructure related. All that the clients need to concern themselves with is the actual code. These attributes enables a more fine-grained billing method, where clients gets charge solely on the resources used. These resources are the time it takes, as well as the memory needed, to execute the serverless function. The vendors providing the serverless solution, such as Amazon (AWS Lambda [3]) and Google (Google Cloud Func- tions [4]), also provide automatic scaling for their serverless solutions enabling a steady high availability. These are presumably among the top reasons why serverless is rapidly increasing in usage. According to Serverless the usage of serverless functions has almost doubled among their respondents from 45 % in 2017 to 82 % in 2018 [5]. Notable is also that 53.2 % stated serverless technology

(9)

is critical for their job.

During the growth of serverless architecture, cloud providers have added support for more languages. AWS for example have gone from only supporting Node.js to now also supporting Python, Ruby, Java, Go and C# [6]. But one language that is still lacking official support, from any cloud provider that offers a serverless solution, is Kotlin.

Kotlin is a programming language developed by JetBrains and was first released in February 2016. Kotlin is mostly run on JVM but can also be compiled into JavaScript or native code (utilizing LLVM) [7]. Despite being a newer language it has already gained a lot of traction by being adopted by large companies and is currently used in production by Pinterest [8] and Uber [9] among others.

Kotlin is also as of 7th May 2019 Google’s preferred language for Android app development [10] and has been among the top in “most loved” languages according to Stackoverflow developer survey reports the last years [11, 12]. One of the reasons for its popularity is Kotlins interoperability with Java, meaning it is possible to continue to work on an already existing Java project using Kotlin.

Other popular attributes are the readability of Kotlin as well as its null safety, facilitated by the language’s ability to distinguish between non-null types and nullable types.

Seeing as Kotlin is such a widely used and favored language it would be of interest to developers and companies to continue utilizing their knowledge within the language in more parts of their work, such as in a serverless context.

The rest of this chapter contains an introduction to this thesis. It explains the problem that brought about the subject of this thesis and specifies what research question that is aimed to be answered. Moreover, this chapter also includes the intended contribution as well as a section covering ethics and sustainability connected to this thesis. Concluding this chapter is a section describing the outline of this report.

1.1 Problem and Research Question

Kotlin is a popular language that is increasing in usage, however, it is not yet officially supported in a serverless solution provided by any public cloud provider. Since the serverless architecture is also being utilized more, companies might be looking to apply their knowledge of a language that they already know.

It could also be that a company already has an application written in Kotlin that they would like to convert into a serverless function.

Since Kotlin is able to run on the JVM, it is possible to package a Kotlin application as a jar file and run it as a serverless function. However, is that the optimal option? An application written in Kotlin could likewise be converted into a native image and run as a serverless function.

Since the payment plan of serverless solutions are based on resource usage, where every millisecond is counted and billed for, there is a possible cost saving factor to be had from optimizing the serverless function´s execution.

The aim of this thesis is to find out how Kotlin performs in a serverless

(10)

environment and what is the best way to run a serverless function written in Kotlin. From this statement two research questions can be extracted:

• What is the difference, if any, of running a serverless function written in Kotlin with a Just-in-Time compiler compared to running the same function as a binary?

• How does cold starts affect the performance of a serverless function written in Kotlin? Does it matter if the function is run with a JIT compiler or as a binary?

1.2 Contributions and Scope

Kotlin is not officially supported by any public cloud provider that offers serverless solutions. To the best of our knowledge, there exists scant knowledge on how Kotlin performs in a serverless environment. This thesis aims to contribute more knowledge on this subject. The expanded knowledge could serve as a foun- dation, should a company be looking into utilizing Kotlin for writing serverless functions. The work done for this thesis could both give information about the performance of serverless functions written in Kotlin in general but also provide an understanding about what is the better way to run a serverless function written in Kotlin.

Only one public cloud provider will be tested. The reason being that there is only one public cloud provider, Amazon, that offers the possibility of custom runtimes.

The runtimes that will be compared for this thesis is the JVM and GraalVM.

The JVM will represent JIT compilers and GraalVM will represent AOT compilers.

1.3 Ethics and Sustainability

From a sustainability standpoint the cloud and the serverless architecture are both environmentally defensible. To begin with, users of the cloud do not have to buy their own hardware. This means that they also do not have to estimate how much computation power they need and therefore the risk of buying more than what they actually need is eliminated. Since computer power in shared in the cloud, the usage of the cloud´s resources can be optimized where the same hardware used by one client one day can be used by another client another day.

This entails power savings and less impact on the environment.

Due to serverless being more lightweight than other more traditional hosting options it is also more attainable. Clients can host their application at a lower price which means more have the opportunity to host applications.

There is a given ethics perspective to this thesis, as with any investigating report. It is of great importance that the work being performed is unbiased.

One way to create a conviction of this is to only use open source code and

(11)

tools available to the general public to ensure repeatability. Results will also be reported in their raw form to ensure readers have the opportunity to perform their own calculations or verify the ones presented in this thesis.

1.4 Outline

Chapter 2 contains necessary background information, such as explanations regarding the different compilers and an ingoing clarification about what a serverless architecture is and what it entails. This chapter also contains research concerning related work.

Chapter 3 incorporates a description of the methodology used to perform the work done for this thesis. It includes how and what benchmarks where chosen.

It also gives an explanation to what metrics where used and why there where chosen.

The result of the work is presented in Chapter 4 and a discussion regarding the result can be found in Chapter 5. Finally Chapter 6 contains the conclusions drawn from the result and the discussion, it also contains a section for possible future work.

(12)

Chapter 2

Background

This chapter contains useful background information about this thesis main subjects: serverless, Kotlin, compilers and benchmarks. It also incorporates a section that presents and discusses related work. A summary of the chapters key points concludes the chapter.

2.1 Serverless

Serverless is a concept that was first commercialized by Amazons service AWS Lambda in 2014 [13], the company was the first public cloud provider to offer serverless computing in the way it is known today. Since then serverless has gained a great deal of traction. Google [14], IBM [15] and Microsoft [16] now also provide their own serverless services.

Serverless refers to a programming model and an architecture, aimed to execute a modest amount of code in a cloud environment were the users do not have any control over the hardware or software that runs the code. Despite the name, there are still servers executing the code, however, the servers have been abstracted away from the developers, to the point where the developers do not need to concern themselves with operational tasks associated with the server, e. g., maintenance, scalability and monitoring.

The provided function is executed in response to a trigger. A trigger is an events that can arise from various sources, e. g., a database change, a sensor caption, an API call or a scheduled job. After the trigger has been received, a container is instantiated and the code, provided from the developer, is then executed inside that container.

2.1.1 The Attributes of Serverless

A serverless architecture does not imply the same infrastructural-related con- cerns and dilemmas, such as capacity, scaling and setup, as more traditional architectures do. This enables developers to acquire a much shorter time to

(13)

market. A very important factor in the software development industry, where changes are happening at a rapid pace and where market windows can open and close fast. The ability to launch code quickly also enables prototypes to be created and tested at a lower cost and therefore at a lower risk. Furthermore, this benefit implies that there can be a larger focus on the product itself. Giv- ing developers the opportunity to concentrate on application design and new features instead of spending time on the infrastructure.

Cloud providers that offers a serverless solution charge only for what the function utilizes, in terms of execution time and memory. The owner of the function is therefore only billed when the function is invoked. This entails, given the right use case, that the infrastructural cost can be reduced, when compared to a more traditional hosting option. Since there is no need to maintain a hosting solution, it can also lead to developers being able to take over the entire deployment chain, rendering the operations role more obsolete and in extent enable an additional cost saving factor.

A serverless solution can bring many benefits to a project, however, it is not an appropriate solution for all projects. A function is executed only when triggered by an event, nothing is running when the function is not needed.

The result of this is that when a function is invoked for the first time, or after a long time of no invocations, the cloud provider needs to go through more steps in order to start executing the function. An invocation of this type is called a cold start. During a cold start at Amazon AWS Lambda the additional phases that needs to be executed, before the invoked function starts executing, are: downloading the function, starting a new container and bootstrapping the runtime [17]. The outcome of this is that during a cold start the execution time, and in extent the response time, will be noticeably longer. A longer execution time also entails a greater cost.

To prevent cold starts, and spare end users of a long response time, one option is to trigger the function periodically to keep it ”warm”. Amazon provides one such solution, Cloud Watch [18], were it is possible to schedule triggers with a certain interval. There are also third party tools serving as warmers [19] [20].

Some tools also analyzes the usage of a serverless function and claims to predict when a trigger is needed, making the function warm upon real invocations [21].

Keeping a function warm may be an option for functions that are being triggered fairly often or if its being triggered with some predictable consistency. Other- wise there is a possibility that the warm up triggers deplete the cost savings that a serverless solution otherwise would provide. In that case a more traditional hosting solution might be a better option, if response time is a decisive factor.

Another approach, to reducing the impact of a cold start, is to reduce the response time during a cold start. In this case there are two parameters that are configurable. The first parameter is the code. One way to optimize the code is to carefully choose programming language, different languages have varying start up times [22]. Furthermore, Amazon recommends a few design approaches that could help optimise the performance of a function. Amazon suggest avoiding large monolithic functions and instead divide the code into smaller more specialized functions. Only loading necessary dependencies rather then entire libraries

(14)

is also good practice. Amazon also recommends using language optimization tools such as Browserfy and Minify [17]. If an AWS Lambda function is reading from another service Amazon emphasises the importance of only fetching what is actually needed. That way both runtime and memory usage can be reduced.

The second parameter, that is configurable, is the runtime, which will be the focus of this thesis.

Resource limitations are, like cold starts, a restraint to serverless solutions.

Cloud providers limits the resources a function can allocate. However, Amazon AWS Lambda has been continuously increased these limits. In November 2017 they doubled the memory capacity, from 1.5 GB to 3 GB, that a lambda function can allocate [23]. In October 2018 they tripled the time limitation from 5 to 15 minutes per execution [24]. There is a possibility this trend will continue in the future and facilitate additional use cases for serverless functions.

In a serverless architecture a third party, the public cloud provider, has taken over a great deal of the responsibility related to hosting compared to a more traditional architecture. This entails that a great deal of trust has to be placed upon the provider. Especially since a serverless solution implies a vendor-lock- in, where a migration can be problematic and require multiple adjustments, due to the code not only being tied to specific hardware but also to a specific data center.

Further trust also has to be put onto the cloud provider on account of security. In a public cloud, where many users’ arbitrary functions are running at the same time, security has to be a high priority in order to prevent intersection of remote procedure calls and insuring container security.

To fully take advantage of the benefits, as well as avoid consequential im- plications from various drawbacks, a serverless solution can bring, it can be concluded that not just any use cases can be applied in a favorable way.

2.1.2 Use Cases for Serverless Functions

For many applications, from a functionality perspective, a serverless architecture and more traditional architectures could be used interchangeably. Other factors, such as the solutions need for control over the infrastructure, cost and the applications expected workload, are determining when considering using a serverless architecture.

From a cost perspective, serverless performs well when invocations occurs in bursts. This since a burst implies many invocations happening close to each other, time wise, and therefore entails that only the first execution will have to go though a, more expensive, cold start. The other calls in the burst will thereafter use the same container and will therefore execute faster. When the burst has ended a serverless architecture will let the infrastructure scale down to zero, during which time there is no charge.

Computation heavy applications could, under the right circumstances, also be a good fit since the cost of other infrastructure solutions grow in proportion to computer power needed. However, to keep in consideration is that if a public cloud provider is used, limitations on computing exist, such as memory and

(15)

time limits. This could mean that, from a performance perspective, a computation heavy application might not be an appropriate use case for a serverless architecture.

From a developer perspective serverless would be a good option in the cases when the drawbacks of lacking control over the infrastructure is outweighed by the fact that there is no need for maintaining the infrastructure or worry about scaling.

Based on the characteristics and limitations of a serverless architecture, such as the basis for the cost and resource limitations, the general usage for a serverless solution has a few common characteristics: lightweight, scalable and single- purposed.

IoT and mobile device backend

When it comes to IoT and backend solutions for mobile devices, a serverless approach could be advantageous. It could offload burdens from a device with limited resources, such as computer power and battery time. Internet connection is also a limited resource on an IoT and mobile device. By using a serverless solution as an API aggregator, required connection time could be reduced due to a reduced number of API calls.

There could also be a benefit from a developer perspective since mobile applications are developed by mostly front-end skilled people, some may therefore lack the experience and knowledge of developing back-end components. Creat- ing a serverless back end simplifies both its creation and set up, as well as elim- inates the need for maintenance. All this enables mobile apps and IoT devices that are fast and consistent in their performance independent of unpredictable peak usage.

iRobot, the developers of the internet-connected Roomba vacuums, is one of the companies that are using a serverless architecture as an IoT backend [25].

Event triggered computing

Event driven applications are ideal for a serverless architecture. AWS Lambda has many ways a user can trigger its functions. One of them is events that happens in there storage solution S3.

One company that is taking advantage of this solution is Netflix. Before a video can be streamed by end users, Netflix needs to encode and sort the video files. The process begins with a publisher uploading their video files to Netflix’s S3 database. That triggers Lambda functions that handles splitting up the files and processes them all in parallel. Thereafter Lambda aggregates, validates and tags the video files before the files are ultimately published [26].

Another company that is also utilizing the same type of solution is Auger Labs, that focuses on custom-branding apps for artist. Auger Labs founder and CEO’s intent has been to remain NoOps, where no configuration or managing of back-end infrastructure was needed. Among other use cases, Auger Labs are using their serverless architecture of choice, Google’s Cloud Functions, in

(16)

combination with Googles Firebase Storage. When an image is uploaded to their storage a function is triggered to create thumbnails in order to enhance mobile app responsiveness. They also use Cloud Functions to send notifications via Slack to handle monitoring [27].

Scaling solutions

Since scaling is handled automatically, developers do not have to worry about how the infrastructure is going to perform in case an expected, or unexpected, burst of requests occurs. The service provider will make sure to start enough containers that can support all the heavy traffic being generated.

Hosting an application with a lower everyday usage, where heavy spikes occurs very rarely, could lead to a high hosting cost, where the clients will pay for computer power that is unused most of the time, in order to maintain high availability even at the spikes.

One such use case is presented by Amazon, regarding Alameda County in California. Their problem was a huge spike in usage during the elections. Their previous solution included on-premises servers that did not measure up. By moving to the cloud and utilizing Lambda AWS, the application could easily scale at a satisfactory rate. Alameda County could avoid buying more expensive hardware that would not be used the rest of the year, at the same time as they could serve all their users during their peak [28].

2.2 Kotlin

Kotlin is a statically typed programming language developed by JetBrains and was first released in February 2016. Kotlin is most commonly run on the JVM but can also be compiled into JavaScript or native code (utilizing LLVM) [7].

Despite being a newer language it has already gained a lot of traction. Large companies such as Pinterest [8] and Uber [9] are currently using Kotlin in production. Kotlin is also as of May 2019 Google’s preferred language for Android app development [10] and has been among the top in “most loved” languages according to Stack Overflow developer survey reports the last years [11, 12].

The reasons behind Kotlins success may be many. One contributor could be its interoperability with Java. It is possible to continue to work on an already existing Java project using Kotlin. Other praised features are Kotlins readability as well as its null safety, facilitated by the languages ability to distinguishing between non-null types and nullable types.

2.3 Types of Compilers

A compiler is a program that translates code written in a higher level language to a lower level language in order to make the code readable and executable by a computer. A compilers type is defined by when this translation is made. An

(17)

Ahead-of-Time compiler performs the conversion before the code is run while a Just-in-Time compiler translated the high level code at runtime.

2.3.1 Ahead-of-Time Compiler (AOT)

An Ahead-of-Time compiler does precisely what the name suggests, it compiles code ahead of time, i.e, before runtime. When an application is compiled with an AOT compiler no more optimizations are done after the compilation phase.

There are both benefits and drawbacks to an AOT compiler. One benefit is that the runtime overhead is smaller since there is no optimizations during runtime. It is therefore also possible that an AOT compiled application is less demanding when it comes to computer resources such as RAM. The drawback to this is that the compiler knows nothing about the workload of this application or how it will be used. There is therefore a risk that the compiler spends time on optimization of, for example, methods that are rarely used.

2.3.2 Just-In-Time Compiler (JIT)

A Just-in-Time compiler offers a dynamic compilation process. Meaning blocks of code are translated into native code during runtime rather than prior to execution like an AOT compiler [29].

A JIT compiler optimizes code during runtime using profiling. Meaning that the program is analysed to determine what optimizations would be profitable to carry out. A JIT compiler will therefore perform well informed optimizations and will not waste time on compiling parts of an application that wont lead to an increase of performance. Examples of metrics that a JIT profiler is based on is method invocation count and loop detection [30]. A high method invocation count means that method is a good candidate for compilation into native code to speed up execution. Loops can be optimized in many ways, one favorable way is to unroll a loop. An unrolling of a loop entails an increase of operations performed by each iteration of the loop. Steps that would be performed in subsequent iterations are merged into earlier iterations.

The drawback to these specialized optimisations is the fact that execution time during the first runs will be longer. Performance will however improve over time as more parts of the code gets translated into native code and the compiler gets more execution history to base its optimizations on.

2.4 The JVM Compiler

Even though all CPU’s are very similar, e. g., have the same functionalities such as perform calculations and control memory access, programs that are designed for one CPU can not be executed on another. The developers of the Java programming language wanted a solution to this problem. They decided to design an abstraction of a CPU, a virtual computer that could run all programs written for it on any system, the result was the Just-in-Time compiler: Java

(18)

Virtual Machine (JVM). This idea was the basis for the slogan created for Java by the developer of Java, Sun Microsystems: write once, run anywhere.

Another benefit facilitated by the JVM’s abstraction of a CPU is the abstract view of the memory that JVM has. Since the JVM treats the memory as a collection of objects it has more control over which programs that are allowed to access which parts of the memory. That way the JVM can prevent harmful programs accessing sensitive memory.

The JVM also include an algorithm called verification that contains rules every program has to follow and aims to detect malicious code and prevent it from running [31]. This algorithm is one of the three cornerstones of the JVM, stated in the Java Virtual Machine Specification [32]:

• An algorithm for identifying programs that cannot compromise the in- tegrity of the JVM. This algorithm is called verification.

• A set of instructions and a definition of the meanings of those instructions.

These instructions are called bytecodes.

• A binary format called the class file format (CFF), which is used to con- vey bytecodes and related class infrastructure in a platform-independent manner.

The JVM was developed primarily for the Java programming language but the JVM has the possibility to execute any language that can be compiled into bytecode. The JVM, in fact, knows nothing of the Java programming language, only of the binary format CFF, that is the result of compiled Java code.

Some of the more well-known languages that can be executed by the JVM aside from Java, are Kotlin, Scala and Groovy [33]. These languages, and all others that can be executed on the JVM, also gets the JVM’s benefits, such as its debugging features and garbage collection, that prevents memory leaks.

The most used JVM is the Java Hotspot Performance Engine that is maintained and distributed by Oracle and is included in their JDK and JRE. The Hotspot JVM continuously analyses the program for code that is executed re- peatedly, so called hot spots, and aims to optimize these blocks, aspiring to facilitate a high-performance execution.

The Hotspot JVM has two different flavors, the Client and Server VM. The two modes run different compilers that are individually tuned to benefit the different use cases and characteristics of a server and a client application. Com- pilation inlining policy and heap default are examples of these differences.

Since characteristics of a server include a long run time, the Server VM aims to optimize running speed. This comes at the cost of slower start-up time and larger runtime memory footprint. On the opposite side, the Client VM does not try to execute some of the more complex optimizations that the Server VM performs. This enables a faster start-up time and is not as memory demanding [34].

(19)

2.5 The GraalVM Compilation Infrastructure

GraalVM is a compilation infrastructure that started out as a research project from Oracle and was released as a production ready beta in May 2019 [35].

GraalVM contains the Graal compiler that is a dynamic JIT, just-in-time, compiler that is utilizing novel code analysis and optimizations. The compiler transforms byte code into machine code. GraalVM is then dependent on a JVM to install the machine code in. The JVM, that is used, also needs to support the JVM Compiler Interface in order for the Graal compiler to interact with the JVM. One that does this is the Java Hotspot VM that is included in the GraalVM Enterprise Edition.

Before the Graal compiler translates the bytecode into machine code it is converted into an intermediate representation, Graal IR [36]. In this representation optimizations are made.

One goal of the GraalVM is to enable performance advantages for JVM- based languages, such as minimizing memory footprint through its ability to avoid costly object allocations. This is done by a new type of Escape Analysis that instead of using an all-or-nothing approach uses Partial Escape Analy- sis [37]. A more traditional Escape Analysis would check for all objects that are accessible outside its allocating method or thread and move these objects to the heap in order to make them accessible in other contexts. Partial Escape Analysis, however, is a flow-sensitive Escape Analysis taking into account if the object only escapes rarely, for example in one single unlikely branch. Partial Escape Analysis can therefore facilitate optimizations in cases a traditional Es- cape Analysis can not, enabling memory savings. During an evaluation done in a collaboration between Oracle Labs and the Johannes Kepler University they saw a memory allocation reduction of up to 58.5 % and a performance increase of 33 % [37]. Notably they also saw a performance decrease of 2.1 % on one par- ticular benchmark, indicating, not surprisingly, that Partial Escape Analysis is not the best solution in every case. But overall all other benchmarks had an increase in performance and a decrease in memory allocation.

Another goal of GraalVM is to reduce start up time of JVM-based applications. This through a GraalVM feature that creates native images, that performs a full ahead-of-time (AOT) compilation. The result being a native binary that contains the whole program and is ready for immediate execution. By this Graal states that the program will not only have a faster startup time, but also have a lower runtime memory overhead when compared to a Java VM [38].

With the help from the language implementation framework Truffle, GraalVM is not only able to execute JVM-based languages. JavaScript, Python and Ruby can also be run with the GraalVM compilation infrastructure [39]. LLVM-based languages such as C and C++ can also be executed by GraalVM thanks to Su- long [40]. Since the GraalVM ecosystem is language-agnostic, developers can create cross-language implementations where they have the ablility to choose languages based on what is suitable for each component.

(20)

2.6 Performing Benchmark Tests

In the field of benchmarks, much research has been done and several open source benchmark suites have been constructed. There are multiple suites targeting the Java Virtual Machine, e .g., SPECjvm2008 [41], DaCapo [42] and Renaissance [43]. Dacapo was developed to expand the SPECjvm2008 suite by targeting more modern functions [44] and the Renaissance suite focused on benchmarks using parallel programming and concurrent primitives [45].

Looking at the thought processes behind building these suites, certain common requirements can be identified. Only open source benchmarks and libraries have been selected. One of the benefits of this is that it enables inspection of the code and the workload. Diversity is also a common attribute these benchmark suites are striving for, a good feature in principle but one that is harder to put into practice. The Renaissance suites interpretation and approach to achieve diversity is to include different concurrency related features of the JVM. Object- orientation is also mentioned as an important factor in the Renaissance suite, since that will lead to an exploit of the JVM parts that are responsible for ef- ficient executions of code patterns commonly associated with object oriented features, e.g., frequent object allocation and virtual dispatch. The developers of the DaCapo suite strived to achieve diversity through maximizing coverage of application domains and application behavior.

Another type of benchmark suite is The Computer Language Benchmarks Game [46]. The aim of the suite is to provide a number of algorithms written in different languages. Kotlin, however, is not one of them. It is used, for example, in an evaluation of various JVM languages made by Li et al. [47]. From this suite the authors categorized the benchmarks, depending on if the program mostly manipulated integers, floating-point numbers, pointers or strings. The Computer Language Benchmarks Game has also been used by Schwermer [48].

In his paper a subset of benchmarks was chosen. One benchmark for each type of manipulation focus, e.i, integers, floating-point numbers, pointers and strings.

The chosen benchmarks where translated to Kotlin to be compared with the Java implementation provided by The Computer Language Benchmarks Game.

The Kotlin translated suite will serve as a complementary part of the benchmark suite used in this thesis.

When creating a benchmark suite, preferably there would exist a tool in the likes of the one described by Dyer et al. [49], that is under construction. A tool is describes where it would be possible to search for open source benchmarks given certain requirements and where researchers could contribute with their own benchmarks. The vision being faster and more transparent research.

Traditionally, performance tests are run in dedicated environment where as much as possible is done to minimize external impact on the result. Factors such as hardware configurations are ensured to be kept static, all background services are turned off and the machine should be single tenant. None of this can be found in a serverless solution hosted in a public could. Configurations are unknown and made by the cloud provider and the machines hosting the functions are exclusively multi tenant. This entails an unpredictable environment where

(21)

there always will be uncertainties. The benefit, however, to perform tests in the public cloud, is that it is easy to set up and at a low cost, where a more traditional approach would mean a higher cost and an environment that requires a high amount of effort to maintain.

A study by Laaber et al. [50] investigates the effect of running micro benchmarks in the cloud. The focus of their study consisted of measuring to what extent slowdowns are detected in a public cloud environment, where the tests were run on server instances hosted by different public cloud providers.

One of the problems the authors address is that the instances might be up- graded by the provider between test executions, which can result in inexplicable differences in the results. However, if tests are done during a short period of time, to avoid such changes by the provider, the results will only represent a specific snapshot of the public cloud. It can then be argued that tests run over a longer period, e.g, a year, would result in a better representation. However, this large amount of time is, in many cases, an unobtainable asset.

The authors also mention the difference between private and public cloud testing and emphasizes that the two can not be compared. This due to the possibility of noisy neighbours in a public cloud but also due to hardware het- erogeneity [51], where different hardware configurations are used for instances with the same type.

Furthermore, the authors acknowledge that even though it is possible to make reasonable model assumptions about the underlying software and hardware in a public cloud, based on literature and information published by the providers, when experiments are done in the public cloud the cloud provider should always be considered a black-box that cannot be controlled.

The paper concludes that slowdowns of below 10 % can be reliably detected 77–83 % of the time and the authors therefore considers micro benchmark experiments possible in instances hosted in a public cloud. They also concluded that there was no big differences between instance types for the same provider.

According to Alexandrov et al. [52] there are four key factors to building a good benchmark suite and running a the benchmarks in the cloud is (1) meaningful metrics, (2) workload design, (3) workload implementation and (4) creating trust.

When considering meaningful metrics, the example of runtime is given as a natural and undebatable metric. Furthermore, cost is discussed as an interesting factor but mostly relevant in research that is meant as support for business decisions. Although the cloud can be seen as infinitely scalable, it is only an illusion and therefore throughput can be seen as a valuable metric.

The workload has to be designed with the metrics in mind. Where the application should be modeled as a real world scenario with a plausible workload.

One of the important factors mentioned, when it comes to workload implementation, is the workload generation. The recommendation is that this is done by pseudo-random number generators to ensure repeatability. A pseudo- random number generator also has the benefit of being much more accessible than it would be to gather the same amount of real data.

Creating trust is considered especially important when it comes to running

(22)

benchmark tests in the public cloud. The reason being the public clouds black- box property. As a client of a public cloud one can never be certain about the underlying software or hardware. To create trust, the authors recommend executing previously mentioned aspects well, along with choosing a representative benchmark scenario.

2.7 Related Work

In this section, previous work that relates to the work that will be done for this thesis will be discussed. It starts with a discussion about solutions that have similar attributed as the serverless architecture. Followed by a section about how GraalVM is used at Twitter. To conclude this topic, about related work, there is a section covering how benchmark environments affect the result and what is thought of running benchmarks in the cloud.

2.7.1 Solutions Silimar to Serverless

The idea to start a processes only once called upon, is not unique for the serverless architecture. Super-servers, or a service dispatchers, is based on the same principle. A super-server is a type of daemon which job is to start other services when needed. Examples of super-servers are: launchd, systemd and inetd.

inetd is an internet service deamon in Unix systems that was first introduced in 4.3BSD, 1986 [53]. The inetd super-server listens to certain predefined ports and when a connection is made upon one of them inetd starts the corresponding service what will handle the request. These ports support the protocols TCP and UDP and examples of services that inetd can call are FTP and telnet. For services that do not expect high loads, this solution is a favorable option. This is due to the fact that such services do not have to run continuously, resulting in a reduced system load. Another benefit is that the services connected to inetd does not have to provide any network code since inetd links the socket directly to the service’s standard input, standard output and standard error.

To create an inetd service, developers only need to provide the code and specify where the file, containing the code, will be located and which port should trigger the service.

In similarity to the serverless architecture, not needing to care about servers is also a principle of agent-based application mobility where an application is wrapped by a mobile agent that has full control over the application. The mobility of the agent lets the application migrate from one host to another, where the application can resume its execution [54]. Instead of abstracting away the server from the developer, like in the serverless solution, this approach lets the developer implement services and avoid servers all together.

Although this approach can bring many benefits, such as reduced network load and latency due to local execution, agent-based application mobility also has its drawbacks. One of the drawbacks is the high complexity of developing the application. The application needs to be delicately designed in order to be

(23)

device-independent and have the ability to be migrated between devices [55].

The solution many applies to this problem is to use an underlying infrastructure or middleware [56, 57].

2.7.2 GraalVM at Twitter

Despite GraalVM only having a beta release, Twitter is already using it in production. Their purpose to adopting GraalVM is to save money from the decrease in computer power needed. Another motivation was that the Hotspot Server VM is old and complex while GraalVM is easier to understand [58].

By switching to GraalVM the VM-team at Twitter saw a decrease of 11 % of used CPU-time in their tweet service, compared to running the Hotspot Server VM. Twitter also discovered that they could decrease CPU-time further by tun- ing some of GraalVM’s parameters. One of these parameters where TrivialInlin- ingSize. Graphs with less nodes than the number represented by this parameter would always be inlined. With their machine learning based tuner, Autotuner, that automatically adjusts these parameters, CPU-time was dropped another 6 % [59].

To take into consideration is that the Hotspot JVM is tuned to the Java language and Twitter is mainly using Scala in their services. The same code base written in Java might not have produced the same dramatic improvements.

2.7.3 Benchmark Environment and the Cloud

When analysing the result of this thesis, it is important to take into consideration impacting error sources. One such error source is the hardware the functions will be running on. Since there will be no indication of what CPU is used for any execution, nothing could be said about its impact on performance.

In a runtime comparison made by Hitoshi Oi three different processors where used [60]. All three was made by Intel and based on the netburst microarchi- tecture but had different clock speed and cache hierarchies. Despite being from the same manufacturer and based on the same architecture, varied performance could still be seen in almost all use cases. In AWS, no guarantee is given that any feature of the different processors used will be the same. This fact and the study made by Hitoshi Oi gives an indication on the possible impact this factor can have on the results.

This is further emphasised in a conference where John Chapin shares his investigation into AWS performance [61]. Among other topics he speaks about the difference in performance in relation to how much memory the user specifies as maximum. Since AWS Lambda allows CPU allocation in proportion to maximum memory usage specified, it would be logical that lower amount of memory allocated always would lead to an inferior performance. However, in Chapin’s experiments he found that this is not always the case. In some instances he got almost the same performance independent of available memory allocation. He draws the conclusion that this is connected to the randomness of the container distribution. Some containers may be placed on less busy servers

(24)

and can therefore deliver better performance. This emphasises the importance of rigorous performance testing; where the testing is well, time wise, distributed to get the best possible representation of the overall performance of the given function in the public cloud.

A comparison of public could providers by Hyungro Lee et al. can give an indication as to how AWS will perform when testing its throughput [62].

Martin Maas et al. suggests that runtimes used in the serverless context should be rethought [63]. This based on the fact that most runtimes today is not optimized for the modern cloud related use cases. They envision a generic managed runtime framework that supports different languages, front ends and back ends, for various CPU instruction sets, FPGAs, GPUs and other acceler- ators. Graal/Truffle is mentioned as a good example of a framework that can create high performance and maintainability by its ability to execute several different languages.

2.8 Summary

Serverless refers to a programming model and an architecture, aimed to execute a modest amount of code in a cloud environment were the users do not have any control over the hardware or software that runs the code. The servers are abstracted away from the developer and the only thing the developer needs to be concerned about is the code. Every task related to maintaining servers are taken care of by the cloud provider. Therefore solutions that requires scaling, for example, are a good fit for the serverless architecture.

The provided code only runs when the serverless function is invoked. Mean- ing that there is nothing running connected to the function when it is not invoked. This also entails that for the first time, and every time the function has not been invoked for a while, start-up actions, such as starting a container, needs to be performed. An execution containing these start-up actions is said to have gone through a cold start, otherwise it is a, so called, warm start.

There are two types of compilers compared in this thesis a Just-in-Time (JIT) compiler and an Ahead-Of-Time (AOT) compiler. An AOT compiler compiles code before it is run and creates an executable file. The AOT compiler used in this thesis is GraalVM that started out as a research project from Oracle. It was released as a production ready beta May 2019. A JIT compiler compiles the code during runtime. The JIT compiler used in this thesis is the Hotspot JVM that is maintained and distributed by Oracle.

When running benchmarks, dedicated and isolated environments are usually used to minimize external impact on the results. The public cloud is eminently unlike such an environment. One reason being that the hardware and its configurations are hidden from the user. The fact that the public could is shared also enables the possibility of a neighbour having an effect on the performance of one’s function. These factors have to be taken into account when analysing the result.

(25)

Chapter 3

Method

A benchmark suite was created for this thesis. For every benchmark two corresponding serverless functions were implemented in Amazon Web Service’s serverless solution Lambda. One that runs with the hotspot JVM provided by Amazon and one that runs as a native image created with the tool GraalVM CE. These functions were then invoked trough the AWS’s command line interface. The commands were run locally to simulate a more real world scenario where network latency can impact the result. All programs returns a JSON containing information about the execution.

We grouped the test into two categories, one that contains the executions that went through a cold start and one that contains the executions that reuse already started containers, warm starts.

The arithmetic mean of the different metrics were calculated along with a two sided confidence interval to be able to analyse the results fairly.

This chapter describes, more in detail, how the work for this thesis was carried out and the motivations behind the choices made. The last section of this chapter contains a summary containing the chapters key points.

3.1 Metrics

The metrics focused on in this thesis are mainly dynamic metrics [64]. Meaning metrics that are to a higher degree based on the execution of code rather than the code itself [65]. This is due to the fact that the interest of this thesis lies within the performance of code given different runtimes. What applications are used and what techniques that were used developing them, factors that are connected to static metrics, are secondary. Some static metrics will, however, be collected.

The static metrics used in this thesis is chosen with the purpose to give the reader an indication of the overall size of the different benchmarks. Four different static metrics will be documented. Two of them are the sizes of the JVM and the GraalVM function, collected from Amazon Console. The other

(26)

two are lines of code and the number of Kotlin files.

The dynamic metrics chosen for this thesis is based on what would be of interest to a developer that is considering using Kotlin in a serverless context.

We hypothesised that the factors a developer would be most interested in are a comparison of performance as well as cost.

When performance of software is measured, one of the most interesting elements to attain is knowledge about how much resources are being used. Since cost, in this case, is exclusively based on resources used, there is no need to add specific cost-related metrics. The second element of interest is what is causing these resource allocations. An example of a factor affecting the performance of a program is garbage collection.

In this thesis the public cloud is used, that can be seen as a black box since users can not be certain what environment their code is executed in. This entails that there are a large amount of factors that can affect the performance of the functions, such as hardware configuration. A choice have therefore been made to only focus on measuring the resources that are being used and not measure factors that are believed to cause these performance changes.

The resources that will be measured are latency, application runtime, response time as well as memory consumption. In Figure 3.1 we can see an illustration of the dynamic metrics which are measured in time.

Figure 3.1: An illustration of the metrics measured in time

3.1.1 Latency

Latency is measured by subtracting the start time, recorded by the locally executed script invoking the Lambda function, from the start time recorded by the function that is returned in the response JSON.

Latency can be important in cases were data becomes stale fast, it is therefore important that the data gets processed quickly. One example of this is a

(27)

navigation system that gets location data from a car and needs to update its direction accordingly.

3.1.2 Response time

Response time is measured by subtracting the start time recorded by the invocation script from the end time recorded at the time the response is returned from Amazon. Response time is a meaningful metric in multiple use cases.

One example is user interfaces. In one study from 1968 [66] and a compli- mentary study from 1991 [67], three types of limits for human and computer interactions are summarized. For a user to experience that a system is react- ing instantaneously the requested result should be delivered within 0.1 second.

To insure a user’s continuous, uninterrupted thought process the response time should not exceed 1.0 second. If the response time surpass a limit of 10 seconds users will want to switch to another task during the execution.

Even thought these studies are written several decades ago there is no indication that users should have raised there tolerance. With faster internet speeds and more powerful computers the opposite are presumably more truthful.

3.1.3 Memory consumption

Memory consumption is another essential factor. As always in software development, developers and operators are looking to optimize execution. One simple reason is that the more memory an application uses the more expensive it is to run. If a developer is running an application on an on-premises system the effect might not be as palpable, until the need to buy more RAM arises. In a serverless context, however, optimization of memory usage can easily lead to a visible cost reduction.

The memory consumption of a function execution is recorded by AWS Lambda and will be retrieved from its logs.

3.1.4 Execution time

The response time might be the most interesting time metric in this work.

However, it is also of interest to see how much of the total time that comprise of actual application execution time and how that time changes given different circumstances. Execution time is also unaffected by external factors, such as internet connection, and is only a result of the characteristics of AWS Lambda.

This makes it a good measurement of the performance of AWS Lambda.

3.2 Benchmarks

Every benchmarks have two different Lambda functions. One that is run with the JIT compiler hotspot JVM and one that is run as a native image, created with GraalVM.

(28)

All benchmarks are open source and have a separate repository on github [68].

Each benchmark has main class that contains a main-function and a function called handler. The handler -function is used as entry point for the serverless functions using the JVM and the main-function is used for the serverless functions compiled with GraalVM.

3.2.1 Real benchmarks

A real test is to be preferred when performing benchmarks. These test are real in the sense that they are real repositories acquired from GitHub. They are not originally intended as serverless applications and a discussion could be had weather any of them would fit in a serverless context. Nevertheless, they still represent real workloads of real applications and will therefor presumably be a better indicator of the performance, of Graal and the hotspot JVM respectively in a serverless environment, than artificial applications.

Each of these real benchmarks contains test written using JUnit. To simulate workload, some, or all, of the tests in the repository are invoked when running the benchmark. After invoking an AWS Lambda function a response from the function is sent back in the form of a JSON containing these fields:

• coldStart : Boolean - Indicates if the run has gone through a cold start or not.

• startTime : Long - The time when the functions code starts to execute represented in milliseconds since the UNIX epoch (January 1, 1970 00:00:00 UTC).

• runTime : Long - Runtime of the application in milliseconds.

• wasSuccess : Boolean - Indicating if the tests where a success, used for debugging purposes.

• failures : List - Containing reasons why tests failed, if there where failures otherwise the list is empty. Used only for debugging purposes.

To determine if a run was a cold start or not, a search is made of a specific file in the /tmp folder (where Amazon lets users write files with a combined size of 512 MB). If the file is not there the file is created. Since the file is removed when that container is, the applications will know if the container is new (the file does not exist), meaning a cold start, or of it has been used before (the file exist), meaning a warm start.

Kakomu

Kakomu is a repository that contains a go simulator [69]. The repository enables a user to play a game of Go with a bot, but it can also simulate a game between two bots. There are 18 tests used for this thesis and focuses on the game model, ensuring a game is evaluated correctly.

(29)

State machine

The state machine benchmark is taken from a repository containing a Kotlin DSL for finite state machine developed by Tinder [70]. This benchmark contains 13 tests.

Math functionalities

This repository provides discrete math functionalities as extension functions [71].

Some examples of its capabilities are permutations and combinations of sets, factorial function and iterable multiplication. The benchmark implementation based on this repository runs 55 individual tests that ensures all equations are done correctly, i.e., most are mathematical equations and set-operations.

3.2.2 Complementary benchmarks

Finding suitable real benchmarks proved to be challenging, therefore the benchmark suite is supplemented with additional artificial benchmarks. One of them is a simple ”Hello world”-example, its only purpose is to return the basic information the other benchmarks does: start time of the function and if it went through a cold start or not.

The other complementary benchmarks are algorithms from the benchmark suite The Computer Language Benchmarks Game [46] implemented in Kotlin for the purpose of the paper ”Performance Evaluation of Kotlin and Java on Android Runtime” [48]. These benchmarks all were categorized by Li et. al [47] to mainly manipulate different data types. These categorizations can be seen in Table 3.1.

Benchmark Data type

Fasta Pointer

N-body Floating-point

Fannkuch-Redux Integer Reverse-Complement String

Table 3.1: Mainly manipulated data types

The benchmarks which originates from The Computer Language Bench- marks Game also returns a JSON, but since there are no JUnit tests run the field wasSuccess and failures are omitted, otherwise the fields are the same as in the real test, i.e., coldStart, startTime and runtime.

Fasta

The Fasta benchmark is categorised as a pointer intensive algorithm that also has a large amount of output. Running the algorithm results in three generated

(30)

DNA sequences, the length of the sequences are decided by an input parameter represented as an integer. The length used in this thesis is 5 × 10⁶.

The generated output is written to a file and consists of three parts. The first part of the DNA sequence is a repetition of a predefined sequence and the last two are generated in a pseudo-random way using a seeded random generator.

After the file has been generated it will be removed to not affect the following tests, since some are running in sequences.

Reverse-Compliment

The Reverse Compliment benchmark takes the input from a file containing the output from a run of the Fasta application, that in turn had an input of 10⁶.

The aim is for the Reverse-Complement program to calculate the comple- menting DNA strands to fit the three DNA sequences that the input file contains. It is calculated with the help of a pre-defined translation table. Due to the fact that the input file that is processed consist of strings, this benchmark is categorized as mainly handling strings. Another attribute of this benchmark to keep in mind is that it is also both input and output heavy.

N-body

The N-body benchmark simulates planets movements and manipulates for the most part floating points. It requires an integer as input that represent the number of simulation steps to be taken. The input used for this benchmark in this thesis is 10⁶.

Fannkuch-Redux

The Fannkuch-Redux benchmark permutes a set of numbers S = {1, ..., n}, where n is the input value, in this case 10. In a permutation P of the set S the first k elements of P is reversed, where k is the first element in P . This is repeated until the first element of the permuted list is a 1. This is done for all n-length permutations P of S.

Since all the elements in the list are integers this benchmark classifies as an application that mostly handles integers.

3.3 Environment and Setup

For this thesis Amazons Web Services is chosen as the public cloud provider, on account of Amazon being the only provider that offers the possibility for customers to provide a custom runtime. AWS’s serverless solution is called Lambda. A user of Lambda has the possibility to create and manipulate Lambda functions by using a CLI provided by Amazon, which we used for both creation and invocations in this thesis.

A Lambda function that should run on the JVM require a so called uber JAR, a JAR file that not only contains the program, but also includes its dependencies.

(31)

That way the JAR file only requires a JVM to run. The JAR files used in this thesis are created with the help of a Gradle plug-in called Shadow [72] and the open JDK version 1.8.0 222. The Lambda functions that executes these JAR files uses the runtime Amazon calls java8 that is based on the JDK java-1.8.0- openjdk. When using the java8 runtime, Amazon utilizes their own operating system, Amazon Linux, on the containers executing that function.

Using GraalVM CE a native image is created from the JAR generated by Gradle. The latest release of GraalVM CE is 19.3 but contains a known bug where it is unable to create native images [73]. Therefore, the previous version, 19.2.1, is used. In this thesis the community edition is used due to its availability.

To create a Lambda function with a custom runtime a bootstrap file is needed in addition to the executable file. This bootstrap file needs to invoke the executable as well as report its result. The bootstrap file and the executable are then decompressed in a zip file format and pushed Lambda to create a function.

All the Lambda functions that was created runs with a maximum memory size of 256 MB and a timeout of 100 seconds. Meaning a program can not use more than 256 MB memory, otherwise the invocation fails, and it will be interrupted if it runs for more than 100 seconds.

3.4 Sampling Strategy and Calculations

Since the benchmarks are executed in a public cloud where the results can be affected by factors such as noisy neighbours, it is reasonable to be mindful of the selection of execution times in order to achieve a representative result.

Two different aspects about time was taken into consideration, day versus night and weekday versus weekend. Although the region chosen for hosting the AWS Lambda functions where us-east-1, there is no guarantee that the users of that region should all have a timezone used in the eastern parts of the Unites States. These test for example are made from the Central European Time zone (GMT+1). It was therefore concluded that since no distinction can be made, between day and night of the users of the same region, an interval was chosen.

The tests done in sequence where performed with 8 hours apart: 12PM, 8PM and 4AM (CET). The benchmarks ran 6 times over the span of three weekdays, from Tuesday 10/12 12:00 PM to Thursday 12/12 4:00 AM. In order to cover the weekend as well, three tests where run, with an 8 hour interval, from Saturday 14/12 12:00 PM to Sunday 15/12 4:00 AM. Since these test where not meant to go through a cold start they could be done in sequence.

When deciding how many invocation each sequence should contain, previous work were consulted. When a JVM is used for running benchmarks, a warm up sequence is commonly defined and used in order to ensure that the JVM has achieved the so called steady state when samples are acquired [74] [75] [76].

The optimal would be to get both warm up instances as well as instances where the JVM has reached a steady state in order to get a fair representation. The amount of invocations required for each benchmark to achieve steady state could be examined, however, is out of the scope for this thesis. Therefore a report

(32)

by Lengauer et al. was used in order to determine a reasonable sample size.

In the report, three different benchmark suites were used and the amount of warm up instances was base-lined at 20 due to the built-in mechanism in the DaCapo suite that requires a maximum of 20 warm-ups to automatically detect a steady state [75]. The suite used for this thesis is undoubtedly different in many ways but this still gives an indication of how many invocations are required before a steady state is reached. We hypothesize that a steady state is reached after 20 invocations, but we also want some samples capturing the steady state.

The number was therefore doubled and it was reasoned that 40 invocations presumably would suffice. The first invocation, however, will inevitably include a cold start and will be excluded, entailing 39 usable executions per sequence.

To get measurements of executions including cold starts, invocations has to be made with a large enough gap. After some trials, 20 minutes was found to be an adequate gap. Benchmarks was executed with a 20 minute interval between Tuesday 10/12 16:40 and Wednesday 11/12 09:00 as well as between Sunday 15/12 10:40 and Monday 16/12 10:20.

When the results have been gathered the raw data has to be compressed in some way in order to make it presentable and comprehensible. For this the arithmetic mean is chosen as a first step. To be able to argue for the accuracy of the result the confidence interval is also calculated. The confidence level used in this work is 95 %, meaning that the level of confidence one can have that the actual value is within the given interval will be 95 %. The confidence level was chosen on account of it being one of the most commonly used [77] and it contributes to a high credibility.

3.5 Summary

For this thesis we create a benchmark suite. The goal of the benchmarks is to simulate a real workload. The suite consists of three benchmarks that are real applications, one simple hello world benchmark as well as four smaller complementary benchmarks. Each of the complementary benchmarks focuses on manipulating different data types.

Each benchmark has two AWS Lambda functions, one that runs on the JVM and one that is run as a native image created with GraalVM. The times when these benchmarks are run is selected with the intent to create a fair representation of the clouds performance. The metrics that are collected form each run is latency, execution time, response time and memory consumption.

A comparison of compiler strategies for serverless functions written in Kotlin

A comparison of compiler

strategies for serverless functions written in Kotlin

KIM BJÖRK

A comparison of compiler strategies for serverless functions written in Kotlin

-

En jämförelse av kompilatorstrategier för serverless-funktioner skrivna i Kotlin

Kim Bj¨ ork - kimbjork@kth.se Royal Intitute of Technology

Stockholm, Sweden

Supervisor: Cyrille Artho - artho@kth.se Examiner: Pontus Johnson - pontusj@kth.se

January 2020

Contents

Chapter 1

Introduction

1.1 Problem and Research Question

1.2 Contributions and Scope

1.3 Ethics and Sustainability

1.4 Outline

Chapter 2

Background

2.1 Serverless

2.1.1 The Attributes of Serverless

2.1.2 Use Cases for Serverless Functions

2.2 Kotlin

2.3 Types of Compilers

2.3.1 Ahead-of-Time Compiler (AOT)

2.3.2 Just-In-Time Compiler (JIT)

2.4 The JVM Compiler

2.5 The GraalVM Compilation Infrastructure

2.6 Performing Benchmark Tests

2.7 Related Work

2.7.1 Solutions Silimar to Serverless

2.7.2 GraalVM at Twitter

2.7.3 Benchmark Environment and the Cloud

2.8 Summary

Chapter 3

Method

3.1 Metrics

3.1.1 Latency

3.1.2 Response time

3.1.3 Memory consumption

3.1.4 Execution time

3.2 Benchmarks

3.2.1 Real benchmarks

3.2.2 Complementary benchmarks

3.3 Environment and Setup

3.4 Sampling Strategy and Calculations

3.5 Summary