Increasing the Throughput of a Node.js Application: Running on the Heroku Cloud App Platform

(1)

IN

DEGREE PROJECT COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS

,

STOCKHOLM SWEDEN 2016

Increasing the Throughput of

a Node.js Application

Running on the Heroku Cloud App Platform

NIKLAS ANDERSSON

(2)

Abstract

The purpose of this thesis was to investigate whether utilization of the Node.js Cluster module within a web application in an environment with limited resources (the Heroku Cloud App Platform) could lead to an increase in throughput of the application and, in the case of an increase, how substantial it was. This has been done by load testing an example application when utilizing the module and without utilizing it. In both scenarios, the traffic sent in to the application varied from 10 requests/second to 100 requests/second. For the tests conducted on the application utilizing the module the number of worker process used within the application varied between 1 and 16. Furthermore, the tests were first conducted in a local environment in order to establish any increases in throughput in a stable environment, and, in case there were notable differences in throughput of the application, the same tests were conducted on the Heroku Cloud App Platform. Each test was also aimed towards testing one of two different types of tasks performed by the application: I/O or CPU bound. From the test results, it could be derived that utilization of the Cluster module did not lead to any increases in throughput when the application was doing I/O bound tasks in neither of the environments. However, when doing CPU bound tasks, it led to a ≥20% increase when the traffic sent to the application in the local environment was 10 requests/second or higher. The same increase could be seen when the traffic sent to the application was 50 requests/second or higher in the Heroku environment. The conclusion was, thus, that utilization of the module would be useful for the company (that this thesis took place at) in case an application installed on Heroku was exposed to higher traffic. Keywords Throughput, Node.js, Heroku, Performance, Increasing

(3)

Abstract

Syftet med detta examensarbete var att undersöka om huruvida nyttjande av Node.jsmodulen Cluster i wen webbapplikation i en miljö med begränsade resurser (Heroku cloud appplattformen) skulle kunna leda till en ökning i throughput hos applikationen, och om det skedde en ökning – hur stor var då denna? Detta har gjorts genom att belastningstesta en exempelapplikation nyttjande modulen och utan den. I båda scenarier varierade trafiken som skickades till applikationen mellan 10 och 100 requests/sekund. För testerna utförda i applikationen som nyttjade modulen varierade antalet workerprocesser mellan 1 och 16. Vidare utfördes testerna i den lokala miljön med målet att slå fast möjlig throughputökning i en stabil miljö först, och om det fanns några märkbara skillnaden i throughput hos applikationen skulle samma tester även utföras på Heroku app cloudplattformen. Varje test strävade också för att testa en av två olika typer av arbetsuppgifter utförda av applikationen: I/O eller CPUbundna. Från testresultatet kunde det fastslås att: Clustermodulen ledde inte till några ökningar vad gällde throughput när applikationen gjorde I/Obundna arbetsuppgifter i någon av miljöerna. När applikationen däremot gjorde CPUbundna arbetsuppgifter ledde det till en ökning på ≥20% när trafiken var 10 requests/sekund eller högre. Samma ökning kunde ses först när trafiken kommer över 50 requests/sekund eller högre i Herokumiljön. Slutsatsen var därmed att användande av modulen skulle vara användbart för företaget som arbetet uträttades hos om en applikation som låg installerad på Heroku utsattes för vad som ansågs vara högre trafik. Nyckelord Throughput, Node.js, Heroku, Prestanda, Öka

(4)

Abstract_{(in English)} Abstract_{(in Swedish)} Table of Contents 1 Introductio_n………5 1.1 Background_………..5 1.1.1 Increasing Throughput_{………...6} 1.1.2 Node.js_……….6 1.1.3 The Heroku Cloud App Platform_………..6 1.1.4 Web Applications_………..7 1.2 Problem_………..7 1.3 Research Questions_……….7 1.4 Purpose_………..8 1.5 Delimitations_………..8 1.6 Disposition_{………...9} 2 Theoretical Background_……….10 2.1 The Company Platform_{………..10} 2.2 Heroku Dyno_{………....11} 2.3 I/O vs. CPU bound_………12 2.4 The Inner Workings of Node.js_{………..13} 2.5 Increasing Throughput in Node.js Using the Cluster Module_{…………...14} 2.6 Related Work_………15 3 Research Process_{………...17} 3.1 Research Methodology_……….17 3.2 Process Overview_………18 3.2.1 Problem Definition_………18 3.2.2 Data Collection_{………...19} 3.2.3 Design & Implementation_{………....20} 3.2.4 Defining the Testing Environments_{………...20} 3.2.5 Creating Test Plan_……….20 3.2.6 Results and Analysis_………20 3.2.7 Evaluation_{………..21} 3.3 Hypotheses_{………...21} 4 Analysis: How to Increase Throughput_………22 4.1 Our approach_{………..22} 4.1.1 Different Implementations of the Cluster Module_{………..22} 4.1.2 Clustering Method Chosen When Creating the Application Templat_{e………...23}

(5)

4.2 The Application Template_{………...23} 4.2.1 CPU Usage_……….25 4.2.2 Workload_{………...26} 4.2.3 Memory Usage_{………....27} 4.3 Test Application_……….27 5 Analysis: Benchmarking the Test Application_{………..28} 5.1 Testing Environment_{………..28} 5.1.1 Local Environment_……….29 5.1.2 Heroku Environment_………29 5.1.3 The Test Application’s Memory Usage………...29 5.2 _{Testing Tools………30} 5.2.1 _{Apache JMeter………..31} 5.2.2 Heroku Metrics………...32 5.3 _{Creating the Test Plan……….33} 5.4 _{Local Tests………33} 5.4.1 I/O Bound_{………..34} 5.4.2 CPU Bound………...35 5.5 Heroku Tests………...36 5.5.1 Throughput Rates………..37 5.5.2 Memory Usage………39 5.5.3 Median Reponse Times……….40 5.5.4 Analysis of Heroku Test Results………...41 6 Discussion_……….43 6.1 Our Methodology and Consequences of the Study_{………...43} 6.2 Discussion and Conclusions_{………....44} 6.2.1 Recommendations Concerning the Application Template_……….45 6.3 Ethics_{………...46} 6.4 Sustainability_{………...46} 6.5 Future Work_……….47 References Appendix 1 Heroku Dyno CPU Information………...52 Appendix 2 _{The Test Application……….58} Appendix 3 T_{he Application Template………..60} Appendix 4 _{The Local Server CPU Specifications………...61} Appendix 5 Results from I/O Bound Tests in Local Environment………63 Appendix 6 Results from CPU Bound Tests in Local Environment………..65 Appendix 7 Results from CPU Bound Tests on Heroku……….…….67

(6)

1 Introduction

Today, virtually every company with a presence on the Internet collects data concerning their customers in some form[1]_{. With a large collection of customer} profiles it is possible to collect information concerning the customer’s geographical area, what products the customer has viewed, what devices the customer is using etc. With this data, customer communication can be improved, marketing can be optimized (through a more welltargeted informational flow), and all customer information can be stored in one single virtual space. Data can come from different sources: web analyticstools, login processes, email, etc. It can also be required to collect data from different physical nodes; it might be located in different data warehouses, and can even be administered by different third party companies. For a large company the collected data may grow very large and there might be a lot of daily transactions. It is therefore important that these transactions are consistent, that data is preserved, and that the application can handle as much traffic as possible. One way of making sure that the application is adapted to do this is by assuring that it can handle as many requests per time unit as possible. This leads to the application being able to serve more clients, thus lowering the risk of a client not receiving the requested data.

1.1 Background

Innometrics, the company that the project took place at, is active within the area just described above. Their product helps other companies personalize their marketing strategies by collecting data from a customer’s different data warehouses, and creating a customer profile out of this data. They were in need of increasing the throughput of Node.js applications used for intrasystem communication between their system and other systems. These applications were installed on an external cloud platform (Heroku Cloud App

(7)

Platform or Amazon Web Services), and thus restricted by each of the platform’s individual specifications.

1.1.1 Increasing Throughput

Throughput is a measurement used for describing the number of requests per time unit handled by any given web service or application. One of the ways of increasing throughput is by making the application more concurrent, that is – to make it process more requests simultaneously[2]_. This can be achieved by adding extra hardware resources or by maximizing utilization of the available resources.

1.1.2 Node.js

Node.js is a runtime environment based on the programming language JavaScript – a programming language most wellknown as the scripting language for web pages. A runtime environment deals with a variety of issues such as the layout and allocation of storage locations for the objects specified in the source code, the mechanisms used by the target program to access variables and for passing variables, etc.[3]_. Node.js ships with a collection of modules, which basically encapsulate related code, as in Java or any other programming language with a set of standard libraries. Also, new modules can be installed, managed and published through the Node Package Manager to provide further functionality. A more detailed specification of Node.js is given in chapter 2.

1.1.3 The Heroku Cloud App Platform

In order to describe what cloud computing is, Eric Griffith states in his article[4]_: “In the simplest terms, cloud computing means storing and accessing data and programs over the Internet instead of your computer’s hard drive”. Heroku belongs to a type of cloud computing known as Platform as a Service (PaaS)[5]_{. This type of service removes the need for organizations to manage the}

(8)

underlying infrastructure (usually hardware and operating systems) and allows users to focus on the deployment and management of their applications[6]_. The Heroku platform allows for users to install and execute applications isolated from one another. It provides functionality such as a database management system and application monitoring. The platform’s execution environment also enables the user to write applications in several different programming languages, such as Node.js, Ruby, Java and PHP.

1.1.4 Web Applications

An application is a stored set of instructions that directs a computer to do some specific task[7]_. Web applications are distributed clientserver applications in which a web browser provides the user interface[8]_{. The client browser and the server side exchange} protocol messages represented as HTTP requests and responses. In the case of cloud computing, web applications no longer exist on the server, instead they reside on a cloud platform.

1.2 Problem

During periods of high traffic towards a web application, it is essential that the system can handle the increased demand of service. Exposing an inefficient web application to high traffic can cause individual requests not to receive their corresponding responses. It can also lead to response times – the total time it takes from when a user makes a request until they receive a response – being longer than desired. In order to fulfill the need of service to as many clients as possible, it is important that the web application can provide a large throughput.

(9)

1.3 Research Questions

The main questions of this thesis narrow down to: How can the throughput of a Node.js application, running on the Heroku Platform, be increased by taking advantage of the available system resources? In case of an increased throughput, how substantial will it be?

1.4 Purpose

The application’s performance is limited by the cloud platform where the application is installed. The purpose of this thesis is to show how to increase throughput of the company’s applications running on the Heroku Platform. The intention is to develop a generic application template in Node.js, that can be used when creating new applications within the company’s Node.js application platform. Applications that utilize this template should be able to be installed on the Heroku Cloud App Platform. The template should increase the throughput of each individual application, and thus increase the performance of the system in whole. Although the application was primarily aimed at the Heroku platform, there should be a possibility to migrate it to other existing cloud platforms. Therefore, the solution should be as general as possible. Furthermore, we are to implement functionality that takes full advantage of the available system resources in order to increase the number of requests handled by the application per time unit. This is to be done without adding any additional hardware resources. Best practices in increasing throughput of a Node.js application deployed on the Heroku Cloud App Platform, without adding hardware resources, will be investigated and evaluated. Hopefully, this will lead to an increase in throughput of each individual application on the Innometrics’ Node.js application platform.

(10)

1.5 Delimitations

This thesis will focus singlehandedly on increasing the throughput by taking advantage of the available system resources. Also, it is only concerned with the increase of throughput in an application running on the Heroku Platform – not on an arbitrary cloud platform. Furthermore, we were limited to using only the free account level of Heroku (specifications of machines on this level are given in chapter 2).

1.6 Disposition

The thesis is outlined as follows. Firstly, a theoretical background is presented, giving a brief insight to the specific technologies that are needed in terms of understanding the approach to the problem and the thesis results. Node.js, the Heroku environment, and increasing throughput in Node.js specifically are here discussed in more detail. After that, the research process is treated. The chapter starts by describing our information gathering process, and continue with a review of existing literature, a description of our research methodology and the requirements specification. The next chapter describes how the template for the applications is created. This chapter is then followed by a chapter devoted to the tests. Here, the testing environment is described and the results from the tests are evaluated. Lastly, in the Discussion chapter, we reflect about our methods and results, future work, and the topics of ethics and sustainability within the area.

(11)

2 Theoretical Background

This chapter will give a deepened insight into the more theoretical parts of the problem area that is essential in order to understand the problem and its solution. It will describe the Innometrics system, the Heroku dyno, how Node.js works in more detail, how to increase throughput within the runtime environment using the cluster module, and related work done within the area.

2.1 The Company Platform

As mentioned in section 1.1, the customer’s (the company buying Innometrics’ product) data warehouse or their system for tracking and managing existing or potential customers (Customer Relationship Management system, or CRMsystem)[9] is connected to the company platform. With the data retrieved from the customer’s data warehouse or CRM system, Innometrics initially puts together a profile for each of the customer’s clients (the visitors to the customer’s website), which is then stored in Innometrics’ own data warehouse. The Innometrics system will then continually add data to this profile containing information on any website interaction that the client in question has made towards the customer’s website. The website interaction to listen for is specified by the customer through the Innometrics system. All client interaction, that has been specified to listen for, is logged in an event stream in the form of data objects known as events. An event is, in turn, a collection of data containing information on an action that has been taken by a client on the customer website. For example, as a client clicks a banner or a link, an event could be generated containing information on which banner or link that was clicked, the time when the click was made, etc. In order to enable Innometrics to retrieve resources from third party sources, Node.js applications (deployed on a cloud platform) are used. Each application has been set to listen for one or several events. In case any of these events are triggered by a visitor on the customer website (e.g. by the client clicking on a link), the

(12)

Innometrics system sends a request containing client’s profile (with the event added to it) to the application. An example of this type of communication is shown in figure 2.1.1. As the client visits a website, an event is generated by the Innometrics system containing information on the IP of the client. A request containing the client profile is then sent by the Innometrics system to the application. The application extracts the IP address contained in the event data of the profile just received with the request, and sends it onwards to an IPlookup service retrieving further information on the IP address in question. Then, as the response is received from the lookup service, the application saves this data towards the Innometrics’ own data warehouse. Figure 2.1.1: A flow chart describing an example case of communication between different actors as an event is triggered on a customer website.

2.2 Heroku Dyno

Each application on the Heroku platform is running on a dyno. Each dyno is a lightweight Linux container that runs a single command provided by the user. A dyno can run any command available in its environment like restart, stop, scale, etc.

(13)

According to Heroku’s official documentation[10]_{, containerization is a virtualization} technology that allows multiple isolated operating system containers to be run on a shared host. All dynos are isolated from one another for security purposes. Dynos on the free account level are limited to 512 MB of RAM[11]_{. Concerning the} CPU specifications, this is something that Heroku (due to unknown reasons) has decided not to reveal to the user, but by accessing the application’s shell environment it was clear that the dyno lied on a machine that had access to one physical unit consisting of 4 cores with 8 hardware threads each (see Appendix 1). However, it seemed[10]_{that this was something that the dyno has varied access to depending on} the amount of other dynos currently active on the shared host. A hardware thread is one out of two execution threads per core that executes simultaneously in order to hide latencies when it comes to retrieving data from memory caches on the CPU, and is something that is implemented by Intel HyperThreading Technology[12]_.

2.3 I/O vs. CPU bound

Tasks performed by an application or a system can be I/O or CPU bound. I/O (I/O is shorthand for Input/Output) bound task performs operations associated with I/O communications. Examples of I/O communications are HTTP requests, database operations and disk reads and writes[13]_. CPU bound tasks are mainly performed by the CPU. In this case the CPU spends its time mostly on computing. Examples of these types of tasks are calculating a hash, searching for an item, and performing mathematical calculations. Figure 2.3.1: A CPU (a) vs. I/O bound (b) application

(14)

An application can also be either CPU or I/O bound. In the case of a CPU bound application, a majority of the tasks done within the application are CPU bound. In the case of an I/O bound application it is the other way around – a majority of the tasks are I/O bound. Both types of applications are depicted in figure 2.3.1. Here it can be seen how the CPU bound application (application ‘a’) spends more time doing calculations, and less time handling I/O. It can also be seen, in application ‘b’, how an I/O bound application spends its time doing the opposite – more time waiting for I/O, and less time doing calculations[14]_.

2.4 The Inner Workings of Node.js

One of the main strengths of Node.js is its method for treating I/O calls. This is much because of I/O calls being handled by background threads, while the main thread of the application, known as the event loop, can treat and process any other requests sent to the application. In figure 2.4.1, there is a detailed overview of the inner workings of Node.js.

(15)

Node.js runtime runs on single core[15]_{and contains an event queue which stores a} list of events, each consisting of a name describing the event and a callback function[16]_{(a function to be run after the initial function has finished its execution).} An example of an event is when an HTTP request is sent to the server. This request is placed in the event queue. The event loop starts by picking up an event containing an I/O call that is to be executed from the queue and then delegates the job to the operating system via an internal thread pool[17]_{. The thread that receives the job then} executes the function associated with the event without blocking the event loop, while the event loop continues treating the next event in queue. After the thread in the internal thread pool has finished its execution, the callback function is again placed in the event queue. The callback function is later on retrieved from the queue and processed by the event loop. If another event occurs, a new event is placed in the event queue, and the procedure is repeated. This way the event loop can handle all incoming requests asynchronously in a nonblocking way. However, Node.js is not as good at treating CPU intensive tasks[18]_{. When Node.js} performs a CPU intensive task all other requests are being held up, due to the event loop running on a single thread and the CPU being occupied with working on this thread. One of the strategies to handle this problem is by using the Cluster module[13]_.

2.5 Increasing Throughput in Node.js Using the Cluster Module

In order to improve Node.js ability to treat CPU intensive tasks, worker processes can be forked. That is, the main process of the application is duplicated into new processes referred to as worker processes[19]_{. The main process is then referred to as} the master process. This functionality is provided by the Cluster module, which is a part of the standard library in Node.js[15]_. When forking new processes, all new connections are first received by the master process and then handed over to an available worker. Which worker gets the connection is decided through a roundrobin approach – which essentially means that the next available worker gets it[20]_.

(16)

Best practice is to bind each worker to its own logical CPU core, which leads to the application’s ability of processing each request being increased through utilization of more of the CPU’s capacity, thus, increasing its effectiveness and throughput[21]_. This essentially means that each Node.js instance (figure 2.4.1) is replicated into its own server instance, where each instance – known as a worker process – listens to the same socket. Here, the master process works as a loadbalancer by receiving all incoming connections and distributing them among the worker processes[15]_. The resulting architecture of the application when implementing the cluster module is depicted in figure 2.5.1. Figure 2.5.1: The desired application architecture for this thesis, with each worker representing the Node.js instance depicted in figure 2.4.1

2.6 Related Work

The Node.js platform is still rather new and evolving rapidly. Because of that it is not easy to find articles that are still uptodate. Some of the articles are reviewed in this section. The article “Optimizing Node.js Application Concurrency” provided by Heroku’s official website, explains how to regulate the number of worker processes[22]_{. It is also} recommended to create worker process and bind each of them to its own logical CPU core, thus making the application take full advantage of the available system resources. One interesting thing that they mention is that each app has unique

(17)

memory, CPU and I/O requirements and there is no solution that can fit each app. However, they do not provide any benchmark results. Rowan Manning in his blog describes how to implement the Cluster module[23]_. He also states that creating multiple processes for a Node.js application can dramatically improve the amount of load the application can handle. He provides some simple benchmarking in order to illustrate the improvement. The app is installed on a local machine without involving the Heroku platform, and the benchmarked function is doing CPU bound tasks. Neil Kandalgaonkar argues that “Node.js can be a great choice for computation heavy services”[24]_{. He clarifies that it can be suitable for some occasional CPUbound tasks} – not too many, nor too heavy tasks however. The Heroku platform is mentioned in the article as well, but due to the fact that the application tested in the article was too big (~200 Mb), it was not possible to perform some thorough tests on that platform. He names the Cluster module as one of the possible solutions.

(18)

3 Research Process

This chapter will describe our research process. It will provide a description of the methodology used in solving the problem, give an overview of the overall process, and lay down the hypotheses for this thesis.

3.1 Research Methodology

Since the thesis consisted of two separate research questions, two different research strategies were used. In order to answer the first research question, how to increase the throughput of the application, practices on how to create a server in Node.js were investigated. The solution was determined through a combination of quantitative and qualitative methods, where a form of applied research based on existing theories and research[25] was used to create a test application, which could then be evaluated by answering the second research question. If the results from answering the second question would lead to a substantial increase (≳20%), the results from the evaluating the first question would be considered positive. If not, the first question would need to be reevaluated based on another existing theory. When answering the second research question, how substantial the increase in throughput would be, two different methods were followed. Experimental research was conducted by having a foundation for this thesis by comparing different test results with one changing variable per test. In our case, these variables were represented by 1) the load sent to the application during the test, 2) the number of workers used by the application in the test, and 3) the environment the test was conducted in (local or Heroku). The hypotheses could also be predefined for the outcome of the comparison, and thereby, a method of the analytical kind was also used[25]_{. Thus, the methodology} used for answering the second research question was a combination of two research methods: experimental and analytical.

(19)

3.2 Process Overview

The methods listed in this section are described in order to give an understanding on how this thesis was structured to be able to achieve the goal and answer the research questions defined in chapter 1. The overall research process is illustrated in figure 3.2.1, and is described in detail below. Figure 3.2.1: The research process

3.2.1 Problem Definition

This was the phase where the problem was defined out of the requirement specification received from the company.

(20)

3.2.2 Data Collection

Data collection consists of two different types of data: primary data and secondary data. Primary data is most generally described as data collected from the information source and most often is retrieved through interviews, observations and discussions with members of the company[26]_. Secondary data, in turn, is typically gathered by persons not involved in the current research. The sources of this kind of data can be technical and statistical records, newspaper articles, etc[26]_. The primary data that the qualitative part of this thesis relies on mainly consists of a task overview given by Innometrics’ supervisor of this thesis, and of informal interviews given by the employees of the company. The overview given by the company consisted of recommendations on what modules to use for the thesis – partly modules used by the company daily when designing applications for the platform, and partly modules that could contribute to this thesis. Recommendations on what tools that might be used when performing the tests were also given. The informal interviews given by employees consisted of recommendations on how to set up the remote environment, and information on the average traffic that the Innometrics system is exposed to. This primary data was then complemented by document studies in the form of company documentation on the platform, technical reports, and articles on the subject. Such materials can give a better and deeper understanding of the subject. The primary data that the quantitative part of this thesis relies on mainly consists of test results obtained from tests conducted in order to answer the second research question on how substantial the increase in throughput was (in case there was an increase).

(21)

3.2.3 Design & Implementation

In this phase, a test application is to be designed and implemented based on a known method for increasing throughput in Node.js. The initial design of the test application is the result of the primary data obtained through qualitative methods just described, and it defines the architecture and functionality of the application.

3.2.4 Defining the Testing Environments

In this phase, the specifications for the machines, of both the local and the Heroku testing environments, were laid down.

3.2.5 Creating Test Plan

During this phase, focus laid on creating a test plan that included test for both the local and the Heroku testing environment. The test plan were to be designed to test the application’s throughput in the case of both I/O and CPU bound tasks, different rates of traffic sent to the application, and different number of worker processes for each traffic rate and type of task. We had been informed on the structure of the requests being sent to the application, and by reusing that structure we only needed to adapt the request’s body to contain data relevant for the test application. The body data that was relevant for this thesis was simply a string, in order to determine which function to call (I/O or CPU bound).

3.2.6 Results and Analysis

This phase consisted of two iterations: one for local tests, and one for Heroku tests. In both iterations, the test application was benchmarked, and the results of the benchmark was then analyzed. The results were presented in form of tables and graphs. Increases in throughput were expressed in terms of a percentage increase between each test. The analysis consisted of a type of formative evaluation, where concentration laid on examining and changing processes as they occur. The last iteration was evaluated, in case it had provided positive results the process would continue to a final evaluation

(22)

of the solution. In case it had provided negative results, a new iteration would be initiated.

3.2.7 Evaluation

The evaluation of the solution to this thesis was to have a summative approach, providing an overall description of the application’s performance increases. It was to be described whether the objectives of the thesis had been fulfilled, and on the future direction of the product. Here, a secondary analysis was also to be given to reexamine existing data to address new questions or methods not employed.

3.3 Hypotheses

Our hypotheses were that the results would show that the test application would have performance increases in areas where Node.js usually was flawed. In other words: when benchmarking one and the same Node.jsapplication with and without our application template, a performance increase, in the form of a higher throughput, should be apparent when doing CPU heavy tasks, such as calculating a hash or when doing other arithmetic calculations. However, when doing I/O bound tasks it should result in a status quo.

(23)

4 Analysis: How to Increase Throughput

This chapter will provide the answer to the first research question of this thesis how can the throughput of a Node.js application, running on the Heroku Platform, be increased by taking advantage of the available system resources? The answer was obtained by using the qualitative methods described in section 3.2.2. It will provide a description of the test application used in this thesis. The application was to consist of partly the throughput increasing template that was to be the product of this thesis, and partly functionality for testing two different aspects – it’s capabilities of fulfilling I/O and CPU bound tasks – of the test application in its current environment.

4.1 Our approach

When analyzing data retrieved during the data collection phase, we found that there were not so many ways to increase throughput of a Node.js application. The main method for increasing throughput in Node.js is by creating multiple processes for the application, thus utilizing more of the available system resources. This is known as clustering the application, and is mainly implemented by the Cluster module described in section 2.5.

4.1.1 Different Implementations of the Cluster Module

There are several alternatives _{for implementing worker processes in Node.js}[27]_. One of these is to simply use the standard Cluster module, which comes as a standard library in Node.js, and provides the most basic mechanisms for implementing worker processes. More about the implementation of the Cluster module for this thesis can be found in section 4.2. There is also the alternative of implementing the Throng module[28]_{, which is used by} Heroku in their own example on how to cluster. This module is also implemented on top of the Cluster module. It is being advertised for being “a simple worker manager for clustered Node.js apps”, by obscuring large parts of the master/worker logic when

(24)

clustering the application – in order to make it easier for the developer. Instead, the developer mainly has to focus on setting the number of workers, configuring the master process, etc. Another alternative is PM2, a program which is also implemented on top of the Cluster module. It is similar to the Throng module by obscuring large parts of the master/worker logic from the developer, but does so to an even larger extent. It also provides the application with some additional functionality such as real time process management[29]_{(e.g. adding workers), basic system monitoring, log aggregation,} etc[30]_. Lastly, there is the alternative of implementing the StrongLoop Cluster Management Tool[27]_{, which also is based on the Cluster module, and basically provides the same} functionality as PM2, but with some smaller differences (such as profiling).

4.1.2 Clustering Method Chosen When Creating the Application

Template

When it came to this thesis, it was found that the standard Cluster module was the most appropriate way to implement clustering in the application. When looking at the alternatives, they either tended to hide larger parts of the cluster related code from the developer (Throng, PM2, and StrongLoop), or offer functionality not relevant for this thesis – which might have lead to a larger memory usage (higher memory allocation) for the processes. They are also all built on top of the Cluster module, and it also seemed easier to adapt the standard Cluster module to different cloud platforms compared to the other alternatives[31]_. While other alternatives, with additional functionality, might be useful in a livecase scenario – it was not appropriate for this study, where it was desired to evaluate the effects of clustering on the most basic level.

(25)

4.2 The Application Template

When creating the template, we relied on the official Node.js documentation, and the description of the Cluster module in particular, on how to create, and cluster, a web application. This lead to a template realizing the server model described in section 2.5 (and depicted in figure 2.5.1). When developing the template, it was important to keep the master process as light as possible, by keeping the allocated memory for the process at a minimum, and not to include any server related code, or any other code that was not relevant to its task of managing the workers. The reason for this was to optimize memory usage on the cloud platform, and since it was the workers that did the request handling procedures, it was important that they had a maximum amount of memory available. In order to change the number of worker processes dynamically for each application instance, an environment variable that could be set via the command line was used. On Heroku, this variable held the number of workers appropriate for the number of dynos used for the application. According to official Node.js Cluster documentation, the default strategy when creating worker processes in an application, was to use the worker processes as request handlers (receives and treats requests), and use the master process for creating workers and handing sockets to them through interprocess communication (IPC) – a mechanism for sharing data among multiple process[4]_. Figure 4.2.1: The master related code of the template This works in the way that, as the application instance is started, the master creates a number of workers equal to the value represented by the environment variable mentioned earlier (see figure 4.2.1, rows 2527). Also, on rows 2931, it is shown how

(26)

a new worker is generated by the master in case a worker dies (process somehow shuts down). Figure 4.2.2: The worker related code of the template In figure 4.2.2, the worker related code of the template is shown. The worker starts by instantiating the Express framework (row 34) – a Node.js framework used for creating web applications providing the process with necessary server functionality. On rows 3941, it can be seen how each worker process listens to the same port. The code for handling each request sent to the application is shown on rows 4451. As a request is sent to the application, the request is treated and a response is generated in the callback of this method. The complete template of this thesis can be found in Appendix 3.

4.2.1 CPU Usage

As mentioned earlier, a Node.js application has a singlethreaded event loop, utilizing only one of the available CPU cores. To increase throughput only using the available system resources, it should be specified in the application how many worker processes that are to be created. As mentioned in section 2.5, best practice in determining how many worker threads that should be created for a particular application, is to base it on the number of

(27)

cores available to the system. That way, each process is bound to a single logical core. The desired CPU usage can be seen in figure 4.2.1.1 Figure 4.2.1.1: Regular vs. desired Node.js CPU usage Using the Cluster module, this can easily be implemented on a physical machine, where you exactly know the specifications of the machine. When it comes to a cloud platform, however, there is not much information revealed about container specifications. A single Heroku dyno shares access to system resources with some other dynos and the performance of a single dyno can vary depending on the total load on the underlying machine. Therefore, according to Heroku’s article “Optimizing Node.js Application Concurrency”[21]_{, clustering more than one worker} on standard single dyno may hurt, rather than help performance. This was one of the things to be considered when performing the tests.

4.2.2 Workload

Analyzing the information received from observations and recommendations, it was clear that the application should be able to handle different amounts of simultaneous requests. The customers that use the application are of different kinds it can be either large or small companies. Thus, the application should take into consideration those differences, i.e. it should be able to handle both larger and smaller amounts of clients. Therefore, finding the right balance of workers is very important.

(28)

4.2.3 Memory Usage

Applications can differ in memory usage. Some applications, in need of larger memory allocation (≳200 Mb for a single application), might suffer from

implementing worker processes on a single Heroku dyno (due to exceeding the memory limit). Exceeding the memory limit could lead to the application not performing desirably, with requests timing out (not receiving responses). Therefore, when clustering an application the memory usage of each process has to be kept in mind – the application’s overall memory usage must not exceed the dyno’s memory limit.

4.3 Test Application

A template was created, and by that the first research question was partly answered. In this case, the next step would be to verify whether the template would give the desired increase in terms of throughput on the Heroku platform or not. In order to do that a test application was to be created and the second research question to be answered. The application had to provide means for testing its capabilities of doing different tasks. From discussions with people at the company, it was discovered that the system sometimes calculates hashes when creating new profiles. Therefore, the test application needed to provide the ability to run two different types of tasks: CPU and I/O bound. The CPU bound function calculated a hash, while the I/O bound task simulated an I/O call by doing a timeout of 300 ms where the application simply was waiting without blocking the event loop. The function to run was parsed from the HTTPrequest that the application received. In each of the two functions, an appropriate response was generated. The same test application (see Appendix 2) was used in both local and Heroku tests.

(29)

5 Analysis: Benchmarking the Test Application

This section will focus on describing the testing environment, the test plan, and the results obtained from the tests, in a local environment and on Heroku. It will provide an analysis of the results, in order to answer the second research question of this thesis on how substantial the increase in throughput of the application can be when clustering functionality has been added.

5.1 Testing Environment

Analyzing the information gathered during the data collection phase when attempting to answer the first research question, we came to the conclusion that it was needed to define a local and a Heroku testing environment. Heroku themselves inform[21]_{that an application might suffer from being clustered} when running on a free account. The tests were thus first conducted on the test application locally, with the goal of acquiring the expected results in a stable environment. Tests were then conducted on the same application, but instead installed on Heroku, with the expectation of obtaining similar results. In both testing environments we used six different versions of our application – one without clustering functionality, and five with clustering functionality, each with a given number of workers available to the application instance (1, 2, 4, 8 or 16). The version without clustering functionality was needed in order to confirm that the added functionality would not affect the performance of the application. For both environments the same machine was used as client. The specifications of the client machine were: ● Macbook Air (13inch, Mid 2013) ● CPU: 1.7 Ghz Intel Core i7 ● Memory: 8 GB 1600 Mhz DDR3 ● OS: Mac OS X El Capitan, Version 10.11.4 ● 100/10Mbs Ethernet Connection

(30)

5.1.1 Local Environment

The local testing environment consisted of two machines: one client (with the specifications given above), and one server with the following specifications: ● MacBook Pro (13inch, Mid 2012) ● CPU: 2.5 Ghz Intel Core i5 ● Memory: 4 GB 1600 Mhz DDR3 ● OS: Mac OS X El Capitan, Version 10.11.5 ● 100/10Mbs Ethernet Connection Through a terminal command in Mac OS X, the specifications for the Intel Core i5 CPU could be retrieved (see Appendix 4). Here, it could be seen that the CPU had access to 2 cores and 4 hardware threads. This later on became a determinant when deciding on what was the most appropriate amount of worker threads to be used when running a local server.

5.1.2 Heroku Environment

Summarizing the specifications given for the Heroku dyno in section 2.2: ● CPU: varied share depending on how many other dynos that are currently active on the shared host ● Memory: 512 Mb Due to having significantly less memory available in the Heroku environment compared to the local one, and due to the fact that a dyno had a varied share of the CPU, it was needed to establish the results locally first. We thought that if the expected results from the hypotheses (i.e. getting a throughput increase only for CPU bound tasks) could be obtained in a local environment, it would be worth testing on Heroku as well. If not, the expected results would definitely not be obtained on the lower performing machines that we had in our Heroku environment.

5.1.3 The Test Application’s Memory Usage

By monitoring the application’s memory usage locally through a Mac OS X terminal command, “top”, we could see that it used 20 Mb without any requests being sent to

(31)

memory usage could climb up to 85 Mb, but averaged around 65 Mb. When requesting the application to run I/O bound tasks, on the other hand, its memory usage could climb up to around 80 Mb, but averaged around 60 Mb. Since we, on Heroku, had a memory quota of 512 Mb, we had now been given an equation for calculating the appropriate number of workers for the application. By having 512 Mb in total memory available, and the application having a max memory usage of around 85 Mb when it was performing CPU bound tasks (the most memory demanding task), the most appropriate number of workers would be around 512 Mb / 85 Mb ≈ 6 workers. Considering that the master process also would need some memory allocated, the appropriate number of workers would most likely be slightly below 6. Among our different versions of the application, we could thereby predict that the one having 4 workers would produce the best results by giving an increased throughput, while not exceeding the memory limit of the dyno (and still leave a margin to it). The application utilizing 4 workers would have a memory quota of 512 Mb / 4 workers = 128 Mb available for each worker (minus master process memory usage). This meant that when the application would be exposed to high traffic ordering it to perform CPU bound tasks, each worker would still have a memory quota of 128 85 = 43 Mb available, which should be considered as a good margin, without leaving a significant amount of unused memory on the dyno. Summarizing, it is important that the memory usage of the application’s processes does not exceed the available memory of the Heroku dyno, and that it, ideally, lies with a good margin below this value – but not too good, because then a large amount of memory could become unused. The problem, concerning the Heroku environment, had thus become memory related as well (not only CPU related).

5.2 Testing Tools

This section will describe testing tools used when running tests locally and on Heroku.

(32)

5.2.1 Apache JMeter

JMeter is a Java application designed to load test functional behavior and measure performance. It provides means for simulating a heavy load on a server, groups of servers, or network, to test its strength or to analyze the performance under different load types. It has the ability to load and performance test many different server/protocol types: HTTP/HTTPS, FTP, TCP etc. Figure 5.2.1.1: Example properties of a thread group With each testing plan, the user creates a thread group, specifying a thread number, a rampup period and a loop count. The thread number specifies how many threads that are to be started in the beginning of each rampup period (specified in seconds), and the loop count specifies how many times this procedure should be repeated. In figure 5.2.1.1, there is an example of the properties that can be set for a thread group. Here, 10 threads are being initiated each second, and this is looped 320 times. Figure 5.2.1.2: An example of the properties of an HTTP Request Sampler

(33)

Within each thread group, in turn, there are several elements that can be included. For example, in our case it was relevant to include an HTTP Request Sampler – an object that contains information on an HTTP request that is to be sent with each thread in the thread group. Figure 5.2.1.2 shows an example of properties set for an HTTP Request Sampler that sends request to port number 8887 on IP 192.168.1.104. The body data can also be set, this is however something that we could not show due to risking company policy infringement. There is also a possibility of generating aggregated reports. This type of report is what lies as a basis for presenting the results of the tests performed in the local environment.

5.2.2 Heroku Metrics

When running the tests on Heroku, we used JMeter for sending the requests, but not for measuring the application’s performance. This was due to JMeter having a different measurement of throughput, which was based on the number of samples divided by the total time of the test. This meant that the time for the request being sent to, and received by, the server, and the time for the response being sent to, and received by, the client, being included in the measurement as well. This was an acceptable measurement in the local environment, since the distance between client and host was small. Now, however, with the application deployed on an external host, we had to take into consideration that there might be a significant distance between the client and the host. Therefore, it was decided that the just mentioned transport times were something that should not be a part of the application’s performance evaluation. In order to measure the application’s performance ideally, it was important to do the measurements as close to the application as possible. This could be done by relying on the metrics tool which Heroku has made available for developers. The tool consisted of a collection of graphs including the same units of measurement for the application as those retrieved from the JMeter reports used in the previous section – namely throughput, average and median response times, and error rates.

(34)

5.3 Creating the Test Plan

The testing procedures of the application followed a pattern where, for each type of task (I/O or CPU bound), the number of requests sent to the application was gradually increased – in order to evaluate how well the application performed different tasks at different traffic rates. The load rate for each test varied between 10100 requests per second. The rates of 1025 requests per second were to simulate low traffic, 2550 medium traffic, and 50100 high traffic. For all tests 15000 samples were sent to the application. Figure 5.3.1: Example of six thread groups each containing one HTTP Request Sampler In order to test each application sequentially, there was one thread group (see figure 5.3.1) for each number of workers available for each application. Within each thread group, we had specified a HTTP Request Sampler (described in section 5.2.1) sending to that application's specific endpoint. As mentioned earlier, regardless of whether the application was running locally or remotely on Heroku we set up the same samplers in each of the thread groups – only changing the names of the thread groups and the host URL for the samplers.

5.4 Local Tests

This section describes the evaluation of the application’s capabilities in performing I/O and CPU bound tasks in the local environment. Focus was laid on differences in throughput between tests, but average and median response times will also be noted

(35)

5.4.1 I/O Bound

Starting with the I/O bound tests and sending in 10 requests per second (see figure 5.4.1.1), it was noted that the results were similar between each thread group – no matter the number of worker processes used. Both the average and median response time is close to the same for each of the thread groups. The throughput (number of requests handled by the application per second) is also similar between the thread groups.

Label # Samples Average (ms) Median (ms) Error % Throughput (rps) 10/s With Cluster, 2 workers 15000 317 308 0,00% 30,8 10/s With Cluster, 4 workers 15000 308 309 0,00% 31,7 10/s With Cluster, 16 workers 15000 307 308 0,00% 31,8 10/s Without Cluster 15000 307 307 0,00% 31,9 10/s With Cluster, 8 workers 15000 307 308 0,00% 31,9 10/s With Cluster, 1 workers 15000 307 307 0,00% 31,9

Label # Samples Average (ms) Median (ms) Error % Throughput (rps) 100/s With Cluster, 16 workers 15000 306 305 0,00% 309,3 100/s With Cluster, 4 workers 15000 305 305 0,00% 310,3 100/s Without Cluster 15000 305 305 0,00% 310,8 100/s With Cluster, 8 workers 15000 305 305 0,00% 310,9 100/s With Cluster, 2 workers 15000 306 305 0,00% 311,5 100/s With Cluster, 1 workers 15000 306 305 0,00% 312,0 Figure 5.4.1.1: Results from local I/Obound tests at 10 and 100 request per second (sorted by throughput) Looking at the results from the test where 10 requests were being sent per second, there is barely a difference when comparing the one without clustering and the ones utilizing it. When looking at the results from the other test (100 requests per second) the result was the same. The difference in throughput between the thread without clustering and the highest performing thread with clustering was ~0.4%, which is not a significant difference (and does not pass the bar of 20%). To summarize, when the application was performing I/O bound tasks in a local environment, the test results did not show a significant difference in terms of

(36)

throughput, and none of the results passed the bar of an increased throughput of 20%. In accordance with the hypotheses (described in section 3.3) and the research process depicted for this thesis (section 3.2.6) of only moving onto Heroku with tests that proved an increase in the local environment, an increase in throughput could not be seen as the application performed I/O bound tasks. All of the results obtained from testing the application’s capabilities of performing I/O bound tasks in the local environment can be found in Appendix 5.

5.4.2 CPU Bound

As can be seen in figure 5.4.2.1, when simulating low traffic (10 requests per second) the results obtained showed significant difference between the thread groups. When comparing the thread group without clustering to the highest performing one with clustering (4 workers), it showed a difference of ~24.4%. There could also be noted a decrease of 25% in average and median response times between the two thread groups.

Label # Samples Average (ms) Median (ms) Error % Throughput (rps) 10/s With Cluster, 1 workers 15000 6 6 0,00% 787,5 10/s Without Cluster 15000 4 4 0,00% 827,6 10/s With Cluster, 16 workers 15000 4 3 0,00% 892,9 10/s With Cluster, 8 workers 15000 3 3 0,00% 972,3 10/s With Cluster, 2 workers 15000 3 3 0,00% 994,6 10/s With Cluster, 4 workers 15000 3 3 0,00% 1029,9

Label # Samples Average Median Error % Throughput 50/s With Cluster, 1 workers 15000 51 53 0,00% 815,5 50/s Without Cluster 15000 37 39 0,00% 954,5 50/s With Cluster, 2 workers 15000 25 27 0,00% 1211,9 50/s With Cluster, 16 workers 15000 15 15 0,00% 1269,7 50/s With Cluster, 4 workers 15000 16 15 0,00% 1320,3 50/s With Cluster, 8 workers 15000 13 13 0,00% 1355,9 Figure 5.4.2.1: The results of the CPU bound tests at 10 and 50 requests per second (sorted by throughput) Moving on to the test where requests were being sent to the application at 50 requests per second, the results obtained showed an increase in throughput of

(37)

~42.1% when comparing the application without clustering and highest performing clustered application (8 workers). In this case, a decrease in average response time when comparing the two threads were of ~64.9%, and in terms of median response time the decrease was ~61.5%.

Label # Samples Average (ms) Median (ms) Error % Throughput (rps) 100/s With Cluster, 1 workers 15000 107 110 0,00% 793,6 100/s Without Cluster 15000 82 86 0,00% 926,8 100/s With Cluster, 16 workers 15000 19 18 0,00% 1151,8 100/s With Cluster, 2 workers 15000 63 67 0,00% 1208,1 100/s With Cluster, 4 workers 15000 49 52 0,00% 1278,3 100/s With Cluster, 8 workers 15000 39 44 0,00% 1326,5 Figure 5.4.2.2: The results of the CPU bound tests at 100 requests per second (sorted by throughput) Looking at the test where requests were being in at 100 requests per second (see figure 5.4.2.2), a difference of ~43.1% in terms of increased throughput could be noted between nonclustered and best performing clustered application (8 workers). In this case, a decrease of ~52.4% in average and ~48.8% in median response times could also be seen. Lastly, it was noted that the application with 1 worker performed ~4.8%14.4% lower than the application not implementing clustering. In conclusion, the results obtained from testing the application’s capabilities of performing CPU bound tasks locally showed increases in terms of throughput higher than the bar of 20%. Because of the results showing this increase, CPU bound tests were to be conducted in the Heroku environment as well. When comparing the application only utilizing 1 worker, however, the throughput was lowered by ~4.8%14.4% compared to when not utilizing clustering at all. The full results of the CPU bound tests can be seen in Appendix 6.

5.5 Heroku Tests

This chapter presents the evaluation results of the same application used in the local tests, but deployed on the Heroku platform instead, are presented. Worth taking note of is that it was here decided not to test the I/O bound function on Heroku, as the results obtained from analyzing the local tests spoke for the cluster

(38)

package not contributing to any increases concerning throughput in this environment. Here, the test results are based on the output of the Heroku metrics. Each bar in the diagram represents the performance of the application during a given minute in time. The vertical line apparent in most of the diagrams (e.g. figure 5.5.1.1) represents a specific minute, chosen by us to analyze. This specific minute belongs to the highest throughput value obtained during each test. All test results in this section are based on results from when the application was performing CPU bound tasks on Heroku.

5.5.1 Throughput Rates

The results obtained from simulating low traffic for an application deployed on the Heroku platform did not show any significant difference. In figure 5.5.1.1 it can be seen that the throughput is about the same (~10000 requests/min) for an application without clustering and for the highest performing application with clustering (4 workers). Figure 5.5.1.1: A comparison in throughput between tests at 10 rps without clustering (upper) and with

(39)

Continuing with the test results obtained from the tests running at 50 requests per second (see figure 5.5.1.2), a difference of ~20.6% in throughput could be noted when comparing the application without clustering to the best performing application with clustering (4 workers). Thus, passing the bar of 20% set out as the prerequisite for being considered as a positive result. Figure 5.5.1.2: A comparison in throughput between tests at 50 rps without clustering (upper) and 4 (lower) workers When looking at the results obtained from the tests running at 100 requests per second (see figure 5.5.1.3), a significant difference in throughput could be seen. Here, there is a difference of ~54.9%, which also passes the bar of 20%.