Investigate potential performance improvements in a Real Time Clearing system using Content Delivery Network-based technology

(1)

Investigate potential performance improvements in a Real Time

Clearing system using Content

Delivery Network-based technology

Henrik Ersk´ ers

Degree Project in Computing Science Engineering, 30 ECTS Credits Spring 2019

Supervisor: Jan Erik Mostr¨ om External supervisor: P˚ al Forsberg Examiner: Henrik Bj¨ orklund

Master of Science Programme in Computing Science and Engineering, 300 ECTS Credits

(2)

Abstract

The increasing demand from customers and users on systems are constantly increasing, demanding faster more robust solutions.

There are no exception especially when it comes to clearing tech-

nology in the financial industry, where the system needs to handle

information fast and be responsive. The thesis have investigated

the potential performance improvements by using Content De-

livery Network(CDN) based technology on Cinnobers Real Time

Clearing system RTC. The thesis have based on constructing a

functional CDN node, and look at the impact that it has on the

system in regards of handling reference data. Could this approach

help improve the system and improve the performance of the sys-

tem and scalability in regards to reference data handling. Based

on the results that was gathered by comparing the node imple-

mentation with the original system, there are clear indications

performance improvements on the system. Where the results con-

cluded that with the CDN node implementation a performance

improvement when requesting reference data, had an increase in

fetch time by a factor of 6. By using the results gathered in the

thesis, a simulation was created to simulate the effect of a fully

scaled CDN. The simulation concluded that the implementation

could reduce latency by 44 minutes during a day of use.

(3)

Acknowledgements

First of all I would like to thank Cinnober for giving me the opportunity to write

the thesis at their office. And the help they provided me to develop the thesis. A

special thanks to my supervisor P˚ al Forsberg at Cinnober for helping me with all

uncertainties and problems that I encountered during my thesis. I would also want

to thank my supervisor Jan Erik Mostr¨ om at the university for helping me with

questions regarding the thesis and giving feedback on the report. Lastly I would like

to thank Linn, family and friends for proofreading the thesis and discussing different

ideas and questions.

(4)

Abbreviations

CDN - Content Delivery Network RTC - Real Time Clearing system

SharedData - A collection of reference data in the system DDoS - Distributed Denial of Service

CSP - Cloud Service Provider PoP - Point of precense

Legacy code - Code that is not written from scratch, adding functionality on existing

code base.

(5)

1 Introduction 1

1.1 Background 1

1.1.1 Real Time Clearing System 1

1.2 Purpose 2

1.3 Goal 2

1.4 Limitations 2

1.4.1 Testing 2

2 Methods 3

2.1 Research 3

2.2 Simulations 3

2.3 Testing 3

2.3.1 Scripts 4

2.3.2 Scenarios 4

2.3.3 Performance testing 5

2.3.4 Monitoring tests 5

2.4 Environment setups 5

2.5 Analysis and discussion 6

2.6 Evaluation methods 6

3 Theory 7

3.1 Cloud 7

3.2 Content Delivery Network 7

3.2.1 CDN Node 7

3.2.2 Scaling 8

3.2.3 Benefits 9

3.2.4 Disadvantage 9

3.3 Data caching 10

3.3.1 Caching Strategies 10

3.3.2 Eviction policies 12

3.4 Reference Data 12

3.4.1 Account 12

3.4.2 Tradable Instruments 12

4 Implementations 14

4.1 Node implementation 14

4.2 Monitoring implementations 14

4.2.1 Monitoring by logging 14

4.2.2 Monitoring by process observations 14

4.3 Simulations and tests 15

4.3.1 CDN node active in Stockholm 15

(6)

4.3.2 CDN node active in Stockholm and Washington D.C 16

5 Result 17

5.1 Scenarios 17

5.1.1 Time to retrieve data from SharedData with 100% miss (No

caching) 17

5.1.2 Time to retrieve data from SharedData with CDN node 18 5.1.3 Test the miss ratio on the caching solution 19 5.2 Measure the load on the origin server(SharedData) 20

6 Discussion 25

6.1 Analyze results 25

6.2 Fully scaled CDN implementation 25

6.2.1 Content Delivery Network Scenario 25

7 Conclusion 29

8 Future work 30

(7)

(8)

1 Introduction

The traffic load on the global internet is continuously increasing each year, with higher amount of data circulating and users demanding faster response times from the services. This thesis will look at one approach at reducing the server load and improving the throughput on users requests.

1.1 Background

The connection to the internet is something that many think of as a human right.

We adapt more and more things to the internet, and with that the amount of traf- fic over the internet is constantly increasing. The total internet traffic has seen a dramatic growth during the past two decades. In 1992 the global internet carried approximately 100 GB of traffic a day, ten years later in 2002 those numbers were at 100 GB per second. If we look at the year 2017 the traffic amounted to 46,600 GB per second. [4] And the trend indicates that it will continue to grow, according to The Cisco Visual Networking Index: Forecast and Trends. Which estimates that the traffic will reach 150,700 GB per second by year 2022[4]. When most of the transferred data is of static form[4], different technologies has emerged to handle the data to reduce the load on the networks, one is Content Delivery Network or CDN:s for short. The technology aims to handle the increasing traffic by reducing unnecessary paths on the network to retrieve the data.

1.1.1 Real Time Clearing System

Cinnober has a system for clearing of financial transactions, TRADExpress

^TM

Real Time Clearing (RTC) which is used by a clearing house, a type of financial

institution. Given a trade on an exchange, the clearing house inserts itself as the

counterparty to both the buyer and seller. In other words, the clearing house acts as

the buyer for the seller and the seller for the buyer. In doing so each trading party is

exposed to only the clearing house, resulting in the traders not needing to care about

who they are trading with. RTC is a distributed system that spans over anywhere

from 5-50 Java processes depending on implementation. Its software design aims to

excel in robustness, low latency and high throughput for clearing house real time

operations and calculations.

(9)

Chapter 1. Introduction

1.2 Purpose

RTC contains a master of reference-data called Shared Data where clients of the system currently access the data directly. To relieve stress on RTC and increase throughput a content delivery network solution is to be investigated. The purpose of the project will be to apply the bases of CDN to a local point in their system, and in a local sandbox environment try to apply it and see what the effects are when it comes to performance and robustness. How can CDN technology be applied to a clearing system to improve throughput and reduce stress to the system. The thesis will investigate different approaches of CDN solutions and investigate which solution will be best suited for the problem. And use that solution to tailor it to the RTC system, by modifying the caching to meet the requirement of the system and get the optimal performance for the system.

1.3 Goal

The goal of this thesis is to implement a solution based on CDN technology and investigate the difference between the solutions in regards to the systems throughput and load. The thesis aims to address the following questions:

• Can a CDN implementation on the RTC system improve throughput of reference- data? By how much?

• Can a CDN implementation to the RTC system relieve stress?

• Can the implementation improve robustness and scalability of the system?

• Is the CDN implementation beneficial for the RTC system?

1.4 Limitations

The thesis will only cover the implementation of one edge node (CDN node). The reason is because the aspect that Cinnober do not want to execute their code on various cloud platforms without further investigation around what and how the code is deployed in the cloud. Based on that, Cinnober agreed that for this thesis it would be best to have it cover the impact of one node and not a complete network. This limitation is based on available resources, such as time and relevance. Thus with one complete node a network can be created. But because of the limited resources and the added complications of that implementation, the focus will lie at implementing one node and test if the solution fulfills the goals.

1.4.1 Testing

Because of the limitations, the geographical significance of the edge nodes cannot

be tested in a real environment. Thus the node created will be tested in different

scenarios to simulate the geographical impact.

(10)

2 Methods

This section will cover the different methods used to implement the solution.

2.1 Research

To answer the questions that are formulated in section 1.4 about the thesis goal, literature studies will be used to validate claims/disclaims about the more abstract questions such as robustness, scalability and general questions about what the dif- ferent benefits are to the system using this technology.

2.2 Simulations

As mentioned in section 1.4, there are limitations to the thesis. Because of the limitations it was not possible to try to set up the system in a complete cloud based environment. Because of the limitation, the implementation was created and tested in a local environment. Simulations will be applied on the results to demonstrate the effects if it was in a cloud environment. Where the simulations will be based on modifying users location, and the geographical impact of retrieval times towards the system. Latency statistics will be used to calculate the result[8]. Based from the results measured from the different test, the values measured will be recalculated with the added latency based from where the user is located. This will be done to illustrate the different times for users that interacts with the system. Where the latency statistics from [8] will be used, that provide geographical ping statistics of the world. Because of the small size of the requests that are sent to the system when requesting reference data, the ping requests provides a good measurement of the latency. Ping request do not differ much in size from the requests that are sent to the system. Thus using ping statistics were an ideal option when simulate some parts of the measured values. The purpose of the simulations are to provide a better understanding of how the results would look like if the system was using CDN technology.

2.3 Testing

To answer the questions about throughput, different tests were constructed to in-

vestigate the results of using the implementation compared to not using it. When

testing the relief of stress to the system, different types of stress tests will be made

to measure the effects of the implementation. During the stress tests both the load

on the master as well as the load on the CDN node were measured. All of the dif-

ferent tests and scenarios were done with the implementation and without to better

(11)

Chapter 2. Methods compare the solutions and draw conclusions from it. For more detailed description about the testing methods see sections below.

2.3.1 Scripts

Scripts were used to create the simulation of many users interacting with the system, and simulate the different scenarios. Monitoring scripts was created to measure the load of different services during the scenarios to clarify the affects of the implemen- tation.

2.3.2 Scenarios

To get a better understanding of the performance of the implementation, a number of scenarios were created. The scenarios purpose was to find and simulate common interactions and user cases on the system.

Time to retrieve data from SharedData

The speed of retrieving data from SharedData is key when trying to have the system as fast and responsive as possible. Therefor this test was created to measure the time to retrieve data with and without the CDN node implementation, e.g. 10 000 requests to retrieve accounts where each request will be timed. The scenario will give an indicator if the implementation is going to improve the retrieval speed of the system.

Simulate the effects of a fully scaled CDN

Because this thesis do not cover a fully scaled CDN implementation, this scenario purpose is to simulate how it would look and perform. Based on the data provided from the other scenarios, this scenario will simulate the potential effect if the im- plementation was fully scaled. The scenario will provide a better understanding of how the CDN would operate, in terms of performance and cost.

Test the miss ratio on the caching solution

For the CDN node that is created, this scenario will investigate the impact of the miss ratio of the cache. How much percentage of the content can be stored in the node, and how the performance reflect based on the amount cached. To test this for each request that the CDN node receives it will log the current caching percentage, and the user process will time each request. With this information the time for each request can be compared to the information about the caching percentage.

The scenario will provide indicators of how to optimize the caching percentage to reach optimal performance. But also provide an foundation where comparisons can be made on cost of having more data stored in the cache in correlation to the performance improvements.

Measure the load on the origin server (SharedData)

Because of SharedDatas central part of the system, it is important to keep the load

as low as possible. The scenario will investigate the impact of the implementation

when it comes to the load of SharedData. This scenario will monitor the system

(12)

2.4. Environment setups

and investigate the logs of the origin server to figure out what kind of requests it is receiving and how long does it take to handle the different requests. The scenario was created to help answering one goal of the thesis, ”Can the implementation relieve stress to the system?”.

2.3.3 Performance testing

One of the questions that the thesis set out to find was the performance aspect of the implementation. Financial system have heavy emphasis on the performance of the system. If the information do not get instantly updated and distributed, it could result in that people take risky positions unknowingly, based on old information.

Therefor a lot of emphasis as been put to create test that illustrates and measure the performance of the implementation and without.

2.3.4 Monitoring tests

One important aspect of running the test is to get the relevant data from the test.

To secure the test data provided from the test scenarios, two different gathering methods has been performed.

Monitoring by logging

The first way of getting the information provided from the test were to add logging at various locations in the the system. From logging the current caching percentage to the time to retrieve one request. Most of the logging was done in node implemen- tation or modified Java classes in the RTC system.

Monitoring by process observation

The second solution of collecting the information from the different test scenarios was by process observation. When running the complete system many of the different processes that makes up the system are affected by each other. So to get a good understanding on how the implementation affected the different processes in the system, they were monitored during the scenarios.

2.4 Environment setups

The tests described below were run in a local environment with the following hard- ware:

• Memory: 32 GB

• Processor: Intel Xeon(R) CPU E5-2630 0 @ 2.30GHz x 12

R

• Disk: 235,2 GB

• Operating system: Ubuntu 18.10 (64-bit)

The different environments that is described in the different scenarios are simulated

using, e.g simulated delay, both in form of latency in respect to the geographical

distance as well as different scenarios with different bandwidth.

(13)

Chapter 2. Methods

2.5 Analysis and discussion

The results from the different test scenarios will be analyzed, to better understand the affects of the implementation. And how Cinnober can use that information to make a decision whether to move forward with the implementation or not. The results from the different scenarios will be discussed how to interpret them and what information can we use moving forward as well as aspects that the scenarios have not taken into account.

2.6 Evaluation methods

The results will be evaluated partly based on the performance of the solution in

different aspects compared to the original implementation. Aside from the more

measurable results, for the more abstract results and conclusions will be evaluated

based on correlation with literature studies and papers.

(14)

3 Theory

3.1 Cloud

When talking about cloud it can mean different things, cloud is a broad categoriza- tion of many different functionalities. People without a computing science back- ground might think of cloud as something like image storage, or backups for your computer or mobile devices. And that is not wrong, one big aspect of cloud is storage, but that is just scratching the surface of what cloud really is. Instead of thinking that the cloud is storage, think of it as data centers that are located around the globe, and the data centers contain a large quantity of servers where the servers can be used as you like. Where companies like Google, Amazon and Microsoft offer access to their data centers often by a ”pay-as-you-go” model, where you only pay for the resources that you use. [2] This provides a more robust and scalable option to companies instead of buying and maintaining the servers them self.

3.2 Content Delivery Network

A so called content delivery network (CDN), is a highly distributed cloud based platform of servers that are optimized to deliver content to users as fast as possible.

These networks are commonly used when hosting websites as well as static content such as images and videos. The popularity of these networks continues to grow, and the majority of internet traffic is served through CDNs, such as sites like Facebook, Netflix and Amazon[5]. The basic idea behind the technology is very trivial. By looking at the system illustration in Figure 3.1, we can see how one company could set up their network. As depicted in Figure 3.1 there are numerous CDN nodes or PoP:s (point of presence) in different geographical location, these nodes are the key component in the technology. When a user interacts with the system, the user connects to a CDN node that is located to the user geographically. The CDN node works like an extension of the origin server, where the most commonly requested information is stored. And changes dynamically based on the patterns of the users requests towards that node. By using this structure a company like Netflix can reach users globally and scale to new regions only by adding additional CDN nodes in that region. With this structure the CDN nodes handle most of the incoming traffic and is supposed to provide faster response times for the users when requesting information. With this structure the network reduces load on the origin servers, because the network eliminates an overflow of connections to the origin server.

3.2.1 CDN Node

The key component in a content delivery network is the edge nodes or PoP:s. These

nodes are essentially lightweight images of the origin server. The nodes purpose are

to store content that is frequently requested by the users, and reduce the connections

(15)

Chapter 3. Theory

Figure 3.1: Illustration of a CDN structure

Source: http://www.liberaldictionary.com/wp-content/uploads/2019/02/cdn- 3983.jpg

and traffic load towards the origin server. If the edge node does not contain the information that the user is requesting then it uses the private network to retrieve the information from the origin server (see Figure 3.1).

3.2.2 Scaling

One of the key aspects of a CDN or having your service running in the cloud is

the scaling possibility of the service. Depending on the incoming traffic the service

can scale up or down to match the user request and reduce the cost of the hosted

service. [12] Let us say that Cinnober has their system live in Europe and in Asia,

then depending on the time of the day the traffic in the regions will differ. This

enables Cinnober to optimize the resources spent on the system, because when it is

night time in Europe the traffic is low and they can scale down the service and put

those resources in Asia where it is morning and the traffic is much higher.

(16)

3.2. Content Delivery Network

3.2.3 Benefits

The benefits of having a service run on a content delivery network are in most cases very beneficial, different things that a CDN can contribute with are the following [1]:

• Protect the website/service from DDoS attacks

• Reduce bandwidth consumption

• Handle high traffic load

• Improve load speed

These benefits could be very useful for the system, as one study found that when little as one second delay of waiting on the service/website decreases customer satisfaction by 16% and 40% of users will abandon a website if it takes more that three seconds to load[9]. The network can also help improve the fail-tolerance of the system, where users can continue to use the system even if the node that they are connected to shuts down. The users would get redirected to another node or the origin and would not notice any difference besides the added latency when connecting to the closest node available.

3.2.4 Disadvantage

As with all technologies, there are downsides of using them. When it comes to CDN:s there are a few as well. [3]

• Additional complexity of the system

• Additional cost

• Geo-location limitations

Of course adding new things will most of the time add complexity to the system,

but it does not have to be a bad thing. Depending on what you get in return from

adding the functionality, one needs to make a decision if this solution is beneficial

to the system or not. When the new technology is a CDN, the cost needs to be

taken into account as well. Having your system executing, or parts of your system

executing on a CDN is not free. One option to run a CDN is to use one of the

CDN providers, such as Akamai or CloudFlare, which handles the scaling and geo-

location distribution of your program. Another option is to run the service through

a Cloud Service Provider (CSP) such as Google cloud or Amazon AWS, where you

can decide how the scaling should work and at what geo-location should instances of

the service be located. But by using a CSP or companies that offer CDN:s there is

the limitation on the geo-location of your deployment, where you are limited by their

range of deployments areas. For example by creating your own CDN implementation

of some aspect of the system and deploying it on Google cloud you have a limited

reach in a country like Russia for example [7], and that is something that needs to

be taken into account when deciding on how setup a CDN or if it is even beneficial

at all.

(17)

Chapter 3. Theory

3.3 Data caching

The data in a cache is generally stored in fast access memory such as RAM (Random- access memory). The primary purpose of caching is to increase data retrieval per- formance by reducing the need to access the underlying slower storage layer. The benefits of caching is that it increases the read throughput, reducing the load on the backend and the elimination of database hotspots [13]. The average memory reference time is [10]:

T = m × T

_m

+ T

_h

+ E (3.1)

Hit ratio = H/(H + M ) (3.2)

where

m = miss ratio = 1 - (hit ratio)

T

_m

= time to make a main memory access when there is a miss.

T

_h

= the latency: the time to reference the cache.

E = Various secondary effects, such as queuing effects in multiprocessor systems H = Cache hits

M = Cache misses. Where we want to achieve a small m as possible to minimize the time to access main memory.

3.3.1 Caching Strategies

There are many different types of caching strategies, below are the two most inter- esting strategies for this project.

Cache-Aside

This strategy is also referred to as lazy loading, the reason for this is because it loads data lazily on the first read. In this strategy the application talks with both the cache and the database (see Figure 3.2). The design works as following: When the service requests some data it checks with the cache if it have it, depending on the answer there are two possible scenarios (depicted in Figures 3.2, 3.3). Either it returns the data that is cached and the service can return the data instantly or the cache doesn’t have the data. In that case the service retrieves the data from the database or wherever the data is stored, and then saves the data to the cache.

Read Through

This strategy is similar to the cache-aside strategy, both are of the type lazy load-

ing, meaning that the data is only loaded once it is requested. The main difference

between this strategy and cache-aside is that the service always goes through the

cache, depicted in Figure 3.4. The service requests the cache for some data, depend-

ing if the cache has the data or not, it will either return it directly or request it from

the database. If the data is not present in the cache the request would just have a

prolonged response time. The main differences between the two strategies is that

in cache-aside the service is responsible for communicating with the cache and the

database, while in read-through the cache is the one responsible for communication

with the database. A benefit of this strategy is that the cached data cannot be

different from that of the database.

(18)

3.3. Data caching

Figure 3.2: Communication path in a hit scenario using cache-aside

Figure 3.3: Communication path in a miss scenario using cache-aside

Figure 3.4: Communication path in when cache hit and miss in read-through

strategy

(19)

Chapter 3. Theory

3.3.2 Eviction policies

Every cache have a finite memory at their disposal, and if cache uses all of the given memory pool repercussions needs to be taken. These repercussions are called eviction policies and two policies are described below.

Least Recently Used (LRU)

This caching algorithm keeps recently used items near the top of cache. Whenever a new item is accessed, the LRU places it at the top of the cache. When the cache limit has been reached, items that have been accessed less recently will be removed starting from the bottom of the cache. This can be an expensive algorithm to use, as it needs to keep ”age bits” that show exactly when the item was accessed.

Least Frequently Used (LFU)

The LFU algorithm uses a counter to keep track of how many times an entry has been accessed. This is used so when removing entries from the cache, the entry with the lowest count is removed first.

3.4 Reference Data

Reference data is information that is used to structure and constrain other informa- tion. Examples of this is in mathematics is the π, or a structure for calendars such as a list of valid months and days of the week. This form of data rarely changes and that can because of that be cached to improve retrieval of information. When we speak of reference data in computer science we often define it as a special subset of master data, where master data is where all the information is stored and changes.

This subset is used for classification, like postal codes, financial hierarchies or coun- tries. The two primary types of reference data that will be looked at during the project are described below.

3.4.1 Account

The accounts that exists within a clearing system doesn’t change that often. Each account follow the same structure, only the content within them are different. Be- cause of this the account structure is going to be cached.

3.4.2 Tradable Instruments

The different types of tradable instruments of a clearing house rarely changes, and

they contain the structure that each instrument has, so instead of retrieving that

information every time, it will be cached.

(20)

3.4. Reference Data

Table 3.1 The structure of an account Account State

Clearing Member Code Trading Member Code Name

ID

Classification Omnibus/ISA Gross/Net

Automatic Give Up Trading Member Code Automatic Give Up Trading Member

Automatic Give Up Customer Confirmation Code Automatic Give Up Supplementary Code

Table 3.2 The structure of an tradable instrument Tradable Instruments State

Instrument ID Name

Symbol

Underlying ID

Instrument Type

Currency

(21)

4 Implementations

The implementations described below are based on the information provided in section 2.3, 3.2.1 and 3.3 describing the different aspects that the solution needs to be made of and handle. We can break the implementation up in two parts, where one part is the implementation of the CDN node, and the other the code to monitor the node and the overall system.

4.1 Node implementation

Because the implementation were done on legacy code that was provided by Cin- nober, the complete node was not implemented during the thesis. But modifications on the collection of classes that makes up their API instance, and creating the classes that was missing for the implementation to work. When working with legacy code there can be limitations constraining the possibilities of adding new functionality to the system. Where some approaches was not optimal because of the structure of the existing code base. Modifications was done on the API instance to be able to use it as a CDN node. Primarily by creating an caching layer, depicted in Figure 4.1. As seen in Figure 4.1 the caching layer intercept the data flow of the API in- stance, and because of the added complexity of adding functionality on legacy code, read-through strategy seen in Figure 3.4, was the best option when implementing the caching layer.

4.2 Monitoring implementations

One important aspect of the implementation that was done was the monitoring functionality. Even if it did not provide anything for the implementation of the system, it enabled the possibility to observe the effects of the implementation.

4.2.1 Monitoring by logging

Logging run time information was the biggest source of information gathering, this was done by writing down relevant information in different Java classes to a file.

The different logged files could later be used when running the simulations, by modifying the values based on the scenario. The logged files will also be used to compare different measured values in comparison to each other.

4.2.2 Monitoring by process observations

The other source of information gathering was done by process monitoring scripts.

These scripts monitored different key processes in the system to observe how they

behaved during the different scenarios were run.

(22)

4.3. Simulations and tests

Figure 4.1: Data flow of the implemented CDN node

4.3 Simulations and tests

To understand how the system behaves in the original state, the scenarios mentioned in section 2.3 where run on the system. The results gathered from the tests provides a solid reference point, that will be compared with the results provided from the CDN node implementation. With the two different test cases described below, the goal is to provide a solid foundation to illustrate the effects of adding one CDN node.

By showing the retrieval times as well as the load on the origin server. And what the potential benefits can be by having the possibility of adding new CDN nodes in different geographical locations.

4.3.1 CDN node active in Stockholm

To illustrate the effect of an active CDN node, one node was started in Stockholm.

The scenarios was run to investigate the differences in system response time. The test illustrates the performance from users based in Stockholm and Washington D.C.

The performance from Washington D.C is calculated from the results measured from

Stockholm with an latency delay added, as mentioned in section 2.2. The test shows

the impact of running one CDN node in the system.

(23)

Chapter 4. Implementations

4.3.2 CDN node active in Stockholm and Washington D.C

This test is an simulated extension of the previously described test. In this test

another CDN node is active in Washington D.C, that becomes the access point to

the system for users located in that area. Where the goal of the test is to illustrate

the effect when it comes to retrieval time, for the users located in the Washington

D.C area.

(24)

5 Result

The main goal from the different scenarios is to get results from the different tests that can be used to answer the set goals for the master thesis or provide data to reach a conclusion from the results. This section will cover the measured values that were provided from testing the different scenarios described in section 2.3.

5.1 Scenarios

The result from the different scenarios tested are described and analysed below.

5.1.1 Time to retrieve data from SharedData with 100% miss (No caching) When running tests without any modifications to the system we can see in Figure 5.1 that retrieving 10 000 Accounts from SharedData takes around 15-17 seconds, which means that the time to retrieve one Account takes around 0,0016 seconds or 1,6 milliseconds(ms). Based on that the user interacting with the system is located in Stockholm, where the node is active. When comparing the impact of the user location seen in Figure 5.2, we can clearly see in Figure 5.2a that the added latency when sending from Washington D.C. The average fetching time increases from 1,6 ms to 103,9 ms per request.

Figure 5.1: The retrieval of reference data without caching

(25)

Chapter 5. Result

(a) from Washington D.C (simulated) (b) from Stockholm.

Figure 5.2: The retrieval of reference data without caching

5.1.2 Time to retrieve data from SharedData with CDN node

This scenario runs the same test suite as the section without caching, but instead this scenario has the CDN node implementation active in different locations.

Active CDN node in Stockholm

When running a CDN node in Stockholm we can see that the difference is immensely, compared with not running the implementation. As we can see in Figure 5.4, where Figure 5.2b and Figure 5.3b are plotted against each other to clearly illustrate the effect. The fetching time for the reference data drops from an average of 15-17 seconds for fetching 10 000 Accounts to an average just under 3 seconds. This means that per account the fetching time gets roughly reduced from 1,6 milliseconds to 0,2638 milliseconds, that’s a fetch time increase of a factor of 6. By comparing the simulated times from Washington D.C in Figure 5.2a, 5.3a. We can see that with the active node in Stockholm, that there are improvements but not significantly.

(a) Washington D.C (simulated) (b) Stockholm.

Figure 5.3: With CDN node active in Stockholm, the retrieval of reference data

(26)

5.1. Scenarios

Figure 5.4: The retrieval of reference data with and without caching.

Active CDN node in Stockholm and Washington D.C

When adding another CDN node, this time in Washington D.C. By adding the node the retrieval time is significantly reduced, as shown in Figure 5.5. We can see that the retrieval time is higher at the start of the simulation, and gets lower and lower.

The reason for the higher retrieval time in the start is because of the cache misses.

And the node needs to retrieve the data from the origin server located in Stockholm.

By adding the additional node in Washington D.C, the average retrieval time goes from 1025 seconds to around 5 seconds. That means that the average retrieval time gets reduced by a factor of 205.

5.1.3 Test the miss ratio on the caching solution

When testing this scenario the retrieval time for the reference data and the cache

hit ratios correlation is investigated. From the tests we can see in Figure 5.6 how

the retrieval time for the reference data is correlated to the cache hit ratio. From

the figure we can see that the retrieval time drastically decreases when the cache hit

ratio increases, where the time of a hit rate at 86% is approximately around 1,2 ms

and the when the hit rate is at 98-100% is around 0,2 - 0,25 ms. Which is a time

reduction by a factor of 6. The results are gathered by measuring retrieval time

from Stockholm towards the CDN node with the same location.

(27)

Chapter 5. Result

Figure 5.5: The retrieval of reference data with CDN node in Stockholm and Washington D.C.

Figure 5.6: The retrieval of reference data in comparison to the hit ratio of the cache

5.2 Measure the load on the origin server(SharedData)

When creating the scenario of measure the impacts on the origin server with the

CDN implementation and without it, there are couple of different aspects that was

investigated. When monitoring the origin server, which in this case is the Shared-

(28)

5.2. Measure the load on the origin server(SharedData)

Data node. We can see that the load on the server is exponentially higher compared to when the CDN node is active. In Figure 5.7 we can see that the CPU usage spikes when the different requests arrive to the server, and reaches a fairly stable percentage around 150-200%. This puts high emphasis on that the server is optimal to handle the load and increases the risk of slower handling of external request that occur at the same time as handling the original requests. We can see in the figure that the CPU load when the CDN node is active is significantly lower and average around 2-4% as is highlighted from Figure 5.7 in Figure 5.8. The origin server which in this case is called SharedData, contains all the reference data for the clearing system. So reserving the CPU usage is essential to ensure that the resources are available when needed. It is trivial that the memory consumption for the Shared-

Figure 5.7: CPU usage of SharedData with and without CDN Node.

Data node should be fairly constant when interacting with it using a active CDN

node. Because the information is stored at the edge node so the requests are reduced

to the origin server, and depending on the available storage capacity on the edge

node and the size of the information that is stored at the origin server.

(29)

Chapter 5. Result

Figure 5.8: CPU usage of SharedData with and without CDN Node.

Figure 5.9: Memory usage of SharedData with and without CDN Node.

(30)

5.2. Measure the load on the origin server(SharedData)

When a user want to retrieve information about an account, the user sends the a getAccount request to the access point of the system. The access point then propagates that request down to the SharedData node that contains all the reference data in the clearing system. The getAccount in turn can trigger two additional requests called getTradingMembers and getClearingMembers depending on the type of account that is requested. All these requests are sent to the SharedData node and are processed, we can see in Figure 5.10, the logs for the incoming requests that were sent during a scenario. If we look at Figure 5.11 which is the moving average of the first 100 requests from Figure 5.10 we can see that the different requests varies between 0.15 to 0.4 milliseconds with a few deviations. With the duration to handle the requests we can clearly see the performance improvements can clearly be made to reduce the time spent handling requests. And when taken into account the correlation between the request, the time begins to add up when one request can add around one extra millisecond. Though one millisecond might not sound that much, but when keeping in mind that one request can trigger two or more additional request. The extra delay in time starts to stack up.

Figure 5.10: The times for different request handled by the SharedData node.

(31)

Chapter 5. Result

Figure 5.11: The moving average of the first 100 requests on the SharedData node.

(32)

6 Discussion

This chapter will analyze the results and discuss how the fully scaled implementation would look in a cloud environment.

6.1 Analyze results

From the results in chapter 5 we can see the impact of the implementation on the system. The results reflect the impact of one node at one access point for users interacting with the system. This means that the results does not depict a complete CDN implementation but illustrates the effects on the system with one CDN node active. Where the CDN node cover a geographical area as seen in Figure 3.1, in the bottom right corner where the node covers the area of Australia. When talking about one CDN node, the node can vary in composition depending on the geograph- ical area and the traffic load in that area. Where at one location the scale of the node can be small to be able to handle the traffic and caching for traffic in that area, where in areas of high load of the system the scale of the CDN node would be much larger to handle all the user requests. So depending on the load on the node, it scales up and down the active instances of the virtual machines handling the user requests to provide a better experience for the users.

6.2 Fully scaled CDN implementation

As mentioned in section 6.1 above, when talking of the structure of one PoP or CDN node, they can be constructed in different ways in terms of the scale of the node.

Where the scale of the node can be determined based on budget restrictions, or load on the system. If the node is structured to keep cost down, then the node would not scale up if experience high traffic load. The consequences would be increased response time on requests. With that in mind the implementation would probably have better performance than previous implementation without the scalable option of the edge nodes. Another approach is to see the each node as a scalable cluster of virtual machines, this way the implementation of the network will always meet the requirements from the users, but the solution would cost significantly more than the previous mentioned solution.

6.2.1 Content Delivery Network Scenario

From the results in section 5 and 6.1, we can construct a fully scaled implementation

of a CDN where we use the results gathered as reference points when constructing

(33)

Chapter 6. Discussion the potential implementation. We need to start by looking at what kind of resources does the implementation need to function properly.

Background

The goal of this scenario is to get a better understanding of how a full scaled CDN implementation on Cinnober’s RTC system with regards to handle reference data would look like. Cinnober have created the clearing system for many well known markets, the structure of the clearing house can of course vary and the amount of clearing members, trading members and tradable instruments. To give some grasp of the volumes on the clearing houses, for one clearing house there are roughly 45000 tradable instruments, 5000 accounts, 100 clearing members, and 400 trading mem- bers. All this information is reference data, and information that could potentially be handled in a CDN.

Cloud deployment

The focus during the thesis has been to ignore the existing CDN package solution and focus more on how Cinnober could construct their own CDN based on their needs. When looking at two big leaders within cloud services providers, Google and Amazon. They offer the geographical coverage that would be interesting in this scenario, where the CDN nodes could be located in a wide range of location suitable for the clearing system. [6, 7] Where spot prices(2019-04-16) of a standard virtual machine(VM) running with one virtual CPU(vCPU) and 3.75GB in memory costs around $0.0523/Hour at Google [11] and a virtual machine running with two vCPU:s and 3.75GB in memory costs around $0.0554/Hour at Amazon [2].

System interactions

To get an understanding on how this information that is stored in the system is used and how it’s used by the users we need to layout some basic behaviors of the users interacting with the system and how that correlates with fetching reference data. In the system there are between 2000-5000 users interacting with the system daily. This might not seem to be a lot of users interacting with the system, but when looking at how one users presence in the system correlates with how much reference data that needs to be requested. We begin to understand the possibility of improvements. When a user interacts with the clearing house user interface reference data is requested constantly, whenever a user loads the interface all the accounts and tradable instruments needs to be fetched for the user. That is just when the user visits the user interface, then the user might want to look at a various range of instruments or accounts that operate within the clearing house.

Setup

By assuming a users interaction patterns during a day we can get a better under- standing on the effect the solution could have on the system. Users can of course interact differently with the system, and a users interaction patterns can vary on a daily basis. But for this scenario we assume that all users interact in the same way.

The calculations below are going to be done under the cost during one day of running

this implementation. Assume that the clearing house is located in central Europe,

(34)

6.2. Fully scaled CDN implementation

then we set up the CDN nodes in Europe(London), Asia (Tokyo, Singapore), USA (Virginia) [6, 7]. With the four nodes we will cover a large geographical area. The base cost for running these instances would be:

T C = Instances × 24 × HP (6.1)

where T C = Total cost and HP = the hourly spot price of the VM instance. From equation 6.1 we then get the total cost for one day to $ 5,3184, this value doesn’t take into account potential scaling for the different edge nodes.

Calculations

Given the information from section 6.2.1 we assume that 1000 of the users interacting with the system are located in USA. Where the users interacts with the system as following during one hour of the day:

• Website refresh, resulting in fetching all reference data.

• Fetches 10 000 instruments.

• Checking the latest trades = 10 000 rows, where one row contain two additional requests

• Checking trade history → fetches all accounts, instruments. And for each row in the 10 000 results → six additional requests.

Based on that information we construct the following equation for users behavior during an hour of the day:

T R = 1 × F A + 10000 × I + 10000 × T + 10000 × T H + 1 × F A + 10000 × P (6.2) Row = Account × 3 + Instrument × 3 (6.3)

Row = Account + Instrument (6.4)

Row = Account + T radingmember + Clearingmember (6.5) where

TR = Total number of requests FA = Fetching all reference data.

I = Instrument.

TH = Trade history, where each row is defined by equation 6.3 T = Trade, where each row is defined by equation 6.4

P = Positions, where each row is defined by equation 6.5

The users total requests sent during one hour would mount up around 220000 re- quests/hour → 1980000 requests/day. When using the data provided in section 5, and inserting the different measured times for the requests with and without we get the following values: No CDN implementation = 3168 seconds during one day or 52,8 minutes. With the CDN node implementation = 522,3 seconds or 8,7 minutes.

So with the implementation the system would roughly reduce latency and waiting

time for the user by 44 minutes during one day of work.

(35)

Chapter 6. Discussion

Scenario limitations

It is important to note that users interactions with the system is approximated based

on limited knowledge, so the scenario could be an underestimation just as much it

could be an overestimation.

(36)

7 Conclusion

The thesis purpose was to find answers to the questions/goals as described in sec- tion 1.3. What conclusions can be drawn based of section 6 and 5, and can all the questions be answered?

The first question that the thesis wanted to answer was can a CDN implementation on the RTC system improve throughput of reference data, which is clearly proven in the result section 5, where we can see that the implementation out performs the implementation without it by a factor of 6. And based on the fully scaled scenario described in discussions section 6.2, we can see how that would result during a com- plete day of operation.

When it comes to can the implementation relieve stress on the system, we can see clearly in Figure 5.7 that the load on the SharedData node drastically decreased with the implementation which is a very important aspect of the solution. Because the SharedData node communicates with other parts of the system and it’s impor- tant that the node is slowed down by reference data requests when business crucial communication needs to be propagated to the node.

The question of if the solution is beneficial to the RTC system might be the most complex question to reach one concrete conclusion, based on the results and in- formation provided in the discussion section, the solution would provide increased throughput and relieve stress to the system which would be very beneficial to the system. So in theory this implementation would be very beneficial based on the cost of the cloud resources needed. The whole aspect whether or not the implementation is beneficial to the RTC system is dependent on how you look at it, from a technical standpoint it would be beneficial and improve the system in regards to scalability, throughput and stress relieving. But there will be added complexity to the RTC system as a whole and added costs to setup and maintain. This is something that Cinnober needs to investigate more to give a better understanding if the solution is cost effective and if it provides an business advantage for the company.

The thesis answered the questions that it was supposed to, and when it comes

to the beneficial aspect provided a basis to make a decision if the company wants to

further investigate this form of solution.

(37)

8 Future work

One question we might be asking in this stage of the thesis is, are we done here?

Is the work accomplish? Well yes and no, in respect on what the master thesis set out to accomplish then the answer is yes, I have managed to find the answers that I was looking for when I started this thesis and set the goals for the project.

But on the other hand this paper is only a pilot study to find out if the approach

and the potential solution to the problem is feasible and possible to implement,

and if so, what would be the benefit of implementing such a solution. So with

that answer I have also motivated that the work is still not completely done. For

future work, or in form of another master thesis that would work as an extensions

to this one. One could create a more technical implementation of the basis of this

thesis. Where the person would look at what would be the optimal solution for a full

scaled implementation of a content delivery network to handle the reference data on

Cinnober’s RTC system.

(38)

Bibliography

[1] Akamai. What are the benefits of a CDN. url: https://www.akamai.com/

us/en/cdn/what-are-the-benefits-of-a-cdn.jsp.

[2] Amazon EC2 Pricing - Amazon Web Services. url: https://aws.amazon.

com/ec2/pricing/ (visited on 04/16/2019).

[3] Josh Carlyle. What Are the Advantages and Disadvantages of Using a CDN?

Nov. 2018. url: https://www.colocationamerica.com/blog/cdn-advantages- and-disadvantages.

[4] Cisco Visual Networking Index: Forecast and Trends, 2017–2022. Jan. 2019.

url: https://www.cisco.com/c/en/us/solutions/collateral/service- provider / visual - networking - index - vni / white - paper - c11 - 741490 . html.

[5] Cloudflare. What is a CDN? url: https://www.cloudflare.com/learning/

cdn/what-is-a-cdn/.

[6] Global Cloud Infrastructure — Regions Availability Zones — AWS. url:

https://aws.amazon.com/about-aws/global-infrastructure/.

[7] Global Locations - Regions Zones — Google Cloud. url: https://cloud.

google.com/about/locations/.

[8] Global Ping Statistics. url: https://wondernetwork.com/pings.

[9] Viki Green. “Impact of slow page load time on website performance”. In:

Medium (Jan. 24, 2016). url: https://medium.com/@vikigreen/impact- of- slow- page- load- time- on- website- performance- 40d5c9ce568a (vis- ited on 02/04/2019).

[10] Orion Sky Lawlor. Performance Modeling: Amdahl, AMAT, and Alpha-Beta.

url: https : / / www . cs . uaf . edu / 2011 / spring / cs641 / lecture / 04 _ 05 _ modeling.html.

[11] Pricing — Compute Engine Documentation — Google Cloud. url: https : //cloud.google.com/compute/pricing (visited on 04/16/2019).

[12] Scaling Based on CPU or Load Balancing Serving Capacity — Compute En- gine Documentation — Google Cloud. url: https://cloud.google.com/

compute/docs/autoscaler/scaling-cpu-load-balancing.

[13] What is Caching and How it Works — AWS. url: https://aws.amazon.

com/caching/.

Investigate potential performance improvements in a Real Time Clearing system using Content Delivery Network-based technology

Investigate potential performance improvements in a Real Time

Clearing system using Content

Delivery Network-based technology

Henrik Ersk´ ers

Henrik Ersk´ ers

Degree Project in Computing Science Engineering, 30 ECTS Credits Spring 2019

Supervisor: Jan Erik Mostr¨ om External supervisor: P˚ al Forsberg Examiner: Henrik Bj¨ orklund

Master of Science Programme in Computing Science and Engineering, 300 ECTS Credits

Abstract

The increasing demand from customers and users on systems are constantly increasing, demanding faster more robust solutions.

There are no exception especially when it comes to clearing tech-

nology in the financial industry, where the system needs to handle

information fast and be responsive. The thesis have investigated

the potential performance improvements by using Content De-

livery Network(CDN) based technology on Cinnobers Real Time

Clearing system RTC. The thesis have based on constructing a

functional CDN node, and look at the impact that it has on the

system in regards of handling reference data. Could this approach

help improve the system and improve the performance of the sys-

tem and scalability in regards to reference data handling. Based

on the results that was gathered by comparing the node imple-

mentation with the original system, there are clear indications

performance improvements on the system. Where the results con-

cluded that with the CDN node implementation a performance

improvement when requesting reference data, had an increase in

fetch time by a factor of 6. By using the results gathered in the

thesis, a simulation was created to simulate the effect of a fully

scaled CDN. The simulation concluded that the implementation

could reduce latency by 44 minutes during a day of use.

Acknowledgements

First of all I would like to thank Cinnober for giving me the opportunity to write

the thesis at their office. And the help they provided me to develop the thesis. A

special thanks to my supervisor P˚ al Forsberg at Cinnober for helping me with all

uncertainties and problems that I encountered during my thesis. I would also want

to thank my supervisor Jan Erik Mostr¨ om at the university for helping me with

questions regarding the thesis and giving feedback on the report. Lastly I would like

to thank Linn, family and friends for proofreading the thesis and discussing different

ideas and questions.

Abbreviations

CDN - Content Delivery Network RTC - Real Time Clearing system

SharedData - A collection of reference data in the system DDoS - Distributed Denial of Service

CSP - Cloud Service Provider PoP - Point of precense

Legacy code - Code that is not written from scratch, adding functionality on existing

code base.

Contents

1 Introduction 1

1.1 Background 1

1.1.1 Real Time Clearing System 1

1.2 Purpose 2

1.3 Goal 2

1.4 Limitations 2

1.4.1 Testing 2

2 Methods 3

2.1 Research 3

2.2 Simulations 3

2.3 Testing 3

2.3.1 Scripts 4

2.3.2 Scenarios 4

2.3.3 Performance testing 5

2.3.4 Monitoring tests 5

2.4 Environment setups 5

2.5 Analysis and discussion 6

2.6 Evaluation methods 6

3 Theory 7

3.1 Cloud 7

3.2 Content Delivery Network 7

3.2.1 CDN Node 7

3.2.2 Scaling 8

3.2.3 Benefits 9

3.2.4 Disadvantage 9

3.3 Data caching 10

3.3.1 Caching Strategies 10

3.3.2 Eviction policies 12

3.4 Reference Data 12

3.4.1 Account 12

3.4.2 Tradable Instruments 12

4 Implementations 14

4.1 Node implementation 14

4.2 Monitoring implementations 14