Replica selection in ApacheCassandra

(1)

Replica selection in Apache

Cassandra

REDUCING THE TAIL LATENCY FOR READS

USING THE C3 ALGORITHM

SOFIE THORSEN

(2)

Reducing the tail latency for reads using the C3 algorithm

Val av replikor i Apache Cassandra

SOFIE THORSEN

sthorsen@kth.se Master’s Thesis at CSC Supervisor: Per Austrin Examiner: Johan Håstad

Employer: Spotify

(3)

(4)

Abstract

Keeping response times low is crucial in order to provide a good user experience. Especially the tail latency proves to be a challenge to keep low as size, complexity and over-all use of services scale up. In this thesis we look at re-ducing the tail latency for reads in the Apache Cassandra database system by implementing the new replica selection algorithm called C3, recently developed by Lalith Suresh, Marco Canini, Stefan Schmid and Anja Feldmann.

Through extensive benchmarks with several stress tools, we find that C3 indeed decreases the tail latencies of Cas-sandra on generated load. However, when evaluating C3 on production load, results does not show any particular improvement. We argue that this is mostly due to the vari-able size records in the data set and token awareness in the production client. We also present a client-side implemen-tation of C3 in the DataStax Java driver in an attempt to remove the caveat of token aware clients.

(5)

Val av replikor i Apache Cassandra

För att kunna erbjuda en bra användarupplevelse så är det av högsta vikt att hålla responstiden låg. Speciellt svans-latensen är en utmaning att hålla låg då dagens applika-tioner växer både i storlek, komplexitet och användning. I denna rapport undersöker vi svanslatensen vid läsning i databassystemet Apache Cassandra och huruvida den går att förbättra. Detta genom att implementera den nya selek-tionsalgoritmen för replikor, kallad C3, nyligen framtagen av Lalith Suresh, Marco Canini, Stefan Schmid och Anja Feldmann.

Genom utförliga tester med flera olika stressverktyg så fin-ner vi att C3 verkligen förbättrar Cassandras svanslatenser på genererad last. Dock så visade använding av C3 på pro-duktionslast ingen större förbättring. Vi hävdar att detta framförallt beror på en variabel storlek på datasetet och att produktionsklienten är tokenmedveten. Vi presenterar ock-så en klientimplementation av C3 i Java-drivrutinen från DataStax, i ett försök att åtgärda problemet med token-medventa klienter.

(6)

Acknowledgements

I want to thank Lalith Suresh and Marco Canini for continuously discussing thoughts and sharing ideas throughout this project.

(9)

(10)

Chapter 1

Introduction

For all service-oriented applications, fast response times are vital for a good user experience. To examine the exact impact of server delays, Amazon and Google conducted experiments where they added extra delays on every query before sending back results to users [21]. One of their findings was that an extra delay of only 500 milliseconds per query resulted in a 1.2% loss of users and revenue, with the effect persisting even after the delay was removed.

However, keeping response times low is not an easy task. As Google reported [12], especially the tail latency is challenging to keep low as size, complexity and overall use of services scale up. When serving a single user request, multiple servers can be involved. Bad latency on a few machines then quickly results in higher overall latencies, and the more machines, the worse the tail latency. To illustrate why, consider a client making a request to a single server. Suppose that the server has an acceptable response time in 99% of the requests, but the last 1% of the requests takes a second or more to serve. This scenario is not too bad, as it only means that one client gets a slightly slower response every now and then.

Consider instead a hundred servers like this and that a request requires a response from all servers. This will greatly change the responsiveness of the system. From 1% of the requests being slow, suddenly 63%1 _{of the requests will take more than a}

second to serve.

It is then apparent that the tail latency must be taken seriously in order to provide a good service.

Apache Cassandra is the database of choice at Spotify for end user facing features. Spotify runs more than 80 Cassandra clusters on over 650 servers, managing

im-1_{Assuming independence between response times, the probability that at least one response}

takes more than a second is 1 ≠ 0.99100

(11)

portant data such as playlists, music collections, account information, user/artist followers and more. Since an end user request often involves reading from several databases, poor tail latencies will affect the user experience negatively for a large number of users.

In this thesis a replica selection algorithm for usage with Cassandra was imple-mented and evaluated, with focus on reducing the tail latency for reads.

1.1 Problem statement

The data in Cassandra is replicated to several nodes in the cluster to provide high availability. The performance of the nodes in the cluster varies over time though, for instance due to internal data maintenance operations and Java garbage collections. When data is read, a replica selection algorithm in Cassandra determines which node in the cluster the request should be sent to. The built in replica selection algorithm provides good median latency, but the tail latency is often an order of magnitude worse than the median, which leads to the following question:

(12)

Chapter 2

Background

2.1 Terminology and definitions

In this section we discuss concepts and technology necessary to follow the thesis. The reader familiar with the concepts can skip this section.

2.1.1 Load balancing and replica selection

Load balancing is the process of distributing workload across multiple computing resources, such as servers. Replica selection is a form of load balancing as it tries to balance requests across the set of nodes that own the requested data.

2.1.2 Percentiles and tail latency

A percentile is a statistical measure that indicates the value below which a given percentage of observations in a group of observations fall. For example, the 95th percentile is the smallest value which is greater than or equal to 95% of the obser-vations.

In the context of latencies, percentiles are important measures when analyzing data. For example, if only using mean and median in analysis, outliers can remain hidden. In contrast, the maximum value gives a pessimistic view since it can be distorted by a single data point.

(13)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 200 400 600 800 1,000 Time (hours) Lat en cy (m s) Mean Median 99th

Figure 2.1: Latencies over time. 2.1.3 CAP theorem

The CAP theorem, also known as Brewer’s theorem [14], states that for a distributed computer system it is impossible to simultaneously provide all three of the following:

• Consistency - all nodes see the same data at the same time.

• Availability - every request receives a response about whether it succeeded or failed.

• Partition tolerance - the system continues to operate despite arbitrary message loss or failure of part of the system.

2.1.4 Eventual consistency

Eventual consistency is a consistency model used in distributed systems to achieve high availability. The consistency model informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.

2.1.5 SQL

(14)

columns, with rows containing information about one specific entity and columns being the separate data points. For example, a row could represent a specific car, in which the columns are “Model”, “Color” and so on. The tables can have rela-tionships between each other and the data is queried using SQL.

2.1.6 NoSQL

NoSQL1 _{databases are an alternative to the tabular relations used in relational}

databases. The motivation for this approach includes simplicity of design, hor-izontal scaling and availability. The data structures used by NoSQL databases (e.g. column, document, key-value or graph) differs from those used in relational databases, making some operations faster in NoSQL and others faster in relational databases. The suitability of a particular database, regardless of it being relational or NoSQL, depends on the problem it must solve.

There are many different distributed NoSQL databases and their functionality can differ a lot depending on which two properties from the CAP theorem they support. 2.1.7 Accrual failure detection

In distributed systems, a failure detector is an application or a subsystem that is responsible for detecting slow or failing nodes. This mechanism is important to detect situations where the system would perform better by excluding the culprit node or putting it on probation. To decide if a node is subject for exclusion/proba-tion a suspicion level is used. For example, tradiexclusion/proba-tional failure detectors use boolean information as the suspicion level: a node is simply suspected or not suspected. Accrual failure detectors are a class of failure detectors where the information is a value on a continuous scale rather than a boolean value. The higher the value, the higher the confidence that the monitored node has failed. If an actual crash occurs, the output of the accrual failure detector will accumulate over time and tend towards infinity (hence the name). This model provides more flexibility as the application itself can decide an appropriate suspicion threshold. Note that a low threshold means quick detection in the event of a real crash, but also increases the likelihood of incorrect suspicion. On the other hand, a high threshold makes less mistakes but makes the failure detector slower to detect failing nodes.

Hayashibara et al. describe an implementation of such an accrual failure detector in [17], called the Ï accrual failure detector. In the Ï failure detector the arrival times of heartbeats2 _{are used to approximate the probabilistic distribution of future}

heartbeat messages. With this information, a value Ï is calculated with a scale that changes dynamically to match recent network conditions.

(15)

2.1.8 Exponentially weighted moving averages (EMWA)

A moving average (also known as rolling average or running average) is a technique used to analyze trends in a data set by creating a series of averages of different subsets of the full data set. Given a sequence of numbers and a fixed subset size, the first element of the moving average sequence is obtained by taking the average of the initial fixed subset of the number sequence. Then the subset is modified by excluding the first number of the series and including the next number following the original subset in the series. This creates a new averaged subset of numbers. More mathematically formulated:

Given a sequence {ai}Ni=1, an n-moving average is a new sequence {si}Ni=1≠n+1defined

from the ai sequence by taking the mean of subsequences of n terms:

si= _n1 i+n≠1_ÿ

j=i

aj

The sequences Sn giving n-moving averages then are:

S₂ = 1₂(a₁+ a₂, a₂+ a₃, . . . , an≠1+ an)

S3 = 1₃(a1+ a2+ a3, a2+ a3+ a4, . . . , , an≠2+ an≠1+ an)

and so on. An example of different moving averages can be seen in Figure 2.2.

0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8

(16)

An exponentially weighted moving average (EMWA), instead of using the average of a fixed subset of data points, applies weighting factors to the data points. The weighting for each older data point decreases exponentially, never reaching zero. The EMWA for a series Y can be calculated as:

S1 = Y1

for t > 1 : St= – · Yt+ (1 ≠ –) · St≠1

Where – represents the degree of weighting decrease, a constant smoothing factor between 0 and 1. A higher value of – discounts older observations faster. Yt is the

value at a time period t, and St is the value of the EMWA at a time period t.

2.1.9 RAID

RAID3 _{is a virtualization technology for data storage which combines multiple disk}

drives into one logical unit.

Data is distributed across the drives in different ways called RAID levels, depending on the specific level of redundancy and performance wanted. The different schemes are named by the word RAID followed by a number (e.g. RAID 0, RAID 1). Each scheme provides different balance between the key goals: reliability, availability, performance and capacity.

RAID 10, or RAID 1+0 is a scheme where throughput and latency are prioritized and is therefore the preferable RAID level for I/O intense applications such as databases.

2.1.10 Apache Cassandra

Apache Cassandra, born at Facebook [18] and built on ideas from Amazon’s Dynamo [13] and Google’s BigTable [3], is an open source NoSQL distributed database system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

DataStax

DataStax is a computer software company whose business model centers around selling an enterprise distribution of the Cassandra project which includes extensions to Cassandra, analytics and search functionality. DataStax also employ more than ninety percent of the Cassandra committers.

3_{Originally redundant array of inexpensive disks, now commonly redundant array of}

(17)

Replication

To ensure fault tolerance and reliability, Cassandra stores copies of data, called replicas, on multiple nodes. The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is one copy of each row on one node. A factor of two means two copies of each row, where each copy is on a different node [7].

When a client read or write request is issued, it can go to any node in the cluster since all nodes in Cassandra are peers. When a client connects to a node, that node serves as the coordinator for that particular client operation. What the coordinator then does is to act as a proxy between the client application and the nodes that own the requested data. The coordinator is responsible for determining which node should get the request based on the cluster configuration and replica placement strategy.

Partitioners and tokens

A partitioner in Cassandra determines how the data is distributed across the nodes in a cluster, including replicas. In essence, the partitioner is a hash function for deriving a token, representing a row from its partion key4 _[9].

The basic idea is that each node in the Cassandra cluster is assigned a token that determines what data in the cluster it is responsible for [2]. The tokens assigned to a node needs to be distributed throughout the entire possible range of tokens. As a simple example, consider a cluster with four nodes and a possible token range of 0-80. Then you would want the tokens for the nodes to be 0, 20, 40, 60, making each node responsible for an equal portion of the data.

Data consistency

As Cassandra sacrifices consistency for availability and partition tolerance, making it an AP system in the CAP theorem sense, replicas may not always be synchro-nized. Cassandra extends the concept of eventual consistency by offering tunable consistency, meaning that the client application can decide how consistent the re-quested data must be.

In the context of read requests, the consistency level specifies how many replicas must respond to a read request before data is returned to the client application. Examples of consistency levels can be seen in Table 2.1.

4_{The partition key is the first column declared in the PRIMARY KEY definition. Each row of}

(18)

Level Description

ALL Returns the data after all replicas has responded. The read operation fails if a replica does not respond.

QUORUM Returns the data once a quorum, i.e. a majority, of replicas has responded.

ONE Return the data from the closest replica. TWO Return the data from the two closest replicas.

Table 2.1: Examples of read consistency levels.

To minimize the amount of data sent over the network when doing reads with a consistency level above ONE, Cassandra makes use of “digest requests”. A digest request is just like a regular read request except that instead of the node actually sending the data it only returns a digest, i.e. a hash of the data.

The intent is to discover whether two or more nodes agree on what the current data is, without actually sending the data over the network and therefore save bandwidth. Cassandra sends one data request to one replica and digest requests to the remaining replicas. Note that the digest queried nodes still will do all the work of fetching data, they will just not return it.

Replica selection

In order for the coordinator node to route requests efficiently it makes use of a

snitch. A snitch informs Cassandra about the network topology and determines

which data centers and racks nodes belong to. This information allows Cassandra to distribute replicas according to the replication strategy [11] by grouping machines into data centers and racks.

In addition, all snitches also use a dynamic snitch layer that provides an adaptive behaviour when performing reads [24]. It uses an accrual failure detection mecha-nism, based on the Ï failure detector discussed in section 2.1.7, to calculate a per node threshold that takes into account network performance, workload and histori-cal latency conditions. This information is used to detect failing or slow nodes, but also for calculating the best host in terms of latency, i.e. selecting the best replica. However, calculating the best host is expensive. If too much CPU time is spent on calculations it would become counterproductive as it would sacrifice overall read throughput. The dynamic snitch therefore adopts two separate operations. One is receiving the updates, which is cheap, and the other is calculating scores for each host which is more expensive.

(19)

find the worst latency as a measure for the scoring. After finding the worst latency it makes a second pass over the hosts and score them against the maximum value. This calculation has been configured to only run once every 100 ms to reduce the cost. As hosts can not inform the system of their recovery once put on probation, all computed scores are reset once every ten minutes as well.

Client drivers and token awareness

To enable communication between client applications and a Cassandra cluster, mul-tiple client drivers for Cassandra exists. Cassandra supports two communication protocols, the legacy Thrift interface [22], and the newer native binary protocol that enables use of the Cassandra Query Language (CQL) [6], resembling SQL. Different drivers can therefore use different protocols.

Popular drivers includes Astyanax, which uses the Thrift interface, and the Java driver from DataStax which only supports CQL. As these drivers can get the token information from the nodes during initialization, they can be configured to be token

aware. This means that the client driver can make a qualified choice about which

nodes to issue requests to, based on the data requested.

2.2 Load balancing techniques in distributed systems

There exists numerous ideas and techniques to improve load balancing in distributed systems. The problem is often to decide on a good trade-off between exchanging a lot of communication between servers and clients and making guesses and ap-proximations on the traffic. Intuitively, more information makes it easier to do good decisions, but information passing can be costly. This section briefly discusses previous work, ideas and algorithms for load balancing techniques in distributed systems, not necessarily with focus on improving the tail latency.

2.2.1 The power of d choices

Consider a system with n requests and n servers to serve them. If each request is dispatched independently and uniformly at random to a server then the maximum load, or the largest number of requests at any server, is approximately _{log log n}log n . Suppose instead that each request gets placed sequentially onto the least loaded (in terms of number of requests enqueued on a server) of d Ø 2 servers chosen independently and uniformly at random. It has then been shown that with high probability5 _{the maximum load is instead only} log log n

log d + C, where C is a constant

factor [1] [20]. This means that getting two choices instead of just one leads to an exponential improvement in the maximum load.

5_{High probability means here at least 1 ≠} 1

(20)

This result demonstrates the power of two choices, which is a commonly used prop-erty in load balancing strategies. When referring to this idea the common way to denote it is by SQ(d), meaning shortest-queue-of-d-choices.

2.2.2 Join-Shortest-Queue

The Join-Shortest-Queue (JSQ) algorithm is a popular routing policy used in pro-cessor sharing server clusters. In JSQ, an incoming request gets dispatched to the server with the least number of currently active requests. Ties are broken by chos-ing randomly between the two servers. JSQ therefore tries to load balance across servers by reducing the chance of one server having multiple jobs while another server has none. This is a greedy policy since the incoming request prefers sharing a server with as few jobs as possible.

Figure 2.3 illustrates the algorithm, with the clients at the top, A-C being servers with their respective queues and pending jobs.

An interesting result that was shown by Gupta et al. [15], is that the performance of JSQ on a processor sharing system shows near insensitivity to differences on the job size distribution. This is different from similar routing policies like Least-Work-Left (send the job to the host with the least total work) or Round-Robin which are highly sensitive to the job size distribution.

JSQ is not optimal6_{, but was still shown to have great performance in comparison to}

algorithms with much higher complexity. A potential drawback with JSQ though, is that as the system grows, the amount of communication over the network between dispatchers and servers could get overwhelming given that each of the distributed dispatchers will need to obtain the number of jobs at every server before every job assignment.

2.2.3 Join-Idle-Queue

The Join-Idle-Queue (JIQ) algorithm, described in [19], tries to decouple detection of lightly loaded servers from the job assignment. The idea is to have idle processors inform the dispatchers as they become idle, without interfering with job arrivals. This removes the load balancing work from request processing.

JIQ consists of two parts, the primary and the secondary load balancing problem, which communicate via a data structure called an I-queue. An I-queue is a list of processors that have reported themselves as idle. When a processor becomes idle it joins an I-queue based on a load balancing algorithm. Two load balancing algorithms for this purpose was considered in [19]: Random and SQ(d). With

6_{In the optimal solution, each incoming job is assigned as to minimize the mean response time}

(21)

Figure 2.3: The join-shortest-queue algorithm. Clients prefer the server with the shortest queue.

JIQ-Random an idle processor joins an I-queue uniformly at random, and with JIQ-SQ(d) an idle processor chooses d random I-queues and joins the one with the shortest queue length. If a client do not have any servers in its I-queue it will in turn make a choice based on the SQ(d) algorithm. Figure 2.4 illustrates the algorithm, again with the clients at the top with their respective I-queues, A-F being servers with their respective queues and pending jobs. It is worth noting that JIQ-Random has the additional advantage of having a oneway communication, without requiring messages from the I-queues.

Lu et al. showed three interesting results:

• JIQ-Random outperforms traditional SQ(2), in respect to mean response time. • JIQ-SQ(2) achieves close to the minimum possible mean response time. • Both JIQ-Random and JIQ-SQ(2) are near-insensitive to job size distribution

with processor sharing in a finite system. 2.2.4 Speculative retries

(22)

Figure 2.4: The join-idle-queue algorithm. Servers join an I-queue based on the power of d choices algorithm. If a client do not have any servers in its I-queue it will in turn make a choice based on the power of d choices algorithm.

Implementing speculative retries adds some overhead, but can still give latency-reduction effects while increasing load only modestly. A way to achieve this is by waiting to send a second request until the first one has been outstanding for more than the 95th or 99th percentile expected latency for that type of request. This limits the additional load to only a couple of percents (~1-5%) while substantially re-ducing the tail latency, since the pending request might be a several second timeout for example.

Speculative retries was implemented in Cassandra 2.0.2 with the default of sending the next request in the 99th percentile [10].

2.2.5 Tied requests

(23)

2.3 The C3 algorithm

The C3 algorithm, described in [23], is a replica selection algorithm for Cassandra usage. Suresh et al. argue that replica selection is an overlooked process which should be a cause for concern. They argue that putting mechanisms such as spec-ulative retries on top of bad replica selection may increase system utilization for little benefit.

C3 tries to solve the problem by using two concepts. Firstly it uses additional feedback from server nodes in order for the clients to rank them and prefer faster ones. Secondly, the clients implement a rate control mechanism to prevent nodes from being overwhelmed. A note worth making is that a client in the C3 design is actually the coordinator node in Cassandra, so the entire algorithm is implemented server side. The current implementation is in Cassandra version 2.0.0.

2.3.1 Replica ranking

In the C3 replica ranking, the clients ranks the server nodes using a scoring function, just like the dynamic snitch, with the score working as a measure of latency to expect from the node in question. Clients prefer lower scores which corresponds to faster nodes for each request.

Instead of only using the latency, the C3 scoring function tries to minimize the product of the job queue size7 _q

s and the service time 1/µs (the time to fetch the

requested rows) across every server s.

Along with each response to a client, the servers send back additional information about their queue sizes and service times. The queue size is recorded after a request has been served and when the response is about to be returned. To make a better forecast, the values are smoothed with EWMA:s, denoting the new values ¯qs and

¯

µs. In addition to these values, the response time Rs (i.e the difference between the

latency for the entire request and the service time) is also recorded and smoothed. To account for other clients in the system as well as ongoing requests, each client also maintain, for each server s, an instantaneous count of its outstanding requests

oss (requests for which a response is yet to be received). It is assumed that each

client knows how many other clients there are in the system (n). The clients then make an estimate of the queue size of each server as:

ˆ

qs= oss· n + ¯qs+ 1 (2.1)

(24)

where the oss· n term is referred to as the concurrency compensation.

The idea behind the concurrency compensation is that clients will account for the scenario of multiple clients concurrently issuing requests to the same server. The clients with a higher value of oss will therefore give a higher estimate of the queue

size at s and rank it lower than a client with fewer requests to s. This results in clients that have a higher demand will be more likely to rank s lower than clients with a lighter demand.

Using this estimation, clients compute the queue size to service rate ratio ( ˆqs/µ¯s)

of each server and rank them accordingly. However, a function linear in ˆqs is not

sufficient as it would demand a rather large increase in queue size in order for a client to switch back to a slower server again, which could result in accumulation of jobs at the faster nodes. Instead, C3 penalizes longer queue lengths by raising the

ˆ

qs term to a higher power, b: ( ˆqs)b/µ¯s. For higher values of b, clients are less greedy

about preferring a server with a lower service time as the (qs)b term will dominate

the scoring function more strongly. In C3, b is set to 3, yielding a cubic function. This results in a final scoring function:

s= Rs+ ( ˆqs)3/µ¯s (2.2)

where Rs and ¯µs are the EWMA:s of the response time and service rate and ˆqs is

the queue size estimate described in equation 2.1. 2.3.2 Rate control

To prevent exceeding server capacity, clients incorporate a rate limiting mechanism inspired by the congestion control in the CUBIC TCP implementation [16]. This mechanism is decentralized as clients do not inform each other of their demands of a server.

Every client uses a rate limiter for each server which limits the number of requests sent within a configured time window of ” ms. The limit is referred to as the sending rate (srate). By letting the clients track the number of responses received from a server in the ” ms interval (the receive rate, rrate) the rate limiter adapts and adjusts srate to match the rrate of the server.

When a client receives a response from a server s, the client compares the current

srateand rrate for s. If srate is found to be lower than rrate, the client increases

(25)

srate_{Ω “ ·} A T _≠ 3 Û —_{· R}0 “ B3 + R0 (2.3)

where T is the elapsed time since the last rate decrease, and R0 is the rate at

the time of the last rate decrease. If the rrate is lower than the srate, the client instead decreases its srate multiplicatively by —, in C3 set to 0.2. The “ value represents a scaling factor and is used to set the desired duration of the saddle region. Additionally a cap for the step size is set by a parameter smax. The scaling

factor in C3 is set to 100 milliseconds and the cap size is set to 10.

To get a better understanding of the properties of the rate controlling function, consider Figure 2.5. The proposed benefits with using this function is mostly the configurable saddle region. While the sending rate is significantly lower than the saturation rate, the client will increase the rate aggressively (low rate region). When the sending rate is close to the perceived saturation point of the server, that is, R0,

the client stabilizes its sending rate and increases it conservatively (saddle region). Lastly, when the client has spent enough time in the stable region, it will again increase its rate aggressively, probing for more capacity (optimistic probing region).

0 50 100 150 200 Saddle region R0 Low rate region Optimistic probing region T (ms) R at e (r eq ue st s pe r ” ms)

Figure 2.5: Growth curve for the rate control function. 2.3.3 Notes on the C3 implementation

(26)

(27)

(28)

Chapter 3

Method

Evaluating performance is not a trivial task. While the focus in this thesis was on improving the tail latency, it was important to not achieve this by sacrificing the average case performance, i.e. the average latency of a request.

A good starting point was to implement C3 in Cassandra 2.0.11 (the version that Spotify uses), to try and verify if the performance gains seen by the C3 authors in version 2.0.0, could also be seen in the newer version, despite the version gap.

3.1 Tools for testing

This section describes tools used while implementing the algorithm and evaluating Cassandra performance. In the process of benchmarking, guidelines and advice from DataStax [8] was adhered to.

3.1.1 The cassandra-stress tool

The cassandra-stress tool is a stress testing utility for Cassandra clusters written in Java which is included in the Cassandra installation [5]. It has three modes of operations: inserting data, reading data and indexed range slicing. For the purpose of this thesis the read mode is what was used for analysis.

During a run, the cassandra-stress tool reports information at a configurable inter-val. Example output can be seen below:

(29)

Here, each line reports data for the interval between the last elapsed time and cur-rent elapsed time (default is 10 seconds). The columns of interest are in particular, latency, 95th and 99.9th. The latency column describes the average latency in mil-liseconds for each operation during that interval. The 95th and 99.9th columns describe the percentiles, i.e. 95% and 99.9th% of the time the latency was less than the number displayed.

The cassandra-stress tool is highly configurable, for example it is possible to specify the number of threads, read and write consistency and size of the records.

3.1.2 The Yahoo Cloud Serving Benchmark

The Yahoo Cloud Serving Benchmark (YCSB) is a framework for benchmarking various cloud serving systems [4]. The YCSB client is a workload generator, and the core workloads included in the installation is a set of workload scenarios to be executed by the generator.

Just like the cassandra-stress tool, the YCSB client is highly configurable. For example it is possible to specify the number of threads, read and write consistency, size of the records and format of the output. Below is example output where the format is a time series:

. . . [READ] , 40 , 27509.0 [READ] , 50 , 31255.0 [READ] , 60 , 12345.5 [READ] , 70 , 15203.66 [READ] , 80 , 20668.25 . . .

Here, each line reports the average read latency (in microseconds) at an interval of ten milliseconds.

3.1.3 The Java driver stress tool

The Java driver stress tool is a simple example application that uses the DataStax Java driver to stress test Cassandra - which also stress test the Java driver as a result.

The example tool is by no means a complete stress application and supports only a very limited number of stress scenarios.

3.1.4 Darkloading

(30)

This is done by snooping on the traffic to the original system and then make a duplicate request to another system.

3.2 Test environment setup

In the process of evaluating the performance of different Cassandra versions, the task was divided into two parts. The first was evaluating performance by using stress tools such as cassandra-stress, YCSB and the Java driver stress tool which generate the workload and traffic by itself. The other part was evaluating performance on production workload and traffic, which was obtained with the Darkloading strategy. Testing on dedicated hardware is preferable as it removes the uncertainty of skewed results due to resource sharing. Therefore, dedicated hardware was used for both cases. For the Cassandra cluster, machines suited for databases was provisioned, with 16 cores, 32 GB of RAM and spinning disks in a RAID 10 configuration. For the machines which send the traffic, dedicated service machines with 32 cores and 64 GB of RAM was used instead.

When different benchmarks were conducted it was deemed interesting to test both consistency level ONE and QUORUM.

Testing with speculative retries both enabled and disabled was tried, but as this did not yield any interesting results1 _{it was omitted as a testing parameter.}

3.2.1 Testing on generated load

When testing on generated workload there were two things in particular desirable to achieve. The first was that enough data was inserted to ensure that the entire dataset does not fit in memory. The other part was running the test long enough, since a cluster has very bad performance at the start of a run (due to the Java Virtual Machine warming up). Due to this the first 15% of all recorded values was discarded to only record values when the cluster performance had stabilized. The 15% breakpoint was not thoroughly analyzed, but was simply decided appropriate when looking at the raw output from test runs.

3.2.2 Testing on production load

To try and make the comparison between different Cassandra versions as fair as possible, the same production traffic was used in each test run. The data was sampled from the real service and saved to file, making it possible to replay the same data multiple times.

1_{A slight improvement could be seen in the higher percentiles, but as this improved performance}

(31)

(32)

Chapter 4

Implementation

4.1 Implementing C3 in Cassandra 2.0.11

As Spotify uses Cassandra 2.0.11 (and above) for new applications, their develop-ment environdevelop-ment is also suited for those versions. Due to the fact that Cassandra 2.0.0 and Cassandra 2.0.11 are incompatible, C3 was instead implemented directly in Cassandra 2.0.11, making the comparisons and cluster setup easier in the Spotify environment.

The implementation did not need much additional reworking of the newer code1_,

making the process simple.

4.2 Implementing C3 in the DataStax Java driver 2.1.5

As previously mentioned, the entire C3 algorithm is implemented server side. How-ever, a client implementation may be preferable as many newer Cassandra client drivers are token aware, meaning that the coordinator node will be able to serve the requested data directly. By implementing C3 in the client, we can send the request to the best replica in the first step, removing the need of going through the coordinator node just to rank the replicas.

With that in mind, the C3 algorithm was implemented in the DataStax Java driver. The Java driver was chosen since it is actively maintained, uses the newer commu-nication protocol and also since it has good support for implementing new load balancing policies. There were some impediments along the way though. Firstly, the queue size and service time as recorded by the server could not be used as this is an extension in the C3 server code. This means that the replica scoring only used metrics as seen by the clients which might have had a significant impact on the performance.

(33)

Secondly, as the driver code is substantially different from the server code, the parameters set in the C3 server code might not have been suitable values for the client.

4.2.1 Naive implementation

To decide which hosts to send a request to, the driver makes use of a load balancing policy. For each request, the load balancing policy is responsible for returning an iterator containing the hosts to query.

This served as a suitable place to implement the replica scoring part of C3. Therefore a new policy called HostScoringPolicy was implemented, responsible for the logic of ranking hosts.

As mentioned earlier, the scoring function was simplified as the metrics from the servers used in the original C3 version were not available. The metrics used in the client-side ranking are the latency for the entire request (Ls), the queue size (qs),

and the outstanding requests to a host (oss), all as seen by the client. Just like the

server implementation, EMWA:s was used to smooth the values. The client version of ˆqs is therefore defined just as before:

ˆ

qs= oss· n + ¯qs+ 1 (4.1)

But with the difference that the queue size here is recorded from the client perspec-tive and not by the server itself as in the original C3 implementation.

This results in the final client scoring function:

s = Ls+ ( ˆqs)3· Ls (4.2)

Here we can notice the big difference that we do not have the service time metric, leaving us with the entire latency of the request as the only measure.

The rate limiting part of C3 was however easily plugged in as the functionality is self contained and not relying on external metrics.

4.3 Benchmarking with YCSB

(34)

by Suresh et al. in [23]. To achieve this, the YCSB framework was used, just like in the original paper.

The test scenario with a read-heavy workload (95% reads, 5% writes) was chosen to be reproduced. In the original experiment 15 Cassandra nodes were used, with a replication factor of 3. 500 million records of 1KB each were inserted across the nodes, yielding approximately 100 GB of data per machine. Since the test setup only had 8 Cassandra nodes the record count was modified to be similar to the load in the original experiment. Therefore 250 million records of 1KB each was inserted, yielding near to 100 GB of data per machine.

Just like the original test scenario three YCSB instances were used (running on separate machines) each running 40 threads, yielding a total of 120 generators. Then for each Cassandra version and consistency level, just like in the original test, two million rows were read, five times. The duration of a read run was about 30-60 minutes depending on consistency level.

4.4 Benchmarking with cassandra-stress

As the cassandra-stress tool already comes packaged together with the Cassandra installation, C3 was also tested with this tool, to gain further confidence about the performance of C3.

The deployment again consisted of the 8 Cassandra nodes, and one separate service machine, running the cassandra-stress tool with the default of 50 threads. 250 million records of 1KB each were inserted across the cluster.

Due to a design choice in the cassandra-stress tool2 _{100 million rows were read. The}

duration of a read run was about 5-7 hours depending on consistency level.

4.5 Benchmarking with the java-driver stress tool

As creating a custom stress tool for the purpose of client evaluation is outside the scope of this thesis, the stress application that comes together with the Java driver was used to evaluate the client implementation of C3.

By having made some small modifications in the source code of the stress application it was possible to test the different load balancing policies with different consistency levels.

2_{For example, inserting 100000 rows will write rows with key values 000000-099999, meaning}

(35)

The deployment again consisted of the 8 Cassandra nodes, and 6 service machines, each running 100 threads. 250 million records of 1KB each were inserted across the cluster.

For each Cassandra version and consistency level, 100 million rows were read. The duration of a read run was about 5-7 hours depending on consistency level.

4.6 Darkloading

In order to benchmark the performance of C3 under production load, a cluster had to be duplicated. A suitable cluster was decided with the recommendations from Jimmy Mårdell at Spotify. The chosen cluster consists of 8 Cassandra nodes with approximately 130 GB of data per node and 6 service machines sending traffic to the cluster. The read/write ratio of the incoming requests to the service is approximately 97% reads and 3% writes.

To send traffic to the test cluster, two versions of the service client was used. The first version was token aware and used consistency level QUORUM, just like the original service. In the other version the token awareness was replaced by plain round robin, and the consistency level was set to ONE, to try and match the set-tings that the original C3 was developed with. Due to the service client using the Astyanax client and not the Java driver, it was unfortunately not possible to Dark-load the C3 client. Although Astyanax supports a beta version that uses the Java driver under the hood, it only does so for older versions of the Java driver.

For each setup, the sampled traffic was replayed at a configured rate which resulted in a disk I/O utilization of around 50-60%, making sure that the cluster had as much traffic as possible without choking the disks. Note however that even though the same traffic was replayed, writes altered the data in the cluster, potentially affecting some reads, but given the low amount of writes this was deemed to be negligible.

(36)

Chapter 5

Results

Here we present the results from our different benchmarks. The standard deviation for each measure is marked in all charts. In some charts, where the difference was small, we have omitted the average latencies as the focus lies on the improving the tail latency. All exact numbers, including averages, are available in Appendix A.

5.1 Benchmarking with YCSB

Here we present the results from the YCSB runs. The results are the averages of the combined values outputted from the three YCSB instances. In Figure 5.1 we have consistency level ONE to the left and QUORUM to the right.

Mean 95:th 99:th 99.9:th 0 20 40 60 80 100 Lat en cy (m s)

Consistency level ONE.

2.0.11 C3 Mean 95:th 99:th 99.9:th 0 20 40 60 80 100 Lat en cy (m s)

Consistency level QUORUM.

(37)

5.2 Benchmarking with cassandra-stress

Here we present the results from the cassandra-stress runs. The results are the averages from the single cassandra-stress instance. In Figure 5.2 we have the results from the 95th and 99.9th percentile latencies, with consistency level ONE to the left and QUORUM to the right.

95:th 99.9:th 0 50 100 150 Lat en cy (m s)

2.0.11 C3 95:th 99.9:th 0 50 100 150 Lat en cy (m s)

(38)

5.3 Benchmarking with the java-driver stress tool

Here we present the results from the java-driver stress runs. The default we compare against is the java-driver 2.1.5 with the default LoadBalancingPolicy that is token aware.

5.3.1 Performance of the C3 client

In Figure 5.3 we have the results from the mean, 95 and 99th percentile latency, with consistency level ONE to the left and QUORUM to the right. For both the 2.1.5 and the C3 client, Cassandra 2.0.11 was running server side.

Mean 95:th 99:th 0 100 200 Lat en cy (m s)

2.1.5 C3 Mean 95:th 99:th 0 100 200 Lat en cy (m s)

(39)

5.4 Darkloading

Here we present the results from the Darkloading runs. First we present the perfor-mance with token awareness in the client, followed by the perforperfor-mance with plain round robin.

5.4.1 Performance with token awareness

In Figure 5.4 we have the results from the 95, 98, 99 and 99.9th percentile latencies.

95:th 98:th 99:th 99.9:th 0 20 40 60 80 100 Lat en cy (m s)

2.0.11 C3

Figure 5.4: Darkloading with token awareness. 5.4.2 Performance with round robin

(40)

95:th 98:th 99:th 99.9:th 0 20 40 60 80

2.0.11 C3

(41)

(42)

Chapter 6

Discussion

6.1 Performance of server side C3

6.1.1 YCSB vs. cassandra-stress

The YCSB stress runs confirms the results from the original experiment, that C3 is superior to the original dynamic snitch. Furthermore we found that regardless of using consistency level ONE or QUORUM (in the original experiment only con-sistency level ONE was evaluated), C3 proved to reduce both latency and variance across all percentiles.

In our cassandra-stress runs, results were again positive but not at all with the same confidence as in the YCSB runs. Although it would have been reassuring to get more similar results between tools we want to emphasize the differences between setups. The cassandra-stress runs were read only, whereas the YCSB runs were read heavy. We had a different number of instances running, as well as a different thread count. We also do not have any control over the read patterns in cassandra-stress, which also could contribute to the differing results.

Additional YCSB runs similar to the cassandra-stress setup is desirable to see if the difference between results decreased, but due to time constraints we leave this to future work.

6.1.2 Darkloading

(43)

We believe that one reason for not seeing much improvement in this case is the fact that the client is token aware. The client will therefore already send the request to a node that has the data, meaning that C3 in some cases will not be able to improve the routing. Darkloading C3 with round robin in the client (and consistency level ONE) actually did improve the results, supporting this claim. Although still having the small 100 µs overhead in the average case, we could now see an improvement already in the 95th percentile, with the 99.9th percentile having improved with about 20%. Even though we did see this improvement, there is still a big gap in performance gain compared to the stress tool results.

This could have several reasons. Firstly, when generating workload, all the records were of equal size (1KB), meaning that all read requests are equally large. In the case of production load, some rows might contain more data than others due to the nature of the Darkloaded service. This means that some reads will have higher latencies, not due to slow servers but due to how the data is structured. The result of this would be that C3 might rank fast servers as slow ones just because they happen to get heavier reads.

Another point worth making is that problems such as garbage collections, where C3 really could improve the performance, commonly does not occur until the cluster has been running for a couple of weeks, which makes it a hard scenario to simulate in the scope of this thesis.

6.2 Performance of client side C3

Although not having the exact metrics like the server C3, the C3 client implemen-tation did lower the tail latency. However, the benchmark showed a lot of variance, making the results inconclusive. Since the variance was present in both the default java-driver version and the C3 implementation we deem this to be a fault in the benchmark setup and not in the implementation.

We suggest that making repeated benchmarks and perhaps tweaking the parameters could give a more conclusive result. However, we are under the impression that C3 in the client could work well, and perhaps be a substitute for token aware clients.

6.3 Conclusion

Given the right conditions the C3 algorithm has proven to be an effective way to decrease tail latencies in Cassandra. We would recommend the current implemen-tation in systems where row sizes are homogeneous as variable size records are not taken into account in the scoring function.

(44)

variable size rows. Given that one can obtain the size of the data requested, it should be possible to make a weighted scoring function, but this is outside the scope of this thesis.

(45)

(46)

Appendix A

Results from benchmarks

A.1 YCSB

System 2.0.11 C3 Average latency (ms) 11.59, ‡=1.74 7.81, ‡=0.97 95th percentile (ms) 21.22, ‡=3.84 11.92, ‡=1.81 99th percentile (ms) 30.28, ‡=4.65 15.36, ‡=2.10 99.9th percentile (ms) 54.85, ‡=5.25 24.64, ‡=2.32 Table A.1: YCSB read latencies with consistency level ONE.

System 2.0.11 C3

(47)

A.2 cassandra-stress

System 2.0.11 C3

Average latency (ms) 3.93, ‡=0.71 3.87, ‡=0.71 95th percentile (ms) 27.12, ‡=1.83 23.05, ‡=1.53 99.9th percentile (ms) 113.57, ‡=37.86 82.66, ‡=23.89 Table A.3: cassandra-stress read latencies with consistency level ONE.

System 2.0.11 C3

Average latency (ms) 8.44, ‡=0.29 8.42, ‡=0.32 95th percentile (ms) 34.54, ‡=1.94 31.57, ‡=2.30 99.9th percentile (ms) 131.58, ‡=35.85 105.59, ‡=32.18

Table A.4: cassandra-stress read latencies with consistency level QUORUM.

A.3 java-driver stress

System java-driver 2.1.5 client C3

Average latency (ms) 8.75, ‡=1.11 10.40, ‡=0.34 95th percentile (ms) 75.05, ‡=28.99 33.04, ‡=26.17 99th percentile (ms) 149.44, ‡=44.38 67.76, ‡=55.80 Table A.5: java-driver stress read latencies with consistency level ONE.

System java-driver 2.1.5 client C3

Average latency (ms) 14.95, ‡=1.56 15.18, ‡=1.18 95th percentile (ms) 104.42, ‡=27.80 100.43, ‡=37.88 99th percentile (ms) 169.03, ‡=38.65 158.42, ‡=42.77

(48)

A.4 Darkloading

A.4.1 Token aware

System 2.0.11 C3 50th percentile (ms) 0.90, ‡=0.01 1.01, ‡=0.01 75th percentile (ms) 1.12, ‡=0.04 1.27, ‡=0.05 95th percentile (ms) 14.59, ‡=3.93 15.41, ‡=3.82 98th percentile (ms) 27.31, ‡=5.48 26.91, ‡=4.71 99th percentile (ms) 36.97, ‡=6.26 35.11, ‡=4.94 99.9th percentile (ms) 70.61, ‡=10.11 63.65, ‡=7.47

Table A.7: Darkloading read latencies with consistency level QUORUM. A.4.2 Round robin

(49)

(50)

Bibliography

[1] Yossi Azar, Andrei Z Broder, Anna R Karlin, and Eli Upfal. Balanced alloca-tions. SIAM journal on computing, 29(1):180–200, 1999.

[2] Nick Bailey. Balancing your Cassandra cluster.

http://www.datastax.com/dev/blog/balancing-your-cassandra-cluster.

Accessed: 2015-06-12.

[3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions

on Computer Systems (TOCS), 26(2):4, 2008.

[4] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings

of the 1st ACM symposium on Cloud computing, pages 143–154. ACM, 2010.

[5] DataStax. The cassandra-stress tool.

http://www.datastax.com/documentation/cassandra/2.0/cassandra/ tools/toolsCStress_t.html. Accessed: 2015-03-16.

[6] DataStax. Coming in Cassandra 1.2: binary CQL protocol.

http://www.datastax.com/dev/blog/binary-protocol. Accessed: 2015-05-24.

[7] DataStax. Data replication.

http://www.datastax.com/documentation/cassandra/2.1/cassandra/ architecture/architectureDataDistributeReplication_c.html. Accessed: 2015-01-27.

[8] DataStax. How not to benchmark cassandra.

http://www.datastax.com/dev/blog/how-not-to-benchmark-cassandra. Accessed: 2015-03-16.

[9] DataStax. Partitioners.

(51)

[10] DataStax. Rapid read protection in cassandra 2.0.2. http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2. Accessed: 2015-01-28. [11] DataStax. Snitches. http://www.datastax.com/documentation/cassandra/2.0/cassandra/ architecture/architectureSnitchesAbout_c.html. Accessed: 2015-01-23.

[12] Jeffrey Dean and Luiz André Barroso. The tail at scale. Communications of

the ACM, 56(2):74–80, 2013.

[13] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakula-pati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205–220. ACM, 2007.

[14] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51–59, 2002.

[15] Varun Gupta, Mor Harchol Balter, Karl Sigman, and Ward Whitt. Analysis of join-the-shortest-queue routing for web server farms. Performance Evaluation, 64(9):1062–1081, 2007.

[16] Sangtae Ha, Injong Rhee, and Lisong Xu. Cubic: a new tcp-friendly high-speed tcp variant. ACM SIGOPS Operating Systems Review, 42(5):64–74, 2008. [17] Naohiro Hayashibara, Xavier Defago, Rami Yared, and Takuya Katayama. The

Ï accrual failure detector. In Reliable Distributed Systems, 2004. Proceedings of the 23rd IEEE International Symposium on, pages 66–78. IEEE, 2004.

[18] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010. [19] Yi Lu, Qiaomin Xie, Gabriel Kliot, Alan Geller, James R Larus, and Albert

Greenberg. Join-idle-queue: A novel load balancing algorithm for dynamically scalable web services. Performance Evaluation, 68(11):1056–1071, 2011. [20] Michael Mitzenmacher. The power of two choices in randomized load balancing.

Parallel and Distributed Systems, IEEE Transactions on, 12(10):1094–1104,

2001.

[21] Eric Schurman and Jake Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. In Velocity Web

(52)

[22] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable cross-language services implementation. Facebook White Paper, 5(8), 2007.

[23] Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. C3: Cutting tail latency in cloud data stores via adaptive replica selection. In Proceedings

of the 12th USENIX Conference on Networked Systems Design and Implemen-tation, 2015.

(53)

Replica selection in ApacheCassandra