Multi Data center Transaction Chain

(1)

AVANCERAD NIVÅ, 30 HP ,

STOCKHOLM SVERIGE 2016

Multi Data center

Transaction Chain

Achieving ACID for cross data center multi-key

transactions

QINJIN WANG

(2)

Multi Data center Transaction Chain

Achieving ACID for cross data center multi-key transactions

QINJIN WANG

Master’s Thesis at KTH ICT Supervisor: Ying Liu

Examiner: Associate Prof. Vladimir Vlassov

(3)

(4)

Abstract

Transaction support for Geo-replicated storage system is one of the most popular challenges in the last few years. Some systems gave up for supporting transactions and let upper application layer to handle it. While some other systems tried with different solutions on guaranteeing the correctness of transactions and paid some efforts on perfor-mance improvements. However, there are very few systems that claim the supporting of ACID in the global scale.

In this thesis, we have studied on various data consis-tency and transaction design theories such as Paxos, trans-action chopping, transtrans-action chain, etc. We have also ana-lyzed several recent distributed transactional systems. As the result, a Geo-replicated transactional framework, namely Multi Data center Transaction Chain (MDTC), is designed and implemented. MDTC adopts transaction chopping ap-proach, which brings more concurrency by chopping trans-actions into pieces. A two phase traversal mechanism is designed to validate and maintain dependencies. For cross data center consistency, a Paxos like majority vote protocol is designed and implemented as a state machine. Moreover, some tuning such as executing read-only transaction locally helps to improve performance of MDTC in different scenar-ios.

(5)

(6)

Acknowledgments

Hereby, I would like to thank my supervisor Ying Liu im-mensely. Without his carefully advise, this thesis work could not be finished. I have learned a lot from him during discussions in the whole thesis period.

Xi Guan is a classmate as well a good friend of mine, who gives me a lot of helpful suggestions during thesis work. I would like to thank him very much.

I would also like to express my appreciations to Prof. Vladimir Vlassov, who is the examiner of this thesis. Not only for providing such a great opportunity of thesis work but also the knowledge taught from classroom.

(7)

Contents List of Figures List of Tables 1 Introduction 1 1.1 Motivation . . . 1 1.2 Goal . . . 2 1.3 Methodologies . . . 3 1.4 Contributions . . . 3 1.5 Organization . . . 3 2 Background 5 2.1 Transactions . . . 5 2.2 Distributed transactions . . . 6

2.2.1 Two phase commit . . . 6

2.2.2 Concurrent handling for distributed transactions . . . 7

2.3 Paxos . . . 8 2.3.1 Consensus . . . 8 2.3.2 Paxos algorithm . . . 8 2.3.3 Paxos variants . . . 10 2.4 Transaction chopping . . . 10 2.5 Epochs . . . 12 3 Related works 15 4 Design 19 4.1 Architecture . . . 22

4.2 Transaction client API . . . 23

4.3 Single data center MDTC . . . 23

4.4 MDTC in multi data centers . . . 27

4.4.1 Transaction views . . . 28

(8)

4.4.3 Read-only transactions . . . 29 4.4.4 Coordinator . . . 30 4.4.5 Forward Pass . . . 31 4.4.6 Backward Pass . . . 32 4.4.7 Majority vote . . . 33 4.5 Fault tolerance . . . 37 4.6 Discussions . . . 38 5 Implementation 43 5.1 Overview . . . 43 5.2 Modules . . . 43

5.3 Transaction life cycle . . . 46

6 Evaluation 49 6.1 TPC-C benchmark . . . 49 6.1.1 TPC-C . . . 49 6.1.2 TPC-C extended . . . 49 6.2 OLTP-Bench . . . 50 6.3 Experimental Setup . . . 51

6.3.1 Google Compute engine . . . 51

6.3.2 Configurations . . . 52

6.4 Result and analysis . . . 53

(9)

2.1 Paxos Algorithm . . . 9

2.2 SC Graph without cycle . . . 12

2.3 SC Graph with cycle . . . 13

4.1 Transaction reorder rate . . . 19

4.2 MDTC transactions in chain . . . 21

4.3 MDTC System architecture . . . 22

4.4 MDTC in single data center . . . 24

4.5 Example for MDTC in a single data center . . . 25

4.6 Cyclic dependencies . . . 27

4.7 Catch up . . . 37

4.8 Coordinator failure . . . 38

4.9 Transaction received . . . 40

4.10 Forward pass started . . . 40

4.11 Backward pass . . . 41

4.12 Done majority vote . . . 41

5.1 MDTC modules . . . 44

5.2 Read-only transaction . . . 46

5.3 Read-write transaction . . . 47

6.1 OLTP-Bench with Cassandra integrated Architecture . . . 51

6.2 Read-only transaction latency . . . 53

6.3 Write-only transaction latency . . . 53

6.4 Read-intensive transaction latency . . . 53

6.5 Write-intensive transaction latency . . . 54

6.6 Read-only transaction throughput . . . 54

6.7 Write-only transaction throughput . . . 55

6.8 Read-intensive transaction throughput . . . 55

6.9 Write-intensive transaction throughput . . . 55

6.10 Read-only transaction abort rate . . . 56

6.11 Write-only transaction abort rate . . . 56

6.12 Read-intensive transaction abort rate . . . 56

(10)

6.14 Read-only transaction scalability . . . 57

6.15 Write-only transaction scalability . . . 58

6.16 Read-intensive transaction scalability . . . 58

6.17 Write-intensive transaction scalability . . . 58

6.18 Various throughput in 10ms view length . . . 59

6.19 Various abort rate in 10ms view length . . . 59

(11)

2.1 Various Paxos extensions . . . 11

3.1 Related works comparison . . . 18

6.1 Google compute engine instance system properties . . . 51

(12)

(13)

(14)

Chapter 1

Introduction

From last decades, more and more internet scale applications have infrastructures based on Geo-replicated storage systems. There are enormous benefits as well as great challenges for such kinds of systems. Distributed transaction support is one of the most popular challenge among them.

According to CAP[26] theorem, only two from the following three properties: consistency, availability and partition tolerance can be on hold at the same time. It is likely that most of the systems choose availability over consistency. As partition tolerance is a must to have property in distributed systems. Among those popular NoSQL storage, MongoDB[29] claims that they are not supporting multi-document transactions. Cassandra[32] has the lightweight transaction[23] feature but it is not fully ACID. It uses Paxos[34] to guarantee atomicity for those kind of ’lightweight’ transactions but it lacks the isolation solution for handling concurrent transactions. There has been several research works done in the distributed transactions area recently [21, 19, 16, 20, 31, 38, 46, 25, 44]. Some solutions do not fully support ACID properties and some solutions don’t have a good performance on achieving high throughout as well as low latency. There will be a long journey for a better solution on distributed transactions. Yet, we would like to have a try with our approach, namely MDTC, which is based on the ideas and designs of several existing distributed transactional systems.

1.1 Motivation

Our motivation comes from the existing problems in distributed transactions for Geo-replicated storage systems: how to achieve ACID properties for multi-key con-current transactions.

(15)

when involving multi-key transactions. Furthermore, some systems even give up supporting for multi-key transactions.

Two-phase commit is commonly used in distributed systems for achieving atom-icity and consistent states among replicas. However, the problem in two phase commit is that it is a blocking protocol which could possibly be blocked if the coor-dinator or replica is failed during commit phase. Three-phase commit[43] was de-signed to solve the problem in two-phase commit, but nevertheless it is not resilient with partition failures. Paxos[34] is a popular algorithm for achieving consistency in distributed systems. Various Paxos extensions, such that [15, 36, 37, 35], as well as the Paxos commit[28] protocol are the sources and motivations for the solution of achieving atomicity and consistency in distributed systems.

Locking is one of the most common solutions for handling concurrency in trans-action systems. Two phase locking is designed to prevent other transtrans-actions to access the same resource during the transaction’s life time, which could guarantee

correctness but performance suffers. On another hand, as opposite to the

pes-simistic locking approach, optimistic concurrency control gained more attentions recently. It executes the transactions in parallel and solves the conflicts whenever it is needed before committing a transaction. Moreover, chopping transaction to pieces, executing the pieces with respect to the dependency between transactions are another optimistic approach for achieving isolation in distributed transactions.

1.2 Goal

The goal of MDTC is to create a framework for distributed transactions in the Geo-replicated storage systems which should guarantee ACID properties as well as achieve high throughput and low latency for concurrent multi-key transactions.

There are the following sub-goals for MDTC:

• The framework could guarantee the highest isolation level, also known as, one-copy serializable for concurrent transactions. But other isolation level could also be achieved, even with a better performance.

• The framework is modularized and layered, where transaction layer and key-value storage layer should be independent. This helps MDTC to adapt to more than one distributed storage solutions.

• The framework could also be designed to adapt to different read-write work-loads. By default, the framework won’t have assumption on the expected workloads.

(16)

1.3. METHODOLOGIES

1.3 Methodologies

The following methodologies are taken for the thesis work of MDTC:

• Quantitative study on existing researches of distributed transactions and dis-tributed systems which provide transaction support.

• Design and conduct experiments to verify functionality and correctness of MDTC.

• Design and conduct experiments to verify the performance improvement brought by MDTC.

• Overhead analysis on different system setups to adapt to different use cases.

1.4 Contributions

The main contribution of the thesis work is to design and implement a distributed transaction framework which aims to provide better solutions in the realm of build-ing distributed transactional systems.

With the main contribution, the following works have also been performed and they are worth to be mentioned as contributions:

• A distributed transactional framework for global scale is created which adapts to different storage solutions. The layered architecture is adopted so that it supports distributed transactions for different distributed storage.

• MDTC achieves globally distributed transactions with only requires 1 message

round among data centers. The latency is low compare to some existing

solutions.

• MDTC also has very low abort rate and high throughput, as well as it scales. • As far as the author knows, we are the first to create a TPC-C[14] benchmark

for Cassandra support.

• All the contributions are open sourced.

1.5 Organization

The remaining of the thesis is organized as follows:

(17)

• Chapter 3 describes the related work of MDTC. The pros and cons for each of the related works are discussed.

• Chapter 4 describes the design of MDTC. Firstly, the system architecture of MDTC is presented, which follows by the transaction API design. Then the whole design idea is presented by a in-depth way: from the simple singe data center scenario to the complicated multi data center scenario. Later on, the design on fault tolerance is presented. Last, the discussion on how MDTC achieves ACID is presented as well as an example is combined.

• Chapter 5 describes the implementation of MDTC. The necessary implemen-tation details of different modules in MDTC are presented. Lastly, transaction life cycle for two different transactions is presented.

• Chapter 6 describes the evaluation of MDTC. It starts with the introduction for TPC-C and our extended implementation. Then evaluation benchmark framework and setting up are described. And lastly the benchmark result is presented and analyzed.

(18)

Chapter 2

Background

2.1 Transactions

Transaction concept was derived from contract law: when a contract is signed, it needs the joint signature to make a deal.[27] In database systems, transactions are abstractions of procedures that the operations are guaranteed by the database to be with all done or nothing done in the presence of failures.

Atomicity, consistency, isolation and durability are four properties of modern transaction systems:

• Atomicity specifies all or nothing property of a transaction. The successful execution of a transaction will guarantee that all actions of the transaction will be executed. Or if the transaction is aborted, the system would behave as if the transaction was never happened.

• Consistency specifies that the database is moved from one consistent state to another with respect to the constraint of the database.

• Isolation describes that the affects among concurrent transactions are in different degrees.

• Durability specifies that once the transaction is committed, the result is permanent and can be seen by all other transactions.

Concurrent transactions

Transactions in database system should be executed as if they were happened in serial in the ideal case. However, achieving serializability as well as handling con-current transactions are important for database systems. There are several isolation levels defined in corresponding to different concurrency levels:

(19)

However, a transaction might be aborted when its uncommitted data is read, which is known as ’dirty read’.

• Read committed level avoids ’dirty read’ since it does not allow a transaction to read uncommitted data from another transaction. It, however, could have ’non-repeatable read’ exists. Since read committed isolation won’t guarantee that the value should not be updated by another transaction prior to the read transaction commits.

• Repeatable read isolation level avoids ’non-repeatable read’ as the read lock could be held until the read transaction is committed, so that the value could not be updated by another transaction unless the read transaction is finished. However, repeatable read doesn’t guarantee the result set is static with the same selection criteria. A transaction read twice with the same criteria might get different result sets. This is also known as ’phantom reads’.

• Serializable isolation level guarantees that all interleaved transactions in the system have the equivalent execution result as they are executed in serial.

2.2 Distributed transactions

Transactions in distributed systems are far more complicated than transactions in a traditional database system. The atomicity property would not be guaranteed if two or more servers can not reach a joint decision. Two phase commit is the most commonly used commit protocol in distributed transactions, which helps to achieve all or nothing property in distributed transaction system. The concurrency handling for distributed transaction includes pessimistic locking approach and op-timistic concurrency control approaches, both of them with pros and cons, which will be discussed in this section.

2.2.1 Two phase commit

As from the name, there are two phases in two phase commit protocol: proposal phase and commit phase. There is transaction manager in the system who takes the role for gathering and broadcasting the commit decisions. There are also resource managers who could propose a transaction commit and decide whether the received commit decision from transaction manager should be committed or aborted.

In proposal phase, a resource manager (note that a resource manger could also be a transaction manager) proposes the value to commit to the transaction manager. Upon received the proposal, transaction manger then broadcasts it to each of the resource managers and waits for the reply from resource managers. Each of the resource managers will reply transaction manager they are prepared.

(20)

2.2. DISTRIBUTED TRANSACTIONS

broadcast the commit message to all resource managers so that all resource man-agers will commit and move to the committed stage.

In normal case, there are 3N-1 (where N is the number of resource managers) messages for the successful execution in two phase commit: 1 proposal message is sent from one of the resource manager to transaction manager. Then transaction manager send to N-1 resource managers preparation message. And N-1 resource manager will reply transaction manager with prepared message. Lastly, transaction manager send to N resource managers the commit message. A minor improvement will be that if one of the resource managers acts as the transaction manager, so there will be two messages less, which will be 3N-3 messages.

The system can come to the aborted stage if transaction manager doesn’t re-ceive replies from all of the resource managers which might due to several rea-sons: resource manager aborted, network delay or failure, and one or more resource managers down. However as long as the transaction manager is down, a resource manager which has replied the prepared message will not know if that transaction is committed or aborted. In such case, two phase commit is a blocking protocol. Some other protocols such as three phase commit[43] solved the blocking in two phase commit. However, three phase commit couldn’t tolerant partition failures.

2.2.2 Concurrent handling for distributed transactions

Two phase locking

Two phase locking (2PL) utilizes locks to guaranteed the serializability of trans-actions. There are two types of locks: write-lock and read lock. The former is associated to resource before performing write on it and the latter is associated to resource before performing read on it. A write lock could block a resource being read or written by other transactions until the lock is released. While a read lock could block a resource being written but would not block a concurrent read from other transactions.

2PL also contains two phases: expanding phase and shrinking phase. In ex-panding phase, locks are acquired and no locks are released. In shrinking phase, Locks are released and no locks are acquired. There are also some variants of 2PL. Strict two phase locking is that the transaction should be strictly applied with 2PL, and will not release write locks until it is ended. On another hand, read locks could be released in phase two before transaction is ended. Strong strict two phase lock-ing applies for strict 2PL, but it won’t release both write and read lock until the transaction ended.

(21)

Optimistic concurrency control

Optimistic concurrency control (OCC) solves concurrent handling for distributed transactions from another perspective than 2PL. In OCC, transactions could pro-ceed without locking on resources when it modifies a value. Before committing, the transaction will validate whether there are other transactions modifies the resources it has been read. If so, the committing transaction will be rolled back.

In order to achieve that the resource should be validated before transaction is committed. Timestamps and vector clock is utilized for remembering the version of the resource.

The pros of OCC is that it would improve the transaction throughput for con-current transactions when the conflicts are low. However, the abort rate will be increased when the number of conflicted transactions raised.

2.3 Paxos

Indeed, all working protocols for asynchronous consensus we have encountered so far have Paxos at their core[17]. Paxos, the de-facto consensus protocol which could

handle N₂ − 1 node failures for a system which has N nodes. It is widely used for

achieving consensus in distributed systems.

2.3.1 Consensus

Generally, there are two requirements for consensus protocols: safety and liveness. Safety is about the correctness of the protocol: only a value that has been proposed can be chosen, only a single value is chosen and a process never learns that a value has been chosen unless it actually has been. Liveness is about how the protocol eventually behaves as good as we excepted: Some proposed value is eventually chosen and if a value is chosen, a process will eventually learn it.

We assume the following as the model of the consensus protocol: there are a set of nodes that can propose values, any nodes can crash and recover, node has access to the stable storage, the messages are passed among nodes asynchronously, and messages can be lost or duplicated but never corrupted.

A naive approach to achieve consensus is to assign a single accepter node who will choose the first value it receives from other proposals. This is an easy solution to implement but it has drawbacks such as single point of failure and high load on the single acceptor.

2.3.2 Paxos algorithm

In Paxos (Fig 2.1), there are three roles of each node: proposers, acceptor’s, and

learners. The consensus procedure starts with a proposer picking up a unique

(22)

2.3. PAXOS Proposer Acceptor Send { Prepare, n } Highest sequence <- n n’ <- Highest accepted Send { Ack, n’ }

v <- Value of highest n’ or Free pickup Upon majority: Send { Accept, <n, v> }

Upon n == highest sequence: Send { Accepted, n }

Upon majority: Send { Decide, v }

Figure 2.1. Paxos Algorithm

not accepting proposals which number is smaller than n and then send the highest

numbered proposal accepted with number less than n (say n0) to the proposer. This

is the first phase of Paxos, as it actually proposes the value and gets the promises that the proposed value is the one being confessed, we could call it propose phase. The second phase starts when the proposer receives the majority of the promises from the acceptor. The proposer will pick up a value v which is the highest proposal number in the received promises and issue accept for the number n and value v to all accepters. Note that if there is no such value exist yet, the proposer could freely pick up a new value. Whenever an acceptor receives the accept message for the number n and value v, it will: send back accepted response if it hasn’t responded to any proposals whose value is bigger than n. Otherwise, it will send a reject response to the proposer. Upon the proposer receives the majority of the responses from accepters, it will decide the consensus value v and broadcast the decision to all learners. Otherwise, the consensus will be aborted by proposer. The second phase is about to write the highest proposed value to all nodes, which we could call it write phase.

(23)

that the system proposes a single value at a time.

2.3.3 Paxos variants

In this section, several Paxos variants are presented in Table 2.1. Multi Paxos, Cheap Paxos[37], Vertical Paxos[36] and Fast Paxos[15] are chosen and their pros and cons are analyzed.

2.4 Transaction chopping

Transaction chopping is to split a long transaction to short ones but without af-fecting the correctness of the execution. So that different transactions could be reordered to achieve maximum performance during execution.

Transactions can be represented as undirected graphs with each node from a piece of transaction execution unit. There are two types of edges:

• C-edges C is stand for conflict, so that a C edge is an edge between two conflict pieces. Two pieces are conflicted if and only if the two pieces are from different transaction and at least one of the transaction will write to the shared item.

• S-edges S is stand for sibling, so that a S edge is an edge between two sibling pieces. Two pieces are siblings if and only if the two pieces are from the same transaction.

The chopping among a group of transactions T 1, T 2 ... T n is being said correct that if any serial execution of the chopping is equivalent to the serial execution of the original transaction instances. With that said, the theorem of the transaction chopping is: A chopping is correct if it is rollback-safe and its chopping graph contains no SC-cycles.[41] The rollback-safe is that either a transaction doesn’t contain rollback statements or all rollback statements are in the first piece of the chopping. Imaging that if a transaction could be rolled back in the piece which is not the first piece. When that piece is rolled back, but the committed pieces prior to it was not rolled back, thus the transaction is broken.

An example of transaction chopping is presented below. There are three con-current transactions which R is denoted for read and W is denoted for write.

T1 R(x) W(x) R(y) W(y) T2 R(x) W(x)

T3 R(y) W(y)

T 1 could be chopped into:

(24)

2.4. TRANSACTION CHOPPING

Name Main idea Pros Cons Failure

han-dling Used by Multi Paxos Assume the leader is stable, leader controls

the ballot number increase, so P1 in Basic Paxos could be ignored. 2 messages if there is no failure. What if the leader is not stable - Most of Paxos implemen-tations includ-ing MDCC Cheap Paxos Main processor: normal Paxos node. Auxiliary processor: only

works when there is main processor failed. Tolerate F failure in a F+1 nodes system System must halt if there are too many main processors failed Use aux-iliary processor to vote. Reconfig-uration so that a new quorum is formed -Vertical Paxos Re-configure the system vertically: when the round is

changed.

Differ-ent read and write

quorum. Config-urable master Tolerate F failure in a F+1 nodes system Primary backup protocol not state machine. Need a stable con-figuration master - -Fast Paxos classical round: proposer- >leader->acceptors-> learners fast round: proposer->acceptors-> learners 2 message delays for fast round Need Col-lision handling Collision handling: coordinated recovery and unco-ordinated recovery MDCC

(25)

T11 T12

T2 T3

S

C C

Figure 2.2. SC Graph without cycle

T12 R(y) W(y)

We could have SC-graph as Fig 2.2 shows, which doesn’t have any cycles. Fur-ther more, if we chop T 11 into two pieces:

T111 R(x) T112 W(x)

Thus, a SC-graph cycle could be found from Fig 2.3.

Remember the definition of the correct chopping, in such case, a group of trans-action could achieve better concurrently execution with re-ordering communicative operations.

Transaction chopping theory provides us a guide to process multiple concurrency transactions with guarantee the correctness as well as not losing performance. Re-cently, more and more distributed translational systems have adopted this theory, which includes [46], [38], [18], etc.

2.5 Epochs

(26)

2.5. EPOCHS T111 T112 T2 S C C T12 T3 S S C

Figure 2.3. SC Graph with cycle

node. Such event also includes a message from one node is received on another node. The logical clocks could be used for totally ordering of events in distributed system.

In Paxos and Raft[39], there is a similar concept about operation period called epoch (which is called term in Raft). An epoch is a period that one leader is elected and acts as leader for consensus. The epoch number is also monotonic increasing. This means when a leader is failed, the new leader could propose the higher value, which starts a new epoch.

Recently, the epoch concept is more and more adopted in distributed transac-tion systems. Calvin uses 10ms as the length of the epoch. At the end of each epoch, transaction requests will be batched and sent to remote replicas. The epoch number is synchronized by Paxos. Rococo is another system which adopts epoch: the epoch number increases only after all transactions in the last epoch are com-mitted, and there is no other server falls behind or has ongoing transactions at the previous or some more epochs ago. Also, dependencies from two epochs ago will be discarded.[38]

(27)

(28)

Chapter 3

Related works

During the thesis work, there were quite some systems studied and analyzed. Among them, we thought the following systems are more important and helpful for MDTC: • Spanner Google spanner is the first truly distributed database at global scale which supports externally-consistent distributed transactions. Spanner uses Paxos to guarantee consistent replication in one shard. Two phase commit is used to guarantee atomicity among shards and the true time API as well as two phase locking is used to guarantee transactions are executed concurrently with respect to linearizability. The biggest invention is true time, which made synchronizing time at global scale possible helped by accepting and waiting for a very small uncertainty. Spanner is used in production in Google, some systems such as F1[42] are build on top of it. However, as we understand for Geo-replicated distributed transactions, Spanner might need 3 message rounds between data centers which could possibly be improved.

• MDCC MDCC implements optimistic concurrency control for Geo-replicated transactions, which guarantees strong consistency with a similar cost to even-tually consistent protocols. In common case, only one cross data center round trip is needed. It is a quorum based system which contains three Paxos pro-tocols:

– Multi Paxos Multi Paxos assumes that there is a stable leader which

then the first phase of the Paxos could be skipped. With multi Paxos, read committed isolation could be guaranteed in MDCC.

– Fast Paxos The idea of Fast Paxos is to bypass the leader. This is

(29)

– Generalized Paxos With Generalized Paxos, the communicative

oper-ations could be reordered so that concurrency is improved.

MDCC’s performance in normal case is outstanding. However, the collisions need to be handled which will introduce some more cross data center latency. It is also important to mention that MDCC could only guarantee read com-mitted isolation, which means non-repeatable read is not avoided.

• Warp Warp achieves serial transactions by transaction chain approach. Each transaction contains two phases, which are: forwardly pass through the chain for validating conflicts and dependencies, backwardly pass through the chain for propagate the commit or abort result of the transaction.

Warp guarantees ACID on top of Hyperdex[24] with 3.2 time higher through-put than two phase commit. However, the design of Warp does not have proper handling for chain node failures and it is not optimized for Geo-distributed scale.

• Calvin Calvin is a deterministic concurrent control framework for partitioned database systems. The transaction requests, instead of transaction execution results, are batched and replicated synchronously or asynchronously between different data centers.

Transactions are batched per 10ms epoch which is ordered by epoch numbers by sequencer and then sent to the same replication group in remote data centers. The deterministic execution guarantees all data centers will have the same execution result.

However, Calvin only supports read and write transaction but it does not support read-write transactions. The static analyze needs to be performed prior to the transaction request is replicated between data centers.

• Lynx Lynx is a Geo-replicated distributed storage system which could obtain both serializable transactions and low latency. The main idea is to perform a static analyze on transactions, and then chop the transaction into hops while preserving serializability. And quickly return to client after the first hop. This is so called transaction chains.

Lynx takes advantages of transaction chopping but it needs the static analyze of transactions dependencies prior to chopping which might be a bottleneck for the system.

(30)

servers who will reach a deterministic serializable order to execute conflicted transactions.

Rococo achieves serializability when handling concurrent transactions with low latency. However, in committing phase, servers need to exchange depen-dency information between each others to reach the deterministic serialized order. This is not optimized for Geo-distributed transactions as the latency between the servers in different data centers would let system suffer.

(31)

Name Main idea Pros Cons

Spanner True time for solving

global time uncertainty.

2PC, 2PL, Paxos for

achieving distributed

transactions

The first globally

dis-tributed transactional

system which provides

serializable isolation.

True time solves global time uncertainty 3 cross data center mes-sages for globalized transac-tions

MDCC Multi Paxos, Fast Paxos,

and Generalized Paxos

1 cross data center

round trip for normal case

Only

guar-antee read

committed isolation

Warp Forwardly and backwardly

validate and execute trans-action in chain 3.2 times throughput than 2PC It is not designed for Geo-replicated transac-tional scale

Calvin Deterministic concurrency

control for partitioned

database

Achieves serializability with deterministic ap-proach Not sup-port for read-write transac-tions

Lynx Static analyze transaction

and chop it

Low latency as it quick return after first hop

The static

analyze

could

be-come a

bottleneck

Rococo Two phase transaction

chopping

Achieves serializability for concurrent transac-tions Not op-timized for Geo-replicated system

(32)

Chapter 4

Design

This chapter aims to explain the system design of MDTC. The main design goal for MDTC is to build a Geo-replicated distributed transaction framework, which holds one-copy serializability consistency and ACID properties with minimum sacrifices on latency and throughput.

Model Assumptions

We will define the following assumptions as the execution model of MDTC. As we want to narrow down the scope of the problems MDTC are aimed to solve. Those scenarios which are out of the model might be achieved and improved in the future. Or there are solutions from other systems which would make the problems solved for MDTC.

(33)

• Membership Since membership management is out of the scope of this thesis, we assume that the servers in the system are configured so that they can join and leave.

• Failures The server could be crashed and then be recovered at any time after. However, there is no byzantine failures in the system.

• Asynchronous messages The servers communicate with each other by asyn-chronous message passing. Messages could be delayed or lost but will never be corrupted.

• K-v store feature The k-v store layer should have API to let MDTC know which server holds what partition key ranges. So that when the involved keys in a transaction are given, the involved servers are known.

• Transaction type The transaction could be read, write or read-write transac-tions which might involve one or more items. Fig 4.2 presents two transactransac-tions

T 1 and T 2, which T 1 needs to sequentially access server s0, s1 and s3, while T 2 need to sequentially access s4, s7. A transaction shouldn’t have keys that

need to be non-continuously accessed more than once. With that said, a trans-action such as r(x), r(y), w(x) is not valid for MDTC, but r(x), w(x), r(y) is valid.

• Transaction processing Transaction processing time are not necessarily be the same but it should be in the same scale for all transactions such as tens of million seconds.

Design inspirations

In previous chapters, we have seen several theory and technologies which are more and more popular and essential for making a distributed transaction framework correct and optimal.

Paxos is widely used for consensus in distributed systems, the various Paxos ex-tensions and Paxos commit protocol are helpful to build a correct distributed trans-action system with respect to atomicity and consistency. Transtrans-action chopping and transaction chains are appeared more frequently on handling concurrency in dis-tributed transaction systems. As the divided transaction items could be reordered and grouped, which will eliminate some unnecessary waits as well as minimize abort rates in the presence of concurrent conflicted transactions. Epochs are useful for achieving correctness when there are uncertainty in globally clocks. Deterministic approach for consensus on transaction input is a new and effectively way to get Geo-replicated transaction correct in all data centers.

(34)

4.1. ARCHITECTURE Storage S0 Storage S7 Storage S1 Storage S2 Storage S6 Storage S5 Storage S3 Storage S4 T1 T2

Figure 4.2. MDTC transactions in chain

and send them to one of the three servers. In the evaluation, each transaction is designed to have the same processing time (1 ms). The evaluation is done by run-ning 10 rounds of the same transaction input. After execution, we have calculated the similarity of transaction output orders on all servers. The result shows that the deterministic (transaction output without reordering) is achieved if the load is around to 100 transactions per second. But for the higher load, reordering will be increased. That means with high transaction load in the system, deterministic execution approach could not be held. (Figure 4.1)

With the above evaluation result, it is clear that for MDTC, we shouldn’t rely on the deterministic execution among different data centers. Instead, the design needs to handle such inconsistency between data centers with minimum compro-mises on throughput and latency. So that, a Paxos like majority vote plus catch up mechanism has been designed for handling cross data center inconsistency.

(35)

Transaction Client API

Server 1

Data center1 Data center2

Server 2 Server 3 Server 1 Server 2 Server 3

Storage (key 1-1999) Storage (key 2000-3999) Storage (key 4000-5999) Storage (key 1-1999) Storage (key 2000-3999) Storage (key 4000-5999) Coordinator Coordinator

Transaction Client Transaction layer K-V store layer

Figure 4.3. MDTC System architecture

4.1 Architecture

In MDTC, layered architecture has been adopted. There are three layers in MDTC: transaction client API layer, transaction layer, and key-value storage layer (Fig-ure 4.3).

Transaction client API exposes a common transaction interface to other systems. With the transaction client API, a transaction request can be sent to MDTC. Caller will get the transaction execution result when the transaction is successfully com-mitted. An aborted message will be received by caller when the transaction is aborted. The isolation level parameter specifies what kind of isolation applies for the transaction, which one-copy serializable is implemented so far.

Transaction layer is the core part of MDTC. The main roles in MDTC are coor-dinators and servers. Each server could possibly receive transaction requests from the client API. The role of servers and coordinators will be explained in the follow-ing sections. When the transaction is finished with either ’Succeed’ or ’Aborted’, the transaction response will be sent back to the original server who received the request.

(36)

4.2. TRANSACTION CLIENT API

distributions in the storage, which is necessary for building up transaction items. The only requirement from MDTC to the storage is to have full replication in each data centers, so that each MDTC server could have the same key ranges comparing to its replica in other data centers.

4.2 Transaction client API

As a distributed database, comparing to the traditional databases, any server in MDTC could be acted as the interface for the transaction. Thus, we call the server involves in a transaction ’transactionclient’. A minimum transaction client contains two procedures: to prepare a transaction and to start a transaction. The former allows any other system to provide a schema for preparing the executions, such as ’P reparedStatment’ in SQL. And the latter starts to execute the transaction as well as guarantee the transaction will be committed or aborted in an asynchronously callback. The procedure in transaction client is presented in algorithm-1.

Algorithm 1 Transaction Client

1: _{procedure PrepareTransaction(T)}

2: Execute (T, P repare)

3: _{procedure StartTransaction(T)}

4: T.T ID ←generateTID()

5: Send (T, ForwardPass) to T.HeadServer

6: Async Return (T, Committed/Aborted)

4.3 Single data center MDTC

In this section, we will present how MDTC performs in a single data center. So that we could get an understanding of why and how transactions are chopped and concurrently executed in MDTC. Figure 4.4

Once a transaction is received by the server in MDTC, the server (which we could also call it transaction client) will be responsible for the transaction. Firstly, a transaction will be chopped by its involved partition keys. This is guaranteed by that each of the transaction client has the knowledge of which server holds what partitioned key ranges. As key partitioning information is very important, the key-value store which provides the consistent partition is a necessity for MDTC.

When transaction client server gets the information about all involved servers in the chain, it should send the transaction to the first server to start the transaction execution. The input of the transaction execution stage is the transaction sent to the first server. And the output is a committed or aborted status as well as execution result returned by the first server to the transaction client.

(37)

Chop transaction

items Forward pass

Backward pass Return result Abort Commit Transaction request Committed Aborted

Figure 4.4. MDTC in single data center

phase (which it is also known as backward pass). Both of phases will traverse all involved servers, but with reversed orders.

In forward pass, the transaction will check if there are any other transactions shares the same keys which have been executed on the server but not yet committed. It will also need to check if there are any conflicted transactions exists on that server. For the former case, that transaction will add those executed transactions as its dependencies. So that on all servers, the sequential execution order between the executed transactions and the current transaction will be guaranteed. For latter case, the conflicted transaction pairs have equal possibilities to execute prior to each other. With that said, in backward pass phase, the dependencies for the conflicted transactions will be generated based on the execution order on the first shared servers both of the transactions are passing to (which is called decider server).

In backward pass, the transaction will traverse from the last server to the first server. Whenever the transaction arrives at a server, it will firstly check if there are other transactions in its dependency list which haven’t been executed on that server yet. In such case, the transaction needs to wait its depended transactions until they are executed. Upon all of its depended transactions have been executed, the transaction can be executed on that server.

(38)

4.3. SINGLE DATA CENTER MDTC

Figure 4.5. Example for MDTC in a single data center

all servers simultaneously and all involved servers will fire a asynchronously update for the k-v storage as well as change the transaction status to committed. Transac-tion client holds a timer on transacTransac-tion execuTransac-tion, so that transacTransac-tion client would abort the transaction if it has reached the maximum timeout. Such case won’t happen in the normal case as the timeout value is configured as a loose boundary which should meet the most transaction execution delay in the correct execution manner. Lastly, transaction client server will return the transaction response to the external requester.

An example

Figure 4.5 shows an example of transaction chain when four transactions executed concurrently in a data center. A data center could contain thousands of servers while one transaction’s key could only be distributed on one or several servers. As from example, transaction T 1 has involved servers s0, s1, s2 and s4, transaction

T 2 has involved servers s1, s2, s3, transaction T 3 has involved servers s1, s2, s3, s4, and transaction T 4 only has one involved server s0. A transaction is received

(39)

transactions (which are backward passed through s0 but not yet committed) to see if there are such transactions that share the same keys with T 1. If so, T 1 have to add those transactions to its dependency list. So we can guarantee that T 1 will be executed after its depended transactions on all other servers whenever a decision needs to be taken for orders between T 1 and its depended transactions on those servers.

T1 will also check if there are any conflicted transactions on s0 which have passed s0 but not yet executed on it (which are forward passed through s0 but haven’t backward passed to it yet). Two transactions are conflicted if and only if when both of the transactions are operated on the same key and at least one of the operations is a write. When there are such conflicted transactions, T 1 will add them to its conflict list. From Figure-4.5 we can see that there is no other conflict transactions on s0 (since the conflict transactions are marked as other colors other than light blue). But T 1 has conflict transactions T 2, T 3 on s2(red) and it is also conflicted with T 3 on s4(blue).

After the aforementioned dependency checks and conflicting checks, T 1 can forward pass to the next server s1, and finally it comes to the last server in the chain, which is s4. when T 1 finishes forward pass on s4, it will start backward pass on s4 right away. The backward pass is more complicated than the forward pass, which will execute transaction as well as guarantee all conflicted concurrent transactions are executed as the same sequence on all servers. With that said, we can be sure that the one-copy serializability isolation is guaranteed.

When T 1 started backward pass on s4, it will firstly check if there are any depended transactions in its list which has not been executed on s4 yet. If so, T 1 needs to wait the depended transactions prior to start execution. The wait time t is a parameter which needs to be evaluated and tweaked. Too long waiting time would introduce unnecessary delays. However, too short waiting time will cause high CPU usage which would affect the transaction execution rate. After the first waiting is finished, T 1 will continue to check and wait the remaining un-executed transactions in the dependency list until the dependency list is empty.

However, as soon as the dependency list is empty, T 1 will start to check the conflict list. To see if there are any conflicted transactions need to be waited so that a cyclic dependencies among several transactions should be avoided. An example of cyclic dependencies can be generated as follows: Assume that T 1 and T 3 are conflicted on s4 for key x. T 2 and T 3 are conflicted on s3 for key y. T 1, T 2 and T 3 are conflicted with each other on s2 for key z. On s4, if T 1 is executed prior to T 3. On s3, if T 3 is executed prior T 2. So that there is a dependency: T 1->T 3->T 2. Upon T 1 and T 2 started backward pass on s2, the dependency T 2->T 1 should be avoided otherwise a cyclic dependency will be created (Fig-4.6). With such case,

T 2 must add T 1 to its dependency list and wait until T 1 is executed.

(40)

4.4. MDTC IN MULTI DATA CENTERS T1 T3 T2 rw(x) rw(y) rw(z)

Figure 4.6. Cyclic dependencies

to avoid transaction dependency cycles.

4.4 MDTC in multi data centers

The aforementioned MDTC design could guarantee one-copy serializability isolation without replication in a single data center. However, for a distributed transaction system, the transactions might be generated in any data centers from any clients. This is beneficial that Geo replication moves data close to clients and provides less access latency. But it also introduces a problem that we have to solve: how to guarantee one-copy serializability isolation in multiple data centers for concurrent transactions as well as not sacrifice too much on performance.

Intuitively, we could build transaction chain regardless of data centers. This approach is simple and easy to implement but it is not realistic: The high latency cross data center will lead to very slow forward and backward pass on servers which are located in different data centers.

(41)

data center partition failure must be gracefully handled. As all concurrent transac-tions must be serialized and queued in one data center which then the master data center becomes a bottleneck.

So far, we could say that it is important to find a solution which meets low latency and high throughput for multiple concurrent transactions. To keep the latency low, we need to design a solution which only passes through servers inside a single data center. At mean while, to keep the throughput high, we need to design a solution which supports transaction being concurrently executed in several data centers.

Based on the aforementioned assumption that the same transaction execution result could be achieved among data centers, we have designed a solution which will allow transactions to be executed concurrently within a data center and the execution result from different data centers could reach consensus with the minimum cost.

Generally, we have extended the single data center design to three phases: for-ward pass phase, backfor-ward pass phase and majority vote phase. To fulfill such solution, we have also introduced some necessary objects and roles into MDTC. It is also worth to mention that read-only transactions are designed to be handled locally so that we could achieve maximum throughput for them. The following sections will describe the design which supports multiple data centers in MDTC.

4.4.1 Transaction views

For handling concurrent transactions, the first question for us is to define it. We thought that the transactions happen in the same short epoch can be treated as concurrent. Such epoch is called transaction view in MDTC. We take 10 ms as the default value of the transaction view, however, the value could be configurable to adapt the different transaction rates via transaction API.

Transaction view is the minimum time unit in MDTC. When a transaction is started by the transaction client, there will be three special views in a transaction’s life cycle: received view, started view, and vote view.

In single data center MDTC, a transaction could be started to execute as soon as it is received by a server in MDTC, namely received view. However, in multi data center design, we employ wound-wait like design for non read-only transactions: a transaction will not be started to execute until the transaction batch for the transaction received view in other data centers has been received. Such view is called started view. A transaction will start to perform forward pass as soon as it is started. When backward pass is finished, the transaction will also need to wait it receives the majority vote messages from the majority of data centers. If such condition is fulfilled, the majority vote will be started. So that then the transaction is moved to vote view. As soon as the majority vote is finished, the data center which issued transaction request will deliver transaction result to transaction client. Meanwhile, the commit decision is broadcast to all involved servers.

(42)

4.4. MDTC IN MULTI DATA CENTERS

assume that the max number of views for communication latency between two data centers is x. So that we could have:

Received view: v

Started view: v + x

Voted view: v + 2x

However, there might be interferes such as network delay or transaction execu-tion delay which causes some data center lags behind as compared to the expected start view or vote view. In such case, the serializable execution is guaranteed by that if a transaction is issued from view v, it won’t start execution unless all other data center transactions in view v has been received. However, if the lag has caused the vote view delayed, the delay for majority vote shouldn’t be unbounded. So that the max vote view is another special view which determines if a non voted transaction should be aborted or not.

4.4.2 Transaction ID

A transaction is identified by transaction id (T ID) in MDTC. Each T ID is made up of tuples: {DID, V ID, LST }, which DID is data center id, V ID is view id and

LST is the local time stamp.

4.4.3 Read-only transactions

Intuitively, read latency would be critical for some applications on top of the MDTC. And also that read-only transaction won’t aggressively create conflicts. Thus we would like to make read-only transaction be executed as fast as possible but keep the ACID properties.

Basically, read-only transactions (Algorithm 2) are designed to be read on local

data center. When a read-only transaction is received, the received server will

(43)

Algorithm 2 Read-only transactions

1: _{procedure onReceivedTxn(T, Server)}

2: Add T to T ransactions[]

3: if T.isReadOnly() then

4: for S’ in T.servers() do

5: Send (T, Read, Server) to S0

6: else

7: Send (T, ForwardPass) to Coordinator

8: _{procedure onRead(T, Server)}

9: T.result =Read(T)

10: Send (T, ReadDone) to Server

11: _{procedure onReadDone(T)}

12: T.result merge

14: if numReadDone ==T.numKeys then

15: Send (T, ReadonlyReturnConditionCheck) to V oteM anager

16: _{procedure onReadonlyReturnConditionCheckDone(T)} 17: if hasConf lictT xnExecuted() then

18: Send (T, Aborted) to T.Client

19: else

20: Send (T, Committed) to T.Client

4.4.4 Coordinator

In single data center, it is enough for all participated servers to execute concurrent transactions by the forward pass and backward pass phase to determine transaction dependency and guarantee the commit orders. However, in multi data centers, because the transaction execution result in one data center could be conflicted with other data centers without coordination, so that we need a coordinator role in MDTC.

Each data center has one coordinator (Algorithm 3) which is responsible for synchronizing views and transaction inputs among data centers. Remember in the previous sections, a non read-only transaction needs to have a wound-wait prior to start execution. However, a server should dispatch transaction to coordinator as soon as a transaction is received. The coordinator will batch the received trans-actions from local servers per view (procedure onReceivedT xn()). Upon a view is ended, coordinator should send the batched transaction for last three views to remote coordinators in other data centers (procedure onV iewEnd()).

(44)

exe-4.4. MDTC IN MULTI DATA CENTERS

cution (forward pass) for those transactions by sending the transaction to its head server (procedure onV iewStart()).

What if a coordinator is lagged when one of the transaction batch is not received or reordered? The batch id will make sure that coordinator won’t start to dispatch transactions in the batch until all of the preceding batches have been dispatched. Moreover, if there are no transactions received in the view, an empty batch with that batch id will be sent to remote coordinators.

For coordinator fail over, there will also be a standby in each data center, which will be described in the later sections.

Algorithm 3 Coordinator 1: _{procedure onReceivedTxn(T)} 2: V ←getCurrentView() 3: T xnBatch[] ←findTxnbatchForView(V) 4: add(T xnBatch[], T ) 5: _{procedure onViewEnd(V)} 6: RemoteCoordinators[] ←findRemoteCoordinators() 7: T xnBatch[] ←findTxnForLastThreeViews() 8: for RC in RemoteCoordinators[] do

9: Send (TxnBatch[], IP) to RC

10: _{procedure onTxnBatch(TxnBatch[])}

11: add(ExecutionList, T xnBatch[])

12: _{procedure onViewStart(V)}

13: nextExecutionV iew ←getLastExecutedView()

14: nextExecutionView = nextExecutionView +1

15: T obeExecutedT xns[] ←findTxnbatchForView(nextExecutionView)

16: while T obeExecutedT xns[] == φ & nextExecutionV iew ≤ currentV iew do

17: nextExecutionView = nextExecutionView +1

18: T obeExecutedT xns[] ←findTxnbatchForView(nextExecutionView)

19: for T’ in T obeExecutedT xns[] do

20: Send (T’, ForwardPass) to T0.HeadServer

4.4.5 Forward Pass

Algorithm 4 describes the forward pass phase for a transaction in MDTC. When a transaction (T ) forward passes to a server (S), the server will firstly add the transaction to its local transactions list. Then the server will iterator through its transaction list to check if there are any transactions which have been executed

or conflicted with that transaction. If there is a transaction T0 which has been

executed on the server, a transaction dependency pair which T is depended on T0

(45)

server and added to both T and T0. So that whenever the dependency between them needs to be validated on any other servers, MDTC would know that T is

depended on T0. If there is a transaction T00 which is conflicted with T , a conflict

pair [T, T00] is also added to the server, as well as T and T00. Two transactions are

conflicted when at least one of the transactions write to the shared key (procedure

isConf lict()). After the validation, if the server is the last server of the transaction

T, then the forward pass is finished. Backward pass will be started right away on that server. Otherwise, the forward pass should continue to traverse to the next servers in the chain. (procedure onF orwardP ass()).

Algorithm 4 Forward Pass

1: _{procedure onForwardPass(T, Server)} 2: for T’ in T ransactions[] do

3: if T’.isCommittedOn(Server) then

4: Add T’ to T. DependedT ransactions[]

5: else if isConflict(T’, T, Server) then

6: Add T’ to T. Conf lictT ransactions[]

7: Update T in T ransactions[]

8: if Server ==T.TailServer then

9: onBackwardPass(T, BackwardPass)

10: else

11: Send (T, ForwardPass) to T.N extServer

12: _{procedure isConflict(T1, T2, Server)} 13: if T1.isReadOnly() and T2.isReadOnly() then

14: return false

15: else

16: for K in T1.keyOnServer(Server) do

17: if T2.hasKey(K) and T1.isWrite(K) or T2.isWrite(K) then

18: return true

19: return false

4.4.6 Backward Pass

(46)

writing to the commit log cache on the server. Otherwise,it should wait unless it is aborted by a timeout (procedure onBackwardP ass()).

A server might not have all knowledge of conflicts in the system at the certain point of time. Thus, a dependency resolver is needed for analyzing acyclic depen-dencies. As described in Algorithm 6, dependency resolver is only responsible for two things: collect dependencies by receiving dependency information from servers, and perform topological sorting whenever acyclic dependencies need to be analyzed. So when cycle dependencies is need to be analyzed, a request will be sent to the dependency resolver. Upon received analyze result, transaction should add trans-actions in all cycles started from that transaction as its acyclic dependencies. So that, with those dependencies, a cycle would not possibly be generated among them. (procedure analyzeAcyclicDependents())

Upon the transaction is aborted by timeout, all of its dependant transactions need to be aborted in a cascading way. And the transaction’s execution result will be aborted, which all of that transaction’s involved servers as well as that transaction client will be notified.

4.4.7 Majority vote

The purpose of majority vote phase is to reach consensus for the transaction execu-tion results in all data centers. Paxos is a popular choice for achieving consensus in distributed systems. However, full Paxos process need 5 message delays. Even with the improved Paxos, such that Multi Paxos, there are also 3 message delays needed. In MDTC, we have designed a majority vote procedure which only needs one cross data center message delay to build up the consistent result in all data centers. In short, all data centers will have the same execution result or the transaction will be aborted after the majority vote. (Algorithm 7)

When backward pass is finished on the last server, server will send a StartMa-jorityVote message to coordinator. Coordinator dispatch the message to remote coordinators as well as add it to local majority vote transaction list. (procedure

onStartV ote())

(47)

Algorithm 5 Backward Pass

1: procedure onBackwardPass(T, Server)

2: if hasN otExecutedDependency(T, Server) then

3: wait(t)

4: for T’ in T ransactions[] do

5: if isExecuted(T’) and T’ 6∈ T.DependedT ransactions[] then

6: Add T’ to T.DependedT ransactions[]

7: else if !isExecuted(T’) and T’ 6∈ T.Conf lictT ransactions[] and

isCon-flict(T’, T) then

8: Add T’ to T.Conf lictT ransactions[]

9: if isNeedAnalzyekAcyclicDependencies(T) then

10: AcyclicDependents[] =analyzeAcyclicDependents(T, Server)

11: if AcyclicDependents != φ then

12: for T" in AcyclicDependents[] do

13: Add T" to T.DependedT ransactions[]

14: else if ¬needtoW aitf orU nexecutedDependency(T, Server) then

15: txnResult =writeToCache(T)

16: addResult(txnResult, timestamp) to T

17: else

18: goto 2: checkForDependency(T, Server)

. Executed

19: if i==0 then

20: Send (T, StartVote) to V oteM anager

21: else

22: Send (T, BackwardPass) to T.nextServer

23: _{procedure needtoWaitforUnexecutedDependency(T, Server)} 24: for T’ in T.DependedT ransactions[] do

25: if T0¬committedonServer then 26: return true 27: return false 28: _{procedure waiting(t)} 29: repeat 30: sleep()

31: until currentT imestamp ≥ t

32: _{procedure analyzeAcyclicDependents(T, Server)}

33: Send (T, AnalzyekAcyclicDependencies) to DependencyResolver

(48)

Algorithm 6 DependencyResolver

1: _{procedure onReceivedDependencies(T, T’)}

2: Add [T,T’] to Dependencies[]

3: _{procedure onTopologicalSort(T, Server)}

4: Cycles[] ←topologicalSort(T, Dependencies[])

5: Send (T, Cycles[]) to Server

resolver will check if there are any cascading aborts need to be invoked. (procedure

onReceivedV ote())

The majority result is calculated by checking if the transaction dependencies are same in two transactions from different data centers. If so, then they are voted as the same result, otherwise, the two vote results are different. When a transaction’s depended transactions are in the catching up phase, which means that transaction should never be voted for committed. So an empty vote result will be returned. (procedure f indM ajorityResult())

In terms of read-only transactions, they are executed locally and don’t need majority vote among different data centers. However, dirty read might occur if the read-only transaction is returned while if there is another transaction write

to the same key prior to the read-only transaction. We must guarantee that

the write transaction returned prior to the read-only transaction. In such case, read-only transaction needs to be aborted to guarantee no dirty read. (procedure

onReadOnlyReturnConditionCheck())

Finally, the transaction will commit after its all dependencies have been com-mitted. Or the transaction will be aborted upon a timeout. (procedure commit()) It is also wroth to mention that if the majority vote is different than current data center. The current data center needs to catch up. This is achieved by sending the majority result to the each involved servers and overwrite the results on them. Note that because of the transaction dependencies, catch up procedure could be done in a cascading way which starts from the first necessary transaction to the one which has done the majority vote.

(49)

Algorithm 7 Majority vote

1: _{procedure onReadonlyReturnConditionCheck(T, Server)}

2: T.isAbort ←shouldAbortReadonlyTxn(T)

3: Send(T) to Server

4: _{procedure shouldAbortReadonlyTxn(T)}

5: Conf lictT xns[] ←findConflictTxns(T)

6: for T’ in Conf lictT xns[] do

7: if T.ExecutionV iew >T0.votingP eoriodLowerBound then

8: return true

9: return f alse

10: _{procedure onStartVote(T)}

11: for VM in RemoteV otemanagers[] do

12: Send (Vote, T) to VM

13: onReceivedV ote(T )

14: _{procedure onReceivedVote(T,Vote)} 15: if numVoteRecieved >numDCs/2 then

16: M ajorityResult ←findMajoritResult(T, Vote[])

17: if MajorityResult == φ & numVoteRecieved == numDC then

18: Send(T, Aborted) to T.Client

19: Send(T, Aborted) to DependencyResolver

20: else if MajorityResult 6= φ then

21: commit(T )

22: Send(T, Committed) to DependencyResolver

23: Server.CommittedT ransaction = T

24: if LocalResult != MajorityResult then

25: Add T to CatchU pList

26: catchU p(T )

27: procedure findMajoritResult(T, Vote[]) 28: if T.Dependency ∈ CatchUpList then

29: Return φ

30: M ajorityResult ←majority(Vote[)

31: Return M ajorityResult

32: procedure catchUp(T) 33: for Server in T.Servers[] do

34: Server.Overwrite(T.Server.CommitResult())

35: _{procedure commit(T)}

36: if T.issuedDC == currentDC then

(50)

4.5. FAULT TOLERANCE

Figure 4.7. Catch up

4.5 Fault tolerance

There are three kinds of failures that are critical in MDTC which will be handled: the coordinator failure, cross data center message lost, and cross data center message delay. With standby, coordinator failure could be properly handed. With adding redundancy in transaction batch, message lost or delay could be solved.

Figure 4.8 shows an example when the coordinator is failed: when transaction

T 1 receives transaction or when another transaction T 2’s is sent from DC2 to DC1.

In both cases, the message will be sent to coordinator of DC1. Firstly, coordinator failure needs to be identified. Heartbeat messages could be used for such case. Secondly, the message from remote data center should not be lost even the local coordinator is failed while the new coordinator election is not finished. So that a standby is needed in MDTC to take care the coordinator role until a new coordinator

is elected. Lastly, a new coordinator election should be done with exactly one

coordinator is elected. Even when the previous coordinator is coming back to live, it should know that there is already a new coordinator. The Paxos algorithm is the best candidate for coordinator elections, while on the implementation point of view, Raft could be adopted.

(51)

T1 S1 _S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 _S3 S5 S4 DC1 DC3 DC2 T2 Coordinator Standby Server

Figure 4.8. Coordinator failure

The transactions from last three views could be put to the same batch.

In terms of partition tolerance, MDTC tolerant n out of 2n + 1 partition failures as the majority vote guarantees that the system proceeds when the majority result can be found.

4.6 Discussions

In this section, we will create an example of two transactions concurrently executed in three data centers for discussing how ACID properties can be achieved in MDTC.

Example

Two transactions T 1 and T 2 have been respectively received in two data centers

DC1 and DC2. The items of the two transactions are:

• T 1: Read A, update A = A + 1. Update C = 1. Read E. • T 2: Read A. Read E update E = E + 1.

Where A, B, C, D, and E are different keys which are distributed on server S1,