A Conflict Detection and Resolution Mechanism for Bounded-Delay Replication

Full text

(1)A CONFLICT DETECTION AND RESOLUTION MECHANISM FOR BOUNDED-DELAY REPLICATION Johan Lundstrom. Submitted by Johan Lundstrom to the University of Skovde as a dissertation towards the degree of M.Sc. by examination and dissertation in the Department of Computer Science. September 1997. I hereby certify that all material in this dissertation which is not my own work has been identied and that no work is included for which a degree has already been conferred on me.. Johan Lundstrom.

(2) Abstract One way of avoiding unpredictable delays, in a distributed real-time database, is to allow transactions to commit locally. In a system supporting local commit, and delayed propagation of updates, the replication protocol must be based on eventual consistency. In this thesis, we present a bounded-delay replication method which is based on eventual consistency. The approach used is to divide the replication protocol into three dierent problems propagation, con ict detection and con ict resolution, where we focus on the con ict detection and resolution mechanism. We have evaluated dierent eventual consistency protocols and chosen version vectors as the base for the con ict detection algorithm. We introduce a method of separating policy and mechanism in the con ict resolution mechanism, which is based on forward recovery to avoid unnecessary computation. The protocols presented in this work are aimed to be used in the distributed active real-time database system DeeDS. We conclude that the protocol proposed can be used in DeeDS under the assumption that no partition failures occur..

(3) Contents 1 Introduction. Real-Time Systems : : : : : : : : : : : : : : Distributed Real-Time Systems : : : : : : : Replication in Distributed Database Systems Denitions and Assumptions : : : : : : : : : 1.4.1 Concurrency control : : : : : : : : : 1.4.2 Replication : : : : : : : : : : : : : : 1.4.3 Eventual consistency : : : : : : : : : 1.4.4 Bounded-delay replication : : : : : : 1.5 Overview of the Dissertation : : : : : : : : : 1.1 1.2 1.3 1.4. 2 Background. : : : : : : : : :. 2.1 Distributed Database Systems : : : : : : : : : 2.1.1 Transactions : : : : : : : : : : : : : : : 2.1.2 Concurrency control : : : : : : : : : : 2.2 Replication in Distributed Database Systems : 2.2.1 Eventual consistency : : : : : : : : : : 2.3 Real-Time Systems : : : : : : : : : : : : : : : 2.3.1 Real-time databases : : : : : : : : : : 2.4 DeeDS Architecture : : : : : : : : : : : : : : : 1. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. 4. 4 5 7 8 9 11 12 13 14. 15 15 16 18 19 22 25 26 27.

(4) CONTENTS. CONTENTS. 3 Problem. 31. 3.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : 3.2 Purpose of this Dissertation : : : : : : : : : : : : : : : 3.3 Description of the Problem : : : : : : : : : : : : : : : : 3.3.1 Problems of eventual consistency protocols : : : 3.3.2 Our problem : : : : : : : : : : : : : : : : : : : : 3.3.3 Replication requirements by DeeDS applications. 4 Evaluation of existing protocols. 4.1 Denition of Replication Protocol : 4.2 Evaluation : : : : : : : : : : : : : : 4.2.1 Version vector algorithm : : 4.2.2 Log transformation protocol 4.2.3 Precedence graph protocols 4.3 Evaluation Results : : : : : : : : :. 5 Bounded-Delay Replication. : : : : : :. : : : : : :. : : : : : :. : : : : : :. 5.1 Bounded propagation : : : : : : : : : : : : 5.2 Con ict Detection : : : : : : : : : : : : : : 5.2.1 Enhanced version vector algorithm 5.2.2 Bounded Con ict Detection : : : : 5.3 Con ict Resolution Mechanism : : : : : : 5.3.1 Access Patterns : : : : : : : : : : : 5.3.2 Resolution Policies : : : : : : : : :. 6 Results. : : : : : : : : : : : : :. : : : : : : : : : : : : :. : : : : : : : : : : : : :. : : : : : : : : : : : : :. : : : : : : : : : : : : :. : : : : : : : : : : : : :. : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : : : :. 6.1 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6.1.1 Bounded propagation : : : : : : : : : : : : : : : : : : : : : : : 2. 31 32 33 33 35 36. 39 39 40 41 45 48 51. 53 53 58 61 69 70 71 74. 77 77 78.

(5) CONTENTS. CONTENTS. 6.1.2 Bounded con ict detection : : : : : : : : : : : 6.1.3 Bounded con ict resolution : : : : : : : : : : 6.1.4 Simulation : : : : : : : : : : : : : : : : : : : : 6.1.5 Discussion of problems with version vectors : 6.2 Discussion : : : : : : : : : : : : : : : : : : : : : : : : 6.2.1 ASAP versus Bounded-delay replication : : : 6.2.2 Discussion of cost of version vector algorithm 6.2.3 Cost of replicating data : : : : : : : : : : : :. 7 Conclusions. 7.1 Summary : : : : : : : : : : : 7.1.1 Eventual consistency : 7.1.2 Evaluation : : : : : : : 7.1.3 Conclusions : : : : : : 7.2 Discussion : : : : : : : : : : : 7.2.1 Relaxing serializability 7.3 Contributions : : : : : : : : : 7.4 Related work : : : : : : : : : 7.5 Future Work : : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. : : : : : : : : : : : : : : : : :. 78 80 81 82 83 83 84 84. 86 86 87 87 88 88 89 89 90 92. 8 Acknowledgments. 93. Bibliography. 94. List of Figures. 100. 3.

(6) Chapter 1 Introduction Fault-tolerance and timely behavior is becoming more important features when designing large systems. Distribution by replicating data, processes or functions are essential to achieve fault tolerant systems. This thesis centers around replication of data in an distributed real-time database. This chapter covers real-time system, which is later expanded with distributed real-time system. Next, we introduce dierent approaches to replication in distributed database systems. We then describes the denitions and assumptions made, concerning distributed real-time databases.. 1.1 Real-Time Systems Nuclear plants and airplane tracking systems have been controlled by real-time systems for a long time. Due to the complexity of controlling such system it is impossible for human operators to do it by themselves. In such systems the correctness of the system do not only rely on the correctness of the produced information but on when it is produced Mul93]. Another denition of a real-time system is made by BW90]: \any information processing activity or system which has to respond to externally generated input stimuli within a nite and specied period." 4.

(7) Chapter 1. Introduction. 1.2. Distributed Real-Time Systems. In this work we regard systems which have to react timely to external stimuli, but there can also exist activities without any time bounds. A real-time system generally consists of one or more controlled objects which are controlled by a computer system and a human operator Mul93]. In order to guarantee timeliness some set of assumptions has to be made about the environment in which the system should operate. These assumptions are called Mul93]: i) load hypothesis | denes the assumed peak load to be delivered to the system (or the maximum transaction rate with worst case execution time) ii) fault hypothesis | denes the types and frequencies of faults introduced to the system. If under any conditions these assumptions are not realistic, the functionality of the real-time system can not be guaranteed. The assumption coverage is a measure of the accuracy of the assumptions made. In this thesis, real-time systems with mixed criticality transactions are considered, e.g., real-time systems consisting of transactions with mixed criticality, and which can guarantee hard real-time transactions. A real-time system can be based on two dierent design approaches either the system is event-triggered or it is time-triggered Mul93]. Event-triggered systems reacts to external events immediately. A time-triggered system reacts to external events at pre-specied instances in time. In this thesis we only consider event-triggered systems transactions with mixed criticality.. 1.2 Distributed Real-Time Systems If real-time systems are to be fault-tolerant, distribution is inevitable Mul93]. Distributed real-time systems can be classied depending on:(1) how fail safe the the system is and (2) if it is based on best-eort response implementation or a guaranteed response implementation. Some systems are inherently distributed such as plant 5.

(8) Chapter 1. Introduction. 1.2. Distributed Real-Time Systems. automation systems. These systems have, in the past, been implemented with centralized control and should benet greatly if the control system could be distributed. Distributed control would lead to better fault-tolerance and, if designed carefully, they could also be more maintainable. Other examples of distributed real-time systems are automotive control and intelligent products Mul93]. Designing a distributed real-time system is a complex task. The resulting system needs to have the property of guaranteed timeliness Mul93]. In order to verify the design under the load and fault hypotheses, the system needs to be testable. When testing a real-time system, not only the value domain needs to be tested but also the time domain (i.e., the times when the values are produced). In the DeeDS project ABE+ 95], developed at the university of Skovde, such testing techniques are developed. A central part of testing is the event monitor Mel96] which monitors the events produced externally and internally (see section 2.4 for an explanation of the DeeDS architecture). Every system evolves with time and changes need to be incorporated. Maintainability is the property which gives a measure of how easy it is to maintain the system (correct errors and incorporate new code). If the real-time system is time-triggered the whole system needs to be reevaluated and tested, since the original calculations no longer hold. If the system is event-triggered and given that new code is encapsulated and does not introduce unpredictability, then it is easier to modify the system. Any man-made system will eventually fail because of physical faults Mul93]. If the system's mission time is longer, or in the same order as the component's mean time to failure (MTTF), then the system need to be designed to tolerate failures. Replication of components and software is an essential way of achieving fault-tolerance. In order to replicate software, a replication protocol needs to be specied which supports the requirements that an application may put on the system. This thesis describes a replication protocol intended for the DeeDS prototype, 6.

(9) Chapter 1. Introduction. 1.3. Replication in Distributed Database Systems. but it should be suitable for any system with similar requirements. The protocol described provides bounded delay, i.e., propagation of updates is made after local transaction commit, within bounded time. In order to achieve bounded-delay replication, a deterministic con ict detection algorithm and a deterministic con ict resolution mechanism is necessary. To limit the amount of development work, an existing con ict detection algorithm is enhanced and a methodology for forward resolution is described. The resolution mechanism allows specication of resolution policies based on the semantic of objects stored in a distributed real-time database.. 1.3 Replication in Distributed Database Systems Redundancy is necessary to achieve fault-tolerant systems. In such systems data or processes are duplicated and the replicas are situated at dierent nodes. If a failure happens the replica can the assume the work. The functionality of the replication protocol is to propagate updates or events so that all replicas have the same system state, i.e., updating one replica implies updating all replicas in the system. Replication protocols can be based on dierent approaches depending on the behavior wanted. If mutually consistent databases are the main requirement, then every update in the system needs to be synchronized and controlled by some concurrency control mechanism. In those cases the replication protocol and the concurrency control algorithm can be combined. However, in the case the requirements of the distributed database are instead based on high availability and fast response time, the replication algorithm and the concurrency control algorithm can be separated, and database consistency requirements relaxed. Distributed databases are often created to enhance the fault-tolerance, i.e., tolerate dierent problems such as site failure and dierent classes of network failure, where the most sever is partition failures. Distributed databases are also created to 7.

(10) Chapter 1. Introduction. 1.4. Denitions and Assumptions. avoid long communication delays.. De

(11) nition 1.1 A. is a communication failure where any two nodes can communicate within a partition and no two nodes in dierent partitions can communicate Dav84]. partition failure. The problem with partition failure in a distributed database, is how to handle running transactions who are updating data items. Some replication protocols handle partition failures by disallowing updates during the partition and by doing so not introducing any inconsistency. Other protocols do not limit the transaction execution during partitioning. Instead they are based on detecting and resolving the inconsistencies after merging the partitions.. 1.4 Denitions and Assumptions In this work operations on a database are operations grouped into an atomic action, called a transaction.. De

(12) nition 1.2 A transaction is a set of operations that modies or inserts data in a database.. A transaction is the smallest executable unit seen by another user or application. In the case where many users can access the database concurrently, each transaction must designed to ensure that it completes and that the operations performed on the database are correct. Changes made must be permanent or not stored at all. This can be expressed as:. Assumption 1.3 A general DBMS should guarantee the ACID properties for transactions (Atomicity, Consistency, Isolation, Durability) EN94].. 8.

(13) Chapter 1. Introduction. 1.4. Denitions and Assumptions. In section 2.1.1 a formal denition of the properties is given. When transactions executions are interleaved in order to achieve higher throughput and shorter delays some ordering of the transactions is needed.. 1.4.1 Concurrency control Based on the type of concurrency control used, replication protocols can be divided into dierent classes. We will dierentiate between optimistic and pessimistic protocols depending on how restrictive the control is, as well as syntactic and semantic protocols depending on how much of the semantics of the application is taken into account. When several users are allowed to execute transactions in an interleaved fashion, concurrency control is needed to be able to guarantee that the resulting database state is correct. Concurrency control is thus synchronizing and interleaving the operations of many concurrent transactions in a controlled and correct manner. There exist a number of concurrency control techniques that are used to ensure isolated concurrently executed transactions. Most of these techniques ensure serializability of execution history, using sets of rules or protocols EN94]. In a distributed database the notion of concurrency control is more complex, since it entails concurrent transactions at multiple sites. The resulting concurrency control algorithm need not only restrict operations to ensure serializability but also synchronize the operations on the replicas to ensure mutually exclusive data items (i.e., all replicas reveal the same value at all times). In this thesis protocol ensuring mutually consistent operations on replicas as well as ensuring one-copy serializabililty (see denition 1.6) consists of both a replication protocol and a distributed concurrency control algorithm. Next we dene the a schedule, leading up to a denition of one-copy serializability.. De

(14) nition 1.4 A. schedule. S of n transactions T1 T2 ::::::: Tn is an ordering of 9.

(15) Chapter 1. Introduction. 1.4. Denitions and Assumptions. operations of the transactions subject to the constraint that, for each transaction Ti that participates in S , the operations of Ti in S must appear in the same order in which they occur in Ti EN94].. A correctness criterion can be based on the schedule a database creates. Serializability is the most widely used correctness criterion.. De

(16) nition 1.5 A schedule S of n transactions is serializable if it is equivalent to some serial schedule of the same n transactions EN94].. The above denition does not include the features of a distributed database where synchronization of operation is needed. To use serializability as a correctness criterion in a distributed database it needs to be strengthened to include synchronization.. De

(17) nition 1.6 A one-copy serializable schedule is equivalent to all transactions executing at a single site resulting in a serializable schedule DGMS85].. Concurrency control denes the ordering of transactions in a database, and can be either optimistic or pessimistic. Optimistic concurrency control techniques, also called validation or certication techniques, do not check the database before any operations but instead checks the database when the transaction is executed. No locking is performed during the transaction execution. Pessimistic Concurrency control algorithms control the execution of a transaction before the execution. An example of a pessimistic algorithm is the two-phase locking protocol, where a transaction is divided into two phases: an expanding phase during which the transaction can acquire locks but not release and a shrinking phase during which the transaction can release existing locks but not acquire new locks. The two-phase locking protocol enforces serializability. 10.

(18) Chapter 1. Introduction. 1.4. Denitions and Assumptions. e.g., if the concurrency control algorithm is locking based the transaction needs to acquire all locks before starting the execution. In this way any transaction scheduled for execution will execute until commit time.. 1.4.2 Replication In addition to the dierent concurrency control strategies above, two dierent strategies for replication protocols can also be found along a another dimension. This new dimension is based on the type of information used in determining the correctness of the database operations DGMS85]. Syntactic replication: Syntactic replication algorithms use serializability as the correctness criterion. Syntactic approaches use neither the semantics of the transaction or the semantics of the data items to ensure correctness. One example often used in implementing a syntactic pessimistic approach is the two-phase locking protocol EN94]. Semantic replication: Semantic replication uses either the semantic of the transaction or the semantics of the data items themselves to control the correctness of the execution schedule. Semantic approaches can be divided into two sub categories, one which uses serializability as a correctness criterion and also uses the semantic information to test serializability. For example, some replication protocols distinguish between transactions overwriting the previous values from transaction that are incremental. The other approach abandons serializability all together and denes the correctness by dening correct database states. Hence, a strategy for replication in a distributed database can be classied as DGMS85]: (1) pessimistic-syntactic (2) pessimistic-semantic (3) optimistic-syntactic or (4) optimistic-semantic. In a distributed database these strategies take somewhat dierent semantics, since database operations needs synchronized over a communication network. 11.

(19) Chapter 1. Introduction. 1.4. Denitions and Assumptions. 1.4.3 Eventual consistency In this work the dimensionality of replication strategies are divided between the dierent part of the replication protocol. The con ict detection algorithm is an optimisticsyntactic strategy where updates are propagated after local commit and do not rely on the transaction or data item semantics. The strategy of propagating updates after local commits are often called eventual consistency or optimistic replication. In this thesis this is a desired behavior since it will allow transactions to commit locally and thus not depend on any network delays. Inconsistencies will occur temporarily in any distributed database based on optimistic replication. If the semantics of the data items or the transaction are known it is possible to perform forward recovery instead of applying undo and redo of transactions to correct the con icts. For instance if two sensors reading a temperature in an engine, the semantics may be that one of the sensors are more reliable. In a situation of inconsistencies in the replicas (i.e., not mutually exclusive), the con ict resolution policy chosen could be: choose the value from the more reliable sensor. The fact is that many situations in manufacturing and in controlling processes could be designed in such a way that forward recovery is possible. In real-time databases the most general approach is to use pessimistic concurrency control since aborting and restarting transaction results in unpredictable behavior. If a pessimistic concurrency control algorithm is used, transactions scheduled for execution will commit when done. In a distributed database this approach is not suitable since if failures can happen independently, i.e., one node or a communication link can fail independent of the other, then pessimistic concurrency control would be too restrictive since all operations on the database must be synchronized. In this work it is assumed that the distributed real-time database has the following properties: 12.

(20) Chapter 1. Introduction. 1.4. Denitions and Assumptions. Assumption 1.7 A distributed real-time database has the following properties: 1. Be fully replicated: all data items are replicated to all sites in the system. 2. Be memory resident: the entire database is stored in main memory to avoid unpredictable disk delays 3. Allow delayed Replication: to avoid unpredictable transaction delays, the data should be updated locally then propagated to all other sites. 4. Allow temporarily inconsistent data: the database needs to tolerate temporarily inconsistent data during propagation of updates.. The real-time database should also support transactions of mixed criticality, both critical (hard deadlines) and non-critical (rm and soft deadlines), and be event-triggered instead of time-triggered. The main requirements of the DeeDS architecture are to support mixed criticality transactions, and to dynamically guarantee transactions with hard deadlines.. Assumption 1.8 In order to achieve deterministic local transactions, every node. has a pessimistic concurrency control algorithm working together with the dynamic scheduler.. 1.4.4 Bounded-delay replication In order to make any guarantees regarding transactions depending on updates written by another node, the replication protocol should support bounded-delay replication. Bounded-delay replication requires a real-time network protocol which can guarantee bounded message delivery.. Assumption 1.9 A real-time network exists which, given that all nodes are available, guarantees a bounded message delivery, i.e., the network protocol handles network failures such as partitions.. 13.

(21) Chapter 1. Introduction. 1.5. Overview of the Dissertation. In section 5.1 the requirements for such network protocol is given. In this thesis we dene bounded-delay propagation to be:. De

(22) nition 1.10 A bounded-delay replication protocol guarantees that any update from one node Nx to any other available node Ny is bounded.. Bounded-delay replication requires deterministic con ict detection and con ict resolution which are the main requirements of this work.. 1.5 Overview of the Dissertation In the next section an introduction to databases, and replication is given. Chapter 3 introduces the problem and the approached used to solve the problem by dividing it to sub-problems. In chapter 4, dierent eventual consistency protocols are described and an evaluation of them is given. Chapter 5 covers the proposed solutions to the sub-problems. Results and discussion are given in chapter 6, and nally, in chapter 7, conclusions and contributions are described.. 14.

(23) Chapter 2 Background 2.1 Distributed Database Systems Distributed database systems are often the solution when requirements dictate an available and fault-tolerant multi-user system, e.g., airline-ticket reservation systems or train-ticket systems. In these systems many dierent branches wish to reserve tickets simultaneously. This places some dierent requirements on the system compared to centralized systems where all requests are serviced by a single server. To enhance the availability and the fault-tolerance in the system the database items are replicated to all sites in the system. This requires some synchronization between the sites so the replicas in the system contain the same data. When an update is made at one site this update has to be propagated to all sites containing a replica of the data item. In distributed database systems the problem of concurrency control increases, since the serializability of a transaction not only depends on other transactions residing on the same node but also on transactions on other nodes interconnected by a communication network. In a distributed system, additional problems such as site failures and communication failures cause special problems since the transactions 15.

(24) Chapter 2. Background. 2.1. Distributed Database Systems. must be synchronized over the network. In order to summarize the dierent problems with distributed database systems and concurrency control we need to discuss transactions and concurrency control in more detail.. 2.1.1 Transactions The set of operations that modies or inserts data in a database is called a transaction Elm92]. In a database system there can exist many users at the same time. To be able to execute dierent user-transactions at the same time synchronization between the transactions are needed to ensure that the database is left in a consistent state. Almost all concurrency control algorithms (see section 2.1.2) restrict the execution of a transaction to uphold the ACID properties EN94, HHB96, GR93]:. Atomicity: A transaction is an atomic unit of execution and should be per-. formed entirely or not performed at all.. Consistency: A transaction is a correct transformation of the database state.. The correct execution of a transaction takes the database from one consistent state to another without violating any of the integrity constraints dened.. Isolation: A transaction should not make its update visible to other transaction until it has committed, updates of only committed transactions are visible.. Durability: Once a transaction completes successfully (commits), the changes. made should be permanent and could not be deleted by some subsequent failure. Consistency is the only property that falls on the shoulder of the application programmer. In a distributed database management system (DDMS) the ACID properties may be overly restrictive since strict they often limit the concurrency. There has been work on specifying ways how to relax the consistency property without violating 16.

(25) Chapter 2. Background. 2.1. Distributed Database Systems. the correctness of the database. The correctness criteria of a database state is that it does not violate the integrity constraints. Integrity constraints can be static or dynamic. Static integrity constraints restrict the set of valid states the DB can be in. Dynamic integrity constraints restrict the set of valid states that may be made from a given database state, e.g., the salary of the worker can not be more than that of the executive. Transactions that violate the ACID properties may brake a dynamic integrity constraints and end up in a inconsistent database state. Not all applications need to restrict the concurrency of transactions to enforce the ACID properties. A transaction may be allowed to read uncommitted data, or the transaction are not allowed to do repeatable reads in order to get a high throughput or greater concurrency. Gray and Reuter GR93] has classied transactions in dierent degrees of isolation (originally called degrees of consistency GLPT76]).. Degree 3: A third degree isolated transaction has no lost updates and has repeatable reads. This is "true" isolation.. Degree 2: A second degree isolated transaction has no lost updates and no dirty reads, but not necessarily repeatable reads.. Degree 1: A rst degree isolated transaction has no lost updates. Degree 0: A zero degree isolated transaction does not overwrite another trans-. action's dirty data (changed but uncommitted data). If the other transaction is in degree 1 or greater This classication makes it possible to specify transactions which allow dierent degrees of isolation in the system and hence maybe introduce a greater concurrency. In the next section concurrency control is discussed.. 17.

(26) Chapter 2. Background. 2.1. Distributed Database Systems. 2.1.2 Concurrency control Concurrency control techniques are used to ensure non-interference or isolation of concurrently executing transactions. When transactions execute concurrently the order of the dierent operations from dierent transactions forms a schedule. A transaction schedule is serializable if it produces the same eect on the database as if the constituent transactions where executed in some serial order. The theory of serializability formally denes the requirements to achieve a serializable executions HHB96, EN94]. Concurrency control is needed since if transactions were to be executed in an uncontrolled manner some problems could arise such as:. The Lost Update Problem. This occurs when two or more transactions. read and modify the same item and their executions are interleaved in such a way that the changes leave the database in an inconsistent state. Suppose two transactions T1 and T2 are interleaved in such a way that T1 reads a value A before T2 reads A. If T1 then writes the value before T2 but after T2 has read A the update will be lost when T2 writes the new value of A based on the original value of A before any read.. The Temporary Update (or Dirty Read) Problem. This problem occurs when a transaction T1 modies some item read by a transaction T2 . If transaction T1 then must roll back, T2 will have read some dirty data (i.e., read uncommitted changes).. The Incorrect Summary Problem. This problem occurs when one trans-. action is summing up values in a list, while at the same time some other transaction updates any of the items in the list. This would leave the data in an inconsistent and non-deterministic state. 18.

(27) Chapter 2. Background. 2.2. Replication in Distributed Database Systems. Concurrency control protocols falls into two categories, optimistic or pessimistic EN94, HHB96, GR93]. Pessimistic concurrency control prevents inconsistencies by disallowing potentially non-serializable executions and by ensuring that committed transactions never need to roll back. An example is the two-phase locking protocol EN94], which is widely implemented in commercial database systems. Optimistic concurrency control permits non-serializable executions to occur. When anomalies are detected the transaction is aborted. This is done during the validation phase (also known as the pre-commit phase). In a distributed database replicas need to be synchronized and the concurrency control needs to be integrated in the replication protocol. Replication in distributed databases is discussed next.. 2.2 Replication in Distributed Database Systems Replicating data and applications is the most powerful way of achieving high availability. Replication of data is not a new eld. We have since longtime replicated databases on backup tapes that are used as recovery data in case of databases crashes. In a distributed database the concurrency control protocol has to be extended to also ensure atomic execution of a transaction, i.e., either all operations execute or none does. This is often done by ensuring one-copy serializability HHB96, EN94], which produces a serial schedule as if all transactions where executed at one site. A distributed transaction is usually initiated at a coordinator site. The read and write operations is forwarded to all participants (sites holding replicas). This type of protocol is called atomicity protocol, and ensures that all participants agree on commit or abort of the distributed transaction. The most popular atomicity protocol is the Two-Phase Commit protocol HHB96]. In this thesis protocols which enforce immediate consistency are called classical protocols and they fall into dierent categories, such as HHB96]: 19.

(28) Chapter 2. Background. 2.2. Replication in Distributed Database Systems. Read One Write All (ROWA): This is the most simple family of replication protocols. A read operation on item d is executed locally on one replica and a write operation is translated to a write to all n replicas1. The underlying concurrency control system must ensure mutual exclusion in all replicas, i.e., changes are made visible at all sites or not at all.. ROWA-Available: This replication protocol tries to enhance the performance. of the original ROWA protocol by allowing sites to fail. A write operation is translated to writing to all available replicas instead of all replicas. ROWA-A tolerates site failures but not communication failures such as network partitioning.. Primary Copy: A primary copy protocol designates one copy as primary and. the rest of the replicas are called backups. Write operation is issued at the primary copy but read operations are executed locally. A transaction commits when the primary and all backups has acknowledged the write. When a primary copy fails a new primary has to be selected. This requires that failures can be detected and distinguished from slow communication. Since it is very dicult to distinguish site failure from slow communication, it is hard for these protocol to withstand network partitioning. The most familiar Primary Copy protocol is the Two-Phase Commit protocol HHB96].. Quorum Consensus or Voting: The above protocols favor read operation. before write. Reads are done on one replica but writes are done to a major part of the replicas. Quorum consensus (QC) (often called voting protocols) protocols normally only need to write to a subset of the replicas called a write quorum. A read quorum is guaranteed to intersect the write quorum which guarantees that the read operation always returns the latest value. QC protocols 1. In this thesis n will denote the number of replicas or sites. 20.

(29) Chapter 2. Background. 2.2. Replication in Distributed Database Systems. can tolerate both site and communication failures and do not need a special recovery protocol. The protocols found in the literature combine dierent techniques to enhance dierent features, such as, tolerating failures, to limit the number of messages etc. Most replication protocols which allow network and site failures assume that failure occurs in isolation, i.e., fault-isolation. But this is not entirely correct since incorrect propagation of updates may distribute incorrect updates which causes the whole system to crash. This can only happen if a transaction does not follow the ACID properties, especially the consistency properties which can not be guaranteed by any database system (the designer is responsible for implementing correct transactions). A distributed system is expected to tolerate independent failures such as a site crash, i.e., to be fault-tolerant. To achieve a fault-tolerant system some redundancy is required Mul93]. Depending on the characteristics in an application dierent emphasis may be put on dierent attributes of dependability. The International federation for information processing denes the following attributes for dependability IFI94]:. Availability: Measure of probability that the system is ready to deliver any service at time t.. Reliability: Measure of continuous service delivery. Safety: Measure of the probability that the system does not fail catastrophi-. cally within a time interval t.. Maintainability: Measure of time to restoration from the last known failure. Note that these are only the most important attributes. They cover the non-functional properties of an system that relate to the quality of service. The quality of service is related to the service given, seen by the user. 21.

(30) Chapter 2. Background. 2.2. Replication in Distributed Database Systems. 2.2.1 Eventual consistency Most replication protocols are based on strict serializability of transactions, which guarantees immediate consistency, i.e., when a transaction commits, its changes are re ected in all copies before any other transaction can read the updated data. One of the rst systems which introduced eventual consistency was Grapevine BLNS82] and Clearinghouse OD83], in which a site distributes updates after it has updated its local database. In this way the global database state is inconsistent during the time it takes for the update to propagate through the distributed system. Other eventual consistency algorithms, collectively referred as epidemic algorithms are introduced by DGH+87]. Epidemic algorithms are based on probability distribution in which an update spreads like rumors (or disease) and eventually all (or most) sites are infected with the latest update. A special protocol, anti-entropy, is used to ensure that updates are guaranteed to eventually reach all sites. Eventual consistency replication protocols do not rely on any special algorithm to detect con icts. Instead these protocols rely on the user to nd and resolve con icts. Weak consistency or Optimistic replication protocols allow local commits and instead of con ict avoidance these protocols use con ict detection. These protocols are designed to tolerate network partitioning, and work dierently from eventual consistency protocols such as Grapevine. Optimistic replication protocols can be combined with classic immediate consistency protocols in one partition. The common dominator in these replication protocols is that they allow transactions to commit during a partition. Con ict detection protocols need to resolve con icts in some deterministic way. Optimistic replication protocols, such as in LOCUS PR82]), involve manual intervention to resolve inconsistencies. Optimistic protocols found in the literature are:. Version Vectors PR82]: A version vector(Si Vi) is associated with every n 22.

(31) Chapter 2. Background. 2.2. Replication in Distributed Database Systems. copy of the le (or item, depending on granularity of locks), where Vi is the number of updates originated from site Si. A version vector V dominates another version vector V , if every item in V has a greater or equal version number than corresponding V . A con ict is detected if neither V or V dominates the other. This technique is simple but suers from the fact that it can only detect write-write con icts, and hence it is not suited for database use. 0. 0. 0. Log Transformation: During partitioning, information about the transac-. tions executed and in what order they are executed are recorded in logs. Upon reconnection partitions logs are exchanged and a rerun log is created, which should indicate what should be considered as been executed during the partition. The rerun log is created by changing the order of transactions and then it may be necessary to roll back and rerun transactions. Transactions have to be predened and the semantic properties of the transactions have to be known to avoid rolling back unnecessary transactions. The properties include commutative, overwriting and quasi-commutative.. { Commutative: transaction pairs where the read-set and the write-set do not overlap.. { Overwriting: transactions that overwrite previously stored data items. { Quasi-commutative: transactions which by invoking some extra transactions become commutative.. Precedence Graph: An example of a precedence graph protocol is the opti-. mistic protocol. All read and write operations are written to a log. In order to construct the graph each partition (or site) constructs a graph where nodes represent transactions and the edges represent interaction between the transactions.. 23.

(32) Chapter 2. Background. 2.2. Replication in Distributed Database Systems. These interactions can be found in the read-sets and write-sets of the transactions which are stored in the log. The rst step in the construction of the graph is to model the interactions between transactions in the same partition. There are three types of edges: i) dependency edges { one transaction reads a value written by another transaction ii) precedence edges { one transaction reads a value which is later modied by another transaction iii) interference edges { a transaction in one partition reads a value written by another transaction in a another partition. Inconsistencies are solved by rolling back transactions until the resulting graph does not contain any cycles. This can lead to cascading rollbacks. Merging partitions is done by forwarding the updates after the graph is acyclic. In Gus95] Gustavsson discusses these three protocols as a way of getting lazy replication { same as optimistic replication (see section 2.2) { for DeeDS . Enhancements are done to modify the existing protocols to tolerate optimistic replication. Gustavsson concludes: \The major weakness of LVV Lazy Version Vectors] is that there may be many version copies in the network. Knowledge about the transactions are not used. ... LLT Lazy Log Transformation] requires that all transactions are known a priori and that inverse and compensating transactions are dened. ... We suggest that the DeeDS project use the LLT, since LVV works on the tdbm level with no knowledge of the overlaying transactions, which makes it dicult to analyze."Gus95] These results do not match the intention for DeeDS since exibility and simplicity requires replication on tdbm level instead on the OBST layer (see section 2.4). If only data items instead of transactions are replicated then new sites can be added without knowing of transactions on other sites and the functionality of a site can be nely 24.

(33) Chapter 2. Background. 2.3. Real-Time Systems. tuned to only incorporate the functions needed for some specied task, e.g., a new machine in a computer controlled manufacturing environment or a new node such as installing air-condition in the car.. 2.3 Real-Time Systems A Real-Time System is any system in which the time at which output is produced is signicant. Usually a real-time system is designed to control or supervise some process in the physical world. The output of the system relates to the input and there is a bound on the reaction time. This time at which the output must be produced is often called a deadline. Deadlines can be classied into dierent classes of criticality depending on the severity of missing the deadline. A system is a hard real-time system if the consequences of a timing failure is catastrophic. A soft real-time system is a system where the consequence of a timing error is in the same magnitude as the utility of the operational system Mul93]. The classication into hard or soft realtime systems some times is not descriptive enough. Another classication of real-time system based on the consequences of timing failures are:. Hard critical: missing a deadline would lead to catastrophic consequences such as great material loss or even death e.g., nuclear plants.. Hard essential: costly penalty e.g., train control systems. Firm: loss of service e.g., missed trains. Soft: degraded service e.g., telephone switching. Real-time systems can be divided into two categories depending on their design paradigm. Event-triggered systems react immediately on external events, where as time-triggered systems react to such events in predened time slices. The work in 25.

(34) Chapter 2. Background. 2.3. Real-Time Systems. this thesis focuses on event-triggered systems, which can exhibit a mixture of both hard-essential and soft deadline tasks. Real-time systems are in essence reactive, i.e., they react to some external stimuli from the environment, produces some result which then again can control the environment. For a real-time system to be predictable and dependable it has to complete its computations even in situations where a failure has happened, this requires faulttolerance. Fault-tolerance itself requires redundancy Mul93]. Redundancy it can be put into hardware, software or into the communication between dierent nodes.. 2.3.1 Real-time databases Distribution is useful in achieving fault-tolerance, and it is also inherent in many real-time systems, e.g., weather forecasting, train or ight control. The growing amount of data that need to be stored in these real-time systems, and the complicated dependencies of data between components makes it desirable to use a database system. Traditionally database system do not focus on real-time response but on integrity and consistency of data. Some approaches to make database system uphold real-time behavior are Sin88]:. Main memory residency: To avoid unnecessary I/O delays due to disk reads. and writes. This approach introduces some special problems such as crash recovery, and requires special features with regard to concurrency control. It has been shown that larger locking granularity actually increases concurrency in a main memory resident databases Sin88].. Trading a feature: Another way of enhancing ordinary databases into real-. time databases is to sacrice serializability to increase availability and performance by decreasing consistency requirements on the database. In several cases 26.

(35) Chapter 2. Background. 2.4. DeeDS Architecture. the serializability can be sacriced if semantic information about the transactions are known in advance Son88] There are special problems with both approaches. Crash recovery is a special case where main memory databases suer since in case of a site crash the recovery procedure has to have some information to recover. Crash recovery techniques for main memory databases often introduce some form of stable storage, e.g., keep a log in stable storage to be able to recover from a crash. This storage will increase the overhead, and introduce some nondeterminism due to I/O delays. The access methods and query processing techniques developed for traditional stable storage databases may be inadequate when designing main memory databases. In traditional databases concurrency control algorithms have used serializability as a criterion for concurrency. Even if, generally, serializability is to restrictive as a concurrency control criteria there is no generalized criteria that oers greater concurrency without jeapordizing correctness of the database. Relaxing serializability requires a priori knowledge about the underlying application. This knowledge can be used to relax serializability to enhance the performance, i.e., enhance concurrency of transactions.. 2.4 Distributed Active Real-Time Database Systems (DeeDS) Architecture Increasingly, public services and consumer products demand timely behavior and larger storage capacity. The Distributed activE rEal-time Database System (DeeDS) developed by the DRTS research group at the University of Skovde ABE+ 95] is one. 27.

(36) Chapter 2. Background. 2.4. DeeDS Architecture. Real-time applications. DeeDS Services Rule manager Event monitor. DeeDS on other nodes. Replication and concurrency. Event Criticality Checker. OBST object store. Scheduler. tdbm storage manager. OSE Delta + Extended file system Application Processsor. Dispatcher OSE Delta Services Processor Loose coupling Tight coupling Distributed communication. Figure 1: The DeeDS architecture attempt to design and implement tomorrow's database systems. DeeDS is an eventtriggered database system, which uses dynamic scheduling of transactions. Transactions can be of mixed criticality (see section 2.3). The reactive behavior in DeeDS is modeled using Event-Condition-Action rules. Figure 1 shows the architecture of DeeDS_ DeeDS separates application related functions from critical system services, where each type executes on a separate processors. The work focuses on the OBST object store layer CRS+ 92] which is a public domain object-oriented database, and the tdbm layer Bra92] which is a storage manager. These layers reside on the application processor. On the service processor the event monitor and the scheduler resides. The reactive behavior of DeeDS are modeled after the event-condition-action (ECA) paradigm CBB+ 89]. The rule manager executes rules which are triggered by events 28.

(37) Chapter 2. Background. 2.4. DeeDS Architecture. signaled by the event monitor. DeeDS has restricted the number of coupling modes to make rule execution more predictable. Cascade triggering is limited in order to attain an upper bound on the maximum execution time of transactions. The nested transaction model makes replication of data dicult since sub-transactions can be aborted even if the root-transaction commits. This makes eventual propagation more suited since, at root-transaction commit time, all committed objects which need to be propagated will be logged. If replication were to be immediate there would be additional synchronization between all replicas and the dierent sub-transactions. The DeeDS platform consists of:. Active Database Functionality: Consistency of the database has to be con-. trolled by integrity constraints. More complex consistency checks are desirable and a way of achieving this is by incorporating active behavior in a database. Active behavior is in the form of rules which are stored in the database and the rules are triggered by pre-specied events. If some condition holds some action is performed.. Event Monitoring: The function of a event monitor is dual. First the vital. part of an event monitor is to detect events when they occur and then pass them on to the modules that are interested. Second an event monitor can be used to test the system. Testing a real-time system requires instrumentation | sensor conguration together with monitoring facility Mel96] | of the system. In Mel96] dierent ways of instrumenting a real-time system are discussed. When testing a real-time system by means of incorporating a monitor, the special code inserted for testing purposes must be left intact since the predictability of the tested system will be aected if this special code later are removed. This phenomenon is called the probe-eect. 29.

(38) Chapter 2. Background. 2.4. DeeDS Architecture. Scheduler: DeeDS incorporates a scheduling monitor where dierent schedul-. ing algorithms can be used. Deadline and value driven heuristic algorithms are used to be able to dynamically guarantee that hard-deadline transactions meet their deadlines.. 30.

(39) Chapter 3 Problem This chapter introduces the problems faced, and what the purpose of the work is. We separate the problem into subproblems and try to highlight the diculty of solving the problem.. 3.1 Motivation Most replication protocols today, e.g two-phase-commit EN94, Mul93] as well as the ROWA and Quorum Consensus protocol families (see section 2.1.2), are based on atomic transactions which uphold serializability and mutual exclusion. The common denition of such transactions is that they support the ACID properties (section 2.1.1). The main dierences between these protocols are: Number of sites that must participate to ensure mutual exclusion. Number of messages that are needed to ensure mutual exclusion.. How site failures are handled. How communication failures, such as partitioning, are handled. 31.

(40) Chapter 3. Problem. 3.2. Purpose of this Dissertation. These replication protocols all require synchronization between replicas at the dierent sites in the network. To synchronize decisions between such replicas among a large number of sites introduces considerable message overhead. Communication over a network increases the transaction duration, since network delays, potential lost messages, and other failures must be dealt with. In many real-time database systems, availability and predictability are more important than immediate consistency. The paradigm behind the DeeDS database prototype is based on main memory residency to eliminate I/O indeterminism Sin88], and full replication of data to eliminate distributed transactions. The consistency model in DeeDS is that every update is made locally, i.e local transaction commits. During this time the node is seen as partitioned from the network, and updates are then replicated as soon as possible in a way that guarantees eventual consistency. The concurrency control used in a local node is pessimistic to be able to guarantee deadlines on hard real-time transactions. If the local concurrency control were optimistic there could be no such guarantees, since a transaction scheduled for execution could be aborted due to con icting transaction in the commit phase. Allowing local commit (without interference) enhances the availability and predictability on a single node but may introduce inconsistencies. These replication protocols are called optimistic replication protocols or weak consistency protocols since they allow replicas to be inconsistent temporarily. They use con ict detection algorithms and con ict resolution instead of con ict avoidance.. 3.2 Purpose of this Dissertation This thesis will focus on how to achieve bounded-delay replication based on bounded propagation, con ict detection and con ict resolution. The replication protocol is intended to be used in the DeeDS prototype. Our main goals are to: 32.

(41) Chapter 3. Problem. 3.3. Description of the Problem. 1. Determine which requirements are placed on the replication protocol by the target application areas for DeeDS. 2. Design a replication protocol, including a bounded propagation protocol, a con ict detection algorithm, and a con ict resolution mechanism. 3. Determine how to achieve bounded propagation of updates in a distributed real-time database system. 4. Separate policy and mechanism in the con ict detection and resolution mechanism of the replication protocol.. 3.3 Description of the Problem The problem faced when we want to replicate data in a distributed real-time database is that most of the commonly used protocols use serializability as the means for ensuring correctness of the database. This is too restrictive for distributed realtime databases, since network delays may cause transactions to miss their deadlines. Site failure and partitioning may also cause transactions to wait indenitely, due to deadlocks or livelocks BW90, Mul93]. To overcome these problems we need a replication protocol which allows transactions to commit locally and then propagate the updates to all other nodes. Full replication is required, otherwise transactions may need to distribute some database operations to other nodes. Replication protocols supporting such transactions are called eventual consistency protocols.. 3.3.1 Problems of eventual consistency protocols In short, we will have the following problems:. 33.

(42) Chapter 3. Problem. 3.3. Description of the Problem. bounded propagation: The protocols that support eventual consistency are. targeted above the network layer, in the ISO OSI model (see section 5.1), and do not address the time it takes for a message to be delivered. In a real-time system, bounded time on message delivery is essential to be able to make any predictions on when an update is re ected in all replicas. We have decided that the distributed real-time database will be fully replicated and this means, that to support real-time requirements we need a broadcast protocol. The broadcast protocol must support bounded message delivery.. Conict detection: Most replication techniques used today use serializability. as a sucient criterion for consistency. They are based on con ict avoidance, e.g., hierarchical locking or distributed locking. The best known propagation protocol, two-phase commit EN94, HHB96], is based on serializability across all participating nodes. This forces a transaction to wait for all participating nodes to be prepared to commit, before being able to commit. If con ict avoidance methods cannot be used due to the unpredictable delays involved, con ict detection and resolution is necessary. A con ict detection algorithm must detect con icts originating from dierent types of operations (or situations). They are especially useful if the database designer allows a lower degree of isolation on a transaction (see section 2.1.1), since this will increase concurrency in the system. Consider two transactions T 1 and T 2 modifying two separate data items, A and B . If T 1 and T 2 execute on dierent sites and are allowed to execute in degree 1 isolation (a degree 1 isolation transaction has no lost updates and does not overwrite another transaction's updated data) then no con ict will occur even if the transactions were to read and base there values on \old" values of the data items B and A, respectively.. 34.

(43) Chapter 3. Problem. 3.3. Description of the Problem. Conict resolution: Con ict resolution techniques often depend on the policies allowed by the con ict detection protocols, as well as the data items stored in the database. Since a real-time database can be used in many dierent application areas, the requirement on policy may dier, e.g., one may want to use the latest update for con icting updates of some class of data items and the average value for con icting updates of another class. There is a possibility to use compensating or corrective transactions KLS90] to reach a consistent database state. This would mean that we have to separate the policy for resolving con icts from the mechanism and allowing the designer of the database to specify the policy for each type of data item.. In this work, optimistic replication protocols are the basis for a bounded-delay replication protocol. The denition of an optimistic replication protocol is that it does not limit transactions during network partitioning. In DeeDS , when local updates are made, the node can be viewed as temporarily partitioned. There exist protocol families which support weak consistency or eventual consistency, e.g., Version Vectors PR82], Partition Logs such as the Log transformation protocol GM83, BGMR+83], and Precedence Graphs such as the optimistic protocol DGMS85]. These protocol do not address the predictability required in a real-time application.. 3.3.2 Our problem The problem to be solved is to modify or enhance some protocol to make it bounded, i.e., to achieve bounded replication. Each part of the replication protocol needs to be deterministic and separate to make the protocol more exible. In this work, a replication protocol consists of a propagation protocol, a con ict detection algorithm, and a con ict resolution mechanism. In order to simplify the development, each 35.

(44) Chapter 3. Problem. 3.3. Description of the Problem. part will be studied separately. Any man made system build for use during a long period will need maintenance Pre92]. Maintaining a real-time database requires new testing and evaluation of both correctness of new algorithms and the time frame. Event-triggered system is less complex to extend than a time-triggered but problems remain. In a time-triggered system the whole system needs to be recomputed and rerun, since every operation must execute in lockstep. If the replication protocol is simple and exible extending the protocol will be less problematic. There exist methods for bounded message delivery in a distributed real-time environment, e.g., CSMA-DCR and DOD/CSMA-CD LR93]. An application may have dierent requirements on dierent kind data items, e.g., when using degrees of isolation in transactions (section 2.1.1).. 3.3.3 Replication requirements by DeeDS applications Today, DeeDS focuses on three dierent application areas, which place slightly different demands on the replication protocol. The dierences consists in applications having read-only transactions, one writer-many readers, or multiple writers. These patterns dier in the type of con ict detection and con ict resolution needed. The application areas in more detail are:. Naming Service: A distributed naming service such as Grapevine BLNS82,. SBN84], needs to be highly available and thus ensure continued operation even if some site fails. Distributed real-time operating systems need a naming service to be able to expand in time and to be exible, many real-time systems built today are not as exible as needed. Video-on-demand is an example of a distributed real-time application which could benet from a bounded naming service, e.g., if the server distributing the lm crashes, a new server distributing the same lm need to be located fast and within a bounded time. 36.

(45) Chapter 3. Problem. 3.3. Description of the Problem. Cheriton and Mann CM89] discuss an architecture for a decentralized naming service with requirements that improve performance and fault-tolerance. Another decentralized naming service, related to Grapevine, is the Clearinghouse OD83]. Grapevine and Clearinghouse both use post-commit replication, i.e., they are based on eventual consistency, where con icts often must be manually resolved. If manual intervention can be avoided, due to policy decisions made at design time, it would improve the exibility of the naming service.. Manufacturing Cells (CIM): Computer based manufacturing is a common. solution today, where computers control dierent manufacturing processes and sample information during operation. The information sampled during operation in a certain cell is often desired in other cells in the system. Such a use could be statistical analysis of the production for early detection of potential deviation from specications. This is an area where the DeeDS prototype could be used to control and integrate all functions needed, by storing the data in the distributed database under hard real-time constraints and later replicating it to other parts of the system under less tough real-time constraints (e.g., with a later deadline or a softer time constraint).. Vehicle Control: Today, almost all parts of modern vehicles is controlled by. computers, but there is no simple connection between dierent systems. Dierent manufacturers has in a joint project dened a bus (car area network, CAN) for communication over the bus. DeeDS could be used as the interconnecting agent in such systems, where every sub-system has its own copy of the database and all communication between the subsystems is through DeeDS (the CANbus mentioned would be the communication medium that the DeeDS databases would use for intercommunication). This would make DeeDS behave homogeneous like a Multi Database Management System (MDBMSs), where each 37.

(46) Chapter 3. Problem. 3.3. Description of the Problem. site has its own dened transactions and only the data objects are replicated. (MDBMSs are out of the scope of this work, for an explanation see Kim94].) One of the problem faced in this thesis is to evaluate the dierent application areas and to specify what requirements they place on DeeDS and the replication protocol in particular.. 38.

(47) Chapter 4 Evaluation of existing protocols In order to reach the goals stated we need a replication protocol which is predictable with bounded propagation time and a well-dened con ict detection and resolution algorithm. Bounded propagation is a required feature to be able to make predictions of when replication of a data item has been performed at other sites. In section 3.3, the characteristic features of DeeDS were stated. To be able to support these features the replication protocol must be exible and simple. A exible and simple replication protocol will enhance the probability of supporting the dierent semantics in dierent application areas.. 4.1 Denition of Replication Protocol In the literature HHB96, EN94, Elm92] most replication protocols combine networking principles and replication principles, i.e., the way of handling requests is incorporated in the replication protocol, and in the case of immediate replication even in the concurrency control. The concurrency control in these systems ensures both mutual consistency among data items in the replicas and synchronization of operations on these data items. In 39.

(48) Chapter 4. Evaluation of existing protocols. 4.2. Evaluation. this thesis, a delayed replication protocol is dened as consisting of three parts, the propagation protocol, a conict detection algorithm and a conict resolution mechanism. In order to limit the extent of this work, only a discussion of network primitives and necessary properties for the real-time network protocol is performed. The design decisions taken for DeeDS enforce constraints regarding the replication protocol, i.e., it has to be optimistic since transactions should be controlled locally, and not depending on any communication network for synchronization of operations. When optimistic replication is performed, inconsistencies must be tolerated and con icts need to be detected. Optimistic protocols regarded in this thesis do not include concurrency control enforcing strict serialization. There exist such replication protocols which tolerate partitions but they limit transactions during partitions by only tolerating read operations HHB96]. If a con ict is detected, resolution needs to be performed. The optimistic protocols considered in this thesis all have their own way of performing con ict resolution. The resolution mechanism can be based on serializability, i.e., the mechanism needs to undo/redo transactions in order to end up with a serializable execution history. Another mechanism can be to resolve the con icts by letting the users themselves correct the inconsistencies. This is not a suitable approach for a real-time system for obvious reasons. The nal approach described is to base the con ict resolution mechanism on transaction semantics, which will minimize the number of transactions needed to be undone and redone.. 4.2 Evaluation Our policy to allow local commit of transactions and bounded propagation of the resulting changes, can be viewed as a limited-duration partition failure. In the following section some protocols based on optimistic replication, tolerating partition failures, 40.

(49) Chapter 4. Evaluation of existing protocols. 4.2. Evaluation. are described and evaluated according to the following criteria: 1. Algorithm complexity: One requirement stated 3.3.2 is that the replication algorithm must be simple and exible. It is easier to validate a simple algorithm and hence easier to verify the predictability of the algorithm. 2. Completeness of the algorithm: For certain applications, it may be possible to use a simple algorithm which only detects a subset of all inconsistencies. This could lead to execution gains. 3. Support of conict resolution: We want a replication protocol which supports forward recovery to avoid unpredictable undo/redo. With this we mean that con ict resolution is based on the semantics of data items and transactions, instead of on correcting the execution history in order to obtain a serializable schedule. Properties 1 and 3 are required since they are the main requirements listed in section 3.3.2. Property 2 must be fullled for a general replication protocol to be generated, since limiting the con ict detection to, e.g., only write-write con icts is a serious drawback for many applications. In the following sections, three dierent algorithms are described in more depth, starting with the version vector algorithm, followed by a description of the log transformation algorithm, and ending with the optimistic protocol. After the descriptions of the algorithms the result of the evaluation is described.. 4.2.1 Version vector algorithm The version vector algorithm was designed to be used in LOCUS, a homogeneous distributed operating system which gives special emphasis to network transparency and high availability PR82]. Originally, the version vector algorithm was a technique 41.

(50) Chapter 4. Evaluation of existing protocols. 4.2. Evaluation. for detecting mutual inconsistency of multiple copies of a single le. The single le algorithm (see section 2.2.1) is described in detail in PPR81]. Whereas timestamps detect sucient conditions for a con ict to exist, version vectors seek to provide necessary and sucient conditions for a con ict. The main design goal when developing the version vector algorithm was to not delay any computations or work while the network was partitioned. Every le in the network has an origin point OP(f ) associated with it, which is a unique identier. A name conict is detected when two or more les which have the same name are found, but which have dierent origin points. A version conict is detected if two or more les have the same origin point, but have incompatible version vectors. Parker et al. PR82] dene a le conict to be either a name con ict or a version con ict. Name con ict and version con ict are two separate problems which are detected by dierent methods. A version vector is associated with each le f in the network. Each site which stores the le has an entry in the vector. Each entry consists of a pair of Si : Vi where Si is a site identier, unique in the network and Vi is the corresponding version number. Every time a le f is updated at site Si the Vi gets incremented by one. When merging two partitions, the version vectors are compared and if one partition's version vector dominates another, the rst partition's version is distributed.. De

(51) nition 4.1 A version vector V dominates version vector V if for every entry i, 0. Vi >= Vi where i = 1 : : : n. 0. Example: Consider the partition graph G(f ) for le f shown in gure 2 taken from PR82]. From the start site A B and C have the same version vector for le f . Then then network is partitioned into partitions AB and C . Note that during this partitioning A updates f twice giving the version vector hA : 2 B : 0 C : 0i. Later BC merges and C updates f once giving the version vector hA : 2 B : 0 C : 1i. During. 42.

(52) Chapter 4. Evaluation of existing protocols. 4.2. Evaluation. ABC <A:0,B:0,C:0>. <A:2,B:0,C:0>. <A:0,B:0,C:0>. AB. C. A. BC. A updates f twice. <A:3,B:0,C:0> A updates f once.. <A:2,B:0,C:1>. No conflict: B’s version adopted. C updates f once.. ABC <A:3,B:0,C:1> Conflict since 3>2,0=0, but 0<1. Figure 2: Partition graph for le f , showing a con ict this partitioning A updates f once resulting in version vector hA : 3 B : 0 C : 0i. At the nal merge, a con ict is detected since 3 > 2 but 0 < 1.. Multi-

(53) le Inconsistency Detection The original version vector algorithm only detected write-write con icts on a single le. The need for detecting multi-le con icts made Parker et al. PR82] enhance the original algorithm to be able to detect cases were multiple les where in con ict. In PR82] they dene the write-set of a transaction to be subset of the read-set, to avoid certain NP-hard problems. To be able to detect multi-le con icts the algorithm logs both the read-set and the write-set. By logging all version vectors associated with a transaction all con icts will be detected PR82], and the dierence from other weak consistency replication protocols is that version vectors detect serialization errors without restricting operations during networks partitioning. Version vectors can easily be used for individual database items or at a higher level database objects-class. Since DeeDS is built with object store requirements, the rest of this thesis will discuss object-oriented databases, but the same reasoning should apply to relational databases as well. There are a various ways to implement 43.

(54) Chapter 4. Evaluation of existing protocols. Site A f<A:0,B:0> g<A:0,B:0>. 4.2. Evaluation. Site B f<A:0,B:0>. f’<A:1,B:0>. g<A:0,B:0>. g’<A:0,B:1>. Figure 3: Multi-le con ict of les f and g. version vectors. One could have a version number on each object but this would lead to an increasingly large database where each object must be represented by an array consisting of one column for each node, each containing an integer showing that node's version. This approach would not scale well since every new node in the system would increase the database by O(n m) entries, where n is the number of sites and m the number of objects in the database. In the multi-le version vector algorithm, each site records a log which consists of sequences of version vectors involved in the operation of le f , called the extent(f )PR82].. De

(55) nition 4.2 If a le g extent(f ) then extent(f ) = extent(g). De

(56) nition 4.3 A log lter is a group of sets S, S = f1 : : : fm, where for each set S. and T, if S 6= T then S \ T = .. Example: Consider Figure 3, taken from PR82]. Initially we have two les f and g with version vectors hA : 0 B : 0i, then a partition separates the two sites. At site A both f and g are read then le f is written generating version vectors f = hA : 1 B : 0i, g = hA : 0 B : 0i. At site B f and g are read then le g are written generating version vectors g = hA : 0 B : 1i, f = hA : 0 B : 0i. If only the modied le's version vectors were compared, no con ict would be detected, but if both version vectors where logged in a log le (log lter), comparison of both version vectors would detect the inconsistency.. 44.

(57) Chapter 4. Evaluation of existing protocols. 4.2. Evaluation. 4.2.2 Log transformation protocol A semantic replication approach such as the log transformation must support the standard database properties of atomicity, consistency, integrity and durability. Note that semantic approaches not necessarily have the same denition of consistency as syntactic approaches. A semantic approach does not use serializability as the correctness criterion. A schedule or execution history is semantically consistent if it takes the database from one consistent state to another. This schedule can be the same as a serializable schedule but need not be. A sensitive transaction is a transaction whose output, after commit, is revealed to the user. A sensitive transaction needs to be semantically consistent. Not all transactions seen by a user are sensitive, however. If the semantics of the transaction is such that a user tolerates inconsistent data (e.g., a non-serializable schedule), then the transaction is not sensitive. For instance, consider a transaction calculating the average of a large number of data items. If the execution must be serializable and follow the ACID properties, then the transaction would limit the concurrency in the system by locking all data items aimed for reading. If the calculated average would not drastically be altered if the transaction were to read an \old" value, then other transactions updating the data item could be executed concurrently, thus increasing the availability of the database. When checking the correctness of semantically consistent schedules, we rst look at the transactions. A transaction can be viewed as a set of operations grouped into steps, which must be stepwise serializable.. De

No results found