Design of a Distributed Transactional Memory for Many-core systems Vasileios Trigonakis

(1)

Royal Institute of Technology

Design of a Distributed Transactional Memory for

Many-core systems

Vasileios Trigonakis

vasileios.trigonakis(@)epfl.ch

vtri(@)kth.se

9 September, 2011

A master thesis project conducted at

(2)

Sweden TRITA-ICT-EX-2011:220

c

(3)

i

Abstract

The emergence of Multi/Many-core systems signified an increasing need for parallel program-ming. Transactional Memory (TM) is a promising programming paradigm for creating concurrent applications. At current date, the design of Distributed TM (DTM) tailored for non coherent Many-core architectures is largely unexplored. This thesis addresses this topic by analysing, designing, and implementing a DTM system suitable for low latency message passing platforms. The result-ing system, named SC-TM, the Sresult-ingle-Chip Cloud TM, is a fully decentralized and scalable DTM, implemented on Intel’s SCC processor; a 48-core ’concept vehicle’ created by Intel Labs as a plat-form for Many-core software research. SC-TM is one of the first fully decentralized DTMs that guarantees starvation-freedom and the first to use an actual pluggable Contention Manager (CM) to ensure liveness. Finally, this thesis introduces three completely decentralized CMs; Offset-Greedy, a decentralized version of Greedy, Wholly, which relies on the number of completed transactions, and FairCM, that makes use off the effective transactional time. The evaluation showed the latter outperformed the three.

(4)

(5)

iii

Acknowledgements

I would like to give special thanks to...

Vincent Gramoli, Post-Doc Fellow at EPFL

for being an always helpful and interested advisor

Rachid Guerraoui, Professor at EPFL

for giving me the opportunity to conduct this thesis

Seif Haridi, Professor at KTH

for being my examiner and the professor who made me interested in Distributed Systems

My family

for supporting me the many years of my studies

(6)

(7)

3 SC-TM, the Single-chip Cloud TM 31 3.1 Introduction . . . 31 3.2 System Model . . . 31 3.3 System Design . . . 32 3.3.1 Application part . . . 32 3.3.2 DTM part . . . 32 TX Interface . . . 33 DS-Lock . . . 33 Contention Manager . . . 35 Object Locating . . . 35 3.4 Transactional Operations . . . 37 3.4.1 Transactional Read . . . 37 3.4.2 Transactional Write . . . 37 3.4.3 Transaction Start . . . 38 3.4.4 Transaction Commit . . . 39 3.4.5 Transaction Abort . . . 40 3.5 Contention Management . . . 41

3.5.1 Back-off and Retry . . . 41

3.5.2 Offset-Greedy . . . 41

Steps . . . 42

Accuracy . . . 42

3.5.3 Wholly . . . 43

(9)

Contents vii

3.6 Elastic Model . . . 44

3.6.1 Elastic Model Implementation . . . 44

3.7 Target Platform . . . 46

3.7.1 Single-Chip Cloud Computer (SCC) . . . 46

3.7.2 SCC Hardware Overview . . . 46

3.7.3 SCC Memory Hierarchy . . . 47

Private DRAM . . . 47

Shared DRAM . . . 47

Message Passing Buffer (MPB) . . . 47

3.7.4 SCC Programmability . . . 48 RCCE Library . . . 48 iRCCE Library . . . 50 3.8 Implementation . . . 52 3.8.1 Multitasking . . . 52 POSIX threads . . . 53 Libtask . . . 53 3.8.2 Dedicated Cores . . . 54 3.9 SCC-Related Problems . . . 56 3.9.1 Programming model . . . 56 3.9.2 Messaging . . . 56 Blocking . . . 56 Deterministic . . . 57 Unreliable . . . 58 4 SC-TM Evaluation 61 4.1 Introduction . . . 61 4.1.1 SCC Settings . . . 61

4.2 Multitasking vs. Dedicated DS-Lock Service . . . 62

4.3 Linked-list Benchmark . . . 63 4.4 Hashtable Benchmark . . . 67 4.5 Bank Benchmark . . . 71 5 Conclusions 75 5.1 Summary . . . 75 5.2 Future work . . . 77 5.2.1 Write-lock Batching . . . 77

5.2.2 Asynchronous Read Locking . . . 77

5.2.3 Eager Write-lock Acquisition . . . 77

5.2.4 Profiling & Refactoring . . . 77

5.2.5 Applications & Benchmarks . . . 78

(10)

(11)

List of Figures

2.1 State diagram of the life-cycle of a transaction. . . 7

2.2 Possible problematic case for an Software Transactional Memory (STM) system . . . 11

3.1 Abstract Architecture of the SC-TM system . . . 32

3.2 Pseudo-code for Read-lock acquire (dsl_read_lock) operation. . . 34

3.3 Pseudo-code for Read-lock release (dsl_read_lock_release) operation. . . 34

3.4 Pseudo-code for Write-lock acquire (dsl_write_lock) operation. . . 35

3.5 Pseudo-code for Write-lock release (dsl_write_lock_release) operation. . . 35

3.6 Pseudo-code for Transactional Read (txread) operation. . . 38

3.7 Pseudo-code for Transactional Write (txwrite) operation. . . 38

3.8 Pseudo-code for Transaction Start (txstart) operation. . . 39

3.9 Pseudo-code for Transaction Commit (txcommit) operation. . . 39

3.10 Pseudo-code for Transaction Abort (txabort) operation. . . 40

3.11 Offset-based timestamps calculation. . . 42

3.12 Offset-Greedy – Contradicting views of timestamps for two transactions . . . 42

3.13 Single-Chip Cloud Computer (SCC) Processor layout. Adapted from [Hel10], page 8. . . 46

3.14 SCC Memory spaces. Adapted from [Hel10], page 52. . . 47

3.15 Round-trip latency for a 32 bytes message on the SCC . . . 49

3.16 RCCE Application Programming Interface (API) – Core utilities . . . 49

3.17 RCCE API – Memory Management functions . . . 49

3.18 RCCE API – Communication . . . 50

3.19 RCCE API – Synchronization . . . 50

3.20 RCCE API – Power Management . . . 50

3.21 The allocation of Application and DS-Lock service parts on SCC’s 48 cores. . . 53

3.22 Activity diagram of the multitasking between the Application and the DS-Lock Service on a single core. . . 54

3.23 Multitasking – An example where the scheduling of Core m affects the execution of Core n. . . 55

3.24 The allocation of the Application and the Dedicated DS-Lock service on the 48 cores of SCC. . . 55

3.25 Implementing data exchange between cores 0 and 1 using RCCE. . . 57

3.26 The RCCE send and receive operations interface. . . 57

(12)

4.1 Available performance settings for Intel’s SCC processor. . . 61 4.2 Throughput of read-only transactions for the multitask-based and the dedicated DS-Lock

versions of Single-chip Cloud TM (SC-TM). . . 62 4.3 Latency of read-only transactions for the multitask-based and the dedicated DS-Lock

versions of SC-TM. . . 63 4.4 Throughput of Linked-list running only contains operations in sequential mode. . . . 64 4.5 Throughput of linked-list for normal and elastic-early transactions. . . 65 4.6 Commit rate of normal and elastic-early transactions on the linked-list micro-benchmark. 65 4.7 Throughput of sequential and transactional (elastic-read) linked-list versions under

dif-ferent list sizes. . . 66 4.8 Ratio of transactional (elastic-read) throughput compared to the sequential under

differ-ent list sizes. . . 67 4.9 Throughput of Sequential and Transactional versions on the Hashtable benchmark under

different load factor values. . . 68 4.10 Ratio of Transactional performance compared to the sequential on the Hashtable

bench-mark under different load factor values. . . 69 4.11 Throughput of normal and elastic-read versions on the Hashtable benchmark under load

factor 4. . . 70 4.12 Ratio of throughput for normal and elastic-read versions on the Hashtable benchmark

compared to the throughput of 2 cores . . . 70 4.13 Throughput of SC-TM running Bank Benchmark with different Contention Managers

(13)

(14)

Introduction

Over the last years, there was a shift of hardware architectural design towards core systems. Multi-cores consist of two, or more, processing units and are nowadays the de facto processors for almost every computer system. In order to take full advantage of these systems, parallel/concurrent programming is a necessity [LS08]. One of the most difficult and error-prone task with concurrent programming is the shared memory accesses synchronization. One has to be careful, or else problems, such as data races and deadlocks, may appear.

Transactional Memory (TM) emerged as a promising solution to the aforementioned problem [HM93]. TM allows the programmer to define a sequence of commands, called transaction, which will be executed atomically with respect to the shared memory. TM seamlessly handles the synchronization between concurrent transactions. Programming using transactions is simpler and easier to debug than using lower-level abstractions, such as locks, because it resembles the sequential way of programming.

On a Distributed System (DS) the synchronization problem aggravates. The asynchronous nature of such systems, combined with the limited debugging and monitoring availability, makes programming on such platforms a cumbersome process. Distributed Transactional Memory (DTM) systems aim to provide the programming abstraction of transactions on DSs. Considering the benefits of Transactional Memories and the difficulties of distributed programming positions Distributed Transactional Memory as a very appealing programming approach for such platforms.

Although present hardware architectures typically incorporate efficient Cache-Coherent (CC) shared memory, this may not be the case for future general purpose systems. Contemporary Many-core pro-cessors consist of up to 100 cores, but they are soon expected to scale up to 1000 cores. In such systems, providing full hardware cache-coherency may be unaffordable in terms of memory and time costs, con-sequently it is probable that Many-cores will have limited, or no, support for hardware cache-coherency [BBD+09]. These systems will rely on Message Passing (MP) and coherency will be handled on soft-ware.

Accordingly, Distributed Transactional Memory on top of Message Passing is a very promising pro-gramming model. The main goal of this thesis was the design and implementation of a Distributed Transactional Memory system suitable for Many-core architectures. The resulting DTM system is called the SC-TM, the Single-chip Cloud TM.

(15)

Motivation 3

1.1 Motivation

As one can notice from Chapter 2 of this report, there is an extensive work on the design of DTM sys-tems [HS05, MMA06, AMVK07, BAC08, RCR08, KAJ+08, ZR09b, DD09, ZR09a, CRCR09, AGM10, LDT+10, KLA+10]. Although all the solutions aim to provide the Transactional Memory abstraction, they differ in three major points;

i . they optimize for different workloads, ii . they provide different guarantees, iii . and/or they target different platforms.

While points (i) and (ii) are specific to TM, point (iii) is generic to Distributed Systems. For example, a system designed for a large scale cluster may be significantly different than one targeting a Many-core, mainly because of the different underlying system properties. Many-core systems recently emerged, so, to my knowledge, this is the first work trying to tailor a DTM algorithm specific for Many-core processors.

Current Many-core processors consist of less than one hundred cores, but in the near future the num-ber is expected to increase up to one thousand. Therefore, one of the most wanted characteristics for SC-TM system was scalability. In order to achieve scalability, a fully decentralized solution has to be applied. Many of the existing solutions introduce a centralization point in order to accomplish some other wanted characteristics. SC-TM is fully decentralized and, as the Evaluation (Chapter 4) shows, scales particularly well.

As I already mentioned, another differentiation point among the solutions is the safety and liveness guarantees they provide. The safety guarantee is almost universal in all the systems; opacity [GK08]. Deadlock-freedomis the most commonly liveness guarantee used by the existing DTMs. The goal for SC-TM is to provide a stronger guarantee; starvation freedom. A starvation-free system is protected from both deadlocks and live-locks.

In the STM world1, starvation-freedom can be "easily" achieved by using a CM [HLM03]. On the other hand, in a decentralized DTM contention management is not a trivial case. The lack of a module which has a global view of the system makes contention management difficult. Moreover, the design of most DTMs is not suitable for contention management. For the aforementioned reason, none of the existing solutions employs an actual CM. SC-TM relies on contention management for providing lock-freedom. Different contention management policies can be easily applied, since CM is a separate module of the system.

1

(16)

1.2 Contributions

Firstly, this thesis presents the first DTM system specifically designed for non-coherent Many-core archi-tectures. The pre-existing DTM solutions mainly target Distributed Systems, such as clusters and Local Area Networks (LANs).

Secondly, SC-TM is the first DTM system that uses an actual Contention Manager in order to provide the wanted liveness guarantees. Three different CMs were developed to be used with SC-TM. To our knowledge, SC-TM is one of the first systems that is both fully decentralized and guarantees starvation-freedom. The practical evaluation revealed that strong liveness guarantees can be essential under certain workloads.

Finally, the experience gained while implementing and tuning the algorithm on the SCC processor can be seen as an extensive study of the programmability of a truly Message Passing Many-core system.

1.3 Structure of the Document

The remaining document is structured as following:

• Chapter 2 describes the background of TM and some related to the SC-TM work.

• Chapter 3 presents the DTM Algorithm, the SC-TM system design, the target platform, and some important implementation decisions.

• Chapter 4 evaluates the SC-TM system.

(17)

(18)

Background & Related Work

This chapter consists of two main parts; the Background and the Related Work. The former (Section 2.1) intends to help the reader familiarize with TM, while the latter (Sections 2.2 to 2.6) to present some work that affected, and is closely related to, this thesis.

Section 2.2 presents some of the most influential research done in the area of Software Transactional Memories. The theory of STMs is the foundation behind building Distributed Transactional Memory systems. Section 2.3 discusses about Contention Managers; a module that can be used for guaranteeing the progressiveness of an STM system. Section 2.4 introduces several existing DTM systems developed the last years. Section 2.5 describes three Cache-Coherence protocols that were designed to be used in the context of DTM. Finally, Section 2.6 presents some work on Shared Memory Cache-Coherence protocols and particularly solutions that are Directory-based. Some of these protocols employ a quite similar approach to coherency with the one use on SC-TM.

2.1 Background

Concurrent programming is essential for increasing the performance of a single application in the modern processor architectures. A concurrent application consists of more than one threads of execution which run simultaneously. The parallel execution threads may share some memory objects. Accessing these objects should be somehow synchronized in order to avoid problems such as data races. The most typical solution to the aforementioned problem is using low level mechanisms, such as locks.

Lock programming is a cumbersome and error-prone process. Moreover, ensuring the correctness of a lock-based application is rather difficult. Apart from that, lock-based synchronization is prone to several problems:

• deadlocks: two or more threads are each waiting for the other to release a lock and thus neither proceeds.

• priority inversion: a higher priority thread needs a lock which is held by a lower priority thread, thus is blocked.

• preemption: a thread maybe preempted while holding some locks, therefore "spending" valuable resources.

• lock convoying: a lock cab be acquired by only one of the threads contending for it. Upon acquisi-tion failure, the remaining threads, perform an explicit context switch, leading to underutilizaacquisi-tion of scheduling quotas and thus to overall performance degradation.

(19)

Background 7

Figure 2.1: State diagram of the life-cycle of a transaction.

executed atomically. The purpose of a transaction is thus similar to that of a critical section. However, unlike critical sections, transactions can abort, in which case all their operations are rolled back and are never visible to other transactions.

Also, transactions only appear as if they executed sequentially. TM is free to run them concurrently, as long as the illusion of atomicity is preserved. Using a TM is, in principle, very easy: the programmer simply converts those blocks of code that should be executed atomically into transactions [GK10]. Finally, the transactions should operate in isolation relatively to the other transactions; no other thread should observe writes before commit.

Figure 2.1 depicts the state-chart diagram of the life-cycle of a transaction. A transaction may be aborted for one of the following reasons:

• A transactional operation cannot be completed due to a conflict.

(20)

If a CM is used, the CONFLICT transition on the diagram means that there was a conflict and the CM decided to abort the current transaction. The NO_CONFLICT transition have two different meanings; either there was no conflict, or there was a conflict and the CM decided to abort the enemy transactions. The following subsections present some aspects of TM systems that are necessary in order to understand how such a system operates.

2.1.1 Hardware, Software, or Hybrid TM

TM was initially proposed as a solution [HM93] implemented on hardware, but soon expanded to software [ST97]. Moreover, hybrid solutions exist, which are based on software/hardware co-design [DMF+06, RHL05]. In the present work, I consider only Software TM and thus the following sections basically refer to STM systems.

2.1.2 Conflict

Two or more alive1transactions conflict on a memory object M if one of the following happens: • One has written to M and the other tries to read it. (Read After Write (RAW) conflict) • One has written to M and the other tries to write it. (Write After Write (WAW) conflict) • One or more has red from M and another tries to write it. (Write After Read (WAR) conflict) Every conflict has to be resolved in order the STM to keep its semantics. The resolution is performed by aborting one or more of the involved transactions.

2.1.3 Irrecoverable Actions

TM has some difficulties when it comes to irrecoverable actions such as input, output, and non-catchable exceptions. For example, if a transaction prints something in the standard output, then it is not acceptable to be aborted and restarted because the output cannot be reverted.

2.1.4 Interactions with non-transactional code

There are two basic alternatives on how transactionally used memory objects should be accessed by non-transactional code.

Weak atomicity. Transactions are serializable against other transactions, but the system provides no guarantees about interactions with non-transactional code. In other words, memory objects used by transactions should not be accessed by non-transactional code.

1

(21)

Background 9

Strong atomicity. Transactions are serializable against all memory accesses. One can consider non-transactional loads and stores as single instruction transactions.

Most STM systems provide weak atomicity, because strong atomicity is very "expensive" to guarantee; all memory accesses need to be intercepted2.

2.1.5 Data Versioning

Data versioning arranges how uncommitted and committed values of memory objects are managed. Three basic approaches are used.

Eager versioning. The updates of the memory objects are immediately written in the shared memory. In order to be able to revert to the old values in case of abort, the transaction keeps an undo-log with the initial values of the objects written.

Lazy versioning. The updates are not persisted in the shared memory, but buffered in a write-buffer. Upon commit, the actual memory locations are updated.

Multi-versioning. The system keeps multiple versions of the same memory object in order to allow read operations to be able to access the "correct" snapshot of the shared memory. For example, assume the following history of transactional events.

ri(x) −→ ... −→ wj(y) −→ ... −→ Cj −→ ri(y)

where ri(x)/wi(x) means transaction i reads/writes memory object x and Cithat it commits.

It should be obvious that there is a RAW conflict on object y. Normally, transaction i would not be able to commit, because it would violate the real time order. Multi-versioning solves this problem by keeping the necessary older memory object versions. So, in our example, transaction i would read the version of y prior of j updating it.

2.1.6 Conflict Detection

How (when) the conflicts are detected.

Pessimistic detection. Check for conflicts during transactional loads and stores. Also called encounter or eager conflict detection.

Optimistic detection. Detect the conflicts when the transaction tries to commit. Also called commit or lazy conflict detection.

2

(22)

Combination. An STM can apply different policies for reads and writes. A typical example is opti-mistic reads with pessiopti-mistic writes.

2.1.7 Conflict Detection Granularity

Similar to the lock granularity, conflict detection granularity defines the minimum size of memory that can be transactionally acquired through a TM system. Even if a transaction loads or stores a smaller part of the memory, it will be considered as loading this minimum conflict detection unit. So, if the TM supports word granularity, if two transactions simultaneously write on different bytes of the same word3, then a conflict is detected. A conflict which is only detected due to the granularity of conflict detection and is not an actual conflict is called a false conflict.

Object. The conflict detection is done with memory object granularity.

Word. The conflict detection is done with single word granularity. In typical processor architectures a word is 4 or 8 bytes.

Cache line. The conflict detection is done with one cache line granularity.

2.1.8 Static or Dynamic

If all transactional memory accesses should be statically predefined, then the STM is called static. If the system handles the memory accesses dynamically, the STM is called dynamic. Almost all STM systems are dynamic.

2.1.9 Lock-based or Non-blocking

There are two major STM implementation approaches; based and non-blocking schemes. A lock-based STM internally uses a blocking locking mechanism to implement the transactional semantics. On the other hand, a non-blocking STM relies on a non-blocking algorithm, such as versioning.

2.1.10 Contention Management

Resolving conflicts is achieved by aborting one or more of the conflicting transactions in order to get a serializable execution. A CM is the module which by implementing a contention management policy decides which transaction should be aborted. For example, one such policy could be to abort all others but the oldest transaction.

3

(23)

Background 11

2.1.11 Transaction Nesting

A transaction that includes other transactions in its body is called a nested transaction. STM systems use nested transactions for achieving composability; creating a new transactional operation by using two or more transactional operations within a nested transaction.

A typical and naive approach to composability is called flat-nesting; the whole code between the outer-most transactional start and end operations are considered as one "flat" transaction.

2.1.12 Liveness Guarantees

Every STM system should provide some liveness guarantees which make certain that the system pro-gresses.

Wait-freedom. Wait-freedom [Her88] guarantees that all threads contending for a memory object eventually4make progress.

Starvation-freedom. Starvation-freedom or lock-freedom [Fra03] guarantees that only one of the con-tending threads eventually progresses.

Obstruction-freedom. Obstruction-freedom [HLM03] guarantees that if a thread does not face con-tention, it will eventually make progress.

Obviously, Wait-freedom ⊇ Starvation-freedom ⊇ Obstruction-freedom, therefore wait-freedom is the strongest guarantee.

2.1.13 Safety Guarantees

Several correctness criteria have been proposed for TM, most of them taken from other fields such as Databases. A short description of these criteria is described later in Subsection 2.2.2. STM’s transactions have the following peculiarity compared to the classic database transactions; even alive transactions are not allowed to access an inconsistent state of the shared memory. In databases, a live transaction that accesses an inconsistent state will be simply aborted, but in a STM system the irrecoverability of some events may cause a problem.

1 i n t x = ( i n t ) TX_LOAD( a d d r e s s 1 ) ;

2 i n t y = ( i n t ) TX_LOAD( a d d r e s s 2 ) ;

3 i n t z = 1 / ( y − x ) ; / ∗ i f ( x == y ) { r u n t i m e e x c e p t i o n } ∗ /

Figure 2.2: Possible problematic case for an STM system if a transaction accesses an inconsistent state. Figure 2.2 illustrates such a problematic case. Assume that because of the semantics of the application, x 6= y, therefore the application programmer does not perform an equality check in line 3. If the

4

(24)

transaction accesses an inconsistent state, it may be that x = y and thus line 3 will throw a division by zero runtime exception which cannot be handled and will make the application hang.

This intuition explains the need for a stricter safety guarantee for STM; opacity [GK08]. Informally, opacity is a safety property that captures the intuitive requirements that:

1. all operations performed by every committed transaction appear as if they happened at some single, indivisible point during the transaction lifetime,

2. no operation performed by any aborted transaction is ever visible to other transactions (including live ones),

(25)

Software Transactional Memory 13

2.2 Software Transactional Memory

This section presents relevant work on STM.

2.2.1 Software transactional memory for dynamic-sized data structures [HLMS03] This paper was the first to introduce an STM system with support for dynamic-sized data structures. Prior to this, the TM systems required that the programmer will statically declare the memory locations that a transaction will use. The solution presented, called DSTM, guarantees obstruction-freedom and was the first system to use a contention manager to resolve conflicts and guarantee progressiveness. Although DSTM provides linearisability of transactions, the authors recognized that it is not a strong enough guarantee for a TM and introduced the problem of the consistency of aborted transactions. The DSTM works with multiversioning of both read and write objects and solves the consistency problem by validating the transaction in every object acquisition. Moreover, DSTM provides an explicit release methodfor read-objects, which aims to increase the concurrency in certain applications.

Finally, a simple correctness criterion for contention managers is presented; a CM should guarantee that eventually every transaction is granted the right to abort another conflicting transaction. Two novel contention managers were suggested; aggressive and polite. Aggressive simply grants the permission to every transaction to preempt a conflicting one, while polite uses a controlled back-off5so as to give the live transaction the opportunity to commit.

2.2.2 On the correctness of transactional memory [GK08]

This papers introduces some formal guarantees that a Transactional Memory (should) provide. Specif-ically, they introduce opacity, a correctness criterion for TMs, which is the de facto safety property the majority of TM systems have adopted. Opacity is an extension of the classic database serializability property. On a TM system, there is the need to state whether an execution with more than one trans-actions executing in parallel "looks like" a sequential one. The major difference between a memory transaction6and a database transaction is that on the former, unlike the latter, a live transaction7should not access an inconsistent state, even if it will be later aborted.

The most prominent consistency criteria are the following:

• Linearisability. In the TM terminology, linearisability means that, intuitively, every transaction should appear as if it took place at some single, unique point in time during its lifespan.

• Serializability. A history H of transactions8_{is serializable if all committed transactions in H issue}

the same operations and receive the same responses as in some sequential9history S that consists only of the transactions committed in H.

5

similar to the TCP congestion algorithm

6_{a transaction executed by a TM} 7

a transaction neither committed or aborted

8_{i.e., the sequence of operations performed by all transactions in a given execution} 9

(26)

• 1-Copy Serializability. 1-copy serializability [BG83] is similar to serializability, but allows for multiple versions of any shared object, while giving the user an illusion that, at any given time, only one copy of each shared object is accessible to transactions.

• Global Atomicity. Global atomicity [Wei89] is a general form of serializability that (a) is not restricted only to read-write objects, and (b) does not preclude several versions of the same shared object.

• Recoverability. Recoverability [Had88] puts restrictions on the state accessed by every transaction, including a live one. In its strongest form, recoverability requires, intuitively, that if a transaction Ti updates a shared object x, then no other transaction can perform an operation on x until Ti

commits or aborts.

• Rigorous Scheduling. A correctness criterion precluding any two transactions from concurrently accessing an object if one of them updates that object. Restricted to read-write objects (registers), this resembles the notion of rigorous scheduling [BGRS91] in database systems.

The authors, after motivating why the aforementioned criteria are not suitable for TMs, they formally introduce opacity. Informally, opacity is a safety property that captures the intuitive requirements that:

1. all operations performed by every committed transaction appear as if they happened at some single, indivisible point during the transaction lifetime,

2. no operation performed by any aborted transaction is ever visible to other transactions (including live ones),

3. and every transaction always observes a consistent state of the system. Then, they provide some TM implementation I characterizations:

• Progressive. I is progressive if it forcefully aborts a transaction Ti only when there is a time t at

which Ticonflicts with another, concurrent transaction Tkthat is not committed or aborted by time

t (i.e., Tk is live at t); we say that two transactions conflict if they access some common shared

object.

• Single-version. I is single-version if it stores only the latest committed state of any given shared object in base shared objects (as opposed to multi-version TM implementations).

• Invisible reads. I uses invisible reads if no base shared object is modified when a transaction performs a read-only operation on a shared object.

(27)

Software Transactional Memory 15

2.2.3 McRT-STM: A High Performance Software Transactional Memory System for a Multi-Core Runtime [SATH+_06]

This paper presents McRT-STM STM system for C and C++ programming languages. McRT-STM uses cache-line conflict resolution for big objects and object level granularity for smaller ones. It also supports nested transactions with partial aborts. The authors implemented and evaluated McRT-STM with several different design choices. They concluded that:

• Lock-based approach is faster than the non-blocking one, because the second incurs a bigger over-head and more aborts. The proposed that deadlock avoidance can be achieved with the use of timeouts. Though, since McRT-STM does not have any "clever" contention management scheme, both live-lock and starvation are possible.

• On the specifics of locking the evaluation showed that read-versioning/write-locking is better than read/write-locking with undo-logging. Their explanation is based on the cache problems that visi-ble reads introduce.

• Undo-logging outperforms write-buffering, because of the overhead of searching the write-buffers in every read operation.

2.2.4 Transactional memory [Gra10]

This paper does an extensive review of the TM systems up to late 2009. It presents both Software, Hard-ware, and Hybrid TM systems. After introducing the theory behind TM and the specific TM systems, the author makes the following conclusions about the trends on TM design:

• Most recent Hardware Transactional Memory systems favour eager conflict detection over lazy. • Eager data version management seems to attend more focus in Hardware Transactional Memory

systems than lazy version management does.

• A majority of STM systems favour optimistic concurrency control over pessimistic.

• Early STM systems usually employ non-blocking synchronization, while more recent proposals usually employ blocking synchronization.

• Recent STMs usually have a more flexible approach regarding concurrency control, conflict detec-tion, and conflict resolution than older ones.

2.2.5 Elastic Transactions [FGG09]

(28)

Then, they propose ε-STM as an implementation supporting elastic transactions. ε-STM uses times-tamps, two-phase locking10, atomic primitives (compare-and-swap and fetch-and-increment), atomic loads and stores, and is built on top of Tiny-STM system [FFR08]. Finally, ε-STM is elastic-opaque.

10

(29)

Contention Management 17

2.3 Contention Management

This section presents some important work on Contention Managers.

2.3.1 Contention Management in Dynamic Software Transactional Memory [SS04]

This paper presents a plethora of contention managers and benchmarks them on top of the DSTM [HLMS03] system. According to the authors, the guarantees a contention manager should provide are two; non-blocking operations and always eventually aborting a transaction11.

In the following, enemy is called the transaction that holds some resource needed by the current transac-tion. The various contention management algorithms evaluated are:

• Aggressive. Always abort the enemy.

• Polite. Back-off an exponentially increasing amount of time in order to give the enemy the pos-sibility to complete. If after n ≥ 1 back-offs the enemy still holds the required resource, abort it.

• Randomized. Throw a (biased) coin to decide if the enemy should be aborted, or wait for a random interval (limited to an upper value).

• Karma. Karma tries to use an estimation of the amount of resources already used by the enemies in order to select which transaction to abort. Whenever a transaction commits, the thread’s karma is set to 0. Then, when a transaction opens some resource it collects karma (≡ priority). Upon conflict, if the enemy has lower priority it is aborted, else the transaction waits for a fixed interval until either the enemy completes, or the sum of the karma with the number of retries is greater than the enemy’s karma, in which case the transaction aborts the enemy. Whenever a transaction is aborted, it keeps the gathered karma so it will have greater chances to complete next time. Finally, every transaction gains one point upon retry, so that the short-length transactions will eventually gather enough karma to commit12.

• Eruption. Eruption algorithm is similar to Karma. The main difference is that whenever a transac-tion finds an enemy with higher priority (called momentum, similar to the karma points of Karma), it adds its momentum points to the enemy and waits13. The motivation behind adding the points to the enemy is to "help" a transaction that blocks many other transactions to complete. Eruption also halves the momentum points of an aborted transaction, in order to avoid the mutual exclusion problem.

• KillBlocked. Every transaction that does not manage to open a resource is marked as blocked. On contention, the manager aborts the enemy when it is either blocked, or a maximum waiting time has passed.

11

in order to provide obstruction freedom

12_{a measure for avoiding live-lock} 13

(30)

• Kindergarden. The transactions take turns accessing a block. The manager keeps a list (for each transaction) with the enemies in favour of which the transaction aborted before. Upon conflict, the manager checks the list and if the enemy is in the list, it aborts it, else it backs-off for a fixed amount of time. If after a number of retries the enemy remains the same transaction, the manager aborts it.

• Timestamp. Timestamp is similar to Greedy [GHP05]. Each transaction gets a fixed timestamp and upon conflict, if the enemy is younger14, is aborted, else the transaction waits for a series of fixed intervals. After half of the maximum number of these intervals, it flags the enemy as potentially failed. If the enemy proceeds with some transactional operations, its manager will clean the flag. When the transaction completes waiting the maximum number of intervals, the enemy gets aborted if its flag is set. Otherwise, the manager doubles the waiting interval and backs-off the transaction again.

• QueueOnBlock. Each transaction holds a queue, where any conflicting transactions subscribe. Upon completion the enemy sets a finished flag in the queue, so that the transactions waiting get the resources. Of course, if more than one transactions wait for the same resource, the enemy allocates the resource to one of them and the others subscribe, for that resource, in the new holder’s queue.

The outcome of the evaluation revealed that there is no universally good contention manager. The per-formance of every manager is closely related to each benchmark and in many cases the perper-formance of some contention managers is unacceptable.

2.3.2 Advanced contention management for dynamic software transactional memory [SS05]

This work is a continuation of the evaluation done in [SS04]. The authors introduce two new contention managers and benchmark them against the ones proved to have the best performance from their previous study (Polite, Karma, Eruption, Kindergarden, and Timestamp).

• PublishedTimestamp. Like Timestamp, but uses a heuristic to estimate if a transaction is active. A transaction updates a "recency" timestamp each time it proceeds with a transactional operation. A thread is assumed active unless it’s timestamp value is lower than the system’s global time more than a threshold. PublishedTimestamp aborts an enemy E whose recency timestamp exceed it’s own (E’s) inactivity threshold. The value of the threshold is reset to an initial value when a thread’s transaction commits, while, when a transaction is aborted and restarted it is being doubled (up to an upper bound).

• Polka. Polka is a combination of Polite and Karma algorithms. Upon contention, Polka backs-off a number n of exponentially increasing intervals, where n equals with the priority difference between the transaction and the enemy. Moreover, Polka unconditionally aborts any set of readers that holds some resources needed for a read-write access in order to give priority to writes. Using this mechanism though, makes Polka prone to live-locks.

14

(31)

Contention Management 19

The evaluation results once again suggested that there is no "universal" contention manager that performs the best in every workload. Though, they concluded that Polka performs well even in its worse case and thus it could be a good choice as a default contention manager.

2.3.3 Toward a theory of transactional contention managers [GHP05]

This paper introduces some foundation theory behind contention management for STMs. Contention managers are different from the classical scheduling algorithms mainly because they are decentralized15 and dynamic16.

The authors present Greedy contention manager and prove that provides the following non-trivial prop-erties:

• every transaction commits within bounded time,

• and if n concurrent transaction share s objects, then the makespan17_{of the execution is within a}

factor of s∗(s+1)₂ of the time needed by an off-line list scheduler18 Greedy uses the following three components of a transaction’s state:

1. Timestamp. Each transaction is assigned a global timestamp pertains in case of abort and retry. Lower timestamp suggests higher priority.

2. Status. Attribute with a value of either active, committed, or aborted. This attribute is changed via an atomic compare and swap operation either from active to committed, or from active to aborted. 3. Waiting. Attribute that indicates if the transaction waits for another transaction.

Greedy uses the aforementioned components by applying the following simple contention management rules (transaction A wants to access an object held by transaction B):

• If priority B < priority A, or if B in waiting mode, then A aborts B.

• If priority B > priority A and A not waiting, then A waits for B to commit, abort, or wait19_.

Finally, the authors prove that any on-line contention manager that guarantees that at least one running transaction will execute uninterrupted at any time until it commits (property named pending commit) is, as Greedy, within a factor of s∗(s+1)₂ of optimal. The practical evaluation revealed that Greedy is mostly suitable in a low contention environment.

2.3.4 Transactional Contention Management as a Non-Clairvoyant Scheduling Problem [AEST08]

This paper analyses the performance of contention managers in terms of their competitive ratio compared to an optimal contention manager that knows the resources that each transaction will use. Competitive

15_{the decision about which of the two transactions to be aborted is mainly local} 16

there is no prior knowledge about the duration and the size of the transaction, fact that does not permit off-line scheduling

17

time to commit all transactions

18_{known N P -Complete problem, but any list schedule is within a factor of (s + 1) of the optimal [GG75]} 19

(32)

ration is the makespan for completing the transactions using the current contention manager divided by the time needed under the optimal manager. They proved that every contention manager having the following two properties:

1. A CM is work conserving if it always lets a maximal set of non-conflicting transactions run. 2. A CM obeys the pending commit property [GHP05] if, at any time, some running transaction will

execute uninterrupted until it commits.

(33)

Distributed Software Transactional Memory 21

2.4 Distributed Software Transactional Memory

This section presents several Distributed Software Transactional Memory systems.

2.4.1 Distributed Multi-Versioning (DMV) [MMA06]

DMV stands for Distributed Multi-Versioning and is a distributed concurrency control algorithm. The data are replicated across the nodes of the cluster and the algorithm ensures 1-copy serializability, while using page-level conflict detection. The motivation behind DMV is to take advantage of the multiple versions appearing due to the data replication, instead of using explicit multi-versioning. An update transaction proceeds in the following steps:

1. The writes are deferred until the commit phase.

2. As a pre-commit action the transaction node broadcasts the differences that it caused on the data set.

3. The receiving nodes do not apply these difference, but buffer them.

4. If a receiving node detects a conflict, the local transaction is aborted. This scheme is live-lock prone. In order to avoid this problem the system uses a system wide token that should be acquired by any transaction that wants to commit20_.

5. The receiving nodes reply to the sender immediately. The differences will only be applied when another transaction requires newer version of the data.

The goal of DMV is to allow read-only transactions to proceed independently by operating on their own data snapshot. DMV uses a conflict-aware scheduler21, in order to minimize the conflicts, and a master-replica22where all update transactions run.

2.4.2 Sinfonia [AMVK07]

Sinfonia aims to provide developers with the ability to program distributed applications without the need to explicitly use message passing primitives, but rather by designing the data structures needed. It stores the data on memory-nodes, providing a linear address-space23 and minitransaction primitives. A minitransaction consists of read, write, and compare items. Every item includes a memory node to be accessed and an address range within that node. The advantage of minitransactions is that in the best case,they can be started, executed, and committed in two messages round-trips. Sinfonia’s microtransac-tions ensure atomicity, consistency, and isolation. The system uses a 2 Phase-Commit protocol, where an application node has the role of the transaction’s coordinator and the memory nodes are the participants. The motivation behind microtransactions is to embed the whole transaction’s execution in the first phase of the 2PC.

20

centralization point

21

it is assumed that it knows which memory pages will be accessed by every transaction, which is a strong assumption

22_{potential bottleneck} 23

(34)

2.4.3 Cluster-TM [BAC08]

Cluster-TM is a DTM design targeting large scale clusters. Cluster-TM uses a PGAS24 memory model and serializability guarantees. For performance boosting, Cluster-TM uses multi-word data movement, transactional "on" construct25, and software controlled data caching.

The authors also partitioned the TM design decision space into the following eight categories26: 1. Transactional view of the heap. "word-based" or "object-based"27.

2. Read synchronization. Read-validation on commit time, or read-locks.

3. Write synchronization. Exclusive access, or both "before" and "after" states of the written location are kept, so that other transactions are able to read the value while it is locked.

4. Recovery mechanism. Write buffering, or Undo-log.

5. Time of acquire for write. On write time (early acquire), or on commit time (late acquire). 6. Size of conflict detection unit (CDU). Cache lines, objects, or groups of words. Cluster-TM used

2nwords, where n ≥ 0 is an initialization parameter

7. Progress guarantee. Deadlock avoidance, obstruction freedom, lock freedom.

8. Where the metadata are stored. Stored in program data objects, in transaction descriptors, or inside data structures. Cluster-TM uses some globally stored metadata (one word per CDU) and transactional local metadata (transactional descriptor where the metadata concerning some data are stored on the home node for that data).

Several design alternatives were explored and the results suggested that both read-locking and write-bufferingprovide acceptable performance in a DTM.

2.4.4 DiSTM [KAJ+08]

DiSTMis a framework for prototyping and testing software cache-coherence protocols for DTM. Three protocols were predefined; one decentralized, called Transactional Coherence and Consistency (TCC) and two centralized, based on leases. All protocols use object-level conflict detection granularity. The TCC allows a transaction to proceed locally and broadcasts the read and write sets as a pre-commit validation action. Although it is presented as decentralized, it uses a global ticket for entering the pre-commit phase, which is a centralization point. The other two protocols use the notion of a lease that has to be acquired before trying to commit. Two alternatives are possible; a system with one global lease, in which case no validation is needed, and a system with multiple leases and validation among the lease holders.

24

Partitioned Global Address Space

25

moving a computation block to another node instead of moving the data, in order to take advantage of the data locality

26_{Cluster-TM was tested with the alternatives in boldface or the ones explicitly mentioned} 27

(35)

Distributed Software Transactional Memory 23

2.4.5 DSM [DD09]

This paper presented a 2-Phase commit algorithm for preserving the transactional consistency model. Every data object permanently resides on it creator’s node (authoritative copy). The algorithm uses version numbers to verify that a transaction has accessed only the latest versions of the objects (first phase of the 2-PC). The version number increases when the authoritative copy of the object is changed. Two performance optimization techniques are used; object caching and object prefetching. Object caching locally caches the remote object accessed, while object prefetching is done in terms of probabilistically calculated paths in the memory heap. The validity of the objects accessed by these techniques is checked by the 2PC protocol.

2.4.6 D2_{STM [CRCR09]}

D2STMsystem is a replicated and consistent even in the presence of failures STM system. Total dataset replication is used to achieve performance and dependability of the system. D2STM builds on top of the JVSTM system, which is a multiversion STM that guarantees the local execution of read-only transac-tions. D2STM inherits the weak atomicity and opacity guarantees from JVSTM. Regarding the consis-tency, D2STM provides 1-copy serializability. Apart from the functionality and guarantees that JVSTM provides, D2STM uses atomic broadcast in order to achieve the consistency of the replicas.

Each transaction proceeds autonomously in a node and the atomic broadcast is used to agree on a com-mon transaction serialization order (the commit order). Moreover, the atomic broadcast provides non-blocking guarantees in the presence of failures28. As mentioned before, the read-only transactions need no validation and can commit without any remote communication. On the other hand, each update trans-action after executing locally needs to do a local (first) and a global (afterwards) conflict validation in order to commit or abort. In order to reduce the overhead of the distributed validation, D2STM uses a scheme called Bloom Filter Certification (BFC), which is a novel non-voting certification scheme that exploits a space-efficient Bloom Filter-based encoding. The BFC scheme is used to encode the read set of the transaction and provides a configurable trade-off between the data compression percentage and the increase on the risk of a false transaction abort.

2.4.7 On the Design of Contention Managers and Cache-Coherence Protocols for Dis-tributed Transactional Memory [Zha09]

In his PhD dissertation, Zhang focuses on the Greedy Contention Manager and presents several cache-coherence alternatives to combine it with with a goal to achieve better DTM characteristics.

Initially, Greedy algorithm is combined with a class of location-aware cache-coherence protocols (called LAC) and proven that these protocols improve Greedy’s performance. The solution is based on a hierar-chical tree overlay, similar with the one used in the BALLISTIC protocol [HS05].

28

(36)

Then, a DHT29-based cache-coherence solution is applied. DHTs have good scalability and load balanc-ing characteristics, but are mostly designed for immovable data objects. Zhang used a simple extension to the DHT (a pointer from the normal host of an item to the node that actually holds it) in order to allow object mobility.

Finally, he presents a cache-coherence protocol called DHCB which is based on a quorum system (DHB-grid) in order to allow node joins, departures, and failures.

2.4.8 FTDMT [LDT+_10]

FTDMTfocuses on providing a DTM with fault-tolerance properties. The DTM runtime replicates each shared object in one backup copy, so in case of failures the object will not be lost. Since one backup copy is kept, the object will be still "safe" only if not both nodes that hold a copy fail "simultaneously"30_.

Under this assumption, FTDTM provides atomicity, isolation, and durability. FTDTM also uses the ap-proximately coherent caching and symbolic prefetching techniques presented in DSM [DD09]. FTDTM provides object-level granularity and uses a Perfect Failure Detector (PFD)31to detect node failures and a Leader Election algorithm to select a leader to control the recovery process in case of a failure32. Fi-nally, FTDMT uses optimistic concurrency with versioning and an adapted version of 2 Phase-Commit that facilitates the failure recovery process.

2.4.9 D-TL2 [SR11]

Distributed Transactional Locking II (D-TL2) is a distributed locking algorithm based on the Trans-actional Locking II (TL2) algorithm33 [DSS06]. D-TL2 provides opacity and strong progressiveness guarantees. Both TL2 and D-TL2 use versioning, but D-TL2 uses Lamport-like non-global clocks, while TL2 uses a global one. The proposed algorithm is an object-level lock-based algorithm with lazy ac-quisition and limits broadcasting to just the object identifiers. Transactions are immobile, objects are replicated and detached from any home node, and a single writable copy of each object exists in the network.

When a transaction attempts to access an object, a cache-coherence protocol locates the current cached copy of the object in the network34and moves it to the requesting node’s cache. Changes to the ownership of an object occurs at the successful commit of the object-modifying transaction. At that time, the new owner broadcasts a publish message with the owned object identifier.

29

Distributed Hash-Table

30_{the second failure before the recovery process re-establishes the a backup of the object} 31

PFD is a very strong assumption, not applicable in a asynchronous DS

32

in which case the operation of the DTM is halted until the replication is restored

33_{TL2 is not non-distributed} 34

(37)

Cache Coherence Protocols for Distributed Transactional Memory 25

2.5 Cache Coherence Protocols for Distributed Transactional Memory

2.5.1 Ballistic [HS05]

Ballisticis a cache coherence protocol for tracking and moving up to date cached objects. Ballistic is location aware and works over a deterministic hierarchical overlay tree structure. Ballistic is used as the cache-coherence protocol of a DTM for Distributed Systems where the communication costs form a metric35. Every node has a TM proxy which is responsible for communicating with other proxies and providing the interface to the applications that use the DTM. A typical transaction consists of the following steps:

1. The application starts a transaction.

2. The application opens an object (using the TM proxy). If the object is not local, the Ballistic protocol is used.

3. The application gets the copy of the object from the proxy.

4. The application works with the copy and probably fetches and updates/reads more objects. 5. When the application wants to commit, the proxy handles the validation.

6. If the transaction can commit, the proxy persists the updates, else the changes are discarded. The conflicts are detected by a contention manager which applies specific policies in order to avoid deadlocks and live-locks. In this solution, when a remote proxy asks for an object, the object’s local proxy checks if the object is being used by any local transactions and if it does , then it applies the policy that the contention manager implements. Generally, reliable communication should be used between the nodes so that no loss of data occurs. The DTM works in an exclusive-write/shared-read mode, keeping only one copy of each object. Ballistic cannot operate properly using non-FIFO links (in the face of message reordering Ballistic may get stuck).

2.5.2 Relay [ZR09b]

This paper introduces a DTM based on a cache-coherence protocol called Relay. Relay is based on a distributed queuing protocol; the arrow protocol [Ray89], which works with path reversal over a network spanning tree. Although arrow protocol guarantees good maximum caps for the locating and moving stretch, it does not take into account the possible contention on an object. The contention can cause several abortions, which in turn makes the queue grow bigger. Relay delays the pointer reversal until the object has already moved to the new node, reducing the number of abortions by a scale of O(N ), where N transactions operate simultaneously on the object. Relay does not operate properly on non-FIFO links (it may route messages inadequately in case of message reordering).

35

(38)

2.5.3 COMBINE [AGM10]

COMPINE is a directory-based consistency protocol for shared objects. It is designed for large scale Distributed Systems with unreliable links. COMPINE operates on an overlay tree where the leaves are the nodes of the system. The overlay tree is similar to the one in BALLISTIC protocol [HS05], but simpler since it does not use the shortcut links used in BALLISTIC. The advantages of COMPINE are the ability to operate over non-FIFO links and to handle concurrent requests without degrading the performance. COMPINE provides these characteristics by combining36requests that overtake each other while passing from the same node. At the same time, COMPINE avoids race conditions while guaranteeing that the cost of a request is proportional to the cost of the shortest path between two nodes.

36

(39)

Cache Coherence Protocols for Shared Memory 27

2.6 Cache Coherence Protocols for Shared Memory

This sections presents some cache-coherence protocols designed for shared memory.

2.6.1 An evaluation of directory schemes for cache coherence [ASHH88]

This paper does an evaluation of shared memory cache coherency protocols. There are two basic ap-proaches to the problem; snoopy cache schemes and the directory schemes. On the former, each cache monitors all the shared memory-related operations in order to determine if coherency actions should be taken. On the latter, a separate metadata directory (about the state of the blocks of shared memory) is kept.

While snoopy protocols use broadcast to disseminate the information, directory based hold enough in-formation about which caches keep a block, thus no broadcast is needed to locate the shared copies. The authors presented and evaluated the following directory based solutions.

Tang’s method [Tan76] allows each memory block to reside in several caches, as soon as this copy is not dirty37. Only one cache can hold a dirty entry of a block. The following actions are taken by the protocol:

• On a read-miss, if there is a dirty copy of this block (checked through the directory), it is being written in the shared memory.

• On a write-miss, if there is a dirty copy, it is being flushed in the shared memory. If not, all the copies in the caches are invalidated.

• On a write-hit, if the block is already dirty, there is no need for further action. If not, all the other cached copies of the block must be invalidated.

Censier and Feautrier [CF78] proposed a similar to Tang’s mechanism. Tang’s method duplicates every individual cache directory in the main one. Therefore, in order to search where a block resides, all directories must be searched. Censier and Feautrier used a centralized directory with some additional metadata to alleviate this overhead.

Yen and Fu [YYF85] suggested a refinement to the Censier and Feautrier consistency technique. The same central directory is used, but an extra flag is kept that designates whether a cache is the only holder of a block. With this flag set, when a write to a clean block is done it is not necessary to search in the directory since this block is not cached anywhere else.

Archibald and Baer [AB84] suggested a broadcast-based solution that has no need to keep any extra metadata. They also used the single holder technique used by Yen and Fu. This solution inherits the scaling problem of snoop based protocols due to the use of broadcasting.

Finally, the scheme that holds full information about where each block resides is being discussed. With this scheme, broadcast is not needed, but the directory’s size grows proportionally to the number of processors. The authors suggest a modification to the full map directory that maintains a fix amount

37

(40)

of metadata. The directory keeps only one pointer to a cache and a broadcast bit per block. If there is one holder, the pointer points to it, else the broadcast bit is set and broadcasting is used for keeping the coherency.

2.6.2 Directory-Based Cache Coherence in Large-Scale Multiprocessors [CFKA90]

This paper, similarly with [ASHH88], evaluates different directory based cache coherency protocols. A categorization of directory protocols according to how the metadata are stored is presented. The different classes are:

• Full-map directories. The directory keeps full data of where each memory block is cached. No broadcasting is ever needed.

• Limited directories. The directory keeps data of where each memory block is cached up to a fixed limit. For example, it could store a single pointer. In case the limit is exceeded, a flag is set and broadcasting is used for invalidation.

• Chained directories. A chain38_{is created pointing from the memory to a chain of caches that holds}

a copy of a block. For example, if initially no cache holds a block and cache a and then b read it, the chain created looks like (memory) → b → a → null. Invalidation is achieved by traversing the chain.

2.6.3 The directory-based cache coherence protocol for the DASH multiprocessor [LLG+_90]

DASH system is a scalable shared-memory multiprocessor system developed in Stanford. The system uses DASH distributed directory-based cache coherence protocol. DASH consists of several processing nodes organized into clusters, each of them holding a portion of the shared memory. The protocol uses point-to-point communication instead of broadcast and is hardware-based.

Every processing node holds the portion of the directory protocol metadata that corresponds to the shared memory that it physically has. For each memory block the directory memory maintains a list of nodes that possess a cached copy. In this way point-to-point invalidation can be achieved.

The authors recognized three main issues in designing a cache-coherency protocol; memory consistency, deadlock avoidance, and error handling. DASH guarantees release consistency [GLL+90]. Release consistency is an extension of weak consistency, where the memory operations of one node may appear out of order with respect to other processors. The ordering of memory operations is preserved only when completing synchronization or explicit ordering operations.

The DASH cache coherence protocol is an ownership protocol and is based on invalidation. A memory block may either be in (i) uncached remote, (ii) shared remote, or (iii) dirty remote state. Coherence within a processing cluster is guaranteed by a memory snooping cache-coherence protocol and not by the directory protocol.

38

(41)

Cache Coherence Protocols for Shared Memory 29

2.6.4 Software cache coherence for large scale multiprocessors [KS95]

This paper introduces a software based cache coherence protocol. The protocol assumes some hardware support39, but was adjusted to work purely on software. The protocol allow more than one processors to write to a cached memory page concurrently and provides a modification of release consistency.

The protocol uses a distributed, non-replicated full-map40 data structure. A memory page can be in one of the following states; uncached, shared, dirty, weak41. Each processor holds the portion of the directory map that corresponds to the physical memory that the processor locally possesses. Moreover, each processor keeps a weak list, which is a list of the pages that are marked as weak.

39

intermediate hardware option-memory-mapped network interfaces that support a global physical address space

40_{every cached copy of a page is logged in the directory map} 41

(42)

(43)

3

SC-TM, the Single-chip Cloud TM

3.1 Introduction

This chapter introduces SC-TM, the Single-chip Cloud TM Distributed Software Transactional Memory algorithm that allows the programmer to use its transactional interface to easily and efficiently take ad-vantage of the inherent parallelism that a Many-core processor exposes, while providing strong liveness and safety guarantees.

SC-TM algorithm aims to create a fully decentralized, modular, and scalable DTM system, suitable for Many-core processors, that guarantees opacity and starvation-freedom1. The algorithm consists of three distinct functional parts (object locating, distributed-locking, and contention-management) so different design choices and extensions can be easily adjusted. For example, data replication could be achieved with an additional network step by configuring the distributed locking service to publish the writes. The algorithm relies on the message passing support that every future Many-core processor is expected to provide. SC-TM was implemented, tested, and evaluated on Intel’s SCC. The SCC experimental processor is a 48-core ’concept vehicle’ created by Intel Labs as a platform for many-core software research. It does not provide any hardware cache coherency, but supports message passing.

Section 3.2 describes the assumptions made for the underlying system. Sections 3.3 to 3.6 describe the design of the SC-TM algorithm, while Sections 3.7 and 3.8 present the target platform, Intel’s Single-Chip Cloud Computer, and the specifics of porting SC-TM on SCC. Finally, Section 3.9 describes several problems that emerged during the thesis merely because of the SCC processor.

3.2 System Model

The underlying platform that SC-TM assumes should comply with the undermentioned properties. Firstly, regarding the process failure model, SC-TM presumes that processes do not fail.

The target system is fully-connected2and the links are reliable [GR06]. Consequently, every message sent is eventually delivered once3_{by the target node.}

Finally, regarding the timing assumptions of the system, I consider an asynchronous system; I do not make any timing assumptions about the processes or the communication channels. Only in the case of the Offset-Greedy CM (Section 3.5), SC-TM assumes a partially synchronous system model [GR06]. In partially synchrony one can define physical time bounds for the system that are respected most of the time.

1

should be provided by the contention manager

2_{every node can communicate with every other} 3

(44)

Figure 3.1: Abstract Architecture of the SC-TM system

3.3 System Design

Figure 3.1 depicts the overall architecture of the SC-TM system. One of the major design goals was modularity; flexible and adjustable so that different design choices and extensions can be easily engi-neered. In order to achieve this, the system is separated to two major parts and several components communicating with well defined interfaces. The two parts are the Application and the DTM system.

3.3.1 Application part

The Application part is nothing more than the application code that the application programmer has developed and makes use of the SC-TM system. This part is as complex as the application programmer wants and from the DTM point of view consists of a simple component.

3.3.2 DTM part

The DTM part is the one implementing the SC-TM algorithm. It consists of the following components: • TX Interface

• DS-Lock

(45)

System Design 33

• Object Locating

Each of the aforementioned components will be described in detail in one of the following subsections.

TX Interface

The Transactional (TX) Interface component is the interface that SC-TM system exports to the applica-tions. Using these functions is the only way an application can interact with the DTM system. It includes functions to perform the following operations:

• Initialize the SC-TM system. • Finalize the SC-TM system. • Start a transaction.

• End (try commit) a transaction. • Perform a transactional read. • Perform a transactional write.

• Perform a transactional memory allocation. • Perform a transactional memory freeing.

For more details on the transactional interface as implemented on the SC-TM system look at Section 3.4.

DS-Lock

The Distributed (DS) Lock component is the heart of SC-TM system. SC-TM is a lock-based TM system and DS-Lock is responsible for providing a multiple-readers/unique-writer locking service. The service is collectively implemented by some, or all, nodes of the system. Consequently, each node running a part of the DS-Lock service is responsible for keeping and handling the locking metadata for a partition of the shared memory. On this context, DS-Lock service is similar to some Directory-based Cache Coherence solutions [LLG+90, KS95] .

Of course, the locking service is not a simple blocking or try-lock one, but is extended in order to include the transactional semantics; Read after Write, Write after Read, and Write after Write conflicts. Whenever one of these conflicts is detected, the component makes use of the Contention Manager which is responsible for resolving the conflicts.

The operations DS-Lock implements are basically the following four: 1. Read-lock acquire