Porting a C Library for Fine Grained Independent Task Parallelism to Enea OSE RTOS

(1)

Porting a C Library for Fine Grained Independent

Task Parallelism to Enea OSE RTOS

Master of Science Thesis

YIHUI WANG

Master’s Thesis at Xdin AB Supervisor: Barbro Claesson, Xdin AB

Detlef Scholle, Xdin AB Examiner: Ingo Sander, KTH

(2)

(3)

Acknowledgements

I would like to express my sincere gratitude to my supervisors, Detlef Scholle and Barbro Claesson from Xdin AB, for giving me the precious opportunity to accom-plish this master thesis and giving me constant support. I would also like to thank Ingo Sander, my examiner from KTH for the useful suggestions and important guid-ance in writting this report.

I would like to give my thanks to other thesis workers in Xdin, especially Sebastian Ullström and Johan Sundman Norberg, who help me a lot with both the project and the thesis. The amazing time that we spent together became one of my most precious memories in life.

My thanks extends to everyone in Xdin AB and Enea AB who are friendly and hospitable.

Finally I would like to thank my beloved family for their support through my entire life. I could not go so far in academic without their support. In particular, I must acknowledge my friend Junzhe Tian for his encouragement and constant assistance through the duration of my master study.

(4)

Abstract

Multi-core starts an era to improve the performance of com-putations by executing instructions in parallel. However, the improvement in performance is not linear with the num-ber of cores, because of the overhead caused by intercom-munication and unbalanced load over cores. Wool provides a solution to improve the performance of multi-core sys-tems. It is a C library for fine grained independent task parallelism developed by Karl-Filip Faxén in SICS, which helps to keep load balance over cores by work stealing and leapfrogging.

(5)

4.2 Thread management . . . 19 4.3 Mutual exclusion . . . 19 4.4 Conditional variables . . . 20 4.5 Synchronization . . . 21 4.5.1 Atomic operations . . . 21 4.5.2 Hardware primitives . . . 21 4.6 Threads on multi-core . . . 22 4.7 Summary . . . 22 5 Wool 25 5.1 Basic concepts . . . 25 5.1.1 Worker . . . 25 5.1.2 Task pool . . . 26 5.1.3 Granularity in Wool . . . 26

5.1.4 Load balance in Wool . . . 26

5.1.5 Structure . . . 27

5.2 Work stealing . . . 27

5.3 Leap frogging . . . 29

5.4 Direct task stack algorithm . . . 31

5.4.1 Spawn . . . 31

5.4.2 Sync . . . 32

5.5 Wool optimization . . . 32

5.5.1 Sampling victim selection . . . 32

5.5.2 Set based victim selection . . . 33

5.6 Other multi-threaded scheduler . . . 33

5.7 Summary . . . 34

II Implementation 35 6 OSE 37 6.1 OSE fundamentals . . . 37

6.2 OSE process & IPC . . . 38

6.3 OSE for multi-core . . . 39

6.3.1 OSE load balancing . . . 40

6.3.2 Message passing APIs . . . 40

(7)

7.2 Freescale QorIQ P4080 platform . . . 46 7.2.1 P4080 architecture . . . 46 7.2.2 e500mc core . . . 46 7.3 Tilera’s TILE64 . . . 47 7.3.1 TILE64 architecture . . . 47 7.3.2 TILE . . . 47 8 Porting Wool to P4080 49 8.1 Design . . . 49

8.1.1 Storage access ordering . . . 49

8.1.2 Atomic update primitives . . . 50

8.2 Implementation . . . 50

8.2.1 Prefetch & CAS . . . 50

8.2.2 Gcc inline assembler code . . . 51

8.2.3 Enea Linux . . . 51

8.3 Experiment . . . 52

8.3.1 Experiment: Fibonacci code with and without Wool . . . 52

8.3.2 Analysis . . . 53

9 Porting Wool to OSE 55 9.1 Design . . . 55

9.1.1 Malloc library . . . 55

9.1.2 Pthread . . . 57

9.2 Implementation . . . 58

9.2.1 Load module configuration . . . 58

9.2.2 Makefile . . . 59

9.2.3 System information . . . 59

9.2.4 Application on OSE multi-core . . . 59

9.3 Experiments . . . 59

9.3.1 Experiment on OSE & Linux . . . 59

(8)

List of Figures

1.1 Implementation Steps . . . 3

2.1 Multi-core Architecture . . . 8

2.2 Architecture of SMP System . . . 8

2.3 Architecture of AMP System . . . 9

4.1 Mutex Structure [14] . . . 20

4.2 Condition Variable [15] . . . 20

5.1 Work Stealing Structure. . . 28

5.2 Work Stealing. . . 29

5.3 Leap Frogging . . . 30

5.4 Spawn. . . 31

6.1 Processes in Blocks with Pools in Domains [26] . . . 38

6.2 Message Passing . . . 39

6.3 MCE: Hybrid AMP/SMP Multiprocessing [31] . . . 40

7.1 P4080. . . 46

7.2 Tilera’s TILE64 Architecture [38] . . . 48

8.1 Enea Linux. . . 52

8.2 Fib (5) . . . 53

9.1 Fib(43) with Wool Library on the Linux & OSE on P4080 . . . 60

(9)

List of Tables

2.1 Comparison of Multiple Processors and Multiple Cores. . . 9

2.2 Comparison between Multiple Threads and Multiple Cores . . . 10

3.1 Comparison of Shared Memory Model and Message Passing Model . . . 13

4.1 Processes and Threads . . . 18

4.2 Advantages and Disadvantages of Threads on Multi-core . . . 22

7.1 Instruction Set Comparison . . . 43

7.2 Memory Consistency Models . . . 45

7.3 Memory Barriers & Instruction . . . 45

8.1 Time for Fib(43) with Wool Library on the Linux X86 Platform . . . . 53

9.1 Fib(46) with Memalign, Malloc and Calloc . . . 57

9.2 Fib(43) with Wool Library on Multiple Cores . . . 60

(10)

(11)

List of Abbreviations

Pthread POSIX thread

OSE Operating System Embedded

P4080 QorIQ P4080 Eight-Core Communications Processors CMP Chip Multi-Processor

SMP Symmetric Multi-Processing

OS Operating System

SMT Simultaneous Multi-Threading ILP Instruction-Level Parallelism TLP Thread-Level Parallelism

API Application Programming Interface LIFO Last In First Out

FIFO First In First Out

IPC Inter Process Communication

MCE Multi-Core Embedded

ISA Instruction Set Architecture

CISC Complex Instruction Set Computer RISC Reduced Instruction Set Computing MPPA Massively Parallel Processor Array Mutex Mutual Exclusion

(12)

(13)

Chapter 1

Introduction

1.1 Background

This thesis project is conducted at Xdin AB1, a technology and IT consultant com-pany, developing and delivering competence for world leading companies. The thesis is part of the MANY2 project hosted by ITEA23. The objective of MANY is to provide scalable, reusable and fast developed software for embedded systems. Multi-core became popular in the recent decade, which takes advantage of single-core with parallel computing. However, the overhead caused by synchronizations among processors becomes a problem as the number of cores keeps on increasing. An efficient task parallelism scheduler is demanded to decrease the communication overhead and improve the performance.

Wool4 _{is a C-library for fine grained independent task parallelism on top of POSIX}

threads (pthread) [1], which is developed at SICS5 by Karl-Filip Faxén. It can be regarded as a low overhead scheduler for concurrent programs. The demands on an efficient parallel computing make it important to port Wool to OSE6 _multi-core

platforms to gain a better performance. This master thesis focuses on porting Wool to Enea OSE real-time operating system.

1.2 Problem statement

Wool is a C library concentrating on reducing the communication overhead and keeping load balanced across multiple cores. It is designed to give a better

perfor-1

http://xdin.com/

2_{Many-core programming and resource management for high-performance Embedded Systems} 3

http://www.itea2.org

4_{Wool’s homepage: http://www.sics.se/ kff/wool/} 5

Swedish Institute of Computer Science

(14)

CHAPTER 1. INTRODUCTION

mance for multi-core systems, like P40807, because it helps to distribute the work load to each core with a low overhead. Wool works on Linux on X868 currently. To port Wool to Enea OSE on P4080, a good knowledge on Wool task parallelism, PowerPC instruction set and OSE operating system are required. Differences be-tween the operating systems and the hardware targets should be considered before the porting. The main problems are listed below.

P1: Wool task parallelism

Wool works on task parallelism. How tasks are assigned to different cores and coop-erate to achieve a more efficient parallelism is an interesting topic. The modification of the Wool source code to fit the new platform is the main part of the project, so it is important to find out which part of source code should be modified to port Wool to the new platform without changing the task parallelism.

P2: Pthread

Wool is a C library based on POSIX threads (pthreads), which is a library provided by the operating systems. The Linux operating system implements pthreads, and all the thread functions and data types are declared in the pthread header file. However, the Enea OSE operating system is based on OSE processes and message passing instead of pthreads. Though most pthread APIs are implemented and supported by OSE to make it easier to port pthread based applications, changes and reconfigurations of OSE pthreads remain to be problems in this project. P3: Hardware synchronization primitives

Hardware synchronization primitives are used in the implementation of Wool, which are used for the optimization of Wool. These hardware primitives are defined by computer verdors. To port Wool to the new target P4080, one must add these primitives for the specified target. These hardware dependent code relies on the memory consistency models, memory barriers and the type of the instruction set. These low level assembly code should be embedded in the C program.

1.3 Goals

This project is conducted in 2 phases: Step1: Port Wool to P4080.

1. Configure P4080 with Linux kernel.

2. Modify the hardware dependent code of Wool using gcc inline assember. 7_{http://www.freescale.com}

8

(15)

1.4. METHOD

X86 P4080 P4080

Linux Linux _(pthread)OSE

Figure 1.1. Implementation Steps

3. Verify Wool on P4080 Step2: Port Wool to OSE.

1. Configure OSE on P4080.

2. Modify OSE libraries, including the pthread library 3. Verify Wool upon OSE pthreads.

1.4 Method

This thesis lasts for 20 weeks, including two phases: theoretical study and implementation. The theoretical study is conducted during the first 10 weeks. Background knowledge of Wool, OSE and P4080 is prepared by reading man-uals and Papers.

In the second phase, design and implementation of Wool on P4080 are con-ducted first, and followed by test verification of the implementation. Finally, the performance of Wool is tested on the new platform and a demonstration is conducted. The platform is a Freescale board with QorIQ P4080 Eight-Core Communications Processors and the implementation is done according to Xdin AB standards.

1.5 Contributions

This master thesis involves the study of parallel programming on multi-core, POSIX thread and Wool task parallelism strategies. Enea Linux has been setup on the Freescale P4080 platform. The source code regarding to the hardware dependent primitives has been changes to fit the new platform. And a set of tests are performed to verify the results.

(16)

CHAPTER 1. INTRODUCTION

(17)

Part I

(18)

(19)

Chapter 2

Multi-core

According to Moore’s Law, the chip performance doubles every 18 months [2]. To keep this trend, micro-processor vendors used to improve the com-puters’ performance by increasing its clock frequency on single-core processor. However, the increase in performance reaches a bottleneck, deal to the power consumption and heat dissipation, which grows exponentially with the clock frequency. Therefore multi-core comes as a solution, which improves the per-formance with multiple execution units executing in parallel [3]. However, it increases the difficulty to program a multi-core system for the massive paral-lelism, and the inter-core dependencies can decrease its performance as well.

Multi-core is a processor which integrates more than one processing unit on a single chip [4]. Each unit is called a "core" which is independent. Multi-core is also called Chip Multi-Processor (CMP) because the cores fit on a single processor. The architecture of a multi-core processor generally includes sev-eral cores, one system bus, private or shared caches (There is always at least one level cache that is shared, which helps to speedup data transfer time, re-duce cache-coherency complexity and rere-duce data-storage redundancy.) and a shared memory, as shown in Figure 2.1. It differs from a single core processor in both architecture and parallel computing.

2.1 Multi-core operating system

(20)

CHAPTER 2. MULTI-CORE Multi-core Processor Shared Memory System Bus L2 Cache Core1 L1 Instr Cache L1 Data Cache CPU Core2 L1 Instr Cache L1 Data Cache CPU . . . CoreN L1 Instr Cache L1 Data Cache CPU

Figure 2.1. Multi-core Architecture

SMP systems

SMP systems (see Figure 2.2) are the most commonly used form, which allow any core to work on any task regardless of where the data is located in memory. There is a single operating system (OS) image for SMPs. Each processor is able to access the shared memory and complete a given task, which is contrary to master and slave processors [5]. So it is easy to distribute tasks among cores to balance the workload dynamically.

Core1 Application Core2 Application . . . . . . Operating System CoreN Application

Figure 2.2. Architecture of SMP System

AMP systems

(21)

2.2. COMPARISON Core1 OSE OS Application Core2 Linux OS Application . . . CoreN Linux OS Application

Figure 2.3. Architecture of AMP System

2.2 Comparison with multi-processor & multi-threading

Multi-core systems are unlike multi-processor systems and multi-threading. To illustrate multi-core system clearly, we make a comparison below.

Multi-processor system

Multi-processor Systems require more power to be driven because signals be-tween cores are routed off-chip. Compared with a multi-processor system, a multi-core system is faster with inter-core bus and cache-snoop [4]. Differences between them are shown in Table 2.1.

Multiple Processors Multiple Cores Two different chips connected by

bus

Connected within a chip Parallelism needs external software

support

Multi-processes run in parallel auto-matically

Heat consumption Less heat consumption

- Lower package cost

Table 2.1. Comparison of Multiple Processors and Multiple Cores.

Multi-threading

(22)

CHAPTER 2. MULTI-CORE

Multiple Threads Multiple Cores

Instruction level parallelism Thread level parallelism

Share one core and L1 cache More than one set of cores and L1 caches One large super-scalar core Several cores

Table 2.2. Comparison between Multiple Threads and Multiple Cores

2.3 Summary

(23)

Chapter 3

Parallel Computing

Parallel computing executes computations simultaneously instead of serial computing, which executes one instruction at a time. The main idea of par-allelism is to break a problem into small parts, assign them to different ex-ecution units, execute them simultaneously and complete the problem faster [6]. A highly efficient parallel computation can improve the performance of the system, but to program a parallel hardware is difficult for the massive of parallelism. Parallel computing models and challenges are discussed in this chapter.

3.1 Parallel performance metrics

There are various ways to evaluate the improvement of performance in a par-allel computation. Actual speedup in a parpar-allel computation can be measured as:

Sp =

T imeserial

T imeparallel

This function shows that the upper bound of speedup equals to the number of CPUs. However, the speedup is not only limited by the number of CPUs, but also the proportion of the program that can be parallelized, which can be calculated with Amdahl’s law. By splitting a program into serial program and parallel program, the maximum speedup is:

Sp ≤

1 (1 − P ) + P

N

(24)

CHAPTER 3. PARALLEL COMPUTING

The upper limit of speedup is determined by the serial fraction of code. So by parallelizing as much as possible, it is possible for us to get a better perfor-mance [4]. The basic idea of parallelism is to break a program of instructions into small pieces and execute them simultaneously. However, with the num-ber of cores increasing, the performance is not increasing linearly due to the overhead caused by cache effects, extra synchronization and bus contention [7]. Therefore tasks (pieces of a program) need to be large enough to run in parallel to reduce overhead, but not so large that it could lead to the problem that there is not enough work to be run in parallel to keep load balance.

3.2 Levels of parallelism

There are different ways to implement parallelism to improve the performance. By using parallel programing, we can decrease the time needed for a single problem, and increase the throughput at the same time.

3.2.1 Instruction-level parallelism (ILP)

Instructions can be overlapped and executed in parallel when they are inde-pendent. By ILP, processors will reorder the instruction pipeline, decompose them into sub-instructions and execute multiple instructions in a simultane-ous way. ILP is implicitly parallelism, which reduces the latency of memory accesses by applying the pipelined techniques and super-scalar architectures1. Furthermore, by reordering instructions, processors can perform useful work instead of stalling on data and instruction dependencies [8].

3.2.2 Thread-level parallelism (TLP)

TLP is the main architecture for high performance multi-core or multiple processors. TLP means that when a thread is idle waiting for memory access, another thread is initialized and run immediately, so that the pipeline can stay in ’busy’ state all the time. In addition, the throughput of the system is increased.

3.3 Parallel programming models

Parallel programming models exist as an abstraction above hardware and memory architectures [9]. They can be applied to any type of hardwares or memory architectures. For example, a shared memory model on a distributed

1

(25)

3.4. DESIGN IN PARALLEL

memory machine appears as a shared memory to the users but is physically distributed. There are several parallel programming models stated below.

Shared memory

In this model, programs share a common piece of memory, which is accessed by programs asynchronously. A typical example of this model is a global variable, which can be accessed and modified by all the programs. Shared variables can be protected by various mechanisms like locks (see Section 4.5), which help to control the accesses to them [9].

Message passing

Distributed memory module is used in this model. Each processor owns its data in a private memory, and different processors exchange data by sending and receiving messages. processors need to cooperate with each other, for example, a send must match a receive operation [9]. A comparison of shared memory model and message passing model is shown in Table 3.1.

Shared memory Message passing communication load and store send and receive

memory shared or private private

synchronization required when sharing data implicit

Table 3.1. Comparison of Shared Memory Model and Message Passing Model

3.4 Design in parallel

Though parallel computing improves performance, challenges like deadlock and race conditions, exist because of the competition of shared resources (pro-cessor, memory or devices) and sequential dependencies (data dependency and communication and synchronization). To cope with these challenges, syn-chronization primitives are used, see Section 4.5. To design in parallel, the following aspects should be paid attention to.

Decomposition

(26)

CHAPTER 3. PARALLEL COMPUTING

Inter-communication

If the partitioned tasks are assigned to different processors and they share data. Communication must be performed, which will introduce both latency and overhead. The cost of communication needs to be considered before de-composition. While waiting for an event or some data, a process stalls and waits, which is called blocking. Otherwise, the process is called non-blocking, and can make progress without suspending while waiting.

Synchronization & data dependency

Synchronization is used to synchronize different tasks and ensure data correct-ness. Most synchronization operations like semaphores are based on memory barriers. A memory barrier makes the memory operations before the bar-rier visible to all the tasks and then continues to execute. Applications with little data dependency can benefit most from multi-core system because less synchronizations are needed which are costly.

Load balancing

In a multi-core system, the overall performance is up to the slowest CPU. So it is important for each task to have a similar work load, and keep the workload on each CPU balanced. Dynamic work assignment could be used to keep load balancing.

Granularity

Granularity is a quantitative measure of the ratio of computation to communi-cation. According to the amounts of work being done between communication events, Granularity is divided by fine-grain parallelism and coarse-grain par-allelism. Fine-grain parallelism consists of higher number of small tasks and more communication is needed. It helps to keep load balancing, at the cost of higher overhead.

3.5 Summary

To achieve a better performance of multi-core, one should fully convert a serial program to several fragments to execute them concurrently. Parallel computing enables concurrent computation but it increases the programming complexity. While different levels of parallelism could be combined to gain a better performance.

(27)

3.5. SUMMARY

(28)

(29)

Chapter 4

Threads

Threads are commonly used to implement parallelism in a multi-core system with shared memory. They are used to increase the throughput and decrease the idle time for processing units. There are several versions of threads, while the most commonly used threads are POSIX threads (also called pthreads), which is a standardized C language threads specified by the IEEE POSIX 1003.1c standard. Linux implements native POSIX threads on it and our dis-cussion is limited within pthreads in this section [11].

The pthread APIs can be grouped into four groups: thread management, mutex variables, condition variables and synchronization. The features of pthread APIs make it simpler to guarantee memory coherency and implement synchronization.

4.1 Thread overview

A thread is an independent stream of instructions that can be scheduled to run as such by the operating system [12]. Threads are defined in the program environment and initialized by the compiler; they run simultaneously in a process independently and are charged by the operating system. Threads are used for improving the performance with less overhead and fewer system resources compared with processes. The actual execution unit of threads is a processor [11].

4.1.1 Process & thread

(30)

CHAPTER 4. THREADS

as a logic flow. There would be several threads executing in parallel within one process, which is called multi-thread process. Each thread is independent because it has its own thread context, thread ID, pointer, stack, register and condition word, and they are recognized by thread ID. A comparison between thread and process is listed in Table 4.1.

Category Process Thread

Address space A process has its own address space protected by operating system

Threads in the same process share the same address space Interaction Shared locations in operating

system

shared memory within the process

Context Switching heavy, the entire process state must be preserved

light, only current register state needs to be saved.

Table 4.1. Processes and Threads

Managing a thread requires fewer system resources and less overhead than a process. Multiple threads can overlap CPU work with I/O operations (with one thread waiting for I/O, another thread is performed by CPU). On the other hand, thanks to the threads, share resources within process must be synchronized. Therefore, it increases the difficulty to write and debug the program.

4.1.2 Threaded program models

Parallel programming is suitable to be applied in a threaded program. Meth-ods of designing parallel programs (like decomposing an serial application into independent small tasks) is given in section 3.4. Here we discuss threaded pro-gram models.

Manager/worker: A manager is a single thread, which assigns work to other threads (workers). The manager takes charge of task assignments. The size of the worker pool can be either static or dynamic.

Pipeline: By breaking a task into smaller pieces, each thread takes care of one piece of code and executes in series. Different threads work concurrently and the task is executed in parallel.

(31)

4.2. THREAD MANAGEMENT

4.2 Thread management

There are four status of a thread, ready, running, blocked and terminated, which can be changed by thread management APIs and the status of shared resources. Threads can be created and attributes (joinable, scheduling) can be assigned with these routines.

Creating and terminating threads There is one default thread per process, while other threads should be created and initialized by the default one. Each thread is named by an unique ID (identifier). The threads become peers after they are created, which means there is neither hierarchy nor dependency between them. Each thread can create a new thread. Once the tasks are done, a thread will terminate itself. They can also be terminated by other threads or the main function. Thread attributes can be set by the arguments in the routines [11].

Joining and detaching threads Joining is a way to synchronize threads. Once worker threads completes their tasks, they are joined with their master thread. While some tasks need not to be joined, so we detach them to free some system resources [11].

4.3 Mutual exclusion

Mutex is short for "mutual exclusion", which is used to protect data then multiple write operations occur. Mutex is a way to implement thread syn-chronization. Once a thread enteres a critical section, other threads must wait until it finished. Even if multiple threads ask for the same mutex, only one of them will success. It is useful in parallel executions because data in the critical path can only be modified by one thread at a time, so that "race" is avoid [11].

(32)

CHAPTER 4. THREADS

Figure 4.1. Mutex Structure [14]

4.4 Conditional variables

Conditional Synchronization means a thread will stay blocked until the sys-tem satisfied the condition. In other words, it makes the thread wait before the condition is satisfied, like waiting for a notification. It is also a way to implement synchronization. Conditional variables differ from mutex because conditional synchronization synchronize threads based on variable values in-stead of controlling the thread accesses to the protected data. What’s more, multiple threads may be permitted accesses to the condition at the same time. The conditions are specified by programmers [11].

Figure 4.2. Condition Variable [15]

(33)

4.5. SYNCHRONIZATION

up. There are three operations on conditional variables: wait(L)1_{, signal(L)}2

and broadcast(L)3, which are atomic operations. While a thread is waiting for a conditional variable, it is blocked until the condition is signaled. Signal a conditional variable is used to wake up another thread which is waiting for it. Broadcasting is waking up all threads which is in a blocking wait state [11]. In Figure 4.2, we can see that one thread waits on condition ready, then wakes up and proceeds, while the other thread signals condition ready.

4.5 Synchronization

In the multi-core system, threads share memory and other resources, which requires synchronizations to coordinate of parallel tasks, including serializing memory instructions, protecting shared resources and waiting for multiple tasks to reach a specified point. Synchronization coheres within threads and protects shared memory by constrainting relative instruction orderings. While synchronization can be a major factor in decreasing parallel speedup because tasks has to wait for other’s completion.

4.5.1 Atomic operations

An atomic operation is a simple way to achieve synchronization by working on data types [13]. Atomic operations are performed with no possibility of inter-ruption. It is visible as either completed or not started without intermediate state [16]. Atomic operations are used to implement other synchronization primitives, like semaphore, which do not block competing threads when it access shared data, that makes it possible to achieve a better performance [13].

4.5.2 Hardware primitives

Memory barriers

To achieve a better performance, both CPUs and compilers reorder the in-structions to gain a fully paralleled pipeline. However, after such reordering optimization, the instructions which access the shared memory are performed out of order, which may lead to incorrect results, especially when there are data dependencies.

A memory barrier is a common way to synchronize threads on multi-core. 1

Release its lock and wait. Once it completes, it indicates that the lock is required by others

2_{Execute the waiting thread once and then go on execution. The lock is still held by the original}

thread

(34)

CHAPTER 4. THREADS

It is a non-blocking mechanism, which is implemented by instructions to en-sure memory accesses perform in the expected order by forcing the processor to see a load or store positioned in front of the barrier before it sees the ones after the barrier [13]. In a word, it enforces ordering on the memory in one thread, and guarantees that all other threads have a consistent view of memory in the system. Memory barriers are always defined and specified by processors.

4.6 Threads on multi-core

Multiple threads on multi-core differs from single-core system, see Table 4.2. Threads need not to wait for resources (like CPUs) on multi-core because they run independently on their own cores and do not compete for resources, e.g. floating point unit, and etc. There are two main differences between multi-thread on single-core and multi-core: caching and priority.

Advantages Disadvantages

Increased performance Data races

Better resource utilization Deadlocks

Efficient data sharing Code complexity Less resource is needed to change contexts Portability issues

Communications between tasks is simple Testing and debugging difficulty

Table 4.2. Advantages and Disadvantages of Threads on Multi-core

Cache synchronization becomes a topic for threads on multi-core, because the shared memory modified by two threads may interfere with each other [13]. Besides, the threads having higher priority cannot ignore the lower ones in multi-core system because they can execute in parallel, which may lead the system to an unstable state.

4.7 Summary

(35)

4.7. SUMMARY

(36)

(37)

Chapter 5

Wool — A low overhead C scheduler

To achieve a good performance on multi-core, problems are broken-down into independent tasks to execute concurrently. It is of great importance to ap-ply an efficient scheduler on multi-threaded computations to keep load bal-ancing with a low overhead. Work stealing and leap frogging are dedicated for scheduling multi-threaded computations, and help the system to achieve a good performance by distributing work to underutilized processors with a minimized communication overhead between threads.

Wool is a C library applying work stealing and leapfrogging, which is de-veloped by Karl-Filip Faxén at SICS, aiming at improving the performance on multi-core system by distributing sequential programs over multiple cores. Wool is a low overhead user level task scheduler, providing lightweight tasks on top of pthreads [1] (see Section 3.3). According to test results, which are shown in his paper [17], the performance of Wool is compatible with that of Clik, Intel TBB and OpenMP on an eight core system.

5.1 Basic concepts

5.1.1 Worker

(38)

CHAPTER 5. WOOL

5.1.2 Task pool

A task pool is a pool of tasks which are managed by a worker. As the task pool grows and shrinks dynamically, Wool implements it as a stack (dequeue) [17]. Newly spawned tasks are placed on top of the pool, and old tasks can be stolen by other workers from the bottom of the queue, more details are given in Section 5.1.5 on page 27.

5.1.3 Granularity in Wool

As it is mentioned in Section 3.4 on page 14, granularity measures the ratio of computation to communication, which reflects the performance of task paral-lelism. We can measure the efficiency of parallelism could be measured in two aspects: task granularity

Gt=

Ts

Nt

(5.1) and load balancing granularity

Gl=

Ts

Nm

(5.2)

where

Ts: Sequential execution time (with no task overheads).

Nt: The number of tasks spawned.

Nm: The number of migrations of tasks, in our case, the number of

steals.

Task granularity represents the average useful work per task. Lower task granularity means higher overhead. Load balancing granularity measures the average useful work per steal [1].

Wool uses a fine grained parallelism strategy, which is good for load bal-ancing. There is a finest grain constant, that defines the smallest sequential program to execute. It is mainly used in the loop functions, where more than one tasks may be executed on one worker as sequential tasks instead of being executed as parallel tasks, if the cost of one iteration task is less than the finest grain constant. By defining this constant, Wool increases the load balancing granularity with lower task granularity.

5.1.4 Load balance in Wool

(39)

5.2. WORK STEALING

could distribute work evenly to each processor. In Wool, work stealing is defined by stealing tasks from the bottom of the task pool, which locates closest to the root. In this way, fewer steals are needed to distribute work evenly because of the relatively larger parallel regions [18]. A deeper discussion is given in Section 5.2 on page 29.

5.1.5 Structure

A multi-threaded computation consists of multiple threads, with one thread per processor. A thread (usually starts one worker) takes care of a pool of tasks, which are organized in a doubly-ended queue. There are two pointers pointing to the top and the bottom of the queue respectively. Newly created tasks are added to the top of the queue and are ready to execute, so the queue is also called a ready queue. A ready queue is treated as a stack by its own processor, which pushes and pops tasks from the top of the queue as LIFO1. But it is regarded as a queue by other processors (who attempts to steal work) and tasks are stolen from the bottom of the queue, like FIFO2 _{[19]. Figure}

5.1.5 is a work stealing model.

Tasks are initialized and ready to execute when they are spawned. Newly created tasks are always inserted on the top of the task pool and they are called children tasks, while the tasks who create the new tasks are called par-ent tasks. Parpar-ent tasks are located closer to the root than children tasks in a dedicated task pool. The model of Wool is shown in Figure 5.1.

5.2 Work stealing

Work stealing is an algorithm for task parallelism and load balancing across processors [20]. Whenever a worker has no tasks to do, it attempts to steal tasks from another worker. The victim3 is chosen randomly. If the victim has no tasks in its own task pool, then the steal attempt fails. If a steal attempt fails, the thief will choose another victim. In Wool, workers choose victims linearly from a random start choice of victim [20]. Once a thief steal successfully, the thief4 _{executes the task, and returns the results to the victim,}

then the steal process ends. In this way, we can keep underutilized processors working instead of idling while other processors are busy working.

1_{Last In First Out} 2

First In First Out

3_{Victim: refers to the worker which has tasks in its task pool, it becomes a victim when tasks}

are stolen by other workers.

(40)

CHAPTER 5. WOOL

Processor1 Processor2 ProcessorN

Worker1 Worker2 WorkerN

.. . .. . .. . .. . .. . .. . . . . Top Bottom Task A Task B Task C Task D Empty Pop Push

Figure 5.1. Work Stealing Structure.

There are three status of a worker.

Working: A worker is working on a task.

Stealing: A worker has no task to do and starts to steal. Workers will

keep on stealing until it gets a task to do.

Spin: A worker stalls and waits for the results of the stolen task which

has not been finished yet.

The work stealing algorithm starts with only the main task in the ready queue. The main task will spawn tasks and other workers start to work by stealing tasks. The work stealing process is showed in Figure 5.1.

(41)

5.3. LEAP FROGGING

Worker1: Victim Worker2: Thief WorkerN

.. . .. . .. . .. . .. . .. . . . . Top Bottom Top Bottom Task A Task B Task C Task E Task F Task G Task D Empty Pop Push

Figure 5.2. Work Stealing.

Compared with work sharing

Work stealing is a lazy scheduling, which means that a processors will not as-sign tasks to others until they are idle. It achieves lower overhead than eager scheduling, which keeps assigning tasks to keep load balance. A typical eager scheduling is work sharing, which attempts to migrate some of the tasks to other processors once new tasks are generated in order to distribute the work load [21].

The advantage of a work stealing algorithm is the migration of the tasks is less frequently compared with work sharing. Another advantage is that, by stealing tasks from the bottom of the queue, parent tasks are stolen which would generate children tasks later. In this way, it reduces the stealing times, because parent tasks are close to the root, it is more efficient to steal a par-ent task than steal several children tasks. It helps to keep load balance and reduces overhead by reducing the times of steals.

5.3 Leap frogging

(42)

CHAPTER 5. WOOL

problem, since the task that the victim steals back is always a task that the victim would have executed if no steals happen [23]. Otherwise, there are a number of drawbacks by stealing work from a random worker, for instance, a task pool may grow beyond its size, since stealing will add a new stack on top of the blocked task.

The leap frogging process is showed in Figure 5.3, which is the next time stamp of Figure 5.2. After worker2 stealing Task A, Task A spawns two more tasks: Task F and Task G. At the same time, Worker1 completes Task E, Task C and Task B. And WorkerN stills work on Task D.

As there is no task in Worker1, it begins to synchronize with Task A, which has been stolen by Worker2 and still in execution. Instead of waiting for Task A completes, Worker1 tries to steal tasks back from the thief and help to finish Task A (A parent task can only finish on condition of all its children tasks are done). So Task F is stolen back because it is in the bottom of the ready queue.

In this way, workers can work in parallel and the load balance is enhanced. However, if the task which the victim steals back is really small, and no other tasks could be stolen back, the victim will spin and wait for the results come out. Then it will loss parallelism in this extreme situation.

Worker1: Victim Worker2: Thief WorkerN

.. . .. . .. . .. . .. . .. . . . . Top Bottom Top Bottom Task G (Task A) Pop Task F Task G Task D

(43)

5.4. DIRECT TASK STACK ALGORITHM

5.4 Direct task stack algorithm

The Wool scheduler uses macros and inline functions to implement indepen-dent task parallelism. The basic operations are spawn and sync, which are like asynchronous function call. With a spawn, a task is created and put into its ready queue, while it executes only when the control reaches the corre-sponding jsync, rather than being executed immediately [17]. The task may be done by either the processor which spawns it, or a thief who steals it.

5.4.1 Spawn

By spawning a task, the task is initialized and the space is allocated. The task is put into the worker who spawns it. The task is neither executed nor returns a value until it reaches a sync. Any task in the task pool can be stolen by other workers. If it has not been stolen when control reaches the corresponding sync, it is executed inline by the same worker who spawned it. Otherwise, the worker will get the results back from the thief [1].

Processor1 Processor1 Worker1 Worker1 .. . .. . . .. ... Top Task1 Task2 Task3 Empty Spawn Task1 Task2 Task3 Task4 Top Figure 5.4. Spawn.

(44)

syn-CHAPTER 5. WOOL

chronize the task or leap frog. A task with a wrapper function5 _{can either}

be stolen by others or inlined by itself. In the end of the process, the top pointer is moved one step up, and points to the first blank space where the next spawned task will be put into [19].

5.4.2 Sync

A pair of sync & spawn is like a LIFO(last-in, first-out). A spawn pushes a task in the task pool while a sync pops the top task off. When the code reaches a sync, it begins to find the latest spawned & unsynchronized task and execute it. By Synchronizing, a task is popped from the task pool and synchronized in one of the following ways according to different situations.

Case1: If the task is not stolen, the worker itself executes it by call.

Case2: If the task is stolen and finished, the result is returned to the victim.

Case3: If the task is stolen but not finished, then leap frogging.

5.5 Wool optimization

Work stealing and leap frogging reduce the communication overhead and keep load balanced in a multi-core system. However, as the number of cores in-creasing up to 50 or more, randomly work stealing algorithm will lead to significant overhead which affects the work stealing efficiency and cannot be ignored. Nonrandom victim selection is proposed to solve this problem. Two advanced algorithms are discussed in this Section: sampling victim selection and set based victim selection in this section.

5.5.1 Sampling victim selection

In stead of stealing work randomly from the first victim found by the thief, a thief samples several workers before each steal operation and chooses the task which is closest to the root to steal. In the task tree, the tasks closer to the root are typically larger, so that stealing tasks which are close to the root can contribute to load balance with smaller overhead (fewer steal operations take place for larger tasks are stolen). Sampling victim and selecting take extra time, but not as much as the time it takes to complete one steal operation. When the number of cores increases to a larger number, like 128 cores, the performance would be much improved [18].

(45)

5.6. OTHER MULTI-THREADED SCHEDULER

5.5.2 Set based victim selection

When the number of thieves is significant, the contention for task pool is fierce. That means more steal attempts compared with the number of successful steals. In this case, thieves will only steal from a subset of workers, which is a private random permutation P of the induces of the other workers (these workers are its own potential victims) [18]. When steal starts, the thief picks up a random start in the permutation and proceeds through it attempting to steal. If there are no work to steal, it fleshes and starts over from the starting point [18].

5.6 Other multi-threaded scheduler

Clik

Intel Cilk6 _{is designed by the MIT Laboratory for Computer Science, which is}

a general-purpose programming language designed for multi-threaded parallel computing. Clik provides simple linguistic extensions to ANSI C. For example, keywords spawn and join to implement work stealing scheduler. Clik++ is open source but the compiler (ICC compiler with Intel Cilk Plus) and tool suite are commercial.

Intel TBB

Intel TBB7 _{is short for "Threading Building Blocks". It is open source,}

repre-senting a task-based parallelism in C++. The TBB library manages threads dynamically and provides templates for common parallel patterns. It is im-plemented with work stealing scheduler and it could keep load balancing au-tomatically.

OpenMP

The OpenMP APIs8 _{are in C/C++ dedicated for shared memory parallel}

programming. It is built on top of native threads. A number of compilers implement the OpenMP API, like gcc, Visual Studio, and etc. Supported languages include C/C++ and Fortran. However, it is designed for shared memory systems, which does not work for distributed memory systems.

6_{http://supertech.csail.mit.edu/cilk/} 7

http://threadingbuildingblocks.org/

(46)

CHAPTER 5. WOOL

Wool Optimizations & Limitations

Compared with other multi-threaded schedulers, Wool is much simpler and has a good performance at the sacrifice of some limitations, for instance, only independent tasks may be executed in parallel [22]. In addition, task descrip-tors are in fixed size in the current Wool9 _{[24, 25].}

5.7 Summary

Wool is a macro-based parallelism library in C. It is based on two basic opera-tions: spawn and sync. Wool focuses on independent fine grained parallelism, based on work stealing and leap frogging. It has compatible performance than Cilk, TBB and OpenMP. However, it is limited in independent tasks that is not flexible.

9

(47)

Part II

(48)

(49)

Chapter 6

OSE

Enea OSE is short for "Operating System Embedded" that has two types of kernels: OSE Real Time Kernel for embedded systems, and OSE Soft Kernel for host computers (like UNIX workstation and PC). Both kernels can work in single-core and multi-core systems [26]. OSE is a distributed operating system that is based on message passing. This chapter is based on Enea documents and white papers [26, 27, 28, 29, 30].

6.1 OSE fundamentals

Load module

Modules are applications, which can be included in the core module (core module is a model that is linked with the kernel during the compile time). or they can be built as separately linked load modules and loaded at runtime [27]. Each load module is assigned a piece of memory in the system. It consists of a block of processes and a block pool. Memory is shared by processes within a block. Communication within or cross load module is done by message passing. Each load module is linked to one core dynamically in the run time, and parallelism is realized at the load module level [26]. Figure 6.1 shows the structure of OSE.

Memory pool

(50)

CHAPTER 6. OSE

Figure 6.1. Processes in Blocks with Pools in Domains [26]

Block

A block is used to group OSE processes together, and acts like a process which could be started, stopped and killed. Each block may have its own memory pool. If a pool becomes corrupted, this will only affect the block connected to it without any influence on other blocks. Normally, new process (child process) should be part of the same block as its parent unless it is specified when created.

Domain

A domain is a memory region contained with programs, which is formed by grouping one or several pools in a separate "domain". In this way, it can avoid the danger when a signal contains a pointer to the memory pool of the sender, for the receiving process has the ability to destroy the pool. By forming a domain, the user can choose to copy the signal buffer from the sender while sending a signal across segment boundaries.

6.2 OSE process & IPC

The OSE kernel is based on message passing between processes. This IPC (In-ter process communication) is implemented as a simple API between processes in a distributed single or multi-core system [29]. OSE processes are equivalent to POSIX threads with some special features, including an individual stack, a set of specific variables and register values. Processes within a load module share the CPU time allocated by the kernel based on their priorities. So a process is not always running, instead it has three states: ready, waiting and running (almost the same as a thread, see Section 4).

(51)

6.3. OSE FOR MULTI-CORE

not allowed to be killed during the existence of the load module. In contrast, dynamic processes can be created, configured and killed during the run-time. Each process is assigned a process identity (ID) in its life.

Figure 6.2. Message Passing

Inter process communications and synchronizations are handled by signals. A signal is used as an acknowledge message (semaphores, barrier or moni-tors) or a data-carrying message. Each process is assigned one unique signal queue. Each message carries the information of the sender, receiver and the owner, which makes it easier to trace and redirect. A process can only be received by the dedicated process which the message is sent to and it is upon the process to choose which message to accept. This process is called Direct Asynchronous Message Passing (DAMP). Message passing between cores is supported by OSE using the same message passing API as that used in the single-core version of OSE [28]. Process of message passing is shown in Figure 6.2.

6.3 OSE for multi-core

The OSE Multi-core version is available for PowerPC from OSE 5.4 and later, which is called MCE (Multi-core Embedded). The architecture of MCE is shown in Figure 6.3. The MCE is a hybrid of SMP and AMP (see Section 2.1), which performs as SMP on application level with a homogeneous single OS image and a shared memory programming model. It also has AMP char-acteristics on the kernel level with multiple schedulers for multiple cores and the support of message passing [30].

(52)

CHAPTER 6. OSE

while running, apart from interrupt or timer interrupt processes. Programs can also be locked to a given execution unit by the designer[26].

6.3.1 OSE load balancing

Nowadays, OSE offers signal interfaces in the program manager to measure the load statistic and move programs between execution units. Applications regard OSE as a SMP since it handles work load distribution over cores. But to achieve a determinism require, it is up to the applications to control load balancing. Parallelism is at a load module level [33]. So thread level parallelism (see Section 3.2.2) and dynamic load balancing are expected in OSE, which is what Wool does.

6.3.2 Message passing APIs

OSE message passing APIs are supported in MCE to communicate over cores. The message passing model fits multi-core better. Because in the shared memory model, inter communication overhead grows with the increment of the number of cores. When there are a greater number of cores, the shared memory model has a higher overhead than message passing. What’s more, synchronizations are not needed because they are implicit in this process. However, message passing may also induce higher latency when parallel pro-grams have much dependency.

6.4 OSE pthreads

POSIX Thread is supported by OSE. The OSE pthread is a pthread emula-tion layer that implements most of the thread standard in the POSIX 9945-1 (1996) standard, which is essentially 1003.1c [27]. OSE implemented a subset of the POSIX thread, which is dedicated for simplifying the porting process

(53)

6.5. SUMMARY

for the POSIX thread based applications. It is recommended to use native OSE processes, which are more efficient than OSE pthreads.

An OSE pthread could be regarded as a prioritized OSE process, except that killing a thread is not allowed (it could be terminated instead). Fast semaphores are used to implement OSE pthread Mutexes, so they cannot be used for other purpose. Pthreads in the core module need no special con-figuration, while pthreads in the load modules require that the program is configured in the shared mode [27].

6.5 Summary

(54)

(55)

Chapter 7

P4080 and TILE64

Multi-core platforms Freescale QorIQ P4080 with eight cores and Tilera TILE64 with 64 cores are discussed in this chapter, regarding to their features and ar-chitectures. The Freescale P4080 is the target board for the porting. TILE64 is discussed for comparison.

7.1 Features

7.1.1 Instruction set

Instruction set architecture (ISA) is one of the most important features of the computer architecture related to programming. An instruction set includes a specification of a set of instructions, data types, addressing modes and etc. Operation bits of the instruction set describe the number of bits an instruction takes in a specific CPU. The main stream CPUs are either 32bits or 64 bits. The instruction set can also be classified into CISC and RISC. RISC includes the most frequently used instructions with the same number of bits in every instruction. While CISC emphasis on hardware which has complex instruc-tions and the soft code size could be smaller. Several ISAs are compared in Table 7.1.

Architecture Bits Instruction set

PowerPC 32/64 RISC Tile 32 RISC IA-64 64 EPIC1 SPARC 64 RISC X86-64 64 CISC X86 32 CISC

(56)

CHAPTER 7. P4080 AND TILE64

Programs built on different instruction sets differ a lot. One CISC style in-struction may need to be implemented by several inin-structions in RISC style.

7.1.2 Memory consistency

To enhance the performance on multi-core systems, large caches are set to reduce the overhead of memory accesses, but it rises a problem that the ory operations may be performed out of order and the data in the same mem-ory location may present different values to different processors [32]. Differ-ent computer architectures build differDiffer-ent models addressing this problem by pipelining memory accesses which is called memory consistency.

The memory consistency model is a contract between the shared memory and the programs running on it [33]. They are mainly divided into sequential consistency and relaxed consistency. The sequential consistency performs the access to memory in the same order as the program order to each individual processor. However, sequential consistency is too complicated and causes a lot overhead. The relaxed consistency means that instruction reorderings are per-mitted but synchronizations are required when there are data dependencies. In the relaxed consistency model, memory barriers are needed to constrain the memory access order. Most processors define a relaxed memory consistency model.

Out-of-order

The out-of-order execution means the execution order2 of a program differs from the program order3 _{due to both compiler and CPU implementation}

op-timizations. Perceived order4_{, which is determined by caching, interconnect,}

and memory-system optimization may differ from execution order. Out-of-order execution is an optimization to give a better performance, but it may lead to unexpected results. Here is an example of out-of-order.

I n i t i a l S t a t e : x = 0 , s =0

P1 P2

w h i l e s == 0 x = 42;

; // m e m o r y b a r r i e r ;

P r i n t x ; s = 1;

2_{The order that the individual instructions are executed on a given CPU.} 3

The order specified in the code of a given CPU.

(57)

7.1. FEATURES

If this piece of code is executed in program order, the result will be 42, for x is printed after the execution of s = 1. If the program is executed or perceived out-of-order, the result may be 0, because P1 may see s = 1 before x = 42. Memory barriers are needed for protection in this case.

Memory consistency models for different microprocessors

Different Processors supply different memory consistency models. For in-stance, X86 and Sparc have strict constrains which do not reorder write op-erations, while PowerPC, IA64 have weaker constrains which need memory barriers to constrain execution order and perceived order. Table 7.2 lists dif-ferent memory consistency models defined by difdif-ferent processors. [36] [32] [38]

Order IA64 PowerPC Sparc X86 TILE

Read After Read? Y Y N N Y

Read After Write? Y Y N N Y

Write After Write? Y Y N N Y

Write After Read? Y Y Y Y Y

Table 7.2. Memory Consistency Models

7.1.3 Memory barrier

Memory barriers (memory fences) are mainly used with weak consistency mod-els to constrain memory access order and ensure data correctness. There are different kinds of barriers, for example, store fence (SFENCE), which con-strains the execution order of write operations is the same as the program order. Generally, the more strict the memory barrier is, the more costly (pipeline is forced to drop the pre-fetched instructions). Developers should only use memory barriers when it is necessary. Memory barriers are hardware dependent, that are defined by processor vendors. Table 7.3 lists some types of memory barriers and instructions.

SFENCE MFENCE

sparc membar(StoreStore) membar(StoreLoad or StoreStore)

x86_64 sfence mfence

powerpc lwsync msync

TILE MF MF

(58)

7.2 Freescale QorIQ P4080 platform

7.2.1 P4080 architecture

The Freescale P4080 Q or IQ integrated multi-core communication processor5

is based on eight Power Architecture6 _{processor cores – e500mc}7_.

The P4080 is a high performance networking platform, which can be used in routers, switches, base station controllers, and general-purpose embedded computing systems. Compared with multiple discrete devices, it offers a better performance and simplifies the board design [35].

Figure 7.1. P4080.

7.2.2 e500mc core

P4080 is based on eight e500mc cores and it supports both symmetric and asymmetric mode. Each core is a 32-bit low power processor. The e500mc core is based on the Power Architecture technology dedicated for embedded systems. The e500mc is a super-scalar dual issue processor (two instructions per clock cycle), which support both out-of-order execution and in-order com-pletion. With seven-stage pipeline, e500 cores is able to perform more instruc-tions per clock [35].

5_{Freescale Semiconductor} 6

Power Architecture is a broad term to describe similar RISC instruction sets for microproces-sors developed and manufactured by such companies as IBM, Freescale, AMCC, Tundra and P.A. Semi.

(59)

7.3. TILERA’S TILE64

Memory consistency model

The Power ISA provides weak consistency memory access model to create the opportunities for processors to reschedule memory transactions [36].

1. Read after read may be reordered unless caching-inhibited is in use. 2. Write after write can be reordered.

3. Read after write, write after read can only be protected by msync.

Memory barrier primitives

msync: This instruction ensures that all read and write preceding msync

have completed before subsequent instructions.

lwarx and stwcx: The instructions are used to perform a read-modify-write

operation to memory. They ensure that only one processor modifies the mem-ory location between the execution of the lwarx instruction and the stwcx in-struction. With the combination of lwarx and stwcx, Instructions like prefetch and compare-and-exchange could be implemented.

7.3 Tilera’s TILE64

7.3.1 TILE64 architecture

TILE64 is a multi-core processor with a mesh network of 64 cores (each core is called a TILE). It could be called Massively Parallel Processor Array (MPPA). TILE64 delivers scalable performance, power efficiency, and low processing latency [38].

7.3.2 TILE

A TILE core has 32-bit, RISC instruction set, with Three-way VLIW8 _pipeline

for instruction level parallelism.

Memory consistency model

There are two properties of TILE memory consistency model: instruction re-ordering rules and store atomicity [38]. This model belongs to relaxed memory model (see Section 7.1.2). To a given processor, memory accesses as well as its visible order to other processors may be reordered, except the following cases [38]:

1. Data dependencies in a single processor are enforced, including read after write, write after write and write after read.

8

(60)

Figure 7.2. Tilera’s TILE64 Architecture [38]

2. Local visible order is determined by data dependencies through registers or memory.

3. The global visible order cannot be determined by the local visible order.

Memory barrier primitives

TILE processors define a relaxed consistency model, which need memory bar-riers to make the piece of memory visible to guarantee that internetwork com-munications occur in order [37]. The TILE64 processor has instructions of memory fences and global barrier syncs.

MF: Memory fence (MF) instruction is provided to ensure memory operations

(61)

Chapter 8

Porting Wool to P4080

In the design of porting Wool to the Freescale P4080 platform with the Linux operating system, the idea is to walk through the source code and modify the code quoting with the hardware primitives. As Wool is a C library which is built upon the operating system, we do not have to take care of the other part of the code other than the hardware dependent part. The operating system running on P4080 is Enea Linux, which supports all the libraries used in Wool.

8.1 Design

The hardware synchronization primitives applied in Wool are mainly memory related code, like memory fences, atomic operations and memory allocation functions, which have been mentioned in Chapter 4 & 7. The object is to add the hardware synchronization primitives defined by the e500mc processor, to the Wool’s source code and write it in the gcc inline assembly format.

Freescale introduces the e500mc processor based on PowerISA v.2.061 _{in the}

QorIQ family chips. The e500mc processor is a 32-bit processor supporting shared storage between processors [39]. Its predominant model is weakly con-sistent, which allows the reordering of code and provides an opportunity to improve the performance over the stronger consistency module. In this case, it is up to the programmer to take care of the ordering and synchronization instructions when shared storage is used among multiple cores.

8.1.1 Storage access ordering

Different memory fences used in Wool should be defined in the header file. A SFENCE is short for store fence, which ensures that all write operations

1

(62)

CHAPTER 8. PORTING WOOL TO P4080

preceding the barrier are committed before the subsequentwrite operations are issued. A MFENCE is short for memory fence, which controls all the memory accesses (both read and write operations). It ensures that memory access operations before the barrier are committed before the subsequent memory access operations are initialized. A memory fence will induce more overhead than a load fence or a store fence. It is because that CPU would discard all the pre-fetched instructions and empty the pipeline. The e500mc core defined the following operations which are implemented in Wool.

sync: provides full sequential consistency. So it is used as MFENCE in

Wool, which is strict to any instruction ordering.

msync: ensures that all instructions preceding msync have completed

before msync completes. It also ensured the data accesses across all storage classes. It could be used as MFENCE also. One can regard it as an alternative operation to sync.

lwsync: is a low overhead memory fence, which constrains the following

accesses: read after read, write after write and write after read. It is used as SFENCE in Wool.

isync: causes the prefetched instructions to be discarded by the core,

which ensures that the instructions preceding to isync are fetched and then executed before the subsequent instructions. But isync does not affect data accesses.

8.1.2 Atomic update primitives

Atomic update primitives used in Wool are the prefetch and CAS operations, which are based on lwarx and stwcx. lwarx creates a reservation and stwcx stores a word on condition that there exists a reservation created by lwarx on the same storage location. The reservation will be lost if another processor modified the same storage location before stwcx, then a new reservation must be set to perform the atomic operation.

8.2 Implementation

8.2.1 Prefetch & CAS

The prefetch primitive loads and replaces a word in storage atomically [39]. The following assembly code shows the processor fetches a new value in r4 and stores it in r3, and the old value in r3 is reserved in r5. The key property of this operation is that it updates a memory value atomically [16].

l o o p :

l w a r x r5 ,0 , r3 #load the old value in r5 from r3 and reserve

(63)

8.2. IMPLEMENTATION

bne - l o o p #loop if lost reservation

The Compare and Swap (CAS) primitive compares a value in a register with a word in storage [39]. It loads the old value from r3 to r6 and compares the value pointed by r4 with the old value. If they are equal, modify the contents(r3) of r4 to the given value r5. The old value will be returned in any case. This is a typical atomic CPU operation to achieve synchronization. It avoids that two threads update the value at the same time, because the second one will fail and re-compute[16].

l o o p :

l w a r x r6 ,0 , r3 #load the old value and reserve

c m p w r4 , r6 #whether r4 equals to r6

bne - e x i t #skip if not

s t w c x . r5 ,0 , r3 #store new value if still reserved

bne - l o o p #loop if lost reservation

e x i t :

mr r4 , r6 #return value from storage

8.2.2 Gcc inline assembler code

The hardware primitives (assembly code) described in the design section should be embedded into the Wool’s source code, which is written in C. So the as-sembly code is written as gcc inline assembler, which is supported by the gcc compiler. Gcc inline assembler should be written according to its basic rules. The basic format is:

asm ( " i n s t r u c t i o n s " : o u t p u t s : i n p u t s : r e g i s t e r s - m o d i f i e d ) ; The first parameter includes instructions with ";" to separate them. The sec-ond and third parameters could be either a storage location or a register. They could be left empty when no input or output is used. The assembly in-structions used in Wool are operations on the memory, so the last parameter should include "memory" to declare that the memory has been changed.

8.2.3 Enea Linux

The operating system used is Enea Linux, which is powered by the Yocto Project2 _{open source configuration.} _{The Yocto project provides standard}

tools, which ensures quick access to the latest Board Support Packages (BSPs) for the most common hardware architectures [40]. The architecture of Enea OSE is shown in Figure 8.1. The operating system could assign tasks to different cores automatically.

Porting a C Library for Fine Grained Independent Task Parallelism to Enea OSE RTOS