Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

187  Download (0)

Full text

(1)

Blekinge Institute of Technology

Doctoral Dissertation Series No. 2007:18

School of Engineering

THEORETICAL ASPECTS ON

PERFORMANCE BOUNDS AND FAULT

TOLERANCE IN PARALLEL COMPUTING

Kamilla Klonowska

ASPECTS ON

AND F

A

UL

T

T

OLERANCE IN P

ARALLEL COMPUTING

Kamilla Klono

wska

ISSN 1653-2090

This thesis consists of two parts: performance bounds for scheduling algorithms for parallel pro-grams in multiprocessor systems, and recovery schemes for fault tolerant distributed systems when one or more computers go down.

In the first part we deliver tight bounds on the ratio for the minimal completion time of a parallel program executed in a parallel system in two sce-narios. Scenario one, the ratio for minimal com-pletion time when processes can be reallocated compared to when they cannot be reallocated to other processors during their execution time. Scenario two, when a schedule is preemptive, the

ratio for the minimal completion time when we use two different numbers of preemptions. The second part discusses the problem of redist-ribution of the load among running computers in a parallel system. The goal is to find a redistribu-tion scheme that maintains high performance even when one or more computers go down. Here we deliver four different redistribution algorithms. In both parts we use theoretical techniques that lead to explicit worst-case programs and scena-rios. The correctness is based on mathematical proofs.

ABSTRACT

(2)
(3)

Performance Bounds and Fault Tolerance

in Parallel Computing

(4)
(5)

Theoretical Aspects on

Performance Bounds and Fault Tolerance in

Parallel Computing

Kamilla Klonowska

ISSN 1653-2090

ISBN 978-91-7295-126-6

Department of Systems and Software Engineering

School of Engineering

Blekinge Institute of Technology

SWEDEN

(6)

© 2007 Kamilla Klonowska

Department of Systems and Software Engineering School of Engineering

Publisher: Blekinge Institute of Technology Printed by Printfabriken, Karlskrona, Sweden 2007 ISBN 978-91-7295-126-6

(7)
(8)

This thesis has been submitted to the Faculty of Technology at Blekinge Institute of Technology in partial fulfilment of the requirements for the Degree of Doctor of Phi-losophy in Computer Systems Engineering.

Contact information:

Kamilla Klonowska

Department of Systems and Software Engineering School of Engineering, Blekinge Institute of Technology Box 520

SE-372 25 Ronneby Sweden

(9)

parallel programs in multiprocessor systems, and recovery schemes for fault tolerant distributed systems when one or more computers go down.

In the first part we deliver tight bounds on the ratio for the minimal completion time of a parallel program executed in a parallel system in two scenarios. Scenario one, the ratio for minimal completion time when processes can be reallocated compared to when they cannot be reallocated to other processors during their execution time. Sce-nario two, when a schedule is preemptive, the ratio for the minimal completion time when we use two different numbers of preemptions.

The second part discusses the problem of redistribution of the load among running computers in a parallel system. The goal is to find a redistribution scheme that main-tains high performance even when one or more computers go down. Here we deliver four different redistribution algorithms.

In both parts we use theoretical techniques that lead to explicit worst-case programs and scenarios. The correctness is based on mathematical proofs.

(10)
(11)

This Ph.D. thesis may have never been realized without my supervisors, colleauges, friends and family members from here and apart. Thank you.

I especially thank my supervisors, Professor Lars Lundberg and Docent Håkan Lennerstad for their contribution in all the papers, comments, ideas, and encouragement during this work; Lars for his patience and Håkan for a lot of dialogues and discussions about life and mathematics.

I also thank the members of my research group: Charlie Svahnberg for answering on all my questions and his contributions in Golomb-series papers, Dawit Mengistu for a lot of discussions about complexity of life and scheduling, Simon Kågström for help-ing me understand Assembly and other strange thhelp-ings, Peter Tröger for deep discus-sions about fault, error and failure, and keeping my family busy during the last, most stressfull weekends, Mia Persson for discussions about NP-complexity, Göran Fries for being my room-neighbour throughout my Ph.D. studies, Bength Aspvall, Håkan Grahn and Magnus Broberg, an ex-member, for his contribution in the first paper.

I would also like to thank my colleagues at the department, especially Linda Ramst-edt, Johanna Törnkvist, Lawrence Henesey, Jenny Lundberg, Nina Dzamashvili-Fogelström, Guohua Bai, Maddeleine Pettersson, Monica Nilsson, May-Louise Ander-sson, Petra Nilsson.

Finally, I want to express my gratitude to my family for their support and encouragement from afar during hard times. Special thanks to Jacek and Wiktor, my Mother, Ciocia Wera, Basia and Ciocia Lodzia, Ciocia Eleonora, Iwonka and Krzysz-tof for being with me during that time. And my friends: Radek and Gosia Szymanek, Ewa and Jonta Andersson, Jelena Zaicenoka, Magdalena Urbanska, Viola and Tomek Murawscy, Karina Stachowiak and Rafal Lacny.

(12)
(13)

Papers included in this thesis:

I “Comparing the Optimal Performance of Parallel Architectures”

Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad, Magnus Broberg

The Computer Journal, Vol. 47, No. 5, 2004

II “The Maximum Gain of Increasing the Number of Preemptions in

Multiprocessor Scheduling”

Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad

submitted for publication

III “Using Golomb Rulers for Optimal Recovery Schemes in Fault Tolerant

Distributed Computing”

Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad

Proceedings of the 17th International Parallel & Distributed Processing Sympo-sium IPDPS 2003, Nice, France, April 2003

IV “Using Modulo Rulers for Optimal Recovery Schemes in Distributed

Computing”

Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad, Charlie Svahnberg

Proceedings of the 10th International Symposium PRDC 2004, Papeete, Tahiti, French Polynesia, March 2004

V “Extended Golomb Rulers as the New Recovery Schemes in Distributed

Dependable Computing”

Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad, Charlie Svahnberg

Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), Denver, Colorado, April 2005

VI “Optimal Recovery Schemes in Fault Tolerant Distributed Computing”

Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad, Charlie Svahnberg

Acta Informatica, 41(6), 2005

Publications that are related but not included in this thesis:

VII “Using Optimal Golomb Rulers for Minimizing Collisions in Closed Hashing”

Lars Lundberg, Håkan Lennerstad, Kamilla Klonowska, Göran Gustafsson

Proceedings of Advances in Computer Science - ASIAN 2004, Higher-Level Decision Making, 9th Asian Computing Science Conference, Thailand, Decem-ber 2004; Lecture Notes in Computer Science, 3321 Springer 2004, ISBN 3-540-24087-X

(14)

VIII “Bounding the Minimal Completion Time in High Performance Parallel

Processing”

Lars Lundberg, Magnus Broberg, Kamilla Klonowska

International Journal of High Performance Computing and Networking, Vol. 2, No. 1, 2004

IX “Comparing the Optimal Performance of Multiprocessor Architectures”

Lars Lundberg, Kamilla Klonowska, Magnus Broberg, Håkan Lennerstad

Proceedings of the twenty-first IASTED International Multi-Conference Applied Informatics AI 2003, Innsbruck, Austria, February 2003

X “Recovery Schemes for High Availability and High Performance Distributed

Real-Time Computing”

Lars Lundberg, Daniel Häggander, Kamilla Klonowska, Charlie Svahnberg

Proceedings of the 17th International Parallel & Distributed Processing Sympo-sium IPDPS 2003, Nice, France, April 2003

XI “Evaluating Heuristic Scheduling Algorithms for High Performance Parallel

Processing”

Lars Lundberg, Magnus Broberg, Kamilla Klonowska

Proceedings of the fifth International Symposium on High Performance Comput-ing ISHPC-V 2003, Tokyo, Japan, October 2003

XII “A Method for Bounding the Minimal Completion Time in Multiprocessors”

Magnus Broberg, Lars Lundberg, Kamilla Klonowska

Technical Report, Blekinge Institute of Technology, 2002

XIII “Optimal Performance Comparisons of Massively Parallel Multiprocessors”

Håkan Lennerstad, Lars Lundberg, Kamilla Klonowska

(15)

Theoretical Aspects on Performance Bounds and Fault Tolerance

in Parallel Computing

1. Introduction ... 3

1.1 Research Questions ... 5 1.2 Research Methodology ... 6 1.3 Research Contribution ... 6

2. Multiprocessor Scheduling (Part I) ... 8

2.1 Classification of scheduling problems ... 8

2.2 Bounds and Complexity on Multiprocessor Scheduling ... 10

3. Load Balancing and Fault Tolerance (Part II) ... 12

3.1 Fault Model ... 12

3.2 Reliability vs. Availability ... 13

4. Summarizing of the papers ... 15

4.1 Part I ... 15

4.2 Part II ... 17

5. Future Work ... 21

6. References ... 22

Paper Section

Paper I

Comparing the Optimal Performance of Parallel Architectures 29

Paper II

The Maximum Gain of Increasing the Number of Preemptions

in Multiprocessor Scheduling... 67

Paper III

Using Golomb Rulers for Optimal Recovery Schemes in Fault

Tolerant Distributed Computing ... 87

Paper IV

Using Modulo Rulers for Optimal Recovery Schemes in

Distributed Computing ... 105

Paper V

Extended Golomb Rulers as the New Recovery Schemes in

Distributed Dependable Computing ... 125

Paper VI

Optimal Recovery Schemes in Fault Tolerant Distributed

Computing ... 143

(16)
(17)
(18)
(19)

Theoretical Aspects on Performance Bounds and

Fault Tolerance in Parallel Computing

1 Introduction

If one single processor does not give the required performance, one improvement alternative may be to execute the application on several processors that work in paral-lel. To increase the performance in parallel computing, we would like to spread the work between the processors (computers) as evenly as possible. This technique is called load balancing. There are many ways to implement parallel computing with multiple processors. One of them is a loosely coupled parallel system consisting of a number of stand-alone computers, connected by a network, where a process can only be executed by the processor on which it was started. This allocation of processes to processors is called static allocation.

A loosely coupled parallel system is very attractive due to its low cost and the potential improvement in availability and fault-tolerance. If one computer fails, the work on the failed computer can be taken over by another computer. However, one challenging problem in loosely coupled systems is to schedule processes among pro-cessors to achieve some performance goals, such as minimizing communication delays and completion time. The completion time of a program is also called the makespan.

Another parallel system is a tightly coupled Symmetric MultiProcessor system (SMP), consisting of multiple similar processors within the same computer, intercon-nected by a bus or some other fast interconnection network. Here a process may be executed by different processors during different time periods. This allocation of pro-cesses to processors is called dynamic allocation. An SMP system offers high perfor-mance and efficient load balancing, but does not offer availability. If one processor fails the entire application will usually fail.

One is often interested in improving the performance by reducing the completion time of a parallel program consisting of a number of synchronizing processes (static allocation). There is a conflict between the execution of the tasks and the communica-tion between them. One extreme example is a parallel program that executes on only one processor. This program is not affected by communication/synchronization over-head, but it suffers from serious load imbalance. On the other hand, if the same parallel program is executed on many processors, then the load may be evenly spread among the processors, but the communication cost can be very high.

Finding a scheduling algorithm that minimizes the completion time for a parallel program consisting of a number of processes is one of the classical computer science problems, and it has been shown to be NP-hard (Garey et al., [21]). A number of good heuristic methods have been suggested, but it is difficult to know when to stop the heu-ristic search for better schedules. Therefore it is important to know the optimal bounds to figure out how close/far the algorithms are from the optimal results.

(20)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

An important question is how much performance one can gain by allowing dynamic allocation provided that we are able to find (almost) optimal scheduling and allocation algorithms. The answer to this question would provide important input when we want to balance the additional cost and complexity of allowing dynamic allocation against the performance loss of restricting ourself to static allocation. In Paper I we define a function that answers this question for a very wide range of multiprocessors and paral-lel programs.

Another possibility to increase the performance is to allow preemptions. In a pre-emptive schedule, a process can be interrupted by other processes and then resumed on the same or on another processor. Scheduling with preemptions is more flexible, but it is not always possible and preemptions can be costly due to overhead for context switching. A preemption can be made at any point in time. Here, the state must be saved before a process is preempted and then restored when it resumes. This means that there is a trade-off - on the one hand one needs preemptions in order to obtain a balanced load, but on the other hand one would like to limit the number of preemptions in order to minimize the overhead costs. The optimal solution to this trade-off problem depends on a number of parameters, such as the cost for preempting and then later restarting a process. One crucial piece of information for making a well informed trade-off decision is the possible performance gain if the number of preemptions is increased, assuming that there are no overhead costs. In Paper II we present a tight upper bound on the maximal gain of increasing the number of preemptions for any par-allel program. This means that we compare the minimal makespan for extremal (worst-case) parallel programs consisting of a set of independent jobs on a multipro-cessor when allowing different number of preemptions.

In fault tolerant parallel systems, like clusters, the availability is obtained by failover techniques. In its simplest form cluster availability is obtained by having two computers, one active and one stand-by. If the primary computer fails, the secondary simply takes over the work. In order to obtain higher availability, one may want to use more than just two computers [Pfister, 54]. However, it can be very costly to build a large cluster system with many stand-by computers. It is often more attractive to fail over to the computers that already are active in the system, but it is difficult to decide on which computer the work on the failing computer should be executed. This is par-ticularly challenging if this decision has to be made statically before the program starts executing. The goal here is to find a redistribution scheme that maintains high perfor-mance even when one or more computers go down. In Papers III - VI we present four different redistribution schemes.

This thesis is divided in two parts. Fig. 1 presents an overview of the thesis as two sets of papers corresponding to Part I and Part II and their intersection of common con-cepts. In both parts we use theoretical techniques that lead to explicit worst-case pro-grams and scenarios. The correctness is based on mathematical proofs.

In Part I we present performance bounds on the ratio for minimal completion time of a parallel program executed in two scenario. Scenario one, the ratio for minimal completion time when processes can be reallocated compared to when they cannot be reallocated to other processors during their execution time (Paper I). Scenario two,

(21)

when a schedule is preemptive, the ratio for minimal completion time when we use two different number of preemptions (Paper II). In Section 2 the classification of mul-tiprocessor scheduling and related work on some bounds and complexity are pre-sented. In Part II we discusses the problem of redistribution of the load among running computers in a parallel system if one ore more computers in the system go down. Here we deliver four different redistribution algorithms (Papers III, IV, V and VI). In Sec-tion 3 the classificaSec-tion of the faults and related work in fault tolerance system are pre-sented.

1.1 Research Questions

In this thesis three main questions are in focus.

Part I: Performance Bounds in Parallel Computing

1. Consider two parallel architectures: a Symmetric MultiProcessor (SMP) with

dynamic allocation and a distributed system (or cluster) with static allocation. How much shorter can the completion time of a parallel program be on the SMP compared to the distributed system provided that we use optimal schedules?

2. Consider a multiprocessor with identical processors and a parallel program

con-sisting of independent processes, and an optimal schedule with preemptions.

If the number of preemptions is increased, how large can the gain in completion time of a parallel program be?

Part II: Load balancing in Parallel Computing

3. Consider a cluster with a number of computers. Consider a worst-case crash

sce-nario, i.e. a scenario when the most unfavorable combinations of computers go down. How can the load be evenly and efficiently redistributed among the running comput-ers (using static redistribution)?

Papers I, II

Papers III, IV, V, VI

Part I - Performance

Part II - Fault Tolerance

wo rs t-case ma th em at ic s

(and VIII, IX, XI, XII, XIII)

(and VII, X)

opti

mal

boun

ds

alg

ori

thm

s

mu ltip roce sso rs

Stern-

Brocot

Tree

Golo

mb R

uler

(22)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

1.2 Research Methodology

Since we are interested in worst case/extreme case scenarios in infinite sets, we can-not use empirical method. To find the answers to the research questions, theoretical techniques that lead to explicit worst-case programs are used. The correctness is based on mathematical proofs (Papers I and II). Furthermore, a number of tests on multipro-cessor and distributed Sun/Solaris environments have been done to validate and illus-trate some of the results in Paper I. In Paper II we take advantage of a number theoretic formula based on the Stern-Brocot tree. Papers III and IV are also based on number theory. In Paper III we present schemes that are based on so called Golomb rulers. Golomb rulers have previously been used in rather different context, e.g. radio astron-omy, (placement of antennas), X-ray crystallography, data encryption, geographical mapping and in hash-tables [Lundberg et al., 46]. In Paper IV we extend the results from Paper III by adding a new type of ruler, a “modulo ruler”, giving a result that is valid in more cases than Paper III. In Paper V we again extend the Golomb rulers to construct the new recovery scheme (trapezium) which give better performance. In Paper VI we exhaust the cases by calculating the best possible recovery schemes for any number of crashed computers by a branch-and-bound algorithm.

1.3 Research Contribution

The main contributions in this thesis are:

Part I: The optimal bounds:

• an optimal bound on the minimal completion time for all parallel pro-grams with n processes and a synchronization granularity z executed in the multiprocessor system with static allocation with k processors and a communication delay t, compared to the system with dynamic allocation with q processors (see Paper I and Section 4.1.1):

.

Here A denotes an allocation of processes to processors, where processes are allocated to the j:th computer. Hence, the allocation is

given by the allocation sequence , where .

Furthermore, , where and . H n k q t z     = minAH A n k q t z      aj a1} a k aj j=1 k

¦

= n H A n k q t z      = g A n k q    +zr A n k t    g A n k q    1 n q © ¹ § · e © ¹ § · max i 1} i k a1 i1 © ¹ ¨ ¸ § · } ak ik © ¹ ¨ ¸ § · ˜ ˜ I

¦

= r A n k t    n n –1 i=1ai ai–1 k

¦

n n –1 --- t˜ =

(23)

The sum is taken over all decreasing sequences of

nonnegative integers such that .

• an optimal bound on the minimal completion time of a parallel pro-gram executed in the multiprocessor system with m processors com-paring the schedules with two different numbers of preemptions, i

and j, where (see Paper II and Section 4.1.2):

Here the notation comes from the following definition:

Def: , where the maximum is

taken over all sets of integers and so that

, for all and

(Lennerstad and Lundberg, [40]).

Part II: The algorithms:

• Golomb Recovery Scheme (see Paper III and Section 4.2.1) • Modulo Recovery Scheme (see Paper IV and Section 4.2.2) • Trapezium Recovery Scheme (see Paper V and Section 4.2.4) • Optimal Recovery Scheme (see Paper VI and Section 4.2.3)

I = ^i1} i k` ij j=1 k

¦

= q ij G m i j   2 min m i  + +j 1 min m + +i 1 2i + +j 2 min mji+1 = m n c max min m1 n1 --- } mc nc ---  © ¹ § · © ¹ § · = m1} m c n1} n c m1+}+mc = m n1+}+nc = n mk!0 nkt0

(24)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

2 Multiprocessor Scheduling (Part I)

In this section we introduce a classification of scheduling problems and present related work with bounds and complexity for some of them.

A very good taxonomy of task scheduling in distributed computing systems is pre-sented by Casavant and Kuhl in [8] (see Fig. 2). Here, static means off-line, i.e. when the decision for a process is made before it is executed, while dynamic means on-line, i.e. when it is unknown when and where the process will execute during its lifetime. In Paper I and II we consider a global optimal static scheduling, when we know the infor-mation of the parallel program in advance.

2.1 Classification of scheduling problems

We start with a definition of scheduling developed by Graham, Lawler, Lenstra and Rinnooy Kan [28]:

Consider m machines Mi (i = 1,...,m) that have to process n jobs Jj (j = 1,...,n).

“A schedule is an allocation of one or more time intervals on one or more machines to each job. A schedule is feasible if no two time intervals on the same machine over-lap, if no two time intervals allocated to the same job overover-lap, and if, in addition, it meets a number of specific requirements concerning the machine environment and the job characteristics. A schedule is optimal if it minimizes a given optimality criterion.”

Furthermore, the authors (Graham et al. [28]) proposed a classification of schedul-ing problems, which has been widely used in the literature, e.g. Lawler et al. [38], Blazewicz et al. [4], Pinedo [55]. The classification shows how many problems in scheduling theory can be if we would look at all combinations. This thesis is limited to one type of machines and one criterion, i.e. a minimal completion time.

Local (single processor) Global (multiprocessors) Static Dynamic

Optimal Suboptimal Physically Distributed Physically Non-Distributed Approximate Heuristic Cooperative Non-Cooperative

Enumerative Graph Theory Math. Pgmg Queuing Theory Optimal Suboptimal Approximate Heuristic Fig. 2. Task Scheduling Characteristics ([Casavant and Kuhl, 8])

(25)

A classification of scheduling problem is presented as a three-field classification

, where represents the processor environment, the job characteristics and

the optimality criterion as follows:

The first field consists of two parameters: ,

where (denotes an empty symbol) represents a single processor; P: identical proces-sors; Q: uniform processors, i.e. the processors with a given speed; R: unrelated pro-cessors, i.e. the processors with job-dependent speeds; O: dedicated processors: open shop sequencing, in which each job has to be processed again on each one of the m processors; there are no restrictions regarding the order of each job; F: dedicated pro-cessors: flow shop sequencing, in which each job has to be processed on each one of the m processors with the special order, i.e. first on process one, then on process two, and so on. After completion on one processor, a job joins the queue at the next proces-sor; J: dedicated processors: job shop sequencing, each job has its own order to follow;

and : the number of processors (Pinedo, [55])

In Paper I we describe a scenario of identical machines with different number of processors, i.e. a set Pk for the parallel architecture with k identical processors and static allocation, and a set Pq - with q identical processors and dynamic allocation.

Then . In Paper II we describe a scenario of identical machines with m

proces-sors, so also here .

The second field describes the job characteristics. Here we describe only some of

the possible values. , where means that

there are no restrictions on the jobs; pmtn means that preemptions are allowed; prec is precedence relation between the jobs, tree is a rooted tree representation; res is

speci-fied resource constraints; rj is release dates; pj = 1: each job has a unit processing

requirement (occurs only if ).

In Paper I we do not have any job characteristics, so for both cases (Pk and Pq) . In Paper II we present the scenario with limited number of preemptions, i.e.

.

The last field describes an optimality criterion. The most commonly chosen are

the maximum completion time or makespan ( ), the total completion time ( ),

or the maximum lateness , where a lateness of a task is its completion time minus

deadline [Graham et al., 28].

In Paper I and II we are interesting in minimizing the makespan, i.e. to minimize the

maximum completion time, so .

Using this notation we can represent the problem of minimizing maximum

comple-tion time on identical parallel machines allowing preempcomple-tion as .

D E J D E J D = D1D2 D1^q P Q R O F J      ` q D2 = k D1 = P D1 = P E

E^q p mtn prec tree  res r j pj= 1` q

D^q P Q  ` E^ `q E^pmtn` J Cmax C j

¦

Lmax J = Cmax P pmtn Cmax

(26)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

2.2 Bounds and Complexity on Multiprocessor Scheduling

The problem of scheduling a parallel program on m machines has been shown to be NP-hard (Garey et al., [21]). Many scheduling problems are easier to solve when we

allow preemptions, e.g. a problem of minimizing maximum completion time (Cmax),

without preemptions on two identical processors ( ) is NP-hard (Karp [33]),

while the problem for preemptive parallel scheduling for more than 2 processors

( ) is solvable in O(n) time (McNaughton, [50]). However, there is still a

practical question of allowing or not allowing preemptions.

In 1966, Graham [26], using List Scheduling (LS) rule, has proved that:

, where m is the number of processors and denotes

the maximum completion time for the optimal schedule. The idea of list scheduling is to make an ordered list of processes by assigning them some priorities. The next job from the list is assigned to the first available machine. One of the most often used

algo-rithms for solving problem is the Longest Processing Time (LPT) algorithm,

which is a kind of list scheduling. Here, the jobs are arranged in decreasing processing time order and when a machine is available, the largest job is ready to begin process-ing. The complexity of this algorithm is O(n log n) and the upper bound is established

by Graham ([27]): . However, in LPT algorithm, the cost

for process reallocation and synchronization is neglected. To achieve better perfor-mance, Coffman et al. [12] introduced another approximation algorithm, called

Multi-fit (MF) with makespan . This algorithm is based on

binpacking techniques, where the jobs are taken in non increasing order and each job is placed into the first processor into which it will fit. This bound was further improved by Friesen to 1.20 [19], Friesen and Langstone [20] to 1.18 and then by Yue [62] to 13/ 11, which is tight. Hochbaum and Shmoys [29] have developed a polynomial approxi-mation scheme, based on the multifit approach, with the computational complexity

for n processes.

Lennerstad and Lundberg in [39] established an optimal upper bound on the gain of using two different parallel architecture, with and without migrations of processes, where the number of processors in both systems are equal. In [41] the result is extended for a scenario with different number of processors. Paper I is an extension of those papers, i.e. the cost for communication and synchronization between processors is included (see Paper I and Section 4.1.1).

A problem of minimizing the makespan with preemptions, , can be

solved very efficiently with time complexity O(n). The makespan of any preemptive

P2 Cmax

P pmtn Cmax

Cmax LS Ce max* d2–1 me Cmax*

P Cmax Cmax LPT Ce max* 4 3 --- 1 3m ---– d Cmax MFk Cmax * e d1.22+2–k O n eH 1 H 2 e P pmtn Cmax

(27)

schedule is at least (McNaughton, [50]). However,

by the McNaughton rule, no more than preemptions are needed in the unlimited

case.

For the scheduling with precedence constraints, i.e. , Graham’s [26]

bound from 1966 ( ) still holds. If the jobs have unit-length,

the problem is NP-hard (Ullman, [60]). However, if the number of

processors is limited to 2, then is solvable in time

(Coff-man & Graham, [13]). Coff(Coff-man and Garey [11] proved that the least makespan achiev-able by a non-preemptive schedule is no more than 4/3 the least makespan achievachiev-able when preemptions are allowed.

In 1972, Liu [43] conjectured that for any set of tasks and precedence constraints among them, running on two processors, the least makespan achievable by a

nonpre-emptive schedule is no more than the least makespan achievable by a preemptive

schedule. The conjecture was proved in 1993 by Coffman and Garey [11]. Here the authors generalize the results to the numbers 4/3, 3/2, 8/5,... i.e. to the numbers

for some . The number k depends on the relative number of

preemp-tions available.

Braun and Schmidt [5] proved 2003 a formula that compares a preemptive schedule

with i preemptions to a schedule with unlimited number of preemptions in

the worst case. They generalized the bound 4/3 to the formula .

In Paper II we extend the results of Braun and Schmidt [5], by comparing a

preemp-tive schedule with i preemptions to a schedule with j preemptions, where (see

Paper II and Section 4.1.2).

Cmax* max maxj^ `pj 1

m ---- pj j=1 n

¦

 ¯ ¿ ® ¾ ­ ½ = m–1 P prec Cmax Cmax LS Cmax * e d2–1 me P prec p j=1 Cmax P2 prec p j=1 Cmax O n 2 4 3e 2ke k+1 kt2

Cmaxip* Cmaxp*

Cmaxip* eCmaxp* d2–2e me i+1 +1

(28)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

3 Load Balancing and Fault Tolerance (Part II)

This section presents related work on redistribution the workload in fault tolerance system and a definition and classification of the faults.

An important issue in a multiprocessor system is how to redistribute the workload in the case of fault of one or more processors. In that case the load should be redistributed to the other processors in the system maintaining maximal load balance.

Load balancing problem can be handle by dynamic policies, where transfer deci-sions depend on the actual current system state, or by static policies that are generally based on the information of the average behavior of the system. The transfer decisions are then independent of the actual current system state. This makes them less complex than dynamic policies [Kameda et al., 32].

A system is fault tolerance “if its programs can be properly executed despite the occurrence of logic faults” (Avizienis, 1967, [2]). Krishna and Shin [35] define the fault tolerance as “an ability of a system to respond gracefully to an unexpected hard-ware or softhard-ware failure”. Therefore, many fault-tolerant computer systems mirror all operations, e.g. every operation is performed on two or more duplicate systems in that sense, that if one fails the other can take over its job. This technique is also used in clusters [Pfister, 54]. In [Cristian, 14] fault tolerance for distributed computing is discussed from a wide viewpoint.

Besides hardware failures, the intermittent failures can occur due to software events. In [Vaidya, 61] a “two-level” recovery scheme is presented and evaluated. The recovery scheme has been implemented on a cluster. The authors evaluate the impact of checkpoint latency on the performance of the recovery scheme. For transaction-ori-ented systems, Gelenbe [22] has proposed “multiple check pointing” approach that is similar to the multi-level recovery scheme presented in [61]. A mathematical model of transaction-oriented system under intermittent failures is proposed in [Gelenbe and Derochete, 24]. Here, the system is assumed to operate in a standard checkpoint-roll-back-recovery scheme. In [Chabridon and Gelenbe, 9] and [Gelenbe and Chabridon, 23] the authors propose several algorithms which can detect tasks failures and restart failed tasks. They analyze the behavior of parallel programs represented by a random task graph in a multiprocessor environment. However, all these algorithms act dynam-ically.

If a single element of hardware or software fails and brings down the entire com-puter system we talk about single point of failure (Pfister, [54]). In the thesis we look at more advanced scenario, where an arbitrary number of faults can occur, i.e. we look at the system with no single point of failure.

3.1 Fault Model

A failure is “an event that occurs when the delivered service deviates from correct service”. “The deviation is called error”. “The adjudged or hypothesized cause of an error is called a fault.” (Avizienis et al., [3]) The events can be presented as in Fig. 3.

(29)

Avizienis et al. [3] presents the elementary fault classes according to eight basic viewpoints: phase of creation or occurrence, system boundaries, phenomenological causes, dimension, objective, intent, capability and persistence, where each of view-point consists of two additional viewview-points. If all combinations would be possible, then it would be 256 different combined fault classes ([3]).

A classification of a fault behaviour of its processors or communication controllers is the following:

- crash fault - the processor stops computing/transmission messages; - omission fault - missed message, a processor loses its content; - timing fault - the message arrives too late or early;

- fail-stop fault - a process/message is stopped from executing, possibly forever; - Byzantine fault - a processor may do anything (Cristian, [14]).

From the point of view of occurrences, faults can be classified into:

- transient fault - a fault starts at a particular time, remains in the system for some period and the disappears;

- permanent fault - a fault starts at a particular time and remains in the system until it is repaired;

- intermittent fault - a fault occurs from time to time (Burns and Wellings, [8]). Our main focus in Papers III-VI is on permanent crash faults.

3.2 Reliability vs. Availability

Computing and communication systems are characterized by fundamental proper-ties: functionality, performance, dependability and security, and cost (Avizienis, [3]). Reliability and availability are two of the five characteristics of dependability. Both are measured by two components: a time-between-failures (MTBF) and a mean-time-to-repair (MTTR).

Reliability (R) is the ability of a computing system to operate without failing, while availability (A) is a readiness for correct service.

Availability is defined as a proportion of time that the system is up (Laprie et al.,

[37]): , if MTBF is very much greater than MTTR.

Fault Error Failure

Fig. 3. Representation of a fault, error and failure ([3])

A MTBF MTBF+MTTR --- 1 MTTR MTBF ---– | =

(30)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

A system with 0.999 availability is more reliable than a system with 0.99 availabil-ity. Furthermore, a system with 0.99 availability has 1 - 0.99 = 0.01 probability of fail-ure. Hence, a failure is F = 1 - A and a reliability R = 1/(1 - A) (Laprie et al., [37]).

The availability of the system can be measured by a number of 9s, e.g. a system with 0.999 availability is called three 9s and belongs to class 3. Fig. 4 presents a classi-fication of the availability of the system converted to an average down time in a given time period (Pfister [54] and Laprie et al., [37]).

The systems of class 4-6 are called high availability systems (Pfister, [54]).

Class / nr of 9s 2 3 4 5 6 % Available 99,9 99,99 99,999 99,9999 99,99999 Hours / Year 87.60 8.76 0.88 0.9 0.1 Minutes / Month 438 43.8 4.38 0.44 0.04

(31)

4 Summarizing of the papers

This section summarize the papers involved in the thesis.

4.1 Part I

4.1.1 Comparing the optimal performance of parallel architectures (Paper I)

This paper extends the results in [Lennerstad and Lundberg, 39] and [Lennerstad and Lundberg, 42], where the synchronization cost was neglected. Here, a bound on the gain of using a system with q processors and run-time process reallocation com-pared to using a system with k processors, no reallocation and a communication delay

t, for a program with n processes and a synchronization granularity z is presented. The

main contribution in this paper is that we handle and validate a more realistic computer model, where the communication delay t for each synchronization signal and the gran-ularity z of the program are taken into consideration. We present transformations which enable us to separate execution and synchronization. Analyzing the parts

sepa-rately and comparing them, we found a formula that produces the

min-imal completion time: . Here A denotes an

allocation of processes to processors, where processes are allocated to the j:th

com-puter. Hence, the allocation is given by the allocation sequence , where

. Furthermore, the sum is taken over all decreasing sequences

of nonnegative integers such that .

, where and

.

The execution part, represented by , corresponds to the previous results

(Lundberg and Lennerstad [39, 42]). The function is convex in the sense

that the value decreases if the load of a worst case program is distributed more evenly

among the computers. However, the synchronization part, represented by ,

is concave. This quantity increases if the load of a worst case program is distributed more evenly. This makes the minimization of the sum H a delicate matter. The type of

allocation which is optimal depends strongly on the value of . If is small, then the

H n k q t z     H n k q t z     = minAH A n k q t z      aj a1} a k aj j=1 k

¦

= n I = ^i1} i k` ij j=1 k

¦

= q H A n k q t z      = g A n k q    +zr A n k t    g A n k q    1 n q © ¹ § · e © ¹ § · max i 1} i k a1 i1 © ¹ ¨ ¸ § · } ak ik © ¹ ¨ ¸ § · ˜ ˜ I

¦

= r A n k t    n n –1 i=1ai ai–1 k

¦

n n –1 --- t˜ = g A n k q    g A n k q    r A n k t    tz tz

(32)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

execution dominates, and partitions representing even distributions, uniform

parti-tions, are optimal. If is large, then the synchronization dominates, and partitions

where all processes are allocated to the same computer are optimal.

Here is an example of the calculation of the execution part of program P calculated first by the formula and then using the vector representation.

Let . Then and

and then

. Using the vector representation of program P with the allocation (1,2) we have:

.

4.1.2 The Maximum Gain of Increasing the Number of Preemptions in Multiprocessor

Scheduling (Paper II)

This paper generalize the results by Braun and Schmidt [5]. We present a tight upper bound on the maximal gain of increasing the number of preemptions for any parallel program consisting of a number of independent jobs when using m identical proces-sors. We calculate how large the ratio of the minimal makespans using i and j

preemp-tions respectively, , can be. We compare i preemptions with j preemptions in the

worst case. We thus allow j from to , while the problem solved in [5]

cor-responds to . In the case , which does not coincide with

unless , we obtain the optimal bound

. For example, excluding one preemption can never deteriorate the makespan more than a factor 4/3, but may do so. This argument cannot be iterated, since different sets of jobs are worst case, depending

on the parameters i and j. In the case we present a formula and a fast

algorithm based on the Stern-Brocot tree.

tz n = 3 q = 2 k = 2 A = ^a1a2;a1+a2 =3` = ^ 1 2 2 1  ` I = ^i1i2;i1+i2=2` = ^ 1 1 2 0  0 2  ` g A 3 2 2    1 3 2 © ¹ § · e © ¹ § · 1 1 1 © ¹ § · 2 1 © ¹ § · ˜ 2 1 2 © ¹ § · 2 0 © ¹ § · 2 1 0 © ¹ § · 2 2 © ¹ § · ˜ + ˜ + 4 3e = = (a1, a2) = (1, 2): 1 1 0 1 0 1 0 1 1 (i1, i2): (1, 1) (1, 1) (0, 2) max of (i1, i2): 1 1 2

g A 3 2 2    =

¦

max of i 1i2 e nr of rows in vector representation = 4 3e

ij i+1 m–1 j = m–1 mti+ +j 1 j = m–1 i = 0 2 je i+1 +1 e je i+1 +2 i=j–1 mi+ +j 1

(33)

4.2 Part II

In fault tolerant distributed systems it is difficult to decide on which processor the processes should be executed. When all computers are up and running, we would like the load to be evenly distributed. The load on some processors will, however, increase when one or more processors are down, but also under these conditions we would like to distribute the load as evenly as possible on the remaining processors. The distribu-tion of the load when a computer goes down is decided by the recovery lists of the pro-cesses running on the faulty processor. The set of all recovery lists is referred to as the

recovery scheme (RS). Hence the load distribution is completely determined by the

recovery scheme for any set of processors that are down.

The computer science problem is reformulated in the following papers into different mathematical problems that produce different static recovery schemes. The corre-sponding mathematical problems with the correcorre-sponding recovery schemes turn out to be the following: Paper III: Paper IV and VI Paper V Log RS Golomb RS Greedy RS

Given a number (n) - find the longest sequence of positive integers such that the sum of the sequence is smaller than or equal to n and all sums of subsequences (including subse-quences of length one) are unique.

Modulo RS

Given a number (n) - find the longest sequence of positive integers such that the sum and the sums of all subsequences

(including subsequences of length one) modulo n are unique. Optimal RS

Given a number (n) - find the longest sequence of positive integers such that:

a) the sum of the elements in the sequence is less than equal n b) l is an integer number such as: , where k is the number of crashed nodes

c) the first l crash routes are disjoint

d) the following crash routes have at least l different values compared to the previous crash routes S = ¢s1s2} s m²

l! 8k+9–3 2e

l l +1 2e –l

(34)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

4.2.1 Using Golomb Rulers for Optimal Recovery Schemes in Fault Tolerant

Distrib-uted Computing (Paper III)

This paper extends the results presented in Lundberg and Svahnberg [47]. Here we present how the Golomb ruler (a sequence of non-negative integers such that no two distinct pairs of numbers from the set have the same difference) (Fig. 5) is applied to the problem of finding an optimal recovery scheme. Golomb rulers are known for lengths up to 41912 (with the 211 marks). Of these the first 373 (with 23 marks) are known to be optimal.

The Golomb recovery lists are build as followed:

Let Gn be the Golomb ruler with sum n + 1 and let Gn(x) be the x:th entry in Gn,

e.g. G12= <1,4,9,11> and G12 (1) = 1, G12 (2) = 4 and so on. Let then gn = x be the

number of crashed computers with optimal behavior when we have n computers, e.g.

g12 = 4. For intermediate values of k we use the smaller Golomb ruler, and the rest of

the recovery list is filled with the remaining numbers up to k-1. For example, by filling

with remaining numbers, the ruler G12 gives the list {1,4,9,11,2,3,5,6,7,8,10}.

The sequence referred to Golomb recovery schemes is in this case <1,3,5,2> (i.e. the differences between the numbers from the list). All other recovery lists are

obtained from this by adding a fixed number to all entries in the modulo sense, i.e. Ri =

{(i+1) mod n, (i+2) mod n, (i+3) mod n,…, (i+n-1) mod n}, where Ri is the recovery

list for process i. If n = 12 we get: R0= <1,3,5,2>, R1= <2,4,6,3>, R2= <3,5,7,4> and

so on.

In this paper we present also the greedy algorithm, which is constructed from a

sequence with distinct partial sums. It can guarantee optimal behavior until ¬log2n¼

computers break down, but we can easily calculate it also for large n where no Golomb rules are known.

4.2.2 Using Modulo Rulers for Optimal Recovery Schemes in Distributed Computing (Paper IV)

Paper IV extends the Golomb recovery scheme. In the formulation which can be handled by the Golomb rulers the wrap-arounds are ignored - i.e. the situations when the total number of “jumps” for a process is larger than the number of computers in the cluster. This problem gives a new mathematical formulation of finding the longest sequence of positive integers such that the sum and the sums of all subsequences (including subsequences of length one) modulo n are unique (for a given n). This

0 1 4 9 11 1 3 5 2 4 8 7 9 10 11

(35)

mathematical formulation of the computer science problem gives new more powerful recovery schemes, called Modulo schemes, that are optimal for a larger number of crashed computers. Fig. 6 presents an example of a modulo-11 sequence with all mod-ulo differences, that does not exist in the previous case. The recovery lists (and recov-ery scheme) are constructed in the same way as the Greedy or Golomb recovrecov-ery lists (and scheme).

4.2.3 Extended Golomb Rulers as the New Recovery Schemes in Distributed

Depend-able Computing (Paper V)

A contribution in this paper is that we have presented new recovery schemes, called trapezium recovery schemes, where the first part of the schemes is based on the known Golomb rulers, (i.e. the crash routes are disjoint) and the second part is constructed in a way, where the following crash routes have at least l unique values compared to previ-ous crash routes. The trapezium recovery schemes guarantee better performance than the Golomb schemes and are simple to calculate. Fig. 7 compares the number of crashes of the trapezium scheme (“trapezium”), with the performance of the scheme using Golomb rulers (“golomb”) as a function of the number of nodes in a cluster (n) up to n = 1024.

The goal of this paper was to find good recovery schemes, that are better than the already known and are easily to calculate also for large n. In Paper V we have found

0 1 6 3 10 1 5 8 7 6 2 4 3 9 10

Fig. 6. Modulo sequence for n = 11 with all differences

Fig. 7. The difference between the trapezium scheme (“trapezium”) and OGRs scheme (“golomb”)

(36)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

the best possible recovery schemes for any number of crashed nodes in a cluster. To find such a recovery scheme is a very computational complex task. Due to the com-plexity of the problem we have only been able to present optimal recovery schemes for a maximum of 21 nodes in a cluster.

4.2.4 Optimal Recovery Schemes in Fault Tolerant Distributed Computing (Paper VI)

In Paper VI we calculate the best possible recovery schemes for any number of crashed computers. We are giving strict priority to a small number of computers down compared to a large number. That means, that we select the set of recovery schemes with optimal worst case behavior when two computers are down, and among these select recovery schemes that have optimal behavior when three computers are down,

and so on. We define the set of recovery schemes that minimizes the maximal

load for 1,2,...,p computers down in formula:

= , where

is a load sequence and defines the worst-case behavior after p crashes when using the recovery scheme R. The optimal load sequence is denoted by SV. , where BV is a bound vector that contains exactly k

entries that equals k for all .

It is not known if the lower bound MV is tight. In this paper we investigate the

tight-ness of the bound MV by the optimal load sequence SV of . We present an

algo-rithm with which we calculate the optimal bound SV. In many instances, when we have

R n p  R n p  RR n L n i R   min PR n L n i P   i = = 1} p ¯ ¿ ® ¾ ­ ½ L n p R   MV max BV j n nj --- © ¹ § · = kt2

Fig. 8. Comparison of the optimal recovery schemes sequences with the modulo sequences

(37)

a larger number of crashed computers, SV do not coincide with MV. Fig. 8 shows to which extent MV is tight. In the grey area MV is tight, since MV = SV here. For larger values of crashed computers q, MV < SV, so MV is not tight.

5 Future Work

The bounds that are obtained in Part I (papers I and II) are tight within the problem formulation. However, there is a universe of possibilities to find new formulations of both more specific and more general practical situations that has no answer yet, but where the present results may provide a useful foundation for finding answers. Both Paper I and Paper II are in fact extensions of previously studied (and optimally solved) problems. Apart from the results themselves, this is the major contribution to the research domain made by the thesis.

From a mathematical point of view, the results have opened new connections between parallel computer performance and combinatorics. Golomb rulers have found new applications, and new combinatorial problems have arisen and been solved. These problems may be studied and refined as mathematical problems, which then can be interpreted in a computer setting. The new application of the Stern-Brocot tree is one example. I believe that it is possible to find more such connections between computer systems engineering and combinatorics.

(38)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

6 References

1. Ahmad, I., Kwok, Y.-K., and Wu, M.-Y., Analysis, Evaluation, and Comparison of

Algorithms for Scheduling Task Graphs on Parallel Processors, in Proceedings of

the International Symposium on Parallel Architectures, Algorithms, and Networks, Beijing, China, June 1996, pp. 207-213

2. Avizienis, A., Design of Fault-Tolerant Computers, in Proceedings of Fall Joint Computer Conference, AFIPS Conference Proceeding, Vol. 31, Thompson Books, Washington, D.C., 1967, pp. 733-743

3. Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C., Basic Concepts and

Tax-onomy of Dependable and Secure Computing, IEEE Transactions on Dependable

and Secure Computing, Vol. 1, No. 1, January-March 2004, pp. 11-33

4. Blazewicz, J., Ecker, K. H., Pesch, E., Schmidt, G., Weglarz, J., Scheduling

Com-puter and Manufacturing Processes, Springer Verlag, New York, NY 1996, ISBN

3-540-61496-6

5. Braun, O., Schmidt, G., Parallel Processor Scheduling with Limited Number of

Pre-emptions, Siam Journal of Computing, Vol. 32, No. 3, 2003, pp. 671-680

6. Bruno, J., Coffman Jr., E.G., and Sethi, R., Scheduling Independent Tasks To Reduce

Mean Finishing Time, Communications of the ACM, Vol. 17, No. 7, July 1974, pp.

382-387

7. Burns, A., and Wellings, A., Real-Time Systems and Programming Languages, Third Eddition, Pearson, Addison Wesley, ISBN: 0-201-72988-1

8. Casavant, T. L., Kuhl, J. G., A taxonomy of Scheduling in General-Purpose

Distrib-uted Computing Systems, IEEE Transactions on Software Engineering, Vol. 14,

No. 2, February 1988, pp. 141-154

9. Chabridon, S., Gelenbe, E., Failure Detection Algorithms for a Reliable Execution

of Parallel Programs, in Proceedings of the 14th Symposium on Reliable

Distrib-uted Systems, SRDS'14, Bad Neuenahr, Germany, September 1995

10. Coffman Jr., E. G., Computer and Job-Scheduling Theory, John Wiley and Sons, Inc., New York, NY, 1976

11. Coffman Jr., E. G., Garey, M. R., Proof of the 4/3 Conjecture for Preemptive vs.

Nonpreemptive Two-Processor Scheduling, Journal of the ACM, Vol. 40, No. 5,

November 1993, pp. 991-1018, ISSN:0004-5411

12. Coffman Jr., E. G., Garey, M. R., Johnson, D. S., An Application of Bin Packing to

Multiprocessor Scheduling, SIAM J. Computing, 7 (1978), pp. 1-17

13. Coffman Jr., E. G., Graham, R., Optimal Scheduling for Two-Processor Systems, Acta Informatica, 1 (1972), pp. 200-213

14. Cristian, F., Understanding Fault Tolerant Distributed Systems, Journal of the ACM, Vol. 34, Issue 2, 1991

15. Dollas, A., Rankin, W. T. and McCracken, D., A New Algorithm for Golomb Ruler

Derivation and Proof of the 19 Marker Ruler, IEEE Trans. Inform. Theory, Vol. 44,

No. 1, January 1998, pp. 379-382

16. El-Rewini, H., Ali, H. H., and Lewis, T. G., Task Scheduling in Multiprocessor

(39)

17. El-Rewini, H., Ali, H. H., Static Scheduling of Conditional Branches in Parallel

Programs, Journal of Parallel and Distributed Computing, Vol. 24, No. 1, January

1995, pp. 41-54

18. Flynn, M.J., Very High-Speed Computing Systems, Proceedings of the IEEE, Vol. 54, No. 12, 1966, pp. 1901-1909

19. Friesen, D. K., Tighter Bounds for the Multifit Processor Scheduling Algorithm, SIAM Journal of Computing, 13 (1984), pp. 170-181

20. Friesen, D. K., and Langston, M. A., Evaluation of a MULTIFIT-Based Scheduling

Algorithm, Journal of Algorithms, Vol. 7, Issue 1, March 1986, pp. 35-59

21. Garey, M. R. and Johnson, D. S., Computers and Intractability - A Guide to the

Theory of NP-Completeness, W. H. Freeman and Company, New York, 1979

22. Gelenbe, E., A Model for Roll-back Recovery With Multiple Checkpoints, in Pro-ceedings of the 2nd ACM Conference on Software Engineering, October 1976, pp. 251-255

23. Gelenbe, E., Chabridon, S., Dependable Execution of Distributed Programs, Elsevier, Simulation Practice and Theory, Vol. 3, No. 1, 1995, pp. 1-16

24. Gelenbe, E., Derochete, D., Performance of Rollback Recovery Systems under

Intermittent Failures, Communication of the ACM, Vol. 21, No. 6, June 1978, pp.

493-499

25. Gerasoulis, A., Yang, T., A Comparisons of Clustering Heuristics for Scheduling

DAG’s on Multiprocessors, Journal of Parallel and Distributed Computing, 16,

1992, pp. 276-291

26. Graham, R. L., Bounds for Certain Multiprocessing Anomalies, Bell System Tech-nical Journal, Vol. 45, No. 9, November 1966, pp.1563-1581

27. Graham, R. L., Bounds on Multiprocessing Timing Anomalies, SIAM Journal of Applied Mathematics, Vol. 17, No. 2, 1969, pp. 416-429

28. Graham, R. L., Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., Optimization

and Approximation in Deterministic Sequencing and Scheduling: A Survey, Annals

of Discrete Mathematics 5, 1979, pp. 287-326

29. Hochbaum, D. S., Shmoys, D. B., Using Dual Approximation Algorithms for

Scheduling Problems Theoretical and Practical Results, Journal of the ACM

(JACM), Vol. 34, Issue 1, January 1987, pp. 144-162

30. Hou, E. S. H., Hong, R., and Ansari, N., Efficient Multiprocessor Scheduling

Based on Genetic Algorithms, Proceedings of the 16th Annual Conference of the

IEEE Industrial Electronics Society - IECON'90, Vol II, Pacific Grove, CA, USA, IEEE, New York, November 1990, pp. 1239-1243

31. Hu, T. C., Parallel Sequencing and Assembly Line Problems, Operations Research, Vol. 19, No. 6, November 1961, pp. 841-848

32. Kameda, H., Fathy, E.-Z. S., Ryu, I., Li, J., A performance Comparison of Dynamic

vs. Static Load Balancing Policies in a Mainframe - Personal Computer Network Model, Information: an International Journal, Vol. 5, No. 4, December 2002, pp.

(40)

Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing

33. Karp, R. M., Reducibility Among Combinatorial Problems, in R. E. Miller and J. W. Thatcher (editors): Complexity of Computer Computations. New York, 1972, pp. 85–104

34. Khan, A., McCreary, C. L., and Jones, M. S., A Comparisons of Multiprocessor

Scheduling Heuristics, in Proceedings of the 1994 International Conference on

Parallel Processing, CRC Press, Inc., Boca Raton, FL, pp. 243-250

35. Krishna, C. M., Shin, K. G., Real-Time Systems, McGraw-Hill International Edi-tions, Computer Science Series, 1997, ISBN 0-07-114243-6

36. Kwok, Y-K., and Ahmad, I., Static Scheduling Algorithms for Allocating Directed

Task Graphs to Multiprocessors, ACM Computing Surveys, Vol. 31, No. 4,

December 1999, pp. 406-471

37. Laprie, J.-C., Avizienis, A., Kopetz, H., Dependability: Basic Concepts and

Termi-nology, 1992, ISBN:0387822968

38. Lawler, E. G., Lenstra, J. K., Rinnooy Kan A. H. G., Shmoys, D. B., Sequencing

and Scheduling: Algorithms and Complexity, S.C. Graves et al., Eds., Handbooks

in Operations Research and Management Science, Vol. 4, Chapter 9, Elsevier, Amsterdam, 1993, Chapter 9, pp. 445-522

39. Lennerstad, H., Lundberg, L., An Optimal Execution Time Estimate of Static versus

Dynamic Allocation in Multiprocessor Systems, SIAM Journal of Computing, 24

(4), 1995, pp. 751-764

40. Lennerstad, H., Lundberg, L., Generalizations of the Floor and Ceiling Functions

Using the Stern-Brocot Tree, Research Report No. 2006:02, Blekinge Institute of

Technology, Karlskrona 2006

41. Lennerstad, H., Lundberg, L., Optimal Combinatorial Functions Comparing

Mul-tiprocessor Allocation Performance in MulMul-tiprocessor Systems, SIAM Journal of

Computing, 29 (6), 2000, pp. 1816-1838

42. Lennerstad, L., Lundberg, L., Optimal Scheduling Combinatorics, Electronic Notes in Discrete Mathematics, Vol. 14, Elsevier, May 2003

43. Liu, C. L., Optimal Scheduling on Multiprocessor Computing Systems, in Proceed-ings of the 13th Annual Symposium on Switching and Automata Theory, IEEE Computer Society, Los Alamitos, CA, 1972, pp. 155-160

44. Lennerstad, H. and Lundberg, L., An Optimal Execution Time Estimate of Static

versus Dynamic Allocation in Multiprocessor Systems, SIAM Journal of

Comput-ing, 24 (4), 1995, pp. 751-764

45. Lundberg, L., Performance Bounds on Multiprocessor Scheduling Strategies for

Chain Structured Programs, Computer Science Section of BIT, Vol. 33, No. 2,

1993, pp. 190-213

46. Lundberg, L., Lennerstad, H., Klonowska, K., Gustafsson, G., Using Optimal

Golomb Rulers for Minimizing Collisions in Closed Hashing, Proceedings of

Advances in Computer Science - ASIAN 2004, Thailand, December 2004; Lecture Notes in Computer Science, 3321 Springer 2004, pp. 157-168

47. Lundberg, L., and Svahnberg, C., Optimal Recovery Schemes for High-Availability

Cluster and Distributed Computing, Journal of Parallel and Distributed Computing

(41)

48. Markatos, E. P., and LeBlanc, T. B., Locality-Based Scheduling for Shared

Mem-ory Multiprocessors, Technical Report TR93-0094, ICS-FORTH, Heraklio, Crete,

Greece, August 1993, pp. 2-3

49. McCreary, C., Khan, A. A., Thompson, J. J., and McArdle, M. E., A Comparison

of Heuristics for Scheduling DAG’s on Multiprocessors, in Proceedings of the

Eighth International Parallel Processing Symposium, April 1994, pp. 446-451 50. McNaughton, R., Scheduling with Deadlines and Loss Functions, Management

Science, 6, 1959, pp. 1-12

51. Nanda, A. K., DeGroot, D. and Stenger, D. L., Scheduling Directed Task Graphs

on Multiprocessors Using Simulated Annealing, Proceedings of the IEEE 12th

International Conference on Distributed Computing Systems, Yokohama, Japan, IEEE Computer Society, Los Alamitos, June 1992, pp. 20-27

52. Pande, S. S., Agrawal, D. P., and Mauney, J., A Threshold Scheduling Strategy for

Sisal on Distributed Memory Machines, Journal on Parallel and Distributed

Com-puting, Vol. 21, No. 2, May 1994, pp. 223-236

53. Papadimitriou, C. H., and Yannakakis, M., Scheduling Interval-Ordered Tasks, SIAM Journal on Computing, Vol. 8, 1979, pp. 405-409

54. Pfister, G. F., In Search of Clusters, The Ongoing Battle in Lowly Parallel

Comput-ing, Prentice Hall PTR, 1998, ISBN 0-13-899709-8

55. Pinedo, M., Scheduling: Theory, Algorithms, and Systems (2nd Edition), Prentice Hall; 2 edition, 2001, ISBN 0-13-028138-7

56. Shirazi, B., Kavi, K., Hurson, A. R., and Biswas, P., PARSA: A Parallel Program

Scheduling and Assessment Environment, in Proceedings of International

Confer-ence on Parallel Processing, ICPP 1993, August 1993, Vol. 2, pp. 68-72

57. Shirazi, B., Wang, M., Analysis and Evaluation of Heuristic Methods for Static

Task Scheduling, Journal of Parallel and Distributed Computing Vol. 10, No. 1990,

pp. 222-232

58. Soliday, S. W., Homaifar, A., Lebby, G. L., Genetic Algorithm Approach to the

Search for Golomb Rulers, in Proceedings of the International Conference on

Genetic Algorithms, Pittsburg, PA, USA, 1995, pp. 528-535

59. Tang, P., Yew P.-C., and Zhu, C.-Q., Impact of Self-Scheduling Order on

Perfor-mance of Multiprocessor Systems, in Proceedings of the 2nd international

confer-ence on Supercomputing, June 1988, St. Malo, France, pp. 593-603

60. Ullman, J. D., NP-Complete Scheduling Problems, Journal of Computer and Sys-tem Sciences, Vol. 10, 1975, pp. 384–393

61. Vaidya, N. H., Another Two-Level Failure Recovery Scheme: Performance Impact

of Checkpoint Placement and Checkpoint Latency, Technical Report 94-068,

Department of Computer Science, Texas A&M University, December 1994 62. Yue, M., On the Exact Upper Bound for the Multifit Processor Scheduling

Algo-rithm, Annals of Operations Research, Vol. 24, No. 1, December 1990, pp. 233-259

63. Zapata, O. U. P., Alvarez, P. M., EDF and RM Multiprocessor Scheduling

Algo-rithms: Survey and Performance Evaluation, Report No.

(42)
(43)
(44)
(45)

Comparing the Optimal Performance

of Parallel Architectures

Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad and Magnus Broberg Proceedings of The Computer Journal, Vol. 47, No. 5, 2004

Abstract

Consider a parallel program with n processes and a synchronization granularity z. Consider also two parallel architectures: an SMP with q processors and run-time real-location of processes to processors, and a distributed system (or cluster) with k proces-sors and no run-time reallocation. There is an inter-processor communication delay of t time units for the system with no run-time reallocation. In this paper we define a func-tion such that the minimum completion time for all programs with n processes and a granularity z is at most times longer using the system with no reallocation and k processors compared to using the system with q processors and run-time reallocation. We assume optimal allocation and scheduling of processes to processors. The function is optimal in the sense that there is at least one program, with n processes and a granularity z, such that the ratio is exactly . We also validate our results using measurements on distributed and multiprocessor Sun/Solaris environments. The function provides important insights regarding the performance implications of the fundamental design decision of whether to allow run-time reallocation of processes or not. These insights can be used when doing the proper cost/benefit trade-offs when designing parallel exe-cution platforms. H n k q t z     H n k q t z     H n k q t z     H n k q t z     H n k q t z    

(46)

Figur

Fig. 1. Overview of the papers
Fig. 1. Overview of the papers p.21
Fig. 2.  An optimal dynamic (a) and static (b) allocation of a program  P executed by  two processors      Program P   Td (P,2)01234P1    P2P3 T s (P,2,t)a) dynamic allocation b) static allocation
Fig. 2. An optimal dynamic (a) and static (b) allocation of a program P executed by two processors Program P Td (P,2)01234P1 P2P3 T s (P,2,t)a) dynamic allocation b) static allocation p.50
Fig. 1.  a) Distributed system - a multiprocessor with  k computers each containing one processor; b) SMP - a multiprocessor with  one computer containing q processors
Fig. 1. a) Distributed system - a multiprocessor with k computers each containing one processor; b) SMP - a multiprocessor with one computer containing q processors p.50
Fig. 3.  Outline of the paper
Fig. 3. Outline of the paper p.53
Fig. 5 demonstrates the transformation. For the simplicity of notation we denote in this section the static completion time by   (instead of  ), and denote the length of the program by
Fig. 5 demonstrates the transformation. For the simplicity of notation we denote in this section the static completion time by (instead of ), and denote the length of the program by p.55
Fig. 10.  The transformation of  m = 8 copies of P’ into a new program with  a thin and a thick part
Fig. 10. The transformation of m = 8 copies of P’ into a new program with a thin and a thick part p.62
Fig. 12.  The transformation of vector representation of program  Q by permutation of  processes into the same program  Q’
Fig. 12. The transformation of vector representation of program Q by permutation of processes into the same program Q’ p.66
Fig. 14.  Making a copy where we have swapped the communication  frequencies of  P2 and P3 2 1  P1 P3 P1 P2 P Q Pc T s  Pc A k t    d T s  P A k t    + T s  Q A k t    A n – 1 n n  – 1  a i  a i – 1  n n – 1  a i  a i – 1  i = 1 ¦
Fig. 14. Making a copy where we have swapped the communication frequencies of P2 and P3 2 1 P1 P3 P1 P2 P Q Pc T s Pc A k t   d T s P A k t   + T s Q A k t   A n – 1 n n – 1 a i a i – 1 n n – 1 a i a i – 1 i = 1 ¦ p.69
Fig. 16.  Validating the fat part of the program H(n,k,q,t = 0, z), for any z (z has no impact when  t  is zero)
Fig. 16. Validating the fat part of the program H(n,k,q,t = 0, z), for any z (z has no impact when t is zero) p.74
Fig. 17.  The function H(n,k,q,t,z) for different values of tz and the intervals   1 d n, k   50, and q = kd159131721252933 37 41 45 4911 01 92 83 74 6 1 1 , 52 2 , 53 3 , 54 4 , 5kno p t i m a l   w i t h   t * z = 0
Fig. 17. The function H(n,k,q,t,z) for different values of tz and the intervals 1 d n, k 50, and q = kd159131721252933 37 41 45 4911 01 92 83 74 6 1 1 , 52 2 , 53 3 , 54 4 , 5kno p t i m a l w i t h t * z = 0 p.76
Fig. 1. An example of a box schedule ( a) and a box schedule with preemption clusters  ( b)
Fig. 1. An example of a box schedule ( a) and a box schedule with preemption clusters ( b) p.87
Fig. 2. Stern-Brocot tree with the decomposition of  2  and  5 2 min 13--- 12---© ¹§·= 7 4 3 min 21--- 21--- 55---©  ¹§·= m n e = k m n c k= m n m x m y
Fig. 2. Stern-Brocot tree with the decomposition of 2 and 5 2 min 13--- 12---© ¹§·= 7 4 3 min 21--- 21--- 55---©  ¹§·= m n e = k m n c k= m n m x m y p.94
Fig. 3. Function  G for m = 200
Fig. 3. Function G for m = 200 p.98
Fig. 4. Function  G for m = 400
Fig. 4. Function G for m = 400 p.98
Fig. 1. An application executing on a cluster  with four computers
Fig. 1. An application executing on a cluster with four computers p.106
Table 1. All sums of subsequences of the reduced step length vector for  n = 97

Table 1.

All sums of subsequences of the reduced step length vector for n = 97 p.111
Fig. 2. The ruler presentation of the OGR with  four marks
Fig. 2. The ruler presentation of the OGR with four marks p.113
Table 2. Comparing the optimal sequences with the greedy sequencesLength of

Table 2.

Comparing the optimal sequences with the greedy sequencesLength of p.115
Fig. 4. The “performance” difference between the OGRs scheme ('golomb'),  greedy version of the recovery scheme (‘greedy’) and the old version of the
Fig. 4. The “performance” difference between the OGRs scheme ('golomb'), greedy version of the recovery scheme (‘greedy’) and the old version of the p.116
Table 3. The known Optimal Golomb RulersLength of  sequence  (n)Sum of sequencem Sequence of Marks11112311,23611,3,241121,3,5,23,1,5,251741,3,6,2,51,3,6,5,21,7,3,2,41,7,4,2,362551,3,6,8,5,21,6,4,9,3,22,1,7,6,5,42,4,3,5,10,13,1,8,6,5,273411,3,5,6,7,10,28441

Table 3.

The known Optimal Golomb RulersLength of sequence (n)Sum of sequencem Sequence of Marks11112311,23611,3,241121,3,5,23,1,5,251741,3,6,2,51,3,6,5,21,7,3,2,41,7,4,2,362551,3,6,8,5,21,6,4,9,3,22,1,7,6,5,42,4,3,5,10,13,1,8,6,5,273411,3,5,6,7,10,28441 p.118
Fig. 1. Outline of the problem formulation
Fig. 1. Outline of the problem formulation p.124
Figure 5 presents an example of recovery list in a cluster with 11 computers when computer zero has crashed.

Figure 5

presents an example of recovery list in a cluster with 11 computers when computer zero has crashed. p.130
Table 1. Comparing modulo sequences with optimal Golomb sequences m = number of computers in a cluster that we can guarantee optimality         using scheme modulo  m;

Table 1.

Comparing modulo sequences with optimal Golomb sequences m = number of computers in a cluster that we can guarantee optimality using scheme modulo m; p.133
Fig. 2. The ruler presentation of the OGR with four marks
Fig. 2. The ruler presentation of the OGR with four marks p.146
Fig. 5 presents an outline when using the sequence &lt;1,3,2,5,2,5&gt;. The load on Z increases by one when all nodes from the crash route are down
Fig. 5 presents an outline when using the sequence &lt;1,3,2,5,2,5&gt;. The load on Z increases by one when all nodes from the crash route are down p.148
Fig. 5. Example of using a sequence &lt;1,3,2,5,2,5&gt;. Here the load on Z after 9 crashes  equals 5
Fig. 5. Example of using a sequence &lt;1,3,2,5,2,5&gt;. Here the load on Z after 9 crashes equals 5 p.149
Table 1. Comparing the trapezium sequences with the optimal Golomb sequences Figure 7 compares the performance of the trapezium scheme (“trapezium”), with the performance of the scheme using Golomb rulers (“golomb”) as a function of the number of nodes in

Table 1.

Comparing the trapezium sequences with the optimal Golomb sequences Figure 7 compares the performance of the trapezium scheme (“trapezium”), with the performance of the scheme using Golomb rulers (“golomb”) as a function of the number of nodes in p.153
Fig. 7. The “performance” difference between the trapezium scheme (“trapezium”)  and OGRs scheme ('golomb')
Fig. 7. The “performance” difference between the trapezium scheme (“trapezium”) and OGRs scheme ('golomb') p.153
Fig. 3. Example of a sequence tree for n = 5
Fig. 3. Example of a sequence tree for n = 5 p.174
Fig. 10. Comparison of the optimal recovery schemes sequences with the modulo sequences
Fig. 10. Comparison of the optimal recovery schemes sequences with the modulo sequences p.179

Referenser

Updating...

Relaterade ämnen :