Blekinge Institute of Technology
Doctoral Dissertation Series No. 2007:18
School of Engineering
THEORETICAL ASPECTS ON
PERFORMANCE BOUNDS AND FAULT
TOLERANCE IN PARALLEL COMPUTING
Kamilla Klonowska
ASPECTS ON
AND F
A
UL
T
T
OLERANCE IN P
ARALLEL COMPUTING
Kamilla Klono
wska
ISSN 1653-2090This thesis consists of two parts: performance bounds for scheduling algorithms for parallel pro-grams in multiprocessor systems, and recovery schemes for fault tolerant distributed systems when one or more computers go down.
In the first part we deliver tight bounds on the ratio for the minimal completion time of a parallel program executed in a parallel system in two sce-narios. Scenario one, the ratio for minimal com-pletion time when processes can be reallocated compared to when they cannot be reallocated to other processors during their execution time. Scenario two, when a schedule is preemptive, the
ratio for the minimal completion time when we use two different numbers of preemptions. The second part discusses the problem of redist-ribution of the load among running computers in a parallel system. The goal is to find a redistribu-tion scheme that maintains high performance even when one or more computers go down. Here we deliver four different redistribution algorithms. In both parts we use theoretical techniques that lead to explicit worst-case programs and scena-rios. The correctness is based on mathematical proofs.
ABSTRACT
Performance Bounds and Fault Tolerance
in Parallel Computing
Theoretical Aspects on
Performance Bounds and Fault Tolerance in
Parallel Computing
Kamilla Klonowska
ISSN 1653-2090
ISBN 978-91-7295-126-6
Department of Systems and Software Engineering
School of Engineering
Blekinge Institute of Technology
SWEDEN
© 2007 Kamilla Klonowska
Department of Systems and Software Engineering School of Engineering
Publisher: Blekinge Institute of Technology Printed by Printfabriken, Karlskrona, Sweden 2007 ISBN 978-91-7295-126-6
This thesis has been submitted to the Faculty of Technology at Blekinge Institute of Technology in partial fulfilment of the requirements for the Degree of Doctor of Phi-losophy in Computer Systems Engineering.
Contact information:
Kamilla Klonowska
Department of Systems and Software Engineering School of Engineering, Blekinge Institute of Technology Box 520
SE-372 25 Ronneby Sweden
parallel programs in multiprocessor systems, and recovery schemes for fault tolerant distributed systems when one or more computers go down.
In the first part we deliver tight bounds on the ratio for the minimal completion time of a parallel program executed in a parallel system in two scenarios. Scenario one, the ratio for minimal completion time when processes can be reallocated compared to when they cannot be reallocated to other processors during their execution time. Sce-nario two, when a schedule is preemptive, the ratio for the minimal completion time when we use two different numbers of preemptions.
The second part discusses the problem of redistribution of the load among running computers in a parallel system. The goal is to find a redistribution scheme that main-tains high performance even when one or more computers go down. Here we deliver four different redistribution algorithms.
In both parts we use theoretical techniques that lead to explicit worst-case programs and scenarios. The correctness is based on mathematical proofs.
This Ph.D. thesis may have never been realized without my supervisors, colleauges, friends and family members from here and apart. Thank you.
I especially thank my supervisors, Professor Lars Lundberg and Docent Håkan Lennerstad for their contribution in all the papers, comments, ideas, and encouragement during this work; Lars for his patience and Håkan for a lot of dialogues and discussions about life and mathematics.
I also thank the members of my research group: Charlie Svahnberg for answering on all my questions and his contributions in Golomb-series papers, Dawit Mengistu for a lot of discussions about complexity of life and scheduling, Simon Kågström for help-ing me understand Assembly and other strange thhelp-ings, Peter Tröger for deep discus-sions about fault, error and failure, and keeping my family busy during the last, most stressfull weekends, Mia Persson for discussions about NP-complexity, Göran Fries for being my room-neighbour throughout my Ph.D. studies, Bength Aspvall, Håkan Grahn and Magnus Broberg, an ex-member, for his contribution in the first paper.
I would also like to thank my colleagues at the department, especially Linda Ramst-edt, Johanna Törnkvist, Lawrence Henesey, Jenny Lundberg, Nina Dzamashvili-Fogelström, Guohua Bai, Maddeleine Pettersson, Monica Nilsson, May-Louise Ander-sson, Petra Nilsson.
Finally, I want to express my gratitude to my family for their support and encouragement from afar during hard times. Special thanks to Jacek and Wiktor, my Mother, Ciocia Wera, Basia and Ciocia Lodzia, Ciocia Eleonora, Iwonka and Krzysz-tof for being with me during that time. And my friends: Radek and Gosia Szymanek, Ewa and Jonta Andersson, Jelena Zaicenoka, Magdalena Urbanska, Viola and Tomek Murawscy, Karina Stachowiak and Rafal Lacny.
Papers included in this thesis:
I “Comparing the Optimal Performance of Parallel Architectures”
Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad, Magnus Broberg
The Computer Journal, Vol. 47, No. 5, 2004
II “The Maximum Gain of Increasing the Number of Preemptions in
Multiprocessor Scheduling”
Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad
submitted for publication
III “Using Golomb Rulers for Optimal Recovery Schemes in Fault Tolerant
Distributed Computing”
Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad
Proceedings of the 17th International Parallel & Distributed Processing Sympo-sium IPDPS 2003, Nice, France, April 2003
IV “Using Modulo Rulers for Optimal Recovery Schemes in Distributed
Computing”
Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad, Charlie Svahnberg
Proceedings of the 10th International Symposium PRDC 2004, Papeete, Tahiti, French Polynesia, March 2004
V “Extended Golomb Rulers as the New Recovery Schemes in Distributed
Dependable Computing”
Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad, Charlie Svahnberg
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), Denver, Colorado, April 2005
VI “Optimal Recovery Schemes in Fault Tolerant Distributed Computing”
Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad, Charlie Svahnberg
Acta Informatica, 41(6), 2005
Publications that are related but not included in this thesis:
VII “Using Optimal Golomb Rulers for Minimizing Collisions in Closed Hashing”
Lars Lundberg, Håkan Lennerstad, Kamilla Klonowska, Göran Gustafsson
Proceedings of Advances in Computer Science - ASIAN 2004, Higher-Level Decision Making, 9th Asian Computing Science Conference, Thailand, Decem-ber 2004; Lecture Notes in Computer Science, 3321 Springer 2004, ISBN 3-540-24087-X
VIII “Bounding the Minimal Completion Time in High Performance Parallel
Processing”
Lars Lundberg, Magnus Broberg, Kamilla Klonowska
International Journal of High Performance Computing and Networking, Vol. 2, No. 1, 2004
IX “Comparing the Optimal Performance of Multiprocessor Architectures”
Lars Lundberg, Kamilla Klonowska, Magnus Broberg, Håkan Lennerstad
Proceedings of the twenty-first IASTED International Multi-Conference Applied Informatics AI 2003, Innsbruck, Austria, February 2003
X “Recovery Schemes for High Availability and High Performance Distributed
Real-Time Computing”
Lars Lundberg, Daniel Häggander, Kamilla Klonowska, Charlie Svahnberg
Proceedings of the 17th International Parallel & Distributed Processing Sympo-sium IPDPS 2003, Nice, France, April 2003
XI “Evaluating Heuristic Scheduling Algorithms for High Performance Parallel
Processing”
Lars Lundberg, Magnus Broberg, Kamilla Klonowska
Proceedings of the fifth International Symposium on High Performance Comput-ing ISHPC-V 2003, Tokyo, Japan, October 2003
XII “A Method for Bounding the Minimal Completion Time in Multiprocessors”
Magnus Broberg, Lars Lundberg, Kamilla Klonowska
Technical Report, Blekinge Institute of Technology, 2002
XIII “Optimal Performance Comparisons of Massively Parallel Multiprocessors”
Håkan Lennerstad, Lars Lundberg, Kamilla Klonowska
Theoretical Aspects on Performance Bounds and Fault Tolerance
in Parallel Computing
1. Introduction ... 3
1.1 Research Questions ... 5 1.2 Research Methodology ... 6 1.3 Research Contribution ... 62. Multiprocessor Scheduling (Part I) ... 8
2.1 Classification of scheduling problems ... 8
2.2 Bounds and Complexity on Multiprocessor Scheduling ... 10
3. Load Balancing and Fault Tolerance (Part II) ... 12
3.1 Fault Model ... 12
3.2 Reliability vs. Availability ... 13
4. Summarizing of the papers ... 15
4.1 Part I ... 15
4.2 Part II ... 17
5. Future Work ... 21
6. References ... 22
Paper Section
Paper I
Comparing the Optimal Performance of Parallel Architectures 29
Paper II
The Maximum Gain of Increasing the Number of Preemptions
in Multiprocessor Scheduling... 67
Paper III
Using Golomb Rulers for Optimal Recovery Schemes in Fault
Tolerant Distributed Computing ... 87
Paper IV
Using Modulo Rulers for Optimal Recovery Schemes in
Distributed Computing ... 105
Paper V
Extended Golomb Rulers as the New Recovery Schemes in
Distributed Dependable Computing ... 125
Paper VI
Optimal Recovery Schemes in Fault Tolerant Distributed
Computing ... 143
Theoretical Aspects on Performance Bounds and
Fault Tolerance in Parallel Computing
1 Introduction
If one single processor does not give the required performance, one improvement alternative may be to execute the application on several processors that work in paral-lel. To increase the performance in parallel computing, we would like to spread the work between the processors (computers) as evenly as possible. This technique is called load balancing. There are many ways to implement parallel computing with multiple processors. One of them is a loosely coupled parallel system consisting of a number of stand-alone computers, connected by a network, where a process can only be executed by the processor on which it was started. This allocation of processes to processors is called static allocation.
A loosely coupled parallel system is very attractive due to its low cost and the potential improvement in availability and fault-tolerance. If one computer fails, the work on the failed computer can be taken over by another computer. However, one challenging problem in loosely coupled systems is to schedule processes among pro-cessors to achieve some performance goals, such as minimizing communication delays and completion time. The completion time of a program is also called the makespan.
Another parallel system is a tightly coupled Symmetric MultiProcessor system (SMP), consisting of multiple similar processors within the same computer, intercon-nected by a bus or some other fast interconnection network. Here a process may be executed by different processors during different time periods. This allocation of pro-cesses to processors is called dynamic allocation. An SMP system offers high perfor-mance and efficient load balancing, but does not offer availability. If one processor fails the entire application will usually fail.
One is often interested in improving the performance by reducing the completion time of a parallel program consisting of a number of synchronizing processes (static allocation). There is a conflict between the execution of the tasks and the communica-tion between them. One extreme example is a parallel program that executes on only one processor. This program is not affected by communication/synchronization over-head, but it suffers from serious load imbalance. On the other hand, if the same parallel program is executed on many processors, then the load may be evenly spread among the processors, but the communication cost can be very high.
Finding a scheduling algorithm that minimizes the completion time for a parallel program consisting of a number of processes is one of the classical computer science problems, and it has been shown to be NP-hard (Garey et al., [21]). A number of good heuristic methods have been suggested, but it is difficult to know when to stop the heu-ristic search for better schedules. Therefore it is important to know the optimal bounds to figure out how close/far the algorithms are from the optimal results.
Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
An important question is how much performance one can gain by allowing dynamic allocation provided that we are able to find (almost) optimal scheduling and allocation algorithms. The answer to this question would provide important input when we want to balance the additional cost and complexity of allowing dynamic allocation against the performance loss of restricting ourself to static allocation. In Paper I we define a function that answers this question for a very wide range of multiprocessors and paral-lel programs.
Another possibility to increase the performance is to allow preemptions. In a pre-emptive schedule, a process can be interrupted by other processes and then resumed on the same or on another processor. Scheduling with preemptions is more flexible, but it is not always possible and preemptions can be costly due to overhead for context switching. A preemption can be made at any point in time. Here, the state must be saved before a process is preempted and then restored when it resumes. This means that there is a trade-off - on the one hand one needs preemptions in order to obtain a balanced load, but on the other hand one would like to limit the number of preemptions in order to minimize the overhead costs. The optimal solution to this trade-off problem depends on a number of parameters, such as the cost for preempting and then later restarting a process. One crucial piece of information for making a well informed trade-off decision is the possible performance gain if the number of preemptions is increased, assuming that there are no overhead costs. In Paper II we present a tight upper bound on the maximal gain of increasing the number of preemptions for any par-allel program. This means that we compare the minimal makespan for extremal (worst-case) parallel programs consisting of a set of independent jobs on a multipro-cessor when allowing different number of preemptions.
In fault tolerant parallel systems, like clusters, the availability is obtained by failover techniques. In its simplest form cluster availability is obtained by having two computers, one active and one stand-by. If the primary computer fails, the secondary simply takes over the work. In order to obtain higher availability, one may want to use more than just two computers [Pfister, 54]. However, it can be very costly to build a large cluster system with many stand-by computers. It is often more attractive to fail over to the computers that already are active in the system, but it is difficult to decide on which computer the work on the failing computer should be executed. This is par-ticularly challenging if this decision has to be made statically before the program starts executing. The goal here is to find a redistribution scheme that maintains high perfor-mance even when one or more computers go down. In Papers III - VI we present four different redistribution schemes.
This thesis is divided in two parts. Fig. 1 presents an overview of the thesis as two sets of papers corresponding to Part I and Part II and their intersection of common con-cepts. In both parts we use theoretical techniques that lead to explicit worst-case pro-grams and scenarios. The correctness is based on mathematical proofs.
In Part I we present performance bounds on the ratio for minimal completion time of a parallel program executed in two scenario. Scenario one, the ratio for minimal completion time when processes can be reallocated compared to when they cannot be reallocated to other processors during their execution time (Paper I). Scenario two,
when a schedule is preemptive, the ratio for minimal completion time when we use two different number of preemptions (Paper II). In Section 2 the classification of mul-tiprocessor scheduling and related work on some bounds and complexity are pre-sented. In Part II we discusses the problem of redistribution of the load among running computers in a parallel system if one ore more computers in the system go down. Here we deliver four different redistribution algorithms (Papers III, IV, V and VI). In Sec-tion 3 the classificaSec-tion of the faults and related work in fault tolerance system are pre-sented.
1.1 Research Questions
In this thesis three main questions are in focus.
Part I: Performance Bounds in Parallel Computing
1. Consider two parallel architectures: a Symmetric MultiProcessor (SMP) with
dynamic allocation and a distributed system (or cluster) with static allocation. How much shorter can the completion time of a parallel program be on the SMP compared to the distributed system provided that we use optimal schedules?
2. Consider a multiprocessor with identical processors and a parallel program
con-sisting of independent processes, and an optimal schedule with preemptions.
If the number of preemptions is increased, how large can the gain in completion time of a parallel program be?
Part II: Load balancing in Parallel Computing
3. Consider a cluster with a number of computers. Consider a worst-case crash
sce-nario, i.e. a scenario when the most unfavorable combinations of computers go down. How can the load be evenly and efficiently redistributed among the running comput-ers (using static redistribution)?
Papers I, II
Papers III, IV, V, VI
Part I - Performance
Part II - Fault Tolerance
wo rs t-case ma th em at ic s
(and VIII, IX, XI, XII, XIII)
(and VII, X)
opti
mal
boun
ds
alg
ori
thm
s
mu ltip roce sso rsStern-
Brocot
Tree
Golo
mb R
uler
Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
1.2 Research Methodology
Since we are interested in worst case/extreme case scenarios in infinite sets, we can-not use empirical method. To find the answers to the research questions, theoretical techniques that lead to explicit worst-case programs are used. The correctness is based on mathematical proofs (Papers I and II). Furthermore, a number of tests on multipro-cessor and distributed Sun/Solaris environments have been done to validate and illus-trate some of the results in Paper I. In Paper II we take advantage of a number theoretic formula based on the Stern-Brocot tree. Papers III and IV are also based on number theory. In Paper III we present schemes that are based on so called Golomb rulers. Golomb rulers have previously been used in rather different context, e.g. radio astron-omy, (placement of antennas), X-ray crystallography, data encryption, geographical mapping and in hash-tables [Lundberg et al., 46]. In Paper IV we extend the results from Paper III by adding a new type of ruler, a “modulo ruler”, giving a result that is valid in more cases than Paper III. In Paper V we again extend the Golomb rulers to construct the new recovery scheme (trapezium) which give better performance. In Paper VI we exhaust the cases by calculating the best possible recovery schemes for any number of crashed computers by a branch-and-bound algorithm.
1.3 Research Contribution
The main contributions in this thesis are:
Part I: The optimal bounds:
• an optimal bound on the minimal completion time for all parallel pro-grams with n processes and a synchronization granularity z executed in the multiprocessor system with static allocation with k processors and a communication delay t, compared to the system with dynamic allocation with q processors (see Paper I and Section 4.1.1):
.
Here A denotes an allocation of processes to processors, where processes are allocated to the j:th computer. Hence, the allocation is
given by the allocation sequence , where .
Furthermore, , where and . H n k q t z = minAH A n k q t z aj a1} a k aj j=1 k
¦
= n H A n k q t z = g A n k q +zr A n k t g A n k q 1 n q © ¹ § · e © ¹ § · max i 1} i k a1 i1 © ¹ ¨ ¸ § · } ak ik © ¹ ¨ ¸ § · I¦
= r A n k t n n –1 i=1aiai–1 k¦
– n n –1 --- t =The sum is taken over all decreasing sequences of
nonnegative integers such that .
• an optimal bound on the minimal completion time of a parallel pro-gram executed in the multiprocessor system with m processors com-paring the schedules with two different numbers of preemptions, i
and j, where (see Paper II and Section 4.1.2):
Here the notation comes from the following definition:
Def: , where the maximum is
taken over all sets of integers and so that
, for all and
(Lennerstad and Lundberg, [40]).
Part II: The algorithms:
• Golomb Recovery Scheme (see Paper III and Section 4.2.1) • Modulo Recovery Scheme (see Paper IV and Section 4.2.2) • Trapezium Recovery Scheme (see Paper V and Section 4.2.4) • Optimal Recovery Scheme (see Paper VI and Section 4.2.3)
I = ^i1} i k` ij j=1 k
¦
= q ij G m i j 2 min m i + +j 1 min m + +i 1 2i + +j 2 min m –ji+1 = m n c max min m1 n1 --- } mc nc --- © ¹ § · © ¹ § · = m1} m c n1} n c m1+}+mc = m n1+}+nc = n mk!0 nkt0Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
2 Multiprocessor Scheduling (Part I)
In this section we introduce a classification of scheduling problems and present related work with bounds and complexity for some of them.
A very good taxonomy of task scheduling in distributed computing systems is pre-sented by Casavant and Kuhl in [8] (see Fig. 2). Here, static means off-line, i.e. when the decision for a process is made before it is executed, while dynamic means on-line, i.e. when it is unknown when and where the process will execute during its lifetime. In Paper I and II we consider a global optimal static scheduling, when we know the infor-mation of the parallel program in advance.
2.1 Classification of scheduling problems
We start with a definition of scheduling developed by Graham, Lawler, Lenstra and Rinnooy Kan [28]:
Consider m machines Mi (i = 1,...,m) that have to process n jobs Jj (j = 1,...,n).
“A schedule is an allocation of one or more time intervals on one or more machines to each job. A schedule is feasible if no two time intervals on the same machine over-lap, if no two time intervals allocated to the same job overover-lap, and if, in addition, it meets a number of specific requirements concerning the machine environment and the job characteristics. A schedule is optimal if it minimizes a given optimality criterion.”
Furthermore, the authors (Graham et al. [28]) proposed a classification of schedul-ing problems, which has been widely used in the literature, e.g. Lawler et al. [38], Blazewicz et al. [4], Pinedo [55]. The classification shows how many problems in scheduling theory can be if we would look at all combinations. This thesis is limited to one type of machines and one criterion, i.e. a minimal completion time.
Local (single processor) Global (multiprocessors) Static Dynamic
Optimal Suboptimal Physically Distributed Physically Non-Distributed Approximate Heuristic Cooperative Non-Cooperative
Enumerative Graph Theory Math. Pgmg Queuing Theory Optimal Suboptimal Approximate Heuristic Fig. 2. Task Scheduling Characteristics ([Casavant and Kuhl, 8])
A classification of scheduling problem is presented as a three-field classification
, where represents the processor environment, the job characteristics and
the optimality criterion as follows:
The first field consists of two parameters: ,
where (denotes an empty symbol) represents a single processor; P: identical proces-sors; Q: uniform processors, i.e. the processors with a given speed; R: unrelated pro-cessors, i.e. the processors with job-dependent speeds; O: dedicated processors: open shop sequencing, in which each job has to be processed again on each one of the m processors; there are no restrictions regarding the order of each job; F: dedicated pro-cessors: flow shop sequencing, in which each job has to be processed on each one of the m processors with the special order, i.e. first on process one, then on process two, and so on. After completion on one processor, a job joins the queue at the next proces-sor; J: dedicated processors: job shop sequencing, each job has its own order to follow;
and : the number of processors (Pinedo, [55])
In Paper I we describe a scenario of identical machines with different number of processors, i.e. a set Pk for the parallel architecture with k identical processors and static allocation, and a set Pq - with q identical processors and dynamic allocation.
Then . In Paper II we describe a scenario of identical machines with m
proces-sors, so also here .
The second field describes the job characteristics. Here we describe only some of
the possible values. , where means that
there are no restrictions on the jobs; pmtn means that preemptions are allowed; prec is precedence relation between the jobs, tree is a rooted tree representation; res is
speci-fied resource constraints; rj is release dates; pj = 1: each job has a unit processing
requirement (occurs only if ).
In Paper I we do not have any job characteristics, so for both cases (Pk and Pq) . In Paper II we present the scenario with limited number of preemptions, i.e.
.
The last field describes an optimality criterion. The most commonly chosen are
the maximum completion time or makespan ( ), the total completion time ( ),
or the maximum lateness , where a lateness of a task is its completion time minus
deadline [Graham et al., 28].
In Paper I and II we are interesting in minimizing the makespan, i.e. to minimize the
maximum completion time, so .
Using this notation we can represent the problem of minimizing maximum
comple-tion time on identical parallel machines allowing preempcomple-tion as .
D E J D E J D = D1D2 D1^q P Q R O F J ` q D2 = k D1 = P D1 = P E
E^q p mtn prec tree res r j pj= 1` q
D^q P Q ` E^ `q E^pmtn` J Cmax C j
¦
Lmax J = Cmax P pmtn CmaxTheoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
2.2 Bounds and Complexity on Multiprocessor Scheduling
The problem of scheduling a parallel program on m machines has been shown to be NP-hard (Garey et al., [21]). Many scheduling problems are easier to solve when we
allow preemptions, e.g. a problem of minimizing maximum completion time (Cmax),
without preemptions on two identical processors ( ) is NP-hard (Karp [33]),
while the problem for preemptive parallel scheduling for more than 2 processors
( ) is solvable in O(n) time (McNaughton, [50]). However, there is still a
practical question of allowing or not allowing preemptions.
In 1966, Graham [26], using List Scheduling (LS) rule, has proved that:
, where m is the number of processors and denotes
the maximum completion time for the optimal schedule. The idea of list scheduling is to make an ordered list of processes by assigning them some priorities. The next job from the list is assigned to the first available machine. One of the most often used
algo-rithms for solving problem is the Longest Processing Time (LPT) algorithm,
which is a kind of list scheduling. Here, the jobs are arranged in decreasing processing time order and when a machine is available, the largest job is ready to begin process-ing. The complexity of this algorithm is O(n log n) and the upper bound is established
by Graham ([27]): . However, in LPT algorithm, the cost
for process reallocation and synchronization is neglected. To achieve better perfor-mance, Coffman et al. [12] introduced another approximation algorithm, called
Multi-fit (MF) with makespan . This algorithm is based on
binpacking techniques, where the jobs are taken in non increasing order and each job is placed into the first processor into which it will fit. This bound was further improved by Friesen to 1.20 [19], Friesen and Langstone [20] to 1.18 and then by Yue [62] to 13/ 11, which is tight. Hochbaum and Shmoys [29] have developed a polynomial approxi-mation scheme, based on the multifit approach, with the computational complexity
for n processes.
Lennerstad and Lundberg in [39] established an optimal upper bound on the gain of using two different parallel architecture, with and without migrations of processes, where the number of processors in both systems are equal. In [41] the result is extended for a scenario with different number of processors. Paper I is an extension of those papers, i.e. the cost for communication and synchronization between processors is included (see Paper I and Section 4.1.1).
A problem of minimizing the makespan with preemptions, , can be
solved very efficiently with time complexity O(n). The makespan of any preemptive
P2 Cmax
P pmtn Cmax
CmaxLS Ce max* d2–1 me Cmax*
P Cmax CmaxLPT Ce max* 4 3 --- 1 3m ---– d CmaxMFk Cmax * e d1.22+2–k O n eH1 H 2 e P pmtn Cmax
schedule is at least (McNaughton, [50]). However,
by the McNaughton rule, no more than preemptions are needed in the unlimited
case.
For the scheduling with precedence constraints, i.e. , Graham’s [26]
bound from 1966 ( ) still holds. If the jobs have unit-length,
the problem is NP-hard (Ullman, [60]). However, if the number of
processors is limited to 2, then is solvable in time
(Coff-man & Graham, [13]). Coff(Coff-man and Garey [11] proved that the least makespan achiev-able by a non-preemptive schedule is no more than 4/3 the least makespan achievachiev-able when preemptions are allowed.
In 1972, Liu [43] conjectured that for any set of tasks and precedence constraints among them, running on two processors, the least makespan achievable by a
nonpre-emptive schedule is no more than the least makespan achievable by a preemptive
schedule. The conjecture was proved in 1993 by Coffman and Garey [11]. Here the authors generalize the results to the numbers 4/3, 3/2, 8/5,... i.e. to the numbers
for some . The number k depends on the relative number of
preemp-tions available.
Braun and Schmidt [5] proved 2003 a formula that compares a preemptive schedule
with i preemptions to a schedule with unlimited number of preemptions in
the worst case. They generalized the bound 4/3 to the formula .
In Paper II we extend the results of Braun and Schmidt [5], by comparing a
preemp-tive schedule with i preemptions to a schedule with j preemptions, where (see
Paper II and Section 4.1.2).
Cmax* max maxj^ `pj 1
m ---- pj j=1 n
¦
¯ ¿ ® ¾ ½ = m–1 P prec Cmax CmaxLS Cmax * e d2–1 me P prec p j=1 Cmax P2 prec p j=1 Cmax O n 2 4 3e 2kek+1 kt2Cmaxip* Cmaxp*
Cmaxip* eCmaxp* d2–2emei+1+1
Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
3 Load Balancing and Fault Tolerance (Part II)
This section presents related work on redistribution the workload in fault tolerance system and a definition and classification of the faults.
An important issue in a multiprocessor system is how to redistribute the workload in the case of fault of one or more processors. In that case the load should be redistributed to the other processors in the system maintaining maximal load balance.
Load balancing problem can be handle by dynamic policies, where transfer deci-sions depend on the actual current system state, or by static policies that are generally based on the information of the average behavior of the system. The transfer decisions are then independent of the actual current system state. This makes them less complex than dynamic policies [Kameda et al., 32].
A system is fault tolerance “if its programs can be properly executed despite the occurrence of logic faults” (Avizienis, 1967, [2]). Krishna and Shin [35] define the fault tolerance as “an ability of a system to respond gracefully to an unexpected hard-ware or softhard-ware failure”. Therefore, many fault-tolerant computer systems mirror all operations, e.g. every operation is performed on two or more duplicate systems in that sense, that if one fails the other can take over its job. This technique is also used in clusters [Pfister, 54]. In [Cristian, 14] fault tolerance for distributed computing is discussed from a wide viewpoint.
Besides hardware failures, the intermittent failures can occur due to software events. In [Vaidya, 61] a “two-level” recovery scheme is presented and evaluated. The recovery scheme has been implemented on a cluster. The authors evaluate the impact of checkpoint latency on the performance of the recovery scheme. For transaction-ori-ented systems, Gelenbe [22] has proposed “multiple check pointing” approach that is similar to the multi-level recovery scheme presented in [61]. A mathematical model of transaction-oriented system under intermittent failures is proposed in [Gelenbe and Derochete, 24]. Here, the system is assumed to operate in a standard checkpoint-roll-back-recovery scheme. In [Chabridon and Gelenbe, 9] and [Gelenbe and Chabridon, 23] the authors propose several algorithms which can detect tasks failures and restart failed tasks. They analyze the behavior of parallel programs represented by a random task graph in a multiprocessor environment. However, all these algorithms act dynam-ically.
If a single element of hardware or software fails and brings down the entire com-puter system we talk about single point of failure (Pfister, [54]). In the thesis we look at more advanced scenario, where an arbitrary number of faults can occur, i.e. we look at the system with no single point of failure.
3.1 Fault Model
A failure is “an event that occurs when the delivered service deviates from correct service”. “The deviation is called error”. “The adjudged or hypothesized cause of an error is called a fault.” (Avizienis et al., [3]) The events can be presented as in Fig. 3.
Avizienis et al. [3] presents the elementary fault classes according to eight basic viewpoints: phase of creation or occurrence, system boundaries, phenomenological causes, dimension, objective, intent, capability and persistence, where each of view-point consists of two additional viewview-points. If all combinations would be possible, then it would be 256 different combined fault classes ([3]).
A classification of a fault behaviour of its processors or communication controllers is the following:
- crash fault - the processor stops computing/transmission messages; - omission fault - missed message, a processor loses its content; - timing fault - the message arrives too late or early;
- fail-stop fault - a process/message is stopped from executing, possibly forever; - Byzantine fault - a processor may do anything (Cristian, [14]).
From the point of view of occurrences, faults can be classified into:
- transient fault - a fault starts at a particular time, remains in the system for some period and the disappears;
- permanent fault - a fault starts at a particular time and remains in the system until it is repaired;
- intermittent fault - a fault occurs from time to time (Burns and Wellings, [8]). Our main focus in Papers III-VI is on permanent crash faults.
3.2 Reliability vs. Availability
Computing and communication systems are characterized by fundamental proper-ties: functionality, performance, dependability and security, and cost (Avizienis, [3]). Reliability and availability are two of the five characteristics of dependability. Both are measured by two components: a time-between-failures (MTBF) and a mean-time-to-repair (MTTR).
Reliability (R) is the ability of a computing system to operate without failing, while availability (A) is a readiness for correct service.
Availability is defined as a proportion of time that the system is up (Laprie et al.,
[37]): , if MTBF is very much greater than MTTR.
Fault Error Failure
Fig. 3. Representation of a fault, error and failure ([3])
A MTBF MTBF+MTTR --- 1 MTTR MTBF ---– | =
Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
A system with 0.999 availability is more reliable than a system with 0.99 availabil-ity. Furthermore, a system with 0.99 availability has 1 - 0.99 = 0.01 probability of fail-ure. Hence, a failure is F = 1 - A and a reliability R = 1/(1 - A) (Laprie et al., [37]).
The availability of the system can be measured by a number of 9s, e.g. a system with 0.999 availability is called three 9s and belongs to class 3. Fig. 4 presents a classi-fication of the availability of the system converted to an average down time in a given time period (Pfister [54] and Laprie et al., [37]).
The systems of class 4-6 are called high availability systems (Pfister, [54]).
Class / nr of 9s 2 3 4 5 6 % Available 99,9 99,99 99,999 99,9999 99,99999 Hours / Year 87.60 8.76 0.88 0.9 0.1 Minutes / Month 438 43.8 4.38 0.44 0.04
4 Summarizing of the papers
This section summarize the papers involved in the thesis.
4.1 Part I
4.1.1 Comparing the optimal performance of parallel architectures (Paper I)
This paper extends the results in [Lennerstad and Lundberg, 39] and [Lennerstad and Lundberg, 42], where the synchronization cost was neglected. Here, a bound on the gain of using a system with q processors and run-time process reallocation com-pared to using a system with k processors, no reallocation and a communication delay
t, for a program with n processes and a synchronization granularity z is presented. The
main contribution in this paper is that we handle and validate a more realistic computer model, where the communication delay t for each synchronization signal and the gran-ularity z of the program are taken into consideration. We present transformations which enable us to separate execution and synchronization. Analyzing the parts
sepa-rately and comparing them, we found a formula that produces the
min-imal completion time: . Here A denotes an
allocation of processes to processors, where processes are allocated to the j:th
com-puter. Hence, the allocation is given by the allocation sequence , where
. Furthermore, the sum is taken over all decreasing sequences
of nonnegative integers such that .
, where and
.
The execution part, represented by , corresponds to the previous results
(Lundberg and Lennerstad [39, 42]). The function is convex in the sense
that the value decreases if the load of a worst case program is distributed more evenly
among the computers. However, the synchronization part, represented by ,
is concave. This quantity increases if the load of a worst case program is distributed more evenly. This makes the minimization of the sum H a delicate matter. The type of
allocation which is optimal depends strongly on the value of . If is small, then the
H n k q t z H n k q t z = minAH A n k q t z aj a1} a k aj j=1 k
¦
= n I = ^i1} i k` ij j=1 k¦
= q H A n k q t z = g A n k q +zr A n k t g A n k q 1 n q © ¹ § · e © ¹ § · max i 1} i k a1 i1 © ¹ ¨ ¸ § · } ak ik © ¹ ¨ ¸ § · I¦
= r A n k t n n –1 i=1aiai–1 k¦
– n n –1 --- t = g A n k q g A n k q r A n k t tz tzTheoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
execution dominates, and partitions representing even distributions, uniform
parti-tions, are optimal. If is large, then the synchronization dominates, and partitions
where all processes are allocated to the same computer are optimal.
Here is an example of the calculation of the execution part of program P calculated first by the formula and then using the vector representation.
Let . Then and
and then
. Using the vector representation of program P with the allocation (1,2) we have:
.
4.1.2 The Maximum Gain of Increasing the Number of Preemptions in Multiprocessor
Scheduling (Paper II)
This paper generalize the results by Braun and Schmidt [5]. We present a tight upper bound on the maximal gain of increasing the number of preemptions for any parallel program consisting of a number of independent jobs when using m identical proces-sors. We calculate how large the ratio of the minimal makespans using i and j
preemp-tions respectively, , can be. We compare i preemptions with j preemptions in the
worst case. We thus allow j from to , while the problem solved in [5]
cor-responds to . In the case , which does not coincide with
unless , we obtain the optimal bound
. For example, excluding one preemption can never deteriorate the makespan more than a factor 4/3, but may do so. This argument cannot be iterated, since different sets of jobs are worst case, depending
on the parameters i and j. In the case we present a formula and a fast
algorithm based on the Stern-Brocot tree.
tz n = 3 q = 2 k = 2 A = ^a1a2;a1+a2 =3` = ^1 2 2 1 ` I = ^i1i2;i1+i2=2` = ^1 1 2 0 0 2 ` g A 3 2 2 1 3 2 © ¹ § · e © ¹ § · 1 1 1 © ¹ § · 2 1 © ¹ § · 2 1 2 © ¹ § · 2 0 © ¹ § · 2 1 0 © ¹ § · 2 2 © ¹ § · + + 4 3e = = (a1, a2) = (1, 2): 1 1 0 1 0 1 0 1 1 (i1, i2): (1, 1) (1, 1) (0, 2) max of (i1, i2): 1 1 2
g A 3 2 2 =
¦
max of i 1i2enr of rows in vector representation = 4 3eij i+1 m–1 j = m–1 mti+ +j 1 j = m–1 i = 0 2 jei+1 +1e jei+1 +2 i=j–1 mi+ +j 1
4.2 Part II
In fault tolerant distributed systems it is difficult to decide on which processor the processes should be executed. When all computers are up and running, we would like the load to be evenly distributed. The load on some processors will, however, increase when one or more processors are down, but also under these conditions we would like to distribute the load as evenly as possible on the remaining processors. The distribu-tion of the load when a computer goes down is decided by the recovery lists of the pro-cesses running on the faulty processor. The set of all recovery lists is referred to as the
recovery scheme (RS). Hence the load distribution is completely determined by the
recovery scheme for any set of processors that are down.
The computer science problem is reformulated in the following papers into different mathematical problems that produce different static recovery schemes. The corre-sponding mathematical problems with the correcorre-sponding recovery schemes turn out to be the following: Paper III: Paper IV and VI Paper V Log RS Golomb RS Greedy RS
Given a number (n) - find the longest sequence of positive integers such that the sum of the sequence is smaller than or equal to n and all sums of subsequences (including subse-quences of length one) are unique.
Modulo RS
Given a number (n) - find the longest sequence of positive integers such that the sum and the sums of all subsequences
(including subsequences of length one) modulo n are unique. Optimal RS
Given a number (n) - find the longest sequence of positive integers such that:
a) the sum of the elements in the sequence is less than equal n b) l is an integer number such as: , where k is the number of crashed nodes
c) the first l crash routes are disjoint
d) the following crash routes have at least l different values compared to the previous crash routes S = ¢s1s2} s m²
l! 8k+9–3 2e
l l +1 2e –l
Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
4.2.1 Using Golomb Rulers for Optimal Recovery Schemes in Fault Tolerant
Distrib-uted Computing (Paper III)
This paper extends the results presented in Lundberg and Svahnberg [47]. Here we present how the Golomb ruler (a sequence of non-negative integers such that no two distinct pairs of numbers from the set have the same difference) (Fig. 5) is applied to the problem of finding an optimal recovery scheme. Golomb rulers are known for lengths up to 41912 (with the 211 marks). Of these the first 373 (with 23 marks) are known to be optimal.
The Golomb recovery lists are build as followed:
Let Gn be the Golomb ruler with sum n + 1 and let Gn(x) be the x:th entry in Gn,
e.g. G12= <1,4,9,11> and G12 (1) = 1, G12 (2) = 4 and so on. Let then gn = x be the
number of crashed computers with optimal behavior when we have n computers, e.g.
g12 = 4. For intermediate values of k we use the smaller Golomb ruler, and the rest of
the recovery list is filled with the remaining numbers up to k-1. For example, by filling
with remaining numbers, the ruler G12 gives the list {1,4,9,11,2,3,5,6,7,8,10}.
The sequence referred to Golomb recovery schemes is in this case <1,3,5,2> (i.e. the differences between the numbers from the list). All other recovery lists are
obtained from this by adding a fixed number to all entries in the modulo sense, i.e. Ri =
{(i+1) mod n, (i+2) mod n, (i+3) mod n,…, (i+n-1) mod n}, where Ri is the recovery
list for process i. If n = 12 we get: R0= <1,3,5,2>, R1= <2,4,6,3>, R2= <3,5,7,4> and
so on.
In this paper we present also the greedy algorithm, which is constructed from a
sequence with distinct partial sums. It can guarantee optimal behavior until ¬log2n¼
computers break down, but we can easily calculate it also for large n where no Golomb rules are known.
4.2.2 Using Modulo Rulers for Optimal Recovery Schemes in Distributed Computing (Paper IV)
Paper IV extends the Golomb recovery scheme. In the formulation which can be handled by the Golomb rulers the wrap-arounds are ignored - i.e. the situations when the total number of “jumps” for a process is larger than the number of computers in the cluster. This problem gives a new mathematical formulation of finding the longest sequence of positive integers such that the sum and the sums of all subsequences (including subsequences of length one) modulo n are unique (for a given n). This
0 1 4 9 11 1 3 5 2 4 8 7 9 10 11
mathematical formulation of the computer science problem gives new more powerful recovery schemes, called Modulo schemes, that are optimal for a larger number of crashed computers. Fig. 6 presents an example of a modulo-11 sequence with all mod-ulo differences, that does not exist in the previous case. The recovery lists (and recov-ery scheme) are constructed in the same way as the Greedy or Golomb recovrecov-ery lists (and scheme).
4.2.3 Extended Golomb Rulers as the New Recovery Schemes in Distributed
Depend-able Computing (Paper V)
A contribution in this paper is that we have presented new recovery schemes, called trapezium recovery schemes, where the first part of the schemes is based on the known Golomb rulers, (i.e. the crash routes are disjoint) and the second part is constructed in a way, where the following crash routes have at least l unique values compared to previ-ous crash routes. The trapezium recovery schemes guarantee better performance than the Golomb schemes and are simple to calculate. Fig. 7 compares the number of crashes of the trapezium scheme (“trapezium”), with the performance of the scheme using Golomb rulers (“golomb”) as a function of the number of nodes in a cluster (n) up to n = 1024.
The goal of this paper was to find good recovery schemes, that are better than the already known and are easily to calculate also for large n. In Paper V we have found
0 1 6 3 10 1 5 8 7 6 2 4 3 9 10
Fig. 6. Modulo sequence for n = 11 with all differences
Fig. 7. The difference between the trapezium scheme (“trapezium”) and OGRs scheme (“golomb”)
Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
the best possible recovery schemes for any number of crashed nodes in a cluster. To find such a recovery scheme is a very computational complex task. Due to the com-plexity of the problem we have only been able to present optimal recovery schemes for a maximum of 21 nodes in a cluster.
4.2.4 Optimal Recovery Schemes in Fault Tolerant Distributed Computing (Paper VI)
In Paper VI we calculate the best possible recovery schemes for any number of crashed computers. We are giving strict priority to a small number of computers down compared to a large number. That means, that we select the set of recovery schemes with optimal worst case behavior when two computers are down, and among these select recovery schemes that have optimal behavior when three computers are down,
and so on. We define the set of recovery schemes that minimizes the maximal
load for 1,2,...,p computers down in formula:
= , where
is a load sequence and defines the worst-case behavior after p crashes when using the recovery scheme R. The optimal load sequence is denoted by SV. , where BV is a bound vector that contains exactly k
entries that equals k for all .
It is not known if the lower bound MV is tight. In this paper we investigate the
tight-ness of the bound MV by the optimal load sequence SV of . We present an
algo-rithm with which we calculate the optimal bound SV. In many instances, when we have
R n p R n p RR n L n i R min PR n L n i P i = = 1} p ¯ ¿ ® ¾ ½ L n p R MV max BV j n n–j --- © ¹ § · = kt2
Fig. 8. Comparison of the optimal recovery schemes sequences with the modulo sequences
a larger number of crashed computers, SV do not coincide with MV. Fig. 8 shows to which extent MV is tight. In the grey area MV is tight, since MV = SV here. For larger values of crashed computers q, MV < SV, so MV is not tight.
5 Future Work
The bounds that are obtained in Part I (papers I and II) are tight within the problem formulation. However, there is a universe of possibilities to find new formulations of both more specific and more general practical situations that has no answer yet, but where the present results may provide a useful foundation for finding answers. Both Paper I and Paper II are in fact extensions of previously studied (and optimally solved) problems. Apart from the results themselves, this is the major contribution to the research domain made by the thesis.
From a mathematical point of view, the results have opened new connections between parallel computer performance and combinatorics. Golomb rulers have found new applications, and new combinatorial problems have arisen and been solved. These problems may be studied and refined as mathematical problems, which then can be interpreted in a computer setting. The new application of the Stern-Brocot tree is one example. I believe that it is possible to find more such connections between computer systems engineering and combinatorics.
Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
6 References
1. Ahmad, I., Kwok, Y.-K., and Wu, M.-Y., Analysis, Evaluation, and Comparison of
Algorithms for Scheduling Task Graphs on Parallel Processors, in Proceedings of
the International Symposium on Parallel Architectures, Algorithms, and Networks, Beijing, China, June 1996, pp. 207-213
2. Avizienis, A., Design of Fault-Tolerant Computers, in Proceedings of Fall Joint Computer Conference, AFIPS Conference Proceeding, Vol. 31, Thompson Books, Washington, D.C., 1967, pp. 733-743
3. Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C., Basic Concepts and
Tax-onomy of Dependable and Secure Computing, IEEE Transactions on Dependable
and Secure Computing, Vol. 1, No. 1, January-March 2004, pp. 11-33
4. Blazewicz, J., Ecker, K. H., Pesch, E., Schmidt, G., Weglarz, J., Scheduling
Com-puter and Manufacturing Processes, Springer Verlag, New York, NY 1996, ISBN
3-540-61496-6
5. Braun, O., Schmidt, G., Parallel Processor Scheduling with Limited Number of
Pre-emptions, Siam Journal of Computing, Vol. 32, No. 3, 2003, pp. 671-680
6. Bruno, J., Coffman Jr., E.G., and Sethi, R., Scheduling Independent Tasks To Reduce
Mean Finishing Time, Communications of the ACM, Vol. 17, No. 7, July 1974, pp.
382-387
7. Burns, A., and Wellings, A., Real-Time Systems and Programming Languages, Third Eddition, Pearson, Addison Wesley, ISBN: 0-201-72988-1
8. Casavant, T. L., Kuhl, J. G., A taxonomy of Scheduling in General-Purpose
Distrib-uted Computing Systems, IEEE Transactions on Software Engineering, Vol. 14,
No. 2, February 1988, pp. 141-154
9. Chabridon, S., Gelenbe, E., Failure Detection Algorithms for a Reliable Execution
of Parallel Programs, in Proceedings of the 14th Symposium on Reliable
Distrib-uted Systems, SRDS'14, Bad Neuenahr, Germany, September 1995
10. Coffman Jr., E. G., Computer and Job-Scheduling Theory, John Wiley and Sons, Inc., New York, NY, 1976
11. Coffman Jr., E. G., Garey, M. R., Proof of the 4/3 Conjecture for Preemptive vs.
Nonpreemptive Two-Processor Scheduling, Journal of the ACM, Vol. 40, No. 5,
November 1993, pp. 991-1018, ISSN:0004-5411
12. Coffman Jr., E. G., Garey, M. R., Johnson, D. S., An Application of Bin Packing to
Multiprocessor Scheduling, SIAM J. Computing, 7 (1978), pp. 1-17
13. Coffman Jr., E. G., Graham, R., Optimal Scheduling for Two-Processor Systems, Acta Informatica, 1 (1972), pp. 200-213
14. Cristian, F., Understanding Fault Tolerant Distributed Systems, Journal of the ACM, Vol. 34, Issue 2, 1991
15. Dollas, A., Rankin, W. T. and McCracken, D., A New Algorithm for Golomb Ruler
Derivation and Proof of the 19 Marker Ruler, IEEE Trans. Inform. Theory, Vol. 44,
No. 1, January 1998, pp. 379-382
16. El-Rewini, H., Ali, H. H., and Lewis, T. G., Task Scheduling in Multiprocessor
17. El-Rewini, H., Ali, H. H., Static Scheduling of Conditional Branches in Parallel
Programs, Journal of Parallel and Distributed Computing, Vol. 24, No. 1, January
1995, pp. 41-54
18. Flynn, M.J., Very High-Speed Computing Systems, Proceedings of the IEEE, Vol. 54, No. 12, 1966, pp. 1901-1909
19. Friesen, D. K., Tighter Bounds for the Multifit Processor Scheduling Algorithm, SIAM Journal of Computing, 13 (1984), pp. 170-181
20. Friesen, D. K., and Langston, M. A., Evaluation of a MULTIFIT-Based Scheduling
Algorithm, Journal of Algorithms, Vol. 7, Issue 1, March 1986, pp. 35-59
21. Garey, M. R. and Johnson, D. S., Computers and Intractability - A Guide to the
Theory of NP-Completeness, W. H. Freeman and Company, New York, 1979
22. Gelenbe, E., A Model for Roll-back Recovery With Multiple Checkpoints, in Pro-ceedings of the 2nd ACM Conference on Software Engineering, October 1976, pp. 251-255
23. Gelenbe, E., Chabridon, S., Dependable Execution of Distributed Programs, Elsevier, Simulation Practice and Theory, Vol. 3, No. 1, 1995, pp. 1-16
24. Gelenbe, E., Derochete, D., Performance of Rollback Recovery Systems under
Intermittent Failures, Communication of the ACM, Vol. 21, No. 6, June 1978, pp.
493-499
25. Gerasoulis, A., Yang, T., A Comparisons of Clustering Heuristics for Scheduling
DAG’s on Multiprocessors, Journal of Parallel and Distributed Computing, 16,
1992, pp. 276-291
26. Graham, R. L., Bounds for Certain Multiprocessing Anomalies, Bell System Tech-nical Journal, Vol. 45, No. 9, November 1966, pp.1563-1581
27. Graham, R. L., Bounds on Multiprocessing Timing Anomalies, SIAM Journal of Applied Mathematics, Vol. 17, No. 2, 1969, pp. 416-429
28. Graham, R. L., Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G., Optimization
and Approximation in Deterministic Sequencing and Scheduling: A Survey, Annals
of Discrete Mathematics 5, 1979, pp. 287-326
29. Hochbaum, D. S., Shmoys, D. B., Using Dual Approximation Algorithms for
Scheduling Problems Theoretical and Practical Results, Journal of the ACM
(JACM), Vol. 34, Issue 1, January 1987, pp. 144-162
30. Hou, E. S. H., Hong, R., and Ansari, N., Efficient Multiprocessor Scheduling
Based on Genetic Algorithms, Proceedings of the 16th Annual Conference of the
IEEE Industrial Electronics Society - IECON'90, Vol II, Pacific Grove, CA, USA, IEEE, New York, November 1990, pp. 1239-1243
31. Hu, T. C., Parallel Sequencing and Assembly Line Problems, Operations Research, Vol. 19, No. 6, November 1961, pp. 841-848
32. Kameda, H., Fathy, E.-Z. S., Ryu, I., Li, J., A performance Comparison of Dynamic
vs. Static Load Balancing Policies in a Mainframe - Personal Computer Network Model, Information: an International Journal, Vol. 5, No. 4, December 2002, pp.
Theoretical Aspects on Performance Bounds and Fault Tolerance in Parallel Computing
33. Karp, R. M., Reducibility Among Combinatorial Problems, in R. E. Miller and J. W. Thatcher (editors): Complexity of Computer Computations. New York, 1972, pp. 85–104
34. Khan, A., McCreary, C. L., and Jones, M. S., A Comparisons of Multiprocessor
Scheduling Heuristics, in Proceedings of the 1994 International Conference on
Parallel Processing, CRC Press, Inc., Boca Raton, FL, pp. 243-250
35. Krishna, C. M., Shin, K. G., Real-Time Systems, McGraw-Hill International Edi-tions, Computer Science Series, 1997, ISBN 0-07-114243-6
36. Kwok, Y-K., and Ahmad, I., Static Scheduling Algorithms for Allocating Directed
Task Graphs to Multiprocessors, ACM Computing Surveys, Vol. 31, No. 4,
December 1999, pp. 406-471
37. Laprie, J.-C., Avizienis, A., Kopetz, H., Dependability: Basic Concepts and
Termi-nology, 1992, ISBN:0387822968
38. Lawler, E. G., Lenstra, J. K., Rinnooy Kan A. H. G., Shmoys, D. B., Sequencing
and Scheduling: Algorithms and Complexity, S.C. Graves et al., Eds., Handbooks
in Operations Research and Management Science, Vol. 4, Chapter 9, Elsevier, Amsterdam, 1993, Chapter 9, pp. 445-522
39. Lennerstad, H., Lundberg, L., An Optimal Execution Time Estimate of Static versus
Dynamic Allocation in Multiprocessor Systems, SIAM Journal of Computing, 24
(4), 1995, pp. 751-764
40. Lennerstad, H., Lundberg, L., Generalizations of the Floor and Ceiling Functions
Using the Stern-Brocot Tree, Research Report No. 2006:02, Blekinge Institute of
Technology, Karlskrona 2006
41. Lennerstad, H., Lundberg, L., Optimal Combinatorial Functions Comparing
Mul-tiprocessor Allocation Performance in MulMul-tiprocessor Systems, SIAM Journal of
Computing, 29 (6), 2000, pp. 1816-1838
42. Lennerstad, L., Lundberg, L., Optimal Scheduling Combinatorics, Electronic Notes in Discrete Mathematics, Vol. 14, Elsevier, May 2003
43. Liu, C. L., Optimal Scheduling on Multiprocessor Computing Systems, in Proceed-ings of the 13th Annual Symposium on Switching and Automata Theory, IEEE Computer Society, Los Alamitos, CA, 1972, pp. 155-160
44. Lennerstad, H. and Lundberg, L., An Optimal Execution Time Estimate of Static
versus Dynamic Allocation in Multiprocessor Systems, SIAM Journal of
Comput-ing, 24 (4), 1995, pp. 751-764
45. Lundberg, L., Performance Bounds on Multiprocessor Scheduling Strategies for
Chain Structured Programs, Computer Science Section of BIT, Vol. 33, No. 2,
1993, pp. 190-213
46. Lundberg, L., Lennerstad, H., Klonowska, K., Gustafsson, G., Using Optimal
Golomb Rulers for Minimizing Collisions in Closed Hashing, Proceedings of
Advances in Computer Science - ASIAN 2004, Thailand, December 2004; Lecture Notes in Computer Science, 3321 Springer 2004, pp. 157-168
47. Lundberg, L., and Svahnberg, C., Optimal Recovery Schemes for High-Availability
Cluster and Distributed Computing, Journal of Parallel and Distributed Computing
48. Markatos, E. P., and LeBlanc, T. B., Locality-Based Scheduling for Shared
Mem-ory Multiprocessors, Technical Report TR93-0094, ICS-FORTH, Heraklio, Crete,
Greece, August 1993, pp. 2-3
49. McCreary, C., Khan, A. A., Thompson, J. J., and McArdle, M. E., A Comparison
of Heuristics for Scheduling DAG’s on Multiprocessors, in Proceedings of the
Eighth International Parallel Processing Symposium, April 1994, pp. 446-451 50. McNaughton, R., Scheduling with Deadlines and Loss Functions, Management
Science, 6, 1959, pp. 1-12
51. Nanda, A. K., DeGroot, D. and Stenger, D. L., Scheduling Directed Task Graphs
on Multiprocessors Using Simulated Annealing, Proceedings of the IEEE 12th
International Conference on Distributed Computing Systems, Yokohama, Japan, IEEE Computer Society, Los Alamitos, June 1992, pp. 20-27
52. Pande, S. S., Agrawal, D. P., and Mauney, J., A Threshold Scheduling Strategy for
Sisal on Distributed Memory Machines, Journal on Parallel and Distributed
Com-puting, Vol. 21, No. 2, May 1994, pp. 223-236
53. Papadimitriou, C. H., and Yannakakis, M., Scheduling Interval-Ordered Tasks, SIAM Journal on Computing, Vol. 8, 1979, pp. 405-409
54. Pfister, G. F., In Search of Clusters, The Ongoing Battle in Lowly Parallel
Comput-ing, Prentice Hall PTR, 1998, ISBN 0-13-899709-8
55. Pinedo, M., Scheduling: Theory, Algorithms, and Systems (2nd Edition), Prentice Hall; 2 edition, 2001, ISBN 0-13-028138-7
56. Shirazi, B., Kavi, K., Hurson, A. R., and Biswas, P., PARSA: A Parallel Program
Scheduling and Assessment Environment, in Proceedings of International
Confer-ence on Parallel Processing, ICPP 1993, August 1993, Vol. 2, pp. 68-72
57. Shirazi, B., Wang, M., Analysis and Evaluation of Heuristic Methods for Static
Task Scheduling, Journal of Parallel and Distributed Computing Vol. 10, No. 1990,
pp. 222-232
58. Soliday, S. W., Homaifar, A., Lebby, G. L., Genetic Algorithm Approach to the
Search for Golomb Rulers, in Proceedings of the International Conference on
Genetic Algorithms, Pittsburg, PA, USA, 1995, pp. 528-535
59. Tang, P., Yew P.-C., and Zhu, C.-Q., Impact of Self-Scheduling Order on
Perfor-mance of Multiprocessor Systems, in Proceedings of the 2nd international
confer-ence on Supercomputing, June 1988, St. Malo, France, pp. 593-603
60. Ullman, J. D., NP-Complete Scheduling Problems, Journal of Computer and Sys-tem Sciences, Vol. 10, 1975, pp. 384–393
61. Vaidya, N. H., Another Two-Level Failure Recovery Scheme: Performance Impact
of Checkpoint Placement and Checkpoint Latency, Technical Report 94-068,
Department of Computer Science, Texas A&M University, December 1994 62. Yue, M., On the Exact Upper Bound for the Multifit Processor Scheduling
Algo-rithm, Annals of Operations Research, Vol. 24, No. 1, December 1990, pp. 233-259
63. Zapata, O. U. P., Alvarez, P. M., EDF and RM Multiprocessor Scheduling
Algo-rithms: Survey and Performance Evaluation, Report No.
Comparing the Optimal Performance
of Parallel Architectures
Kamilla Klonowska, Lars Lundberg, Håkan Lennerstad and Magnus Broberg Proceedings of The Computer Journal, Vol. 47, No. 5, 2004
Abstract
Consider a parallel program with n processes and a synchronization granularity z. Consider also two parallel architectures: an SMP with q processors and run-time real-location of processes to processors, and a distributed system (or cluster) with k proces-sors and no run-time reallocation. There is an inter-processor communication delay of t time units for the system with no run-time reallocation. In this paper we define a func-tion such that the minimum completion time for all programs with n processes and a granularity z is at most times longer using the system with no reallocation and k processors compared to using the system with q processors and run-time reallocation. We assume optimal allocation and scheduling of processes to processors. The function is optimal in the sense that there is at least one program, with n processes and a granularity z, such that the ratio is exactly . We also validate our results using measurements on distributed and multiprocessor Sun/Solaris environments. The function provides important insights regarding the performance implications of the fundamental design decision of whether to allow run-time reallocation of processes or not. These insights can be used when doing the proper cost/benefit trade-offs when designing parallel exe-cution platforms. H n k q t z H n k q t z H n k q t z H n k q t z H n k q t z