Multiprocessor Scheduling of Synchronous Data Flow Graphs using Local Search Algorithms

(1)

KTH Information and Communication Technology

Multiprocessor Scheduling of

Synchronous Data Flow Graphs using Local Search Algorithms

EMMA CALLERSTRÖM KAJSA ELFSTRÖM

Bachelor’s Thesis at School of Information and Communication Technology Supervisor and Examiner: Ingo Sander

TRITA-ICT-EX-2014: 62

(2)

(3)

Abstract

Design space exploration (DSE) is the process of exploring design alternatives before implementing real-time multiprocessor systems. One part of DSE is scheduling of the applications the system is developed for and to evaluate the performance to ensure that the real-time requirements are satisfied. Many real-time systems today use multiprocessors and finding the optimal schedule for an application on a multiprocessor system is known to be an NP-hard problem. Such an optimization problem can be time-consuming which justifies the use of heuristics. This thesis presents an approach for scheduling applications onto multiprocessors using local search algorithms. The applications are represented by SDF-graphs and the assumed platform has homogeneous processors without constraints regarding communication delays, memory consumption or buffer sizes. The goal of this thesis project was to investigate if heuristic search algorithms could find sufficiently good solutions in a reasonable amount of time. Experimental results show that local search algorithms have potential of contributing to DSE by finding high-performance schedules with reduced search time compared to algorithms trying to find the optimal scheduling solution.

(4)

Referat

Schemaläggning av Synkrona

Dataflödesgrafer på Multiprocessorer med hjälp av Lokala Sökalgoritmer

Den process som föregår implementationen av ett realtidssystem där olika designalternativ utforskas kallas på engelska för Design space exploration (DSE). En del av DSE är att schemalägga de applikationer som systemet är avsett för och att utvärdera dess prestanda för att säkerställa att realtidskraven uppfylls. Många realtidssystem idag använder multiprocessorer och att hitta det optimala schemat för en applikation på ett multiprocessorsystem anses vara ett NP-svårt problem. Ett sådant optimeringsproblem är ofta tidskrävande vilket motiverar att använda en heuristik. Denna rapport presenterar ett tillvägagångssätt för schemaläggning av applikationer på multiprocessorer med hjälp av lokala sökalgoritmer. Applikationerna är representerade som SDF-grafer och plattformen antas ha homogena processorer utan restriktioner för bufferstorlek eller minnesåtgång och som kan kommunicera utan fördröjning. Målet med kandidatexamensarbetet var att undersöka om heuristiska sökalgoritmer kan hitta tillräckligt bra schemalösningar inom en rimlig söktid. Resultaten visar att lokala sökalgoritmer har potential att bidra till DSE genom att producera bra lösningar på kortare tid än algoritmer som söker den optimala lösningen.

(5)

List of Figures

1.1 Scheduling of an application on a platform with three processors. 2 1.2 A synchronous data flow graph (SDFG) and the same graph

converted into a homogeneous SDFG. . . 3 1.3 A state space landscape for a local search algorithm. An NP-hard

problem usually contains a large number of local maxima. . . 5 2.1 A synchronous data flow graph (SDFG) with initial tokens. . . 10 2.2 Transformation from synchronous data flow graph (SDFG) to

homogeneous synchronous data flow graph (HSDFG). . . 11 2.3 A Gantt chart of the graph in Figure 2.1. . . 12 2.4 Draft of a state space landscape. . . 16 2.5 Representation of a state in the 8-queens problem and a possible

neighbour state. . . 16 2.6 In simulated annealing the search should converge to the highest

area of the solution space . . . 19 2.7 The different steps of a basic genetic algorithm. . . 21 3.1 Relations between order-based schedule and time-based schedule. 24 3.2 Time-based schedules illustrating two neighbour-states. . . 25 4.1 The assumed multiprocessor platform. . . 28 4.2 An HSDFG, an example of a state representing an order-based

schedule and the graph representation of a state. . . 29 4.3 Example of a scheduling of an HSDFG where deadlock occurs. . . 30 4.4 Example of a scheduling of an HSDFG where deadlock occurs. . . 31 4.5 Flow chart for hill climbing. . . 33 4.6 Temperature and probability according to cooling schedule . . . . 35 4.7 Flow chart for simulated annealing. . . 36 4.8 State representation in the genetic algorithm. . . 38 5.1 Four of the HSDFGs used for testing the local search approach . . 39 5.2 Two of the HSDFGs used for testing the local search approach . . 40

(8)

5.3 A hill climbing search without sideways moves. A move means the current state is modified into a neighbour state. . . 41 5.4 A hill climbing search with 10 sideways moves allowed. The

down-hill move is due to a restart. . . 42 5.5 A simulated annealing search. . . 43 5.6 Graph showing the change in throughput for hill climbing (HC)

and the simulated annealing (SA) algorithms. . . 45 5.7 Graph showing the improvement of found states over time for the

HC and the SA algorithms. . . 45 5.8 Graph showing the change in throughput for every made move of

the hill climbing and the simulated annealing algorithms. . . 47 5.9 Graph showing the improvement of found states over time for the

hill climbing and the simulated annealing algorithms. . . 47 6.1 A state space with throughput requirement . . . 52

(9)

List of Tables

2.1 Calculation of probability with different deterioration and temperatures . . . 19 4.1 Calculation of probability with changes of different orders of

magnitude . . . 35 5.1 Comparison of local search algorithms on search time and

throughput for the first and the best solution. . . 44 5.2 Comparison of local search algorithms on search time and

throughput for the first and the best solution. . . 46 5.3 Table for comparisons of throughput and search time for the first

and best solutions produced by the different approaches . . . 48 6.1 Throughput and search time scheduling graphs G3 and G4 . . . . 51

(10)

(11)

Abbreviations

CP constraint programming DSE design space exploration HC hill climbing

HSDFG homogeneous synchronous data flow graph NP nondeterministic polynomial time

RTS real-time system SA simulated annealing

SDFG synchronous data flow graph SSE state space exploration

WCET worst case execution time

(12)

(13)

Chapter 1 Introduction

This chapter presents the background and purpose of the thesis. It introduces the reader to scheduling on multiprocessor systems, the challenges within the field and the heuristic approach for optimization used in this project. One section is devoted to reflection upon ethics and sustainability issues. The chapter also gives a brief description of the methods used in the project.

1.1 Background

Today embedded systems are found in a wide range of applications from consumer electronics, household appliances and medical equipment to automotives, avionics and space systems. Simply stated, an embedded system is a computing system dedicated to a specific application or part of an application that could be a part of a larger system. Typical characteristics are that the application is known before-hand, the system handles one or a few functions and has minimal or no user interface [3].

Many systems have strict deadlines for the finishing times of executions of applications and are known as real-time systems (RTSs). The deadlines are called real-time requirements and can be further divided into soft and hard [11]. An example of a soft RTS is audio or video streaming where missing deadlines could cause lower quality which is undesirable but could be acceptable as long as the media does not lag. On the other hand, missing a deadline in a hard RTS could have disastrous consequences. Imagine a pacemaker or a driverless car, both examples of hard RTSs, and what could happen if certain actions were not performed in time.

The rapidly increasing complexity of embedded systems creates a constant need for faster and more efficient hardware. Until recently, the main issue was to produce smaller chips that could run with higher clock frequency. But since

(14)

the early years of the 2000s, due to extensive heat dissipation of the compact chips the trend has changed towards using multiprocessors [2].

1.1.1 Multiprocessor Scheduling

Utilizing multiple processors effectively requires that computations in an application can be executed in parallel on different cores. The level of available parallelism in an application gets limited if the computations are dependent of each other. To have the processor cores working concurrently with a precedence order and varying execution times of the computations calls for good scheduling. Figure 1.1 shows a simple schedule for an application consisting of five different computations, in the figure called tasks, on a platform with three processors. The schedule tells when and on which processor a task is executed. Task 2 and Task 3 cannot execute initially since they are dependent of Task 1. Task 4 and Task 5 are dependent of Task 3.

Another schedule could have had Task 3 placed on processor 2 after Task 2. This would have been a worse schedule since it would cause the tasks to miss their deadline. A better schedule could have put Task 3 on processor 1, because it would give the same performance with fewer processors.

Figure 1.1. Scheduling of an application on a platform with three processors.

In addition to deadlines there could be other constraints regarding memory consumption and communication delays depending on the platform design.

Optimizing schedules for applications on multiprocessors is known to be an NP-hard problem [5]. Nondeterministic polynomial time (NP) implies that there is no polynomial time algorithm known to exist that describes the run-time complexity [10]. This means that there is no known function of the input size describing the upper limit of the time it could take to solve the problem. NP is further described in Section 2.2.1. For large applications, finding the optimal schedule could be very time-consuming.

(15)

1.1.2 Application Modelling and Optimization Approach

When designing a real-time system (RTS), scheduling of the applications the system is built for can show if the real-time requirements are met and contribute in evaluating the hardware resource efficiency. The process of exploring design alternatives for software components and hardware architecture before implementing a system is called design space exploration (DSE) [9]. Scheduling an application requires a model of the application.

Many applications can be represented as data flow graphs. A specific data flow graph model suitable for multiprocessor scheduling of these applications, especially digital signal processing applications, is the synchronous data flow graph (SDFG)[10]. A description of this model can be found in Embedded Multiprocessors written by Sundararajan Sriram and Shuvra S. Bhattacharyya [10]. The vertices in SDFGs are called actors and represent computations. The edges, usually called channels or arcs, represent data precedence constraints between actors. The actors produce data to their output channels and consume data from their input channels. These chunks of data are called tokens. An SDFG can easily be converted into a HSDFG for the purposes of multiprocessor scheduling. Figure 1.2 shows an SDFG and the same graph converted into an HSDFG. Actors represent computations, channels represent data precedence constraints and the tokens represent chunks of data sent between actors. Homogeneous refers to that an actor only produces and consumes one token on each of its output and input channels. This entails that an actor produces the exact number of tokens that the receiving actor needs. HSDFGs are further described in Section 2.1.2.

Figure 1.2. A synchronous data flow graph (SDFG) and the same graph converted into a homogeneous SDFG.

(16)

The only performance metric of schedules regarded in this thesis was the throughput which describes the number of iterations of the graph per time unit [10]. An iteration implies that all actors in the graph have executed once.

Thus optimizing a schedule for an HSDFG meant improving the throughput.

The approach for optimizing schedules was to use local search algorithms.

Local search is a subcategory of heuristic, also known as informative, search.

The objective of a heuristic search is to find a sufficiently good solution in a reasonable amount of time when methods searching for the optimal solution are too time-consuming [8]. The price for the reduced search time is that heuristics yield an approximation of the optimal solution [8]. Since the scheduling problem is NP-hard, the use of heuristics is justified.

Local search algorithms are used when the goal state is what matters and not the path to reach it [8]. The algorithm does not keep the computation path in memory but typically focuses only on the current state. The search progresses by modifying the current state into a neighbour state to try to improve it. The neighbour state is evaluated with a specified evaluation function. In this thesis the evaluation function corresponded to the performance metric of the schedules which was the throughput. With throughput as the performance metric the optimum was a maximum but for other evaluation functions, where the best solution has the lowest value, the optimum would be a minimum. If the small modification into a neighbour state yields a better throughput, the neighbour becomes the new current state. If not, another modification could be tried. And so the local search algorithm continues until no better neighbours are found.

Figure 1.3 shows how the current state is moving through the state-space landscape by being modified into a neighbour state. An up-hill move implies a modification of the current schedule into a new schedule with better throughput. If the current state reaches a peak in the landscape where there is no better neighbour to move to, the algorithm has found a maximum.

The global maximum is the highest peak in the landscape and represents the optimal schedule. A local maximum is any of the other lower peaks. As seen in the landscape in Figure 1.3 a local maximum could be close to equal in throughput to the global maximum. Such a schedule could be considered an approximation of the optimal solution and be sufficiently good if it meets the throughput requirement. Local search algorithms are further described in Section 2.3.

(17)

Figure 1.3. A state space landscape for a local search algorithm. An NP-hard problem usually contains a large number of local maxima.

1.2 Problem

The problem addressed in this thesis was optimization of multiprocessor schedules for HSDFGs. This optimization problem is considered NP-hard which entails that the required time to find the optimal schedule could be unreasonable for large and complex graphs. Since scheduling of applications is a significant phase in the DSE process requiring a lot of testing it is desirable to reduce the time needed for optimizing the schedule. The optimization approach in this thesis project was to use local search algorithms. The approach assumed that the worst case execution times (WCETs) of actors were known and that the only performance metric was the throughput. The platform was assumed to have homogeneous processors without constraints regarding communication delays, memory consumption or buffer sizes.

1.3 Purpose

The purpose of this thesis was to present an approach for scheduling of HSDFGs onto multiprocessors using local search algorithms. The evaluation of the approach would show if heuristics and especially local search algorithms could provide sufficiently good solutions with reduced search time when approaches searching for the optimal solution are too time-consuming.

The definition of a sufficiently good solution in this thesis was a valid

(18)

schedule that met the throughput requirement. The throughput requirements used when testing the approach would be based on the throughput of optimal schedules for selected HSDFGs on an 8-core platform produced by other optimization approaches guaranteeing the optimal solutions.

Three different types of local search algorithms called hill climbing, simulated annealing and genetic algorithms would be considered when choosing suitable algorithms for the approach. Implementing a local search algorithm required defining a solution state, a neighbour state and an evaluation function.

1.4 Goal, Benefits, Ethics and Sustainability

The goal of this thesis project was to present two heuristic search algorithms and investigate if they could contribute to other more time consuming approaches of scheduling HSDFGs onto multiprocessors. The results of scheduling HSDFGs by the algorithms could be compared with the results of another more time consuming approach applied on the same graphs. The algorithms used in the project are heuristic and the ambition was to find a solution that met the throughput requirements or found the global maximum in reasonable time. However the algorithms were not meant to generate a solution that is guaranteed to be the global maximum. A solution that include the global maximum is told to be the optimal solution.

Hopefully this thesis could contribute to the research in the area of DSE with regard to multiprocessor schedules. It may be useful for research, not only in the area of DSE, but in areas dealing with logistic-related problems.

Optimization of DSE may not only include scheduling but also positive environmental effects. One example is that algorithms that solve a problem faster are generally less energy-consuming. The statement is based on the fact that generating a subset of the solutions ought to consume less energy than finding all solutions. In a sustainable perspective, time reductions are an improvement of energy consuming algorithms, and worth striving for. If optimized schedules contribute to resource efficiency, sustainable design decisions can be made already at the manufacturing phase. The design decisions are sustainable in the sense of minimizing material reducing environmental impact. In the case that the materials are metals containing rare earth elements, the impact of the environment is even more relevant.

Rare earth elements are costly to extract and difficult to recycle [1].

There is a risk of a rebound effect if an optimized algorithm is used for other types of software expansions instead of hardware reduction. Rebound effects are described as having an unexpected adverse consequence than predicted

(19)

[6]. These effects are important to take into consideration when dealing with sustainability.

1.5 Introduction Methods

The working process in this thesis project will be presented thoroughly in the Methods chapter 3, this section is a resume and also an introduction to the processes used. The work of the thesis project is meant to solve the problem (described in Section 1.2) and can be described in different parts, here presented in chronological order.

The initial part was about to understand the problem with the optimization problem. To understand the problem in this part of the project, information was gathered in books, papers and by searching on internet databases. After the initial research the purpose was formulated.

The next part was about the implementation of the program. The tools used were HSDFG, WCET, throughput requirements, order-based- and time-based schedules. The realization of the design alternatives of the algorithms could improve DSE.

The program implementation was written in Java. The implementation included two different local search algorithms, a hill climbing algorithm and a simulated annealing algorithm, further information about their characteristics and functions can be found in Section 2.3. The scheduling has been developed by a self-timed approach, see Section 2.1.

After implementing the algorithms, experiments applying the algorithms on different HSDFGs were carried out. These consisted of executions of both local search algorithms scheduling different graphs and comparison between local search approaches and a constraint programming approach.

The experiments consisted both of primary measurements such as throughput and if the global maximum was found and secondary measurements, such as number of sideways moves and number of visited states. The last part in this thesis project was the evaluation. By comparing the purpose of the thesis with the evaluation of the results it was possible to ensure whether the purpose was satisfied or not.

1.6 Delimitations

The aim of the project was to see if an approach using local search algorithms could provide a noteworthy alternative to more time-consuming approaches generating the optimal solution. In that case it could motivate developing the

(20)

approach further to manage more complex graphs and platform constraints.

Therefore the delimitations described could be seen as possible areas for future work.

The thesis project was delimited to assume a simple multiprocessor platform without constraints regarding communication delays or memory consumption. A more realistic platform model could have included communication delays between actors and between different processors.

Moreover the processors could have had a limited amount of memory and a limited buffer size for storing of the tokens sent between actors.

The research of suitable local search algorithms for the approach was delimited to hill climbing, simulated annealing and genetic algorithms. Two of these algorithms would be selected for implementation and evaluation.

The approach would not handle cyclic HSDFGs. It would be able to map and schedule multiple graphs onto one shared platform but not to set or analyse individual performance requirements for each graph.

1.7 Outline

The following chapters of the thesis include the theoretical background with the important concepts and knowledge required to understand the content of the thesis. Then the working process, with associated tools used in the scheduling problem of the graphs is presented. The approach of this thesis project is described in an own chapter with common concepts of local search and the two concerned algorithms. Also the genetic algorithm approach is presented. The two last chapters include the results and conclusions that can be drawn from both the outcome of the thesis project.

(21)

Chapter 2 Schedule Optimization in Multiprocessor Systems

In the following sections the basic theoretical background of the degree project is described. The initial sections introduce scheduling, SDFGs and the optimization problem concerned in the project.

2.1 Multiprocessor Schedules

The main idea of multiprocessor scheduling was to make use of the multiple processors by running tasks independent of each other, in parallel. There are different algorithms that could be applied on scheduling onto multiprocessors, each algorithm with its advantage and disadvantage depending on the characteristics of the problem. In the following sections the approach applied in the degree project are described followed by how schedules are evaluated.

2.1.1 Self-timed Schedules

Self-timed scheduling is one of six strategies described in Embedded Multiprocessors written by Sundararajan Sriram and Shuvra S. Bhattacharyya [10]. The strategies are categorized by their more or less dynamic approach.

In fully static scheduling, all scheduling decisions are made at compile time, compared to dynamic scheduling, where scheduling decisions are made at runtime [10] . Self-timed scheduling, used in this degree project, is almost fully static with the difference that the dependencies of the actors on the processors are known at compile time, but the start times of the actors are determined at run time. Within the range of this thesis project, no communication delays between the actors was handled. Such delays may require flexible

(22)

communication times between the actors. By self-timed scheduling no flexible time for communication is known at compile time in this thesis project.

Specially in this thesis project there are communication dependencies bot no communication times. Communication is one demanding factor in hard real-time systems, where communication with flexible delays is required in some real-time systems.

Dynamic strategies are demanding due to run time scheduling decisions and the requiring time for performing that. It is desirable to reduce run time computations and decisions in hard real-time systems. Therefore a more static approach is considered.

Parameters known at compile time in this study includes HSDFG, WCETs, number of processors and throughput requirements. The input to the self-timed schedule is given at compile time and described further in the following section.

2.1.2 Synchronous Data Flow Graph

Figure 2.1. A synchronous data flow graph (SDFG) with initial tokens.

An SDFG is a graph describing the flow of data through a system and can graphically illustrate the dependencies of the actors. The actors produce and consume data, called tokens and in an SDFG, the number of the tokens consumed and produced by the actors are known at compile time [10].

Execution of the data can be illustrated as a graph of vertices (called actors) and edges (called channels) [10] as described earlier in this thesis. The edges represent directed dependencies between two actors. In Figure 2.1 three actors A, B and C is shown. In the example, actor B and C can only be executed after actor A. The initial tokens on each actor shows that the actors are executed on different processors. A time-based schedule of this graph is described in Section Time-Based and Order-Based Schedules, 2.1.3. An actor can only

(23)

execute when it has got a token from all of the input channels. It is possible to see that actor B and actor C could only start to execute when actor A had fired. The actor producing data is called source vertex and the actor that consumes data is called sink vertex [10]. The channel to the actor itself with an initial token are called self-edged initial token but these initial tokens with a bullet does not have to be placed on an self-edged channel, it could be placed on a channel between two different actors.

Figure 2.2. Transformation from synchronous data flow graph (SDFG) to homogeneous synchronous data flow graph (HSDFG).

If the number of tasks from a source is higher than the tasks consumed at the sink, a buffer for the tokens not fired directly can be implemented. It is possible to implement buffers that keep all the data that waits for being consumed in a queue. In this thesis project equal number of tokens will be handled as HSDFGs after transformation from SDFG. In Figure 2.2, actor X will produce two tokens while actor Y needs require three tokens. In an HSDFG, the number of tokens at an edge will be the same at source as at the sink [4].

An SDFG or an HSDFG may have one or more initial tokens, representing a data dependence between actors during iterations [10]. After the whole first iteration of a graph, the actors with initial tokens can be executed again in a blocked schedule. In a pipelined-schedule, it is possible to begin the second iteration before the ending of the first one and in that way use the exceeding processors. In Figure 2.3 a pipelined schedule is illustrated. There, actor A executes a second time while actor B and C are executing their first time.

(24)

2.1.3 Time-Based and Order-Based Schedules

Self-timed schedules described earlier, could be either an order-based schedule, or a combination of order-based schedule and time-based schedule.

Order-based schedules only consist of mapping and ordering of the actors on the processors without time information. Time-based versions use the dependencies and also the execution times of the actors at the processors during the scheduling. The main goal of a time-based schedule is to determine the execution time of the tasks distributed on the processors based on the order-based schedule and the execution times for each actor.

Decision that distributes tasks on the available processors and reduces the run time for the whole execution of the graph is to prefer. The different measurable values in concern are described in the next section. One of these decisions increasing the throughput is to let tasks which are not dependent of each other and belonging to different iterations execute at the same time on different processors. The approach is called overlapped schedule (or non-blocked schedule) and is used between one iteration of the HSDFG and the following iteration [10]. The approach of overlapped schedule is used in the implementation of this degree project. The simple graph in Figure 2.1, can represent an overlapped schedule. The schedule is based on an environment of three processors available and actors with the same execution times without communication delays.

Figure 2.3. A Gantt chart of the graph in Figure 2.1.

A is the first and only actor to execute in the first timeslot, then actor B, C followed byA (again) executes in the second time-slot. In this simple example the following schedule will be repeated with the three actors distributed on the three processors. Note that actor A is executing the second time at the time where B and C start executing. This is because of the overlapped schedule.

(25)

In Figure 2.3 the schedule is represented with a Gantt chart of a time-based schedule with a overlapped approach. The initial phase and the periodic phase are marked at the time of one respectively two iterations of the graph. The vertical axis in a Gantt chart represents the processors and the horizontal axis the time [10]. Time-based schedules require the hardware to handle measuring of the time during execution.

2.1.4 Evaluating Schedules

The main measurement in the degree project is based on the periodic phase and is the keystone for evaluating the schedules. The Gantt chart in Figure 2.3 is marked with two phases called initial phase and periodic phase. The definition of the initial phase in this degree project is the time where all tasks have been executing once, or in other words, the first iteration of the graph.

Periodic phase is the time when the tasks have been executed twice whether some actors may have been executing more times or not. Throughput is the inverse of period and is used to visualize the property of a schedule with a specific input graph. The time of the initial phase contribute to information about how many time-units the start up-phase took, and can be used in evaluation of the schedules.

A new schedule of a HSDFG, generating a solution is called neighbour and is an essential part of creating new solutions. When evaluating schedules the number of equivalent solutions could be an interesting aspect. It could be interesting when deciding to accept equivalent solutions in a schedule or not.

2.2 The Optimization Problem

In this thesis the only performance metric of the schedules was the throughput, which is the number of iterations of the graph per time unit. Thus optimization meant finding schedules with better throughput. The challenge was to find a schedule where all dependencies between actors were respected, no deadlocks occured and that made maximum use of the opportunities of parallelism that came with the multiple processors.

To find a high performance schedule for an HSDFG on a multiprocessor is known to be an NP-hard problem. This chapter will describe what an NP-hard problem is and motivate the use of heuristics for problem-solving and optimization.

(26)

2.2.1 NP-completeness

The time complexity of an algorithm describes the needed run-time as a function f of the input size n. A commonly used notation for this is the Big O-notation which describes the asymptotic upper bound of the complexity.

In Big-O, the algorithm has the time complexity f (n) ∈ O(g(n)) where g(n) is the asymptotic time complexity when the input size n goes to infinity. If g(n) is a polynomial, the algorithm is said to be polynomial time. An efficient algorithm, with low run-time, has time complexity O(g(n)) with a low degree polynomial.

An important set of problems, polynomially equivalent with regard to complexity, is the NP-complete problems. These are problems, for which polynomial time algorithms are not known and believed not to exist [10].

Without an upper bound polynomial to describe the time complexity the algorithm could be impossible to run in a reasonable amount of time.

A problem that is at least as complex as any NP-complete problem is referred to as NP-hard. The problem of optimizing the mapping and scheduling of tasks in an arbitrary dependency graph is known to be NP-hard [5]. Since there is no optimal algorithm, finding the optimal solution for scheduling and mapping could mean searching the whole solution space. If the input graph is large and complex, it could take unreasonably long time to find the optimum. In such cases, it is justified to use heuristics to find a sufficiently good solution.

2.2.2 Heuristic Search

Search algorithms are used for problem-solving and optimization when systematically testing all solutions is too time-consuming. The algorithm leads the search towards the relevant solutions and the usual goal is to find the optimum. The algorithms define how to make decisions based on some evaluation of which way to continue during the search process.

Heuristic or informative search refers to using additional information that is not part of the problem definition when making these decisions [8]. This information could come from experience, thumb rules or educated guesses.

The aim is to narrow the search to reach the best solution faster. However, since the educated guesses are not always correct, the algorithm sometimes chooses the wrong direction. This could cause it to miss the path to the optimal solution. Later on the search stops when the algorithm believes it has found the best solution, even though it might be only second best or even worse.

A common example to describe heuristic search is imagining climbing in

(27)

the mountains in thick fog and trying to find the highest peak in the area.

The added information, based on an educated guess, could be always choosing the steepest upward path. The climber will probably reach a peak in a short time, but it might not be the highest.

When dealing with NP-hard problems the solution space is usually very large. Since it is known that the optimal solution is hard to find, sometimes an approximation of the optimal solution could be good enough. In that case heuristic search is a good option.

2.3 Local Search Algorithms

Local search is a subgroup of heuristic search [8] which implies that the resulting solution state might be an approximation of the optimal solution.

Local search algorithms are special because they only keep track of one or a few current states in the search process rather than systematically testing different paths from the initial state. The search progresses by exploring the local neighbourhood of the current state and choosing a good neighbour to set as the new current state. A neighbour is a small modification of the current state. Since the preceding states are not kept in memory, local search should only be used when it is the solution state that matters and not the path to get there.

This gives local search the properties it is known for. It uses a small amount of memory and is often able to find a sufficiently good solution relatively fast in a large solution space where systematic search would have been too extensive.

Figure 2.4 shows how the algorithm searches the solution space moving forward by making small modifications of the current state into neighbours.

The state space has several maxima where the global maximum represents the optimal solution. If the current state has reached a maximum it will not find any better states in its neighbourhood because all neighbours would cause a down-hill move. Here the algorithm might think it has reached the best solution and terminate, even if it is only a local maximum and not the global.

This is why local search algorithms might produce an approximation of the optimal solution.

There are several choices to make when implementing this kind of algorithms. For NP-hard problems, the solution space could look more like in Figure 1.3 in the previous chapter. Many of these local maxima could be very low compared to the global maximum. If the local search algorithm gets stuck on a local maximum that is not sufficiently good, it needs a way to get away from this point and continue the search. One way to do this is to restart the algorithm with a new initial state. Creating a new initial state can be done

(28)

Figure 2.4. Draft of a state space landscape.

systematically or randomly. Another way to continue is to accept neighbours worse than the current state and continue from there. This is done in a local search algorithm called simulated annealing discussed in Section 2.3.2.

Furthermore, as seen in Figure 2.4, the solution space could consist of plateaus and shoulders that can be handled differently. A plateau is a local maximum with several neighbours with the same value creating a flat area in the state space. If the algorithm is implemented not to accept neighbours with the same value, it will terminate on the plateau. However, as seen in Figure 2.4, what appears to be a plateau could actually be a shoulder leading to better solutions. Therefore the algorithm could be allowed to make sideways moves to be able to continue from the shoulder. With sideways moves a risk follows that the algorithm gets lost on the plateau or shoulder and ends up moving back and forth since it does not remember which states it has visited before.

A common solution is to set an upper limit of the number of sideways moves allowed. The red circle in Figure 2.4 illustrates the current state. Which peak the current search will find depends on where the initial state was located.

The goal is to reach the global maximum.

Figure 2.5. Representation of a state in the 8-queens problem and a possible neighbour state.

Implementation of a local search for problem-solving or optimization requires a definition and representation of a state and a neighbour. A common

(29)

problem example is the 8-queens problem which amounts to placing eight queens on a chess board, one per column, without any queens attacking each other. One way to represent a state in this problem could be to have a string of length eight where each position represent a column on the chess board and the number represent at which row a queen is placed in that column.

Then a neighbour state could be created by modifying one of these numbers, which implies moving one single queen. This is illustrated in Figure 2.5. The representation of a state and neighbour is not always as straightforward as in this example. Moreover the modification of the current state into a neighbour has to be relevant for progress of the search.

The following sections describe the local search algorithms used in this project. These descriptions are written with references from the book Artificial Intelligence - A modern approach by Stuart J. Russell and Peter Norvig [8].

2.3.1 Random-Restart Hill Climbing

The hill climbing algorithm only focuses on one current state. From there it only makes up-hill moves to better states in the neighbourhood. The algorithm practically consists of a loop of finding a better neighbour and make it the current state. Normally the algorithm starts at a random initial state and stops when it has reached a peak where there are no better neighbours surrounding the current state.

Since there could be more than one better neighbour state, the algorithm needs to know which one to choose. Either it could produce all neighbour states and choose the best, which is known as steepest-ascent hill climbing.

Another variant called stochastic hill climbing is to choose randomly among the better neighbours. In first-choice hill climbing the algorithm picks the first neighbour created that is better than the current state. This is suitable if there is a large number (e.g. thousands) of neighbours.

Figure 2.4 pictures a one-dimensional state space landscape where the current state is the red circle climbing a hill. Once the algorithm has started an up-hill path, it will climb to the top without considering if it is a local maximum or the global, where it gets stuck.

In random-restart hill climbing the algorithm makes a restart whenever it gets stuck on a local maximum. The search is restarted from a new position in the state space by creating a new random initial state. If the optimal solution is known, the search can continue restarting until it finds the global maximum.

If it is not, the search could have a time limit or a limited number of restarts.

Since hill climbing only makes uphill and possibly sideways moves, it is doomed to get stuck on a local maximum without being able to get away from there unless it does a restart. This is a disadvantage since it could be really

(30)

close to the global maximum but is forced to restart in order to continue.

When restarting the search all effort to get this far is thrown away. If the algorithm could choose to accept worse neighbours as well it might be able to continue without restarting. This is the purpose of the algorithm Simulated Annealing described in the next section.

2.3.2 Simulated Annealing

This algorithm springs from the process of annealing metals which improves the structure of the metal and makes it easier to work with. The algorithm simulates the process where the metal is heated above a critical temperature and then cooled down slowly.

Simulated annealing has two big differences from hill climbing. Firstly, it allows moves to worse neighbours. Secondly it creates a neighbour randomly and chooses whether to accept or reject it. A better neighbour is always accepted and a worse neighbour is accepted with some probability less than one. The chance for the neighbour to be accepted decreases as the simulated temperature goes down. The temperature is cooled down slowly with time according to a cooling schedule. Since there are whole books written only about simulated annealing there are of course many ways to implement the algorithm.

One common method to calculate the probability to accept a move to a worse neighbour is called the Metropolis’ algorithm and originates from the first book on the subject Equation of State Calculations by Fast Computing Machines written by Metropolis et al. in 1953. The formula for the probability reads P = exp(−c/t) where r is a random value between 0 and 1. If P > r the neighbour is accepted. In this formula c should represent the change of the evaluation function compared to the current state. If the optimum is a maximum this is usually calculated as c = E(current) − E(neighbour) where E(state) is the evaluation function. t is the temperature mapped to passed time according to the cooling schedule.

The table below shows examples of the probability with different temperatures and evaluation differences.

The tricky part is often to construct the cooling schedule. If the temperature is too high, the algorithm will perform a random search accepting all neighbours, good as bad. If the temperature is too low, it will act like hill climbing. The aim is to have a high temperature in the beginning, accepting large decrements in the evaluation function, to be able to get to the highest area of the solution space landscape. Then the temperature should cool slowly to make the search converge to searching only the highest peaks in the local area. In the illustrated example in Figure 2.6 the initial temperature is high

(31)

c t P = exp(−c/t) 0.2 0.95 0.810157735 0.4 0.95 0.656355555 0.6 0.95 0.531751530 0.2 0.10 0.135335283 0.4 0.10 0.018315639 0.6 0.10 0.002478752

Table 2.1. Calculation of probability with different deterioration and temperatures

enough to accept decrements in the evaluation function of about 6 units.

This enables the current state, marked with the red circle, to get past the first dips at points A, B, C and D. When the current state is trying to get past the F point the temperature is too low to accept such a deterioration.

When it moves backwards again it cannot get back past the C point. As the temperature decreases the search will converge to the highest peak at point E and stay there until the temperature freezes and the search terminates. In a state space with a different landscape the highest peak could have been beyond theF point which would have required a higher start temperature to be found.

Figure 2.6. In simulated annealing the search should converge to the highest area of the solution space

There is no ultimate cooling schedule suitable for all search problems, it

(32)

has to be adjusted to the specific problem. The main things defining a cooling schedule is the start temperature, final temperature, temperature decrement schedule and the number of iterations on each level. These parameters should be based on data statistics from running the search.

2.3.3 Genetic Algorithms

Genetic algorithms are inspired by the evolution theory and are analogies of the natural selection process. Instead of modifying one current state into a neighbour, parent states are combined to create children states.

The algorithm starts with a set of randomly generated initial states, called the initial population. These are evaluated with the specified evaluation function, in genetic terms called fitness function. The next step is selecting pairs from the population for mating with each other. The better fitness the higher chance to get selected for mating. All selected pairs get mated which means they are combined into two new states. Then, according to evolution theory, a mutation could occur. This means that a part of a child state could, with a low probability, get a small random modification, positive or negative.

And so the algorithm continues mating the children states and develops new generations. The algorithm stops when it has found a satisfactory or optimal solution.

As a simple example, recall the case of placing eight queens on a chess board without anyone threatening each other from Section 2.3. States in this problem can appropriately be represented with an integer array and the steps in the algorithm can be implemented as in Figure 2.7. These steps are performed in a loop until a satisfactory or optimal solution is found. The colours show how the states are combined and mutated.

However, representing a state is not always as simple as in the eight-queens problem. Furthermore, a prerequisite for the genetic algorithm to produce good results is that the state is defined in a way that makes these combinations of parents meaningful. Because the potential of genetics over other algorithms lies in the possibility of doing the crossovers. If a state is represented by a finite string of letters, what the crossovers will do is combine blocks of letters and thereby raise the level of granularity of the search. As the algorithm proceeds it will benefit blocks that are contributing to good solutions, and these will survive the evolution.

Genetic algorithms are popular for optimization problems but more research remains to be done in order to find the situations when they perform well [8].

(33)

Figure 2.7. The different steps of a basic genetic algorithm.

2.4 Related Work

There are numerous approaches to solve the complex problem of scheduling applications onto multiprocessors. Among these are the approaches based on constraint programming (CP). A part of the evaluation of the local search approach developed in this project is a comparison with a CP-approach. There is also a number of heuristic approaches including local search algorithms.

These approaches often divide the scheduling problem into three steps performed with different methods [7]. These are mapping, scheduling and performance analysis. This decomposition could often prevent the algorithms from exploring all possible solution schedules. The local search approach presented in this thesis avoid decomposing the problem and are therefore capable of exploring the whole solution space.

(34)

(35)

Chapter 3 Methods

In this chapter, the working process is described and the tools used during the project are presented. The implementations of the algorithms are presented in the following chapter. To make the content easier to follow, the working process in this chapter is divided into sections that describe the way of working through the thesis project.

The theoretical background in this thesis is based on information obtained from books, journals, articles and conference proceedings on the subject. The aim was to find sources with high reliability and updated information. The following parts in this chapter contain: design decisions of the algorithms applied on HSDFGs, implementation of the algorithms and evaluation of the results generated from the tests of the algorithms.

3.1 Applying Local Search on Scheduling of Homogeneous Synchronous Data Flow Graphs

The aim of the working process was to present the approach for scheduling SDFGs onto multiprocessors and apply heuristics on the scheduling. The first algorithm used in the approach was the well-known local search algorithm hill climbing. The reason for starting with the hill climbing algorithm was partly because of the simplicity and straight forward approach and partly because of the possibilities of further development from a hill climbing algorithm.

The second algorithm used was simulated annealing. The strength of this algorithm is to permit downwards moves in a state space landscape to enable the search to continue and reach another peak, hopefully a maximum higher then the already visited local maximum. The reason for choosing simulated

(36)

Figure 3.1. Relations between order-based schedule and time-based schedule.

annealing was because of the characteristic shape of the state space landscape NP-complete problems. As described earlier the shape is typically seen as a bed of nails and with that knowledge, simulated annealing can advantageously be adapted to reach a maximum that is located close to the current position.

The tools chosen for the approach were HSDFGs, WCETs, throughput, order-based and time-based schedules. Some of them are described earlier and are associated with the graph and the evaluation of the schedule. The HSDFG was imported as an extensible markup language (XML) file, see Figure 3.1. The time-based schedule was developed from an order-based schedule where only the dependencies of the actors are represented. In the time-based schedule the execution times for the actors (WCET) are known aswell. The way of implementing the order-based and time-based schedule with self-timed execution is described in section 3.2. The measurement of throughput is required for a real time system in which the execution time of the tasks is central. By generating a time-based schedule both throughput and initial latency are generated. The time of initial latency can be in ones interest to ensure that the phase is not too long. Thus, it is possible to generate only the throughput directly from the order-based schedule. The relations between the above described terms in this thesis project are shown in Figure 3.1.

Some of the considerations in the process of design and implementation were how to handle plateaus, shoulders and ridges. Those are described in Section 2.3. At the time of dealing with plateaus it is important to not get stuck by keeping generating neighbour-solutions. Design decisions of a tolerant algorithm, can be worth considering when handling possible plateaus and shoulders. A very tolerant algorithm that accepts many sideways moves take longer time and may not result in a good enough solution. On the other hand a non-tolerant algorithm does not accept that many sideways moves and therefore may miss possible local maxima close to the current position.

Ridges are another trap to take into concern and they are characterized by a sequence of local maxima were each neighbour solution is pointing

(37)

Figure 3.2. Time-based schedules illustrating two neighbour-states.

downhill which means worse throughput values [8] . The only way to reach another maximum is either to go back a few steps or to generate a new initial solution. One problem of ridges is that it is not obvious how close the next local maximum is located and therefore not an easy task to tolerate going downwards. Simulated annealing is one strategy where the temperature does regulate the possible down-hill moves and therefore is able to handle ridges.

3.1.1 Algorithms

The algorithms were using the same platform and had the same input-parameters. The graph was described in an XML-format and imported to the program. Dependencies between the actors were represented by an order list, the first element with higher priority than the second etc. The list was the basis from where each new solution was generated in the hill climbing algorithm by a time-based schedule, see Section 2.3.

To understand the concept of up-hill move and down-hill move the concept of a neighbour has to be re-described. In section 2.3 neighbours were described by the 8-queens problem on a chess board. In Figure 3.2 two different time-based schedules are graphically illustrated. The description of a neighbour in the queens problem was described as a small modification and one move for one of the queens to a new possible position. The definition of a neighbour when it comes to scheduling of HSDFGs is a modified order-based schedule changing the order of one actor. As seen in Figure 3.2 the first mappings of the actors onto the processors take for example two time units.

When changing the order of actor B, a less appealing schedule occurs and the time for execution takes now three time units. The illustration is meant to show the consequence of a neighbour state going up-hill or down-hill in the state space landscape. In Section 4.1.2 neighbour states are described and different scheduling of another graph is presented.

(38)

3.2 Program Implementation of Local Search Approach

The realization of scheduling HSDFGs was done by implementing a Java program. The input to the program was HSDFGs imported from XML files.

As shown in Figure 3.1 the order-based schedule was first generated and then the time-based schedule. When the time-based schedule was created the execution times of the actors where added to the information about the order of the known actors. State space exploration (SSE) is the part of the program that uses WCET to generate a time-based schedule.

3.3 Evaluating Results of Local Search

The evaluation of the results in this thesis project is of interest when concluding the work, connecting the formulation of the problem in the beginning of the thesis with the outcome of the project. Six graphs were scheduled by two different approaches, hill climbing and simulated annealing.

These two algorithms were compared to a constraint programming approach scheduled on the same graphs. The measurement of interest was the search time, number of visited states during the search time, throughput and whether or not the global maximum was found. The number of visited states was of interest when evaluating the efficiency of the algorithms. The global maximum of the graphs was given from the constraint programming approach before execution, if it was found before a certain time. To avoid errors orgin from different computers, the hill climbing algorithm and the simulated annealing algorithm was compared on the same computer.

(39)

Chapter 4 Local Search Approach

This chapter presents the approach for scheduling HSDFGs onto multiprocessors developed in this thesis project. The local search algorithms used were random restart hill climbing and simulated annealing. The representation of a state and a neighbour was the same for both algorithms.

The chapter also includes a section describing a third approach where a genetic algorithm was used with a state representation inspired by the initial state creation for the other two algorithms. The approach with the genetic algorithm was incomplete but aimed at discussing the potential of such an approach.

4.1 Common Concepts

This section describes the general concepts of the local search approach such as the input and output, the assumed platform and the definition of a state and a neighbour.

4.1.1 Input and Platform Description

The input needed for the algorithms is an HSDFG, WCETs for the actors in the graph, the number of processor cores and, if available, a throughput requirement. The approach assumes a simple platform, seen in Figure 4.1, where the only variable is the number of processors. Each one of the homogeneous processors has its own local memory and there are no communication delays between them. As soon as one actor has produced a token on its output channel it will be available for the receiving actor, no matter on which processor. This allows for remapping without affecting

(40)

the communication time between actors which reduces the complexity of the scheduling problem.

Figure 4.1. The assumed multiprocessor platform.

4.1.2 Definition of a State

A state represents an order-based schedule. This means that the actors are mapped to processors and the internal execution order of the actors on each processor is decided. A valid state is one that respects the dependencies according to the HSDFG. Figure 4.2 shows a simple HSDFG and an example of a state representing an order-based schedule. Due to the internal execution order on each processor, new dependencies between actors arise. In Figure 4.2 actor C cannot execute until actor B is finished. In the graph representation of the state this is shown by the dotted arrow fromB to C. Neither can actor B execute again before C is finished. This is represented in the graph by the arrow back fromC to B, creating a cycle for that processor. The circle on this arrow depicts an initial token allowing for actor B to execute the very first time. Likewise actorA needs a channel to itself with an initial token to create a cycle for its processor. This shows that actorA cannot execute again before the last execution is finished.

A state is evaluated by the throughput of the schedule. This is calculated using state space exploration (SSE) during self-timed execution as described in section 3.2.

4.1.3 Creating a Random Initial State

The first step of creating a random initial state is to create an execution order list of all actors in the graph. This is done by repeating the following three steps until all actors are added to the list. Adding an actor to the list simulates executing it.

(41)

Figure 4.2. An HSDFG, an example of a state representing an order-based schedule and the graph representation of a state.

1. Find all actors ready to execute 2. Shuffle them

3. Add them to the order list

Next is building the state by pulling the actors one by one from the list and mapping them to a random processor. The internal execution order on each processor is set by the order the actors are added to the processor.

4.1.4 Definition of a Neighbour State

A neighbour state is created by moving one single actor to another position in the order-based schedule. It could be moved to any processor and take any valid position in the execution queue. A valid position implies that the dependencies between actors according to the original graph are still respected.

An example of a state and all its possible neighbours is shown in Figure 4.3.

Worth noting here is that a modification of a state into a neighbour could mean changing both the mapping and ordering of the actors in one move.

To make sure that dependencies are respected when making moves each actor is provided with input and output lists. The input list for an actor contains all actors that must be executed before it. The output list similarly contains all actors dependent of tokens from it. This includes not only the closest ones connected to the actor in question by channels but also all the actors preceding or succeeding them. For the graph in Figure 4.4 the input list for actor C3 would include A and B and the output list would include D3, E and F.

(42)

Figure 4.3. Example of a scheduling of an HSDFG where deadlock occurs.

When moving an actor to a new position, it is checked that no preceding actors on the same processor are part of the actor’s output list and no succeeding actors are part of the actor’s input list. However, when using this simple rules, there is still a risk for a deadlock to occur. An example of this is shown in Figure 4.4 in the next section.

4.1.5 Handling Deadlocks

Figure 4.4 shows an example of a state with a deadlock. This situation can occur even if the rules of respecting dependencies are followed as described in the previous section. The C-actors are resources for the D-actors. D4 is blocking C3 and waiting for C4 while D3 is blocking C4 and waiting for C3.

This satisfies the conditions for a deadlock.

In this approach deadlock detection is performed during the simulated self-timed execution. If the runtime has passed the sum of all WCETs, and there is some actor that has not executed yet, a deadlock is detected. This method is motivated by its simplicity. A more complex alternative could have been to prevent deadlocks when looking for a new valid position in the schedule for an actor. Running the algorithm on all graphs used in this degree project shows that the deadlock states usually constitutes less than 5 percent with peaks of 20 percent of the total number of tested states.

(43)

Figure 4.4. Example of a scheduling of an HSDFG where deadlock occurs.

4.1.6 Output

The output is an order-based schedule together with a static time-based schedule produced with self-timed execution with the WCETs of the actors.

The time-based schedule gives information about the initial latency and the throughput. For both algorithms the output is the best schedule encountered during the whole search, saved as the globally best state. If a throughput requirement was given as input, a state only counts as a solution if it satisfies the requirement.

4.2 Approach using Random Restart Hill Climbing

As described in Section 2.3.1 there are several ways to choose how many neighbours to evaluate before making a move to a new neighbour. The steepest-ascent version evaluates all neighbours and chooses the best. The first-choice version randomly creates neighbours until a better one is found and takes it. The first-choice method usually works well if a state has a large number, for instance thousands, of neighbours [8]. The algorithm in

(44)

this approach uses a combination of both versions since it takes two steps for creating a neighbour. The first step is to choose an actor to move. The second step is to find a new position in the schedule for this actor. So the algorithm randomly chooses an actor to move, then it tries all possible moves for this actor and chooses the best, which gives steepest-ascent choice for this actor. The first actor that can present an up-hill move from the current state is chosen, which gives the first-choice method. This is motivated by the potentially large number of neighbours of one state. If there are fifty actors, and each actor can make 20 different moves, the state has one thousand neighbours. If no actor can present a move that gives better throughput, there are no up-hill moves, and the search has reached a local maximum. When stuck on a local maximum, the algorithms restarts with a new random initial state to continue the search.

An alternative to restarting is the option of making sideways moves. If sideways moves are possible from this state there is a chance that the search has entered a shoulder which means there could be up-hill moves nearby. In this approach the user can choose how many sideways moves to allow. The reason why there must be a limit is because the algorithm does not keep track of which states have already been tested which could cause a jumping back and forth between a small number of states the rest of the search time. With this alternative the restart will occur when there are no up-hill or sideways moves possible or the maximal number of sideways moves is reached. The user can choose when the hill climbing search should terminate. Either after a certain amount of time or after a certain number of restarts. If the optimal solution is known and given as input, the search will stop if it is found.

4.2.1 Process and Flow Chart

Figure 4.5 shows a flow chart of the search process of the hill climbing algorithm. This section gives a description of the steps.

1. Create a new random initial state as described in Section 4.1.3.

2. A list is kept of the actors that has already been tried to move. Choose an actor randomly among those that are not on the list.

3. Test all possible moves of this actor to other positions in the schedule. Moving the actor to a new position represents a neighbour state. Evaluate all neighbours created by moving this single actor by calculating the throughput for the schedules with SSE. Choose the best neighbour. If there are several best neighbours that give the same throughput, choose the first one encountered.

(45)

Figure 4.5. Flow chart for hill climbing.

4. If the chosen neighbour has a better throughput than the current state, go to step 5. If not, go to step 6.

5. Accept the move of the actor by setting the current state to the neighbour state. Empty the list of actors tried to move. Zero the sideways moves counter. Repeat from step 2.

6. If there is still actors left to try to move to find an up-hill move, repeat from step 2. If all actors have been tried to move, according to the list, then there are no possible up-hill moves from this state. Go to step 7.

Multiprocessor Scheduling of Synchronous Data Flow Graphs using Local Search Algorithms