Parallelization of Aggregated FMUs using Static Scheduling

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2018 | LIU-IDA/LITH-EX-A--18/044--SE

Parallelization of Aggregated

FMUs using Static Scheduling

Mattias Hammar

Supervisor : Lennart Ochel Examiner : Peter Fritzon

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

This thesis implements and evaluates static scheduling for aggregated FMUs. An aggregate FMU is several coupled FMUs placed in a single FMU. The implementation creates task graphs from the internal dependencies and connections between the coupled FMUs. These task graphs are then scheduled using two different list scheduling heuristics, MCP and HLFET. The resulting schedules are then executed in parallel by using OpenMP in the runtime. The implementation is evaluated by looking at the utilization of the schedule, the execution time of the scheduling and the speedup of the simulation. These measurements are taken on three different test models. With model exchange FMUs only a really small speedup is observed. With co-simulation models the speedup varies a lot depending on the model, the highest achieved speedup was 2.8 running on four cores.

(4)

Preface

(5)

Acknowledgments

Firstly, I would like to thank everyone at Modelon, especially my external supervisor Labinot for all their help in making this thesis possible. I would also like to thank the staff at Linköpings University for giving me a great education and all of my classmates and the members of the academic computer club Lysator for making this five of the best years of my life.

Lastly, I also want to thank all my family and friends that has supported me in many different ways. I would not have managed this without all of your support.

(6)

3.5 Multiprocessor Scheduling . . . 15 3.6 Related Work . . . 18 4 Method 20 4.1 Workflow . . . 20 4.2 Implementation . . . 20 4.3 Heuristics . . . 25 4.4 Test Models . . . 29 4.5 Evaluation . . . 29 4.6 Technical Limitations . . . 31 5 Results 32 5.1 Weights . . . 32 5.2 Evaluation . . . 34 6 Discussion 40

(7)

6.1 Results . . . 40 6.2 Method . . . 42 6.3 The work in a wider context . . . 42

7 Conclusion 43

7.1 Summary . . . 43 7.2 Future Work . . . 43

(8)

List of Figures

3.1 The different model descriptions of two subsystems, coupled at the behavioral

description. . . 4

3.2 Two coupled systems, the dashed lines are direct feed-through and the red lines show an algebraic loop. The outputs and inputs are denoted with y and u respec-tively. . . 5

3.3 A simple system. . . 7

3.4 An example of an FMU. . . 7

3.5 Two coupled systems, one with Co-simulation FMUs and one with Model Ex-change FMUs. . . 8

3.6 The FMI Model Exchange FMU state machine . . . 9

3.7 The FMI Co-Simulation state machine . . . 10

3.8 An example of an aggregated FMU. . . 11

3.9 Graph showing the theoretical speedup according to Amdahl’s law. . . 13

3.10 A simple Task Graph. . . 15

4.1 The workflow that was used during this thesis. . . 20

4.2 Task graph for a step sequence for two FMUs. . . 21

4.3 Two coupled FMUs. . . 23

4.4 An example task graph from the FMUs in figure 4.3 . . . 23

4.5 The task graph for the step sequence of the FMUs in figure 4.3 . . . 24

4.6 Overview of the Race Car model. Note that each wheel has direct feed through and that the connections are vector valued. c Modelon . . . 29

4.7 Overview of the FMUs in the balanced car aggregate. . . 30

5.1 Speedup graph of the Co-Simulation Race Car Model. . . 35

5.2 Speedup graph of the Co-Simulation Four Race Cars Model. . . 37

(9)

List of Tables

3.1 An example of a schedule . . . 16 3.2 Attributes of the task graph in figure 3.10 . . . 17 5.1 Average execution time of the Signal and Wait operations in milliseconds. . . 32 5.2 Average execution times in milliseconds for the Co-Simulation Race Car model

with a step size of 2ms. . . 32 5.3 Task graph weights for the Co-Simulation Race Car model. . . 33 5.4 Average execution times in milliseconds for the Model Exchange Race Car model

with a step size of 2ms. . . 33 5.5 Task graph weights for the Model Exchange Race Car model. . . 33 5.6 Average execution times in milliseconds for the Co-Simulation Four Race Cars

model with a step size of 2ms. . . 33 5.7 Task graph weights for the Co-Simulation Four Race Cars model. . . 34 5.8 Average execution times in milliseconds for the Co-Simulation Balanced Car

model with a step size of 0.5ms. . . 34 5.9 Task graph weights for the Co-Simulation Balanced Car model. . . 34 5.10 Utilization of the Co-Simulation Race Car model schedules with and without

pin-ning a FMU to a specific core. . . 34 5.11 Runtimes of the Co-Simulation Race Car Model in seconds. . . 35 5.12 This table shows the total execution time for each FMUs’ step sequence in

millisec-onds. . . 35 5.13 Utilization of the Model Exchange Race Car model schedules with and without

pinning a FMU to a specific core. . . 36 5.14 Runtimes of the Model Exchange Race Car Model in seconds. . . 36 5.15 Utilization of the Co-Simulation Four Race Cars model schedules with and

with-out pinning a FMU to a specific core. . . 36 5.16 Runtimes of the Co-Simulation Four Race Cars Model in seconds. . . 37 5.17 This table shows the total execution time for each FMUs’ step sequence in

millisec-onds. . . 37 5.18 Utilization of the Co-Simulation Balanced Car model schedules with and without

pinning a FMU to a specific core. . . 38 5.19 Runtimes of the Co-Simulation Balanced Car Model in seconds. . . 38 5.20 This table shows the total execution time for each FMUs’ step sequence in

millisec-onds. . . 38 5.21 Average runtimes from all test models per graph in milliseconds. . . 39

(10)

Listings

3.1 An example of a Step Sequence in a aggregate description with two FMUs. . . 12

3.2 Pseudo code for a simple semaphore implementation. . . 13

3.3 A small OpenMp example. . . 14

3.4 Pseudo code for the general list scheduling algorithm. . . 16

3.5 The general cluster scheduling algorith, . . . 18

4.1 An example of a step sequence in the aggregate description with two FMUs. . . 21

4.2 A divided step sequence without synchronization. . . 21

4.3 A divided step sequence with synchronization. . . 22

4.4 Pseudo code for the execution of the step sequence in parallel . . . 22

4.5 Pseudo code to calculate the static b-level of a task graph. . . 25

4.6 Pseudo code for the Highest Level First With Estimated Times heuristic. . . 26

4.7 Pseudo code to calculate the critical path of a task graph. . . 27

4.8 Pseudo code to calculate the ALAP of a task graph. . . 27

(11)

1

Introduction

Modeling and simulation is a technique to represent physical systems as abstract models and perform experiments on them. It has become a useful part of many engineering processes as a step between concept design and prototype. Creating real prototypes can be an expensive and long process, it is often cheaper to create a digital model. The model can be used to verify and optimize the design before a prototype is built.

In 2010 the first version of the Functional Mockup Interface (FMI) standard was published. It defines a standard way to represent models. A model that follows the FMI standard is called a functional mockup unit (FMU). In this thesis an aggregated FMU refers to an FMU that contains several coupled FMUs internally. The thesis implements and evaluates the use of static scheduling to execute the simulation of aggregated FMUs in parallel on symmetric multiprocessor systems.

1.1 Motivation

A big problem with performing simulation on digital models is that it can be computationally heavy. With complex models it can takes several minutes to calculate one second in simula-tion time. For example, the model described in secsimula-tion 4.6 takes about 20 minutes to simulate 25 seconds. The faster the simulations can run, the less time is wasted, the simulations can run on less expensive hardware and the power consumption can be decreased. Therefore, it is important that the simulation tool vendors design the tools as efficient is possible.

A big part of this is optimizing the program for modern hardware. The trend in computer architecture for the last decade have been to increase the number of cores in each CPU. This is because increasing the clock frequency of a single core has seen a lot of diminish returns. The diminishing returns are due to something known as the three walls in computer architecture, the memory-, instruction level parallelism- (ILP) and power-wall. The memory-wall refers to the fact that the memory speeds have not kept up with the CPU performance. The ILP-wall is the increase in difficulty to find enough parallelism in a single instruction stream to keep the CPU cores busy. The power-wall refers to the fact that a small increase in clock frequency can increase the power consumption a lot.

To fully utilize a modern CPU it is therefore necessary to design the simulation software to use multiple cores. This thesis will present a solution for how this can be achieved within the FMI standard using static scheduling techniques.

(12)

1.2. Aim

1.2 Aim

The aim of this thesis is twofold. The first part is to figure out and show a good way to implement static scheduling for aggregated FMUs. The second part is to evaluate if static scheduling is a good method for simulating aggregated FMUs in parallel. This includes ex-ploring what advantages and disadvantages the approach has and what kind of speedups can be expected.

1.3 Research questions

This thesis has two research questions it will strive to answer. The first question will be answered by showing an implementation of static scheduling for aggregated FMUs. The second question is to evaluate how well the implementation performs.

1. How can static scheduling be used to simulate an aggregated FMU in parallel? 2. How big of a speedup can be expected by executing an aggregated FMU in parallel?

1.4 Delimitations

This thesis will assume that the simulations are executed on a shared memory processor. It will also only discuss parallel simulation in context of the FMI standard. It will not support models that have algebraic loops.

(13)

2

Background

This chapter introduces additional background information as a complement to the introduc-tion.

2.1 Functional Mockup Interface

Functional Mockup Interface1(FMI) is a standard created by the ITEA2 (Information Tech-nology for European Advancement) project MODELISAR. The goal was to support the AU-Tomotive Open system ARchitecture (AUTOSAR) and to develop the FMI standard. The purpose of FMI is to standardize model exchange and co-simulation of dynamic models, in-stead of every tool using its own solution. The standard specifies an interface to represent models. An instance of a model that follows this specification is called a Functional Mockup Unit (FMU).[1]

One common use case is that an OEM wants to simulate several models from different sup-pliers coupled together into one larger system. If each supplier uses different modeling tools to create their models, it can cause incompatibilities between the models and obstruct a suc-cessful simulation of the whole system. Hence it was necessary to create an open standard that all tools could follow, which became FMI.

The MODELISAR project began in July 2008 and ended in December 2011. The first version 1.0 of FMI was published in 2010 and the latest version 2.0 was published in 2014 [2]. Since the MODELISAR project has ended, FMI is now maintained and developed by the Modelica Association. [1]

(14)

3

Theory

This chapter introduces the necessary theory needed to understand the rest of the thesis. It will first introduce coupled systems from a general approach, then it will go into the specific implementation of coupled systems within the FMI standard. After that it will introduce the concept of an aggregated FMU, which couples several regular FMUs and encapsulates them in a single FMU. The next section will deal with the necessary parallel computing theory to understand the parallelization parts of this thesis. Then the multiprocessor scheduling problem will be introduced and the last section will contain some related work.

3.1 Coupled Systems

Simulating and modeling complex engineering systems is a difficult task that requires a lot of work. To simplify the workflow, it is often preferred to divide a large system into several smaller subsystems and coupling them together later on. This modular approach makes it possible to model each subsystem independently of each other in parallel. There are many advantages to this approach, for example, the subsystems can be reused with little work, modelers only need to focus on areas in their expertise, the internal workings of each subsys-tem can be hidden.

Figure 3.1: The different model descriptions of two subsystems, coupled at the behavioral description.

However, the modular approach also comes with the problem of coupling the subsystems together efficiently without compromising the stability of the system. The coupling can be done at three different abstraction levels, at the physical, mathematical and behavioral model descriptions, see figure 3.1. In the physical model description, the system is modeled with physical parameters, such as mass, resistance, velocity etc. In the mathematical description the system is described by mathematical equations and in the behavioral description the sys-tem is described by the simulation results from the mathematical equations. Coupling the subsystems at the physical description is basically the same as the non modular approach.

(15)

3.1. Coupled Systems

Only the mathematical and behavioral model descriptions will be considered in this thesis. [3]

No matter in which abstraction level the system is coupled at, there has to be some consider-ation taken in how this is done. The output of a subsystem can be directly dependent on one or several inputs of the same subsystem. This is called Direct Feed-through.

Definition 3.1. Direct Feed-through is when an output of a system is directly dependent on an input. [4]

It means that the value of the input directly controls the value of the output. One example of direct feed-through is when the output is simply a constant added to an input, but if the output is delayed by one time step it is not a direct feed-through. In figure 3.2 the dashed lines denotes direct feed-through, y[1]₁ is for example directly dependent on u[1]₂ . This becomes a problem if the system contains a loop of direct feed-through connections. It is necessary to know the value of all directly dependent inputs to calculate the value of an output, this is not possible if the output is part of a direct feed-through loop. Such a loop is also known as an algebraic loop.

Definition 3.2. An algebraic loop is a loop of connections between the subsystems with direct feed-through.

u

[1]1

u

[1]2

u

[2]1

u

[2]₂

y

[1]₂

y

[1]1

y

[2]₁

y

[2]₂

Figure 3.2: Two coupled systems, the dashed lines are direct feed-through and the red lines show an algebraic loop. The outputs and inputs are denoted with y and u respectively. See figure 3.2 for an example of an algebraic loop. There are ways to eliminate loops, for ex-ample, by adding a filter that removes the direct feed-through property on one of the outputs in the loop [3]. There are also ways to numerically solve the loops, which in general, is a non trivial task. As stated in section 1.4 algebraic loops are not considered in this thesis.

Mathematical Description

A system coupled at the mathematical description is often called a strongly coupled system. This is only possible if the subsystems expose their internal equations. This section will in-troduce a mathematical description of a coupled system.

A subsystem i can be described by the differential algebraic equation (DAE)

˙x[i] = f[i](x[i], u[i], t) (3.1) y[i] =g[i](x[i], u[i], t), (3.2)

where x[i]is the vector of state variables, u[i]is the vector of inputs, y[i]is the vector of outputs and t is time [3]. Note that the superscript i denotes which subsystem the function or variable

(16)

3.1. Coupled Systems

belongs to. To describe a global system of N subsystems, the global input and output vectors are

y=hy[1] y[2] . . . y[N]i T

(3.3) u=hu[1] u[2] . . . u[N]iT . (3.4) To describe the connections between the subsystems, each subsystem’s input vector u[i] is defined as a function of the global output vector y.

u[i] =L[i]y=h_L[i]₁ _{. . .} _L[i]_i´1 ₀ _L[i]_i+1 _{. . .} _L[i]_Ni              y[1] .. . y[i´1] y[i] y[i+1] .. . y[N]              (3.5)

where N is the number of subsystems and L[i] is the coupling matrix where each element is either 0 or 1 [3, 4]. Now the global system can be described by vectors of all the subsystems state variables x, their outputs y and inputs u.

Simulating a strongly coupled system is equivalent to solving the initial value problem, i.e solving ordinary differential equations given an initial state. There are several numerical methods to do this, one example is the Runge-Kutta method. [3]

Co-Simulation

Coupling at the behavioral description is most often referred to as weakly coupled systems, simulator coupling or co-simulation. This is when each subsystem uses their own integrator to solve the internal system and then communicates the result to the other subsystems. The result of this is that the communication between the subsystem are not continuous, they occur at fixed discrete points, usually called global steps and denoted Tn. There is a need to distin-guish between global steps Tnand local steps t[i]n,m, where Tn=t[i]_n,0ăt[i]_n,1ă. . . ă t[i]n,m=Tn+1 . Since a subsystem only knows the value of its inputs at tn,0for each global step n it raises a question on how to deal with the inputs. There is a couple of different ways of solving this problem. The easy answer is too keep the inputs constant between each global step. Another solution is to extrapolate the input from its previous values. In either case, it is worth noting that we might reduce the accuracy of the simulation and may introduce instability. [4] Weakly coupled systems introduce many advantages over strongly coupled systems. The fact that all subsystems has their own integrator makes it possible to use a domain specific integrator for each subsystem. More importantly for this thesis, since the systems are more divided, each subsystem can be simulated in parallel.

Simulation of a co-simulation system is done with something usually called a master algo-rithm. The master algorithm controls the communications between the different subsystems and the order of execution. There are two main types, the Jacobi type and the Gauß-Seidel types. When stepping the system forward one step from Ti to Ti+1every system n has to know or make an assumption about its inputs u[n](Ti)to calculate their outputs y[n](Ti+1),

(17)

3.2. Functional Mockup Interface

Figure 3.3: A simple system.

where u[n](Ti)denotes the input vector of system n at the time Ti. With an algorithm of the Jacobi type, the input is calculated from the connected systems previous output, for example, in figure 3.3 with constant interpolation system 2 would use u[2](Ti+1) = y[1](Ti)to calcu-late y[2](Ti+1). With a Gauß-Seidel type master algorithm the systems would instead use the actual input values, for example system 2 would use u[2](Ti+1) = y[1](ti+1), however this means that system 1 has to be solved before system 2 or 3 can know their inputs. Jacobi type master algorithms have a bigger potential for simulating the systems in parallel, however, there exists cases where the Gauß-Seidel types converges and the Jacobi types does not.[5]

3.2 Functional Mockup Interface

The Functional Mockup Interface (FMI) is a standard that describes a way to represent mod-els, see section 2.1. A model that follows the standard is called a Functional Mockup Unit (FMU). An FMU is a ZIP archive that contains an XML-file named model description, a set of C functions usually called the runtime and other optional data such as icons, documenta-tion files, libraries etc. The XML file contains informadocumenta-tion about the model, such as exposed variables, unit definitions, model structure and more. The set of C functions can either be in C code or binary format and it defines a standardized API to interact with the model. Figure 3.4 shows an example of an FMU’s file structure. [1]

Model.fmu Binaries Win32 Model.dll Linux32 Model.so ... Documentation index.html Icon.png ModelDescription.xml Resources ...

Figure 3.4: An example of an FMU.

Two of the most important exposed functions for this thesis are fmi2GetXXX(. . . ) and fmi2SetXXX(. . . ), where XXX denotes the variable type. These functions will from now on be referred to as Set and Get. The Set function sets the value of an FMU’s input and Get retrieves the value of an output from an FMU.

(18)

There are two different types of FMUs, FMU for model exchange (ME) and FMU for co-simulation (CS). The ME FMUs are intended for strongly coupled systems, they expose their derivatives so that an external tool can simulate the model. The CS FMUs are intended for weakly coupling systems, they include a solver in the FMU, see figure 3.5 for an overview of the difference. Both types of FMUs have some common features and restrictions which are important for this thesis. [1].

(a) Co-simulation (b) Model Exchange

Figure 3.5: Two coupled systems, one with Co-simulation FMUs and one with Model Ex-change FMUs.

Feature 3.1. It is possible to get dependency information about which outpus directly de-pends on which inputs from an FMU.

Restriction 3.1. The FMI standard does not guarantee that an FMUs operations are thread safe.

Feature 3.1 is really important to be able to parallelize the simulation of FMUs. If the depen-dency information is not available there is no way of knowing in which order we can execute the operations safely. On the other hand, restriction 3.1 limits how well the simulations can be parallelized. It means that it is not possible to execute two operations on the same instance of an FMU at the same time without risking invalid results.

FMU for Model Exchange

The FMUs for model exchange expose their internal equations to enable an external solver to simulate the system. This can be used to strongly couple several FMUs together. The FMUs interface handle ordinary differential equations (ODE) with events, also known as hybrid ODE. It is described as a piecewise continuous-time system, discontinuities can occur and are called events. In between the events the variables are either continuous or constant. During the simulation an ME FMU can be in five different states, see figure 3.6 for the state machine. [6]

Instantiated: The FMU has just been loaded or reset, variables that have an exact or approximate initial value can be set using the Set operations. To exit this mode fmi2EnterInitializationMode(. . . ) is called and the FMU enter the initialization state.

Initialization: During this state the initial values for inputs can be calculated using extra equations that are not accessible in other states. Outputs can also be retrieved using the Get operations. To exit this state fmi2ExitInitializationMode(. . . ) is called and the FMU enters the event mode.

Event Mode: In this state the events are processed, new values for all continuous-time variables and activate discrete-continuous-time variables are computed. When

(19)

fmi2NewDiscreteStates(. . . ) is called the FMU will either enter the continuous-time mode or do another iteration of the event Mode.

Continuous-Time Mode: In this state the external solver will calculate the ODEs to step the system forward. The integration is stopped when an event is triggered, the solver will check the event indicators after each completed integrator step. After an event is trig-gered it will enter the event mode. All discrete-time variables are fixed during this state.

Terminated: The simulation has ended and the and the solution at the last time step can be retrieved.

There are four different types of events in an ME FMU.

External Event: An external event is triggered when a continuous time input has a discon-tinuous change, a discrete time input changes value or a parameter is changed.

Time Event: A time event is triggered at a predefined time step.

State Event: A state event is triggered if the FMU enters a specific state, this is done by setting event indicators.

Step Event: A step event can occur after an integrator step has finished, they typically do not influence the model behavior. They are used to ease the numerical integration.

Figure 3.6: The FMI Model Exchange FMU state machine [6]

FMU for Co-Simulation

The second type of FMU is co-simulation (CS), they are used in weakly coupled systems. In other words, the communication is only done at fixed discrete time points and each FMU is packaged with their own solver which can be solved independently of each other. The master

(20)

3.3. Aggregated FMU

algorithm steps each subsystem forward one step in time by calling fmi2DoStep(. . . ) on the FMUs, from here on fmi2DoStep(. . . ) will be shortened to DoStep. To control the communica-tion between the FMUs, the Set and Get funccommunica-tions are used. During simulacommunica-tion an CS FMU has four different states, instantiated, initialization and terminated are basically the same as in the ME case, see the previous section 3.2. However, instead of the continuous-time and event modes it has an initialized mode. [6]

Initialized: This is where the actual simulation takes place, DoStep is used to calculate the state for the next global time step. Get and Set are used to pass values between models. The FMU can be in three different states, step complete, step failed or step canceled depending on the return value of a DoStep call.

See the state machine in figure 3.7 for a more detailed description.

Figure 3.7: The FMI Co-Simulation state machine [6]

3.3 Aggregated FMU

FMI is great for representing models, but the standard does not specify how the information describing the coupling of several FMUs should be stored. One way to solve this problem is by using Aggregated FMUs. An aggregated FMU is a single FMU that contains several FMUs that are coupled together. This works by creating a new FMU for the aggregate and placing all the FMUs that are coupled together in its resource directory. Then an aggregate description XML file is added. This file specifies how the FMUs are coupled together. See figure 3.8 for an example. When the aggregate is used, a simulation tool will load the runtime (binaries) of the aggregate, the aggregate runtime will parse the aggregate description, load each FMU’s binary files and run the simulation. It is important to note that the aggregate acts just like an regular FMU, any simulation tools that supports FMI will is able to use an aggregated FMU.

(21)

3.3. Aggregated FMU Aggregate.fmu Binaries Win32 Aggregate.dll ModelDescription.xml Resources AggregateDescription.xml Model1 Binaries Win32 Model1.dll Resources ... Model2 Binaries Win32 Model2.dll Resources ... ...

Figure 3.8: An example of an aggregated FMU.

Aggregate for Co-Simulation

An aggregate for co-simulation is an FMU that consists of several CS FMUs. Remember that an aggregate look like a regular FMU for the simulation tool. A DoStep call on the aggre-gate will need to simulate the entire coupled system one step forward. In other words, the aggregate runtime needs some kind of a master algorithm to simulate the coupled system. The aggregate description contains call sequences that the aggregate will execute. Each call sequence contains operations that should be executed on a specific FMU. In the CS case an aggregate contains three call sequences.

Enter Initialization Sequence: This call sequence is executed when the FMUs are in the in-stantiated state. It will set variables with approximate or exact initial values and make sure all FMUs enter the initialization state.

Initialization Sequence: This call sequence is executed when the FMUs are in the initializa-tion state. It will move the FMUs from the initializainitializa-tion state to the initialized state.

Step Sequence: This call sequence is executed when the FMUs are in the initialized state. It will move the aggregated forward one global step in time. This is done by using Get and Set operations to control the communication between the FMUs and DoStep to move each FMU in the aggregate forward in time.

By doing it this way the runtime is very flexible, for instance both Jacobi and Guaß-seidel master algorithms can be used by changing the step sequence. A simple example of a step sequence with two FMUs can be seen in listing 3.1.

(22)

3.4. Parallel Computing Theory 1 StepSequence: 2 Set( u[1]1 ) 3 Set( u[1]2 ) 4 Set( u[2]₁ ) 5 DoStep( 1 ) 6 DoStep( 2 ) 7 Get( y[1]₁ ) 8 Get( y[2]1 )

Listing 3.1: An example of a Step Sequence in a aggregate description with two FMUs.

Aggregate for Model Exchange

Coupling model exchange (ME) FMUs and saving them in an aggregate works in a similar way as in the CS case. It defines call sequences in the aggregate description that propagates the calls to the aggregate to all FMUs in the aggregate. It has the same enter initialization and initialization sequences as the CS case, but it also has three other call sequences. Remember that in the model exchange case the solver is not included in the FMU. These call sequences are only for moving the FMUs between different states.

Continuous Sequence: Is executed when the FMU is in the continuous time mode state. It updates all inputs that are continuous.

Event Sequence: Is executed when the FMU is in the event mode state.

New Discrete State Sequence: Is executed when fmi2NewDiscreteStates(. . . ) is called.

3.4 Parallel Computing Theory

This section will describe some basic parallel computing theory needed for this thesis.

Amdahl’s law

In parallel computing it is necessary to measure how effective a parallel algorithm is com-pared to the sequential algorithm. This is often done by measuring the speedup of the parallel algorithm. It is defined as the ratio between the sequential and parallel execution times.

Definition 3.3. Let Tsdefine the execution time of the fastest sequential algorithm and Tpthe execution time of a parallel algorithm with p processors. Speedup is then defined as

S= Ts Tp.

In the ideal case the speedup would be equal to p, although in most practical cases it is not possible to reach an ideal speedup. This is because parallel algorithms add some overhead and it is usually only possible to run certain parts of the algorithms in parallel. It is possible to calculate the theoretical best possible speedup if the fraction of the algorithm that can be executed in parallel is known. This is known as Amdahl’s law.

Definition 3.4. Let p define the factor of the algorithm that can be parallelized, 1 ´ p is the factor that cannot and s is the speedup of p. Then Amdahl’s law states that the max possible speedup for the entire algorithm is

S(s) = 1 (1 ´ p) + p_s. [7]

(23)

3.4. Parallel Computing Theory 0 20 40 60 80 100 120 0 5 10 15 s S ( s ) p=0.95 p=0.90 p=0.80

Figure 3.9: Graph showing the theoretical speedup according to Amdahl’s law.

When programming for a specific architecture s is usually fixed, which makes 1 ´ p the lim-iting factor for the possible speedup. Even a small 1 ´ p factor can be detrimental for the per-formance, see figure 3.9. It is however worth noting that these are simplifications. Amdahl’s law does not for example, take into account the added overhead in using parallel algorithms.

Semaphore

In multi-threaded programs it is necessary to have some kind of synchronization between the threads. It can for instance be because of communication between the threads, code that only one thread can execute at once or limitations in the algorithm. To accomplish this there are several synchronization primitives, one of these is the semaphore.

A semaphore is a synchronization primitive that is used to control how many threads that can access a shared resource concurrently. It has an internal counter and two available operations, Wait and Signal. The Wait operation will put the thread to sleep if the counter is zero or less, if this is not the case then the thread will decrement the counter and continue. The Signal operation will increment the counter and check if there are any threads waiting for the resource, if that is the case it will wake the first one up. See listings 3.2 for pseudo code.

1 c o u n t e r = i n i t i a l _ v a l u e

2

3 f u n c t i o n Wait( )

4 i f c o u n t e r i s equal t o 0

5 add p r o c e s s t o wait queue and s l e e p

6 decrement c o u n t e r

7

8 f u n c t i o n Signal( )

9 increment c o u n t e r

10 i f wait queue i s not empty

11 wake up t h e f i r s t t h r e a d in t h e queue

Listing 3.2: Pseudo code for a simple semaphore implementation.

One easy way to think about semaphores is to imagine that the counter as how many re-sources are available, for example, the number of copies of a book in a library. The Wait operation is then equivalent to either taking a book or waiting until one is available and the Signal operation as someone returning a book.

(24)

3.4. Parallel Computing Theory

OpenMP

When developing a multi-threaded program it is often preferred to use a parallel framework. One of the de facto standard parallel frameworks for shared memory systems is OpenMP. OpenMP specifies a set of library routines, environment variables and compiler directives for C, C++ and Fortran [8].

OpenMP uses a fork-join model, i.e the program is started with a master thread that forks (creates new) threads when it encounters an OpenMP directive. The library routines and environment variables can be used to control the runtime, e.g how many threads to run and which schedules to use. For more information about OpenMP see the OpenMP specification1_, this section will only discuss the directive that is most relevant for this thesis, the Parallel For directive. [8]

The Parallel For is used to execute for loops in parallel and is on the form #pragma omp f o r [ c l a u s e [ [ , ] c l a u s e ] . . . ]

f o r ´loops

where a clause is an option. When a thread encounters the directive, it will create a set of threads and divide the iterations into chunks. How the chunks are divided and executed among the threads depends on which schedule is used. OpenMP supports three types of schedules, static, dynamic and guided. With static scheduling the iterations are divided into equal sized chunks and each thread is assigned chunks in a round robin fashion. With dynamic scheduling the iterations are also divided into equal sized chunks, however, each thread is only assigned one chunk and has to request a new chunk when it has finished its previous chunk. The guided schedule is similar to dynamic, but all chunks are not the same size. The chunks start out large and get smaller and smaller.[8]

1 A = { 1 , 2 , . . . , 1 0 0 0 }

2 #pragma omp f o r num_threads ( 4 ) schedule ( s t a t i c , 2 0 )

3 f o r each elm in A

4 elm = elm ∗ 2

5 end f o r

Listing 3.3: A small OpenMp example.

For a small example see listing 3.3. It doubles the value of all elements in an array using 4 threads and a static schedule with a chunk size of 20.

Task Graphs

Task graphs is a way to represent computations and their dependencies. It is a directed acyclic graph (DAG) where each node represents a task, denoted ni, which is a set of instructions that must be executed sequentially.

Definition 3.5. A directed acylic graph (DAG) G is a set of nodes V and directed edges E that contains no directed cycles.

The edges represent dependencies between the computations and are denoted(ni, nj)for an edge from node nito nj. Consider the simple task graph in figure 3.10, task C and B can be executed in parallel, but neither of them can be executed before task A has finished. Nodes that have a directed edge to another node are usually called the parents of that node, for

(25)

3.5. Multiprocessor Scheduling A 2 B 1 C 2 D 2 1 1 3 1

Figure 3.10: A simple Task Graph.

example, node A is a parent of node B and C. In the same manner, node B and C are called children of A. Nodes without any parents are called entry nodes and nodes without any children are called exit nodes. There is also a need to represent the computation time of a task and the communication costs of the edges.

Definition 3.6. The computation time of a task in a task graph is represented by the weight of a node niin a task graph and is denoted w(ni)

Definition 3.7. The communication cost between two nodes ni, nj are represented by the weight of the edge between them and is denoted c(ni, nj)

The numbers inside the nodes on figure 3.10 are the computation time and the numbers on the edges are the communication times. Another important aspect of task graphs is the critical path.

Definition 3.8. The critical path cp is the path in a task graph G with the largest sum of edge and node weights.

The critical path corresponds to the longest sequence of tasks that must be executed sequen-tially. There can be more than one if several paths have the same length. No matter how many processing cores are available, it is not possible to execute a task graph faster than it takes to execute its critical path. For example, in 3.10 the critical path is A Ñ B Ñ D with a length of 9.

3.5 Multiprocessor Scheduling

Finding an optimal schedule to execute a task graph on a number of processing cores is known as the multiprocessor scheduling problem. The DAG scheduling algorithms (DSA) can be divided into two different categories, static and dynamic. A static scheduling algo-rithm creates the schedule before the program is executed, and the runtime blindly follows that schedule. A dynamic scheduling algorithm schedules the tasks during execution, they are more flexible as they can measure the execution time of the tasks and adapt during run-time, but they are also more complicated and add overhead. In this thesis only static sched-ules are considered. The general multiprocessor scheduling problem is NP-complete and defined as the following [9].

Definition 3.9. Given a finite set A of tasks, a length I(a) P Z+ for a P A, a number of processors m PZ+and a deadline D PZ+. Finding a partition A=A1YA2Y ¨ ¨ ¨ YAminto m disjoints set such that

max(ÿ aPA_i

I(a): 1 ď i ď m)ďD

(26)

3.5. Multiprocessor Scheduling

For an example of a schedule see table 3.1 which is an optimal schedule of the task graph in figure 3.10.

Table 3.1: An example of a schedule Step Core 1 Core 2

0 A -1 A -2 - -3 B C 4 - C 5 - -6 - -7 D -8 D

-There exist some special cases of the problem that can be solved in polynomial time, but they are quite restrictive [10]. Most DSA instead use heuristics to achieve near-optimal solutions in polynomial time. Kwok and Ahmad have done a thorough summary on 27 different DSA that uses heuristics [10]. The two most common types of DSAs are list and cluster scheduling algorithms.

List Scheduling

One of the simplest type of static task graph scheduling is list scheduling. There are many different list scheduling algorithms but in the general case it involves two steps. The first is to sort the nodes according to some priority. The second is to loop through the sorted nodes and chose a processor to schedule it on according to some strategy, see listing 3.4.[11]

1 S o r t nodes n P V i n t o a l i s t L a c c o r d i n g t o some p r i o r i t y

2 f o r each n in L do

3 Choose a P r o c e s s o r P a c c o r d i n g t o some s t r a t e g y

4 schedule n on P

5 end

Listing 3.4: Pseudo code for the general list scheduling algorithm.

The schedule step is usually done in one of two different ways, insertion or non insertion. In the non insertion way, the task is appended to the chosen processor. If insertion is used the task can be inserted into a hole in the schedule of the processor. There are many different ways to calculate the node priority, two of the most common are Bottom-level and Top-level [10].

Definition 3.10. Top-level (t-level) of a node niis defined as one of the longest paths from an entry node to ni[10].

Definition 3.11. Bottom-level (b-level) of a node niis defined as one of the longest paths from nito an exit node. Static b-level is a b-level that does not take into account the communication costs when calculating the longest path. [10]

Longest path refers to the path with the largest sum of node and edge weights. T-level is highly correlated with the earliest possible start time of that node. B-level is highly correlated with the critical path, the node with the highest b-level is by definition part of a critical path. Another attribute that some algorithms use is as late as possible (ALAP).

(27)

3.5. Multiprocessor Scheduling

Definition 3.12. As late as possible (ALAP) of a node is how far the start time of the node can be delayed without changing the length of the schedule.

Table 3.2: Attributes of the task graph in figure 3.10 Node T-level B-level Static B-level ALAP

A 0 9 6 0

B 3 6 3 3

C 3 5 4 4

D 7 2 2 7

See table 3.2 for attributes corresponding to the graph in figure 3.10. A few examples of list scheduling algorithms are the following.

HLFET Highest level first with estimated times is one of the simplest list scheduling algo-rithms. The list is sorted by static b-level in descending order and starts with only the entry nodes. The nodes are scheduled on the processor that allows for the earliest start time, after a node have been scheduled all of its children are added to the list. The complexity of HLFET isO(v2)where v denotes the number of nodes. [10]

ETF Earliest time first is quite similar to the HLFET algorithm. The list is also sorted by static b-level in descending order and only starts with the entry nodes. Then for each node in the list, the earliest start time is calculated for each processor. The node-processor pair with the lowest start time is chosen, ties are broken by static b-level. All the children of the chosen node are added to the list. The complexity of ETF isO(pv2)where p denotes the number of processors. [12]

MCP Modified critical path works by first calculating the ALAP of all nodes. Then for each node a sorted list l(ni)is created, the node’s and all its children’s ALAP are added to the list. All of the lists are then sorted in ascending order and a node list L is created from that order. This list is then iterated over and each node is scheduled to the processor that allows the earliest start time, with insertion. MCP has a complexity ofO(v2_log₍_n₎₎_. [13]

Kwok and Ahmad have done a comprehensive benchmark of static scheduling algorithms. They used randomized graphs to test a suite of different DSAs. They compared the each DSA with respect to the speedup they achieved, how effectively they used the cores and the run-ning time for finding a solution. Of the list scheduling algorithms the clear winner when tak-ing account all these measurements was MCP. However, the difference in speedup between the different list scheduling algorithms was very small. It was mostly MCP’s low complexity that made it scale better than the other algorithms. ETF was finding shorter schedules then MCP in some cases but its complexity of O(pv2) caused long running times. HLFET had similar speedups compared to ETF and MCP but it did not scale as well.[14]

Cluster Scheduling

Another common static scheduling approach is cluster scheduling. In the general case it works by first creating a cluster for each node in the DAG. Then it does incremental improve-ments by merging clusters without increasing the total schedule length. It keeps merging clusters until it cannot find any more merges that does not increase the schedule length, see listing 3.5. Since the number of clusters can be larger then the number of available processors a post processing step of mapping the clusters to processors is also required. In fact, without the post processing step this problem is not NP-Complete, it is possible to find an unbounded

(28)

3.6. Related Work

set of clusters that gives an optimal schedule in polynomial time. However, any realization of the scheduling algorithm will need to map the clusters to a bounded set of processors. [11]

1 C r e a t e a s e t o f c l u s t e r s

2 Assign each node t o a unique c l u s t e r

3 r e p e a t

4 Choose and merge c l u s t e r s

5 i f schedule l e n g t h got i n c r e a s e d a f t e r merge :

6 R e j e c t merge

7 u n t i l no v a l i d merge can be found

8 Map c l u s t e r s t o p r o c e s s o r s

Listing 3.5: The general cluster scheduling algorith,

One important concept of cluster scheduling is zeroing edges. If two nodes with a connection between them are scheduled to the same core the weight of their edge can be zeroed. This is because there is no communication cost between two operations on the same core. This gives an advantage over list scheduling algorithms such as HLFET and MCP. They can follow the critical path well by prioritizing nodes on ALAP and static b-level, but they do not take into account how the critical path changes if edges are zeroed. [11]

3.6 Related Work

Khaled et al. have written a paper about multi-core simulation using Co-Simulation FMUs [15]. They used the dependency information given by FMI to build a DAG where each node is an operation, and each vertex is a data dependency between two nodes. Then they used an offline scheduling heuristic based on one created by Grandpierre and Sorel that they call RCOSIM [16]. To solve the problem of FMI not being thread safe they scheduled all oper-ations for an FMU on the same core. They also did a case study testing this algorithm on a Spark Ignition RENAULT F4RT engine divided into five FMUs. The model was tested on two Intel Xeon with 8 cores each running at 3.1 GHz. They got a 10.87 times speedup compared to running the entire model in a monolithic approach. However, they only received about a 1.4 times speedup compared to running the divided model on a single thread.

Another piece of related work is the paper ”Acceleration of FMU Co-Simulation On Multi-core Architecture” by Saidi et al. [17]. In this paper they discuss and implement a parallel method to accelerate the simulation of CS FMUs in xMOD2. They used the same method as Khaled et al. to build graphs and schedule them using RCOSIM [15]. In their first approach they used estimations of the execution times when scheduling the DAG. But they were not satisfied with this, so they implemented a profiler to get more realistic execution time estima-tions before running the scheduling heuristic. They also tried to solve restriction 3.1 in two different ways. The first solution used a mutex lock for each FMU. The second approached modified the scheduling heuristic such that all operations for an FMU got scheduled to the same core. They tested their solution in xMOD with a simulation of a Spark Ignition RE-NAULT F4RT engine implemented as 5 coupled co-simulation FMUs. They received a ap-proximately 1.3 times speedup with the mutex lock and a 2.4 times speedup with the second solution using 5 cores. They also tried it with more than 5 cores but the speedup barely changed, most likely due to the fact that they used 5 FMUs.

Saidi et al. has also done a paper on using the RCOSIM approach to parallelize the simulation of co-simulation FMUs [18]. They performed two different graph transformations on the task graph before scheduling it. The first it to allow for so called multi-rate simulation. Multi-rate simulation means that each FMU can run with a different step size. The second

(29)

3.6. Related Work

tion they performed was to solve the problem with FMI not guaranteeing thread safety. They added edged between operations on the same FMU so that only one operation can be run at the same time. They then compared RCOSIM and their own version of RCOSIM on a spark ignition RENAULT F4RT engine on a Intel core i7 with eight cores at 2.7 GHz. Their modified approach got about 2.9 times speedup while the RCOSIM only got about 1.9 on four cores.

(30)

4

Method

This chapter describes the methodology used during this thesis. It will first describe the workflow that was used during the implementation. Then it will show how the parallel solution was implemented. After that it will describe how the MCP and HLFET heuristics were implemented. In the next section the models used for evaluation will be presented and then a section that describes how the evaluation of the implementation was done. The last section will discuss the technical limitations of this method.

4.1 Workflow

This section describes the workflow that was used during this thesis. It all starts out with an ssp file. SSP or System Structure Parameterization is a companion standard to FMI. It is a format for describing the coupling and parameterization of several interconnected FMUs. To import the ssp files the backend of Modelon’s product FMI Composer (FMIC) was used. It parses the ssp file and create a new aggregate FMU from the FMUs and their coupling infor-mation described in the ssp file. To create the aggregate, it creates a model description, aggre-gate description and includes the runtime etc. To simulate the aggreaggre-gated FMU, PyFMI1was used. PyFMI is a Python package for interacting with FMUs. See figure 4.1 for an overview of the workflow. SSP Backend (Java) Aggregated FMU Runtime (C) PyFMI Simulation results

Figure 4.1: The workflow that was used during this thesis.

4.2 Implementation

This section will describe how the backend and runtime was modified to parallelize the sim-ulation of aggregated FMUs using static scheduling. Looking at the workflow in figure 4.1 there was two possible locations where the static scheduling could have been implemented, in the runtime or the backend. Implementing the static scheduling in the runtime would add overhead since unless some type of caching was implemented, the scheduling would have been done every time the aggregated FMU was instantiated. Implementing the static schedul-ing in the backend would only result in overhead durschedul-ing the creation of an aggregated FMU, since the schedule can be saved in the aggregate. The downside of implementing the schedul-ing in the backend is that the number of cores available durschedul-ing the simulation is unknown.

(31)

4.2. Implementation

This is because the computer that creates the aggregate might not be the same computer that runs the simulation. If it was implemented in the runtime the scheduling algorithm could always schedule for the correct number of cores. It was decided to implement the scheduling in the backend, due to the reduced overhead and simpler runtime binaries.

Runtime

The runtime is the binary that is placed in the aggregated FMU. It will parse the aggregate description, load all FMUs and expose the FMI API to the simulation tools.

u[1] u[2] Step[1] Step[2] y[1] y[2]

Figure 4.2: Task graph for a step sequence for two FMUs.

Since the scheduling was implemented in the backend it was necessary to save the resulting schedule in the aggregated FMU. In the previous sequential solution the aggregate descrip-tion contained several call sequences. It was decided to modify these call sequences to allow for a parallel schedule. In the old implementation the step sequence for the task graph in figure 4.5 looked like:

1 StepSequence: 2 Set( u[1]₎ 3 Set( u[2]₎ 4 DoStep( 1 ) 5 DoStep( 2 ) 6 Get( y[1]₎ 7 Get( y[2]₎

Listing 4.1: An example of a step sequence in the aggregate description with two FMUs. In order to parallelize this, it seemed natural to divide the call sequence into several call sequences, one for each core. Dividing the previous step sequence for a parallel solution using two cores would then result in the following.

1 StepSequence 1 : 2 Set( u[1]₎ 3 DoStep( 1 ) 4 Get( y[1]₎ 1 StepSequence 2 : 2 Set( u[2]₎ 3 DoStep( 2 ) 4 Get( y[2]₎

Listing 4.2: A divided step sequence without synchronization.

This corresponds well to the output of a static scheduling algorithm, it contains the execution order for each core. However, it was also necessary to add synchronization between the different cores. Otherwise there is for example nothing that guarantees that the Set(u[1]₎ operation is executed before Get(y[2]), even though there is a dependency between them. To accomplish this, a semaphore structure was implemented using the Windows API. Two new operations were added to the call sequences, Signal and Wait. Both semaphore operations has a semaphore id as a parameter and the Signal operation also has an increment parameter.

(32)

4.2. Implementation

The increment parameter is an integer that will be added to the semaphore counter when the signal operation is called. Adding synchronization to the previous example would result in:

1 StepSequence 1 : 2 Set( u[1]₎ 3 Signal( id = 0 , increment = 1 ) 4 DoStep( 1 ) 5 Get( y[1]₎ 1 StepSequence 2 : 2 Set( u[2]₎ 3 DoStep( 2 ) 4 Wait( id = 0 ) 5 Get( y[2]₎

Listing 4.3: A divided step sequence with synchronization.

The Signal operation in step sequence 1 has the same id as the Wait operation in step sequence 2. This ensures that Set(u[1])is finished before Get(y[2])is executed. The increment parameter in the Signal is only set to one since there is only one corresponding Wait operation. If there would have been two Wait operations with id=0 then the increment parameter would have been set to two. The addition of synchronization ensures that all dependencies in a task graph is enforced, as long as the scheduling algorithm inserts the semaphore operations correctly. Even though this example shows a step sequence the same procedure was used for all the different call sequences. The big difference between the call sequences is how the task graph are created, this is explained in the next section.

The next step was to execute the divided call sequences in parallel. To accomplish this it was necessary to either use native threads or a parallel framework. Using native threads usually gives the programmer more control. But this problem is quite coarse grained and there was no need for the finer granularity control. It was therefore decided to use OpenMP, the de-facto standard for parallel programming in C. Its coarser granularity fits this problem perfectly. More specifically, OpenMP 2.0 was used, OpenMP 2.0 is 16 years old but the Microsoft Visual Studio Compiler does not support newer versions.

To execute the call sequences in parallel an exec_callseq(. . . ) function was implemented. It propagates the operations in the call sequence to the correct FMU. When the sequential im-plementation was in place, the parallelization with OpenMP was trivial. It was enough to create a for loop over each call sequence and add an OpenMP directive. Because each call sequence should be executed on its own thread, the parallel for directive was used with a static schedule and a block size of one. Since the exec_callseq(. . . ) functions return a status code it was also necessary to reduce the result to the highest status code. For a pseudo code example of the step sequence, see listing 4.4.

1 parse s t e p _ s e q s from d e s c r i p t i o n

2 #pragma omp f o r num_threads ( s i z e ( s t e p _ s e q s ) ) schedule ( s t a t i c , 1 )

3 f o r each s t e p _ s e q in s t e p _ s e q s

4 r e s u l t = e x e c _ c a l l s e q ( s t e p _ s e q , . . . ) )

5 reduce r e s u l t

6 end f o r

7

Listing 4.4: Pseudo code for the execution of the step sequence in parallel

Backend

The backend imports an SSP file, creates an aggregated FMU and saves it. This is where the scheduling was done. The scheduling algorithm has three inputs: a converter, the number of cores to schedule for, and a task graph. The converter was used to convert the nodes to call

(33)

4.2. Implementation

sequence operations that can then be exported to the aggregate description. The number of cores is given as a command line argument to the backend.

FMU 1 FMU 2 y [2] 1 y[2]₂ u[2]₁ y[1]₁ u[1]₂ u[1]₁

Figure 4.3: Two coupled FMUs.

The first step was to create the task graphs. In order to accomplish this, two different types of dependencies had to be taken into account, the connections between the FMUs and the internal dependencies in each FMU. The internal dependencies are from an input to an output on the same FMU, the connections are from an output to an input. In figure 4.3 the internal dependencies are depicted as dashed arrows and the connections as solid arrows. Feature 3.1 makes it possible to parse the internal dependencies from each FMU’s model description. The connection dependencies was parsed from the SSP file.

u[1]₁

u[1]₂ y[1]₁ u[2]₁ y[2]₁ y[2]₂

Figure 4.4: An example task graph from the FMUs in figure 4.3

When the variables, dependencies and connections were known it was possible to create the task graphs. One task graph was created for each call sequence. All call sequences except the step sequence was created in a very similar manner. The biggest differences between them were which variables that was included when the graph was created. When the correct variables was chosen, creating the graph was trivial. For each variable a node was created and edges corresponding to the internal dependencies and connections was added. See figure 4.4 for a simple example task graph of the aggregate in figure 4.3 where all inputs and outputs are included.

Enter Initialization: This call sequence accomplishes three different steps. The first step is to set all internal inputs that has a starting value, to do this it includes all variables that are not constant with a exact or approximate initial value in the task graph. After this it has to call EnterInitalizationMode() on all sub FMUs, however this is not included in the task graph, instead it is added as a post processing step after the scheduling is done. After it has been called on each sub FMU the call sequence has to make sure each input connected to another FMU is updated, it does this by adding all internal outputs and their corresponding inputs to the task graph. See the state machine in figure 3.7 for more details in which variables are included.

Initialization: This call sequence is to make sure that all internal inputs and outputs are up-dated after any changes to the external outputs that the simulation tool might have done. To accomplish this the task graph includes all internal outputs and their corre-sponding inputs.

(34)

4.2. Implementation

Continuous: This call is used when the aggregate is in the Continuous time mode state. The task graphs includes all continuous outputs and their corresponding inputs.

Event: The event call sequence will be triggered when EnterEventMode() is called. Here the task graph contains both the discrete and continuous outputs and their corresponding inputs.

New Discrete State: This call sequence also updates all variables, both continuous and dis-crete. The difference between this and the Event sequence is that NewDiscreteState() is called on each sub FMU if any of their discrete inputs has changed. This is however not represented in the task graph, it is added in a post processing step after the scheduling has been completed.

u[1]₁ u[1]₂ u[2]₁

Step[1] _Step[2]

y[1]₁ y[2]₂ y[2]₁

Figure 4.5: The task graph for the step sequence of the FMUs in figure 4.3

The step sequence task graph had to be handled differently than all other call sequences. The step sequence was implemented using a Jacobi type master algorithm. All inputs are delayed one time step, this means that we can begin the step sequence by setting all inputs to their corresponding output’s value from the previous time step.

All inputs has to be set before DoStep was called on the same FMU, this means that all inputs had to have an edge to their FMU’s DoStep node. Since the outputs can only be fetched after their FMU’s DoStep operation is completed there also had to be an edge between the DoStep node and all outputs on the same FMU. Because the internal dependencies were from an input to an output on the same FMU there was no reason to take them into account during this task graph. All inputs already had to be scheduled before any outputs from the same FMU could.

There should not have been any reason to take into account the connections either since the threads are synchronized between each execution of a call sequence. But due to an imple-mentation detail in the runtime it was necessary to add dependencies that are the reverse of the connections. This was because the Set and Get operations that share a connection used the same memory location to store intermediate results. The result of all this can be seen in figure 4.5, the edge from u[1]₂ to y[2]₂ is because of their connection.

The last step before scheduling was adding weights to the task graphs. It is very difficult to estimate the weights. The execution time of an FMU operation depends on how the model is designed and how the runtime of the FMU is implemented. These two factors are not known during the scheduling. To get accurate weights, a profiler could be used to measure the execution times. However, there was not enough time to implement a automatic profiler, instead each test model was manually profiled. This was done by creating an aggregate

(35)

4.3. Heuristics

scheduled with uniform weights, this aggregate was then simulated with a modified runtime that measured the execution time per operation.

When the average execution time of a model had been measured the weights were manually calculated. The operations with the lowest average execution time was assigned a weight of one, and all other weights were scaled from that. The execution times of the semaphore operations are not dependent on the FMU and was only measured once.When the weights had been calculated it was time for the scheduling, this is described in the next section.

4.3 Heuristics

When the estimation of node and edge weights was done, the only part left was the schedul-ing. Two different list scheduling algorithms was implemented, HLFET and MCP.

Highest Level First with Estimated Times

The first heuristic that got implemented was HLFET. It follows the basic list scheduling pat-tern discussed in section 3.5 but with some added implementation details. The first step of HLFET is to create a list of nodes sorted by static b-level. To calculate the static b-level, a recursive function was used, see listing 4.5. It traverses the graph in a depth first manner and sets all exit nodes’ static b-level to their own weight. All other nodes’ static b-level are set to their own weight added with the largest static b-level of their children.

1 f u n c t i o n s b _ l e v e l ( node )

2 node . s b _ l e v e l = w(node)

3 f o r each c h i l d in node . c h i l d r e n

4 node . s b _ l e v e l = max ( node . s b _ l e v e l , s b _ l e v e l ( c h i l d ) + w(node) )

5 end f o r

6 r e t u r n node . s b _ l e v e l

7

8 f o r each node in entry_nodes

9 s b _ l e v e l ( node )

10 end f o r

Listing 4.5: Pseudo code to calculate the static b-level of a task graph.

When the static b-level had been calculated it was time to implement the actual scheduling algorithm, see listing 4.6. First a list of nodes sorted on the static b-level in descending order was created. Each node in this list was then iterated over. The first step for each node was choosing which core the node should be scheduled on, in HLFET this is done by iterating over each core and finding the one that had the earliest possible start time. However, in this case we had one big and important exception to this. Due to the restriction 3.1 it was not possible to execute several operations on the same FMU in parallel. To solve this, all operations on a specific FMU was scheduled onto the same core. This was done by keeping a map of FMUs to cores and checking if a node’s FMU was already scheduled to a specific core, see line 9 in listing 4.6.

The next step after choosing a core was adding synchronization to the node’s parents. Since the node cannot be executed before its parents a Wait operation was added for each parent. However, if the parent was scheduled to the same core as the node it was not necessary to add it, since they are already guaranteed to be executed in order. But it was necessary to decrement the increment parameter of the parent’s corresponding Signal operation. After the wait semaphores were added the node was converted to one or several operations and added to the cores schedule. If the node had any children, a signal semaphore was also added after the operation with increment equal to the number of children.

(36)

4.3. Heuristics

In the final step all children of the node are updated and added to the sorted list. Since a child is not allowed to be scheduled before the node their earliest time was set to the sum of the end time of the node and the edge weight. Since the list is sorted by static b-level it is guaranteed that all parents of a node will be scheduled before the node itself is scheduled.

1 c a l c u l a t e s t a t i c b´l e v e l o f a l l nodes

2 sorted_nodes = entry_nodes s o r t e d on b´l e v e l

3 c o r e s = l i s t o f empty c o r e s

4 fmu_core_map = empty map

5

6 f o r each node in sorted_nodes

7 chosen_core = empty c o r e

8

9 i f node . fmu in fmu_core_map :

10 chosen_core = fmu_core_map . g e t ( node . fmu )

11 e l s e

12 //Find t h e e a r l i e s t p o s s i b l e s t a r t time without i n s e r t i o n

13 f o r each c o r e in c o r e s

14 i f c o r e . a v a i l a b l e < chosen_core . a v a i l a b l e

15 chosen_core = c o r e

16 end i f

17 end f o r

18 fmu_core_map . add ( node . fmu , chosen_core )

19 end i f

20 chosen_core . a v a i l a b l e = max ( chosen_core . a v a i l a b l e , node . e a r l i e s t ) + w( node )

21

22 //Add wait semaphores t o a l l p a r e n t scheduled on d i f f e r e n t nodes

23 f o r each p a r e n t in node . p a r e n t s

24 i f p a r e n t . c o r e ! = chosen_core

25 chosen_core . add ( wait o p e r a t i o n )

26 e l s e

27 p a r e n t . semaphore . increment´´

28 end i f

29 end f o r

30

31 //Convert node t o o p e r a t i o n and add t o schedule

32 chosen_core . add ( c o n v e r t ( node ) )

33 chosen_core . add ( s i g n a l o p e r a t i o n ( increment= s i z e ( node . c h i l d r e n ) ) )

34

35 f o r each c h i l d in node . c h i l d r e n

36 c h i l d . e a r l i e s t = max ( c h i l d . e a r l i e s t , chosen_core . a v a i l a b l e + c ( node , c h i l d ) )

37 i f c h i l d not in sorted_nodes

38 sorted_nodes . add ( c h i l d )

39 end i f

40 end f o r

41 end f o r

Listing 4.6: Pseudo code for the Highest Level First With Estimated Times heuristic.

Modified Critical Path

The second heuristic to be implemented was modified critical path (MCP). MCP is similar to HLFET but it has a important few differences. The first step is to calculate the graphs critical path since it is needed in order to calculate the ALAP of each node. To calculate the critical path a recursive function traverses the graph depth first. Each exit node returns their own weight, all other nodes return the largest critical path of their children added with their own node weight. This recursive function is called with the entry nodes as input. The largest value returned is the critical path, see listing 4.7 for pseudo code.

Parallelization of Aggregated FMUs using Static Scheduling

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2018 | LIU-IDA/LITH-EX-A--18/044--SE

Parallelization of Aggregated

FMUs using Static Scheduling

Mattias Hammar

Upphovsrätt

Copyright

Preface

Acknowledgments

Contents

List of Figures

List of Tables

Listings

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Background

2.1

Functional Mockup Interface

3

Theory

3.1

Coupled Systems

u

u

u

u

y

y

y

y

Mathematical Description

Co-Simulation

3.2

Functional Mockup Interface

FMU for Model Exchange

FMU for Co-Simulation

3.3

Aggregated FMU

Aggregate for Co-Simulation

Aggregate for Model Exchange

3.4

Parallel Computing Theory

Amdahl’s law

Semaphore

OpenMP

Task Graphs

3.5

Multiprocessor Scheduling

List Scheduling

Cluster Scheduling

3.6

Related Work

4

Method

4.1

Workflow

4.2

Implementation

Runtime

Backend

4.3

Heuristics

Highest Level First with Estimated Times

Modified Critical Path