Static Timing Analysis of Parallel Systems Using Abstract Execution

(1)

Mälardalen University Press Licentiate Theses No. 186

STATIC TIMING ANALYSIS OF PARALLEL

SYSTEMS USING ABSTRACT EXECUTION

Andreas Gustavsson

2014

School of Innovation, Design and Engineering

Mälardalen University Press Licentiate Theses

No. 186

STATIC TIMING ANALYSIS OF PARALLEL

SYSTEMS USING ABSTRACT EXECUTION

Andreas Gustavsson

2014

(2)

ISSN 1651-9256

Printed by Arkitektkopia, Västerås, Sweden

Abstract

The Power Wall has stopped the past trend of increasing processor through-put by increasing the clock frequency and the instruction level parallelism. Therefore, the current trend in computer hardware design is to expose explicit parallelism to the software level. This is most often done using multiple pro-cessing cores situated on a single processor chip. The cores usually share some resources on the chip, such as some level of cache memory (which means that they also share the interconnect, e.g. a bus, to that memory and also all higher levels of memory), and to fully exploit this type of parallel processor chip, pro-grams running on it will have to be concurrent. Since multi-core processors are the new standard, even embedded real-time systems will (and some already do) incorporate this kind of processor and concurrent code.

A real-time system is any system whose correctness is dependent both on its functional and temporal output. For some real-time systems, a failure to meet the temporal requirements can have catastrophic consequences. There-fore, it is of utmost importance that methods to analyze and derive safe estima-tions on the timing properties of parallel computer systems are developed.

This thesis presents an analysis that derives safe (lower and upper) bounds on the execution time of a given parallel system. The interface to the analysis is a small concurrent programming language, based on communicating and syn-chronizing threads, that is formally (syntactically and semantically) defined in the thesis. The analysis is based on abstract execution, which is itself based on abstract interpretation techniques that have been commonly used within the field of timing analysis of single-core computer systems, to derive safe timing bounds in an efficient (although, over-approximative) way. Basically, abstract execution simulates the execution of several real executions of the analyzed program in one go. The thesis also proves the soundness of the presented ana-lysis (i.e. that the estimated timing bounds are indeed safe) and includes some examples, each showing different features or characteristics of the analysis.

(3)

Abstract

The Power Wall has stopped the past trend of increasing processor through-put by increasing the clock frequency and the instruction level parallelism. Therefore, the current trend in computer hardware design is to expose explicit parallelism to the software level. This is most often done using multiple pro-cessing cores situated on a single processor chip. The cores usually share some resources on the chip, such as some level of cache memory (which means that they also share the interconnect, e.g. a bus, to that memory and also all higher levels of memory), and to fully exploit this type of parallel processor chip, pro-grams running on it will have to be concurrent. Since multi-core processors are the new standard, even embedded real-time systems will (and some already do) incorporate this kind of processor and concurrent code.

A real-time system is any system whose correctness is dependent both on its functional and temporal output. For some real-time systems, a failure to meet the temporal requirements can have catastrophic consequences. There-fore, it is of utmost importance that methods to analyze and derive safe estima-tions on the timing properties of parallel computer systems are developed.

This thesis presents an analysis that derives safe (lower and upper) bounds on the execution time of a given parallel system. The interface to the analysis is a small concurrent programming language, based on communicating and syn-chronizing threads, that is formally (syntactically and semantically) defined in the thesis. The analysis is based on abstract execution, which is itself based on abstract interpretation techniques that have been commonly used within the field of timing analysis of single-core computer systems, to derive safe timing bounds in an efficient (although, over-approximative) way. Basically, abstract execution simulates the execution of several real executions of the analyzed program in one go. The thesis also proves the soundness of the presented ana-lysis (i.e. that the estimated timing bounds are indeed safe) and includes some examples, each showing different features or characteristics of the analysis.

(4)

Acknowledgments

I would like to express my deepest gratitude to my advisors, Bj¨orn Lisper, Andreas Ermedahl and Jan Gustafsson, for accepting me as a doctoral student and also for their patience and invaluable guidance during my education so far. Without you, this thesis would not exist. A special thanks goes to Vesa Hirvisalo for putting a lot of energy and time into getting acquainted with, and suggesting improvements on, my research. Last, but far from least, I would like to thank everybody with whom I have shared many laughs and experiences during coffee breaks, trips, parties, after works and other activities. Thank you all!

The research presented in this thesis was funded partly by the Swedish Re-search Council (Vetenskapsr˚adet) through the project “Worst-Case Execution Time Analysis of Parallel Systems” and partly by the Swedish Foundation for Strategic Research (SSF) through the project “RALF3 – Software for Embed-ded High Performance Architectures”.

Andreas Gustavsson V¨aster˚as, October, 2014

(5)

Acknowledgments

I would like to express my deepest gratitude to my advisors, Bj¨orn Lisper, Andreas Ermedahl and Jan Gustafsson, for accepting me as a doctoral student and also for their patience and invaluable guidance during my education so far. Without you, this thesis would not exist. A special thanks goes to Vesa Hirvisalo for putting a lot of energy and time into getting acquainted with, and suggesting improvements on, my research. Last, but far from least, I would like to thank everybody with whom I have shared many laughs and experiences during coffee breaks, trips, parties, after works and other activities. Thank you all!

The research presented in this thesis was funded partly by the Swedish Re-search Council (Vetenskapsr˚adet) through the project “Worst-Case Execution Time Analysis of Parallel Systems” and partly by the Swedish Foundation for Strategic Research (SSF) through the project “RALF3 – Software for Embed-ded High Performance Architectures”.

Andreas Gustavsson V¨aster˚as, October, 2014

(6)

Introduction

This chapter starts by introducing the fundamental concepts used within the field of the thesis. It then states the asked research questions, the approach used to answer the questions and the resulting contributions of the thesis. This chapter also presents the papers included in the thesis and a pilot study on using model checking for timing analysis of parallel real-time systems.

1.1 Real-Time Systems

As computers have become smaller, faster, cheaper and more reliable, their range of use has rapidly increased. Today, virtually every technical item, from wrist watches to airplanes, are computer-controlled. This type of computers are commonly referred to as embedded computers or embedded systems; i.e. one or more controller chips with accompanying software are embedded within the product. It has been approximated that over 99 percent of the worldwide production of computer chips are destined for embedded systems [15].

A real-time system is often an embedded system for which the timing be-havior is of great importance. More formally, the Oxford Dictionary of Com-puting gives the following definition of a real-time system [54].

“Any system in which the time at which output is produced is significant. This is usually because the input corresponds to some movement in the physical world, and the output has to relate to that same movement. The lag from input time to output time must be sufficiently small for acceptable timeliness.”

(11)

Chapter 1

Introduction

This chapter starts by introducing the fundamental concepts used within the field of the thesis. It then states the asked research questions, the approach used to answer the questions and the resulting contributions of the thesis. This chapter also presents the papers included in the thesis and a pilot study on using model checking for timing analysis of parallel real-time systems.

1.1 Real-Time Systems

As computers have become smaller, faster, cheaper and more reliable, their range of use has rapidly increased. Today, virtually every technical item, from wrist watches to airplanes, are computer-controlled. This type of computers are commonly referred to as embedded computers or embedded systems; i.e. one or more controller chips with accompanying software are embedded within the product. It has been approximated that over 99 percent of the worldwide production of computer chips are destined for embedded systems [15].

A real-time system is often an embedded system for which the timing be-havior is of great importance. More formally, the Oxford Dictionary of Com-puting gives the following definition of a real-time system [54].

“Any system in which the time at which output is produced is significant. This is usually because the input corresponds to some movement in the physical world, and the output has to relate to that same movement. The lag from input time to output time must be sufficiently small for acceptable timeliness.”

(12)

2 Chapter 1. Introduction

The word “timeliness” refers to the total system and can be dependent on me-chanical properties like inertia. One example is the compensation of temporary deviations in the supporting structure (e.g. a twisting frame) when firing a mis-sile to keep the mismis-sile’s exit path constant throughout the process. Another example is to fire the airbag in a colliding car. This should not be done too soon, or the airbag will have lost too much pressure upon the human impact, and not too late, or the airbag could cause additional damage upon impact; i.e. the inertia of the human body and the retardation of the colliding car both impact on the timeliness of the airbag system. It should thus be apparent that the correctness of a real-time system depends both on the logical result of the performed computations and the time at which the result is produced.

Real-time systems can be divided into two categories: hard and soft real-time systems. Hard real-real-time systems are such that failure to produce the com-putational result within certain timing bounds could have catastrophic con-sequences. One example of a hard real-time system is the above-mentioned airbag system. Soft real-time systems, on the other hand, can tolerate missing these deadlines to some extent and still function properly. One example of a soft real-time system is a video displaying device. Missing to display a video frame within the given bounds will not be catastrophic, but perhaps annoying to the viewer if it occurs too often. The video will still continue to play, although with reduced displaying quality.

The ever increasing demand for performance in computer systems has his-torically been satisfied by increasing the speed (clock frequency) and complex-ity (e.g. using pipelines and caches) of the processor. It is however no longer possible to continue on this path due to the high power consumption and heat dissipation that these techniques infer. Instead, the current trend in computer hardware design is to make parallelism explicitly available to the programmer. This is often done by placing multiple processing cores on the same chip while keeping the complexity of each core relatively low. This strategy helps increas-ing the chip’s throughput (performance) without hittincreas-ing the power wall since the individual processing cores on the multi-core chip are usually much simpler than a single core implemented on the equivalent chip area [89].

A problem with the multi-core design is that the cores typically share some resources, such as some level of on-chip cache memory. This introduces depen-dencies and conflicts between the cores; e.g. simultaneous accesses from two or more cores to shared resources will introduce delays for some of the cores. Processor chips of this kind of multi-core architecture are currently being used in real-time systems within, for example, the automotive industry.

To fully utilize the multi-core architecture, algorithms will have to be

par-1.2 Execution Time Analysis 3

allelized over multiple tasks, e.g. threads. This means that the tasks will have to share resources and communicate and synchronize with each other. There already exist software libraries for explicitly parallelizing sequential code au-tomatically. One example of such a library available for C/C++ and Fortran code running on shared-memory machines is OpenMP [83]. The conclusion is that concurrent software running on parallel hardware is already available today and will probably be the standard way of computing in the future, also for real-time systems.

When proving the correctness of, and/or the schedulability of the tasks in, a real-time system, it is, as far as the author knows, always assumed that safe (i.e. not under-approximated) bounds on the timing behavior of all tasks in the system are known. The timing bounds are, for example, used as input to algorithms that prove or falsify the schedulability of the tasks in the system [5, 34, 70]. Therefore, it is of crucial importance that methods for deriving safe timing bounds for this type of parallel computational systems are defined. This thesis presents a method that derives safe estimates on the timing bounds for parallel systems in which tasks share memory and can execute blocks of code in a mutually exclusive manner. The method mainly targets hard real-time systems. However, it can be applied to any computer system fitting the assumptions made in the upcoming chapters.

1.2 Execution Time Analysis

A program’s execution time (i.e. the amount of time it takes to execute the entire program from its entry point to its exit point) on a given processor is not constant in the general case; the execution time is dependent on the initial system state. This state includes the input to the program (i.e. the values of its arguments), the hardware state (e.g. cache memory contents) and the state of any other software that is executing on the same hardware. However, for any program and any set of initial states, at least one of the resulting execution times will be equal to the shortest execution time for the given program and set of initial states. The shortest execution time is referred to as the Best-Case Execution Time (BCET). Likewise, at least one of the resulting execution times will be equal to the longest execution time for the given program and set of initial states. The longest execution time is referred to as the Worst-Case Execution Time (WCET). Note that both the BCET and the WCET could

(13)

The word “timeliness” refers to the total system and can be dependent on me-chanical properties like inertia. One example is the compensation of temporary deviations in the supporting structure (e.g. a twisting frame) when firing a mis-sile to keep the mismis-sile’s exit path constant throughout the process. Another example is to fire the airbag in a colliding car. This should not be done too soon, or the airbag will have lost too much pressure upon the human impact, and not too late, or the airbag could cause additional damage upon impact; i.e. the inertia of the human body and the retardation of the colliding car both impact on the timeliness of the airbag system. It should thus be apparent that the correctness of a real-time system depends both on the logical result of the performed computations and the time at which the result is produced.

Real-time systems can be divided into two categories: hard and soft real-time systems. Hard real-real-time systems are such that failure to produce the com-putational result within certain timing bounds could have catastrophic con-sequences. One example of a hard real-time system is the above-mentioned airbag system. Soft real-time systems, on the other hand, can tolerate missing these deadlines to some extent and still function properly. One example of a soft real-time system is a video displaying device. Missing to display a video frame within the given bounds will not be catastrophic, but perhaps annoying to the viewer if it occurs too often. The video will still continue to play, although with reduced displaying quality.

The ever increasing demand for performance in computer systems has his-torically been satisfied by increasing the speed (clock frequency) and complex-ity (e.g. using pipelines and caches) of the processor. It is however no longer possible to continue on this path due to the high power consumption and heat dissipation that these techniques infer. Instead, the current trend in computer hardware design is to make parallelism explicitly available to the programmer. This is often done by placing multiple processing cores on the same chip while keeping the complexity of each core relatively low. This strategy helps increas-ing the chip’s throughput (performance) without hittincreas-ing the power wall since the individual processing cores on the multi-core chip are usually much simpler than a single core implemented on the equivalent chip area [89].

A problem with the multi-core design is that the cores typically share some resources, such as some level of on-chip cache memory. This introduces depen-dencies and conflicts between the cores; e.g. simultaneous accesses from two or more cores to shared resources will introduce delays for some of the cores. Processor chips of this kind of multi-core architecture are currently being used in real-time systems within, for example, the automotive industry.

To fully utilize the multi-core architecture, algorithms will have to be

par-1.2 Execution Time Analysis 3

allelized over multiple tasks, e.g. threads. This means that the tasks will have to share resources and communicate and synchronize with each other. There already exist software libraries for explicitly parallelizing sequential code au-tomatically. One example of such a library available for C/C++ and Fortran code running on shared-memory machines is OpenMP [83]. The conclusion is that concurrent software running on parallel hardware is already available today and will probably be the standard way of computing in the future, also for real-time systems.

When proving the correctness of, and/or the schedulability of the tasks in, a real-time system, it is, as far as the author knows, always assumed that safe (i.e. not under-approximated) bounds on the timing behavior of all tasks in the system are known. The timing bounds are, for example, used as input to algorithms that prove or falsify the schedulability of the tasks in the system [5, 34, 70]. Therefore, it is of crucial importance that methods for deriving safe timing bounds for this type of parallel computational systems are defined. This thesis presents a method that derives safe estimates on the timing bounds for parallel systems in which tasks share memory and can execute blocks of code in a mutually exclusive manner. The method mainly targets hard real-time systems. However, it can be applied to any computer system fitting the assumptions made in the upcoming chapters.

1.2 Execution Time Analysis

A program’s execution time (i.e. the amount of time it takes to execute the entire program from its entry point to its exit point) on a given processor is not constant in the general case; the execution time is dependent on the initial system state. This state includes the input to the program (i.e. the values of its arguments), the hardware state (e.g. cache memory contents) and the state of any other software that is executing on the same hardware. However, for any program and any set of initial states, at least one of the resulting execution times will be equal to the shortest execution time for the given program and set of initial states. The shortest execution time is referred to as the Best-Case Execution Time (BCET). Likewise, at least one of the resulting execution times will be equal to the longest execution time for the given program and set of initial states. The longest execution time is referred to as the Worst-Case Execution Time (WCET). Note that both the BCET and the WCET could

(14)

4 Chapter 1. Introduction 0 Time Probability Lower timing bound BCET Minimal observed execution time Maximal observed execution time WCET Upper timing bound The actual WCET

must be found or upper bounded

Measured execution times Possible execution times Derived bounds on the execution times

Worst-case performance Worst-case guarantee

Figure 1.1: Execution time distribution of some program. [113]

possibly be infinite,1_though.

Figure 1.1 illustrates the relation between the possible execution times a program might have, and safe bounds on those execution times: any estima-tion of the WCET that is greater than the actual WCET is a safe bound on the actual WCET; likewise, any estimation of the BCET that is smaller than the actual BCET is a safe bound on the actual BCET. The figure also shows that measuring the execution time will always give a time between, and including, the BCET and WCET of the considered program. It is thus very difficult to guarantee that the actual BCET and WCET are found by measuring the execu-tion time of the program. This is since a huge number of possible initial system states must typically be considered in the general case.

When considering simple-enough (most often sequential) hardware, i.e. hardware that is free from timing anomalies [72], research on execution

time analysis can result in very efficient methods for tight (i.e. not too

over-approximate) estimation of the (BCET or) WCET. This is because tight estimations of the best-case and worst-case execution times for each single instruction, or a block of instructions, can be derived in isolation from other statements. However, when introducing multi-core architectures with 1_{One example for which both the BCET and WCET of a program are infinite is when the}

program always enters some nonterminating loop along all possible paths. Another example of an infinite WCET is when a program deadlocks.

1.2 Execution Time Analysis 5

shared memory, the hardware does most likely suffer from timing anomalies regardless of how simple the processor cores are [2, 72, 97]. Practically, this means that an execution time for a statement that lies in-between the state-ment’s BCET and WCET could result in the global (BCET and) WCET. The consequence is that the only safe option is to take all the possible execution scenarios into account when estimating the global timing bounds.

Today, there exist several algorithms and tools that strive to derive a safe and tight estimate of the WCET of a sequential task targeted for sequential hardware. Some examples of such tools are aiT [30, 113], Bound-T [49, 113], Chronos [65, 113], Heptane [113], OTAWA [8], RapiTime [96, 113], SWEET [27, 113], SymTA/P [113] and TuBound [91, 113]. aiT, Bound-T and RapiTime are commercial tools while the others are primarily research pro-totypes. aiT, Bound-T, Chronos, Heptane, OTAWA and TuBound are purely static tools while SWEET and SymTA/P mainly use static WCET analysis techniques, but also dynamic techniques to some extent. RapiTime is heav-ily based on dynamic techniques.

In dynamic WCET analysis, measurements of the actual execution time of the software running on the target hardware are performed. This method is not guaranteed to execute the program’s worst-case path, though, which could, for example, include some error-handling routine that is only rarely executed. Thus, the WCET might be gravely under-estimated; i.e. there might exist paths through the code with considerably worse (longer) execution times than the worst execution time detected by the measurements.

In static WCET analysis, the program code and the properties of the target hardware are analyzed without actually executing the program. Instead, the analysis is based on the semantics of the programming language constructs used to define the program and a (timing) model of the target hardware. Static methods usually try to find a tight estimation of the WCET, but always safely over-estimate it.

Static WCET analyses are normally split into three subtasks: the flow

ana-lysis (formerly known as the high-level anaana-lysis), which constrains the possible

paths through the code; the processor-behavior analysis (formerly known as the low-level analysis), which attempts to find safe timing estimates for execu-tions of code sequences based on the considered hardware; and the calculation, where the most time-consuming path is found, using information derived in the first two phases. This is illustrated in Figure 1.2.

The flow analysis phase takes as input some form of representation of the analyzed program’s control flow structure (e.g. a Control Flow Graph, CFG [104]), and possibly additional information such as input data ranges and

(15)