Partitioning methodology validation for embedded systems design

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer Science

Master thesis, 30 ECTS | Datateknik

2016 | LIU-IDA/LITH-EX-A--16/033--SE

Partitioning methodology

validation for embedded

systems design

Validering av partitioneringsmetodik för design av

inbyggda system

Jonas Eriksson

Supervisor : Tommy Färnqvist Examiner : Ola Leifler

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

As modern embedded systems are becoming more sophisticated the demands on their applications significantly increase. A current trend is to utilize the advances of heterogeneous platforms (i.e. platform consisting of different computational units (e.g. CPU, FPGA or GPU)) where different parts of the application can be distributed among the different computational units as software and hardware implementations. This technology can improve the application characteristics to meet requirements (e.g. execution time, power consumption and design cost), but it leads to a new challenge in finding the best combination of hardware and software implementation (referred as system configuration). The decisions whether a part of the application should be implemented in software (e.g. as C code) or hardware (e.g. as VHDL code) affect the entire product life-cycle. This is traditionally done manually by the developers in the early stage of the design phase. However, due to the increasing complexity of the application the need of a systematic process that aids the developer when making these decisions to meet the demands rises. Prior to this work a methodology called MULTIPAR has been designed to address this problem. MULTIPAR applies component-/model-based techniques to design the application, i.e. the application is modeled as a number of interconnected components, where some of the components will be implemented as software and the remaining ones as hardware. To perform the partitioning decisions, i.e. determining for each component whether it should be implemented as software or hardware, MULTIPAR proposes a set of formulas to calculate the properties of the entire system based on the properties for each component working in isolation.

This thesis aims to show to what extent the proposed system formulas are valid. In particular it focuses on validating the formulas that calculate the system response time, system power consumption, system static memory and system FPGA area. The formulas were validated trough an industrial case study, where the system properties for different system configurations were measured and calculated by applying these formulas. The measured values and calculated values for the system properties were compared by conducting a statistical analysis. The case study demonstrated that the system properties can be accurately calculated by applying the system formulas.

(4)

(5)

Acknowledgments

I would especially like to thank my talented supervisor Gaetana Sapienza for providing me with great help and support throughout the project. I would also like to thank my colleagues at ABB; Tiberiu Seceleanu, Roger Jansson and Nunzio Meli, for providing me with valuable assistance.

(6)

List of Figures

2.1 Heterogeneous platform . . . 8

2.2 Component in isolation and CBA . . . 9

2.3 Example of a CBA . . . 10

2.4 Pipelined design . . . 13

2.5 Different execution types . . . 15

2.6 System configuration . . . 17

3.1 Simplified structure of a heterogeneous system powered by a power controller . . 30

3.2 External hardware setup . . . 31

5.1 Simulink model with 7 components . . . 43

5.2 Figure illustrating the execution flow of the Simulink model in C . . . 46

5.3 Timestamp method implemented in C . . . 47

5.4 Simplified schematic of the shunt resistor . . . 49

5.5 Content from a map file generated by IAR Embedded Workbench . . . 53

5.6 Elicitation of system configuration to be used in the validation . . . 56

5.7 Overview of the system deployment model . . . 57

5.8 Topology of the Simulink model . . . 59

5.9 Topology of the system after composition of C1, C2, C3 and C4 . . . 59

5.10 Topology of the system after composition of C3,5, C4, C6 and C7 . . . 60

5.11 Execution flow of a system configuration . . . 64

5.12 Differences between the calculated and measured EFPs . . . 66

6.1 System response time measured and calculated for each system configuration . . . 69

6.2 System power consumption measured and calculated for each system configuration 69 6.3 Example illustrating the sequential and parallel interpretation of the average power consumption . . . 70

6.4 System static memory measured and calculated for each system configuration . . . 71

6.5 System FPGA area measured and calculated for each system configuration . . . 72

A.1 Measured EFPs for all component variants . . . 86

A.2 Selection of the best component variants (SW and HW) with respect to the measured EFPs . . . 87

A.3 Calculated SRT for the selected system configurations . . . 88

A.4 Calculated SPC for the selected system configurations . . . 89

A.5 Calculated SSM for the selected system configurations . . . 90

A.6 Calculated SFA for the selected system configurations . . . 91

A.7 Measured SRT for the selected system configurations . . . 92

A.8 Measured SPC for the selected system configurations . . . 93

A.9 Measured SSM for the selected system configurations . . . 94

(9)

List of Tables

1.1 Abbreviations introduced in the thesis . . . 6

2.1 Decision matrix example . . . 10

2.2 System response time formulas . . . 16

3.1 Execution time criteria . . . 22

3.2 Power consumption criteria . . . 28

3.3 Power consumption criteria . . . 33

4.1 Execution time methods for software versus criteria . . . 38

4.2 Power consumption methods for software versus criteria . . . 40

4.3 Power consumption methods for hardware versus criteria . . . 41

5.1 Component variants generated from Simulink . . . 45

5.2 Measured execution time of components implemented as software . . . 48

5.3 Measured execution time of components implemented as hardware - measured in microseconds . . . 48

5.4 Voltage supplied to each power rail - measured in Volt . . . 50

5.5 Different power rails on the ZC702 platform . . . 50

5.6 Power consumption of the different power rails in idle state - measured in watt. . . 51

5.7 Power consumed by the different SW component variants . . . 52

5.8 Power consumed by the different HW component variants . . . 52

5.9 Memory footprint - measured in byte . . . 53

5.10 FPGA area utilization - measured in quantity . . . 54

5.11 Selected variants for each component . . . 55

(10)

(11)

1 Introduction

Today embedded systems are widely used, and we can find an embedded system within almost every electrical product. They exist in cars, microwaves, airplanes, routers, DVD-players etc. Embedded systems becomes more advanced and require more sophisticated applications. Today, it is a trend that embedded applications are deployed on a heterogeneous platform. A heterogeneous platform contains different kinds of computational units, e.g. a central processing unit (CPU) (one or several cores), a field-programmable gate array (FPGA) and a graphical processing unit (GPU). To the developers, this gives the opportunity to implement and distribute different parts of the application among the different computational units as either hardware (HW) or software (SW) in order to meet requirements as for instance execution time, power consumption, reliability, and design cost, etc. In order to do this, at the design phase, the designers must make decisions where they determine what parts of the application should be implemented as SW or HW executable units. This is known as the partitioning problem [61][77][10] and it has an impact on the application performance and quality, the overall development process and the application life-cycle [55]. In particular, the partitioning problem can be defined as "finding those parts of the model best implemented in hardware and those best implemented in software" [55]. This must be done with respect to requirements of the application and project constraints. The partitioning problem is considered as one of the main challenges when developing embedded systems [55]. Traditionally it is done by experts within the field (SW or HW) where they manually make the partitioning decisions in the design phase. Therefore it is of key importance that the designers make the correct partitioning decisions in the design phase. If this is not the case, the development could be negatively affected due to unplanned process interrupts that requires redesigns and iterations of the design. This has a negative impact on the overall development life-cycle. As the systems get more advanced the partitioning problem becomes more complex for designers, and require designers to make decisions that are the result of considering many requirements and project constraints. The need of a methodology that aids the designers when performing the partitioning of the application in the design phase increases.

To address this problem a partitioning methodology, MULTIPAR, has recently been designed [54]. MULTIPAR is based on multi criteria decision analysis (MCDA) techniques which enable the designers to take many and of different kinds of requirements and project

(12)

1.1. Research problem and research questions

constraints into consideration when designing an embedded system. MULTIPAR applies a component-based approach when modelling the system architecture, thus the application is seen as a number of interconnected components. Each component is characterized by a set of extra-functional properties (EFPs). The component can be implemented as either SW or HW, i.e. two different variants. Each variant elicits different values of EFPs depending on the implementation type (i.e. if SW or HW). Examples of EFPs are the execution time, reliability, design cost and power consumption. Under certain constraints [56] that are typical of embedded systems, a set of the EFPs related to the entire system can be composed from the EFPs of the interconnected components. MULTIPAR computes these system EFPs based on a set of formulas [56]. A system partitioning solution refers here as one way of implementing the system (e.g. all components as SW or HW). In order to choose the best system partitioning three main steps are carried out:

1. Select the best SW variant and the best HW variant for each component

2. List all the possible system partition solutions based on the selected variants in the previous step.

3. Select the best system partitioning solution by considering the system EFPs of interest and MCDA. Some of the systems EFP values will be calculated using the formulas proposed in [56].

Currently this methodology is in research stage and it is of interest to validate the methodology. This thesis work aimed to validate a part of this proposed methodology. In particular, the main goal was to validate the proposed system formulas for the execution time, power consumption and resource utilization. These formulas calculates the system EFPs based on the component EFPs working in isolation. The outcomes of the formulas are used in MULITPAR to find the best system partitioning solution.

1.1 Research problem and research questions

1.1.1 Research related concepts

In order to understand the aim of the presented thesis work and the main research questions, some key concepts are described in the following part.

• Component variants

The functionality of a component can be realized, i.e. implemented, as either SW or HW. Multiple implementations of the same type (e.g. SW) can exist that implement the same functionality but in different ways (e.g. different sorting algorithms), even though, for a given component the exposed interface is the same among the different variants. One component variant refers to one implementation of the component.

• Extra-functional properties

Each component is characterized by a set of EFPs. Examples of EFPs are execution time, memory footprint and reliability. Depending on how each component is implemented as, i.e. SW or HW, it will exhibit different values on the different EFPs. For example, a component implemented in HW can, if possible, utilize the parallelism of FPGA designs and therefore decrease the execution time compared to a SW implementation. On the other hand, the power consumption might increase due to that design decision. Deciding upon which component should be implemented as SW or HW is often a trade-off between different EFPs [55]. EFPs can be of different types, and in [53] they are divided into three main categories: runtime, life-cycle and project/business-related EFPs. In MULTIPAR EFPs are handled at the system level or the component level. Under certain constraints, some of the system EFPs can be analytically derived from the component EFPs [56].

(13)

1.2. Research methodology

• System configuration

Considering the components and the related variants, an application that contains a set of components and a set of bindings that connects the components can be deployed in different ways on the platform depending on the chosen component variants. For instance, if all components are implemented and deployed as software it can be referred to as a system configuration. Another system configuration is when all components are implemented and deployed as hardware. Several different systems configurations exist for all the hybrid configurations, i.e. when some components are implemented and deployed as software and some as hardware. For instance, if a system contains 5 components there are in total 25=32 different system configurations.

1.1.2 Problem definition and Research questions

G. Sapienza et al. [56] have proposed a set of formulas to calculate different types of system EFPs. They can be applied to calculate some of the system EFPs when applying MULTIPAR and in relation to this the main goal of this thesis is:

To validate the proposed system EFPs formulas for the runtime category. Specifically, it is of interest to validate the formulas for all of the proposed runtime EFPs, i.e. execution time, power consumption and resource utilization.

In order to achieve the main goal the two following research questions have been stated: RQ1: What methods are suitable to measure the runtime EFPs of interest?

There exist several methods that can be used for measuring the component EFPs of particular interest for MULTIPAR, however not all of them are suitable. As a consequence, the aim is to carry out an analysis of the current state of the art and practice for selecting the most adequate methods.

RQ2: To what extent can the runtime system EFPs of interest be derived from the EFPs of the components involved?

The formulas proposed by G. Sapienza et al. [56] to analytically compute the EFPs of the execution time, power consumption and resource utilization. However, they have not been validated.

1.2 Research methodology

In order to answer the research questions, the following sections describe the applied research methodology.

1.2.1 Analysis of methods to measure Extra-functional properties

An evaluation of the state of the art and state of the practice methods to measure EFPs was conducted. This was done for runtime EFPs, specifically execution time for software, power consumption for software and hardware.

1.2.2 Suitability analysis of the most suitable methods to measure

Extra-functional properties in order to carry out the selection of the most

appropriate ones

For each EFPs of interest, a method was selected based on the evaluations and a set of criteria related to each EFP. The criteria for the selection were determined for each EFP. To measure the execution time for hardware, memory footprint and FPGA area utilization an appropriate method was selected based on the development environments available.

(14)

1.3. Research scope

1.2.3 Validation through an industrial case study

The formulas of interest were validated through an industrial case study. Specifically, the case study was developed with the context of an European project named CLERECO - Cross-Layer Early Reliability Evaluation for the Computing cOntinuum [12] and it was focused on the realization of a controller application for a brushless direct current (BLDC) motor.

1.2.4 Calculation of the theoretical values for the system properties

Based on the measured EFPs for each component and variant, the system EFPs were calculated by applying the proposed system formulas.

1.2.5 Measurement of the actual values for the system properties

In order to validate the formulas the actual values were derived from the same industrial case study where the same system configurations used in the previous step were deployed on a heterogeneous platform from which the system EFPs where measured.

1.2.6 Comparison of the measured values with the calculated values

The validation will be based on a comparison on the values calculated from the system formulas and the measured values for each system configuration. A statistical analysis as was performed on the measured and the calculated values to show the accuracy of the system formulas.

1.3 Research scope

This part describes the scope of this thesis.

1.3.1 Hardware

The heterogeneous platform (also referred as platform) under analysis shall consist of a CPU and an FPGA. Components deployed on the CPU and the FPGA should be able to communicate through a shared communication channel.

1.3.2 Execution environment

The application should be deployed on a platform as a standalone application. For example, applications that are deployed on top of an operating system, e.g. Linux, are not considered in this research. The standalone application should execute uninterrupted (with the exception of user defined interrupts). Only one standalone application can be executed at a time and only single core CPUs are considered. In the case where a CPU has many cores the assumption is that only one of the many cores is utilized, i.e. the SW is only executed on one of the many cores.

1.3.3 Programming languages

When referring to software in this report only software written in C and C++ are considered. When referring to synthesizable hardware, the hardware description languages (HDL) are limited to VHDL and Verilog.

(15)

1.4. Report structure

1.4 Report structure

The rest of the thesis is structured into 6 chapters. Chapter 2 describes relevant theory in order to understand the rest of the thesis as well as a section discussing related works focused on methods to measure EFPs. Chapter 3 evaluates different methods to measure runtime EFPs and 4 selects the most appropriate method for each EFP based on the evaluation. In the latter chapter, the results according to RQ1 are acquired. In chapter 5 an industrial case study was perform from which the results were gathered to be able to answer RQ2. Chapter 6 discusses the acquired results from the industrial case study. Chapter 7 states the conclusions of each research question. In the end of the report an appendix is included with figures with result from the measurements.

1.5 Abbreviations

(16)

1.5. Abbreviations

Abbreviations Meaning

ALF ARTISTIC2 Language for WCET flow Analysis

AXI Advanced Extensible interface

BCET Best case execution time

BLCD Brushless direct current

CBA Component-based application

CFG Control flow graph

CLERECO CrossLayer Early Realiability Evaluation for the Computing cOntinuum

CPU Central processing unit

DM Decision matrix

EFP Extra-functional properties

FF Flip-flop

FPGA Field-programmable array

GPU Graphical processing unit

HDL Hardware description language

HW Hardware

IDE Integrated development environments

ILP Integer level programming

IRQ Interrupt request

LUT Look-up table

MCDA Multi criteria decision analysis

NOP No operation

SFA System FPGA area

SPC System power consumption

SRT System response time

SSM System static memory

SW Software

SWEET SWEdish Execution Time Analysis Tool

TI Texas instruments

WCET Worst case execution time

(17)

2 Research context

This chapter describes the basic concepts related to the research of this thesis.

2.1 Preliminaries

2.1.1 Heterogeneous platforms

To be able to implement and deploy a part of the system as either software or hardware, the so called heterogeneous platforms are required to enable this. These platforms consist of different computational units (e.g. CPU, GPU or FPGA) on which the system can be deployed on. In this thesis, platforms that consist of a CPU and an FPGA are considered. On the CPU the system or a part of the system can be implemented and deployed as software (e.g. compiled C or C++ code). On the FPGA the system or a part of the system can be implemented and deployed as hardware (e.g. synthesized from VHDL code). Heterogeneous platforms provide a communication channel between the CPU and the FPGA. This enables the system to establish a communication in between the different parts of the system deployed on the CPU and the FPGA. Figure 2.1 shows a simplified architecture of a typical heterogeneous system.

2.1.2 System: Platform & Application

A system is defined such that it consists of a heterogeneous platform and an application. In this thesis we are limited to platforms according to the previous description of a heterogeneous platform. An application can be built as a number of interconnected new and/or existing components, here referred as a component-based application (CBA). A CBA can be seen as a set of components and bindings. A component implements functionalities corresponding to some part of the application. A binding is a link between two different components and it is used for communication (i.e. sending data from one component to another component). Depending on the variants that the binding connects it is realized in different ways. For example if one variant is implemented as SW and the other variant is implemented as HW, the binding in between can be realized through a shared communication channel on the platform on which the communication is established. If the variants are both implemented as SW the binding could for instance be realized by communicating by passing

(18)

2.1. Preliminaries

Figure 2.1: Heterogeneous platform

parameters via function calls. Each component requires a set of input signals and provides a set of output signals. The input signals and output signals corresponds to the interface of the component.

Each component is implemented and deployed as either SW or HW. If it is deployed as SW, the component is implemented in for example c code and compiled into an executable that is deployed on the CPU. If a component is deployed as HW it can for example be implemented in VHDL code and synthesized for deployment on an FPGA. The bindings between the components are dependent on the type of implementation. If both components in between a binding are implemented as SW the binding can for example be represented as passing parameter(s) in a function where the parameter(s) corresponds to the information sent on the binding or setting global variables shared by the components. If one of the components is implemented as SW and the other as HW, the binding corresponds to some physical communication channel on which the information is transferred between the components (see Figure 2.1). Representing the bindings can be done in several different ways. [55]

2.1.3 EFPs at component and system level

In this research the EFPs is of interest on both a component and a system level. EFPs on a component level refer to the exhibited EFPs of a component working in isolation. A component, that belongs to a CBA, that is working in isolation is completely disconnected from the rest of the CBA and thus not affected by the other components. The EFPs on a system level refers to the exhibited EFPs of the CBA. Thus, the exhibited EFPs of the CBA is dependent on the exhibited EFPs for each component. The system EFPs is not only dependent on the component EFPs, one would also have to consider the integration of the component that affects the EFPs. For example, the execution time on a system level is dependent on each components execution time. When connecting the components an extra communication delay is introduced in between the interconnected components that has to be considered

(19)

2.2. MULTIPAR

when measuring the execution time on a system level. Figure 2.2 shows the structure of a component in isolation and a CBA.

Figure 2.2: Component in isolation and CBA

2.2 MULTIPAR

The objective of MULTIPAR is to provide a partitioning solution of the application, i.e. to decide which parts of the application that should be implemented as SW or HW, based on multiple criteria. The application is modelled as a CBA where each component is platform-independent, i.e. the underlying platform is not specified. The CBA can be modelled by using different tools, e.g. the Mathworks Matlab [37] and Simulink [58]. The EFPs related to the components are the criteria from which the decisions are made upon. The MULTIPAR process contains a set of activities that leads to the partitioning of the components into SW or HW. The main activities are described below.

2.2.1 Application modelling and component selection

The first step is to define the application based on application requirements and project constraints. Application requirements are related to what functionality the application should implement. The project constraints can for example be effort (e.g. cost or time). From the application requirements and project constraints, the application is modelled as a CBA, e.g. by using Simulink, with a set of components and a set of bindings connecting the components. The use of a CBA enables reuse of components in the model. In this activity, existing components and variants can be reused. The component of interest will be part of the so called decision matrix (DM). If there exist already implemented components in the library corresponding to some component in the CBA they will be included in the DM. For one component in the DM, several different implementations (i.e. variants) can be included that exhibit different EFPs but have the same interface. For each component in the CBA were an already implemented component is available in the library, a new component entry will be included into the DM. Each component entry specifies the component id (C_ID), implementation id (I_ID) and type (SW/HW). For each non-existing component two virtual component variants (one SW and one HW) will be included into the DM.

As can be seen in an example in Table 2.1, a DM is created based on an CBA, shown in Figure 2.3, that consist of four different components. Two of them, C1 and C2, are available in the

(20)

2.2. MULTIPAR

component library for re-usage. The component library contains different implementations of the re-used components, e.g. C1 has two software implementations and one hardware implementation where each implementation exhibits different values for the properties. For the non-existing components, in Table 2.1, two virtual implementations for each component are included into the DM, one SW implementation and one HW implementation.

Figure 2.3: Example of a CBA

EFP Application constraints

C_ID I_ID Type Execution time Power consumption Max execution time

Existing C1 C1.1 SW 12us 4mw 5us C1.2 SW 7us 3mw C1.3 HW 2us 8mw C2 C2.1 SW 24us 8mw 25us C2.2 HW 10us 18mw New C3 C3.1 SW_v 2us 8mw 1us C3.2 HW_v 0.5us 2mw C4 C4.1 SW_v 50us 20mw 30us C4.3 HW_v 25us 40mw

Table 2.1: Decision matrix example

2.2.2 Define component related extra-functional properties and constraints

EFPs are derived from the application requirements and project constraints and inserted into the DM. Each EFP is prioritized, e.g. by assigning a weight to each property (similar to how it’s done by G. Sapienza [55]). If values for the EFPs, i.e. the worst case execution time (WCET) and power consumption in 2.1, are available they will be inserted in the DM in this phase, otherwise they will have to be estimated by an expert or measured. For existing components resided in the component library values for the EFPs can be derived. Application and/or project constraints are considered and inserted into the DM. For instance the maximum execution time is defined in Table 2.1 as an application constraint.

(21)

2.3. Measurements

2.2.3 Component variants filtering

In this activity the component variants that do not fulfil the application or project constraints are filtered out. It could be the case that one application constraint limits the power consumption of component Ci and one variant of Ci exceeds this application or project constraint. For example, in Table 2.1 the maximum execution time for C1 are set to 5us. The component variants C1.1 and C1.2 do not fulfil this requirement (12us and 7us) and are thus filtered out from the DM.

2.2.4 Solutions ranking

In this activity the best (in the sense that the best traded-off application and project constraints) software variant and hardware variants will be selected from the variants for each component. That means, in the end of this activity one software variant and one hardware variant should be selected as the best candidate for the partitioning. The selection is based on a MCDA. The MCDA can be defined as "an umbrella term to describe a collection of formal approaches which seek to take explicit account of multiple criteria in helping individuals or groups explore decisions that matter" [8]. Different MCDA methods can be applied in this activity, the most known and simple one is the weighted sum model (WSM) [65]. As outcome, for each component, where existing, the two best ranked variants are selected, the best ranked SW variant and the best ranked HW variant.

2.2.5 System configuration ranking

Based on the best traded-off component variants, the best-traded-off system configuration will be ranked using MCDA methods. In the worst case, given n components and 2 variants (SW and HW ones) for each of them, 2n system configurations will be ranked in a similar way to the above described activities. The main difference is that application and project constraints will be considered at systems level and where possible system EFPs will be derived from the components working in isolation.

2.3 Measurements

In this section the execution time, power consumption and area utilization EFPs will be described. Each EFP is divided into two subsections, SW and HW, were each EFP is described with respect to the implementation type, i.e. SW or HW.

2.3.1 Execution time

2.3.1.1 Software

The execution time of a given task refers to the time spent by a CPU executing that task. Usually the WCET, specifically in real-time system, is of interest to determine the worst case scenario. X. Li et al. [35] defines the WCET as following "WCET is defined as the upper bound b on the execution time of a program P on a processor X such that for any input the execution time of P on X is guaranteed to not exceed b". This is vital when developing real-time systems to be sure that each task meets its deadline. This can either be measured in time or clock cycles. [76]

Today the most common way is to determine an upper bound of the execution time instead of calculating the exact WCET [76]. This is because the WCET problem is such complex due to its multiple dependencies. The WCET is not only dependent on the input to the software but also on the running hardware. A typical processor is divided into several different components that make the WCET context dependent. Typical components within

(22)

2.3. Measurements

a processor are memory, cache-memory, pipelines and branch predictions. Below the key factors influencing the execution time are described. [76]

• Memory & cache-memory

As the CPU frequency exploded in the 21th century the frequency ratio between the CPU and the memory increased to favour the CPU. This introduced the more common use of cache-memories. The primary memory is a large and slow memory while the cache-memory is a smaller but much faster memory located close to the CPU. This gives the cache-memory a lower latency compared to the primary memory. The cache-memory is used for temporal storage to gain quicker access during execution. If the next instruction is not stored in the cache-memory it must first be loaded from the primary memory into the cache-memory before it can be used by the CPU. This adds an extra latency to the total execution. On the other hand, if the instruction is already located in the cache-memory the instruction can be fetched directly from the cache-memory and thus a lower execution time can be expected. The WCET is therefore dependent on the memory. The memory hierarchies can be very complex on processors and thus very hard to take into account when determining the WCET. One would also have to consider the initial state of the memories, in some cases instructions used by the software might already be located in the cache-memory and the context, i.e. some data/instructions already loaded into the cache-memory. [76]

• Branch prediction

Some CPUs predict the outcome of a conditional statement and the next instruction in one of the branches of the conditional outcome is pre-fetched. If the prediction was incorrect the CPU will have to undo all work from the pre-fetch to the time when the incorrect prediction was detected. This takes some extra time and thus the total execution time increases. In fact, the branch prediction affects the content of the cache-memory since the instruction is fetched into the cache-memory. If the CPU incorrectly predicts the outcome of the conditional statement the instruction fetch must be undone. The fetched instruction of the incorrect branch has somehow affected the content of the cache-memory. The extra latency added from the branch predictions is both dependent on a) the restoration of work done after the branch prediction and b) the modification of the cache-memory. [59][76]

• Pipeline

Instead of finishing one instruction before executing the next one, pipeline stages utilize parallelism to increase the throughput of the CPU. The behaviour of the pipeline is dependent on the software structure. The processor will fetch the instruction before executing it, while the current instruction is executing the next instruction can be fetched into the pipeline stage to increase the throughput. However, assuming that the next instruction is dependent on the currently executing instructions result. That means that the next instruction will have to wait for the currently executing instruction to write back the result to the memory before it can be fetched. This causes the pipeline stage at the next instruction to stall and as a result it introduces an extra latency. Depending on how the code is written dependencies between adjacent instructions causes the pipeline stage to stall and as a result the total execution time is increased. [76]

2.3.1.2 Hardware

Calculating the execution time for the HW is slightly different from the one to calculate the execution time for SW. For hardware execution time calculation the propagation delay and/or the pipeline width have to be considered. The propagation delay refers to the amount of time it takes for a signal to traverse through the logical gates. The result should therefore

(23)

2.3. Measurements

not be read from the output before the propagation time has passed since the data was put on the input ports of the design. This is usually something that the designer do not have to consider manually when the target FPGA board is known since today’s IDEs, e.g Vivado, provides timing information. [67]

The execution time for a pipeline design, driven by a clock that consists of a set of data-processing elements, e.g. registers, connected in series is dependent on the width of the pipeline, see Figure 2.4. For each clock cycle the data is moved from one element to

Figure 2.4: Pipelined design

another and the execution time is therefore the width of the design. The width of the design corresponds to the longest pipeline path through the HW. This assumes that the clock cycle is larger than the propagation delay from the input of an element to the input of its successor element.

2.3.2 Power consumption

It is not uncommon that the terms energy and power are used interchangeably. They have two different meanings and in this research the power consumption is of interest, i.e. at which rate a component is consuming energy. The power is measured in the metric watt and can be expressed by equation 2.1.

P= IV (2.1)

Where P is power, V voltage and I current. 2.3.2.1 Software

Previous research by V. Tiwari et al. [63] shows that the power consumption is dependent on the instruction that is executing. Different instructions utilize different parts of the processor and thus different power consumptions can be expected during runtime. To accurately calculate the power consumption of software, auxiliary hardware has to be considered. Depending on the architecture, the auxiliary hardware could for instance be communication hardware (e.g. ethernet), cache memory and HDD. The power consumption is also dependent on the frequency and the temperature of the platform. [34]

2.3.2.2 Hardware

The total power consumption of the FPGA is dependent on the static and dynamic power consumption where the static power is caused by leakage currents inside transistors and dynamic power depends on the switching activity by the circuits within the FPGA. The power consumed by the FPGA depends for instance on the frequency, supplied voltage, temperature, and toggle rate in the circuit. The toggle rate of the circuits depends on the input to the circuits and the power consumption is therefore dependent on the input. The total power consumption is the sum of the static and the dynamic power consumption. [42] [26] [63]

(24)

2.3. Measurements

2.3.3 Resource utilization

Depending on the implementation type (SW or HW) the resource utilization refers to different kind of resources. For SW the resource utilization corresponds to the amount of memory utilized by the SW, referred to as the memory footprint. The HW resource utilization corresponds to the amount of resources utilized on the FPGA, referred to as the FPGA area utilization.

2.3.3.1 Software

Memory footprint refers to the amount of primary memory the software utilizes during runtime. The memory footprint of a program can be divided into five parts, also known as the memory model [39]. This is described below via the main segments, as follows:

• Text segment

Sometimes referred to as the code segments and contains executable instructions. This is usually read only to prevent write operations from modifying code. The text segment is preferably placed on a high address, above the stack, or on a low address, below the heap, to prevent heap and stack overflows to modify the code (assuming it is not set to read-only). [39][38]

• Initialized data segment

This is where all the global, static, constant and external variables that are initialized during compile time reside. This segment can be divided into two smaller segments, read-only and read-write segments. This is due to the fact that some variables can be assigned to as constant and will not change during run-time. These constant variables will therefore reside in the read-only segment and some variables might have their value overwritten during run-time and will therefore be placed in the read-write segment. [39][38]

• Uninitialized data segment

In this part of the memory, all the static and global variables that are not initialized will reside. For example, if a variable is declared but not defined, all variables will be set to 0 by default before execution of the software. [39][38]

• The stack

The stack is used to store data with a limited life time, i.e. temporary. The stack size increases and decreases during run-time depending on the behaviour of the software and is crucial for subroutines and interrupts to work properly. All local variables within the subroutine along with the parameters to a subroutine are pushed on the stack whenever a subroutine is called. The stack also stores the address of where the execution should continue in the memory when the subroutine returns. This is vital for the execution flow to continue correctly. Depending on the system, the stack might reside on a lower address and grow upwards or vice versa. The size of the stack is dependent on the execution path of the software. This means the size of the stack is dynamic and thus harder to determine the worst case usage. [39][38][30][29]

• The heap

This is the segment in the memory where the dynamic allocation of memory takes place. In c, this is done by for instance using the functions malloc and calloc. The allocated memory will reside in the memory until the software releases the allocated memory by using the function free. This means that even if memory is allocated within

(25)

2.4. System level formulas

a subroutine, it will still reside in the heap when the subroutine returns (assuming free is not used). As for the stack the same goes for the heap, located on a low address and grows upwards or vice versa depending on the architecture. The size of the heap is dependent on the execution path of the software and is harder to determine the worst case size of the heap during execution. [39][38]

To calculate the memory footprint, all of the aforementioned memory segments have to be considered to get an exact value. However, this can be troublesome due to possible non-deterministic behaviour of the stack and the heap. Determining the static memory footprint is much more trivial (compared to the dynamic memory footprint) since the static segments are available during compile time. The dynamic memory footprint is more complex because the behaviour is execution dependent and can thus only be calculated/estimated, if even possible, during runtime.

2.3.3.2 Hardware

The FPGAs have a limited number of reprogrammable resources. The FPGA area utilization refers to the amount of resource utilized on the FPGA chip. A resource can for example be a look-up Table (LUT), a flip-flop (FF) or a slice. A slice is a logic block that contains a set of LUTs, FFs and multiplexers. The amount of LUTs, FFs and multiplexers in a slice is dependent on the FPGA brand. When measuring the FPGA area utilization one of the resources have to be selected. For example, FFs and LUTs are two different kinds of resources and cannot be compared. [32]

2.4 System level formulas

In MULTIPAR some of the system EFPs can be derived by applying a set of formulas that utilizes the EFPs of the component variants [56]. In this research the system formulas to calculate the total execution time of the system (system response time), the power consumption of the system (system power consumption), the memory utilized by the system (system static memory) and the FPGA utilization by the system (system FPGA area) will be analysed.

2.4.1 System response time (SRT)

G. Sapienza et al. [56] defines the SRT as "the time needed for a signal coming from the input to produce a change at the system output", see Figure 2.6. This is dependent on a) the execution time for each component in the system, b) in what way the components are executed (sequentially or in parallel, see Figure 2.5) and c) the binding type between the components.

(26)

Considering Figure 2.6 where an example system configuration is shown, there are 4 different component variants implemented in both SW and HW. This means that there are different binding types between the components. For example, C1 to C2 (SW-HW) and C1 to C3 (SW-SW) got different binding types and therefore different communication time in between can be expected. Component C1 and C2 are sequentially executed, i.e. C2 is dependent on C1s output and the execution can therefore not begin before the execution of C1 is completed. C2 and C3 are considered to be executed in parallel.

Execution order Binding type Formula

Sequential SW-SW RTSWSW =ExTCa+ExTCb+CTCa,Cb

Sequential SW-HW RTSWHW =ExTCa+ExTCb+CTCa,Cb

Sequential HW-SW RTHWSW =ExTCa+ExTCb+CTCa,Cb

Sequential HW-HW RTHWHW =ExTCa+ExTCb+CTCa,Cb

Parallel SW-SW RTSWSW =ExTCa+ExTCb+CTCa,Cb

Parallel SW-HW RTSWHW =max(ExTCa, ExTCb) +CTCa,Cb

Parallel HW-SW RTHWSW =max(ExTCa, ExTCb) +CTCa,Cb

Parallel HW-HW RTHWHW =max(ExTCa, ExTCb) +CTCa,Cb

Table 2.2: System response time formulas

Table 2.2 shows the formulas that need to be applied when calculating the SRT depending on the execution order and binding type. Where ExTCa is the execution time for a component

and CTCa,Cb is the communication time between two components. To calculate the SRT the

formulas in Table 2.2 have to be applied on each connected component pair (e.g. Ca and Cb in Figure 2.5) in order to be considered as one component. The formulas are applied recursively until all components are composed into one component that corresponds to the entire system. To apply the formulas to calculate the SRT in the example shown in Figure 2.6, the formulas would have to be applied recursively. For example, one approach would be to first calculate the response time (RT) of C2 and C3 by applying the parallel SW-HW formula and composing C2 and C3 into one component where the execution time of the new component is updated according to the formula. After the composition of C2 and C3 the system would now consist of three components executed in a sequential order. Sequential SW-HW and HW-SW formulas would be applied to the existing components to compose them into one component that corresponds to the SRT.

In Table 2.2 the formula to calculate the parallel execution order for SW-SW binding type is different from the other parallel formulas. As previously stated, this research is limited to only one SW component executed at a time. Even if two SW components are considered as parallel in the CBA they will still be executed in a sequentially on the CPU.

2.4.2 System power consumption (SPC)

To compute the total SPC, the power consumption from each component plus additional power consumption related to the platform and the system application has to be considered. Example of additional power consumption is the static power consumption by the FPGA card that is present even without any design programmed on the FPGA. To compute the

(27)

Figure 2.6: System configuration

SPC, formula 2.2 is proposed [56]. SPC = n ¸ i=1 CPCCi+EP (2.2)

CPCCiis the power consumption for the ith component, EP is the extra power consumption

and n is the total number of components.

2.4.3 System static memory (SSM)

This system EFP is only relevant for the components that are implemented as software. To compute the SSM, the static memory for each component and the additional system-related static memory have to be considered. Extra static memory is the additional memory usage when deploying the system and can for example be the configuration of a communication channel or instructions to pass parameter(s) between components. To compute the SSM, formula 2.3 is proposed [56]. SSM= n ¸ i=1 CSMCi+EM (2.3)

CSMCiis the static memory usage for the i-th component, EM is the extra memory used by

the system and n is the total number of components.

2.4.4 System FPGA area (SFA)

As for the SSM, the SFA is only relevant for the components within the system implemented as hardware. The SFA is dependent on how much area each component utilizes plus additional area required by the system. An example of extra area can be the logic required to establish a communication link between different components. To compute the SFA, formula 2.4 is proposed [56]. SFA= n ¸ i=1 CFACi+EA (2.4)

CFACi is the FPGA area utilization for the i-th component, EA is the extra FPGA area

(28)

2.5. Related work

2.5 Related work

This section presents the related work. Specifically it is focused on the work related to methods to measure EFP for the aforementioned EFPs of interest. Consequently the sections were divided as follows.

2.5.1 Methods to measure EFP

In this thesis different methods to measure EFPs are of interest. In prior for this work a lot of different research exist on methods to measure different EFPs.

2.5.2 Execution time

2.5.2.1 Software

The WCET problem has been widely studied and several methods to estimate the WCET have been proposed. A survey was conducted 2008 describing different tools to estimate the WCET [76]. In general, there are three different types of methods to calculate the execution time: static analysis, measure based and a hybrid version that estimates the WCET.

The first type of methods are based on static analysis which statically analyses the code, source code (e.g. C code) or executable, and from that determines an upper bound or estimation of the WCET. The analysis is usually performed in addition with a model of the target architecture. This is used to model the architectural effects, e.g. cache miss and pipeline stalls, to increase the accuracy of the estimation. AbsInt (aiT) [2] and Bound-T [9] are two commercial tools used by the industry [76]. aiT have been used by Airbus and DAIMLER [3] while Bound-T traditionally was used in the spacecraft domain [76]. They work in a similar fashion where a control flow graph (GCFG) is created and the WCET is estimated based on the CFG in addition with a model of the architecture.

Different research groups from different universities around the world perform research on the WCET problem. Some examples of tools developed by research groups are SWEET [74] that is developed by the WCET group at Mälardalens University, HEPTANE [24] that is developed by the IRISA project-team and Chronos [35] that is developed at National University of Singapore. The estimation approach is similar to the commercial tools, a static analysis is performed in addition with some kind of model of the architecture to calculate an upper bound of the WCET. The model for SWEET is slightly different and simplified compared to the other tools, e.g. HEPTANE and Chronos. Specifically the model describes the amount of clock cycles needed by each instruction while HEPTANE and Chronos models the architecture to some extent.

The second type of method is the measure based, i.e. the WCET is estimated by collecting execution traces from the software during run-time. The rudimentary way to do it as described by Rapita Systems [75] and A. Roscoe et al. [50] where the average execution time is calculated. The latter rudimentary way could be done in a similar way where the worst measured execution time is stored. As described in Embedded Coder [17], Embedded coder instruments the code with execution traces and automatically calculates the WCET and average execution time based the execution traces from several runs. This assumes the developers are working with an model based design within Simulink and C code is generated by using the built-in tool Embedded coder. Another approach is described by J. Tong et al. [64], a survey, where the execution time is determined by monitoring dedicated hardware counters.

(29)

2.5. Related work

also called a hybrid method. An example of this approach is represented by RapiTime [45] that was developed by Rapita System and is a commercial tool that utilizes static analysis in combination with measurements. Measurements are captured for each basic block in a CFG and the static analysis determines the worst case path through the software based on the timing information for each basic block in the CFG.

2.5.2.2 Hardware

For measuring the execution time the propagation delay can be derived by using IDEs provided by Xilinx [72] and Altera [62] calculates the timing delays based on a design. To identify the pipeline width of a design different simulator can be used, e.g modelsim [40], Vivado Simulator [72] and Quartus simulator [57], to simulate the behaviour of a design to derive the execution time.

2.5.3 Power consumption

When measuring the power consumption the options differs depending on what measure capabilities the target platform offers. As described by V. Tiwari et al. [63], the power is measured with a high resolution, i.e. on a clock cycle level by measuring the current drawn by the processor. The power consumption could be monitored using for example an oscilloscope connected to the board as done by J. Russel et al. [52]. This method assumes that the target platform provides measure points that correspond to the part of the platform that is of interest.

PowerScope is introduced by J. Flinn et al. [21] which is an energy profiling tool for mobile applications. The systems consumption is measured and collected during run-time and analysed in a later stage, when the sampling is done, where the power consumption is computed for each process in the running system. The measurements are collected in a similar way.

Some power measurements can be done with external hardware and ad-hoc tools on the host machine. For example by using IAR I-jet [27] with IAR Embedded Workbench [28] that supplies the target platform with power during run-time. The I-jet measures the power consumed by the board and sends it to the host machine for further analysis. Lautherbach [33] provides external hardware and tools to measure the power consumption in a similar way. Texas instruments (TI) manufactures power controllers that distribute power to different parts of a chip. Via the I2C-bus the power consumption can be monitored for different rails either by creating a communication component or using TI USB adapter, this method is described by A. Farhadi et al. [6] and is feasible for both hardware and software measurements.

Major chip producers, such as Xilinx and Altera, provide tools for estimating the power consumption. The estimation is based on the VHDL design where the user can provide the tool with extra information, such as simulation data, to improve the accuracy. Xilinx [41] and Altera [43] implements power estimators as part of their development environment. This method is used for estimating the consumption of an HDL design.

2.5.4 Resource utilization

2.5.4.1 Software

Profiling the memory can be done in different ways. Two examples are given by the Valgrind [69] and Rational Purify [68] tools. These can for example be used to detect use of a released (free) variable, memory leaks and use of uninitialized memory. These are primary used to

(30)

2.5. Related work

detect run-time errors. Some integrated development environments (IDE) produces map-files during the compile and linking stage. The map-file contains information of where the code and data is located and the size of the static memory can thus be calculated. IDEs such as IAR Embedded Workbench [28], Microsoft Visual Studio [71] and Xilinx SDK [79] creates such files for debug purposes.

2.5.4.2 Hardware

Measuring the FPGA area utilization can be done by using IDEs when synthesizing a design into a hardware representation. This can for instance be done by using the utilization reports produced during synthesis by Vivado [72] or Altera Cyclone [14].

(31)

3 Evaluation of methods to measure

hardware/software EFPs

As one of the main goal of this work, an evaluation to find possible ways to measure the different EFPs for software and hardware was performed. Tools and methods were identified by using search engines, e.g. Google scholar, Google and based on experiences from senior researchers and practitioners in both the industry and academy.

The aim of the evaluation is to find methods that are the most suitable for measuring the execution time and power consumption. For the execution time, only the methods applicable for software will be evaluated. An evaluation will therefore not be performed on methods to measure execution time for HW and resource utilization for SW and HW. Methods for measuring the power consumption will be evaluated for both software and hardware. Each method in the evaluation will be assessed according to a set of criteria that have been specified in relation to their applicability for partitioning purpose. However, a selection of a method to measure the execution time for hardware, resource utilization for HW and SW will be selected in section 4.

For each evaluated method a set of criteria were defined. The criteria defined were based on the needs of this project to measure a component (SW or HW) in isolation.

3.1 Execution time

3.1.1 Software

To find the different methods a study of the current state of the art and state of the practice was conducted. Different search engines were used to find research papers, tools and state of the practice. Experiences from professors and experts working within the field were gathered. To be able to compare the execution time methods, a set of criteria were defined in Table 3.1. 3.1.1.1 Method 1 - AbsInt (aiT)

The aiT is a commercial tool used by several companies, e.g. Airbus, NASA and DAIMLER [2]. The tool works on executable meaning it supports all programming languages as long

(32)

3.1. Execution time

Applicability The method shall be applied to a component in isolation Novice skill The method shall be simple to apply. No expert knowledge about

the method or some related technique should be needed. Effortless The method shall not require any extra significant effort. E.g.

compile and linking is not considered to be extra effort since it is part of the normal work flow when developing

Documentation The method shall provide documentation on how to apply the method and technical details that can be used for analysis Hardware considered The method shall take the underlying hardware into

consideration when measuring/estimating the execution time. The provided value should be dependent on the underlying hardware

Uninterrupted environment The method shall be able to measure/estimate the execution time as if the component is uninterrupted during run-time

Windows support The method shall support usage in Windows Linux support The method shall support usage in Linux Non-commercial The method shall be free

Table 3.1: Execution time criteria

as they can be compiled to the target hardware. The tool calculates an upper bound of the WCET in several steps:

• In the first step the tool will create a CFG [76] from the provided executable. To do this, the tool requires some knowledge about the targeted hardware in order to identify the instructions from the byte code.

• A value analysis is performed to determine the effective memory addresses of data. This is useful when performing the cache analysis. By analysing the register the tool can predict some of the memory access. This analysis is also used to determine upper loop bounds.

• Cache analysis that classifies the memory accesses to different categories in order to determine whether the data is located in the cache or not. This classification is based on the value analysis performed in prior to the cache analysis. The result of the cache analysis is used by the next step, pipeline analysis, to predict the pipeline stalls that occur due to cache misses

• Pipeline analysis is performed to model the behaviour of the sequential execution of instructions. It estimates the execution time of a basic block, set of instructions, in the CFG based on the processors pipeline behaviour. The pipeline analysis takes both in-order and out-of-order execution into consideration

• The final step is to perform the path analysis which determines the worst-case execution path based on the CFG and the timing information from the basic blocks

In addition to this, the user can provide the tool with annotations that for example specifies the loop boundaries or the maximum recursion depth. The analysis assumes that the running software is uninterrupted. [2][19][76]

The tool supports several different architectures (e.g. ARM Cortex-M3 and PowerPC 755). The analysis takes the pipeline and cache behaviour into consideration when estimating an upper bound of the execution time.

Usage

(33)

3.1. Execution time

would use the executable in addition to an optional annotation file to compute the upper boundary of the execution time. The annotation file must be provided in situations where an upper bound of a loop cannot be calculated by the tool. This tool is applicable for estimating the execution time for a SW component but would in some cases require some extra effort, e.g. creating an annotation file.

The tool can be bought and downloaded from the homepage where the user manual is included. The tool can be used in both Windows and Linux environments.

3.1.1.2 Method 2 - Bound-T

This is a commercial tool that is based on static analysis and traditionally used in the spacecraft domain [9]. The tool was developed by the Space Systems Finland Ltd, and later extended to support other application domains by Tidorum Ltd [76]. The tool calculates an upper bound of the execution time and in addition to that it supports stack usage analysis. The input to the tool is an executable and optionally an annotation file describing some limitations, e.g. maximum loop iterations. The tool will create a CFG from the executable in which the pipeline effects are modelled. The graph is analysed in order to find possible execution paths. Part of this is to find an upper bound of loops. When the CFG and possible execution paths is known, the time it takes to execute one instruction in the executable is modelled. An upper bound of the execution time is calculated for each basic block. The upper bound of the WCET is estimated based on the CFG and the timing WCET for each basic block. The analysis assumes that the running software is uninterrupted.

Bound-T supports several different architectures, for instance ARM7, where the hardware is modelled to take the pipeline effects into consideration.

Usage

The isolated component, with corresponding C code, has to be compiled and linked to an executable. The executable, corresponding to the SW component, is used as input to the tool with extra information to assist the tool with upper bounds on certain loops that the tool cannot compute by itself. That means that the tool, in some cases, requires some extra effort in order to be applied. A well-detailed user manual and a technical manual are provided on the homepage for extra support. It can only be used on Linux.

3.1.1.3 Method 3 - Heptane

This is an open source tool that is based on static analysis. The tool calculates an upper bound of the execution time based on a static analysis of source code (C code) or executable. The tool comes with two modes, a timing schema-based method and an integer level programming (ILP) based method for calculating the upper bound of the execution time. The first method produces a result quicker while the ILP-based method produces the results more detailed but requires more computation time. The timing schema-based method makes use of the source code to compute the upper boundary while the ILP-based method makes us e of the executable. The tool analyses the effect of instruction caches, pipelining and branch prediction. In order to find the upper boundary of loops the user must specify the upper limits of the loop in the source code. The analysis assumes that the running software is uninterrupted.[24][76]

Heptane comes with supports Mips and ARM [24] architectures. The analysis takes the pipeline and cache effects into consideration when estimation an upper bound of the WCET. Usage

(34)

3.1. Execution time

The tool accepts both source code and an executable to perform the analysis. Heptane does not implement any mechanism to find upper boundaries for loops. The user has to manually insert loop bounds as annotations in the source code, this requires extra effort. The support is very limited, as well as the provided documentation on how to run the tool. The tool can only be used in a Linux environment.

3.1.1.4 Method 4 - SWEET

The SWEdish Execution Time Analysis Tool (SWEET) was developed by the WCET research team in Västerås, Sweden [74][60]. The main function of SWEET is flow analysis but it also supports WCET estimation. The input to the SWEET tool to compute the flow analysis is an ALF-file and an optional annotation file. ALF stands for ARTISTIC2 Language for WCET flow Analysis (ALF) and is common for flow analysis [23]. SWEET uses abstract execution to determine the flow facts. This analysis tries to determine the different paths through the software. Part of this is deriving the loop boundaries, infeasible paths, execution frequency of parts of the code. The result of the flow analysis is flow facts which can be used by other tools, e.g. aiT or RapiTime, to calculate an upper bound of the execution time. The current version of SWEET does not implement a low-level analysis, i.e. it cannot calculate an upper bound of the execution time based on e.g. cache analysis or pipeline analysis. However, it can estimate the WCET by using a more simplified timing model. The abstract execution determines the worst and the best execution path, by using this in addition to a timing model the best case execution time (BCET) and WCET boundaries can be calculated. The tool also supports BCET/WCET calculation from ALF construct cost. This is a more inaccurate approach where the execution time is measured from a higher level (source code or ALF format). To achieve this, a timing model must be defined to measure the execution time. One method to define a timing model is described by P.Altenbernd [4]. The analysis assumes that the running software is uninterrupted. [60][36]

The current version of SWEET utilizes a very simple model of the architecture. Instead of modelling the entire architecture, as in most of the other static tools, each instruction is mapped to a number of clock cycles. The number of clock-cycles for each instruction in the instruction set corresponds to the timing-model which will be used in the analysis to determine an upper bound of the WCET.

Usage

To estimate the execution time, an ALF-file must be generated from the isolated component, i.e. the C code, on which the analysis will be performed. The tool requires the ALF-file, a timing model and optionally an annotation file to estimate an upper boundary of the WCET. The annotation file gives the tool information which cannot be extracted from the source code, e.g. loop boundaries. SWEET provides documentations on how it works and how to use it on their homepage. The tool can be used in both Windows and Linux.

3.1.1.5 Method 5 - Chronos

This tool is developed at the National University of Singapore. It performs a static analysis on both the source code and an executable in order to estimate the WCET. Chronos uses the source code to determine loop bounds. If it fails it will request annotations from the user. The main part of the analysis is done on the executable due to the details available in the binary executable. The first step is to perform a light weight data flow analysis on the source code to determine the upper loop bounds. If that fails, the user has to provide annotations. A path analysis is performed in order to derive the CFG. For each basic block in the CFG, it estimates the timing by considering the behaviour of the processor. Chronos supports out-of-order pipelines, branch prediction and instruction cache analysis. Chronos is based on

Partitioning methodology validation for embedded systems design

Linköping University | Department of Computer Science

Master thesis, 30 ECTS | Datateknik

2016 | LIU-IDA/LITH-EX-A--16/033--SE

Partitioning methodology

validation for embedded

systems design

Validering av partitioneringsmetodik för design av

inbyggda system

Jonas Eriksson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Research problem and research questions

1.1.1

Research related concepts

1.1.2

Problem definition and Research questions

1.2

Research methodology

1.2.1

Analysis of methods to measure Extra-functional properties

1.2.2

Suitability analysis of the most suitable methods to measure

Extra-functional properties in order to carry out the selection of the most

appropriate ones

1.2.3

Validation through an industrial case study

1.2.4

Calculation of the theoretical values for the system properties

1.2.5

Measurement of the actual values for the system properties

1.2.6

Comparison of the measured values with the calculated values

1.3

Research scope

1.3.1

Hardware

1.3.2

Execution environment

1.3.3

Programming languages

1.4

Report structure

1.5

Abbreviations

2

Research context

2.1

Preliminaries

2.1.1

Heterogeneous platforms

2.1.2

System: Platform & Application

2.1.3

EFPs at component and system level

2.2

MULTIPAR

2.2.1

Application modelling and component selection

2.2.2

Define component related extra-functional properties and constraints

2.2.3

Component variants filtering

2.2.4

Solutions ranking

2.2.5

System configuration ranking

2.3

Measurements

2.3.1

Execution time

2.3.2

Power consumption