Increasing energy efficiency and instruction scheduling by software prefetching

(1)

UPTEC IT 16002

Examensarbete 30 hp June 2015

Increasing energy efficiency and instruction scheduling by software prefetching

Alexander Fougner

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

Increasing energy efficiency and instruction scheduling by software prefetching

Alexander Fougner

With the increasing problems related to semiconductor process node shrinkage and the expansion of the mobile devices market, the requirements for energy efficiency are continuously constrained. Alternative methods such as Decoupled Access/Execute adapts software to better fit dynamic voltage and frequency scaling. Targeting the energy inefficient Out-of-Order execution logic new methods propose to increase energy efficiency by moving the Out-of-Order logic from hardware level to software level by enabling reordering of loop iterations. One way to enable reordering of iterations is to transform a loop to backwards recursion. The aim of this thesis is to investigate transformations of loops into recursions and to evaluate the resulting performance impact. This thesis presents a source language independent implementation of a LLVM compiler pass transforming loops into forward and backward recursions.

The performance impact is evaluated by choosing parallel loops from the Rodinia benchmark, measuring the recursion overhead for different recursion depths. In certain cases, tight loops showed a variation in overhead ranging between 22% to 78%

for the backwards recursion case depending on recursion depth, whereas for loose loops the observed overhead for some loops were as low as 1% regardless of the recursion depth.

Ämnesgranskare: Stefanos Kaxiras Handledare: Alexandra Jimborean

(4)

(5)

Summary in Swedish

Med de st¨ andigt f¨ orfinade tillverkningsprocesserna f¨ or microchip ¨ okar ocks˚ a str¨ oml¨ ack- aget i kretsarna, men ¨ aven s˚ a kallad d¨ od kisel som refererar till den chipyta som inte kan anv¨ andas p˚ a grund av kretsens str¨ om- och v¨ armebegr¨ ansningar. Allteftersom det stora intresset f¨ or mobila enheter och Internet-of-Things st¨ aller h¨ ogre krav p˚ a kretsar- nas energieﬀektivitet, m˚ aste nya alternativa metoder utforskas. Dynamisk sp¨ anning- och frekvensskalning har tidigare varit en popul¨ ar metod f¨ or att reglera halvledarkret- sars energif¨ orbrukning genom att b¨ attre anpassa kapaciteten till arbetsbelastning.

En av nackdelarna med frekvensskalning ¨ ar att anpassningsf¨ orm˚ agan ¨ ar f¨ or grovko- rnig f¨ or att kunna ¨ andra till en lagom arbetsniv˚ a med de snabba variationer som kretsens instruktioner efterfr˚ agar. Grovt talat kan man dela upp arbetsbelastningen f¨ or en pro- cessor i tv˚ a olika kategorier. I det ena fallet begr¨ ansas prestandan av instruktioner som mestadels ¨ ar ber¨ akningsintensiva vilket inneb¨ ar att processorn inte kan arbeta snabbare.

I den andra kategorin domineras arbetet av datatrafik till och fr˚ an kretsen som leder till massa on¨ odig v¨ antetid f¨ or processorn. N¨ ar processorns instruktioner best˚ ar av bland- ning av dessa tv˚ a kategorier leder det till att frekvensskalningen m˚ aste kompromissa och l¨ agga sig p˚ a en niv˚ a som inte ¨ ar optimal f¨ or n˚ agon av kategorierna. I moderna kret- sar anv¨ ander man sig ¨ aven av en teknik kallad Out-of-order execution f¨ or att hantera de f¨ ordr¨ ojningar som uppst˚ ar i samband med att instruktioners operander inte finns tillg¨ angliga. Detta m¨ ojligg¨ or en eﬀektivare schemal¨ aggning av ber¨ akningar, dock till en kostnad. Denna teknik f¨ orbrukar mycket energi och ¨ ar en intressant del att optimera.

I fallet med frekvensskalning har tidigare forskning visat att det ¨ ar f¨ ordelaktigt att dela upp programmet i separata faser genom att f¨ orst ladda in data till cacheminnen under l˚ ag frekvens f¨ or att sedan ¨ oka tempot under ber¨ akningsfasen, s˚ a kallad decoupled access/execute (DAE). En id´e f¨ or att ¨ oka energieﬀektiviteten ¨ ar genom att ers¨ atta Out- of-order execution med en mjukvaruvariant och att kombinera detta med DAE. Ist¨ allet f¨ or att anv¨ anda traditionella loopstrukturer f¨ or de olika access- och exekveringsfaserna anv¨ ands rekursion. Detta m¨ ojligg¨ or separata faser d¨ ar ena sker vid rekursionsanropen och den andra sker n¨ ar rekursionen ska returnera och stacken rullas tillbaka.

I detta examensarbete unders¨ oks rekursionstransformationen och dess p˚ averkan p˚ a prestanda. Detta realiseras genom en kompilatormodul baserad p˚ a kompilatorsamlingen LLVM, vilket bland annat medf¨ or att modulen i teorin ¨ ar k¨ allkodsoberoende. Denna modul transformerar en loopstruktur till en rekursion, b˚ ade f¨ or fram˚ atrekursion och bakl¨ angesrekursion. Prestandan utv¨ arderas genom en upps¨ attning av benchmarks fr˚ an Rodinia benchmark suite, som bland annat unders¨ oker prestandan f¨ or olika rekursions- djup och optimeringsflaggor f¨ or kompilatorn.

Som v¨ antat s˚ a ¨ okar exekveringstiden markant f¨ or m˚ anga av fallen. Det st¨ orsta ut-

slaget kan generellt uppm¨ arksammas hos sm˚ a loopar, delvis p˚ a grund av deras rel-

ativt f˚ a ber¨ akningsinstruktioner. I ett fall, j¨ amf¨ ort med referenstestet, introducerar

bak˚ atrekursionen mellan 22% till 78% overhead beroende p˚ a rekursionsdjupet. F¨ or st¨ orre

loopstrukturer observeras en mindre prestandaf¨ ors¨ amring, ibland s˚ a lite som 1% oavsett

det maximala rekursionsdjupet.

(6)

Abbreviations

CIV Canonical induction variable CFG Control-flow graph

DAE Decoupled access/execute

DVFS Dynamic voltage and frequency scaling EDP Energy delay product

ILP Instruction-level parallelism IPC Instructions per cycle IR Intermediate representation IV Induction variable

JIT Just-in-time [compilation]

MLP Memory-level parallelism

SSA Static single assignment

TRE Tail recursion elimination

(9)

Chapter 1

Introduction

1.1 Background

Modern processors have a vast range of technologies to speed up execution, such as branch prediction, hardware prefetching and out-of-order execution, all of which makes the chip design more complex and also increases energy consumption.However, trading energy efficiency for lower latency in the execution pipeline is not beneficial for devices with limited power sources. As more mobile and battery dependent computers enter the market, a need for better energy efficiency arises. One alternative solution to achieve more energy efficient designs is to decrease the complexity of the chip by replacing hardware features with better software solutions. In a modern CPU a significant part of the energy is consumed by the out-of-order execution logic [1], making it an interesting target for improving energy efficiency. In addition, silicon die shrinkage in semiconductor manufacturing increases the amount of dark silicon and power density problems arise [2], resulting in further limitations to chip design and energy efficiency.

The primary objective for the dynamic-scheduling logic is to prevent a stalling in- struction from blocking the next scheduled instructions. Allowing the CPU to reorder instructions reduces the total memory latency, i.e. increasing the speed at the cost of energy eﬃciency. A possible alternative solution to avoid stalling instructions is the software approach, one example being the Decoupled Access/Execute model [3]. To re- duce latency, soon to be used data is prefetched for execution in an access phase, to be followed by the execute phase.

1.1.1 Decoupled Access/Execute (DAE)

First presented by Smith [4] in 1982, decoupled access and execute phases have certain exploitable properties, one of them being improved energy eﬃciency . Koukous et al. [3]

managed to separate the access and execute phases and adjust the phases to a granularity

more appropriate for dynamic voltage and frequency scaling (DVFS). By running the

access phase at low frequency and the execute phase at higher frequency, they increased

the memory-level parallelism (MLP) and instruction-level parallelism (ILP) for the access

(10)

phase and execute phase, respectively. This resulted in a reduced energy delay product (EDP) of 25% on average, on a machine with low latency per-core DVFS.

Implementations in recent works divide the access and execute phases into two sep- arate loops [5]. This thesis investigates an alternative to this approach by transforming the original loop into recursion. This enables the possibility of hoisting the accessing phase above a recursion call instead of creating two separate loops. It also enables values to be automatically saved to the stack by the compiler. This can be used to save and maintain values which are unique to each iteration, an example being the loop-carried induction variables.

1.2 Problem description

As an extension of the DAE model [6], it is proposed to increase energy eﬃciency by reordering instructions at runtime. By reordering (delaying) instructions whose operands are not yet available in cache, the memory latency is eﬀectively hidden at runtime as a feature of software rather than the more energy consuming out-of-order execution technology which is hardware-based.

One solution to implement out-of-order execution in software is to transform loops into recursion form, thus enabling forward recursion and backward recursion depending on where the recursion call is located. In the backward recursion case iterations will be executed in reverse, i.e. out-of-order. The recursion model brings the advantage of the stack automatically keeping track of loop-carried values.

The described model can be viewed as two separate and smaller problems, the first problem being that of transforming iterative loops into recursive functions. The second problem consists of decoupling the access and execute phases, and dynamically deter- mine the optimal recursion depth. However, the goal of this thesis is limited to the implementation and evaluation of the first problem by transforming loop iterations into a recursive form. Integration of the decoupled access and execute phases are excluded and not part of this thesis.

The contribution of this thesis primarily consists of two parts, the first being a prototype compiler pass for the LLVM (formerly Low Level Virtual Machine) compiler infrastructure [7]. The presented compiler pass transforms loops, independently of the source language, into recursive versions. The second part evaluates the performance of both the original and the transformed versions, also comparing diﬀerences between the two. There are multiple important aspects to evaluate.

• Recursion overhead to measure potential degradation in performance.

• Feasible compiler optimizations after the transformation, specifically if the

transform is reversible by the tail recursion elimination in LLVM.

(11)

1.3 Method

Using the LLVM suite [7] a compiler pass prototype has been implemented. This com- piler pass transforms loops to recursions, allowing iterations to be reordered given that the recursive function contains the entire loop body and under the assumption that dependences between iterations are respected.

The prototype pass is implemented in the C++ programming language, the primary reason being to maintain performance and modularity while being the main language the LLVM compiler infrastructure is written in. Thus, this simplifies the process of integrating the prototype with the LLVM framework.

To evaluate the performance impact of this proposal, a set of programs from the Rodinia benchmark suite is used [8]. The selected targets to transform are OpenMP [9] parallelized for-loops which are free of loop-carried dependences, and enabled both in-order and out-of-order execution of loop iterations. This enables the reverse recursion to take place without adding the complexity of iteration dependency handling, which would be the case for non-parallelized loops.

1.4 Thesis outline

chapter 2 presents related work in the field.

chapter 3 introduces common compiler theory and terminology used in this thesis.

chapter 4 includes an introduction to LLVM concepts and some of its core classes.

chapter 5 describes the design of the implemented prototype pass.

chapter 6 discusses the prototype evaluation.

chapter 7 concludes the work.

(12)

Chapter 2

Related work

2.1 Recursion

2.1.1 Transformation

Silva et al. [10] introduce an algorithm for automatic transformation of higher-order languages loop constructs into recursive versions. They developed a library in the Java language able to successfully transform each kind of Java loop construct (for, while, do- while and foreach) including constructs containing exceptions, return, break and continue statements.

They evaluated the performance of the transformed recursive versions by choosing 14 benchmarks from the Java Grande suite [11] and also added their loops2recursion im- plementation to the set of evaluated programs (30 in total, 15 iterative + 15 recursive).

Comparing the iterative versions of the benchmarks with the transformed recursive ver- sions they found, with a 99% confidence level, in 11 of the 15 experiments the recursive version to be less efficient. In one experiment, again with 99% confidence level, they found the recursive version to be more efficient than the iterative version. However, dis- abling JIT and the many implied optimizations resulted in iteration being more efficient.

The Java library implemented by Silva et al. only produces tail-recursive methods, however, it is important to note that the JVM used for the experiments did not support tail recursion optimization. This is noteworthy because in most cases tail recursion is considered to perform better than non-tail recursion due to the decreased overhead in the tail recursive case.

Although Silva et al. claims the algorithm to be easily adapted to other languages, it is not source language independent, compared to an implementation transforming the intermediate representation which actually would be language independent.

2.1.2 Stack optimization

Schaeckeler et al. [12] present an algorithm for reducing the stack size of recursive

functions. In the first phase they develop a set of theorems to determine when it is legal

to change certain stack-allocated variables into globally allocated variables. Formal

(13)

parameters and locally declared variables considered dead at the recursive call can be declared global, thus seeing a reduction of the stack size in 70% of the tested benchmarks.

In 40% of the benchmark functions, the savings were the maximum theoretical reduction, and in 30% of the cases the size reduction would be between 16.7% and 33.3% of the maximum savings.

By splitting up live ranges of formal parameters and local variables at recursive calls, the second phase of the algorithm reduced the stack size even further. All benchmark functions saw a reduction of stack space, even for previously unoptimized functions.

Depending on benchmark, the stack space were reduced between 23.1% and 66.7%.

2.2 In-Order and Out-of-Order execution

Srinivasan et al. [13] discuss the load latency tolerance in dynamically scheduled proces- sors. Flexible load completion policies where load completions are delayed for as long as possible are compared to an ideal memory hierarchy where every load is completed in one cycle. They conclude most of the observed programs exhibit load tolerance to some degree, and still perform comparable to the ideal memory system. For the latency tolerant policies to perform comparable to the ideal memory system, they estimate the number of load operations required to be completed within one cycle to be between 1%

and 71%, depending on benchmark and computer configuration. It is also concluded that up to 74% of the loads can be satisfied in 32 cycles.

Another interesting hardware model, developed by Shioya et al. [14], consists of two

diﬀerent execution units; the out-of-order execution unit (OXU) and the in-order execu-

tion unit (IXU). Fetched instructions are scheduled for execution in the IXU. The IXU

executes those instructions that are ready upon arrival and those which have dependent

operands that will be made available in the IXU pipeline. Not-ready instructions are

handled as a NOP (no operation) and later passed on to the OXU, where it will queue

until the dependences are resolved. Their evaluation results show that the IXU can

execute over 50% of the instructions, resulting in 17% lower energy consumption for the

whole processor and also performing marginally better than a conventional superscalar

processor.

(14)

Chapter 3

Compiler concepts

This chapter introduces concepts related to graph theory which are used in the thesis.

3.1 Dominators

Definition

A node d dominates a node n (d dom n) if every path from the start node to n contains d.

By definition, each node dominates itself.

3.2 Graphs

Numerous compiler optimizations are based on diﬀerent graph theory concepts, one being the control-flow graph (CFG). A CFG consists of basic blocks (nodes) and legal traversal between blocks (edges), representing the legal flow of a program, i.e. all possible execution paths that might be traversed.

3.2.1 Special edges Back edge

A back edge is an edge in which its head dominates its tail. Note that a directed arc a = (x, y), from x to y, is considered to have x as its tail and y being the head.

Critical edge

A critical edge is an edge directed from a block with multiple successors, to a block with multiple predecessors. Thus, inserting computations uniquely for this edge requires the edge to be split up with an intermediate basic block.

Abnormal edge

Exception handling constructs can produce an edge whose destination is unknown, such

an edge is called an abnormal edge and tends to inhibit optimizations.

(15)

Impossible edge

It is sometimes necessary to add an edge to a CFG solely because of property preserva- tion, one example being that the exit block in a function postdominates all other blocks.

Such an edge is called an impossible edge or a fake edge and it can never be traversed.

3.3 Loops

3.3.1 Natural loops

Optimization may occassionally be easier to perform when a loop is in a simplified form.

For example, a simplified form is the natural loop, which requires the loop to have only a single entry block, called the loop header. As a result, the header dominates all blocks in the loop body, including a backedge block b which has an edge l = (b, h), where h is the header. The loop body is then defined to be the set of nodes, including the backedge block and header, from which b can be reached without passing through h.

3.3.2 Headers Pre-header

For certain optimizations it is beneficial to have a single basic block precede the loop header, one example being Loop-invariant code motion (LICM). Such a basic block is called the pre-header of the loop, and is guaranteed to be executed once prior to entering a loop.

Header

A loop header is the entry block for a loop, meaning it dominates all the blocks in the

loop body as well as the back-edge (also called latch block) which points back to the

header.

(16)

Chapter 4

LLVM

This chapter introduces a selection of important concepts and classes from the LLVM compiler infrastructure [7, 15], as well as the LLVM intermediate representation lan- guage.

4.1 LLVM Classes

Module

The Module class is the highest level of program structure, representing one or many translation units merged together by the linker. It also keeps track of the program functions, symbol table and maintains a list of global variables.

Function

A Procedure in a translation unit is represented by the Function class, maintaining the containing basic blocks, arguments to the procedure and also the corresponding symbol table.

BasicBlock

A basic block is an ordered collection of instructions, having only one point of entry and exit. The last instruction in the basic block is called terminator and can consist of e.g.

a branch, jump or return instruction. The BasicBlock class is the LLVM representation of a basic block and keeps track of the containing instructions and the parent function in which the block is embedded.

Instruction

The Instruction class is the base class to a collection of subclasses, every subclass rep-

resenting a specific type of instructions, e.g. BinaryOperator, CastInst, CmpInst and

TerminatorInst. The Instruction class itself keeps track of the instruction type, usally

referred to as opcode, and its parent basic block.

(17)

Value

In the intermediate representation operations and variables have a typed value, this is represented by the Value class. All subclasses inheriting from the Value class may be used as, for example, operands to other instructions and operations. In the case of a branch operation referencing a label to a basic block, the label can be viewed as the value for the basic block.

User

In comparison to the Value class which is the base class to value producing classes, User is the base class to operations using values. Because User is a subclass to Value, both are commonly inherited to other subclasses, e.g. an addition instruction which uses two values (the operands) and produces a value (the sum). User exposes an interface enabling iteration over the operands or fetching them by index.

4.2 The LLVM IR Language

In this section commonly used parts of the LLVM assembly language [16] are introduced.

4.2.1 Static Single Assignment

The LLVM IR Language has the property of Static Single Assignment (SSA) form, meaning every variable is assigned only once and needs to be defined before use. In other words, every assignment results in a new variable in the intermediate representation. For example, assignment of the x variable in fig. 4.1 corresponds to the intermediate form in fig. 4.2, containing x

₂

and x

₃

.

4.2.2 Phi nodes

i f ( x < y ) x = 3 e l s e

x = 1 z = x

Figure 4.1: A simple branch

x

1

< y

1

x

2

= 3

z

1

= x

?

x

3

= 1

Figure 4.2: Basic blocks

x

₁

< y

₁

x

2

= 3

z

1

= Φ(x

2

, x

3

) x

3

= 1

Figure 4.3: Basic blocks

In regard to high level languages it is common to assign the same variable in diﬀer-

ent branches of the program (Figure 4.1), resulting in multiple definitions of the same

variable in the intermediate representation (Figure 4.2). The reason for this is the SSA

(18)

property, requiring every IR variable to be defined at most once. Consequently, for suc- cessing basic blocks, use of the variable is ambigious. To solve this problem a Phi node is introduced. The main idea for a Phi node is to merge declarations from diﬀerent ex- ecution paths into a single declaration, selecting the correct one depending on incoming edges (Figure 4.3).

In the LLVM IR language a Phi node is represented by the following syntax.

phi type [ value1, inedge1 ], [ value2, inedge2 ]

The keyword type specifies the bit size of the data, e.g. i8 for a character, and the bracket list defines what variables to select for corresponding edges. Phi nodes are required to have at least one entry, and be placed in the beginning of a basic block, as well as having entries for all predecessors. However, it is allowed for the Phi node to contain the value undef (undefined) for certain incoming edges. For instance, this can occur when a variable is set in one path but not in the other. It is also allowed for the Phi node to refer to itself in the value list. This is a common occurence when the CFG contains a loop, in which case the induction variable can be defined as a Phi node referring to itself (from previous iterations) and a value from a definition outside the loop, the starting value.

4.3 LLVM Passes

The implementation described in this thesis is, explicitly or implicitly, dependent on the following passes.

loop-simplify

The loop-simplify pass is used to normalize loops, transforming to a format satisfy- ing certain conditions. Because of this loop normalization, certain optimizations and transformations adhere to a common pattern, thus reducing the complexity of the im- plementation described in 5. The following conditions are guaranteed to be satisfied by the loop-simplify pass.

• A single, non-critical edge entering the loop header from outside. The unique predecessor to the header is called pre-header.

• Loops are guaranteed to have exactly one backedge.

• All loop exit blocks only have predecessors from inside the loop, and are therefore dominated by the loop header.

mem2reg

Promotes memory-based variables to registers. This results in loads and stores opera-

tions of high level variables, especially loop-variant variables, to be promoted to inter-

mediate SSA registers. This greatly simplifies the implementation of the transform pass

because loop variant variables are being represented as Phi nodes, rather than having

(19)

the additional complexity of loading and storing the variables to memory. Instead, the implemented pass can use the infinite number of virtual registers available and let LLVM handle the register allocation during the recursion calls, spilling memory as needed.

indvars

There might be multiple induction variables used in loops, which are simplified and canonicalized by this pass. This results in loops having a single canonical induction variable, which in turn simplifies the implemented pass. Specifically, the maximum recursion depth value is derived from the canonicalized induction variable and is later used during recursion to determine whether to continue or exit the recursion.

loop-rotate

A loop is originally created with the header as an exiting block, meaning the loop

condition is determined in the header with a minimum loop trip count of zero as a

result. The rotation operation transforms the loop by placing the condition in the latch

block, thus the latch block is turned into an exiting block. This guarantees a minimum

loop trip count of one and determines whether to run the loop again after the body is

executed.

(20)

Chapter 5

Implementation

This chapter describes the implementation of a prototype LLVM transform pass. Given a loop with certain characteristics, the pass can transform the loop into a recursion.

To simplify the implementation of the transformation pass certain criteria need to be fulfilled. The first criteria is that the pass can only handle loops with a single exiting block. The second criteria is a composite criteria, a set of conditions guaranteed to be fulfilled by the LLVM loop-simplify pass. Thus, this implementation is dependent upon the loop-simplify pass to work.

Although the actual pass transformation is performed on an intermediate level, the steps described here are illustrated by pseudocode to simplify the description.

5.1 Detecting loops to transform

In the benchmarks targeted by the recursion transform pass there exist a vast set of loops, many of them not being parallelizable, thus not suitable for the transformation.

To determine which loops to target and transform, a semi-automatic ”tagging” procedure is introduced. By adding a directive pragma to all loops targeted for transformation, the Clang compiler will in the compilation process add metadata nodes to the corresponding loop blocks which later can be retreived to decide whether to run the pass on a specific loop or not.

Clang implements #pragma clang loop, a set of directives used to specify hints

e.g. for the loop vectorization passes. The directive implements options for specifying

vectorization, interleaving and unrolling for loop constructs. In this specific case the

directive pragma is used as an optional compiler hint and is to be ignored by all other

stages in the compiler pipeline, except for the specific use it was made for. For this

purpose, the clang loop vectorize width pragma is chosen, together with a ”magic

number” explicitly chosen to be ignored by vectorization passes in case they are enabled

(Figure 5.1). To be considered valid in the loop vectorization passes values for the

vectorization width need to be a power of two and less than or equal to 64. The integer

1337 is an invalid vectorization width and is therefore chosen to be the magic constant

for this implementation.

(21)

#define MAGIC TRANSFORM 1337

#pragma c l a n g l o o p v e c t o r i z e w i d t h (MAGIC TRANSFORM) f o r ( int n = 0 ; n < c o l s ; n++) {

. . . }

Figure 5.1: Example of the clang loop pragma

1 ; The latch block con tai ns the t e r m i n a t o r r e f e r e n c i n g the met ada ta

2 for . body173 :

3 ...

4 br i1 %cmp172 , label %for . body173 , label %for . end299 , ! llvm .←�

loop !1

5

6 ; The met ada ta is located in the end of the file

7 !1 = m eta dat a !{m eta dat a !1 , m eta dat a !2}

8 !2 = m eta dat a !{m eta dat a !" llvm . loop . v e c t o r i z e . width ", i32 1337}

Figure 5.2: Example of metadata attached to a branch

During the transform pass the loops are checked for metadata containing the vec- torization width and if the width equals to the magic constant a recursion transform is applied (Figure 5.2). The metadata node is attached to the terminating instruction in either the loop latch block or the loop header. Parsing of the vectorization metadata is normally handled by the LoopVectorizeHints class, however, the original class would invalidate the magic constant. Therefore, a copy of the LoopVectorizeHints class is created and the validation code for the vectorization width is disabled.

5.2 Collecting iteration data

The analysis of the CFG, such as tracking induction variables and their operands, is crucial for a recursion. The methods used to collect and analyse induction variables, their operands and their outgoing values are discussed in this section.

5.2.1 Induction variable

Transforming a for loop into recursion form requires a canonical induction variable (CIV)

to be identified. In this case the CIV is assumed to begin count at zero and increase

by one for every iteration through the loop. By invoking the LLVM IndVarSimplify

pass during the optimization stage, certain loops containing induction variables can

be transformed into a canonicalized form, resulting in the loop having a CIV. If the

IndVarSimplify pass successfully transforms a loop to have a CIV it is then possible to

reach it by calling the method getCanonicalInductionVariable() from the Loop class,

(22)

returning the CIV as a PHINode. The canonicalized variable can be used to identify and isolate the loop condition and the incrementation for the loop, by collecting users of the value.

5.2.2 Finding loop-carried outgoing values

To successfully transform the loop the induction variables carried by the loop need to be found. However, it is also required to find the corresponding increased or, generally stated, changed value for the induction variables taking place between iterations. By checking the PHINode representing an IV, the value corresponding to the incoming latch edge is the changed value carried by the loop, used in the next loop iteration.

5.2.3 Closed set of dependent operands in a loop

In the case of multiple dependent operands for a outgoing value it is not suﬃcient to obtain the incoming latch value alone, therefore a simple traversal algorithm is imple- mented to collect all loop variant latch value dependences. Primarily consisting of two data structures, a Queue and a Set, value operands for queued instructions are expanded and added to the set in the following manner.

Algorithm 1 Dependences collection algorithm

1:

procedure Collect–Deps

2:

let ExpandInstr ∈ Q

3:

let ExpandInstr ∈ S

4:

while Q �= ∅ do

5:

Current ← Q.pop

6:

for each User u ∈ Current do

7:

if User is an Instruction and User is in Loop and User not in S then

8:

push u to Q

9:

add u to S

10:

end if

11:

end for

12:

end while

13:

end procedure

• Merely uses of Instruction values are of interest. Values which do not produce a diﬀerent SSA value between iterations have no impact on calculations in the same fashion as instructions, i.e. operands in the form of functions and basic blocks are ignored when operand values are added to the queue.

• Instruction operands from outside the loop blocks are not considered, because of

their invariance to the loop iterations.

(23)

• To guarantee the algorithm is exiting, new operands are added to the queue iff they are not found in the dependency set. This ensures no operands are expanded more than once which would put the algorithm into an infinite loop.

Adding an instruction, e.g. an induction variable, to the queue and dependency set and executing the algorithm results in all loop dependent instructions being collected in the dependency set. This collection of instructions is then used to hoist only the loop dependent data to precede the recursion call.

5.2.4 Loop exit condition

It is possible to find the exit condition of the loop by checking the loop exiting block and its conditional branching instruction. By collecting the operands recursively from the branch instruction, the dependent instructions are found as well.

5.3 General modifications to the IR

This section seek to explain the implemented modifications which is applied to both forms of transformations, in-order and reverse order.

Recursion step limit

To control the number of recursion calls that will be executed consecutively, a recursion step limit is introduced. The limit is defined as recursion limit = CIV + max steps , represented by an addition instruction embedded into the original loop header. This value is recalculated every time the original loop is resumed, thus it remains unchanged throughout a single recursion instance. On its termination, the recursion instance will return to the loop, the recursion limit will be recalculated with the updated CIV and the process is once again repeated with the new limit.

5.4 Reverse-Order execution

This section describes the necessary steps to transform a loop into reverse recursion, i.e.

where the recursion calls are performed before the corresponding loop bodies. The goal

is to transform a normal loop, example shown in figure 5.3, into the recursion shown in

figure 5.4. A loop in its simplified form consists of a header, body, exiting block and

backedge (latch). The block in a CFG where the loop either continues or exits is called

exiting block. This block is either located in the header or the latch block depending on

whether the loop-rotate pass is executed or not. Originally, the exiting block is in the

header, which is the case used in this implementation for the reverse transform.

(24)

idx = 0;

# pragma clang loop v e c t o r i z e _ w i d t h ( L O O P N U M B E R (1) ) for( ; idx <= i ; ) {

index = ( idx + 1) * m a x _ c o l s + ( i + 1 - idx ) ;

i n p u t _ i t e m s e t s [ index ]= m a x i m u m ( i n p u t _ i t e m s e t s [ index -1 - m a x _ c o l s ] + r e f e r r e n c e [ index ] , i n p u t _ i t e m s e t s [ index -1] - penalty , i n p u t _ i t e m s e t s [ index - m a x _ c o l s ]

- p e n a l t y ) ;

idx = idx +1;

}

Figure 5.3: Example of a loop before the recursion transform

e x t r a c t e d _ l o o p b o d y ( < parameters > , max_steps , idx , i d x _ p o i n t e r ) {

if( idx <= i && idx < m a x _ s t e p s ) { // r e c u r s e

e x t r a c t e d _ l o o p b o d y ( < parameters > , max_steps , idx +1 , i d x _ p o i n t e r ) }

else {

// stop r e c u r s i o n and save o u t g o i n g values

* i d x _ p o i n t e r = idx ; return;

}

// o r i g i n a l l o o p b o d y

- p e n a l t y ) ; }

# pragma clang loop v e c t o r i z e _ w i d t h ( L O O P N U M B E R (1) ) for( idx = 0 ; idx <= i ; idx ++) {

m a x _ s t e p s = idx + 5;

e x t r a c t e d _ l o o p b o d y ( < parameters > , max_steps , idx , i d x _ p o i n t e r ) ;

// reload o u t g o i n g values idx = * i d x _ p o i n t e r ; }

Figure 5.4: Example of a loop after the reverse order recursion transform

(25)

Cloning the header

In addition to the original loop header where the loop exit condition resides, the recursive version requires a duplicated header and exit condition. This additional header is then to be extracted to the new function and control the recursion depth. Furthermore, because iterations in the outer loop are skipped as a result of the recursive transformation, the behaviour from the outer loop header needs to be replicated inside the recursion as well.

Therefore, the complete header block including its instructions are cloned, except for its PHI-nodes. Because the PHI-nodes in the loop header are abstract values carried by the iterations they are not cloned, but instead handled separately, passed as parameters to the recursion function.

Cloned instructions are mapped in a ValueMap, where the original instruction points to the cloned instruction. This map is used to remap the operands of the cloned instruc- tions to point to the corresponding local values in the cloned header.

Remapping overridden values

In addition to the local remapping for operands between the cloned header and the original header, there may be instructions in the rest of the loop body still using the now overriden values from the original header. Such instructions need their operands to be replaced by the new corresponding instructions defined in the cloned header.

However, note that both the original header and the cloned header are exempt from this operation. The original header is explicitly exempted because the relationship of all definitions and uses within that block are correct. In contrast, the cloned header is implicitly exempted. During the header cloning procedure, all uses of values defined in the original header are replaced with corresponding local values, thus there can be no uses from the original header.

An example of a cloned header is represented in figure 5.9, named for.cond68.clone.

To control the recursion depth new instructions to handle the control flow are introduced.

This includes a max recursion steps comparison (%maxsteps.cmp) and a composite log- ical operation (%recursion.cond) consisting of the max steps comparison and the orig- inal loop exit condition.

Recursion condition

The cloned header, now containing the additional loop exit condition which will control the recursion, can be modified to include the recursion step limit. By identifying whether the condition is exit-on-true or loop-on-true, the following modifications will be made to the logical structure.

• Exit on true

If the original condition exits the loop when evaluated to true, the new recursion

condition is required to also evaluate to true when the limit is reached. The original

condition and the new recursion condition are combined by a logical or-condition,

as in table 5.1.

(26)

Original condition Recursion condition Exit on true Loop on true

T T T T

T F T F

F T T F

F F F F

Table 5.1: Condition table for the new loop condition with recursion limit

• Loop on true

The opposite case, loop continues execution if the condition is evaluated to true, requires the recursion limit condition to evaluate to false when the recursion limit is reached. To correctly merge the two conditions a boolean and-operator is used, the resulting truth table being 5.1.

Recursion skeleton preparation

Prior to inserting the recursion skeleton into the control flow graph there are a few modifications to be done. After cloning the header there are two headers, both branching to the entry block and exit block. In preparation of the recursion skeleton insertion, the cloned header is modified to branch directly to the entry block by replacing the conditional branch with an unconditional branch. Additionally, the conditional branch in the original header is set to branch to the cloned header instead of the loop body entry block. This results in two duplicated headers connected in series which will handle both the original outside loop as well as a recursion called from inside the loop body.

Creating the recursion skeleton.

The recursion call structure concists of a simple if-else branch, controlled by the compos- ite recursion condition (original + max steps condition). This introduces two important basic blocks. The recursion call block, to hold the recursion call to the extracted loop body, and the recursion exit block, to perform a write-back of the loop dependent values.

Figure 5.5 represents the partially transformed loop with the recursion skeleton inserted.

Hoisting loop-carried values and operands

The reverse case requires the loop-carried values to be available at the point where the recursion call is made. This implies the operands to the loop-carried variables are required to dominate all their uses, thus, if the operands are defined in the loop body they have to be hoisted to a block dominating the recursion call.

create PHImap

Maintenance of the loop carried variables is key to the recursion transform. To keep track

of the variables and their relation on an intermediate representation level a ValueMap is

(27)

idx = 0

m a x _ s t e p s = idx + 5;

if( idx <= i && idx < m a x _ s t e p s ) { }

else { }

- p e n a l t y ) ;

idx = idx +1;

}

Figure 5.5: Example of a loop with the recursion skeleton inserted

introduced. This ValueMap tracks the relation between a PHInode, which is the value used in the current iteration, and its latch value which is intended for the next iteration.

There are two PHImaps in use for the reverse case and one for the in-order case.

The common PHImap which is utilised for both cases maps the PHInodes from the loop header (loop carried values) to their corresponding incoming latch value. This is used in the creation of the arguments to the recursion call, to deduce the outgoing values which are intended for the next iteration in the recursion.

The second PHImap is only utilised in the reverse case and is the inverse to the first PHImap. Thus, it is used after the loop body is extracted, to replace the operands of the automatically created store instructions. Because the recursion condition and exit block are executed before the loop body, the correct values to store on exit are not the outgoing values, but the incoming values which are the values from the previous, and final, iteration.

Extraction of loop body

Blocks belonging to the loop body, excluding the original header and split latch block,

are extracted to a new function by the llvm::CodeExtractor class. The basic blocks to

extract are collected in an ArrayRef and sent as an parameter to the constructor. Note

that the first basic block in the ArrayRef parameter is required to dominate the rest of

the block sequence. Figure 5.7 represents the partially transformed loop with a part of

the loop body extracted into a separate function.

(28)

e x t r a c t e d _ l o o p b o d y ( < parameters > , max_steps , idx , * i d x _ p o i n t e r ) {

if( idx <= i && idx < m a x _ s t e p s ) { }

else { }

- p e n a l t y ) ;

* i d x _ p o i n t e r = idx +1;

} idx = 0

m a x _ s t e p s = idx + 5;

// reload o u t g o i n g values idx = * i d x _ p o i n t e r }

Figure 5.6: Example of a loop with parts of the body extracted

(29)

else { }

- p e n a l t y ) ;

* i d x _ p o i n t e r = idx +1;

} idx = 0

m a x _ s t e p s = idx + 5;

Figure 5.7: Example of a loop with parts of the body extracted

Inserting the recursion call

The last phase of transforming the extracted loop body into an recursion consists of in-

serting a function call and base case into the recurse and exit-blocks respectively. This

includes the construction of an argument list which consists of the original arguments

passed on by the function call while respecting the loop carried values. The argument

list is constructed by making a copy of the original argument list in the function call

generated by the llvm::CodeExtractor. However, the loop carried values are replaced

by the corresponding outgoing value tracked by the PHInode map 5.4. In addition,

for the reverse recursion case another map is created to track the relationship between

PHInode outgoing values and their formal parameters. Note that both of these reside

inside the extracted function. This map is later used to translate outgoing values into

incoming values in the recursion exit block where the loop carried values are finally

written to memory. After the recursion returns to the original loop, the stored values

are reloaded from memory.

(30)

else {

* i d x _ p o i n t e r = idx ; return;

}

- p e n a l t y ) ; }

idx = 0

m a x _ s t e p s = idx + 5;

Figure 5.8: Example of a completely transformed loop

Relocating and remapping outgoing values

Because llvm::CodeExtractor only extracts a set of blocks into a new function without

any regards to the recursion transform, the store instructions for the loop carried values

will be located immediately before the function exit. Thus, every function call will

result in a rewrite of values that are not read again until the recursion has completely

returned to the original outside loop. By relocating the generated store instructions to

the recursion exit block the unnecessary store operations are removed, leaving only the

needed base case store operations before returning from the recursion. However, at this

point the store instructions refer to the loop body’s outgoing values which should be used

in the next-iteration, but the recursion base case should return without executing another

loop body instance. Thus, the outgoing values are replaced with the corresponding

incoming values mapped by the PHInode map created in 5.4 and also ensure the base

case to return directly after the remapped store instructions.

(31)

Figure 5.9: CFG with cloned header

Loop Body

for.cond68:

%indvars.iv3 = phi i64 [ %indvars.iv.next4, %for.cond68.backedge ], [ 0, ... %for.cond68.preheader ]

%max_steps = add i64 %indvars.iv3, 5 %43 = trunc i64 %indvars.iv3 to i32 %cmp69 = icmp slt i32 %43, %28

br i1 %cmp69, label %for.body70, label %do.end.loopexit94

T F

for.cond68.clone:

%59 = trunc i64 %indvars.iv3 to i32 %60 = icmp slt i32 %59, %28

%maxsteps.cmp = icmp ne i64 %max_steps, %indvars.iv3 %recursion.cond = and i1 %maxsteps.cmp, %cmp69 br i1 %60, label %for.body70, label %do.end.loopexit94

T F

do.end.loopexit94:

br label %do.end for.body70

T F

for.body3 for.body4

for.body5

for.cond68.backedge:

%indvars.iv.next4 = add nuw nsw i64 %indvars.iv3, 1 br label %for.cond68

(32)

5.5 In-Order execution

This section describes the process of converting a loop into an in-order recursion. The implementation requires the loop to exit through the latch block, which is accomplished by running the LLVM loop-rotate pass.

The in-order transform is eﬀectively built upon the same principles as the reverse transform, although because of the reduced complexity for the in-order transform some operations are only required in the reverse case. Primarily, the focus of this section is on the parts where operations are omitted or diﬀer from the reverse case.

5.5.1 Prepare for extraction

First, the recursion step limit and recursion condition are inserted into the loop latch, before the terminating instruction. Secondly, the recursion skeleton is introduced. Con- trary to the reverse case where the skeleton is placed between the header and the entry to the loop body, for the in-order case the skeleton is placed before the terminator in- struction in the loop latch. Thus, for the in-order transform the loop body is executed before the recursion call. An example of the CFG with the added recursion skeleton and modified exit condition is shown in figure 5.11

Similar to the reverse case, the loop variant arguments are managed by a PHInode map. This is done by mapping the PHInode values to their corresponding incoming latch value.

5.5.2 Extract to function

The extraction of the loop blocks is executed in the same manner as in the reverse case. However, contrary to the reverse case which requires the loop carried values to be available before the recursion call and loop body, during the argument construction for the in-order recursion there is no need to map the formal parameters to the loop carried outgoing values. This is due to the fact that the reverse case loop condition is determined before the loop body is executed, while the in-order case loop condition is determined in the loop latch after the loop body is executed. Thus, for the latch condition the correct values to save are the outgoing values from the final iteration.

5.5.3 Recursion call

The last step in the transformation is to embed the recursion call into the created skeleton

and hoist the store instructions to the recursion exit block.

(33)

e x t r a c t e d _ l o o p b o d y ( < parameters > , max_steps , idx , i d x _ p o i n t e r ) {

- p e n a l t y ) ;

else {

* i d x _ p o i n t e r = idx +1;

} }

# pragma clang loop v e c t o r i z e _ w i d t h ( L O O P N U M B E R (1) ) for( idx = 0 ; idx <= i ; idx ++) {

m a x _ s t e p s = idx + 5;

e x t r a c t e d _ l o o p b o d y ( < parameters > , max_steps , idx , i d x _ p o i n t e r ) ; }

Figure 5.10: Example of the in-order recursion

(34)

Figure 5.11: CFG in-order recursion skeleton

for.body69.lr.ph:

%fp.194 = phi %struct._IO_FILE* [ %fp.1, %if.end65 ], [ %fp.067, %while.body ... ]

%done.192 = phi i32 [ %done.1, %if.end65 ], [ 0, %while.body ] %16 = trunc i64 %call37 to i32

%17 = add i32 %16, -1 br label %for.body69

for.body69:

%indvars.iv70 = phi i64 [ 0, %for.body69.lr.ph ], [ %indvars.iv.next71, %28 ] %max_steps = add i64 %indvars.iv70, 5

br label %for.body69.split

for.body69.split:

%sub = mul i64 %indvars.iv70, 210453397504 %sext = add i64 %sub, 115964116992 %idx.ext = ashr exact i64 %sext, 32

%add.ptr = getelementptr inbounds [490 x i8]* %sandbox, i64 0, i64 %idx.ext %call72 = call i32 (i8*, i8*, ...)* @__isoc99_sscanf(i8* %add.ptr, i8*

... getelementptr inbounds ([6 x i8]* @.str8, i64 0, i64 0), float* %tmp_lat, ... float* %tmp_long) #5

%26 = load float* %tmp_lat, align 4 %sub73 = fsub float %26, %conv %mul75 = fmul float %sub73, %sub73 %27 = load float* %tmp_long, align 4 %sub76 = fsub float %27, %conv11 %mul78 = fmul float %sub76, %sub76 %add79 = fadd float %mul75, %mul78 %conv82 = call float @sqrtf(float %add79) #1

%arrayidx84 = getelementptr inbounds float* %12, i64 %indvars.iv70 store float %conv82, float* %arrayidx84, align 4

%indvars.iv.next71 = add nuw nsw i64 %indvars.iv70, 1 %lftr.wideiv = trunc i64 %indvars.iv70 to i32

%exitcond = icmp eq i32 %lftr.wideiv, %17 br label %for.body69.split.split

for.cond92.preheader.lr.ph:

%18 = add i32 %conv38, -1

br i1 %cmp1968, label %for.body95.lr.ph.us.preheader, label ... %for.end110.preheader

T F

for.body69.split.split:

%max_steps.cmp = icmp eq i64 %max_steps, %indvars.iv70 %recurs.ec = or i1 %max_steps.cmp, %exitcond

br i1 %recurs.ec, label %recursion.exit, label %recursion.skel

T F

recursion.exit:

br label %28 recursion.skel:

br label %28

%28:

br i1 %exitcond, label %for.cond92.preheader.lr.ph, label %for.body69, ... !llvm.loop !1

T F

(35)

Chapter 6

Evaluation

6.1 Benchmark selection

To evaluate the performance of the recursion transform a set of benchmarks from the Rodinia [8] suite is used. The loops targeted for transform are parallel, which satisfies the out-of-order iteration constraint the reverse recursion implies. Unfortunately, not all of the benchmarks are suitable for the purpose of testing the recursion overhead. There are a number of cases where a benchmark is either avoided or simply not possible to include in the evaluation. The following cases have been discovered during the analysis and are reason for exclusion.

• No CIV, which is a prerequisite for the implemented pass.

• Cold loops are excluded if they are statistically insignificant compared to the total benchmark runtime. For example, this would apply to a loop running for less than a second in a benchmark running for a minute or longer.

• Low iteration count is seen in, among others, some of the benchmarks related to image processing. For instance, the loop body might consist of only a function call and the number of loop iterations are very low in which case the recursion overhead would be too small to have any significance.

6.1.1 Hardware platform

The experiments are run on a Intel Core i5-4690K CPU with cache memory specificed in

the cache table 6.1.1, using 2x4GB 1600MHz RAM and with Intel Turbo Boost disabled

to minimize noise throughout the testing phase. The performance CPU governor is

used to lock the scaling frequency to the specified maximum. Additionally, the level 2

prefetcher is enabled, as is the ”fetch adjacent cacheline” option.

(36)

L1 instruction cache L1 data cache L2 cache LL cache

4 x 32KB 4 x 32KB 4 x 256KB 6MB 12-way

Table 6.1: Intel Core i5-4690K caches 6.1.2 Software platform

Building upon the foundation of LLVM, the transform pass is developed and used to- gether with LLVM 3.5.1, which also applies to the clang/clang++ [17] frontend used to compile the benchmark programs. Additionally, the version of the Linux kernel and the perf tool [18] is 3.19.

6.1.3 Profiles

To measure the impact by the recursion transform a set of profiles are introduced, which represent diﬀerent sets of optimization passes invoked during the compilation process.

Any discrepancies revealed between profiles in the performance evaluation will be ex- plained by comparing which passes are applied during compilation, and statistics from the tools perf and Valgrind. For the following sections the maximum number of recur- sions, i.e. recursion depth, will be represented by the letter N. For example, the profile inorder-256 represents an in-order transformation with a maximum step of 256.

reference

The reference profile applies optimization aggressively, by adding the -O3 flag throughout the compilation process. As the name suggests, this profile exists to have a reference which the other profiles are compared to.

ref-custO3

This profile is a modified version of reference, which excludes the loop-rotate pass, such that the reverse order transform is applied rather than in-order transform. The purpose is to measure the impact of the loop rotation removal, before any further opti- mizations are applied.

inorder-N

The inorder profile transforms a latch-exiting loop into a recursion by applying the customized-O3 set, loop-rotate and the recursion-transform. No other optimizations are applied after the recursion transform, which allow measuring the extra overhead the recursion adds.

inorder-O3-N

Similar to the inorder profile, but adds an extra set of customized-O3 optimization

passes after the recursion transform is applied to verify if any new optimizations are en-

(37)

abled/disabled with respect to the recursion transform. The customized optimizations include all level 3 optimizations, but exclude the tail call elimination optimization (tail- callelim) 6.2. This prevents the transform to be reversed into an loop again, disabling the possibility to measure the impact of the recursion.

reverse-N

This profile applies the customized-O3 set, but excludes the loop-rotate pass to ensure all loops are in header-exiting form. This enables the transform pass to output reverse execution recursions.

reverse-O3-N

This profile exists for the same purpose as inorder-post-O3, the only diﬀerence being the reverse execution rather than in-order execution.

6.2 Important passes

Tail call elimination (tailcallelim)

This pass transforms self recursions into iterations by branching from the return in- struction to the entry of the function, thus creating a loop. There are several limitations in the current implementation. First, instructions located between the recursion call and the return instruction can prevent tail recursion elimination (TRE) unless it is dead code, however, the implemented recursion in-order transform does not place any instructions inbetween these locations, thus it will not prevent TRE. Secondly, TRE is also possible for functions returning void. Due to the recursion transform using llvm::CodeExtractor, this constraint is also fulfilled. In addition, if it can be proved that the callee does not access the stack frame of its caller, the function will be marked as eligible for TRE by the code generator.

Increasing energy efficiency and instruction scheduling by software prefetching

Examensarbete 30 hp June 2015

Increasing energy efficiency and instruction scheduling by software prefetching

Alexander Fougner

Institutionen för informationsteknologi

Abstract

Increasing energy efficiency and instruction scheduling by software prefetching

Alexander Fougner

Summary in Swedish

Som v¨ antat s˚ a ¨ okar exekveringstiden markant f¨ or m˚ anga av fallen. Det st¨ orsta ut-

slaget kan generellt uppm¨ arksammas hos sm˚ a loopar, delvis p˚ a grund av deras rel-

ativt f˚ a ber¨ akningsinstruktioner. I ett fall, j¨ amf¨ ort med referenstestet, introducerar

bak˚ atrekursionen mellan 22% till 78% overhead beroende p˚ a rekursionsdjupet. F¨ or st¨ orre

loopstrukturer observeras en mindre prestandaf¨ ors¨ amring, ibland s˚ a lite som 1% oavsett

det maximala rekursionsdjupet.

Contents

1 Introduction 7

1.1 Background . . . . 7

1.1.1 Decoupled Access/Execute (DAE) . . . . 7

1.2 Problem description . . . . 8

1.3 Method . . . . 9

1.4 Thesis outline . . . . 9

2 Related work 10 2.1 Recursion . . . 10

2.1.1 Transformation . . . 10

2.1.2 Stack optimization . . . 10

2.2 In-Order and Out-of-Order execution . . . 11

3 Compiler concepts 12 3.1 Dominators . . . 12

3.2 Graphs . . . 12

3.2.1 Special edges . . . 12

3.3 Loops . . . 13

3.3.1 Natural loops . . . 13

3.3.2 Headers . . . 13

4 LLVM 14 4.1 LLVM Classes . . . 14

4.2 The LLVM IR Language . . . 15

4.2.1 Static Single Assignment . . . 15

4.2.2 Phi nodes . . . 15

4.3 LLVM Passes . . . 16

5 Implementation 18 5.1 Detecting loops to transform . . . 18

5.2 Collecting iteration data . . . 19

5.2.1 Induction variable . . . 19

5.2.2 Finding loop-carried outgoing values . . . 20

5.2.3 Closed set of dependent operands in a loop . . . 20

5.2.4 Loop exit condition . . . 21

5.3 General modifications to the IR . . . 21

5.4 Reverse-Order execution . . . 21

5.5 In-Order execution . . . 30

5.5.1 Prepare for extraction . . . 30

5.5.2 Extract to function . . . 30

5.5.3 Recursion call . . . 30

6 Evaluation 33 6.1 Benchmark selection . . . 33

6.1.1 Hardware platform . . . 33

6.1.2 Software platform . . . 34

6.1.3 Profiles . . . 34

6.2 Important passes . . . 35

6.3 Method . . . 35

6.4 Results . . . 36

6.4.1 Needleman-Wunsch . . . 36

6.4.2 Breadth-first search . . . 37

6.4.3 CFD solver . . . 38

6.4.4 Heart wall . . . 40

6.4.5 Hotspot . . . 41

6.4.6 Nearest neighbour . . . 41

6.4.7 Particlefilter . . . 42

6.4.8 Speckle reducing anisotropic diﬀusion (SRAD) . . . 43

7 Conclusions 46 7.1 Future work . . . 46

Appendices 48 A Appendix 49 A.1 Needleman-Wunsch . . . 49

A.2 Breadth-first search . . . 52

A.3 CFD solver . . . 55

A.4 Heart wall . . . 58

A.5 Hotspot . . . 60

A.6 Nearest neighbour . . . 63

A.7 Particlefilter . . . 66

A.8 Speckle reducing anisotropic diﬀusion (SRAD) v1 . . . 69

A.9 Speckle reducing anisotropic diﬀusion (SRAD) v2 . . . 72

Abbreviations

CIV Canonical induction variable CFG Control-flow graph

DAE Decoupled access/execute

DVFS Dynamic voltage and frequency scaling EDP Energy delay product

ILP Instruction-level parallelism IPC Instructions per cycle IR Intermediate representation IV Induction variable

JIT Just-in-time [compilation]