Store prefetch policies: Analysis and new proposals

(1)

UPTEC IT18 022

Examensarbete 30 hp November 2018

Store prefetch policies

Analysis and new proposals

Carl Bostr¨om

Computer and Information Engineering Programme

(2)

Institutionen f ¨or informationsteknologi Bes¨oksadress:

ITC, Polacksbacken L¨agerhyddsv¨agen 2 Postadress:

Box 337 751 05 Uppsala Hemsida:

http:/www.it.uu.se

Abstract

Store prefetch policies

Analysis and new proposals Carl Bostr¨om

This thesis is focusing on how to gain performance when executing pro- grams on a CPU. More specifically the store instructions are studied.

These instructions often cause huge delays while waiting on write per- mission for a particular data block. If an out-of-order (OoO) CPU fills up the entire store buffer with stores waiting for write permission, the CPU has to stall, and cycles are going to be wasted. To over overcome this issue, the idea is to use predictors to try to predict which write per- missions that are needed in the future and brought it to the L1 cache in advance. This method is similar to that of the branch predictor where the CPU is fed with instructions to start working in advance based on the prediction of whether or not a branch is to be taken. Naturally, you would want the prediction to be correct, but even if we assume that the predictor always grants needed write permission ahead of time there might still be a problem, and that is when to grant store permission for the data block. Too late and you still need to wait since the need for it occurs before it is in place. Too early and it is going to be evicted from the L1 cache due to space issues, and there are going to be an idle time bring it back in when the need occurs. Furthermore, it is also a waste of energy since we brought in something to be evicted before use. The question I aim to answer with this master thesis is when to prefetch data permission to gain optimal performance.

Handledare: Alberto Ros

¨Amnesgranskare: Stefanos Kaxiras Examinator: Lars- ˚Ake Nord´en UPTEC IT18 022

Tryckt av: Reprocentralen ITC

(3)

(4)

Sammanfattning

Detta arbete kommer att fokusera p˚a hur prestandan vid körning av program p˚a en CPU kan ökas. Mer specifikt kommer Store instruktionerna att studeras. Dess instruktioner osakar ofta stora förseningar i samband med väntan p˚a att data ska

¨

overföras fr˚an RAM-minnet till L1 Cachen. Om en out-of-order CPU inte kan hitta andra instruktioner att jobba med i väntan p˚a datan s˚a kommer dessa cyklar att bortkastas. För att försöka överkomma denna problematik är en ide att använda Predictors för att förutsp˚a vilken data som kommer att användas inom en snar framtid och överföra den till L1 Cachen i förväg. Detta liknar Branch Predictors där CPUn matas med instruktioner att börjar jobba med baserat p˚a en gissning om branchen kommer att tas eller inte. Här gissar vi vilka skrivrättigheter som ska överföras till L1 i förväg istället för vilka instruktioner som kommer efter en Branch. Det är självklart önskvärt att gissningen är korrekt, men även om den är det s˚a kan vi ha problem, och det är när ska datan överföras. För sent och vi m˚aste fortfarande vänta eftersom behovet av skrivrättigheten uppkommer ni den är p˚a plats. För tidigt och skrivrättigheten kan bli avlägsnad fr˚an L1 cachen p˚a grund av platsbrist samt att det blir en väntetid för att överföra datan p˚a nytt när behovet väl uppst˚ar. De senare

¨

ar ocks˚a ett slöseri av energi d˚a vi överför n˚agot som kommer att avlägsnas innan användning. M˚alet med detta masterarbete är att utröna när data ska överföras för optimal prestanda.

(5)

Acknowledgment

I want to thank Alberto Ros from Murcia University and Stefanos Kaxiras from Uppsala University for presenting the fascinating topic of this thesis to me. Alberto and Stefanos have given me the required information to carry out this work. They have also shared their versions of Sniper 2.2.2 and Gems 2.2.3 with me and even edited them with this work in mind. The ideas behind the new policies have been developed in cooperation with Alberto.

(6)

List of Figures

2.1 Our architecture . . . 3

3.1 Trace example: Starting PC . . . 9

3.2 Trace example: Anonymous instructions and a Load . . . 10

3.3 Trace example: Branch and store . . . 10

3.4 Trace example: Branch to a previous instruction . . . 10

5.1 Execution time for all benchmarks (in percent weighted against On- Commit) . . . 16

5.2 Number of L1 accesses for all benchmarks (in percent weighted against OnCommit) . . . 18

5.3 Number of store prefetches for all benchmarks (in percent weighted against OnCommit) . . . 20

5.4 Energy consumption for all benchmarks (in percent weighted against OnCommit) . . . 22

(9)

List of Tables

3.1 Simulated system configuration . . . 9 3.2 Energy measurements for different CPU parts. . . 11 5.1 Average execution time . . . 15 5.2 The benchmarks that are more affected by execution time (in percentage) 16 5.3 Average number of L1 accesses . . . 17 5.4 The benchmarks that are more affected by number of L1 accesses (in

percentage) . . . 18 5.5 Average number of store prefetches . . . 19 5.6 The benchmarks that are more affected by number of store prefetches

(in percentage) . . . 20 5.7 Average energy consumption . . . 21 5.8 The benchmarks that are more affected by energy consumption (in

percentage) . . . 22 5.9 Average execution time . . . 23 5.10 The benchmarks that are more affected by execution time (in percentage) 24 5.11 Average number of L1 accesses . . . 25 5.12 The benchmarks that are more affected by number of L1 accesses (in

percentage) . . . 26 5.13 Average number of store prefetches . . . 27 5.14 The benchmarks that are more affected by number of store prefetches

(in percentage) . . . 28 5.15 Average Energy consumption . . . 29 5.16 The benchmarks that are more affected by energy consumption (in

percentage) . . . 30 5.17 Average execution time . . . 31 5.18 The benchmarks that are more affected by execution times (in percent-

age) . . . 33 5.19 Average number of L1 accesses . . . 33 5.20 The benchmarks that are more affected by Reduce speculation the

benchmarks which number of L1 accesses are affected the most (in number of percent) (in percentage) . . . 34 5.21 Average number of store prefetches . . . 35 5.22 The benchmarks that are more affected by number of store prefetches

(in percentage) . . . 36 5.23 Average energy consumption . . . 37 5.24 The benchmarks that are more affected by energy consumption (in

percentage) . . . 38 5.25 Comparation between PCBased 4 and OnNonBSpec, both with Same-

cacheLine. . . 39 5.26 Average execution time . . . 39 5.27 The benchmarks that are more affected by execution time (in percentage) 41 5.28 Average number of L1 accesses . . . 41 5.29 The benchmarks that are more affected by number of L1 accesses (in

percentage) . . . 42 5.30 Average number of store prefetches . . . 43

(10)

5.31 The benchmarks that are more affected by number of store prefetches (in percentage) . . . 44 5.32 Average energy consumption . . . 45 5.33 The benchmarks that are more affected by energy consumption (in

percentage) . . . 46 5.34 The average numbers for all policies for all four aspects . . . 47

(11)

Introduction 1

1.1 Motivation

We are always in need of faster and smaller computers that burn less energy. Over the second half of the twentieth century, the speed of our CPUs (central processing units) was increased by letting them run at higher and higher frequencies. However, a linear frequency speed-up causes an exponential increase in energy and heat. In fact, a rule of thumb is that you are spending as much energy as needed for the computation to cool the circuit. Since it has become desirable to make smaller and smaller units, it becomes harder and harder to cool the circuits down. However, software developers still require more and more of the hardware. When CPUs cannot speed-up in the same way as before, the CPUs needs to become more effective. That is why pipelin- ing was introduced into the CPUs. For example, if one instruction takes six seconds to compute, then two instructions will be completed after twelve seconds. If the work were to be divided into three stages where one instruction spends two seconds in each stage, the CPU can then begin working on a new instruction every two seconds.

Given this improvement, four instructions are completed in twelve seconds. Twice the amount of work can be done during a period without increasing the speed of the CPU and therefore the energy consumption roughly remains the same.

Another opportunity to take advantage of is that computing some instructions, espe- cially memory instructions (loads and stores) involves long waiting times. If, instead of waiting, other work is computed with the next instructions, one are not affected by the waiting time ultimately. Because the computation of the instructions next in line is started while waiting, this is called out of order execution. It is up to the CPU to decide if the work can be done in another order and still produce the same outcome.

Like mentioned above computation of memory instructions includes waiting for write permission and that useful tasks can be done while waiting. Still, it is advantageous to decrease the waiting time since there might not always be useful tasks to work on that cover the entire waiting time. This thesis aims to investigate whether we can shorten the waiting time by prefetching the data needed for store instructions in advance. The earlier a data permission is granted, the earlier it is ready to use.

1.2 Scope

This master thesis introduces and evaluates the three state-of-the-art policies (see 2.1.2) for prefetching permission for the store instruction. There has not been much research focusing on the acceleration of store instructions in particular. The three state-of-the-art policies (OnExecure, OnCommit, and NoPrefetch) are going to be combined in different ways to try to come up with a more optimal policy concerning speed up (number of cycles), number of prefetches, L1 accesses, and power consump-

(12)

tion. Furthermore, some of the preparation work concerning editing and understand- ing traces from benchmark programs are covered as well.

1.3 Related work

The related work for this master thesis is covered in subsection 2.1.2 were the state- of-the-art store prefetch policies are described.

1.4 Structure of the report

This report consists of eight chapters.

Chapter 1 - Introduction This gives a motivation of the work along with its scope.

Chapter 2 - Background This chapter is divided into two parts. The first part covers the used CPU architecture and related terms to be used throughout the report along with introducing the state-of-the-art policies. The second part covers the methodology employed in this thesis.

Chapter 3 - Setting Up The Testbed This chapter can be seen as a continuation of the second part in chapter 2. Here the modification of existing simulation tools are covered. The trace employed to connect the simulation infrastructure is covered in this chapter.

Chapter 4 - Proposed Store Prefetch Policies This chapter introduces all store prefetch policies that are proposed within the work of this thesis.

Chapter 5 - Results In this chapter includes graphs comparing the different policies with different settings concerning execution times, L1 accesses, useful prefetch, and energy.

Chapter 6 - Discussion Issues with the set up that can have an impact on the results are covered in this chapter.

Chapter 7 - Conclusions This chapter will offer conclusions about the state-of- the-art and proposed store prefetch policies.

Chapter 8 - Future work This chapter presents ideas on future work that can be built upon the work of this thesis.

(13)

Background 2

This chapter is divided into two sections. The first one, 2.1 Theoretical Background, describes a simple CPU architecture which focuses on the parts used for executing memory instruction, i.e., stores. It is followed by an introduction of the three state- of-the-art prefetching strategies. Section 2.2 goes through the workflow of this thesis and introduce the tools that are used, modified or created, as well as how they are combined.

2.1 Theoretical Background

2.1.1 A theoretical Architecture

Figure 2.1: Our architecture

Figure 2.1 shows architecture in use. This image displays almost all f the features and parts of the CPU that are going to be discussed and analyzed from different per- spectives in this thesis. The stars symbolize where a prefetch is issued for the three state-of-the-art store prefetch policies evaluated in this thesis (section 2.1.2). The structure is the one used for the simulated CPU (see section 3.3). In modern CPUs, there is often more than one core; all cores have a processor pipeline with several stages. The pipeline is split into stages to increase the throughput of executed instructions. One can compare it with a car factory where the cars travel on a conveyor belt through different stations and at each station, something is installed within the car, i.e., seats, wheels or the engine.

(14)

The simulated CPU has a seven stage pipeline (for simplicity, Decode and Alloca- tion are here combined). In the following section some key parts of the CPU are introduced while covering these stages. One thing to mention before moving on is that the CPU executes instructions out of order but with the promise that the effect is always the same as if the instructions had been executed in order.

Fetch

In this stage, a new instruction (see 3.4) is added to the processor. In the most common case, the next instruction is brought in from the I-cache (Instruction cache).

An I-cache is a cache that holds the instructions to be executed shortly (caches will be covered in 2.1.1). The CPU has a PC (program counter) which keeps track of the address of the current instruction that has been fetched. When fetching the next instruction, the PC increments by the length of the current instruction. Different loops are used in programming, and a loop determinate if there should be a jump back to the beginning of the loop body or if the loop should terminate and the lines below it should execute. A CPU cannot, for example, execute C code directly; it needs to be compiled first. A compilation is a translation from code to instructions within the instruction set (see 3.4) that is supported by the CPU in question. A loop is compiled to a branch instruction which has a target instruction address and the condition to be satisfied if the branch is taken (take one more lap in the loop). In order to not stall the CPU, it need to be continue to bring in new instructions before the condition is computed. The question to ask is which instructions to be brought in.

Here the Branch predictor is used to predicted the outcome of a branch and enables the CPU to load instructions based on that prediction. If a branch is predicted to be taken, the PC is set to the branch target address. If a prediction turns out to be wrong, the CPU has to remove the work and all of the side effects produced by the wrongly executed instructions. Fetched instructions are placed in the Fetch queue (FQ) (our has 60 entries).

Decode (and Allocation)

The instruction (from the instruction queue) will be interpreted here. In the Alloca- tion phase recurses, like entries in the: Issue queue (IQ), Reorder buffer (ROB), and Store buffer (SB) are received [4]. The two later buffers are introduced later on.

Dispatch

In this stage, we take care of dependencies. If adding two numbers and then multiply the result with a third number, then the addition has to be finished before the multiplication can be executed. Thus the multiplication has a dependency on the addition.

If the result of the multiplication have to be stored, then the multiplication needs to be computed before the result of it can stored. This gives a dependency between the store and the multiplication. When the dependencies have been worked out, the instructions will be put in the Instruction queue (IQ) (which has 60 entries as well).

Execute

Here is the arithmetic computation performed. It takes place in one of the arithmetic logic units (ALUs), for the store instructions this means calculating the memory address. The instructions may be executed out-of-order.

Commit

Commit is the last step of the pipeline, and it shares the ROB (ReOrdering Buffer), with 192 entries, with some of the previous stages. This buffer puts the out-of-order

(15)

executed instructions back into order. Given the size of the ROB, it is possible to execute up to 191 instructions ahead while waiting for an instruction to finish. When all instructions in a sequence are done, they can leave the pipeline. The xin figure 2.1 illustrates a squashed instruction. If a branch prediction turns out to be wrong, then all the instructions that should never have been executed but have been based on that incorrect prediction has to be squashed.

When the addition, with the sum of two number multiplied by a third number, is done and the result is added to the ROB, the multiplication instruction in the IQ (Instruction Queue) gets informed that its dependent instruction is done (the arrow from ROB to IQ in figure 2.1). Finally the arrow from ROB to ALU provides the result from the addition to the multiplication.

Store Buffer (SB)

Every instruction takes the path that has been described until now, the path described from beyond this point are only taken by the store instructions. When a store instruction leaves the ROB, it goes to the store buffer (also called store queue).

The store instruction is now represented by the memory address to be fetched and the data to be stored. The buffer is FIFO-ordered (first in, first out), this means that the first store in the buffer waits for its data to be ready in the L1 cache (see 2.1.1) and blocks all stores behind it in the buffer. The store buffer interacts only with the L1 cache.

Memory and caches

The memory structure (in the bottom of figure 2.1) is an essential part when talking about store instructions. Beginning with the main memory that has a storage capacity of some gigabytes, a latency of 160 cycles if a hit, and is located on the motherboard.

A cache is a quick and small memory unit that are placed on the CPU chip, in which the data that is likely to be used soon is placed. In this architecture, there are three caches which are named L1 (32kiB 8-way 4 hit cycles), L2 (128kiB 8-way 12 hit cycles) and L3 (1MB per bank 8-way 35 hit cycles). L stands for level, and the higher the digit, the bigger the cache is and the further away from the core it is located. A bigger cache means that it takes a longer time to find specific data in it. The time is measured in cycles, and in every cycle, something can occur in the CPU, i.e., an instruction can be placed in FQ and/or another one can be placed in the ROB.

If our CPU runs in a frequency of 2.2 GHz, there are:

2.2 × 1000³= 2.2 × 10⁹cycle/second Given that one cycles takes:

1

2.2 × 10⁹ = 2.2 × 10⁻⁹s = 2.2ns

This gives us that it takes 4 × 2.2 = 8.8ns to retrieve data from the L1 cache.

A cache keeps copies of needed data, and the data are saved in different regions within the cache depending on the data address. The number of ways a cache have, is the same as the number of data chunks in every address range that can be kept simultaneously within the cache. The higher number of ways means that it is more complex to build. The smaller number of ways the higher risk of conflict misses, that is when the cache runs out of places for data from a specific address range. Our coaches are all 8-ways which means that a piece of data can be in one of eight places (if it is in the cache) and that a cache can hold no more than eight data chunks from a given address range. The I-cache is like another L1 cache that only holds instructions while the L2 and L3 cache hold both data and instructions. When data loads into the L1 cache from the main memory, it often loads to all the other caches at the same time.

(16)

2.1.2 State-of-the-art prefetch policies

Three approaches were already implemented in the Gems simulator 2.2.3 at the beginning of this thesis and are also covered in the literature. These pros and cons will be described below based on the architecture introduced in section 2.1.1. The ”stars”

mentioned in the paragraph headings are the ones in figure 2.1.

OnExecute (star 1) OnExecute [7] is the earliest and most speculative store prefetch policy, where the operation to bring data with store permission into the L1 cache is issued within the execute stage. It ensures that the data arrives in L1 before it is needed, i.e., its instruction in at the head of the store buffer. The downside is that several events which impact on the data to be prefetched can occur:

First, if the store instruction is affected by a branch, that branch can be mispredicted which means that energy and space is wasted in the small L1 cache by bringing in unneeded data. Bringing in data to a cache can cause eviction of other data. When a store instruction takes first place in the buffer and finds that its data have been evicted from the L1 cache it has to wait for the data to be brought to the L1 again.

It might take less time since the data can be left in the L2 cache and be brought from it instead of the main memory.

Second, given that the prediction is correct there might still be in trouble since when the prefetch is issued, the instruction has to finish the pipeline and passes through the queue in the buffer. This process might take a long time. During this time the data can have arrived in the L1 cache and been evicted due to lack of space. This lack of space is due to more data have been prefetched from the point in time it arrives in L1 until the instruction is at the head of the store buffer. Prefetching to early can also become a vulnerability since malicious software can cause a prefetch of illegal data before the core figures out that it is illegal, and when it does the data have already been exposed. To conclude, this alternative is the best if nothing goes wrong, but many things can go wrong.

OnCommit (star 2) In this case, the prefetch is issued in the commit stage (when passing the store to the store buffer). Unneeded data will not be prefetched. There is still a possibility that the data will be brought in to the L1 cache and evicted due to lack of space, as described in the previous paragraph. Prefetching data on commit might still mean that the instruction may be waiting in the head of the store buffer. This waiting can, in the worst case, fill up the entire buffer which can stall the processor, i.e., block it from executing any other instruction. Intel 64 and IA-32 Architecture [8] uses OnCommit. They do not use the name OnCommit, but they quickly describe a load prefetch with the following sentence: ”Reading for ownership and storing the data happens after instruction retirement and follows the order of store instruction retirement.”. ”Reading for ownership” is the same as prefetch with write permissions and ”instruction retirement” is another name for instruction commit. To conclude, OnCommit is the one in the middle between the two extremes, OnExecute (star 1) and NoPrefetch (star 3).

NoPrefetch (star 3) NoPrefetch does no prefetch, the data will be brought into the L1 cache when the instruction is in the head of the store buffer. All permission granted will be needed and not evicted before use. This wastes no energy by bringing in unneeded data. The downside of this is that every instruction will be blocking the buffer for a long time waiting on its data to become available in the L1 cache. The risk for filling up the buffer and stall the processor (see 2.1.1) is therefore high.

To conclude: this is the most energy efficient, but also the most time-consuming

(17)

trade-off. Upon analyzing the source code of Gem5 [6] it was discovered that No- Preftch was the default implemented policy.

2.2 Simulation infrastructure

Here are the tools that have been used for this thesis covered. The tools that have been modified or created within the work of this master thesis are to be found in chapter 3.

2.2.1 Benchmarks

To examine how a CPU behaves and how a certain store prefetch policy will work one has to run, or in this case, simulating the run of a program. These types of programs are often called benchmarks. For this work, the SPEC CPU ©2016 industrystan- dardized benchmark suite [3] was used. It is designed to stress both the CPU and the memory subsystem, and it consists of 55 different benchmark programs that have been used to evaluate different store prefetch policies.

One million instructions from every benchmark are going to be simulated. A warmup of 10% is used, i.e., the measurements will start after the execution of the first one hundred thousand instructions. A warmup is caused in the beginning of an execution since there will always be a miss in the caches and we do not want that phase to affect the result.

2.2.2 Sniper

Sniper is a parallel, high-speed and accurate x86 simulator with support for multicore according to their website [2]. Sniper takes two things as input, a bunch of configurations that describes the architecture to simulate, and a command line with the call to the application which we want to simulate, for example, one of the programs in 2.2.1.

Sniper can then produce graphs and other documents that describe and measure the simulation of the chosen program on the chosen CPU configuration. Two examples of plots that can be generated are CPI stacks [1] and energy stack; bar diagrams with one bar per thread in the program. These bars are divided into different regions based on the percentages of time or energy spent in that region. The regions can be: Ifetch, mem-l1, mem-l2 or branch.

2.2.3 GEMS

The Wisconsin Multifacet Project at the University of Wisconsin has released a Gen- eral Execution-driven Multiprocessor Simulator (GEMS). The simulator is written in C/C++ and it is an open-source software. The version used in this work was provided by my supervisor (Alberto Ros) who had implemented OnCommit and OnExecute, along with one that he has come up with but not yet published (OnNonBSpeculative).

He has also created support for the changes to the traces in Sniper that was done during this thesis (see 3.1). Support for the policies to be proposed in this work was implemented based on the received source code. GEMS was modified to take a path to a folder containing traces (like in 3.4), an array of configuration settings (see 3.2 and 2.1) and a path for writing the output stats-file. A stats-file consists of around two thousand lines and is divided into two parts: A configuration part where you find the information given to GEMS in the configuration array and a Stats part with many measurements like the number of cycles and the number of accesses to the caches.

(18)

Setting Up The Testbed 3

3.1 Changes to Sniper

In this chapter, there will more focus on the traces, the interface between Sniper and GEMS simulators. Traces are simply a list of the CPU instructions that will be simulated, a subject that will be covered more in section 3.4. Sniper produce the trace which in turn will be used by GEMS. Some minor updates have been done to the parts of Sniper that print the trace:

• Writing the start PC (program counter) at first line in the trace.

• Writing the length (PC difference) of the previous instruction to gather with every instruction.

• Prints the target PC of a branch together with a ”*” if the branch is taken.

• Implementing a new flag, –insert-clear-stat-by-icount=n, where n is a positive integer, which leaves a line with just a ”C” after n lines of instructions.

3.2 Configuration

Configuration is a list of keys with assigned values that sets out the characteristics for the CPU to be simulated. Some examples can be the number and sizes of caches, the sizes of buffers, number of cores, frequency and a lot more. The PROCESSOR STORE PREFETCH key is the one used to set the store prefetch policy. Another key to keeping track of is SIMULATION BENCHMARK which holds the name of the simulation benchmark program so that it can be added to the stats-file (see 2.2.3). Between the different simulations, only the values of the two described keys are changed, the rest remains the same and sets out the architecture of the CPU which will be described in the following subsection.

(19)

3.3 CPU architecture

The CPU design simulated for this master thesis (the numbers introduced in figure 2.1) are listed in the table below and are used in similar researches [10].

Processor (HSW-class) Issue and commit width 4

Instruction queue (IQ) 60 entries Reorder buffer (ROB) size 192 entries

Load queue (LQ) 72 entries

Store queue (SQ), Store buffer (SB) 42 entries Lockdown table (LDT) 32 entries

Memory

Private L1 cache 32KB, 8-way, 4 hit cycles Private L2 cache 128KB, 8-way, 12 hit cycles Shared L3 cache 1MB per bank, 8-way, 35 hit cycles Memory access time 160 cycles

Network

Topology / routing 2D mesh / Deterministic X-Y Data / Control msg size 5 / 1 flits

Switch-to-switch time 6 cycles

Table 3.1: Simulated system configuration

3.4 Trace: Interface Sniper-GEMS

A CPU cannot read program languages like C or Java directly. Instead, the CPU has a set of instructions, called instruction set, that it can compute. The instruction set can differ from one type of CPU to another, but three types of instructions that can be found in some way in any instruction set are; arithmetical operations such as, addition, subtraction, multiplication and division, memory instructions such as, loads and stores, and branch instruction. Branches tell the CPU if it should jump to another instruction or not. Consider a for-loop, that loop will be represented by a branch instruction telling the CPU if it should jump back to the beginning of the loop or continue below it. When compiling a C or Java file, you end up with a binary file that contains the translation from the source code to a particular instruction set.

This file is the one you use to run the application in question. A trace file is a human- readable ”binary file”.

PC stands for program counter, and it keeps track of which instruction in a program to be executed next. When taking a branch, the CPU changes the PC to the address of the target instructions from a branch and continue forward.

The trace files used here will cover more details regarding memory instructions, that is loads and stores. That is why a trace file has a line break after each memory instruction. Instructions that are not store instructions nor branches will be referred to as anonymous instructions. Below follows some examples on lines from traces which will be explained and translated into English.

4030 f c

The first line of a trace contains the starting PC Figure 3.1: Trace example: Starting PC

The trace starts with the value of the starting PC in hexadecimal-format. From now on the differentiation of the PC from one instruction to the next is written in decimal-format along with the instruction. Note that the second instruction shows the length (the change to the PC) of the first one, and so on.

(20)

0 4 L4 e e 7 e 2 2 0 4

Two anonymous instructions, both with length 4 followed by a load of 4 bytes from memory address ee7e220

Figure 3.2: Trace example: Anonymous instructions and a Load

The 0 and the 4 denotes two anonymous instructions (loads, stores and branches are of greater interest). Since the first digit denotes the size of the previous instruction, it shows that the two anonymous instructions both have the size of four (the ”4” after the first space and the ”4” after the ”L”). The ”L” means that the instruction is a load instruction and loads data from the hexadecimal memory address ee7e220. The number of bytes to be loaded is written after the last space, in this case, four.

0 d1d3 b4d1t99∗ S99 7 f f f 8 3 6 8 a c d 8 8

An anonymous instruction of size 4, that is dependent on the first and the third instruction before this. This instruction is followed by a taken branch to PC+99 with a dependency on the prior instruction. Last is a store of 8 bytes to the address

7fff8368acd8.

Figure 3.3: Trace example: Branch and store

As explained in the previous examples, this line starts with one anonymous instruction of size four. In this case, the first instructions are also dependent on the results from the two previous instructions, and this is denoted by the two ”d” before the first space. The digit after each ”d” points out how many instructions before the dependent one are, in this case one and three. The next instruction starts with a ”b” which means that it is a branch. The branch is dependent on the instruction before, the anonymous one that was discussed at the beginning of this example. The

”t” denotes the branch target which afterward is defined as the difference between the current PC and the target address. Here 99 should be added to the PC if taken.

The ”*” denotes that the branch is taken. After that, there is an ”S”, namely a store instruction. Here one can once again see that the branch is taken since the difference from the branch instruction is 99 (directly after the ”S”) as the branch target. A store interaction follows the same pattern as a load, here it stores 8 bytes to the memory address 7fff8368acd8.

0m3 b4t −166∗ L−166 7 ff8368acd8 8

An anonymous instruction of size 4 that is dependent on data from memory that is retrieved in the third prior instruction. After that, there is a branch with a negative PC difference (-166), as the target, which leads to a load of 8 bytes to the address

7fff8368acd8.

Figure 3.4: Trace example: Branch to a previous instruction

In this example, there is only have two new things to cover. The ”m” denotes a memory dependency; data that are to be loaded in the third previous instruction.

The ”d’s” are depend on data to be computed by the instruction in question while the

”m’s” are data that need to be loaded into the CPU. The next thing to cover is that there is a negative PC difference for the target address. This is all required knowledge to directly translate the line into English: An anonymous instruction of size 4 that is dependent on data from memory, retrieved in the third prior instruction. After that, there is a branch with a negative PC difference (-166) as the target which leads to a load of 8 bytes from the address 7fff8368acd8.

(21)

3.5 Metrics for evaluation

During this thesis, a python script was written to automatizes the run of the simulation on every subfolder in a predefined folder (all subfolders have to contain traces from a benchmark program). In the source code, there is a predefined array with the names of the store prefetch policies you want to run. The script will then run GEMS on every trace with every store prefetch policy and name the stat-file in the following way: [name on benchmark] [name on store prefetch policy].stats

The script will also generate the results by reading some values from every stats file, the values to be read are hard-coded into an array. After collecting all these values from every file, the values from all benchmarks for every store policy will be summed. The values for OnCommit will be set to 100%, and the others will get their percentages based on that. These percentages will then be written into tables (which can be found in the results).

The average results that you will see in the result chapter 5 is calculated in the following way:

P policy OnCommit

N umberOf Benchmarks

3.5.1 Energy graphs

Table 3.2 lists the energy consumption of accessing our three caches and the network flits; the energy needed for transmission of one data unit between caches and the pipeline. One flits for transmission of control messages and five for data [10]. To retrieve the energy consumption of the caches we use the CACTI-P tool [9]. The energy consumption of the caches is used to estimate the energy consumption of the different cache structures, assuming an 22 nm technology node. To estimate the dynamic energy consumption of the interconnection network, we assume that it is proportional to the data transferred [5] and that each flit transmitted through the network consumes the same amount of energy as reading one word from an L1 cache each time that it crosses a link (link and router energy). These number are used by the python script 3.5 to generate the energy graphs that can be seen in the results chapter 5.

CPU part Energy [nJ]

Accessing L1 0.013343 Accessing L2 0.0214929 Accessing L3 0.0454353 Network flit 0.0033357

Table 3.2: Energy measurements for different CPU parts.

(22)

Proposed Store Prefetch Policies 4

This chapter will introduce the new store prefetch policies proposed by this thesis.

Three different set of techniques for aiming to decrease the runtime and energy consumption are represented by each section below.

From now on the terms policy and filter are used. A policy is a store prefetch policy while a filter is a less complex policy that can more easily be added on top of another policy, which will prevent some prefetches from occurring at all.

4.1 Techniques to reduce speculation effect

These techniques aim to reduce the speculation effect by not prefetching permission to stores that may not be committing or has its data already in the L1 cache. If the requested permission is still in the L1 cache due to previous instruction there are no point in prefetching it again.

4.1.1 OnNonBranchSpecultive

If a store instruction follows a branch, it is better to wait until the outcome of the branch is known, to make sure that all data brought in to the L1 cache will be needed.

If the load is not affected by a branch, there will be a benefit for the performance by an early prefetch. The difference between OnNonBSpeculative and OnCommit is that the prefetch will, if not affected by a branch, be issued upon arriving in the ROB. Compare to OnCommit the time the instruction is waiting in the ROB will be earned.

4.1.2 Re-Execute

Re-Execute is not a fully prefetch policy. It is more of a question on how to handle Re-Execution of an instruction. Reasons for Re-Execution can either be that a load follows a store with an unknown address that is later computed to the same as the store. In this case, the load has to be Re-Executed to get the data that is updated by the previous store. Another reason can be that the load in question has a previously not resolved load. If the load is invalidated or evicted from the cache, then the load (and all the next instructions) have to be Re-Executed.

A comparison can be made between issue a prefetch on Re-Execution or not. If the Re-Execution takes place close in time to the first execution, then the data for that store is likely to be in L1 and/or in another cache, hence prefetching again is not necessary and thereby just a waste of energy. Re-Execute is considered to be a filter.

(23)

4.2 Techniques to filter unnecessary prefetches

Here the memory address of the store is evaluated. The prefetch will occur given that there are a need for that base on experience from handling stores to this address previously. Another thing to be considered is, if the data is sharing a cache line with data that has already been prefetched there is no need to prefetch it again. Since the entire cache line is always prefetched.

4.2.1 SameCacheLine

When retrieving data from the main memory to a cache, a chunk of adjacent fields is read into the cache. The reason for this is since the same data (i.e., the contents of a file) is usually stored next to each other in memory, making this chunk of data (called cache line) more likely to contain the next data that will be requested as well. The entire cache line has the same physical address. SameCacheLine is like Re-Execute a primitive policy that can be called a filter. It has a register that keeps the previous address so that the current one can be checked against it for equality. If there is an equality and it has already prefetched it, given it should be prefetched according to the policy used in combination with SameCacheLine, therefore, does not need to be prefetched again.

4.2.2 PCbasedPredictor

The idea of PCbasedPredictor comes from branch prediction, which given the previous observations on whether or not a branch is taken which predicts if it is needs to be taken. The prediction is based on information on the target PC (the address of the target instruction), and whether or not branches to this instruction are usually taken. If it is likely that a branch will be taken, then the predictor predicts that it is going to be taken this time as well and the CPU is fed with instructions from the target of the branch. If it is predicted that a branch is wont be taken the CPU gets the instructions that come after the branch.

PCbasedPredictor keeps track of the memory addresses of the store instructions instead. PCbasedPredictor uses a buffer in which each entry map to a number of memory addresses. The more entries, the fewer addresses map to each. The rest you get when dividing the address (interpreted as a number) with the number of entries, is the index representing that address. That means that two adjacent memory addresses maps to different entries.

Every entry contains an integer that can take values between 0 and 3 and is preset to 2. If seeing a store to a memory address that maps to an entry with a number greater than one, the data will be prefetched. If a store instruction hits in the L1 cache (the data was already there) the number in the corresponding buffer entry is (if it is not 0) decreased by one, making it less likely to be prefetched next time. If a store misses in the L1 cache (the data is not yet there), the number in the corresponding buffer entry is (if it is not 3) increased by one, making it more likely to be prefetched next time.

To evaluate the impact of the buffer size, different sizes were tested and named PCBased X, where X to the power of two gives the number of entries.

(24)

4.3 Techniques to adapt to timeliness

Techniques for timeliness focus not on if, but when to prefetch. An early, not yet used, prefetch might be evicted by other an prefetch due to space issues. To avoid this issues, the prefetch can be delayed, so the data to be evicted can be used before eviction.

4.3.1 PCbasedTimelinessPredictor

PCbasedTimelinessPredictor uses two buffers, where one of the buffers is the one used in PCbasedPredictor and the other keeps track of when to issue the prefetch, this is called the Timeliness buffer. The Timeliness buffer is implemented in the same way (with an integer between 0 and 3 and the same mapping to the data address of the store) as the one in PCbasedPredictor. The TimeLiness buffer stores information on what type of miss it is. A late miss means that data permission is not granted in L1 when it is needed, in this case, the number is increased by one (if it is not 3). An early miss means that the permission was granted early to the L1 cache and has been evicted by a data block that was prefetched later to the same location in the L1 cache.

In the same way that an entry in one of the buffers is statically mapped to more than one memory address, a cache has a limited number of slots for every address range, the number of ways. If all the places are taken, it might (depending on the replace- ment policy in place) replace the old but not yet used data. An early miss decreases (if it is not 0) the corresponding value in the Timeliness buffer. This policy is the only one that does not only try to predict if data should be prefetched or not but also when.

When deciding if and when to prefetch, the first buffer is used in the same way as PCbasedPredictor to predict if a prefetch should be done or not. If it will be a prefetch, the value of the Timeliness buffer is checked to decide on when. A value below Z means that it will be prefetched later, like OnCommit. A value above or equal to Z will cause an earlier prefetch, like OnNonBranchSpecultive. Different values on Z (2 and 3) are implemented together with different sizes of the two buffers (they have all the same size on the two buffers). The versions are named PCbased- TimelinessPredictorZ X where X denotes the size of the buffers in the same way as for PCbasedPredictor.

4.4 General remarks

Each new policy and filter can be built on top of each other to gain better performance.

In the result (chapter 5) you can see how filters and policies have been used on top of each other.

(25)

5 _Results

This chapter shows the results of the different filters and store prefetch policies concerning; execution times, number of L1 accesses, number of store prefetches and energy consumption, which are referred to as metrics. This chapter is divided in four sections, one for each set of techniques (introduced in chapter 4) and the State-of- the-art ones (see 2.1.2). In each of these sections, there is one subsection for each metric. Every subsection presents the average values for each compared policy combination in a table. This table follows by a graph with the values of all benchmark for all compared policy combination. The values are given in percentages normalized to OnCommit. Finally the five benchmarks that benefit the most and the least from a certain policy combination compared with OnCommit are presented.

5.1 State-of-the-art policies

First, the state-of-the-are store prefetch policies will be analyzed:

• NoPrefetch

• OnExecute

• OnCommit

5.1.1 Execution time

The average execution time is measured in cycles.

NoPrefetch OnExecute OnCommit Execution time 146.50% 102.35% 100.00%

Table 5.1: Average execution time

This table shows the impact of the prefetched policies on average runtimes. When using any type of prefetch policy it will roughly cut the runtime by a third (for OnCommit and OnExecute). This result shows that bringing data into the L1 cache for store instructions is a bottleneck. Still, it is not just to prefetch data for all store that may be executed. If comparing OnExecute and OnCommit, one can see that OnExecute is 2.35% units slower. Since OnExecute issues there prefetches earlier they will, therefore, prefetch more data that later will turn out to not be used. Unnecessary prefetch slows down the execution, and should be minimized.

(26)

Figure 5.1: Execution time for all benchmarks (in percent weighted against OnCom- mit)

The numbers are based on a comparison between OnCommit and OnExexcute OnCommit better

benchmarks differance

gcc 3 22.06

gobmk 4 21.48 gobmk 3 21.41

gcc 5 17.81

gobmk 1 17.06

OnExexcute better benchmarks differance omnetpp -24.32 cactusADM -14.95 soplex -9.43

lbm -9.28

milc -6.85

The differences between the lowest and highest values given in percent units.

Table 5.2: The benchmarks that are more affected by execution time (in percentage) The table above presents the number of percent units in terms of the number of cycles for a particular benchmark when comparing OnExecute and OnCommit. Gcc

(27)

3 and Gobmk 4 gain over 20% in reduced number of cycles while running OnCommit (being less aggressive with the prefetch). On the other hand, OnCommit is increasing the number of cycles for Omnetpp and CactusADM by 20% units. The ones that goes well with OnExecute are the ones that do not have so much need of prefetching, makes the best results for NoPrefetch. The ones that are very sensitive to prefetch go well with OnExecute.

5.1.2 L1 accesses

The average number of accesses to cache L1 is measured here.

NoPrefetch OnExecute OnCommit L1 accsesses 84.82% 121.86% 100.00%

Table 5.3: Average number of L1 accesses

This is an interesting table which shows the amount of unnecessary work caused by prefetching. NoPrefetch does, as the name suggests, no prefetches which mean that all L1 accesses are caused by store instructions that write to the cache. Everything above 84.82% is unnecessary work and will waste energy. OnExecute does as expected more accesses than OnCommit since it prefetches earlier and is more speculative.

(28)

Figure 5.2: Number of L1 accesses for all benchmarks (in percent weighted against OnCommit)

benchmarks differance bzip2 2 96.64 bzip2 1 95.13

gcc 4 48.71

dealII 43.51

bzip2 3 36.02

OnExexcute better benchmarks differance soplex 1 -0.22

bwaves 0.00

calculix 0.22

milc 0.24

leslie3d 0.25

Table 5.4: The benchmarks that are more affected by number of L1 accesses (in percentage)

The table shows that OnCommit decreases the L1 accesses for all but two of

(29)

the benchmarks and the difference can be over 90% units (Bzip2 1 and Bzip2 2).

This behavior shows once again that an early prefetch triggers unneeded accesses to the L1 cache. It is interesting that Soplex 1 has more accesses for OnCommit than OnExecute and this might be something to investigate further.

5.1.3 Store prefetches

The average number of store prefetches is measured here.

NoPrefetch OnExecute OnCommit Store prefetch 0.00% 399.72% 100.00%

Table 5.5: Average number of store prefetches

The number of store prefetches confirms the behaviors and conclusions drawn from the tables over execution times and L1 accesses. NoPrefetch does none store prefetches while OnExeute does it the most, 299.72% units more than OnCommit.

This is also what to expect since OnExecute is more speculative then OnCommit.

Another thing to notice here is that sometimes after commit, the store instruction is already at the head of the Store Buffer, and therefore it issues the write instead of the prefetch. The effect is then equivalent with NoPrefetch.

(30)

Figure 5.3: Number of store prefetches for all benchmarks (in percent weighted against OnCommit)

benchmarks differance bzip2 1 4731.44 bzip2 2 4011.72 bzip2 5 876.07 libquantum 513.31 bzip2 4 418.57

OnExexcute better benchmarks differance

bwaves 0.00

leslie3d 0.58

gcc 5 3.82

gcc 2 5.06

gcc 3 5.89

Table 5.6: The benchmarks that are more affected by number of store prefetches (in percentage)

OnExecute issues more store prefetches then OnCommit for all the benchmarks

(31)

except Bwaves which has the same number of prefetches for OnCommit and OnExe- cute.

5.1.4 Energy consumption

The average energy consumption is here measured in nJ .

NoPrefetch OnExecute OnCommit

Energy 90.95% 133.13% 100.00%

Table 5.7: Average energy consumption

This table shows the energy consumption of each prefetch technique. First of all, NoPrefetch consumes the smallest amount of energy while OnExecute consumes the most. This consumption comes from unnecessary prefetches. It is also worth noting that if using OnExecute instead of OnCommit the number of cycles decreases by 2.35%

units, but the energy consumption increases by 33.13% units. Using OnCommit instead of NoPrefetch burns 9.05% units more energy but that gives a speedup of 46.5% units. A prefetch policy can pay off in decreased energy consumption, but if getting too far in prefetching the loss concerning increased energy consumption can be huge.

(32)

Figure 5.4: Energy consumption for all benchmarks (in percent weighted against OnCommit)

benchmarks differance bzip2 2 109.57 bzip2 1 96.69 gobmk 1 67.72 bzip2 3 61.39

gcc 7 60.95

OnExexcute better benchmarks differance

milc -1.45

soplex 1 -0.46 libquantum -0.08

bwaves 0.00

calculix 0.22

Table 5.8: The benchmarks that are more affected by energy consumption (in percentage)

Using OnCommit rather then OnExexute saves much energy for many bench-

(33)

marks. Finding Bizp2 in the lead when it comes saving energy is no surprise since it is in the lead when it comes to decreasing prefetches and L1 accesses. This cor- relation will also explain why Milc increase a bit in energy consumption while using OnCommit instead of OnExecute. Longer execution time will also increase the energy consumption.

5.2 Techniques to reduce speculation effect

Second, all prefetch policies from state of the art and the following three new ones will be analyzed:

• OnNonBSpec (OnNonBSpeculative)

• OnExecute with Re-Execute

• OnNonBSpec with Re-Execute

5.2.1 Execution time

Execution time 146.50% 102.35% 100.00%

OnNonBSpec OnExecute + Re-Execute OnNonBSpec + Re-Execute

Execution time 96.66% 100.57% 96.58%

Table 5.9: Average execution time

OnNonBSpec seems to be the fastest one, 3.34% units faster then OnCommit.

It makes a difference not to prefetch stores that are affected by a branch. The Re- Execute filter affects OnExecute with 1.22% units but OnNonBSpec with only 0.08%

units.

(34)

The numbers are based on a comparison between OnCommit and OnNonBSpec with Re-Execute

OnCommit better benchmarks differance

gcc 5 1.91

gcc 1 1.13

tonto 0.71

gamess 1 0.62

gcc 0.62

OnNonBSpec with Re-Execute better benchmarks differance

cactusADM -14.97 omnetpp -11.05

soplex -8.57

gcc 2 -8.33

lbm -8.25

Table 5.10: The benchmarks that are more affected by execution time (in percentage) Most of the benchmarks benefit from using OnNonBSpec with Re-Execute, in the

(35)

lead, CactusADM (14.97% units) and Omnetpp (11.05% units). The ones that loses with using OnNonBSpec with Re-Execute does so with under two percent.

5.2.2 L1 accesses

L1 accsesses 84.82% 121.86% 100.00%

L1 accsesses 113.82% 118.26% 112.87%

OnNonBSpec does 4.04% units less L1 accesses then OnExectue. Re-Execute decreases the L1 accesses by 3.60% units used on OnExecute and 0.95% units used on OnNonBSpec. The difference can be explained by OnNonBSpec removing some of the prefetches that Re-Execute also will remove. The total numbers tell us that OnNonBSpec with Re-Execute has the lowest number of L1 accesses (112.87%) apart from OnCommit with 100.00%.

(36)

OnCommit better benchmarks differance bzip2 1 94.21 bzip2 2 92.52

gcc 4 47.53

hmmer 33.71

hmmer 1 32.80

soplex 1 -0.71

bwaves 0.00

gcc 2 0.07

libquantum 0.16 leslie3d 0.18

Table 5.12: The benchmarks that are more affected by number of L1 accesses (in percentage)

(37)

In this table OnBSpec has more L1 accesses for all benchmarks except Spolex.

Bzip2 1 and Bzip2 2 have over ninety percent more accesses which makes them the two benchmarks that have the highest increase.

5.2.3 Store prefetches

Store prefetch 0.00% 399.72% 100.00%

Store prefetch 342.12% 384.08% 336.54%

OnNonBSpec decreases the number of store prefetches by 57.60% units compared with OnExecute. Re-Execute is also decreasing the number of prefetches by 15.64%

units used on OnExecute and 7.68% units used on OnNonBSpec. The difference can be explained by OnNonBSpec removing some of the prefetches that Re-Execute also will remove. The total numbers shows that OnNonBSpec with Re-Execute has the lowest number of L1 accesses apart from OnCommit (100%) (and NoPrefetch 0.0%) by 336.54% against 384.08% (OnExectue with Re-Execute).

(38)

OnCommit better benchmarks differance bzip2 1 4687.26 bzip2 2 3837.14 bzip2 5 564.19 sphinx3 373.41 h264ref 262.78

bwaves 0.00

leslie3d 0.51

gcc 5 0.84

gcc 3 1.06

gcc 2 1.23

(39)

OnNonBSpec with Re-Execute does more prefetches then OnCommit for all benchmarks except Bwaves, which stays the same. In first place is bzip2 1 which increases the prefetch by 4687% units.

5.2.4 Energy consumption

The average energy consumption is here measured in nJ .

Energy 90.95% 133.13% 100.00%

Energy 110.24% 129.81% 109.42%

Table 5.15: Average Energy consumption

What to notice here is that OnNonBSpec burns 22.89% units less energy then OnExecute. Re-Execute decrease the energy consumption by 3.24% units used on OnExecute and 0.82% units used on OnNonBSpec. OnNonSPec with ReExecute gets 109.42% units, which is 9.42% units more than OnCommit.

(40)

Figure 5.8: Energy consumption for all benchmarks (in percent weighted against OnCommit)

The numbers are based on a comparison between OnCommit and OnNonBSpec with ReExecute

OnCommit better benchmarks differance bzip2 1 91.17 bzip2 2 86.41

gcc 4 45.28

hmmer 1 30.34

hmmer 29.72

OnNonBSpec with ReExecute better benchmarks differance

gcc 2 -4.37

milc -1.45

soplex 1 -1.40 soplex -0.84 libquantum -0.70

Table 5.16: The benchmarks that are more affected by energy consumption (in percentage)

(41)

OnNonBSpec with Re-Execute can decrease the energy consumption for some benchmarks with only a few percent units, and the best is Gcc 2 with 4.37% units less energy consumption. However, for some other benchmarks like Bzip2 1 and Bzip2 2, the consumption is increased, in this case by around 90% units. To see these two benchmarks consume the most energy is not a surprise since they are high on L1 accesses and store prefetch.

5.2.5 Conclusion

OnNonBSpec with Re-Execute is the fastest store prefetch policy with a speed-up of 3.42% units compared to OnCommit, but it burns 9.42% units more energy. The question to ask is if an energy increase by 2.75% units is worth a speed-up of 1%

units. It is likely to believe that the answer will differ with the type of machine and application. OnNonBSpec with Re-Execute will be kept together with the three stateof-the-art policies as the reference for the future once. The new filters and policies to be tested will all be put on top of it. All the polices (except NoPrefech) still consumes more energy then OnCommit, the techniques introduced next will aim to reduce this consumption.

5.3 Techniques to filter unnecessary prefetches

In this section sameCachLine and PCbasedPredictor (PCbased) are introduced. The digit after PCbased denotes the buffer size, the number of entries is two to the power of the digit. Tests have been conducted with PCBased2, 4, 6, 8, 10 and 16 but as it turns out 2 behaves in the same way as 4, and 8 as the remaining sizes all four metrics (execution time, L1 accesses, number of prefetch and energy consumption).

Therefore, to reduce the number of configurations in the tables of this section, only buffer sizes 2 and 8 are used. The new configurations, based upon OnNonBSPec with Re-Execute, are:

• OnNonBSpec with sameCacheLine

• PCbasedPredictor 4

• PCbasedPredictor 4 with sameCacheLine

• PCbasedPredictor 8

• PCbasedPredictor 8 with sameCacheLine

5.3.1 Execution time

Execution time 146.50% 102.35% 100.00%

OnNonBSpec + Re-Execute OnNonBSpec* PCbased 4

Execution time 96.58% 96.64% 96.96%

PCbased 4* PCbased 8 PCbased 8*

Execution time 96.88% 96.96% 96.88%

* with sameCacheLine Table 5.17: Average execution time

SameCacheLine has a little slowdown on OnNonBSpec (the best from 5.2) with 0,06% units and 0.08% units, on PCbasedPredictor (2 and 8). One hypothesis for this is that loading the same permissions twice in a short manner of time will not be an

(42)

issue since the memory-subsystem will quickly realize that the requested permission is already in the L1 cache. Therefore blocking the second prefetch of a cache line a bit earlier will not have a big impact, at least on the execution time.

PCbased 4 and 8 have the same numbers 96.96% without sameCacheLine and 96.88%

with. These results can be compared to OnNonBSpec with its 96.64%, with same- CacheLine. Iterate over a buffer as in PCBased takes time, so the advantage of doing that has to pay out more at runtime. Otherwise, it will end up with a slowdown as here.

(43)

The numbers are based on a comparison between OnCommit and OnNonBSpec with SameChacheLine

gcc 5 6.58

gcc 1 1.19

xalancbmk 0.57 h264ref 2 0.52 gamess 1 0.51

OnNonBSpec with SameChacheLine better benchmarks differance

omnetpp -16.10

lbm -8.43

soplex -8.14

milc -7.70

gcc 2 -6.67

Table 5.18: The benchmarks that are more affected by execution times (in percentage) Here OnCommit is compared with OnNonBSpec with Re-Execute and Same- CacheLine, since its the best one concerning execution time among the new one for this section. Some benchmarks seem to benefit from OnNonBSpec with Re-Execute and SameCacheLine while others do not. The absolute value of the one benefiting the most from OnNonBSpec with Re-Execute and SameCacheLine is higher than the one losing the most (| − 16.10| > |6.58|).

5.3.2 L1 accesses

L1 accsesses 84.82% 121.86% 100.00%

L1 accsesses 112.87% 111.72% 105.76%

L1 accsesses 105.03% 105.76% 105.03%

* with sameCacheLine

PCbasedPredictor 4 and 8 have the same number of L1 accesses (105.76%). Same- CacheLine improves both on OnNonBSpec by 1.15% units and on PCbasedPredictor (4 and 8) by 0.73% units. PCbasedPredictor (4 and 8) with SameCacheLine becomes the closest one to OnCommit with its 105.03%.

(44)

The numbers are based on a comparison between OnCommit and OnNonBSpec with SameCaceLine

hmmer 26.20

gamess 1 22.99 perlbench 2 22.02 hmmer 1 20.36 GemsFDTD 20.33

OnNonBSpec with SameCaceLine better benchmarks differance

milc -3.52

gcc 2 -0.17

bwaves 0.00

soplex 1 0.25

gcc 3 0.28

Table 5.20: The benchmarks that are more affected by Reduce speculation the benchmarks which number of L1 accesses are affected the most (in number of percent) (in percentage)

(45)

This table shows us that OnNonBSpec with Re-Execute and SameCacheLine decreases the L1 accesses for only two benchmarks, Milc (-3.52% units) and Gcc 2 (0.17%

units). Hmmer is the benchmark that has the greatest decrease of L1 accesses 26.20%

units.

5.3.3 Store prefetches

Store prefetch 0.00% 399.72% 100.00%

Store prefetch 336.54% 330.52% 155.16%

Store prefetch 151.95% 155.16% 151.95%

* with SameCacheLine

The store prefetches shows roughly the same thing as L1 accesses (see table 5.19).

PCBased (4 and 8) gives the same percentage without SameChaheLine, 155.16%, and with, 151.95%, a difference of 3.21% units. SameCacheLine also improves OnNonBe- spec by 6.02% units. PCBased (4 and 8) with SameCacheLine is once again closest to OnCommit with its 151.95%.

(46)

The numbers are based on a comparison between OnCommit and OnNonBSpec with SameChacheLine

OnCommit better benchmarks differance libquantum 247.48 bzip2 5 240.84 h264ref 217.65 sphinx3 208.58 gamess 1 195.76

OnNonBSpec with SameChacheLine better benchmarks differance

bwaves 0.00

gcc 5 0.41

leslie3d 0.50

gcc 3 0.52

gcc 2 0.61

Store prefetch policies: Analysis and new proposals

Examensarbete 30 hp November 2018

Store prefetch policies

Analysis and new proposals

Carl Bostr¨om

Computer and Information Engineering Programme

Abstract

Store prefetch policies

Analysis and new proposals Carl Bostr¨om

This thesis is focusing on how to gain performance when executing pro- grams on a CPU. More specifically the store instructions are studied.

Handledare: Alberto Ros

¨Amnesgranskare: Stefanos Kaxiras Examinator: Lars- ˚Ake Nord´en UPTEC IT18 022

Tryckt av: Reprocentralen ITC

Sammanfattning

Acknowledgment

Contents

List of Figures

List of Tables

Introduction 1

1.1 Motivation

1.2 Scope

1.3 Related work

1.4 Structure of the report

Background 2

2.1 Theoretical Background

2.1.1 A theoretical Architecture

2.1.2 State-of-the-art prefetch policies

2.2 Simulation infrastructure

2.2.1 Benchmarks

2.2.2 Sniper

2.2.3 GEMS

Setting Up The Testbed 3

3.1 Changes to Sniper

3.2 Configuration

3.3 CPU architecture

3.4 Trace: Interface Sniper-GEMS

3.5 Metrics for evaluation

3.5.1 Energy graphs

Proposed Store Prefetch Policies 4

4.1 Techniques to reduce speculation effect

4.1.1 OnNonBranchSpecultive

4.1.2 Re-Execute

4.2 Techniques to filter unnecessary prefetches

4.2.1 SameCacheLine

4.2.2 PCbasedPredictor

4.3 Techniques to adapt to timeliness

4.3.1 PCbasedTimelinessPredictor

4.4 General remarks

5 Results

5.1 State-of-the-art policies

5.1.1 Execution time

5.1.2 L1 accesses

5.1.3 Store prefetches

5.1.4 Energy consumption

5.2 Techniques to reduce speculation effect

5.2.1 Execution time

5.2.2 L1 accesses

5.2.3 Store prefetches

5.2.4 Energy consumption

5.2.5 Conclusion

5.3 Techniques to filter unnecessary prefetches

5.3.1 Execution time

5.3.2 L1 accesses

5.3.3 Store prefetches

5 _Results