Implementing the Load Slice Core on a RISC-V based microarchitecture

(1)

UPTEC IT 20038

Examensarbete 30 hp November 3, 2020

Implementing the Load Slice Core on a RISC-V based

microarchitecture

Axel Dalbom Tim Svensson

Institutionen f ¨or informationsteknologi

(2)

(3)

Institutionen f ¨ or informationsteknologi Enheten f ¨or studentservice

Bes ¨oksadress:

ITC, Polacksbacken L ¨agerhyddsv ¨agen 2

Postadress:

Box 337 751 05 Uppsala

Telefon:

018–471 30 03 Hemsida:

http:/www.it.uu.se

Abstract

Implementing the Load Slice Core on a RISC-V based microarchitecture

Axel Dalbom Tim Svensson

As cores have become better at exposing Instruction-Level Parallelism (ILP), they have become bigger, more complex, and consumes more power. These cores are approaching the Power- and Memory-wall quickly. A new microarchitecture proposed by Carlson et. al claims to solve these problems. They claim that the new microarchitecture, the Load Slice Core, is able to outperform both In-Order and Out-of-Order designs in an area and power restricted environment.

Based on Carlson et. al.’s work, we have implemented and evaluated a prototype version of their Load Slice Core using the In-Order Core Ariane. We evaluated the Load Slice Core by comparing the LSC to an IOC when running a microbenchmark designed by us, and when running a set of Application Benchmarks.

The results from the Microbenchmark are promising, the LSC outperformed the comparable IOC in each test but problems related to the configuration of the design were found. The results from the Application Benchmarks are inconclusive. Due to time constraints, only a partially functioning LSC were compared to a comparable IOC.

From these results we found that the LSC performed comparably or slightly worse than its IOC counterpart.

More research on the subject is required for any conclusive statement on the mi- croarchitecture can be made, but it is the opinion of this paper’s authors that it does show promise.

Handledare: David Black-Schaffer Amnesgranskare: Yuan Yao ¨ Examinator: Lars- ˚ Ake Nord´en UPTEC IT 20038

Tryckt av: Reprocentralen ITC

(4)

(5)

Contribution

Both Authors have contributed equally to this project although focused on some different areas. Tim

Svensson focused on the Register Renaming, the Iterative Backwards Dependency Analysis and the Mi-

crobenchmarks while Axel Dalbom focused on the Instruction Queues and the logic around it as well as

running the Application benchmarks and analysing the results.

(6)

1 Introduction 1

2 Background 1

2.1 In-order and Out-of-order cores . . . . 1

2.2 The Load Slice Core Microarchitecture . . . . 2

2.2.1 Memory Hierarchy Parallelism . . . . 2

2.2.2 Address Generating Instruction . . . . 2

2.2.3 Iterative Backwards Dependency Analysis . . . . 2

2.2.4 Bypass Instruction Queue . . . . 2

2.2.5 Register Dependency Table . . . . 2

2.2.6 Instruction Slice Table . . . . 3

2.2.7 Register Renaming . . . . 3

2.2.8 Example execution . . . . 3

2.2.9 Extending an In-Order Core into a Load Slice Core . . . . 5

2.3 Ariane . . . . 6

2.3.1 RISC-V . . . . 6

2.3.2 Architecture . . . . 7

3 Implementation 8 3.1 Iterative Backwards Dependency Analysis . . . . 9

3.1.1 Register Rename . . . . 9

3.1.2 Instruction Slice Table . . . . 9

3.1.3 Register Dependency Table . . . . 9

3.2 Bypass Instruction Queue . . . . 9

3.3 Issuing from Instruction queues . . . . 10

3.3.1 Read and write dependencies . . . . 10

3.3.2 Branch instructions . . . . 12

4 Experimental Setup 12 4.1 Microbenchmarks . . . . 13

4.2 Application Benchmarks . . . . 14

5 Results and Discussion 14 5.1 Microbenchmarks . . . . 14

5.1.1 Load Slice Core vs. In-Order Core . . . . 14

5.1.2 Effect of Scoreboard Size . . . . 15

5.2 Application Benchmarks . . . . 16

5.2.1 Speculative instructions . . . . 17

5.2.2 Register renaming . . . . 18

5.3 Out-of-order commit . . . . 19

6 Conclusion 20 7 Future Work 20 7.1 A full Register Renaming . . . . 20

7.2 Improved IST . . . . 20

7.3 Measure the power and the area used . . . . 20

(7)

8 References 21

A ubmk-unroll.S 22

B util.S 25

(8)

1 Introduction

In this thesis, we present our implementation and evaluation of the Load Slice Core Microarchitecture on Ariane, a RISC-V In-Order Core.

The Load Slice Core Microarchitecture is a new type of microarchitecture proposed by Carlson et. al. in their paper The Load Slice Core Microarchitecture. It is a restricted out-of-order core that uses hardware to find program slices that can be prioritized for execution. This allows the LSC to be faster than a comparable IOC while also being more power and size efficient than a comparable OoOC. Most notably is that, in a power- and area-constrained many-core design, the LSC outperforms both In-Order and Out- of-Order designs. [1]

In this thesis, the LSC was implemented using an Ariane Core. Ariane is an open source In-Order Core that uses the RISC-V ISA. The core is modeled in a Hardware Description Language and can be simulated using software, enabling developers to modify and test it without having to create any actual hardware.

Due to time constraints and tool licensing issues, the implemented LSC was only evaluated with respect to the achieved performance. Performance was measured using both a microbenchmark and a set of ap- plication benchmarks, and the achieved performance was compared between the LSC and a IOC.

From the microbenchmarks, we found that the LSC indeed outperforms an IOC but that the performance gain is dependent on choosing a correct hardware configuration.

Due to time constrains, the findings from the application benchmarks are not conclusive but they do show that the LSC does not significantly worsen performance.

2 Background

2.1 In-order and Out-of-order cores

An in-order core executes instructions strictly in program order and stalls the pipeline whenever an in- struction cannot be executed. This can for example be because an instruction needs data from a previous instruction that is still being fetched from memory, or that the functional unit an instruction needs is cur- rently busy with another instruction. This can lead to some parts of the processor not being utilized at all while other parts are so busy the pipeline stalls.

Situations where this is particularly bad can be when fetching data from memory. If a load instruction misses in the cache and needs to fetch the data from memory and the next instruction needs the data produced by the load, the pipeline will stall until the memory fetch is complete. Another example is if there is a long sequence of instructions that all use the same functional unit, the pipeline will stall even though there are other functional units that could be executing instructions but instead are doing nothing.

An out-of-order core can reduce the impact these problems have on performance by looking for indepen- dent instructions that can utilize the idle resources without affecting the perceived order of the instruc- tions. In both of the examples above an out-of-order core can schedule independent instructions to execute ahead of program order and use idle resources instead of stalling like an in-order core would.

Typically an out-of-order core executes the instructions out-of-order and then re-orders the instructions

before committing them so that the program appears to execute in-order.

(9)

2.2 The Load Slice Core Microarchitecture

In the paper The Load Slice Core Michroarchitecture [1], they describe a new type of processor called a Load Slice Core. According to the paper, the Load Slice Core (LSC) achieves 53% more MIPS than an In-Order Core (IOC) while also being more energy efficient than both an IOC and an Out-Of-Order (OoO) Core. The premise is to identify memory-dependent instruction slices and execute them out of order. The core will receive the benefits of OoO on memory accesses while still keeping the hardware complexity relatively simple.

This is achieved by replacing the single instruction queue in an IOC with two in-order instruction queues, a main queue and a bypass queue. The memory-dependent instructions slices are put in the bypass queue while the rest of the instruction slices are put in the main queue. This way, memory accessing instructions are still executed in-order in regards to other memory instructions but can be executed earlier than regular instructions to minimize the waiting time for the slow memory instructions. The fundamental parts and concepts of the LSC architecture are explained below.

2.2.1 Memory Hierarchy Parallelism

An important concept introduced in the paper is Memory Hierarchy Parallelism. Memory Hierarchy Parallelism (MHP) is defined to be the number of overlapping memory access that hit in the memory hierarchy of the processor. It is used as a measurement of how efficient the core is when it comes to loading and storing data in its caches. A higher MHP means that a core can read and write data more quickly, which in turn will allow the core to perform memory bound operations faster.

2.2.2 Address Generating Instruction

An Address-Generating Instruction (AGI) is an instruction that produces data that a future load or store instruction need for address calculation. These instructions are important to be able to identify memory- dependent instruction slices.

2.2.3 Iterative Backwards Dependency Analysis

The Load Slice Core uses a technique called Iterative Backward Dependency Analysis (IBDA) to find all AGI in the code. It is implemented in hardware and uses naturally occurring code loops, e.g. for-loops, to find instructions that are to be marked for early execution. Each time an AGI is encountered, its producers are also marked as AGI. This is performed recursively for each AGI until all have been found. Store and Load instructions are always marked as AGI and can be seen as the root of the recursive chain.

2.2.4 Bypass Instruction Queue

The Bypass Instruction Queue (B-IQ) is a second instruction queue implemented parallel to the already existing main queue (A-IQ). As the name suggests, the B-IQ allows certain instructions to bypass the program order thus allowing them to potentially be executed out of order. The instructions that are placed in the B-IQ are either load/store instructions or AGI’s. The instruction is recognized as an AGI if it is found in the IST.

2.2.5 Register Dependency Table

The Register Dependency Table (RDT) is a direct-mapped cache that keeps track of the latest instruction

that wrote to a specific register and if that instruction was an AGI. The RDT has one entry per physical

register in the core and it saves the instruction’s PC address and AGI status. The RDT is updated each

(10)

time a new instruction is put in either the main or bypass queue. If the instruction is either a load or a store, it is marked as an AGI in the RDT and its PC is sent to the IST. If the instruction is a known producer of another AGI it will also be marked as an AGI. An instruction is known to be an AGI if it is found in the IST.

2.2.6 Instruction Slice Table

The Instruction Slice Table is a cache-like component that is used by the LSC to keep track of AGIs. The Instruction Slice Table (IST) is constructed like a cache except that it does not store any actual data. It stores the PC of any found AGI as well as a valid bit. If a lookup is made into the IST and returns a hit, that means that the instruction will be put into the B-IQ instead of the normal instruction queue.

2.2.7 Register Renaming

Register Renaming is used in order to execute instructions speculatively. It will allow for execution ahead of any control flow instructions thus allowing the LSC to execute loads speculatively.

Register Renaming is a technique used to increase the Instruction Level Parallelism in a program. To un- derstand how register renaming achieves this, we must first understand the problem that it solves.

In a program, there exists different kinds of dependencies between its instructions. These dependencies can be classified into three different types: data dependencies, name dependencies, and control depen- dencies. A data dependency (also called a true data dependency) occurs when an instruction x uses the data produced by an instruction y, thus x must occur after y to ensure program correctness. A name de- pendency occurs when either an instruction x writes to a register/memory location that a later instruction y reads (called an antidependency), or when two instructions write to the same register/memory location (called an output dependence). In both cases the ordering of the instructions must be preserved to ensure program correctness. A control dependency determines the ordering of instructions in relation to branch instructions, i.e. don’t execute code that is dependent on some condition before the condition is checked.

[2]

In a program, the Register Renaming finds and eliminates name dependencies between instructions. This is achieved by, as the name suggests, renaming the instruction’s source and destination registers in such a way to expose the true data dependencies. Once the data dependencies have been found, instructions that are not dependent on each other can be executed out of order as long as their reordering does not violate some control dependency, thus increasing the Instruction Level Parallelism achieved by the pro- gram.

2.2.8 Example execution

The following is an example to highlight the differences in execution between an in-order core, an out-of- order core and a Load Slice Core. Assume that an address to memory is stored in register 30 (x30).

1. lw x1, 0(x30) # long latency load 1

2. addi x10, x30, 4 # gen address load 2

3. addi x2, x1, 15 # consume load 1

4. addi x11, x10, 8 # gen address load 2

5. addi x12, x11, 16 # gen address load 2

6. lw x3, 0(x12) # long latency load 2

The instructions can be divided into two independent slices.

(11)

1. Instruction 1 loads a value from memory which instruction 3 uses for some calculation.

2. Instruction 2, 4 and 5 generates a new address that instruction 6 uses to load a value from mem- ory.

In-order core To minimize the time it takes to execute this section of code, ideally the two long-latency loads would be issued as close to each other as possible to overlap the waiting time for the loads.

An in-order stall-on-miss core would execute the first load and then have to wait for the load to finish before issuing the second instruction. A stall-on-use would be able to execute instruction two while waiting for the first load to complete but would stall on instruction 3 as this instruction uses data produced by the first load.

In both cases, only one load can be issued at a time and there is no overlap between the loads.

Out-of-order core An out-of-order core would look for dependencies and see that there are two in- dependent slices and order the instructions so that the load instructions are overlapped as much as possi- ble.

The out-of-order core would execute instruction 1 first and while waiting for the load to finish, execute instructions 2, 4, 5 and 6 and then continue waiting for the first load to finish to be able to execute instruction 3. In this case, the waiting times for the two loads will be overlapped which reduces the time spent stalling.

Load Slice Core To see how the Load Slice Core behaves, let’s assume that these six instructions are in a loop that iterates several times.

As shown in figure 1 the first iteration of the loop would be issued in program order, the load instructions 1 and 6 would go into the B-IQ and the rest would be put into the A-IQ. The IBDA would mark instruction 5 as an Address-Generating Instruction and put the PC of instruction 5 in the Instruction Slice Table.

Instruction 6 cannot be overlapped with instruction 1 as instruction 6 is dependent on instructions 4 and 5 which both are blocked behind instruction 3 in the A-IQ.

Figure 1: First iteration of the loop.

The second iteration (shown in figure 2) would also be issued in program order, but now instruction

5 would be put into the B-IQ and the IBDA would mark instruction 4 as an AGI to instruction 5 and

(12)

instruction 4 would be put into the IST as well. As instruction 4 is still blocked behind instruction 3 in the A-IQ, the two long-latency loads cannot be overlapped.

Figure 2: Second iteration of the loop.

From the third iteration (shown in figure 3) and onward, all Address-Generating Instruction that have been blocked by instruction 3 are now in the B-IQ and can be issued ahead of the A-IQ. As instruction 2 is ahead of instruction 3 in program order, instruction 2 doesn’t have to be issued from the B-IQ. If instruction 3 had been ahead of instruction 2, it would have taken another iteration of the loop to find instruction 2 and put it into the B-IQ. The two long-latency loads can now be overlapped to minimize stalling time.

Figure 3: Third iteration of the loop.

2.2.9 Extending an In-Order Core into a Load Slice Core

The Load Slice Core can be implemented using an In-Order Core. The LSC pipeline is mostly the same as that of an IOC, and Figure 4 shows this. The white structures are the same in both the LSC and the IOC, the striped structures are modified when converting between the IOC and the LSC, and the dark structures are only present in the LSC and not in the IOC.

The new structures are the Bypass Instruction Queue (B-IQ), the Register Rename, the Register Depen-

dency Table (RDT), and the Instruction Slice Table (IST).

(13)

The Load Slice Core Pipeline

Figure 4: Highlighted are the structures that are either modified or added to an IOC to implement the LSC.

The B-IQ is a second instruction queue that is placed in parallel with the main instruction queue (A-IQ).

Only memory-accessing instructions and AGIs are placed in the B-IQ. This allows the instructions in the B-IQ to be issued when the A-IQ stalls.

The Iterative Backwards Dependency Analysis (IBDA) algorithm is implemented in hardware using the Register Rename, RDT, and IST.

The IST stores all found AGIs and it is used to determine if an instruction should go to the A-IQ or the B-IQ.

The RDT is used to find the producers of memory-accessing instructions and AGIs. When a new producer is found the RDT updates the IST with the new information.

The Register Renaming is used to eliminate naming dependencies in the program. The program slices found using IBDA all have data dependencies between each instructions. If there also were naming dependencies present they might be found as false positives by the algorithm. The Register Renaming also allows the LSC to execute instructions from the B-IQ speculatively.

Finally, memory support structures are enlarged to allow for a larger amount of outstanding misses.

2.3 Ariane

In order to simplify the conversion of an in order core to a LSC we did the following.

We chose a baseline core that was open-source so that we could be modify it without acquiring a license.

The chosen core also implemented branch prediction as to allow execution of speculative loads, which according to the LSC paper is very important. Which is why we chose the Ariane core.

Ariane is a 64-bit, CPU with a 6-staged In-Order pipeline that implements the RISC-V ISA and is capable of running UNIX-like operating systems. [3, 4] The core is implemented using SystemVerilog and is open source and can be found on github

¹

.

2.3.1 RISC-V

RISC-V is an open-source Instruction Set Architecture (ISA) based on the RISC instruction set. The RISC-V ISA is designed to be flexible and customizable, and consists of a base integer ISA and several optional extensions. The base ISA is comprised of a minimal set of instructions to be usable for simple

1

https://github.com/pulp-platform/ariane

(14)

applications as well as to function as a base from which to build more specialized processors. To further expand the possible usage areas, the RISC-V ISA can be implemented with either a 32-bit or 64-bit address space.

The RISC-V ISA also describes a general-purpose set of extensions to the base ISA which is called RV64G. This set of extensions includes multiplication and division, atomic memory operations and single- and double-precision floating-point operations. This can also be extended further with support for compressed instructions to improve code size, performance and energy efficiency with the cost of additional hardware complexity.[5]

2.3.2 Architecture

Following is an overview (see Figure 5) of the Ariane Core.

Figure 5: An overview of the Ariane Core.

PC Generation This stage is responsible for generating the next program counter. In RISC-V, branches are always relative to the current PC. If no valid branch prediction exists, the prediction defaults to always taken for backwards jumps and never taken for forwards jumps. Additionally, unconditional jumps can be either to a specific address or relative to the PC.

The instruction scanner is used to find if the current instruction is a jump and, if that’s the case, what kind of jump instruction it is.

Instruction Fetch The IF stage receives a PC from the PC Gen. stage and fetches the the instruction

from the Instruction Cache. The instruction is placed in a FIFO queue that is used as the interface to the

(15)

Instruction Decode Stage.

Instruction Decode The ID Stage does the following.

1. Scan data stream from the IF Stage and identify the instructions.

2. Re-align any miss-aligned instructions.

3. Decompress any compressed instructions.

4. Generate a scoreboard entry from the instruction and send it to the Issue Stage.

Issue The Issue stage receives decoded instructions from the ID stage and issues the instructions to the functional units. The issue stage also keeps track of all issued instructions, the status of the functional units and receives the write-back data from the execute stage through the scoreboard.

In order to track the order instructions are issued and to allow for limited Out-of-Order execution, Ariane implements a Scoreboard. The Scoreboard holds all instructions that are to be executed in the Execute stage where they wait until the functional unit they require is available for them to use. Instructions are committed in the same order they arrive to the Scoreboard.

The Issue stage has a rudimentary register rename module implemented that allows a register to be inde- pendently used by two instructions at the same time.

Execute The Execute stage consists of the following:

• ALU

• Branch Unit

• Load Store Unit (LSU)

• Multiplier

• Floating Point Unit (FPU)

• CSR Buffer

The functional units are independent from each other and receive their data from the Issue Stage. After an operation has been completed, the result is returned to the Issue Stage.

Commit The Commit stage receives up to two instructions from the scoreboard in Issue stage and updates the architectural state. Depending on which instructions are committed this can be writing data to one of the register files or committing stores to memory.

3 Implementation

As described in Section 2.2.9, the Load Slice Core can be implemented by modifying an In-Order Core.

In this section, we will describe how we modified the Ariane Core into a Load Slice Core.

There are two slightly different implementations used in this project. The first implementation is made

for running general programs and benchmarks correctly. This implementation uses Ariane’s register

renaming scheme as well as the checks for dependencies between instructions presented in Sections 3.3.1

and 3.3.2. The second implementation is made for running custom-made microbenchmarks where the

program flow can be controlled to make sure that it behaves correctly. This implementation uses the

(16)

custom made register renaming (see Section 3.1.1), by using this register renaming implementation the check for branch instructions in Section 3.3.2 becomes unnecessary and as such this implementation does not use those checks.

Do note that no modifications to the memory support structures were needed to implement the LSC on Ariane.

3.1 Iterative Backwards Dependency Analysis

In order to implement the Iterative Backward Dependency Analysis, three components had to be created:

Register Rename, Instruction Slice Table, and Register Dependency Table.

3.1.1 Register Rename

In the Load Slice Core, there are two different implementations of Register Renaming. The first is a limited register renaming from the Ariane Core, and the second is a simple Register Renaming built by us.

Ariane’s limited register renaming is implemented in its scoreboard. It has a fixed renaming scheme, each entry in the register file can be represented by two different names in the scoreboard. This allows for up to two instructions that write to the same register file entry to be present in the scoreboard simultane- ously.

The simplified register renaming implemented by us works like a traditional register renaming compo- nent, it dynamically assigns physical registers based on which logical registers are being used. It uses 16 logical registers and 32 physical registers. Additionally, due to time constraints, the register renaming implemented can not handle mispredicted branches. Because of this, only benchmarks and microbench- marks written in assembly with these limitations in mind would work when using this register renaming component.

3.1.2 Instruction Slice Table

The IST was implemented as a 128 entry, 2-way set associative cache using the LRU replacement policy.

Each cache entry contains the PC of an instruction that’s been marked as an AGI by the RDT.

3.1.3 Register Dependency Table

The RDT was implemented as a direct mapped cache with one entry per physical register in the processor.

Each cache entry contains a valid bit, the PC of the instruction that wrote to the register, and if the instruction was an AGI. The x0 register will never be put in the AGI since it’s hardwired zero. If the instruction is an AGI, then the instructions source registers are marked as AGI in the IST.

3.2 Bypass Instruction Queue

Because Ariane’s own instruction queue is before the decode stage in the pipeline, the B-IQ was imple-

mented together with an additional main queue (A-IQ) after the decode stage. The two instruction queues

were implemented as circular FIFO queues with a variable size set at compile time. Because of the cir-

cular implementation of the queues, the push and pop functions take only one cycle each to complete. To

keep track of the program order between the two queues, every entry in the queues has a cycle counter that

gets incremented by one every cycle from the time they are put into a queue until they are issued.

(17)

Instructions are put into the two queues based on if the instruction is a load, a store or an Address- Generating Instruction or not. Load instructions and Address-Generating Instructions are put into the B-IQ while stores are put into both queues and the rest are put into the A-IQ.

3.3 Issuing from Instruction queues

The only instructions considered for issuing are the instructions at the heads of the two queues, this is to keep the instructions in each queue in program order. If both instructions are ready to be issued, the instruction in the B-IQ is issued.

An instruction is considered ready to be issued if the functional unit the instruction needs is not busy, the source and destination registers that the instruction uses are not currently being written to by any func- tional unit, and there are no dependencies between the instruction about to be issued and older instructions that have not been issued yet. This is true for all instructions except store instructions. As explained in Carlson et. al., store instructions are put into both queues and are only issued when the same store is at the head of both the main- and the bypass-queue. This way store instruction blocks both queues until it can be issued to ensure that memory is updated in the correct order.

There are two types of dependencies between instructions that need to be checked before an instruction is issued to ensure a program is running correctly. The first one has to do with instructions reading and writing the same registers and the second one has to do with branch instructions and is only relevant when running the application benchmarks.

3.3.1 Read and write dependencies

Because either queue can issue instructions ahead of the other, there can exist dependencies between an instruction in one queue and older instructions in the other queue. To illustrate this we will use the example loop from section 2.2.8:

1. lw x1, 0(x30) # long latency load 1

2. addi x10, x30, 4 # gen address load 2

3. addi x2, x1, 15 # consume load 1

4. addi x11, x10, 8 # gen address load 2

5. addi x12, x11, 16 # gen address load 2

6. lw x3, 0(x12) # long latency load 2

In the second iteration of this loop, instruction 5 will be marked as address generating and will be put into the B-IQ. The problem now is that instructions 2 and 4 need to be issued before instruction 5, but they are blocked in the A-IQ behind instructions 1 and 3. Instruction 5, which when executing in program order should be blocked behind instruction 4, is now first in the B-IQ and can be issued before instruction 2 and 4.

In other words, because instruction 5 is no longer blocked by instructions 2 and 4 which it is dependent

upon, there is no way to determine if instruction 5 can be issued correctly or not. This is explained in

more detail with Figure 6 below.

(18)

Figure 6: Stepping through second iteration of the loop.

Step 1 From here either instruction 1 or 2 can be issued, instruction 1 gets issued because it is old- est, its functional unit is not busy and there is no functional unit writing to the instruction’s destination register.

Step 2 Instructions 2 and 5 are considered for issuing and instruction 2 gets issued because it is old- est, its functional unit is not busy and no functional unit is writing to any register that instruction 2 uses.

Step 3 Instruction 3 cannot be issued because one of its source registers are being written to by in- struction 1 which is a load that has not returned from memory yet. Instruction 5 though has no functional unit writing to any of its registers and its functional unit is not busy either, which means that instruction 5 will be issued.

Step 4 Instruction 1 has still not finished so instruction 3 cannot be issued and instead instruction 6 will be issued because again, the functional unit is not busy and no register used by instruction 6 is being written to.

Now there is a problem: both instructions 5 and 6 have been issued before instruction 4 which means that instruction 6 will use the wrong address when loading from memory as register 11 has not been updated with the correct value by instruction 4.

To prevent this from happening, dependencies between instructions must be resolved before the instruc- tions are issued. This was implemented by adding another criteria that an instruction needs to fulfill to be considered ready for issue.

Below is the criteria for dependencies between instructions that must hold for an instruction to be consid-

ered ready. If it does not hold, the instruction is not ready.

(19)

let B = instruction in the B-IQ considered for issue.

Are there instructions in A-IQ older than B?

yes:

Do any of those write to B's source registers?

or:

Do any of those read from B's destination register?

yes:

B not ready no:

B might be ready 3.3.2 Branch instructions

A side effect of having out-of-order commit (see Section 5.3) and branch prediction is that instructions on a predicted branch can get issued before the branch instruction that is predicted on. Because the commit order of instructions is the same as the issue order, this means that instructions on a speculative branch can get issued and committed before the branch instruction is resolved. In other words, if a speculative branch is mispredicted, instructions from that branch can have been committed when they should have been thrown away.

Because the original Ariane core is issuing and committing instructions in-order, the branch prediction doesn’t have to take this into account normally. To solve this problem another check was implemented for instructions in the B-IQ. Before an instruction from the B-IQ can be issued, there can be no older control flow instructions in the A-IQ. This means that whenever a branch is followed, the branch instruction is issued first and if the branch was mispredicted no instructions from that branch gets committed.

This will have an impact on the performance of the Application Benchmarks because having to issue branch instructions first in every branch means that branch instructions will be blocking instructions in the B-IQ from being issued. This will limit the amount of instructions that can be issued and executed ahead of program order which will decrease the performance of the implementation.

4 Experimental Setup

Load Slice Core

Component name Organization Ports

Instruction Queue 16 entries 1r 1w

Bypass Queue 16 entries 1r 1w

Instruction Slice Table 128 entries, 2-way set-associative 1r 1w

Register Dependency Table 32 entries x 6B 1r 1w

Register Renaming: Mapping Table 16 entries x 6 bits 1r 1w Register Renaming: Free List 32 entries x 7 bits 1r 1w

Scoreboard 8 entries 1r 2w

Table 1: The components and their organization of the Load Slice Core implemented using the Ariane

core.

(20)

The additional instruction queues are added as a separate stage in the pipeline and not integrated with the existing instruction queue stage in Ariane. Because of this there is some extra overhead when inserting and removing instructions from the queues because there are now two instructions queue stages in the pipeline. Because of this it was decided to instead of comparing the results of the LSC implementation to the unmodified Ariane core without the extra overhead. The LSC implementation was compared to a slightly modified Ariane core with a single instruction queue added in the same place in the pipeline as the two queues in the LSC implementation, effectively adding the extra overhead without changing the functionality of the Ariane core.

When measuring the performance of our implementation we have run the benchmarks on two different implementations.

1. The Ariane core with the LSC component added which will be referred to as the LSC.

2. The Ariane core with a single instruction queue added which will be referred to as the IOC.

Since all of the tests were executed in a simulated environment, there was no significant delay to load data from memory when executing the benchmarks and microbenchmarks. To artificially induce a load penalty in testing, the shift register present in the Ariane core’s Load Store Unit was used to create it.

This artificial load delay was used to simulate a cache miss delay and was used to explore how longer cache miss times would impact the performance of the LSC architecture.

4.1 Microbenchmarks

To show the effects of the modifications, the Microbenchmark ubmk-unroll was created. It explores how the performance of the Load Slice Core was affected by loop-unrolling, extra non-AGI work, and load delay.

The Microbenchmark is a simple program that loads a value from memory, sums it with the previ- ously loaded value, does some other non-memory related work, and finally checks if it should loop again.

Loop unrolling in the Microbenchmark was achieved by simply repeating the same set of instructions 2, 4, or 8 times before checking the loop condition. To ensure that there always was an semi-equal workload between the tests even when they had different numbers of loop unrollings, the Microbenchmark always performed the same number of loop iterations. The Microbenchmark always performs 64 iterations of the loop, this number was chosen because it is evenly divisible by 1, 2, 4, and 8, making it easier to construct the Microbenchmark. The loop unrolling was used to test if the performance of the Microbenchmark was changed significantly by reducing the number of times the loop condition was checked.

The extra non-AGI instructions inserted into each loop (and unrolled loops) is used to asses if the archi- tecture benefits from finding other work which can be performed while waiting for the load instructions to complete.

Extra Load Delay was introduced to asses how the LSC compared to the IOC as the time to complete a load instruction increased.

The performance of the Microbenchmark was measured in cycles per loop, since it highlights the time

it takes to perform one workload, i.e. one loop iteration. This was chosen instead of cycles per in-

struction because of the large difference in instructions there are between the different variations of the

Microbenchmark.

(21)

4.2 Application Benchmarks

For testing general-purpose scenarios, we used the three benchmarks: Dhrystone, Towers and Multiply.

Dhrystone is a general performance benchmark, Towers tests the performance of solving the Towers of Hanoi puzzle and Multiply tests the performance of multiplying arrays with each other. The Applica- tion Benchmarks used are standard benchmarks available for RISC-V processors through the riscv-tests repository

²

. We choose these three because they represented a good mix of the benchmarks available that tests different cases of our implementation.

5 Results and Discussion

5.1 Microbenchmarks

This section will discuss the results of the Microbenchmark ubmk-unroll, the results can be seen in the Figure 7 and Figure 8.

5.1.1 Load Slice Core vs. In-Order Core

1 2 3 4 5 6 7 8 9 10 10

20 30 40 50

IOC Cycles per Loop

Load Delay 1

1 2 3 4 5 6 7 8 9 10 10

20 30 40 50

Load Delay 8

1 2 3 4 5 6 7 8 9 10 10

20 30 40 50

Load Delay 16

1 2 3 4 5 6 7 8 9 10 Extra Instr. per Loop 10

20 30 40 50

LSC Cycles per Loop

Normal Loop Unrolled 2 times Unrolled 4 times Unrolled 8 times

1 2 3 4 5 6 7 8 9 10 Extra Instr. per Loop 10

20 30 40 50

1 2 3 4 5 6 7 8 9 10 Extra Instr. per Loop 10

20 30 40 50

LSC vs. IOC

Figure 7: The results of the microbenchmark ubmk-unroll when run on both the LSC and the IOC.

Comparing the performance of the LSC to the IOC.

2

https://github.com/riscv/riscv-tests

(22)

Figure 7 show the results of running the microbenchmark on a LSC and an IOC while varying the extra load delay, the number of extra instructions per loop, and the number of loop unrollings. The top row shows the results from the IOC, and the bottom row shows the results from the LSC. The columns, from left to right, show the results for running with 1, 8 and 16 extra cycles of Load Delay. In the figures, the x-axis represents the number of extra instructions per loop, and the y-axis shows how many cycles it took to execute one loop iteration. The color of the line represent how many times the loops were unrolled.

The results of both the IOC and the LSC are mostly as expected. Here we will discuss the results of the IOC. As expected, we can see that there is a clear relationship between the number of extra instructions per loop and the number of cycles per loop, and the cycles per loop increases as the Load Delay increases.

However, unexpected was the fact that loop unrolling had a very minimal impact on performance. The reason for this is probably because of how loop unrolling was implemented in the microbenchmark. In the microbenchmark, loop unrolling is implemented by performing the same calculate address, load data, sum data, do extra work loop iteration another time before checking the loop condition. Although this will decrease the total number of instruction needed to be performed overall, the same general instruction flow still occurs. If instead the loop unrolling had been implemented such that the all address calculations are done together, then all loads are done together, then all the summations, and then all the extra work, there would probably have been a difference present in the results. This is because of how the IOC must perform all instructions in Issue order. It cannot start loading the second load of the twice unrolled loop until it is done with everything in the first load, its summation, and its extra work. So what would have happened in the new version of the benchmark would be that all load instructions would be able to have been put in the Scoreboard simultaneously and executed in parallel thus decreasing the total amount of time needed to perform each loop drastically.

Here we will discuss the results of the LSC. As can be seen in Figure 7, there is a relationship between the number of extra instructions per loop and the number of cycles it took to execute each loop. What can also be seen is that the LSC is more efficient than the IOC, the angle of the line for the LSC is shallower than that of the IOC, which means that the increases of the number of extra instructions per loop has a smaller effect on the number of cycles per loop on the LSC than on the IOC. This also holds true as the Load Delay is increased. What was not expected was how the lines seem to jump in the results of the LSC and how the jumps get larger as the Load Delay increases. Additionally, the performance gain relative to the IOC decreases as the Load Delay increases which is counter to what we expected to see. In order to try to explain these findings further testing was done and the results can be seen in Figure 8.

5.1.2 Effect of Scoreboard Size

A hypothesis for why the microbenchmark did not behave as expected was that the Scoreboard size used

was too small to successfully hide the latency of the load instructions in the microbenchmark. To test

this, the microbenchmark was run on a version of the LSC with a Scoreboard expanded to hold 32 entries

instead of the default 8 entries. The Figure 8 shows the results of this. Do note that the axes in Figure 8

are different to the ones in Figure 7. In Figure 8, the y-axis is Cycles per Loop, the x-axis is the Number

of Loop Unrollings and the color denotes the number of Extra Instructions per Loop. Here we can clearly

observe that the LSC with a Scoreboard size of 32 performs the best in every case. From this we can

conclude that, in order for the LSC to improve performance in relationship to an IOC, a sufficiently large

Scoreboard must be used to hide the latency of the Memory Accessing instructions.

(23)

1 2 4 8 0

10 20 30 40 50

LD 1 Cycles per Loop

IOC, SB 8

1 2 4 8

0 10 20 30 40 50

LSC, SB 8

1 2 4 8

0 10 20 30 40 50

LSC, SB 32 1 Extra Instr.

10 Extra Instr.

1 2 4 8

0 10 20 30 40 50

LD 8 Cycles per Loop

1 2 4 8

0 10 20 30 40 50

1 2 4 8

0 10 20 30 40 50

1 2 4 8

Number of Loop Unrollings 0

10 20 30 40 50

LD 16 Cycles per Loop

1 2 4 8

Number of Loop Unrollings 0

10 20 30 40 50

1 2 4 8

Number of Loop Unrollings 0

10 20 30 40 50

SB 8 vs. SB 32

Figure 8: The results of running the benchmark ubmk-unroll on the LSC, comparing the effect of scoreboard size on performance.

5.2 Application Benchmarks

Apart from the Microbenchmarks, the implementation has also been evaluated using the three Application

Benchmarks, Dhrystone, Multiply and Towers. Table 2 below shows statistics for each benchmark on how

many percent of instructions are load instructions, branch instructions and how many end up in the B-IQ

as well as how often the scoreboard is full.

(24)

Table 2: Statistics about the Application Benchmarks

These statistics gives us an indication of how the performance of these benchmarks will be affected by our implementation of Load Slice Core.

IOC LSC 0

0.5 1 1.5 2 2.5 3

CPI

towers

IOC LSC 0

0.5 1 1.5 2 2.5 3

CPI

dhrystone

IOC LSC 0

0.5 1 1.5 2 2.5 3

CPI

multiply

Figure 9: Measurement of Cycles Per Instruction for three benchmarks.

The performance of each benchmark have been measured on both LSC and IOC using an instruction queue depth of 16 and a scoreboard size of 8. As can be seen in the figure, the performance of the LSC implementation is actually slightly worse than for the IOC.

5.2.1 Speculative instructions

One of the main goals with the LSC microarchitecture is to issue and execute long-latency instructions

such as loads to memory as early as possible while still executing the program correctly, see Section 2.2

for more details. A big part of this is being able to issue and execute loads speculatively which is limited

(25)

when running the Application Benchmarks by having to issue the branch instruction first (Section 3.3.2).

Depending on what the branch contains this can affect performance to different degrees. If there are a lot of branches in a program with few instructions in every branch, the branch instructions will be blocking the B-IQ a large portion of the time. On the other hand if there are few branches and many instructions in between, the branch instructions will not be blocking the B-IQ. But as the instructions that are put into the B-IQ are found over multiple loop iterations, there will not be as many independent instructions slices in the B-IQ.

In other words, if there are a lot of loops in a program the branch instructions will block the speculative loads in the B-IQ and if there are few loops the IBDA will not find as many independent instruction slices to put into the B-IQ. And as can be seen in Table 2, both the Dhrystone and the Towers benchmarks finds plenty of instructions to put into the B-IQ (39% and 50% respectively) but as there are a lot of branch instructions (21% and 18% respectively), the instructions in the B-IQ will spend a lot of time waiting before they are issued which is why we don’t see any improvement in performance. The Multiply benchmark is a special case because only around 1% of the instructions are load instructions and as the LSC microarchitecture improves performance by re-ordering load instructions there should not be any significant difference in performance between LSC and IOC which we can also observe in the actual results in Figure 9.

5.2.2 Register renaming

To see a performance gain for the application benchmarks and in general applications, we would need to have a fully working register rename component. Due to time contraints we were only able to implement a register rename that works with handwritten assembly code, as these benchmarks are compiled from C we cannot use our implementation of register rename when running these.

Using register renaming means that the processor only stalls execution for true dependencies which means less stalls and better performance. Because the LSC microarchitecture can execute load instructions ahead of program order, a processor using LSC gains a larger benefit from register renaming than a fully in-order processor. Below is two iterations of a loop with and without register renaming.

No register renaming With register renaming Loop iteration #1

1. lw x1, 0(x30) 1. lw x1, 0(x30)

2. addi x10, x1, 4 2. addi x10, x1, 4

3. addi x11, x10, 16 3. addi x11, x10, 16

4. addi x12, x11, 8 4. addi x12, x11, 8

Loop iteration #2

5. lw x1, 4(x30) 5. lw x33, 4(x30)

6. addi x10, x1, 4 6. addi x42, x33, 4

7. addi x11, x10, 16 7. addi x43, x42, 16

8. addi x12, x11, 8 8. addi x44, x43, 8

If we assume that a load instruction takes 10 cycles to complete and an addi instruction takes 1 cycle, we

can see in Figure 10 below how a processor using LSC and an in-order processor without LSC benefits

from register renaming.

(26)

Figure 10: Blue represents loop iteration 1 and green represents loop iteration 2.

The processor without LSC get no benefit when using register renaming in this case while the processor using LSC completes the two iterations in around 44% fewer cycles with register renaming compared to without register renaming. With regards to this, what we expect from the results is that the LSC will benefit more from register renaming than the IOC. Which is also why register renaming is such an important part in getting performance increase from the Load Slice Core.

As mentioned in the previous section, in the Multiply benchmark we should see very little if any difference in performance between LSC and IOC.

5.3 Out-of-order commit

Ariane implements in-order issuing and out-of-order execution, which means that instructions can return from execution in a different order than they were issued in. To make sure each instruction is committed in the correct order Ariane re-orders the instructions to the same order that they were issued in. This means that because LSC changes the issue order, a side effect is that the commit order is also changed.

Usually when re-ordering instructions, the end goal is to make the program seem like it is running in- order from the outside. But as our implementation makes sure that the program will run correctly before the issue order is changed, having in-order commit is not needed for programs running on our implemen- tation to behave as intended. In fact, according to Alipour et al. [6], resources can be released earlier when committing out-of-order compared to in-order which can increase performance for some applica- tions.

Having out-of-order commits allows for committing instructions faster which means that the scoreboard

will not be filled as fast as when committing instructions in-order. In the scope of this project out-of-order

commit does not affect the results of the application benchmarks as the scoreboard size is not a bottleneck

because as can be seen in Table 2 the scoreboard is only full between 0% and 1% of the time. In the

Microbenchmarks this might cause a performance increase because on the setup the Microbenchmarks

are running on, more loads can be executed at the same time which means that execution units and

scoreboard slots become more important resources.

(27)

6 Conclusion

In this paper, we’ve presented what a Load Slice Core is, we’ve given a brief overview of the Ariane Core, we described how we implemented a prototype Load Slice Core using the Ariane Core and finally we presented and discussed our experiments and the results of said experiments.

We found that the Load Slice Core is a new and promising microarchitecture but more testing needs to be performed before conclusive statements can be said regarding it. Our Microbenchmark showed that, under the right circumstances, the Load Slice Core can perform up to 50% better than a comparable In- Order Core. However, due to time constraints, we could not implement a fully functional Load Slice Core for the general case, thus we found no significant difference between our implemented Load Slice Core and the comparable In-Order Core when running Application Benchmarks.

7 Future Work

7.1 A full Register Renaming

A fully implemented register renaming scheme would allow for accurate testing of the Load Slice Cores in a more general application environment, like the Application Benchmarks. Without this, no conclusive claim of how the Load Slice Core performs in comparison to either In-Order Cores or Out-of-Order Cores can be made.

7.2 Improved IST

At the end of testing an issue was discovered with the implemented IST. The IST does not take into account the fact that all instructions are byte aligned, thus the IST will only be able to hold a quarter of the total number of instructions it should be able to hold. This must be fixed in any future work.

7.3 Measure the power and the area used

Due to time constraints and tool licensing issues, no power and area usage were measured. Since Carlson

et. al. claim that the Load Slice Core Microarchitecture outperforms both In-Order Cores and Out-of-

Order Cores in a power- and area-limited design environment, it is essential that future works are able to

test this claim in practice.

(28)

8 References

[1] T. E. Carlson, S. Kaxiras, W. Heirman, O. Allam, and L. Eeckhout, “The load slice core microarchi- tecture,” ISCA ’15: Proceedings of the 42nd Annual International Symposium on Computer Archi- tecture, June 2015.

[2] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. Elsevier Science & Technology, 2014.

[3] Pulp-Platform, “Ariane documentation,” https://pulp-platform.github.io/ariane/docs/home/.

[4] F. Zaruba and L. Benini, “The cost of application-class processing: Energy and performance analysis of a linux-ready 1.7-ghz 64-bit risc-v core in 22-nm fdsoi technology,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 11, pp. 2629–2640, Nov 2019.

[5] The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version 20190608-Base- Ratified, RISC-V Foundation, March 2019.

[6] M. Alipour, T. E. Carlson, and S. Kaxiras, “Exploring the performance limits of out-of-order commit,” in Proceedings of the Computing Frontiers Conference, ser. CF’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 211–220. [Online]. Available:

https://doi.org/10.1145/3075564.3075581

(29)

A ubmk-unroll.S

#

# ubmk-unroll.S

#

#ifndef NUM_EXTRA

#define NUM_EXTRA 1

#endif

#ifndef NUM_UNROLL

#define NUM_UNROLL 1

#endif

#define TOTAL_LOOPS 64

#define LOOPS (TOTAL_LOOPS / NUM_UNROLL)

#define _work(n) EXTRA_WORK_ ## n

#define work(n) _work(n)

#define _SUM(n) SUM_LOAD_UNROLL_ ## n

#define SUM(n) _SUM(n)

#define EXTRA_WORK_0

#define EXTRA_WORK_1 EXTRA_WORK_0 ; EXTRA_WORK

#define EXTRA_WORK_2 EXTRA_WORK_1 ; EXTRA_WORK

#define EXTRA_WORK_3 EXTRA_WORK_2 ; EXTRA_WORK

#define EXTRA_WORK_4 EXTRA_WORK_3 ; EXTRA_WORK

#define EXTRA_WORK_5 EXTRA_WORK_4 ; EXTRA_WORK

#define EXTRA_WORK_6 EXTRA_WORK_5 ; EXTRA_WORK

#define EXTRA_WORK_7 EXTRA_WORK_6 ; EXTRA_WORK

#define EXTRA_WORK_8 EXTRA_WORK_7 ; EXTRA_WORK

#define EXTRA_WORK_9 EXTRA_WORK_8 ; EXTRA_WORK

#define EXTRA_WORK_10 EXTRA_WORK_9 ; EXTRA_WORK

#define EXTRA_WORK_11 EXTRA_WORK_10 ; EXTRA_WORK

#define EXTRA_WORK_12 EXTRA_WORK_11 ; EXTRA_WORK

#define EXTRA_WORK_13 EXTRA_WORK_12 ; EXTRA_WORK

#define EXTRA_WORK_14 EXTRA_WORK_13 ; EXTRA_WORK

#define EXTRA_WORK_15 EXTRA_WORK_14 ; EXTRA_WORK

#define EXTRA_WORK_16 EXTRA_WORK_15 ; EXTRA_WORK

#define SUM_LOAD_UNROLL_0

#define SUM_LOAD_UNROLL_1 SUM_LOAD_UNROLL_0 ; SUM_LOAD

#define SUM_LOAD_UNROLL_2 SUM_LOAD_UNROLL_1 ; SUM_LOAD

#define SUM_LOAD_UNROLL_3 SUM_LOAD_UNROLL_2 ; SUM_LOAD

#define SUM_LOAD_UNROLL_4 SUM_LOAD_UNROLL_3 ; SUM_LOAD

#define SUM_LOAD_UNROLL_5 SUM_LOAD_UNROLL_4 ; SUM_LOAD

#define SUM_LOAD_UNROLL_6 SUM_LOAD_UNROLL_5 ; SUM_LOAD

#define SUM_LOAD_UNROLL_7 SUM_LOAD_UNROLL_6 ; SUM_LOAD

(30)

#define SUM_LOAD_UNROLL_8 SUM_LOAD_UNROLL_7 ; SUM_LOAD

#define SUM_LOAD_UNROLL_9 SUM_LOAD_UNROLL_8 ; SUM_LOAD

#define SUM_LOAD_UNROLL_10 SUM_LOAD_UNROLL_9 ; SUM_LOAD

#define TOT_LOOPS x1

#define CRNT_ITER x2

#define ADDR_DATA x3

#define REG_DATA x4

#define REG_EXTRA x5

#define EXTRA_WORK addi REG_EXTRA, REG_EXTRA, 1

#define SUM_LOAD la ADDR_DATA, d_in ;\

lw x10, (ADDR_DATA) ;\

add REG_DATA, REG_DATA, x10 ;\

work(NUM_EXTRA) ; .section .text .global main .align 2

main: # setup variables

li TOT_LOOPS, TOTAL_LOOPS # Number of iterations li CRNT_ITER, 0 # current iteration

li REG_EXTRA, 0 # Extra work store reg li REG_DATA, 0

loop: SUM(NUM_UNROLL)

check: # check if we should quit bge CRNT_ITER, TOT_LOOPS, return addi CRNT_ITER, CRNT_ITER, NUM_UNROLL j loop

return: # return function, exit handling la x10, d_extra

sw REG_EXTRA, (x10) la x10, d_out

sw REG_DATA, (x10) fence

ecall

.section .data .align 4

.global begin_signature begin_signature:

.global c_start, c_end, c_diff c_start:.word 0

c_end: .word 0

(31)

c_diff: .word 0 .word 0

d_in: .word 1 d_out: .word 0 d_extra:.word 0 .word 0

.align 4

.global end_signature

end_signature:

(32)

B util.S

#

# util.S

#

#define BYTE 0x1

#define WORD 0x4

#define DWORD 0x8

#define QWORD 0x10 .equ USER_ECALL, 0x8 .section .text.init .global _start _start:

Implementing the Load Slice Core on a RISC-V based microarchitecture

UPTEC IT 20038

Examensarbete 30 hp November 3, 2020

Implementing the Load Slice Core on a RISC-V based

microarchitecture

Axel Dalbom Tim Svensson

Institutionen f ¨or informationsteknologi

Institutionen f ¨ or informationsteknologi Enheten f ¨or studentservice

Bes ¨oksadress:

ITC, Polacksbacken L ¨agerhyddsv ¨agen 2

Postadress:

Box 337 751 05 Uppsala

Telefon:

018–471 30 03 Hemsida:

http:/www.it.uu.se

Abstract

Implementing the Load Slice Core on a RISC-V based microarchitecture

Axel Dalbom Tim Svensson

From these results we found that the LSC performed comparably or slightly worse than its IOC counterpart.

More research on the subject is required for any conclusive statement on the mi- croarchitecture can be made, but it is the opinion of this paper’s authors that it does show promise.

Handledare: David Black-Schaffer Amnesgranskare: Yuan Yao ¨ Examinator: Lars- ˚ Ake Nord´en UPTEC IT 20038

Tryckt av: Reprocentralen ITC

Contribution

Both Authors have contributed equally to this project although focused on some different areas. Tim

Svensson focused on the Register Renaming, the Iterative Backwards Dependency Analysis and the Mi-

crobenchmarks while Axel Dalbom focused on the Instruction Queues and the logic around it as well as

running the Application benchmarks and analysing the results.

Contents

1 Introduction 1

2 Background 1

2.1 In-order and Out-of-order cores . . . . 1

2.2 The Load Slice Core Microarchitecture . . . . 2

2.2.1 Memory Hierarchy Parallelism . . . . 2

2.2.2 Address Generating Instruction . . . . 2

2.2.3 Iterative Backwards Dependency Analysis . . . . 2

2.2.4 Bypass Instruction Queue . . . . 2

2.2.5 Register Dependency Table . . . . 2

2.2.6 Instruction Slice Table . . . . 3

2.2.7 Register Renaming . . . . 3

2.2.8 Example execution . . . . 3

2.2.9 Extending an In-Order Core into a Load Slice Core . . . . 5

2.3 Ariane . . . . 6

2.3.1 RISC-V . . . . 6

2.3.2 Architecture . . . . 7

3 Implementation 8 3.1 Iterative Backwards Dependency Analysis . . . . 9

3.1.1 Register Rename . . . . 9

3.1.2 Instruction Slice Table . . . . 9

3.1.3 Register Dependency Table . . . . 9

3.2 Bypass Instruction Queue . . . . 9

3.3 Issuing from Instruction queues . . . . 10

3.3.1 Read and write dependencies . . . . 10

3.3.2 Branch instructions . . . . 12

4 Experimental Setup 12 4.1 Microbenchmarks . . . . 13

4.2 Application Benchmarks . . . . 14

5 Results and Discussion 14 5.1 Microbenchmarks . . . . 14

5.1.1 Load Slice Core vs. In-Order Core . . . . 14

5.1.2 Effect of Scoreboard Size . . . . 15

5.2 Application Benchmarks . . . . 16

5.2.1 Speculative instructions . . . . 17

5.2.2 Register renaming . . . . 18

5.3 Out-of-order commit . . . . 19

6 Conclusion 20 7 Future Work 20 7.1 A full Register Renaming . . . . 20

7.2 Improved IST . . . . 20

7.3 Measure the power and the area used . . . . 20

8 References 21

A ubmk-unroll.S 22

B util.S 25

1 Introduction

In this thesis, we present our implementation and evaluation of the Load Slice Core Microarchitecture on Ariane, a RISC-V In-Order Core.

Due to time constraints and tool licensing issues, the implemented LSC was only evaluated with respect to the achieved performance. Performance was measured using both a microbenchmark and a set of ap- plication benchmarks, and the achieved performance was compared between the LSC and a IOC.

From the microbenchmarks, we found that the LSC indeed outperforms an IOC but that the performance gain is dependent on choosing a correct hardware configuration.

Due to time constrains, the findings from the application benchmarks are not conclusive but they do show that the LSC does not significantly worsen performance.

2 Background

2.1 In-order and Out-of-order cores

Typically an out-of-order core executes the instructions out-of-order and then re-orders the instructions

before committing them so that the program appears to execute in-order.

2.2 The Load Slice Core Microarchitecture

2.2.1 Memory Hierarchy Parallelism

2.2.2 Address Generating Instruction

An Address-Generating Instruction (AGI) is an instruction that produces data that a future load or store instruction need for address calculation. These instructions are important to be able to identify memory- dependent instruction slices.