Modeling Region Granularity of the D2M Memory System

(1)

IT 18 005

Examensarbete 30 hp February 2018

Modeling Region Granularity of the D2M Memory System

Pin Tool driven test for the Split Cache Hierarchy

Johan Snider

Institutionen för informationsteknologi

Department of Information Technology

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Abstract

Modeling Region Granularity of the D2M Memory System

Johan Snider

Cache simulation is a potentially complex and time consuming task in the field of computer architecture. Often, only parts of a program are simulated due to practical time constraints. This thesis proposes a way to simulate entire

benchmark programs using the Intel Pin platform (PIN) for research into the Direct- to-Master memory system (D2M).

D2M is a design at the forefront of the computer architecture research field, but the granularity of cacheline group sizes has not been fully investigated. We run tests on ten benchmarks from the PARSEC 3.0 suite and five benchmarks from the SPEC CPU 2006 suite to investigate the effects of different cacheline size

groupings, known as region size, in D2M. For each benchmark, a set of statistics are generated for each region size tested. We analyze these results to show the effects of region size on the D2M design. Specifically we analyze the effects of region size on the metadata (MD) hierarchy, which is the structure responsible for tracking cachelines in the data hierarchy.

Ten out of fifteen applications have peak MD1 traffic with a region size of 16 or 32 cachelines. These applications, however, are also least effected by change in region size. The applications that are most effected by region size are the ones that have peak first level MD (MD1) traffic with either smaller or larger regions.

When considering overall MD traffic, the region size of 64 cachelines generates the most overall MD traffic and the lowest number of cache misses. This is because of lower overhead in the MD hierarchy which translates to extended MD reach.

In this way, we model and simulate D2M using PIN to generate statistics about the different D2M region sizes. These results can be taken into consideration when running more in-depth simulations which could potentially save researchers time when performing cache simulation.

Tryckt av: Reprocentralen ITC 18 005

Examinator: Mats Daniels

Ämnesgranskare: David Black-Schaffer Handledare: Erik Hagersten

(4)

(5)

Dedicated to Dr. David Kenneally my first computer science teacher

5

(6)

List of Figures

1 A Direct-mapped Cache . . . 17

2 A Set-associative Cache . . . 18

3 Multi-core Cache Hierarchy . . . 19

4 Simplified Direct-to-Data Cache Hierarchy . . . 21

5 Simplified Direct-to-Master Cache Hierarchy . . . 21

6 MD1 Traffic (PKMO) Appendix A . . . 29

7 MD1 Traffic (PKMO) Appendix A . . . 30

8 MD1 Use (PKMO) Appendix B . . . 32

9 MD1 Use (PKMO) Appendix B . . . 32

10 Combined MD1 and MD2 traffic (PKMO) Appendix C . . . 38

(9)

(10)

List of Tables

1 Encoding for LI in the D2M Design . . . 20

2 Description of benchmarks from the PARSEC 3.0 suite . . . 26

3 Description of benchmarks from the SPEC CPU 2006 suite . . . 27

4 Number of MD entries for different region sizes . . . 27

5 MD reach for different region sizes . . . 28

6 Matrix of D2M statistics . . . 28

7 MD1 Traffic (PKMO) . . . 31

8 PB state on MD3 eviction (PKMO) . . . 33

9 MD traffic overview (PKMO) . . . 34

10 MD1 Cacheline Use . . . 34

11 canneal MD1 traffic . . . 35

12 canneal MD2 traffic . . . 35

13 gromacs MD traffic (PKMO) . . . 36

14 gromacs MD1 traffic (PKMO) . . . 36

15 gromacs MD2 traffic (PKMO) . . . 37

16 gems MD traffic . . . 37

17 gems MD1 traffic (PKMO) . . . 37

18 gems MD2 traffic (PKMO) . . . 38

19 lbm MD traffic . . . 39

20 lbm MD1 traffic (PKMO) . . . 39

21 lbm MD2 traffic (PKMO) . . . 39

22 lbm MD1 Cacheline Use (PKMO) and Percentages . . . 40

25 PARSEC 3.0 MD1 Cacheline Use per Region Size . . . 42

26 SPEC MD1 Cacheline Use per Region Size . . . 42

27 Combined MD1 and MD2 traffic (PKMO) . . . 43

(11)

(12)

List of Abbreviations

PIN Intel Pin Platform also known as Pintool D2M Direct-to-Master memory system

MD Metadata

MD1 First level of metadata MD2 Second level of metadata MD3 Third level of metadata L1 First level cache L2 Second level cache L3 Third level cache D2D Direct-to-Data LLC Last level cache

TLB Translation lookaside buffer

LI Location information aka cacheline pointer VA Virtual address

PB Presence bits

PKMO Per kilo memory operations

(13)

(14)

1 Introduction

1.1 Background

A classic problem in computer architecture is how to bridge the gap between the fast cores and the relatively slow main memory. While a process waits for memory it usually can not execute any other instructions. Often the process has to wait several hundred CPU cycles before the data is available, which leads to a longer overall execution time. Caching is an attempt to hide this memory latency by making the processor appear to always have fast access to data.

This problem has been inherit in computer systems since the 1960’s when computer architects first identified issues of using slower and faster memory technologies together. The first publication on caches describes the problem of moving data from slow magnetic tape to faster core memory, and the possibility that that data might see reuse if it was kept in a faster core memory [9]. The paper goes on to describe a direct-mapped cache and explain how a small cache memory could accumulate data for reuse that would decrease the amount of time spent waiting for memory operations to finish. This paper was published in 1965 by Sir Maurice Vincent Wilkes and is cited as the first publication on caches [4].

Since then, numerous optimizations and improvements have been made to keep up with the growing computational power and data size of modern day computers. However, these improvements introduce their own limitations, and in a field where performance is everything, computer architects are now trying to minimize the latency of the cache.

The problem with modern cache hierarchies is essentially the same as the original problem with memory: it takes too long. Too much time and energy are spent searching for data in the cache, and when the data is not found in the cache that time and energy is wasted. With the increasing sizes and associativity of caches these problems become even worse.

1.2 D2D and D2M

The Direct-to-Data (D2D) [7] and Direct-to-Master (D2M) [5] cache designs address some of the limitations of traditional caches. These designs track data so that on cache hits the data is located and served to the CPU without searching through data and on cache misses the cache is not searched at all. To accomplish this tracking of data, cachelines are grouped together into regions. This grouping cuts down on the tracking overhead and utilizes temporal and spatial locality.

These regions are grouped together into three levels of metadata that make up the metadata hierarchy. The result is a design that allows the cache to provide the processor with data faster while using less energy.

The research question addressed in this thesis is: What is the optimal region granularity for D2M? As such, the main contribution of this thesis is research into the performance of the D2M design with varying region sizes. In the thesis, four region sizes have been tested: 8, 16, 32, and 64. Looking at trade-offs and performance factors such as MD1 traffic and MD1 cacheline use, we try to determine which region size is the optimal choice for D2M. We analyze the level of MD1 traffic in particular because the number of hits to the first level of the metadata hierarchy gives an indication of the efficiency of the different region sizes.

(15)

The amount of overhead used on tag storage is an important factor in the D2M design. With a region size of 8, there is one tag for every 8 cacheline pointers. With a region size of 64, there is one tag for every 64 cacheline pointers.

This means that for larger region sizes there is less tracking overhead. If those cacheline pointers are not used, however, there will be poor MD1 performance despite there being less overhead. For example, with a region size of 64, if only one cacheline pointer is used then 63 cacheline pointers are wasted. In this case, it is better to have smaller region sizes because with a region size of 8, only 7 cacheline pointers can go unused.

To perform this analysis, a D2M simulator has been programmed in C [7]. This simulator keeps track of which cachelines have been accessed and if subsequent accesses would result in cache hits or cache misses. PIN is used to generate an address trace of instructions and that trace is then fed to the simulator. The simulator then returns statistics about the benchmark which show the behavior of D2M. PIN was selected because of its speed and compatibility with cache simulators. This is how we are able to simulate entire program execution and produce performance results for D2M.

2 Cache Primer

2.1 Caching Basics

The purpose of caching is to hide the latency of loading data from memory onto the CPU. As such, the cache physically sits in-between the CPU and memory.

The main concept is that after a piece of data has been loaded onto the CPU, a copy of it is kept in the cache in hopes that it will be accessed again. In this way, subsequent accesses will be able to be loaded faster onto the CPU. [4]

In theory the implementation of caches is straightforward: after data has been used by the CPU, put a copy of it on a separate piece of memory close to the CPU so that it can be reused without paying the penalty of being loaded from memory twice. The problem is this separate piece of memory, the cache, has limited space. Therefore, the objective is to use the space in the cache as efficiently as possible. For example, consider a program that accesses four pieces of data in a loop and a cache that can hold four pieces of data. The first time through the loop all of the data will be loaded into the cache and for every access afterwards there will be cache hits. Or, in other words, the data in the cache will be used. This results in the program executing faster.

It is rarely the case, however, that all of the data for a program fits into the cache. When a program uses more data than the cache capacity, decisions have to be made as to which data is kept in the cache and which data is copied back to memory, i.e. evicted from the cache. The risk of this is evicting data from the cache that will then be used later. For example, if we change the program to loop over five pieces of data, then on the fifth access a piece of data will have to be evicted from the cache. If the first piece of data is evicted, then there will be a cache miss when the next iteration starts. Even worse, if this pattern continues, evicting data right before it is used, there will not be any cache hits and there will be no benefit from the cache at all. This is the worst case scenario for caches in general and, in this example, it is the result of a bad eviction policy.

15

(16)

This is the basic idea behind caching. These small examples show how different access patterns can result in either dramatic program speedup, or no speedup whatsoever, depending on the type of cache being used.

2.2 Temporal and Spacial locality

Two important concepts in caching are temporal and spacial locality. The property of temporal locality states that data that has just been used will most likely be used again, relatively soon. Therefore it is beneficial to load data into the cache after it has just been used. Similarly, the property of spacial locality states that data that is nearby data that is being accessed will most likely be accessed. For example, arrays of data are likely to be accessed sequentially.

Since arrays are usually physically stored together, it makes sense to load data into the cache in chunks. This is why it is common practice to group pieces of data together in cachelines. In traditional cache design when a word of data is requested the entire cacheline of either four or eight words is brought into the cache [4].

2.3 Direct-mapped Cache

A direct-mapped cache is one of the earliest and simplest cache designs. It illustrates how data can be organized on a separate piece of memory close to the CPU. Instead of going straight to memory to load data, the CPU first sends the memory address to the cache, and if the cache has a valid entry it returns the data to the CPU. The data that is returned to the CPU is the data that would be found at that address in memory. If the data is not found in the cache, then the data is loaded from memory and copied into the cache in hopes that it will be used.

As seen in Figure 1, a direct mapped cache has three fields: a valid field, a tag field and a data field. The valid field is to determine if the tag and data fields are valid and requires one bit. The tag field contains a section of the memory address called the address tag. The length of the tag field depends on the cache capacity. In this example, the address tag is 20 bits and it is the left most 20 bits of the memory address. The data field holds a copy of the data found at the memory address. In this example the data field is 32 bits. Also 10 bits from the memory address are used to select a position in the cache, known as the cache index. The cache index and address tag are used together to uniquely identify each memory address. To illustrate, a memory address will always map to the same index, because of this the entire memory address does not need to be stored in the cache. Only the bits needed to identify each cacheline uniquely need to be stored in the cache which, in this case, is the rest of the memory address. The two right most bits are used as a byte offset and not needed by the cache.

2.4 Set-associative Cache

One drawback to the direct-mapped cache design is that if two memory addresses index to the same position in the cache, only one of those entries can be kept at a time. This is known as a cache conflict. In the event of conflicts, one entry has to be evicted to make room for the other. If the CPU goes back and forth

(17)

Figure 1: A Direct-mapped Cache

17

(18)

requesting two conflicting addresses, they will take turns evicting each other.

This is bad for cache performance, because the data in the cache is never used.

Figure 2: A Set-associative Cache

The problem of cache conflicts can be mitigated by using multiple direct- mapped caches together as shown in Figure 2. In this way a set associative cache can be built. This set associative cache can handle up to four cache conflicts before data has to be evicted. This means that the cache is more flexible in terms of where it can place data, but this also requires that all four entries have to be searched. This typically can be done in parallel to minimize cache latency, but requires more energy [4].

Since there are four possible locations for each cache index, the cache only needs eight bits of the memory address to index into the cache. As such, that means that 22 bits are needed in the tag field to uniquely identify each cacheline in the cache.

(19)

2.5 Cache Hierarchy

A challenge in cache design is to design a cache that is both large and fast. These properties, however, oppose each other. The larger the cache is, the longer it takes to search.

The standard solution to this problem is to combine several caches together to form a cache hierarchy. A cache hierarchy is usually built with a small, fast cache at the top level and several larger, slower levels underneath. The first level gives the CPU fast access to data, and the other levels act as backup storage for the first level. The benefit comes from the fact that memory is so slow, that it is faster to load data from the last level of the cache hierarchy than memory.

This is how the cache hierarchy attempts to act as a large and fast cache.

A common practice to increase performance on multi-core chips is to give each core its own local cache. These local caches usually have a one- or two- level cache hierarchy that work together with a larger shared third level or last level cache (LLC). An illustration of this is shown in Figure 3.

Figure 3: Multi-core Cache Hierarchy

3 D2M Cache Overview

This thesis is based on the paper: A Split Cache Hierarchy for Enabling Data- Oriented Optimizations published in 2017 at the HPCA conference in Austin, Texas by members of the UART group in Uppsala, Sweden [5]. The Split Cache Hierarchy paper builds on two earlier publications: (1) TLC: A tag-less cache for reducing dynamic first level cache energy [6] and (2) The Direct-to- Data (D2D) Cache: Navigating the Cache Hierarchy with a Single Lookup [7].

The Tag-less cache paper introduces the concept of adding information to the translation lookaside buffer (TLB) to eliminate tag comparisons and the D2D paper introduced the concept of a single direct lookup to navigate across the cache hierarchy. The Split Cache Hierarchy paper combines and extends these

19

(20)

principles to a multicore design called D2M.

The D2M design splits the traditional cache hierarchy into two parts, one part for storing data called the data hierarchy, and another part for keeping track of that data, the metadata hierarchy (MD). This is why the D2M design is referred to as a split cache.

Tracking in the MD hierarchy is accomplished by grouping cachelines together into regions. This grouping reduces the tracking overhead in the MD hierarchy because cachelines grouped into the same region are tracked by a single MD entry. Each MD entry has a tag and a set of cacheline pointers which point to the location of each cacheline in the data hierarchy. These cacheline pointers are referred to as Location Information (LI) pointers. In this way, a cacheline’s location can be determined across the entire data hierarchy by a single lookup in the MD hierarchy, removing the need for searches and tags in the data hierarchy.

Table 1 shows an example encoding from the Split Cache Hierarchy Paper [5].

Using 6 bits, an encoding can be made to identify the cache level and associative way where data can be found in the data hierarchy.

Encoding Meaning 000NNN In NodeID: NNN

001WWW In L1, way=WWW

010WWW In L2, way=WWW

011SSS Encoding of eight symbols, one of which is for memory 1WWWWW In LLC, way=WWWWW

Table 1: Encoding for LI in the D2M Design

Additionally, the LI in the MD hierarchy is deterministic. This means that the LI always points to valid data in the data hierarchy. If the MD for a cacheline does not exist then it is guaranteed not to be in the data hierarchy. This property gives the D2M design lower latency on cache misses.

Figure 4 shows a simplified illustration of the Direct-2-Data design, with two levels of MD, two levels of data and an example of what a region looks like with one virtual address (VA) tag and three pieces of LI. Here we see that the LI from the region points to different places where data can be found. Even though the data can be found in different places, the tracking information is bundled together in the MD hierarchy into one region.

D2D is extended to a multicore design by giving each core a local D2D cache and adding a shared third level of metadata (MD3) along with a shared LLC.

As illustrated in Figure 5, data can be tracked by a local MD across the entire data hierarchy. The LI can even point to data in a remote core as indicated by the red arrow.

One extension to the MD3 is made to track cores with active MD entries of a region, these are called presence bits (PB). The presence bits are responsible for keeping track of which cores have active MD entries for a region so that invalidation messages and coherence traffic do not have to be broadcast to every core. This is represented by the grey arrows in Figure 5.

Additionally, the D2M design implements a dynamic coherence optimization on top of standard coherence protocols. This is accomplished by classifying each region as either private or shared. Regions tracked by only one core are private, and regions tracked by multiple cores are shared. If the region is private,

(21)

Figure 4: Simplified Direct-to-Data Cache Hierarchy

Figure 5: Simplified Direct-to-Master Cache Hierarchy

21

(22)

operations can be made on this region without synchronizing with the other cores. If the region is shared, then coherence measures have to be followed to ensure the consistency of data.

This is how D2M removes the need for traditional searching found in other cache designs. To recap, D2M uses the LI stored in the MD hierarchy to locate cachelines in the data hierarchy.

3.1 D2M Cache Implementation

The following sections give a more detailed description of D2M. These explan- ations are specific to the D2M implementation used for this thesis. There are some differences from this implementation compared to the specification in the D2M paper, such as the addresses used in the simulator are all virtual addresses, and the L1 instruction cache and L1 data cache are not modeled separately.

These modifications were accepted because it was determined that they would not be dominating factors in the results.

3.2 MD1 Search

When the cache simulator is called the local MD1 is searched first. If there is no hit, the local MD2 is searched. If there is no hit that means the data is not present in the core’s local cache and therefore the shared MD3 is searched. If there is no hit in MD3, then a MD3 entry is created. Then MD entries are created for MD1 and the data is installed in L1 on the local core. If there is no space in the MD1 or L1, entries can be evicted to the lower levels of the hierarchies.

A search in MD1 starts by finding the MD1 index for the virtual address.

The tag from the memory access is compared with the tags stored in MD1 and if they match the LI for that cacheline is extracted from the MD1 entry. The LI is then decoded to determine where in the data hierarchy the data is located.

Data can be found in one of five places:

• L1: first level local cache

• L2: second level local cache

• L3: third level shared cache

• Memory

• A remote core 3.2.1 L1 Hit

If the data is found in L1, which is the most common case, then the private bit for the region is checked. If the region bit is private the L1 associative way is extracted from the LI, the L1 index is computed and then that data entry is located and validated, regardless if the access is a read or a write. Then a MD1 L1 hit is returned for the access.

If the region bit is shared, then the access type has to be taken into account.

If the access type is a read then the data can be validated as if the region bit is private. This is because even though the region is shared, there is a copy

(23)

of the data in the local core which is guaranteed to be valid. If another core would write to that cacheline, it would have to invalidate it first. Therefore if the data is in the local L1 cache, it can be read. If the access is a write, however, then invalidation messages have to be sent to the other cores. This is done by removing the cacheline from the remote core and changing the MD entry LI to point to the core making the invalidation. This is how coherency is maintained for shared writes in the D2M design.

3.2.2 L2 Hit

The L2 hit works exactly as the L1 hit, except for at the end of the access the data is moved up into the L1 and the MD entry is updated. This operation does not require any coherence messages to be sent because cores can freely move data in their local L1 and L2 caches.

3.2.3 L3 Hit

An L3 hit is similar to an L1 and L2 hit. First the L3 associative way is decoded from MD1 LI. After which special attention has to be paid to maintain coherence.

Since the cacheline is located in L3, no cores can have a local copy of the cacheline.

If a core did have a local copy of the cacheline the LI would have to point to that core. However, other cores may have MD entries for this region. So when a cacheline is moved from L3 to a local core the PB of the MD3 entry have to be checked to see if any other cores have MD entries for that region. If no other cores have MD entries, then no invalidations need to be sent. If there are other cores with MD entries for this region, then the LI for that cacheline has to be updated to point to the core that has the data. Finally the core also has to be added to the MD3 entries PB.

3.2.4 Memory Hit

A hit to memory works the same way as a hit to the L3. If the LI points to memory then the data is simply added to the L1 and the MD entry is updated.

And similarly to a L3 hit, since the data could still be tracked by MD entries in other cores, the PB in the MD3 entry needs to be examined. If there are any other cores with MD entries they have to be updated to point to the core with the data.

3.2.5 Remote Hit

If the LI from MD1 points to a remote core this means that the cacheline is in a region where data is shared with another core. Since the MD entries keep track of private and shared cachelines on a per region basis, it is impossible to know which cacheline or cachelines are shared. Therefore the PB of the MD3 entry need to be examined to see which other cores have copies of the region. Based on the PB and the access type the cacheline can either be copied to the local cache, or invalidation messages have to be sent.

If the access type is a read then the cacheline can be duplicated in both cores, and as long as they only read the cacheline there is no need to send invalidations.

If the access type is a write then the cacheline will become invalid on the other

23

(24)

cores. Therefore the core has to send invalidations to any other core that might have a copy of the cacheline.

3.3 MD2 Search

The MD2 is similar to the MD1 search. First the MD2 index is computed from the virtual address. If a MD2 entry is found then this is a MD2 hit. Next the LI is extracted from the MD2 entry to determine where the data is located in the data hierarchy. These options are identical to the MD1 search and the logic is the same. The only difference between a MD1 hit and a MD2 hit is that after the MD2 hit the MD2 entry is moved up into the MD1. This does not require any movement of data in the cache because all the LI pointers stay the same.

3.4 MD3 Search

The MD3 search is more complicated than the other MD searches because of the complexity of maintaining coherence between all of the cores, in particular writes to shared regions. First the MD3 entries are searched. If an entry is found the PB are extracted from the entry to determine what state the region is in.

The PB are an encoding to show which cores have local MD2 entries for that region. So the number and position of set bits indicates which cores have copies of the region. The region can be in one of four states:

• Uncached: There is no MD3 entry.

• Untracked: The number of PB set are equal to zero and no cores have any MD entries for that region.

• Private: The number of PB is equal to one and one core has a MD entry for that region.

• Shared: The number of PB is greater than one and multiple cores have MD entries for that region.

3.4.1 Uncached

Uncached means that there is no MD3 entry for this region. This is when the MD3 search returns a miss and a MD3 entry is created.

3.4.2 Untracked

If there are no PB set in the MD3 region, this means that no cores have any copies of this region. This can happen when all of the MD2 entries for a region have been evicted, leaving only an active MD3 entry.

Even though none of the cores have data, there are still two possibilities of where data can be, either in memory or in L3. To determine which is the case the LI is extracted from the MD3 entry. If the cacheline is found in memory, a MD1 entry is created from the MD3 entry and the cacheline is installed into L1.

Then the core is added to the PB in the MD3 entry and the private bit for MD1 is set. If the cacheline is found in L3, the same procedure is followed as before, plus the L3 entry is removed afterwards.

(25)

3.4.3 Private

If the MD3 region has one bit set in the PB that means that only one core has a MD entry for that region. Furthermore it also means that that core has private access to that region. However, since MD3 is being searched that means that the core where the access started does not have a valid MD for this region.

Therefore, this is the point where a region transitions from a private region to a shared region.

The first thing to check when moving from a private to a shared region is whether or not the access is a read or a write. If it is a read then the data can be copied to both cores where it will hopefully see reuse. To do this the MD is copied from the remote core to the MD3 entry. Then a MD1 entry is created in the local core from the MD3 entry and the cacheline is installed in L1.

If it is a write then invalidation messages need to be sent to the other core.

First the PB is decoded from the MD3 entry to see which core has an active MD entry. Then the LI is updated in the remote core to point to the local core.

After which the same procedure is followed as if the access was a read.

3.4.4 Shared

If more than one PB is set in the MD3, the region is already in a shared state.

Meaning that multiple cores can have copies of the cachelines in that region. As before the access type is checked to determine if it is a read or a write.

If the access is a read the MD entry is copied from the remote core to MD3 where it is used to create a MD1 entry for the local core. Then the cacheline is replicated and added to L1. Finally the core is added to the PB for the MD3 entry.

If the access is a write then invalidation messages are sent out to all of the cores that could have a copy of the cacheline, and then the same procedure takes place as if it were a read.

4 Modeling the Region Granularity of D2M

Modeling the region granularity of D2M is accomplished by implementing a model of the D2M cache as specified in A Split Cache Hierarchy for Enabling Data-Oriented Optimizations [5]. The paper outlines the design for D2M with 8 cores, and specifies the behavior of the cache. In the paper D2M uses a region size of 16. Here the region size is an input parameter to the cache simulator.

PIN is used to collect memory accesses into an address trace. This trace is then fed to the D2M simulator. At the end of a simulation, a series of statistics are generated to show the behavior of the program. From these statistics we are able to reason about the behavior of D2M.

4.1 PIN

PIN works by inserting extra instructions into an executable. These PIN instructions are added dynamically while the program is running and execute with the program. In this way, PIN can take a benchmark program and insert instructions to call the cache simulator before every memory access that occurs.

25

(26)

The motivation for using PIN came from the desire to simulate benchmark programs in a manageable amount of time. PIN is faster than other cache simulators like Gem5 because it is not a full-system simulator. PIN collects the address trace of an application as it executes, without modeling all of the hardware interactions of a processor. Generally, the task of simulating an entire benchmark with Gem5 would take multiple days to finish, which is too long for prototyping cache designs [2] [8].

PIN could be setup to model the D2M behavior being studied. This method was more convenient and faster than setting up a Gem5 simulation. In this way, we were able to simulate the benchmark programs from start to finish. On average each simulation took about four hours to run, with some applications taking up to eight hours. This is how we were able to generate statistics about D2M behavior and evaluate the effects of the different region sizes.

4.2 Benchmark Suites

Traditionally benchmarks are used to time how long different architectures take to finish executing a program. We are not interested in the amount of time the programs take to finish. We are interested in the behavior of D2M. For this reason we do not actually time the benchmarks, we only run them with the simulator to generate statistics.

4.2.1 PARSEC 3.0

The PARSEC 3.0 benchmark suite is a well known and widely used series of multi- threaded programs used in the computer architecture research community [1].

This benchmark suite comprises a series of programs which cover a variety of different topics in computing.

Benchmark Description Input Size

blackscholes Partial differential equation solver simlarge bodytrack Body tracking of a person on video simlarge canneal Simulated cache-aware annealing simsmall facesim Simulates the motions of a human face simsmall ferret Content similarity search server simmedium fluidanimate Fluid dynamics for animation purposes simmedium

freqmine Frequent itemset mining simmedium

raytrace Real-time raytracing for graphics simsmall streamcluster Clustering of an input stream simlarge

vips Image processing simlarge

Table 2: Description of benchmarks from the PARSEC 3.0 suite The included benchmarks from the PARSEC 3.0 suite along with the input size options are shown in Table 2. For more information about each benchmark and the details about the different input options please refer to the PARSEC 3.0 manual [1].

(27)

4.2.2 SPEC CPU 2006

The SPEC CPU 2006 benchmark suite is another well known set of real life applications [3]. The benchmarks included in this thesis are shown in Table 3.

All of the SPEC benchmarks were run with the test input size. For more information about these benchmarks please refer to the SPEC CPU 2006 publication [7].

Benchmark Description

gcc C language optimizing compiler gemsFDTD Computational Electromagnetics solver gromacs Chemical and molecular dynamics simulator lbm Computational fluid dynamics simulator mcf Combinatorial optimization

Table 3: Description of benchmarks from the SPEC CPU 2006 suite

4.3 Area-neutral Design Comparison

To simulate D2M with different region sizes, each benchmark was tested against four versions of the D2M simulator, each with a different number of cachelines per region. Groups of 8, 16, 32, and 64 cachelines per region were tested. In order to compare the different region sizes with roughly the same implementation cost, the capacities for MD1, MD2 and MD3 were adjusted to achieve an area-neutral design comparison. This was with approximating 56 bits for a physical address and 38 bits for a virtual address in the MD hierarchy.

region size Md1 entries Md2 entries Md3 entries

8 192 6k 24,5k

16 128 4k 16k

32 80 2,5k 10k

64 48 1,5k 6k

Table 4: Number of MD entries for different region sizes

Table 4 shows the number of entries for the MD hierarchies with different region sizes. The number of MD entries does not scale linearly with the changes in region size. This is because there are different overheads associated with different region sizes.

The dominating factor in the MD overhead is the tag. To illustrate, with a region size of 8, each MD entry has one tag for every 8 cacheline pointers. For a region size of 64, each MD entry still has one tag, but that one tag is attached to 64 cacheline pointers. This means that for a region size of 64, D2M uses about an eighth as much space on MD tags as it would with a region size of 8. All of this space that is not used on tags can be used for more total MD entries which gives a larger MD reach.

Even though there are only 48 entries in MD1 with a region size of 64, each MD entry keeps track of 64 cachelines. This means that a total of 3072 cachelines are tracked from MD1, where as the MD1 with a region size of 8 can

27

(28)

only track 1536 cachelines. Table 5 shows the MD reach in megabytes for each D2M configuration.

region size region reach Md1 reach Md2 reach Md3 reach

8 0.5 kb 1 mb 3 mb 12 mb

16 1 kb 1.3 mb 4 mb 16 mb

32 2 kb 1.6 mb 5 mb 20 mb

64 4 kb 1.9 mb 6 mb 24 mb

Table 5: MD reach for different region sizes

5 Results

The D2M simulator generates a variety of statistics such as: reads, read misses, writes, and write misses that are used to evaluate traditional caches. Additionally, hits to L1, L2, L3, memory and remote cores from MD1, MD2, MD3 are calculated. An overview of the statistics are displayed in Table 6. Additionally the D2M simulator also gathers statistics about used cachelines on MD1 evictions and PB state on MD3 evictions.

md1 hits to: md2 hits to: md3 hits to:

l1 l1 l3

l2 l2 n/a

l3 l3 n/a

mem mem mem

rem rem rem

Table 6: Matrix of D2M statistics

In the following sections the results are measured in hits per kilo memory operations (PKMO).

5.1 MD1 Results

Each benchmark can be classified in one of three ways in terms of MD1 performance. Either a benchmark sees more MD1 traffic as the region size decreases, increases or there is a peak somewhere in between. Out of the fifteen benchmarks tested from the PARSEC 3.0 and SPEC CPU 2006 suites, three of the benchmarks have more MD1 traffic with the region size 8. Four of the benchmarks have more traffic with the region size 64 and the other eight benchmarks peaked with a region size of 16 or 32. These results show that the common case is to have the most MD1 traffic with a region size of either 16 or 32.

Figure 6 shows the results of the benchmarks that have peak MD1 traffic with either the region size 8 or 64. The four benchmarks on the left have more MD1 traffic with the region size 64. They are: mcf, lbm, bodytrack and fluidanimate.

These applications benefit from the increased reach in the MD hierarchy. They

(29)

Figure 6: MD1 Traffic (PKMO) Appendix A

access data in a sequential manner which means that the larger the region size, the more cachelines are accessed on average.

The three benchmarks on the right perform best with the region size 8. They are: blackscholes, raytrace, and canneal. These applications do not benefit from the extra reach in the MD hierarchy. In fact they pay for it. These applications access data with a sparse access pattern, which means that with larger region sizes they do not use as many cachelines per region.

29

(30)

Figure 7: MD1 Traffic (PKMO) Appendix A

The benchmarks shown in Figure 7, see a peak in MD1 traffic with a region size of either 16 or 32. Interestingly, these are also the benchmarks that are least effected by changes in region size in terms of MD1 traffic.

Table 7 shows the differences in MD1 traffic for every benchmark regardless of region size. These numbers show the MD1 activity in the best case, MD1 activity in the worst case and range between them. In particular, the range shows the effect that the different region sizes can have on a benchmark. The benchmarks that have the most MD1 traffic with a region size of 16 or 32 are also the benchmarks that are least effected by changes in region size, highlighted in red.

Overall, ten of the fifteen benchmarks had more than 980 hits from MD1 for every 1000 memory operations on average in the worst case. That means that regardless of region size, most of the benchmarks tested saw 98% of their activity from MD1. Four benchmarks: lbm, canneal, bodytrack and gems, highlighted in blue, fell into the range of 940 to 980 MD1 hits PKMO. One benchmark, mcf, highlighted in grey, was an outlier in terms of MD1 performance with all of the region size tests falling in the range of 780 to 800 MD1 hits PKMO.

(31)

Benchmark Best Case Worst Case Range

lbm 998 963 35

mcf 808 780 28

canneal 971 945 26

black 1000 983 16

body 994 978 16

raytrace 999 988 11

gcc 994 983 11

freqmine 998 989 9

gems 973 967 5

facesim 998 993 5

ferret 997 994 3

gromacs 992 990 3

streamcluster 996 993 3

fluidanimate 999 997 2

vips 998 997 2

Table 7: MD1 Traffic (PKMO)

5.2 MD1 Evictions

For every MD1 hit to L1, L2, L3, or memory, the LI for each cacheline is marked as used. When the region is evicted the D2M simulator computes how many cacheline pointers have been used and increments a count of total MD1 evictions.

In this way, the average number of used MD1 cacheline pointers is calculated.

To clarify, the first access when the region is created is not counted, only hits after the region has been created are computed.

Figure 8 and Figure 9 show that the number of used cachelines on MD1 evictions reflects the amount of MD1 traffic. For example, streamcluster has the highest number of used cacheline pointers with a region size of 32 and it also has the most MD1 traffic with a region size of 32. Two outliers that do not follow this pattern are gromacs and canneal.

31

(32)

Figure 8: MD1 Use (PKMO) Appendix B

Figure 9: MD1 Use (PKMO) Appendix B

(33)

5.3 MD3 Evictions

Due to the deterministic property of the metadata and the forced eviction policy of D2M, a MD3 eviction means that all of the data for that region has to be removed from the data hierarchy. On MD3 evictions the D2M simulator records if the region is untracked, tracked by one node, or multiple nodes. This information about PB state on MD3 evictions shows the penalty D2M pays for having forced evictions and deterministic metadata.

rg MD3 evictions Untracked Private Shared

8 7.9 7.7 0.2 0

16 4.3 4.2 0.1 0

32 2.3 2.2 0.2 0

64 1.3 1.1 0.2 0

Table 8: PB state on MD3 eviction (PKMO)

Table 8 shows the average number of MD3 evictions and the state that the regions were in when they were evicted. On average there were more MD3 evictions for the smaller region sizes. This is because the smaller region sizes have a smaller MD1/MD2 reach. Regardless of region size, however, most of the regions were untracked at the time of eviction meaning that no forced evictions had to take place. The column labeled private, shows that approximately 0.2 of every 1000 memory operations resulted in a forced eviction from one core. The column for shared evictions shows that there is statistically no regions that are tracked by more than one core at the time of eviction. These results point to the conclusion that D2M does not often pay the penalty for its forced eviction policy.

5.4 Benchmark Evaluation

The benchmarks are evaluated in three groups. The first group consists of the benchmarks that had the most MD1 traffic with the region size of 8. The second group consists of the benchmarks that had the most MD1 traffic with a region size of 16 and 32. These two region sizes are evaluated together because of the difference between the two are always less than 2 PKMO. The third group consists of the benchmarks that had the most MD1 traffic with the region size 64. In each group we evaluate trends, identify outliers and reason about the D2M behavior.

When evaluating the MD traffic, the hits from MD1, MD2, MD3 and cache misses will always add up to 1000. This is because the measurements are normalized to hits per 1000 memory operations (PKMO). In this way, the distribution of hits to the the different MD structures is analyzed. Additionally, since these are hierarchical structures, MD2 hits can be seen as MD1 misses.

Similarly, MD3 hits can be seen as MD1 and MD2 misses. Also some of the results may be omitted if they do not show meaningful data, for example if there are no results for MD3 traffic the table is not shown.

33

(34)

5.5 Region size 8

The tests for blackscholes, raytrace and canneal all see the most MD1 activity with a region size of 8 cachelines. This means these benchmarks have a sparse access pattern. In other words they need many MD entries to keep track of data with high granularity. Table 9 shows the distribution of hits for these benchmarks across the MD hierarchy.

Notably, blackscholes and raytrace see 99.9% of their MD activity from MD1 with the region size 8. This shows that the MD1 reach of 1 mb is enough to effectively track the data in the data hierarchy. With larger region sizes, however, MD1 is not able to track the data as efficiently. We see this in the correlation between MD2 traffic and region size. With larger region sizes, the MD entries are not able to use all of their LI pointers, and when this happens the MD1 is filled with unused space and begins to miss.

blackscholes raytrace canneal

rg md1 md2 md3 miss md1 md2 md3 miss md1 md2 md3 miss

8 999 0.23 0.01 0.01 999 0.4 0.07 0.15 965 22 8.5 4.6

16 999 0.99 0 0 999 0.8 0.04 0.07 960 29 8.5 2.6

32 992 7.59 0 0 996 3.8 0.02 0.03 949 41 8.2 1.6

64 982 17.3 0 0 993 7.2 0.01 0.02 938 53 7.8 0.9

Table 9: MD traffic overview (PKMO)

To illustrate this, Table 10 shows the average number of used cachelines for MD1 for these benchmarks. These results show that blackscholes uses all of the cachelines pointers on average with a region size of 8, but as the region size increases the average drops. Similarly for raytrace, these results show that as the region size increases the average declines, however, not as dramatically as with blackscholes.

rg black raytrace canneal

8 8.0 4.8 1.3

16 2.4 4.4 1.3

32 0.7 2.3 1.3

64 0.3 1.5 1.2

Table 10: MD1 Cacheline Use

The MD1 cacheline use for canneal shows that something different is hap- pening. Regardless of region size canneal uses between 1.3 and 1.2 cachelines on average. Despite the number of used cachelines being about the same, there is a 27 PKMO increase in MD1 traffic when comparing the region size 8 to 64.

To understand the behavior of this benchmark, as well as to show the general behavior of this group of benchmarks, we perform a more detailed evaluation of the canneal benchmark.

5.5.1 canneal

From the canneal results in Table 9 three characteristics can be observed: (1) the MD traffic from MD1 moves to MD2 as the region size increases, (2) the

(35)

traffic to MD3 stays between the range 7.8 and 8.5 and (3) there are more cache misses with the smaller region sizes, despite their being more MD1 traffic.

The MD1 results in Table 11 show the breakdown of MD1 hits to L1, L2, L3, and memory. The MD1 results are primarily made up of L1 hits, with the region size of 8 having the most L1 hits. As the region size increases, D2M is unable to track data in L1 as well and this is the reason why there is an overall decrease in MD1 traffic with larger region sizes. Also, the MD1 hits to memory show the benefit of spatial locality from larger region sizes. However, this does not make up for the loss of traffic to L1.

rg md1 md1 l1 md1 l2 md1 l3 md1 mem

8 965 952 3.8 3.6 5.3

16 960 946 4.5 3.9 5.7

32 949 935 4.6 4.2 5.9

64 938 923 4.8 4.2 6.0

Table 11: canneal MD1 traffic

The MD2 hits in Table 12 show more L1 hits with the larger region sizes.

This means that when there is a miss in MD1 the data is usually found in L1.

This hints that if the MD1 reach was increased there would be better tracking of data in L1, however this is not the case. The MD1 reach does not need to be extended, the regions need to be tracked with higher granularity.

8 22 8 4.5 7.1 2.0

16 29 14 4.2 7.3 3.0

32 41 25 4.1 7.3 4.0

64 53 36 4.3 7.3 5.1

Table 12: canneal MD2 traffic

This information does not explain why canneal uses roughly the same number of cachelines on average regardless of region size. What is most likely happening is that canneal is accessing the same number of cachelines on average, but with smaller region sizes those cachelines are accessed more often. Since the D2M simulator only computes if a cacheline pointer has been used once, the volume of cacheline pointer accesses is not shown in the results. It is most likely this effect that leads to more MD1 traffic with smaller region sizes despite the average MD1 cacheline use being about the same.

To sum up, these three benchmarks have the most MD1 traffic with smaller regions sizes and are therefore more likely to perform better with a region size of 8. These benchmarks do not benefit from the extended reach of the MD hierarchy with larger region sizes. They require a large number of MD entries to track data in small sections. As such it is worth the extra overhead in the MD hierarchy to have smaller region sizes for these benchmarks.

5.6 Region size 16 and 32

The benchmarks facesim, ferret, freqmine, streamcluster, vips, gcc, gems, and gromacs all see the most MD1 activity with the region size 16 or 32. These region

35

(36)

sizes are evaluated together because the MD1 traffic is less than 2 PKMO apart for all of the benchmarks. Also, these benchmarks are the ones least effected by change in region size. With eight of the fifteen benchmarks falling into this category, this leads to the conclusion that most benchmarks see peak MD1 traffic with a region size of 16 or 32.

Figure 7 shows two subgroups in this set of benchmarks. One group that performs slightly worse with a region size of 8 and one group that performs slightly worse with a region size of 64. Gromacs, streamcluster and vips perform worse with a region size of 8 and gems, gcc, ferret, facesim, freqmine perform worse with a region size of 64.

From the first subgroup, we will look in detail at gromacs to show an example of how D2M suffers from small region sizes when dealing with L2 sensitive applications. From the second subgroup we will look in detail at gems to show how D2M can turn MD3 traffic and misses into MD2 traffic with larger region sizes.

5.6.1 gromacs

The results in Table 13 show that gromacs has nearly the same amount of MD1 and MD2 traffic with the region sizes 16, 32, and 64. With the region size of 8, however, there is less MD1 traffic and more MD2 traffic.

rg md1 md2 md3 miss

8 989.9 10 0.11 0.02 16 992.1 7.8 0.05 0.01 32 992.4 7.6 0.02 0 64 992.1 7.9 0.01 0 Table 13: gromacs MD traffic (PKMO)

Table 14 shows how the MD1 hits are distributed to the data hierarchy. The smaller region sizes have slightly better tracking to L1, but the larger region sizes have better tracking to L2, L3 and memory. Specifically with a region size of 8, D2M does not have the reach from MD1 to cover the data in L2.

8 989.9 974.5 12.3 2.3 0.78

16 992.1 974.2 14.6 2.5 0.83

32 992.4 973.2 15.6 2.7 0.85

64 992.1 972.2 16.2 2.8 0.85

Table 14: gromacs MD1 traffic (PKMO)

Table 15 shows how the MD2 is distributed to the data hierarchy. The smaller the region size the more often D2M has to go to MD2 and from MD2 data is most likely to be found in L2. As the region size increases the less often D2M has to go to MD2. Specifically, with larger region sizes there are more hits to L1 and fewer hits to L2 from MD2.

This shows that L2 sensitive applications can benefit from the extended reach of MD1 with large region sizes. Specifically, if the reach from MD1 can be extended to cover data in L2 then there will be more MD1 traffic.

(37)

8 10 1.4 7.5 1.1 0.01

16 7.8 1.7 5.3 0.8 0.02

32 7.6 2.6 4.4 0.7 0.03

64 7.9 3.6 3.8 0.6 0.04

Table 15: gromacs MD2 traffic (PKMO)

5.6.2 gems

The MD traffic in Table 16 shows a peak in MD1 activity with the region sizes 16 and 32. Noticeably, there is an increase in MD2 traffic with larger region sizes as the MD3 traffic and number of misses decreases. This is because of the extended reach of the MD hierarchy with larger region sizes. As the reach of the MD hierarchy increases D2M is able to turn MD3 traffic and cache misses into MD2 traffic. There is, however, some drawback to tracking cachelines with a low granularity in terms of MD1 traffic. In this test there is a decrease of MD1 traffic by 3 PKMO but an overall increase in MD2 traffic by 13 PKMO.

rg md1 md2 md3 miss

8 970 16 6.1 7.6

16 973 19 3.2 5.2

32 973 21 2.8 3.6

64 967 29 1.7 1.8

Table 16: gems MD traffic

Table 17 shows that the smaller the region size the more MD1 traffic there is to L1, as the region size increases D2M is not able to track data in L1 as well because the region granularity is too large. The larger the region size, however, the more MD1 traffic to L2, L3 and memory. By increasing the MD reach with the larger region sizes, D2M trades better L1 locality for more overall tracking across the data hierarchy. This is what makes the MD1 traffic peak between the region sizes 16 and 32.

8 970 926 5.2 11.8 27.7

16 973 924 6.3 12.8 29.8

32 973 922 7.2 12.9 31.1

64 967 915 8.0 12.4 32.2

Table 17: gems MD1 traffic (PKMO)

Table 18 shows that the MD2 traffic increases as the region size increases.

The test with the region size 64 shows that there is 10.6 PKMO hits to L1. This is reflected in the drop of L1 traffic from Table 17. This means that with a region size of 64 D2M has to go to MD2 more often to find data in L1. Notably, with larger region sizes D2M tracks more data in L3 and memory, even though the hits to L2 are higher with a region size 8. This shows that with larger reach in the MD hierarchy, D2M is able to more efficiently track data across the entire

37

(38)

data hierarchy.

8 16 0.5 7.4 7.2 0.9

16 19 1.8 6.5 9.4 1.3

32 21 3.9 4.6 11.0 1.4

64 29 10.6 3.8 12.6 2.2

Table 18: gems MD2 traffic (PKMO)

These results show that MD1 traffic is not the only important factor in D2M evaluation. By increasing the region size, and thus extending the reach of the MD hierarchy, D2M is able to increase the overall traffic to MD1 and MD2. In fact, the combined MD1 and MD2 traffic shows that there is always an increase in MD traffic and a decrease in cache misses. This can be seen in Figure 10.

Figure 10: Combined MD1 and MD2 traffic (PKMO) Appendix C

5.7 Region size 64: bodytrack, fluidanimate, lbm and mcf

The benchmarks bodytrack, fluidanimate, lbm and mcf have the most MD1 traffic for the region size of 64. These benchmarks benefit the most from the increased reach of the MD hierarchy because they have more sequential access patterns than the other benchmarks. To represent this group we make an evaluation of the lbm benchmark.

5.7.1 lbm

Here there is a stark difference in the behavior of this benchmark with different region sizes. Table 19 shows that with a region size of 64, D2M is able to move almost all of the MD traffic to MD1, while lowering the frequency of cache misses.

This shows that this benchmark benefits from the extended MD1 reach offered by tracking cachelines in region sizes of 64.

(39)

rg md1 md2 md3 miss

8 963 22 7.3 8.3

16 983 9.7 3.7 4.2

32 995 1.3 1.9 2.1

64 998 1.1 0.02 1.1 Table 19: lbm MD traffic

The MD1 activity in Table 20 shows that there is slightly better tracking of data in L1 with the region size of 8, but there is a larger increase in L2 hits and hits to memory with the region size of 64.

8 963 795 63 0.14 104

16 983 793 75 0.14 114

32 995 793 82 0.14 120

64 998 792 83 0.13 122

Table 20: lbm MD1 traffic (PKMO)

The MD2 activity in Table 21 shows that with a region size of 8, D2M is still able to track the data in L2, but when the region size is increased, D2M is able to migrate that L2 traffic to MD1.

8 21.8 0.03 17.4 0 4.4

16 9.7 0.01 7.6 0 2.1

32 1.3 0 0.9 0 0.4

64 1.1 0.02 0.2 0 0.97

Table 21: lbm MD2 traffic (PKMO)

These results are reflected by the number of MD1 cacheline pointers used by lbm. Table 22, shows the average cacheline use for the different region sizes along with those numbers as percentages. With the region size of 64 lbm uses 91% of the cacheline pointers. This is why we see such a dramatic increase in MD1 traffic with this region size.

39

(40)

Cachelines per region Avg use Percentage

8 4.9 61%

16 10.5 66%

32 27.9 87%

64 58.2 91%

Table 22: lbm MD1 Cacheline Use (PKMO) and Percentages

This benchmark shows the best case performance of D2M with larger region sizes. Although this is not the common case, these results show that if D2M is able to use the LI pointers in MD1 there will be increased MD1 traffic.

(41)

6 Conclusions and Future Work

In this thesis we introduced a method to evaluate different region sizes of D2M.

We used a D2M simulator and PIN to test different benchmarks from the SPEC CPU 2006 and PARSEC 3.0 benchmark suites and generated statistics about the different region sizes of D2M.

This evaluation has shown a number of different behavior patterns in D2M and several different ways to reason about those behaviors. From the results of the benchmarks examined, most of the benchmarks have peak MD1 traffic with region sizes of 16 and 32 cachelines. These benchmarks, however, are least effected by the change in region size. In general, when considering the combined traffic for MD1 and MD2 most benchmarks saw an overall increase in MD traffic and a decrease in cache misses.

These results lead to the conclusions that: (1) the optimal region sizes in terms of MD1 traffic are 16 or 32 cachelines and (2) the optimal region size in terms of overall MD traffic is 64 cachelines.

Future work with PIN and D2M should focus on using D2M as a framework to implement other cache optimizations [7]. Specifically the D2M simulator could be extended to support different algorithms for data placement/replication and cache bypassing. These tests in PIN could provide researchers with a better understanding of D2M behavior so that time spent on in-depth simulations can be used efficiently and effectively.

41

(42)

7 Appendices

7.1 Appendix A: MD1 traffic

rg mcf lbm body fluid black raytrace canneal 8 780.1 962.6 977.8 997.1 999.8 999.4 971.3 16 792.4 982.5 983.1 998.3 999.0 999.2 968.0 32 799.6 994.8 988.7 999.2 992.2 996.8 961.2 64 808.4 997.8 994.2 999.4 983.3 987.9 945.5

rg gems gromacs gcc stream ferret facesim vips freqmine 8 970.4 989.9 993.2 993.2 994.6 996.6 996.7 997.8 16 972.6 992.1 994.2 995.0 996.6 997.7 997.9 998.0 32 972.8 992.4 992.1 995.7 996.7 997.3 998.5 998.0 64 967.3 992.1 983.1 995.2 993.6 993.0 997.7 989.3

7.2 Appendix B: MD1 Use

rg black body canneal facesim ferret fluid freqmine ray stream vips avg

8 8.0 1.2 1.3 5.9 3.6 2.1 2.1 4.8 4.6 4.5 3.8

16 2.4 1.8 1.3 8.7 4.7 3.1 2.4 4.4 6.2 7.1 4.2

32 0.7 2.9 1.3 7.7 5.1 5.6 2.6 2.3 7.4 9.0 4.5

64 0.3 6.2 1.2 3.6 2.9 8.3 1.4 1.5 6.6 6.1 3.8

Table 25: PARSEC 3.0 MD1 Cacheline Use per Region Size

rg gcc gems gromacs lbm mcf avg

8 2.6 1.7 2.2 4.9 1.0 2.5

16 3.4 1.9 2.8 10.5 1.1 3.9

32 3.0 2.0 3.0 27.9 1.2 7.4

64 2.2 1.8 3.1 58.2 1.2 13.3

Table 26: SPEC MD1 Cacheline Use per Region Size

(43)

7.3 Appendix C: MD1 and MD2 traffic

rg canneal facesim ferret fluidanimate gems lbm mcf streamcluster

8 987 998 999 998 986 984 986 995

16 989 999 999 999 992 992 993 997

32 990 1000 999 999 994 996 996 998

64 991 1000 999 1000 997 999 998 999

Table 27: Combined MD1 and MD2 traffic (PKMO)

43

(44)

References

[1] Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Prin- ceton University, January 2011.

[2] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, August 2011.

[3] John L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comput.

Archit. News, 34(4):1–17, September 2006.

[4] David A. Patterson and John L. Hennessy. Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design). Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA, 4th edition, 2008.

[5] A. Sembrant, E. Hagersten, and D. Black-Schaffer. A split cache hierarchy for enabling data-oriented optimizations. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 133–144, Feb 2017.

[6] A. Sembrant, E. Hagersten, and D. Black-Shaffer. Tlc: A tag-less cache for reducing dynamic first level cache energy. In 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 49–61, Dec 2013.

[7] Andreas Sembrant, Erik Hagersten, and David Black-Schaffer. The direct- to-data (d2d) cache: Navigating the cache hierarchy with a single lookup.

SIGARCH Comput. Archit. News, 42(3):133–144, June 2014.

[8] Tim Weidner. Investigating the scalability of direct-to-master caches, 2017.

[9] M. V. Wilkes. Slave memories and dynamic storage allocation. IEEE Transactions on Electronic Computers, EC-14(2):270–271, April 1965.

Modeling Region Granularity of the D2M Memory System

Examensarbete 30 hp February 2018

Modeling Region Granularity of the D2M Memory System

Pin Tool driven test for the Split Cache Hierarchy

Johan Snider

Institutionen för informationsteknologi

Department of Information Technology

Abstract

Modeling Region Granularity of the D2M Memory System

Johan Snider

Dedicated to Dr. David Kenneally my first computer science teacher

Contents

List of Figures

List of Tables

List of Abbreviations

1 Introduction

1.1 Background

1.2 D2D and D2M

2 Cache Primer

2.1 Caching Basics

2.2 Temporal and Spacial locality

2.3 Direct-mapped Cache

2.4 Set-associative Cache

2.5 Cache Hierarchy

3 D2M Cache Overview

3.1 D2M Cache Implementation

3.2 MD1 Search

3.3 MD2 Search

3.4 MD3 Search

4 Modeling the Region Granularity of D2M

4.1 PIN

4.2 Benchmark Suites

4.3 Area-neutral Design Comparison

5 Results

5.1 MD1 Results

5.2 MD1 Evictions

5.3 MD3 Evictions

5.4 Benchmark Evaluation

5.5 Region size 8

5.6 Region size 16 and 32

5.7 Region size 64: bodytrack, fluidanimate, lbm and mcf

6 Conclusions and Future Work

7 Appendices

7.1 Appendix A: MD1 traffic

7.2 Appendix B: MD1 Use

7.3 Appendix C: MD1 and MD2 traffic

References