• No results found

Measuring the effect of memory bandwidth contention in applications on multi-core processors

N/A
N/A
Protected

Academic year: 2021

Share "Measuring the effect of memory bandwidth contention in applications on multi-core processors"

Copied!
74
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Measuring the effect of memory bandwidth

contention in applications on multi-core

processors

by

Emil Lindberg

LIU-IDA/LITH-EX-A—15/002—SE

2015-02-10

(2)
(3)

Final Thesis

Measuring the effect of memory bandwidth

contention in applications on multi-core

processors

by

Emil Lindberg

LIU-IDA/LITH-EX-A—15/002--SE

2015-02-10

Supervisor: Erik Hansson

Examiner: Christoph Kessler

(4)
(5)

Abstract

In this thesis we design and implement a benchmarking tool for applications’ sensitivity to main memory bandwidth contention, in a multi-core environ-ment, on an ARM Cortex-A15 CPU. The tool is supposed to minimize usage of shared resources, except for the main memory bandwidth, allowing it to isolate the effects of the bandwidth contention only. The difficulty in doing this lies in using a correct memory access pattern for this purpose, i.e. which memory addresses to access, in which order and at what rate in order to minimize cache usage while generating a high and controllable main memory bandwidth usage.

We manage to implement a tool with low cache memory usage while still being able to saturate the main memory bandwidth. The tool uses a proportional-integral controller to control the amount of bandwidth it uses. We then use the tool to investigate the memory behaviour of the platform and of some applications when the tool is using a variable amount of bandwidth. However, we have some difficulties in analyzing the results due to the lack of support for hardware performance counters in the operating system we are using and are forced to rely on hardware timers for our data gathering. Another difficulty is the platform’s limited L2 cache bandwidth, which leads to a heavy impact on L2 cache read latency by the tool. Despite this, we are able to draw some conclusions on the bandwidth usage of other applications in optimal cases with the help of the tool.

(6)
(7)

Contents

1 Introduction 1 1.1 Motivation . . . 1 1.2 Problem Description . . . 1 1.3 Results . . . 2 1.4 Structure . . . 3 2 Background 4 2.1 Memory Hierarchy . . . 4 2.2 SDRAM . . . 6

2.3 Bandwidth Usage Properties . . . 6

2.4 Paged Virtual Memory . . . 7

2.5 Experimental Platform . . . 8

3 Our Bandit 10 3.1 Design . . . 10

3.1.1 Main objectives . . . 10

3.1.2 Floating point data . . . 10

3.1.3 Location of Memory Accesses . . . 11

3.1.4 Controlling Bandwidth Usage . . . 12

3.1.5 Measuring Target Application Bandwidth Usage . . . 13

3.1.6 Multithreading the Bandit . . . 14

3.1.7 Portability . . . 14

3.1.8 Usability . . . 14

3.2 Implementation . . . 14

3.2.1 Location of Memory Accesses . . . 14

3.2.2 Bandwidth delay implementation . . . 15

3.2.3 PI-Controller implementation . . . 16

3.2.4 Measuring Target Application Bandwidth Usage . . . 17

3.2.5 Multithreading the Bandit . . . 17

3.2.6 Portability . . . 18

3.2.7 User interface . . . 19

(8)

CONTENTS CONTENTS

4 Evaluation 22

4.1 Method . . . 22

4.2 Platform Performance . . . 23

4.2.1 Memory Access Latency . . . 23

4.2.2 Off-Chip Memory Bandwidth . . . 24

4.3 Bandit Performance . . . 25

4.3.1 Cache Miss Generation . . . 25

4.3.2 PI-Controller Performance . . . 29

4.4 Measuring Memory Latency . . . 30

4.5 Measuring Bandwidth . . . 31

4.6 Bandit’s Effect on Programs . . . 34

4.6.1 Overview . . . 34 4.6.2 Micro-Benchmarks . . . 34 4.6.3 Telecom Application . . . 36 4.6.4 MiBench . . . 37 4.7 Summary . . . 42 4.7.1 Platform evaluation . . . 42 4.7.2 Bandit evaluation . . . 42 4.7.3 Closing remarks . . . 43 5 Related Work 44 5.1 Prior Work . . . 44

5.2 Main Memory Bandwidth Contention Measurement . . . 45

5.3 Main Memory Bandwidth Contention Mitigation . . . 45

6 Conclusions 47 6.1 Limitations . . . 47

6.2 Future work . . . 47

Appendices 50

A Cache hit simulator 51 B Physical address lookup 54 C Memory Access Iteration Code 56 D PI-Controller Code 60 E The Reference Program 62 F Synthetic benchmarks 64

(9)

Chapter 1

Introduction

1.1

Motivation

Chip multiprocessors, CMPs, have rapidly become the standard in laptop and desktop PCs. This is due to multiple reasons, one being the unfeasible energy and cooling requirements of running a processor at high frequencies, as an increase in frequency approximately increases the energy consump-tion by the cube [3]. For the same reason, a decrease in frequency leads to large energy savings, allowing processor manufacturers to keep up with Moore’s law by constructing CMPs with several cores that each run at a lower frequency, resulting in a higher total theoretical computational power. While the potential gains of utilizing CMPs are high, they present sev-eral new challenges and considerations, which prevents or delays them from being adopted and utilized in every computer. Applications using multiple cores are more difficult to develop and require more from programmers, but there is another major issue that is introduced by CMPs, namely shared resource contention. This means that different cores contend for limited resources, such as caches, main memory bandwidth and network resources. An application running on one core can therefore affect the execution time of an application running on a different core by contending for the same shared resources. Due to this contention, running two seemingly independent tasks on different cores in a real-time system (a system where processes have dead-lines) can be a problem [11]. To mitigate resource contention, we need to know the applications’ characteristics with regard to shared resources. With that information, we can decide if we can run them together or if we need to redesign either our platform or applications.

1.2

Problem Description

In this thesis we are going to look specifically at the contention for main memory bandwidth using methods previously described by Ekl¨ov et al. [7].

(10)

1.3. RESULTS CHAPTER 1. INTRODUCTION

They generated traffic on the main memory bus with an application they called Bandwidth Bandit and then measured how various applications’ per-formances were affected.

The Bandit is a program that generates a variable amount of load on the main memory bus. The load it generates should be as realistic as possible when compared to real applications and it should have minimal effect on the other shared resources, which in this case mostly means the shared caches. We have some additional demands on the Bandit compared to our pre-decessors. First, we want to investigate if it is possible to gather memory bandwidth usage data on the application affected by the Bandit, directly with the Bandit. The previous Bandit relied on external monitoring with hardware performance counters to gather data, but we want to rely as lit-tle as possible on other software and specific hardware. Secondly, we want the Bandit to be a usable tool in regression testing to verify an applica-tion’s characteristics regarding memory bandwidth usage. Finally, our tar-get platform is an embedded system with an ARM Cortex-A15 CPU, while the previous Bandit used an Intel system.

As an added requirement, we want to look into the portability aspect of the Bandit and try to isolate the hardware dependent features of the Bandit. This would allow for easier porting to different architectures and keep the Bandit relevant in an environment of ever changing hardware with a lower effort.

The platform we are using is intended for running telecom applications which we need to take into consideration when evaluating the Bandit.

1.3

Results

We managed to create an application that is able to control the amount of bandwidth used by itself, while also keeping the amount of used cache memory low. This bandit can then be used to test how other applications function under different contention environments. As the execution time can be heavily affected by this contention, as we have shown, it can be a useful tool to verify the robustness of various real-time applications.

Porting of the Bandit to different architectures is doable, however our reliance on a fast hardware timing function is a weakness. It can however be replaced by other mechanisms for controlling high precision delays.

The attempt to make the Bandit able to measure other applications’ bandwidth usage did not go entirely as we had wanted. We were able to measure synthetic applications with highly stable bandwidth usage, but our attempt to measure other applications did not work very well. With more time, it could have been possible to implement it.

Our target architecture’s low bandwidth beyond the L2 cache and it being a bottleneck led to it being difficult to separate pure main memory accesses from L2 cache accesses as the latter suffered very much from the contention. This was also very obvious when we attempted to implement

(11)

1.4. STRUCTURE CHAPTER 1. INTRODUCTION

a bandwidth measuring function in our bandit as the measurements were affected by L2 cache accesses.

1.4

Structure

In Chapter 2 we present concepts used for understanding our work. These concepts are mainly focused on the different types of memory in a cpu and their function. We also present our experimental platform here.

In Chapter 3 we present the design and implementation of our bandit and show details of how it works.

In Chapter 4 we evaluate the performance of our platform, the function of the Bandit and tests how it affects other applications.

In Chapter 5 we show some related work to our thesis, which mostly are articles that focus on memory bandwidth contention in different ways.

We conclude our thesis with Chapter 6 with some outlook on the future for the Bandit.

(12)

Chapter 2

Background

2.1

Memory Hierarchy

Modern processors utilize a hierarchy of memories. First, there are several layers of volatile storage, from the registers in the processor to a couple of cache memories, usually two or three (denominated by L1, L2 etc.), and finally the main memory. Behind all those is the permanent but slow storage. The speed of CPUs has been increasing at a much higher rate than that of main memory for a while now [15] and still is [14]. It was first the case that the two operated at the same speed, but they are now separated by orders of magnitude. This is the reason for the memory hierarchy. Data often has the following two properties: temporal locality (data recently used is likely to be used again) and spatial locality (data close to other recently used data is likely to be used). By having faster caches, processors get faster access to the data that is likely to be used and even though they are small, they are large enough to gain the benefits of the spatial and temporal locality.

(13)

2.1. MEMORY HIERARCHY CHAPTER 2. BACKGROUND

Figure 2.1: Organization and mapping of cache sets in main memory to cache memory.

When data is fetched from the memory for the first time, it is read from the main memory and then stored in the different layers of cache. More data than requested, a cache line, is often fetched to take advantage of the spatial locality. In some cases, when the access pattern can be predicted, even more data can be read into the cache, which is called prefetching. When any of this data is read again and it still resides in a cache, a cache hit has occurred. The memory system searches for the requested data in order of the fastest cache to the slowest and returns the first hit. When a new cache line is installed, it usually can not be stored at any place in the cache, i.e the cache is usually not fully associative. Instead, cache memories can be n-way, where n denotes in how many places a block of data can be stored in the cache. These n places that can store the same subset of memory blocks are called a cache set. An example of how cache sets maps from main memory to cache memory is found in figure 2.1. If there is no free spot to install the data in, some other cache line in the same cache set must be evicted. The selection of the cache line can be done with different algorithms such as least recently used (LRU) or random selection [12]. In CMPs, caches can be private to a processor or shared between some or all of the processors as shown in Figure 2.2. Shared caches introduces contention for space in the cache, a phenomenon Ekl¨ov et al. also examined [6]. They also noted that decreased performance of the caches led to an increase of usage of main memory bandwidth. This happens because every cache hit removes the need for a read from the main memory. The memory hierarchy is in other words coupled, if one part of it is affected the effect ripples upwards through the hierarchy.

(14)

2.2. SDRAM CHAPTER 2. BACKGROUND

Figure 2.2: Organization of L1 and L2 caches. The back side is internal to the cores and the front side is external to the CPU

2.2

SDRAM

Synchronous dynamic random access memory, SDRAM, is the most com-mon type of main memory in computers today [12]. Access to the SDRAM is controlled by a memory controller which translates a memory address in the operating system into an address in the SDRAM. An address has to be translated into several different signals. With more than one SDRAM module, a memory channel has to be selected and every SDRAM is parti-tioned into banks, rows and columns. All these parts together translate into a unique memory address.

A bank contains different rows and columns. The different banks allow for some parallelism since they can prepare reads independently. When data has to be read, the selected bank must first load the selected row, also known as a page, into a buffer unique for each bank. If this page is ready when the request is made (page hit) the request will be serviced significantly faster than if the wrong page is loaded (page miss) or if no page is loaded (page empty).

Due to the channels’ and banks’ parallel nature the SDRAM has great potential for parallelism [13].

2.3

Bandwidth Usage Properties

The various levels in the memory hierarchy usually allow some form of par-allel operation or at least queues in order to increase the utilization of the buses in question. An ARM Cortex-A15 for example can have 16 outstand-ing loads and stores at any given time, while its cache is able to keep de-livering data that is present in the cache while waiting for data that earlier

(15)

2.4. PAGED VIRTUAL MEMORY CHAPTER 2. BACKGROUND

caused a cache miss. The memory controller for the main memory can direct requests to the different banks in an SDRAM, and if multiple channels are available, it can direct the requests over both channels in parallel.

The parallel operation in the hierarchy is key to understanding how an application behaves under different contention for the memory bandwidth resources, as Ekl¨ov et al. demonstrated [7]. They classified applications as either bandwidth sensitive or latency sensitive. When the load on the memory system increases, the latency will gradually increase, even if there is available bandwidth left. This is due to contention for specific parts of the memory system, which results in requests being placed in different queues. A bandwidth sensitive application is good at utilizing prefetching and is able to perform calculations while new data is being fetched. Latency sensitive applications is the opposite and will have to stall in order to wait for new data.

2.4

Paged Virtual Memory

Modern operating systems usually employ a memory management technique called paged virtual memory [12]. This works by giving each process its own memory space, a virtual memory space, independent of the actual physical memory space where only that process operates. A process then sees the available memory as one contiguous block segmented into pages and can usually address 4 GB of memory or more, independent of the actual available memory of the system. The size of these pages is usually 4 KB, but can differ depending on support from hardware and the operating system. When memory is allocated by a process, a binding is created between virtual pages and physical pages, i.e. pages in the physical memory. The operating system keeps track of all these mappings in a page table on a per process basis. The page tables reside in the main memory.

One of the great benefits of paged virtual memory is the ability to scatter the memory space of a process arbitrarily in the physical memory, thus reducing the impact of memory fragmentation. If a process allocates a chunk of memory by the size of 1,000 pages, the addresses of the chunk will be contiguous in the virtual memory space, but 1,000 contiguous pages are not necessary for the allocation. The 1,000 pages may be scattered throughout the physical memory.

When a process accesses a virtual memory address, a lookup is done in the page table by using the higher bits of the virtual memory address. The lower bits, the 12 least significant bits in the case of 4 KB (212) bytes page

size, represent the offset within the page. If the mapping exist, the virtual address is translated into the corresponding physical address by taking the physical page address and using the same offset within it.

Modern processors provide hardware support for this type of virtual memory with a memory management unit, MMU. It performs the transla-tion from virtual memory addresses to physical memory addresses. To speed

(16)

2.5. EXPERIMENTAL PLATFORM CHAPTER 2. BACKGROUND

Figure 2.3: Mapping from virtual addresses to physical addresses. up the process a translation lookaside buffer, TLB, is used, which is a kind of cache memory that stores recently accessed pages and their corresponding mappings to physical memory. These can operate in a hierarchy in the same way as the normal caches. When a process performs a memory access when hardware support exists, a lookup is first done in the TLB. If the address exists, the physical address is returned and the memory access is performed. Otherwise a so called page walk is performed, which is the operation of ac-cessing the page table in main memory, retrieving the requested mapping and storing it in the TLB. This is usually performed through dedicated hardware. If no such mapping exists, an exception is raised.

2.5

Experimental Platform

The platform our bandit is being tested on is running two clusters of ARM Cortex-A15 processors with four cores in each, all within a single die. Each core has a private L1 cache memory, each cluster shares an L2 cache mem-ory and the entire system shares an L3 cache memmem-ory. The clusters are connected via the L2-cache system to a CoreLink CCN-504 Cache Coherent Network [1] which connects the clusters to each other and to the rest of the system. The main memory is connected to a CoreLink DMC-520 Dynamic Memory Controller [2] supporting two DDR3 modules. The documented bandwidth of the memory controller is around 15 GB/s.

The L1 cache is 32 KB in size, uses 64 B cache lines and is 2-way asso-ciative. Its cache replacement policy is least recently used and it can have up to 6 different outstanding memory requests at any time. The L2 cache is

(17)

2.5. EXPERIMENTAL PLATFORM CHAPTER 2. BACKGROUND

Figure 2.4: Architecture of the experimental platform.

2 MB in size, uses 64 B cache lines and is 16-way associative. Its cache re-placement policy is random selection and it can have 16 outstanding writes and 11 outstanding reads. The L3 cache has the same properties as the L2 cache except that it is 8 MB in size. The operating system is a custom Linux distribution patched with real-time patches. The Bandit is cross-compiled using GCC version 4.6.3 with -O2 optimizations active.

(18)

Chapter 3

Our Bandit

3.1

Design

3.1.1

Main objectives

In short, the objective of our implementation is to use as little shared re-sources as possible except for main memory bandwidth and have a suffi-ciently realistic bandwidth usage pattern, be easy to use and, finally, be easy to port to different architectures. The most relevant shared resource to minimize in our case is the shared L2 cache memory. The L3 cache memory usage is not focused on to simplify the implementation and evaluation and it is also not relevant for the final use case of the platform. The Bandit is completely implemented in C as it is close enough to the hardware for our needs.

3.1.2

Floating point data

Floating point operations should be avoided if possible in time critical sec-tions due to them being significantly slower than normal integer operasec-tions. The data for us that would benefit from being represented as floating point is time. In order to use integers instead, we have to represent time in a small enough unit to not lose any significant precision. We therefore chose to represent time as nanoseconds.

Since we have a 32-bit platform, it would be faster for us to save the time in a 32-bit integer. This, however, puts a limit on how long we can time things. Using an unsigned integer we get the maximum value of 4,294,967,295 ns, or roughly 4.3 seconds. Due to the high speed nature of computers, 4.3 seconds should be enough for our needs.

(19)

3.1. DESIGN CHAPTER 3. OUR BANDIT

3.1.3

Location of Memory Accesses

In order to achieve the first two objectives, it is necessary to access specific memory addresses in a specific order. To generate a memory access, we need to make a cache miss. A cache miss can be guaranteed in two ways, either by trying to access memory of a greater size than the target cache, or, if the number of ways in a set is limited, by accessing more cache lines that belong to the same cache set then there are ways. By doing the latter, we do not need to use up the entire cache memory. The usage pattern we aim for is one that utilizes the parallelism in the main memory as uniformly as possible, i.e. utilizes the channels and banks equally, and, if possible, control the number of page hits and misses in the SDRAM.

To access memory in specific cache sets we need to know how the map-ping to cache sets is done. The least significant bits of the address denotes the offset in the cache line, in our case this means the 6 least significant bits. Some of the following bits will then denote the cache set. To calculate how many bits denote the set we use the following equation:

cache sets = cache size

ways ∗ line size (3.1) The L2 cache in our case is 2 MB in size, has 16 ways and its line size is 64 bytes. Equation 3.1 gives us 2048 different cache sets which needs 11 bits to address them all. This means that the 11 bits after the cache line offset encodes the cache set. Since they are offset by 6 for the addressing of the cache line, the cache sets will repeat every 217bytes or 128 KB of memory. If we align memory allocations to this 128 KB boundary we can contain us to a few cache sets and therefore generate cache misses without thrashing the entire cache.

An easy way to step through this memory is to create a circularly linked list for each cache set that we use, with each element at a constant offset from the 128 KB boundary. Each element contains only a pointer to the next element. By stepping through through these lists we generate the memory traffic.

Our platform utilizes, as described in Section 2.5, a random eviction strategy for the cache. A perfect LRU strategy would allow us to use one more element than there are ways for each cache set, i.e. 17 elements in our L2 example. However, the random eviction strategy forces us to use significantly more elements than that. Using a simple simulator we created, described in Appendix A, we can determine that at twice the amount of elements we will get a hit rate of about 20% and that same number is about 2% when we have four times as many elements as we have ways, which should be enough for our needs.

Now we know what we need to in order to generate reads beyond the L2 cache, but we also need to read beyond the L3 cache. The requirements to not thrash the L3 cache is not as strict though. Since the L3 cache has the same properties as the L2 cache, except that it is four times the

(20)

3.1. DESIGN CHAPTER 3. OUR BANDIT

size, it should suffice to allocate four times as many elements as we earlier anticipated when calculating on the L2 cache.

By taking all this in mind we get that the circularly linked lists belonging to the cache sets should have at least 16 ∗ 4 ∗ 4, or 256, elements.

To generate the uniform access to the main memory we need to know the mapping from memory addresses to the different parts of the SDRAM. This translation is done by the memory controller, however we can not find this specific mapping in the technical reference manual for the memory controller [2]. A workaround for this problem that improves our chance for uniform accesses is simply to allocate more memory than we need for our cache miss purposes.

3.1.4

Controlling Bandwidth Usage

In order to generate different amounts of bandwidth usage, we take a set amount of steps through the linked lists and then delay by a different amount of time depending on how much bandwidth we want to use. The number of steps we take before the delay is important because there is an overhead associated with each delay. Fewer steps gives us a finer degree of control over the bandwidth usage, but it limits our ability to generate a higher amount of traffic due to the overhead associated with the delays. By using a hardware timer in the ARM Cortex-A15 that has a very low overhead, we can get a very high control over this delay.

An alternative would be to use a normal busy loop instead. This solution would however reduce the observability as the delay is not based directly on time, but on loop iterations instead. It could be solved by calculating how long different iterations take to run, but it would take more effort. An upside would be that the overhead of the delays would be lower because of the simpler construct, but the observability is prioritized in this case due to our lack of hardware counter support.

We also want to be able to control the total bandwidth usage. This can be done in a number of different ways, although it is very intuitive to just enter the usage in absolute numbers, in our case the number of MB/s. This can also be done in different ways, either by using pre-calculated values for the delay corresponding to different bandwidth usages or, as we are doing, by a proportional-integral controller, PI-controller.

The PI-controller use the current bandwidth usage as the process variable and the delay as the manipulated variable. A controller works by comparing the process variable, the bandwidth usage, to the target value, the target bandwidth usage, and then taking actions depending on the difference. The difference between the observed value and target value is called the error. The action is to change the manipulated variable so the process variable approaches the target value. In our case that means to modify the delay so we achieve the target bandwidth usage. The proportional part and the integral part gives us the change in our manipulated variable:

(21)

3.1. DESIGN CHAPTER 3. OUR BANDIT u(t) = Kpe(t) + Ki Z b a e(τ) dτ (3.2) Where

u(t) is the output. e(t) is the current error.

Kp is the proportional gain constant.

Ki is the integral gain constant.

The proportional part provides a proportional change to the output while the integral part provides a change depending on the accumulated error, thus compensating for long term errors [8].

The advantage of the controller is a high flexibility and low sensitivity to changes in the memory access latency. This is very helpful when the Bandit is run in multiple threads, as we will look into in the next section.

By timing each full iteration through all the 64 lists and by using this time information combined with how much data we access during an itera-tion we know how much bandwidth we are currently using. This is the only data we need in order to have a functioning controller. Because we know the length of the delay, we can also separate the time taken for the actual memory accesses. This will allow us to evaluate the current memory access latency in the system.

3.1.5

Measuring Target Application Bandwidth Usage

The basic idea behind measuring the bandwidth usage of the target applica-tion is that if the maximum bandwidth that the Bandit can expect to use is known, then the difference between that and the actual accessed bandwidth is the amount that the target is using. The problem is that the target appli-cation is getting less bandwidth when the Bandit uses bandwidth compared to when the system is silent. However, if we know how much bandwidth the Bandit is using, then we can compare the target’s bandwidth with a bandit’s bandwidth in the same circumstances and from this it can be possible to gain some information about the application’s actual bandwidth usage.

It would work as following:

1. Run an application with the Bandit and get measurements.

2. Perform another run and replace the application with another bandit and configure it so we get the same measurements as in the previous step.

3. Run a bandit with the same configuration and get information on how much memory bandwidth it is using. This bandwidth usage should be the same as the original applications bandwidth usage.

(22)

3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT

3.1.6

Multithreading the Bandit

Multithreading is useful to generate even more bandwidth usage. By using the pthreads library to achieve this, we can run all the Bandits under one process. The benefit of this is that the threads can communicate with each other via their shared memory space in the process, this helps the Bandit to monitor its total memory bandwidth usage and only one controller is needed to regulate the memory bandwidth usage. However, it will also work if the Bandits are run as different processes as the bandits will get their own controllers. The controller will then be able to handle the effect of the other bandits running in parallel.

When started with multiple threads, the Bandit pins the different threads to their own cores so they don’t interfere with each other.

3.1.7

Portability

The main hardware dependencies in the Bandit are the memory layout and the hardware timer used for the delay. In order to avoid thrashing the shared caches, we need to adapt our memory placement to the architecture’s cache size, ways and line width. Our implementation has hard coded values for those parameters, so a change of platform from the memory layout perspec-tive should only require a correct setting of the parameters, but we do not investigate this in practice.

The hardware timer used in the ARM CPU has a very low overhead associated with it, which allows it to be used to time the delay. Although, if another architecture does not have a timer with similar properties, we will have to either switch to the busy loop discussed earlier or accept the in-creased overhead and therefore reduced maximum possible stolen bandwidth per bandit.

3.1.8

Usability

The goal of the interface to the Bandit is for it to be usable in a script environment so it can be used for automated tests. A simple command line interface which take different parameters fulfills this goal. It can be used to control the number of executing threads, the amount of bandwidth that the threads should use in total and a verbosity setting that controls the amount of data output.

3.2

Implementation

3.2.1

Location of Memory Accesses

To implement our design in Section 3.1.3 we have to carefully choose the memory we use for our circularly linked lists.

(23)

3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT

In order to find addresses fulfilling the cache usage and main memory access demands, Ekl¨ov et al. used a feature called huge pages in Linux to allocate 8 MB chunks of contiguous memory, significantly larger chunks compared to the standard 4 KB pages. In those chunks they placed several linked lists, which they then iterated through to generate memory traffic. However, our Linux installation does not support the huge pages feature. Luckily, we have another feature that allows us to find the physical ad-dress corresponding to our virtual adad-dresses. The complete mappings of a processes virtual addresses can be looked up by using the proc filesystem, specifically /proc/PID/pagemaps. The specific way we did it can be found in Appendix B.

The first step we do is to allocate memory that we know is at least aligned by the page size of the system. It is done with the following call:

p o s i x _ m e m a l i g n (& m e m o r y _ a r r a y [ k ] , p a g e _ a l i g n m e n t , s i z e ) ;

With the help of that function we can allocate as many pages as we need to find pages with the correct alignment. To find pages with the correct alignment is simply evaluating a modulo operation:

if (( l o o k u p _ a d d r e s s ( m e m o r y ) % a l i g n m e n t ) == o f f s e t )

The current method we use is to perform one allocation for each memory page, which is very slow as it results in a lot of system calls. The reason we do this was to allow us to free the pages we did not need. A major drawback that the freeing had was that it fragmented the memory of the system and resulted in a severe slowdown of the system, even after the Bandit had finished executing. The Bandit now keeps all the memory it allocates, which means that the Bandit uses around 32 MB of memory at the 256 element minimum we calculated in Section 3.1.3.

A better alternative given that we can’t free memory could have been to allocate several pages at once and then test them, thus saving system calls, but we did not do this due to time constraints.

Once we have the memory, we partition the pages into 64 pieces, one for each cache line in the page, and create the 64 different circularly linked lists with one element in each page, each belonging to a separate cache set. This means that we use 64 out of the 2048 available cache sets or 3% of the available L2 cache. By having these 64 lists we are hopefully able to utilize the parallelism in the various levels of the memory hierarchy, however we have no control over the page hits and misses in the SDRAM.

3.2.2

Bandwidth delay implementation

We use the implementation described in Section 3.1.4, the hardware timer is used for the delay between reads and we always perform 16 memory reads at the time, resulting in 64 * 16 bytes, or 1 KB, of memory read at the time. The reason we use this number is because it is the lowest number we found that still can effectively generate large enough amounts of bandwidth usage.

(24)

3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT

With a lower number, we would not be able to leverage the parallelism in the memory hierarchy.

The main function for generating bandwidth usage is shown in Listing C.1. It is mainly made up of two parts, the memory read part, shown in Listing C.3, and the delay part, shown in listing C.2. The memory reading utilizes the circularly linked lists constructed in Section 3.2.1 and keeps an iterator for each list in order to remember the positions. As we can see in Listing C.4, the compiler unrolls the memory read loop, thereby improving the Bandit’s bandwidth usage performance.

3.2.3

PI-Controller implementation

The controller is implemented as described in Section 3.1.4. At first we tried to implement it using only integers, but it was more difficult than using floating point numbers. In the end the controller is not run very frequently, so the extra clock cycles required for floating point numbers is acceptable. The controller also has two modifications in order to improve its speed and reduce oscillations. The first one handles the property of the delay have to increase exponentially in order to decrease the bandwidth usage. The ratio in the following listing is used to scale up or down the delay modification depending on the delay’s size compared to the current memory read step time: f l o a t w a i t _ s t e p _ r a t i o = n e w _ w a i t _ t i m e / ( f l o a t ) s t e p _ t i m e ; if ( w a i t _ s t e p _ r a t i o > R A T I O _ M A X ) { w a i t _ s t e p _ r a t i o = R A T I O _ M A X + ( w a i t _ s t e p _ r a t i o - R A T I O _ M A X ) * R A T I O _ S T E P _ D O W N ; } e l s e if ( w a i t _ s t e p _ r a t i o < R A T I O _ M I N ) { w a i t _ s t e p _ r a t i o = R A T I O _ M I N ; }

The integral part in a PI-controller can lead to overshooting and oscil-lations in a controller [8]. In order to minimize this effect, we reduce and flip the sign of the accumulated error and apply an adjustment to the new delay at the moment that we pass the target value as shown in the following listing: c o n s t f l o a t I N T E G R A L _ M O D I F I E R = -0.1; if (( d i f f e r e n c e < E P S I L O N && d i f f e r e n c e > - E P S I L O N ) || !(( d i f f e r e n c e >= 0) ^ ( o l d _ d i f f e r e n c e < 0) ) ) { i n t _ d i f f = i n t _ d i f f * I N T E G R A L _ M O D I F I E R ; f l o a t o v e r s h o o t _ f a c t o r = d i f f e r e n c e / ( f l o a t ) t o t a l _ u s a g e ; int w a i t _ m o d i f i c a t i o n = ( f l o a t ) ( w a i t _ t i m e + s t e p _ t i m e ) * o v e r s h o o t _ f a c t o r ; n e w _ w a i t _ t i m e -= w a i t _ m o d i f i c a t i o n ; }

(25)

3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT

3.2.4

Measuring Target Application Bandwidth Usage

In order to get the target application’s bandwidth usage as described in Section 3.1.5 we first need the bandwidth usage of the Bandit. We get that as follows:

u n s i g n e d int m e m _ u s a g e = n u m b e r _ k i l o b y t e s * 1 0 0 0 * 1 0 0 0 / ( m e d i a n _ t i m e ) ; // KB per ms

my_data - > m e m o r y _ u s a g e [ my_data - > id ] = m e m _ u s a g e * 1 0 0 0 / 1 0 2 4 ; // MB per s

Care has been taken to avoid integer overflows from the different operations. However, it was most probably an unnecessary optimization to only use 32-bit integers here and not 64-32-bit integers or floating point operations since the operation is not very frequent.

We then use these values to get the target applications usage:

a p p _ u s a g e = b a s e l i n e _ m e m o r y _ u s a g e [ my_data > b a n d i t _ c o u n t 1] -t o -t a l _ u s a g e ; a p p _ u s a g e = a p p _ u s a g e > 0 ? a p p _ u s a g e : 0; a p p _ n e e d = ( a p p _ u s a g e * my_data - > b a n d i t _ c o u n t * 100 / t o t a l _ u s a g e ) ; p r i n t f (" T o t a l u s a g e : % u MB , App u s a g e : % d MB , \ App n e e d in b a n d i t t e r m s : % d %%\ n " , t o t a l _ u s a g e , a p p _ u s a g e , a p p _ n e e d ) ;

The app need in bandit terms in the code refers to how much, in percent, the application is using compared to a bandit. The baseline memory usage is the value produced from the benchmark described in Section 3.1.8

3.2.5

Multithreading the Bandit

The Bandit automatically pins the threads to a separate core in order to isolate them from each other. It is shown how it is done in this code snippet:

(26)

3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT p t h r e a d _ g e t a f f i n i t y _ n p ( p t h r e a d _ s e l f () , s i z e o f ( c p u _ s e t _ t ) , & c p u s e t ) ; for ( c u r r e n t _ c o r e = 0; c u r r e n t _ c o r e < c o r e s ; ++ c u r r e n t _ c o r e ) { if ( C P U _ I S S E T ( c u r r e n t _ c o r e , & c p u s e t ) ) { if ( n r _ c o r e s _ f o u n d == id ) f o u n d _ c o r e = 1; e l s e ++ n r _ c o r e s _ f o u n d ; } if ( f o u n d _ c o r e ) b r e a k ; } if ( f o u n d _ c o r e ) { C P U _ Z E R O (& c p u s e t ) ; C P U _ S E T ( c u r r e n t _ c o r e , & c p u s e t ) ; p t h r e a d _ s e t a f f i n i t y _ n p ( p t h r e a d _ s e l f () , s i z e o f ( c p u _ s e t _ t ) , & c p u s e t ) ; } e l s e {

f p r i n t f ( stderr , " Not e n o u g h c o r e s a s s i g n e d to the p r o c e s s \ n ") ;

e x i t ( -1) ; }

The cpu affinity has to be set beforehand and it requires to be assigned at least as many cores as there are threads.

Each thread handles its own measurements of bandwidth usage, but only one thread uses the data for the regulator and eventual print outs to the user. No locks are used for this shared data as they slowed down the bandit noticeably. Since only one thread writes each data and only one thread reads the data and the detriment for using old is not very significant, it is not a problem.

3.2.6

Portability

The identified portability issues is, as pointed out in Section 3.1.7, the mem-ory layout and the hardware timers. A software portability issue is also created by the method we use to acquire the physical memory addresses, as seen in Appendix B.

The memory layout can at least be broken down into parameters that can be tuned to match the current system. However, we only do it for one specific layer of cache as seen in the code snippet below for time reasons:

(27)

3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT # d e f i n e L L C _ A S S O C 16 # d e f i n e L L C _ S I Z E _ I N _ M B 2 # d e f i n e L L C _ S I Z E ( L L C _ S I Z E _ I N _ M B * MB ) # d e f i n e L L C _ S E T _ S I Z E ( L L C _ A S S O C * L L C _ L I N E S I Z E ) # d e f i n e L L C _ N B _ S E T S ( L L C _ S I Z E / L L C _ S E T _ S I Z E )

LLC_ASSOC in this case refers to the associativity, or the number of ways of the cache as we have referred to it.

The hardware timer is used in the following manner:

s t a t i c i n l i n e u i n t 3 2 _ t _ _ g e t _ c p u _ t i m e 3 2 ( v o i d ) { _ _ u 3 2 cvall , c v a l h ; asm v o l a t i l e (" m r r c p15 , 0 , %0 , %1 , c14 " : "= r " ( c v a l l ) , "= r " ( c v a l h ) ) ; r e t u r n c v a l l ; }

If a low overhead timer is available for the platform it is just a matter of replacing the contents of this function. But that may not always be the case. We still decided that this hardware timer is better for us due to the increased observability we can get by using it, especially in the case of the delay represented in actual nanoseconds as described in Section 3.1.4.

The method used to get the physical memory addresses is not portable to older linuxes or other operating systems in general and it would have to be replaced in its entirety when compiling for another system. We did however not find another way to do this on our platform, so it is good enough for us.

3.2.7

User interface

The implementation of the user interface uses getopt for parsing of the arguments due to its simplicity. The result is that arguments are always preface with a flag in order to avoid ordering of the arguments. The Bandit is able to produce different output, but the default setting is to be silent.

Example usage:

• The Bandit help string:

r o o t @ d u 1 :~# ./ p b a n d i t - - h e l p U s a g e : - - b e n c h m a r k - m - - m e a s u r e - v - - v e r b o s e = L E V E L - t - - t h r e a d s = NUM - b - - b a n d w i d t h = T A R G E T

• Usage of a Bandit with 2 threads with the target usage of 2000 MB/s and outputting the total used bandwidth:

(28)

3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT

r o o t @ d u 1 :~# t a s k s e t - c 1 ,2 ./ p b a n d i t - t 2 - b 2 0 0 0 - v 1 T o t a l u s a g e : 1 9 5 7 MB

T o t a l u s a g e : 2 0 0 0 MB T o t a l u s a g e : 2 0 2 8 MB

• Usage of a Bandit with 3 threads with the target usage of 1000 MB/s and outputting the total used bandwidth time of each memory read step, the wait time between read steps and individual thread band-width usage: r o o t @ d u 1 :~# t a s k s e t - c 4 ,5 ,6 ./ p b a n d i t - t 3 - b 1 0 0 0 - v 3 T h r e a d 2 a f f i n i t y : 40 T h r e a d 0 a f f i n i t y : 10 T h r e a d 1 a f f i n i t y : 20 T h r e a d 2: 3 7 3 7 1 8 7 ns , s t e p t i m e : 4 1 9 . 6 7 7 2 4 6 ns , w a i t t i m e 2 5 0 0 ns , m e m o r y u s a g e 333 MB T h r e a d 0: 3 7 2 1 4 0 6 ns , s t e p t i m e : 4 0 7 . 3 4 8 3 8 9 ns , w a i t t i m e 2 5 0 0 ns , m e m o r y u s a g e 334 MB T h r e a d 1: 3 8 0 3 5 9 3 ns , s t e p t i m e : 4 7 1 . 5 5 7 1 2 9 ns , w a i t t i m e 2 5 0 0 ns , m e m o r y u s a g e 328 MB T o t a l u s a g e : 1 0 0 0 MB T h r e a d 2: 3 7 1 9 6 8 7 ns , s t e p t i m e : 4 1 0 . 0 0 5 3 7 1 ns , w a i t t i m e 2 4 9 6 ns , m e m o r y u s a g e 335 MB T h r e a d 0: 3 7 3 1 9 2 1 ns , s t e p t i m e : 4 1 9 . 5 6 3 2 3 2 ns , w a i t t i m e 2 4 9 6 ns , m e m o r y u s a g e 333 MB T h r e a d 1: 3 7 2 3 7 6 5 ns , s t e p t i m e : 4 1 3 . 1 9 1 4 0 6 ns , w a i t t i m e 2 4 9 6 ns , m e m o r y u s a g e 334 MB T o t a l u s a g e : 1 0 0 1 MB

• Usage of a Bandit with the measuring feature active:

r o o t @ d u 1 :~# t a s k s e t - c 4 ,5 ,6 ./ p b a n d i t - t 3 - v 1 - - m e a s u r e T o t a l u s a g e : 2 1 9 8 MB , App u s a g e : 543 MB , App n e e d in b a n d i t t e r m s : 74 % T o t a l u s a g e : 2 2 4 4 MB , App u s a g e : 497 MB , App n e e d in b a n d i t t e r m s : 66 % T o t a l u s a g e : 2 2 8 0 MB , App u s a g e : 461 MB , App n e e d in b a n d i t t e r m s : 60 %

3.2.8

Low Overhead Bandit

With regular intervals, the Bandit performs other tasks than just using bandwidth. The controller, measuring and printing all cost time for the Bandit to do. In our evaluation, we will have an alternate bandit that have all those features stripped away and we call it the Low Overhead Bandit.

(29)

3.2. IMPLEMENTATION CHAPTER 3. OUR BANDIT v o i d * b a n d i t ( v o i d * t h r e a d _ o p t i o n s ) { s t r u c t b a n d i t _ o p t i o n s * o p t i o n s = t h r e a d _ o p t i o n s ; s t r u c t b a n d i t _ d a t a m y _ d a t a ; i n i t _ b a n d i t (& my_data , o p t i o n s ) ; p t h r e a d _ b a r r i e r _ w a i t (& b a r r i e r ) ; w h i l e (1) { n o r m a l _ b a n d i t _ i t e r a t i o n (& m y _ d a t a ) ; b a n d i t _ u p d a t e (& m y _ d a t a ) ; b a n d i t _ p r i n t (& m y _ d a t a ) ; b a n d i t _ c o m m i t (& m y _ d a t a ) ; } f r e e _ b a n d i t (& m y _ d a t a ) ; }

While the Low Overhead Bandit looks like this:

v o i d * l o w _ o v e r h e a d _ b a n d i t ( v o i d * t h r e a d _ o p t i o n s ) { s t r u c t b a n d i t _ o p t i o n s * o p t i o n s = t h r e a d _ o p t i o n s ; s t r u c t b a n d i t _ d a t a m y _ d a t a ; i n i t _ b a n d i t (& my_data , o p t i o n s ) ; p t h r e a d _ b a r r i e r _ w a i t (& b a r r i e r ) ; w h i l e (1) { n o r m a l _ b a n d i t _ i t e r a t i o n (& m y _ d a t a ) ; } f r e e _ b a n d i t (& m y _ d a t a ) ; }

By comparing these bandits to each other we will find out if the overhead of the other tasks than the main function of the Bandit interferes with its function.

(30)

Chapter 4

Evaluation

4.1

Method

Using performance hardware counters to measure the performance and func-tion of the Bandit would have been optimal, as they provide accurate counts of many different performance metrics. The ARM Cortex-A15 does support hardware counters that can measure cache hits and misses, off-chip memory accesses and more, however in order for software that use them to gather the metrics to work, the operating system must be compiled with support for it. The Linux installation we are using is lacking this feature. Instead, we rely on micro-benchmarks that use the available timing functions also used by the Bandit in order to highlight different effects of the Bandit.

In order to verify that the Bandit is indeed using memory bandwidth whilst minimizing the impact on the L2-cache we use a micro-benchmark that allocates memory normally and then iterates through that memory with read operations while timing it. The program can be run with different sizes of the allocated memory, thus allowing us to run the program in a way to only access the different caches or to force it to only access the main memory by iterating over enough memory. It has virtually no overhead when compared to the Bandit as the access loop is unrolled and performs unrelated reads that can be run in parallel. By measuring the total time to iterate over the entire memory block we can get the access time to memory and total bandwidth when parallel accesses are performed. Since the memory accesses are sequential, prefetching can also occur, however as long as all the tests are performed this way, the prefetcher should not be a problem since it is present in all the tests. The memory access loop of the reference program is shown in Listing E.1. We refer to this micro-benchmark as the reference program throughout the evaluation.

To be able to execute the Bandit and the reference program on differ-ent cores and contain them in the separate clusters of the platform we use the taskset command. The platform has, as described in section 2.5, two

(31)

4.2. PLATFORM PERFORMANCE CHAPTER 4. EVALUATION

CPU clusters and in the first one most of the Linux housekeeping processes are contained. All of the tests, with the exception of a cross-cluster test, are performed in the second cluster in order to minimize noise from other applications.

All values from measurements are an average of 10 runs. When the slowdown of an application’s execution time is of interest, we present it as a relative execution time compared to the original execution time. We calculates it as follows:

slowdown = new time

base time (4.1) The advantage of this value is that it keeps the appearance of the original graph while giving us normalized values that allow us to compare different tests more easily. It is equally applicable when comparing latencies.

4.2

Platform Performance

4.2.1

Memory Access Latency

By using the reference program we can get the numbers for the different memory access latencies of the platform. We just need to select the correct size for the memory block in the reference program. Due to the random replacement policy of the L2 and L3 cache, the memory block needs to be a little bit larger compared to a case with an LRU replacement policy. Using the same simulator as in Section 3.1.3, which is described in Appendix A, give us again that the hit rate for a memory block twice the size of a cache is about 20%, and if we double once more, for the total of four times the size of the cache, we reach a hit rate around 2%, which is low enough for our purposes. There is also another problem: if the allocated memory is of the same size as the cache, we cannot guarantee that all pages will map uniformly to the different cache sets, which leads to some sets where we have too many cache lines, which in turn leads to misses, even though the cache is large enough. To compensate for this we use a smaller memory block that is half the size of the target cache type.

Access type Memory block size Access speed L1 Cache 16 KB 1.8 ns L2 Cache 1024 KB 5.2 ns L3 Cache 4096 KB 12.1 ns Main memory 64 MB 24.1 ns

Table 4.1: Observed memory access latencies of the different memory types. Taking these things into consideration we select the sizes 16 KB for L1 access speed tests, 512 KB for L2 access speed tests, 6 MB for L3 access

(32)

4.2. PLATFORM PERFORMANCE CHAPTER 4. EVALUATION

speed test and 32 MB for main memory access speed tests. The results are shown in Table 4.1. One thing to note is that this result shows the access latency for a series of reads, which means that prefetching is a factor in the measurements, especially in the main memory read case.

4.2.2

Off-Chip Memory Bandwidth

By running multiple instances of the reference program, each iterating over memory blocks of the size 32 MB, we can measure the limits of the main memory bandwidth in the cores and clusters. By first testing the available bandwidth for one cluster we find out that one core saturated almost all the available bandwidth in one cluster, which is about 2700 MB/s. Next we find out how the two clusters of four cores each are affecting each other’ bandwidth. In an ideal case with no contention between the clusters, we should be able to achieve about 5400 MB/s. The results are in Table 4.2 and a more gradual bandwidth usage pattern result is found in Figure 4.1.

0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 3000 Bandwidth usage of cluster 2 (MB/s)

Bandwidth usage of cluster 1 (MB/s)

Figure 4.1: Accessed bandwidth when load is generated in both CPU clus-ters.

An interesting thing in Table 4.2 is how little the two clusters affect each other. The bandwidth per cluster is only reduced by approximately 6% from 5400 MB/s when both clusters are saturated. It is also worth noting how relatively little bandwidth we have access to, especially if we limit us to one core. If we compare this to the platform used by Ekl¨ov et al. [7] they

(33)

4.3. BANDIT PERFORMANCE CHAPTER 4. EVALUATION

Number of clusters Bandwidth One cluster 2733 MB/s Two clusters 5142 MB/s Table 4.2: Total memory bandwidth.

get roughly 12 GB/s from their platform with the same number of active memory channels.

The reason for this low bandwidth could be due to a quality of service feature activated in the CCN-504 [1] not allowing a single cluster to access more bandwidth than about 2700 MB/s. The documented bandwidth of the memory controller is significantly higher than the single cluster speed, which is corroborated by the small reduction of bandwidth when two clusters are running at full capacity.

4.3

Bandit Performance

4.3.1

Cache Miss Generation

What we want to evaluate here is primarily two things:

• The memory allocation method, to ensure minimal cache memory us-age.

• The memory access pattern’s ability to generate cache misses. Firstly, we compare our aligned memory allocation method to normal al-location with malloc. For the normal method we use the reference program, which will act as the reference for normal memory allocation. Our method will allocate 4 KB pages and align them at the 128 K boundary as described in the Bandit implementation chapter. The memory is then iterated over with reads in the same fashion as the reference program, with some slight modifications shown in Appendix E.3. We are therefore not using the Ban-dit directly here, only the BanBan-dit’s memory allocation. We show the normal memory allocation reference result in Figure 4.2 and the aligned memory allocation result in Figure 4.3.

(34)

4.3. BANDIT PERFORMANCE CHAPTER 4. EVALUATION 0 5 10 15 20 25 4 16 64 256 1024 4096 16384 65536 Single access sp eed (ns)

Accessed memory size (KB) Normal accesses

Optimal LRU accesses

Figure 4.2: Memory access speed with normal memory allocation. A hypo-thetical case with optimal LRU cache eviction strategy given as comparison. If we start by examining Figure 4.2 with the normal memory allocation reference we can see that the access latency increases around 32 KB, 2 MB and 8 MB. We can therefore see that the latency follows the various cache sizes as expected. The smooth transitions between the limits compared to the optimal LRU case are most probably due to the memory not being uniformly allocated over the cache sets, leading to cache misses in some of the sets and the fact that cache hits still occur even though the data set doesn’t fit into the cache in its entirety.

(35)

4.3. BANDIT PERFORMANCE CHAPTER 4. EVALUATION 0 5 10 15 20 25 30 1 4 16 64 256 1024 4096 16384 Single access sp eed (ns)

Number of elements per cache way Aligned accesses

Expected aligned accesses

Figure 4.3: Memory access speed with aligned versus unaligned memory al-location. The expected result of the memory allocation given as comparison. For our allocation method, aligned memory allocation in Figure 4.3, we can see that it behaves almost exactly as we expected for the L1 and L2 cache accesses, which have 2 respectively 16 ways. However, we expected the L3 cache, which is four times larger than the L2 cache but still has 16 ways, to act as an L2 cache with 16 ∗ 4 = 64 ways. But, from the figure we can see that with our method it behaves as if it has 128 ways. One possible reason for this could be that the mapping to different sets is different compared to the other caches. Because our target cache to optimize for is the L2 cache, we can accept this and just allocate the extra pages needed to generate main memory accesses.

Now that it seems like the allocation method is placing the memory in the correct cache sets, we need to test that the Bandit does not invalidate the L2 cache when it is run. In Figure 4.4 we use the reference program with a memory block of 512 KB, i.e. a block that should fit well into the L2 cache.

(36)

4.3. BANDIT PERFORMANCE CHAPTER 4. EVALUATION 0 5 10 15 20 25 30 35 40 45

No load One instance Two instances

L2 cac he access latency (ns) No load L2 reader Bandit Low overhead bandit Reference

Figure 4.4: Observed latency of L2 cache access.

We also have a modified bandit here, as described in Section 3.2.8, where we have stripped away the control loop and measurement functions in order to minimize overhead and focus the test on the memory accesses performed by the Bandit. The bar called L2 reader is an identical program to the one being tested. It is used as a comparison for what happens when we have accesses that do not invalidate the cache.

As we can see in the figure, the Bandit has a significantly lower impact on the latency to the L2 cache than the reference application, which we expected. This is most likely due to the low cache overhead of the Bandit, however it could also be the case that the Bandit is not generating as much traffic as the reference. To examine this, we study how the Bandit affects main memory access latency compared to the reference in Figure 4.5. The memory accesses we measure are generated by the reference program. We also use a low overhead bandit here as in the previous experiment. What we can see there is that the Bandit affects the memory access latency almost as much as the reference program. The small difference is probably from overhead of the delay loop, but it does not seem to be very significant.

(37)

4.3. BANDIT PERFORMANCE CHAPTER 4. EVALUATION 0 10 20 30 40 50 60 70

No load One instance Two instances

Main memory access latency (ns) No load Bandit Low overhead bandit Reference

Figure 4.5: Main memory access latency.

From these experiments, we can be fairly certain that the Bandit’s basic function is correct; it performs memory accesses whilst keeping the cache memory footprint low.

4.3.2

PI-Controller Performance

The controller is fast for target bandwidth usage above 1%, and generally stable. However, there are some oscillations when the bandwidth usage is above 80%. Example runs are demonstrated in Listing 4.1, where there is a two second delay between each bandwidth output.

r o o t @ d u 1 :~# t a s k s e t - c 4 ,5 ,6 ./ p b a n d i t - t 3 - b 2 5 0 0 - v 1 T o t a l u s a g e : 2 5 3 0 MB T o t a l u s a g e : 2 4 9 6 MB T o t a l u s a g e : 2 5 2 2 MB T o t a l u s a g e : 2 4 8 8 MB T o t a l u s a g e : 2 5 0 1 MB r o o t @ d u 1 :~# t a s k s e t - c 4 ,5 ,6 ./ p b a n d i t - t 3 - b 1 0 0 0 - v 1 T o t a l u s a g e : 997 MB T o t a l u s a g e : 999 MB T o t a l u s a g e : 1 0 0 0 MB T o t a l u s a g e : 1 0 0 1 MB T o t a l u s a g e : 1 0 0 0 MB r o o t @ d u 1 :~# t a s k s e t - c 4 ,5 ,6 ./ p b a n d i t - t 3 - b 10 - v 1 T o t a l u s a g e : 28 MB T o t a l u s a g e : 15 MB

(38)

4.4. MEASURING MEMORY LATENCY CHAPTER 4. EVALUATION

T o t a l u s a g e : 10 MB T o t a l u s a g e : 10 MB T o t a l u s a g e : 9 MB

Listing 4.1: Demonstration of regulator performance

This is due to two factors. When the bandwidth usage is high, the delay between memory accesses is low. This means that a small change in delay has a greater effect compared to situations with a low bandwidth usage. There are also some variations, or noise, in the memory access latency, due to the same sensitivity to small changes, and this can affect the controller in a measurable way.

4.4

Measuring Memory Latency

To examine the memory access latencies in greater detail we run the refer-ence program alongside a bandit and vary the target amount stolen by the Bandit from 0 to 2800 MB/s. In order to steal as much available bandwidth as possible, we perform this test with 3 concurrent bandit threads. The reference program is run with 512 KB and 32 MB to test L2 cache memory access speed respectively main memory access speed.

In Figure 4.6 we have the two tests with the measured memory access latency. We can see that the Bandit is able to use more of the bandwidth when run against the L2 cache reader, which is the reason for the graph flattening out. 0 10 20 30 40 50 60 70 80 90 0 500 1000 1500 2000 2500 3000 Memory acce ss latency (ns)

Target stolen bandwidth (MB/s) Main memory read

L2 cache read

(39)

4.5. MEASURING BANDWIDTH CHAPTER 4. EVALUATION 1 2 3 4 5 6 7 8 9 10 11 0 500 1000 1500 2000 2500 3000 Relativ e access latency

Target stolen bandwidth (MB/s) Main memory read

L2 cache read

Figure 4.7: Access latency to main memory and L2 cache converted to a relative slowdown from the base latency of each access.

To be able to compare the experiments better, we transform the access latency as described in Equation 4.1. This gives us Figure 4.7, where we can see a similar development of the latency. The rate of the slowdown is low and linear at first and when the saturation is high we have a higher rate of increase in slowdown. This could be a sign of improved usage of the parallelism in the memory hierarchy in the beginning and in the end we have contention for the outstanding memory access queues. This is very clear in the L2 cache access case, because when the queues are full, a fast L2 access has to wait for many slow main memory accesses and reach latencies comparable to main memory access latencies. Since there is an order of magnitude in difference between the two types of access, L2 cache accesses will suffer more from the contention in relative terms.

4.5

Measuring Bandwidth

In order to test the measurement feature under controlled circumstances, we need to know how much bandwidth the target application is using. We also need to do this without proper measurement facilities. Therefore, we begin testing it by measuring another bandit and the reference application, which both output their achieved bandwidth. Thanks to the Bandit’s controller feature, it is easy to request different bandwidth levels for the target. It also allows us to see the controller feature run under a loaded scenario.

(40)

4.5. MEASURING BANDWIDTH CHAPTER 4. EVALUATION

In Figure 4.8 the target application is another bandit. The target at-tempts to use more and more bandwidth until it maxes out just short of 700 MB/s. This tests gives at least some merit to our method, both pro-grams are competing for the same resource and they affect each other in a predictable manner. However, this experiment is optimal for this method as they both use the exact same method to access memory and the bandwidth usage is stable under a very long duration.

0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 800 Observ ed u sage (MB/s) Target usage (MB/s) Measured usage Actual usage

Figure 4.8: Measuring bandwidth usage of another bandit.

In the next experiment we use a slightly modified version of the reference program to generate load on the system. We inserted a busy loop after each read, as seen in Listing E.2, and we varied the length of the loop in order to generate different loads. Figure 4.9 shows the result. We can see that the Bandit is fairly successful in measuring the bandwidth usage, however there is an almost constant error in the measurement. This could be due to the difference in the memory access patterns in the two programs, which causes different behaviors in the system.

(41)

4.5. MEASURING BANDWIDTH CHAPTER 4. EVALUATION 100 150 200 250 300 350 400 450 500 550 600 1 2 4 8 16 32 64 128 256 512 Observ ed usage (MB/s)

Busy loop iterations

Measured usage Actual usage

Figure 4.9: Measuring bandwidth usage of the reference program with the added delay.

Both of these experiments include a target that is generating a stable load and is therefore easy to sample with our method. When attempting to measure an application with more varied bandwidth characteristics, the results become very unreliable. This is because we only output the current measurement once every 2 seconds and therefore fail to gather a detailed picture of the characteristics. Another problem is that accesses beyond the L2 cache is the bottleneck in the memory system, as we established in Section 4.2. This causes reads that hit the L2 cache to have a visible impact on the measurements. As an example, the reference program set to read from the L2 cache gives a measurement of around 120 MB/s even though it never performs a main memory read.

To improve the function we could increase the sampling rate. By taking samples once a millisecond instead of every other second, as we have now, we could get a more detailed view of the memory usage. This does however not prevent the issue with the L2 access affecting the measurement. The Bandit would also need to be calibrated for the new overhead as it reduces the ability to steal bandwidth.

(42)

4.6. EFFECT ON PROGRAMS CHAPTER 4. EVALUATION

4.6

Bandit’s Effect on Programs

4.6.1

Overview

The intention of the Bandit is to test other applications’ properties when running in an environment with main memory contention. To begin we will investigate three custom made and highly synthetic applications with the goal to highlight specific behaviours that can occur in applications. Then we will investigate another synthetic benchmark, it is however designed to emulate the characteristics of the control plane application in a base sta-tion, used in the telecom industry. Finally, we will run the Bandit with applications from the MiBench suite [9].

The Bandit acts the same as in the previous test of the memory access latency with the target bandwidth usage incrementing in steps of 100 MB/s up to 2800 MB/s.

4.6.2

Micro-Benchmarks

The three custom applications are really simple and they each focus on one trait. The source for them can be found in Appendix F. The execution time in the normal case with no bandit running is shown in Table 4.3.

Program Base Execution Time L2-application 1.01 s

Computation-application 0.84 s Random-application 0.24 s

Table 4.3: Base execution times without any memory bandwidth contention. • The L2-application, shown in Listing F.1, is a matrix multiplication test that uses a simple tiled algorithm with the tile size of 16 and performs the calculations on a square matrix with the side of 512 elements where an element is an integer. This gives us a total size of 1 MB, which fits into the L2 cache. The goal of this application is to test the Bandits effect on an application that performs many L2-reads and without needing any off-chip memory bandwidth.

• The Computation-application, shown in Listing F.2, performs calcu-lations on two vectors that are sequential in memory and outputs it in a third vector. Each vector is about 50 MB in size and after each read there are many calculations performed on each element. This gives us a program that should have good potential for prefetching. The goal of this application is to test an application that uses off-chip mem-ory bandwidth, while still being insensitive to contention for off-chip memory bandwidth.

(43)

4.6. EFFECT ON PROGRAMS CHAPTER 4. EVALUATION

• The Random-application, shown in Listing F.3 and F.4, is a pointer chaser application that follows a linked list where the elements are randomized in the memory. This program will have no possibility for prefetching and will also cause many page faults. The goal of this application is to see how an application without any prefetching is affected by the Bandit.

1 2 3 4 5 6 7 0 500 1000 1500 2000 2500 3000 Relativ e execution time

Target stolen bandwidth (MB/s) L2-application Computation-application Random-application

Figure 4.10: Custom test programs’ relative slowdown by bandit. As predicted by the earlier measurements of the L2 access latency, our L2-application with many L2 cache accesses suffers considerable slowdowns when the system is completely saturated. Even when the Bandit only uses around 1,000 MB/s the execution speed increases by about 50%. Even though an application is effective at utilizing the L2 cache, it may still be very sensitive to contention in the system.

The Computation-application is, as expected, not at all sensitive to the contention. This is because of the sequential nature of the program that allows the prefetcher to work very well in this case. It can also be because the program has a relatively low amount of total accesses outside of the L1 cache, especially compared to the L2-application. Also, the fact that the main memory accesses were slowed down by a factor of about 3 with the Bandit at maximum load means that the memory accesses are not as badly affected as the L2 cache accesses.

The third program, Random-application, is not as badly affected as ex-pected. Due to the lack of computations and prefetching, we expected

(44)

sim-4.6. EFFECT ON PROGRAMS CHAPTER 4. EVALUATION

ilar penalties to this program as the pure memory access evaluation per-formed earlier as shown in Figure 4.7. However, it is only hit by a slowdown of a factor 2 at maximum bandit bandwidth usage. The reason for the lower performance hit is possibly the time spent on page faults. The high over-head for performing table walks at each memory access reduces the amount of time that the process is actually accessing main memory.

4.6.3

Telecom Application

Program Base Execution Time Telecom-application 10.28 s

Table 4.4: Base execution times without any memory bandwidth contention. This is intended to simulate the behavior of a telecom application that has a large code size, in our case 36 MB, that attempts to have parts that are both cache friendly and not so cache friendly. Compared to the other three application, this one is balanced in its behavior and is more realistic than the previous synthetic benchmarks.

From Figure 4.11 we can see that it is not as extreme with its bandwidth utilization as the memory access test, which is expected. An interesting thing is that the execution time development seems to be separated into two parts with different rates of increased execution time. This could be a similar development as Ekl¨ov et al. [7] observed: the instructions per clock cycle, IPC, decreased slowly when the bandwidth utilization came closer to the maximum and then took a turn for the worse when the attempted utilization was beyond the maximum and the Bandit stole slots in the queue for the memory resources. If this is the case and the assumed maximum available bandwidth is 2,800 MB, then the bandwidth requirement for this program is around 1,000 MB/s. However, this is not only main memory bandwidth, as the L2 accesses uses up bandwidth as well.

References

Related documents

The cache hierarchy in Figure 1 can partly take advantage of Weakly Ordered (WO) or Release Consistent (RC) memory consistency models because the processor is not blocked when a

While Flash does have a lower per-bit cost compared to DRAM and is non-volatile, its write endurance is only on the scale of 10 4 – 10 5 (very low compared to DRAM’s

The loneliness Denver is left with, is not mainly due to the actual presence of Paul D, but to the memories Sethe and Paul D share from Sweet Home.. Denver does not know much

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating

The thesis tries to verify the hypothesis that the bandwidth bandit method can be used to measure the sensitivity of a GPU application to the off-chip memory

As we saw in Section 4, contention for shared off-chip memory resources can result in both reduced bandwidth and increased latencies, at different points in the mem- ory

In this survey we have asked the employees to assess themselves regarding their own perception about their own ability to perform their daily tasks according to the

While NoC-based many-core processors bring powerful computing capability, their architectural design also faces new challenges. How to provide effective memory management and