Optimizing on-chip Machine Learning for Data Prefetching

(1)

Optimizing on-chip Machine Learning for Data Prefetching

Bachelor’s thesis in Computer science and engineering

HAMPUS LARSSON MIRANDA JERNBERG ALBIN PANSELL

FABIAN STIGSSON

FREDRIK HAMREFORS PONTUS SÖDERSTRÖM

Department of Computer Science and Engineering CHALMERSUNIVERSITY OF TECHNOLOGY

UNIVERSITY OFG^OTHENBURG

(2)

(3)

Bachelor’s thesis 2022

Optimizing on-chip Machine Learning for Data Prefetching

HAMPUS LARSSON MIRANDA JERNBERG

ALBIN PANSELL FABIAN STIGSSON FREDRIK HAMREFORS PONTUS SÖDERSTRÖM

Department of Computer Science and Engineering Chalmers University of Technology

University of Gothenburg Gothenburg, Sweden 2022

(4)

HAMPUS LARSSON MIRANDA JERNBERG ALBIN PANSELL FABIAN STIGS- SON FREDRIK HAMREFORS PONTUS SÖDERSTRÖM

Supervisor: Mateo Vázquez Maceiras, Department of Computer Science and Engi- neering

Examiner: Pedro Petersen Moura Trancoso, Department of Computer Science and Engineering

Bachelor’s Thesis 2022

Department of Computer Science and Engineering

Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg

Telephone +46 31 772 1000

Gothenburg, Sweden 2022

(5)

HAMPUS LARSSON, MIRANDA JERNBERG, ALBIN PANSELL, FABIAN STIGS- SON, FREDRIK HAMREFORS, PONTUS SÖDERSTRÖM

Department of Computer Science and Engineering

Chalmers University of Technology and University of Gothenburg

Abstract

The idea behind data prefetching is to speed up program execution by predicting what data is needed by the processor, before it is actually needed. Data prefetching is commonly performed by prefetching the next memory address in line, but there are other, more sophisticated approaches such as machine learning. The accuracy performance of a Machine learning prefetcher can be highly accurate and the model can be of great size, but applying it to hardware will enforce a limit regarding the size of the model. Therefore a balance between machine learning model size and performance has to be considered. This paper describes the optimization of a machine learning prefetcher’s size, whilst retaining performance, and how it was achieved by considering Recurrent Neural Networks’ in hardware. Finally this paper suggests machine learning prefetcher attributes promoting feasibility in hardware, as well as presenting a machine learning model optimized for prefetching in a hardware setting.

Keywords: Data Prefetching, Machine Learning, HW/SW co-Design, HLS, FPGA

(6)

Sammandrag

Tanken bakom ’data prefetching’ är att snabba upp programexekveringen genom att förutspå vilken data processorn kommer att behöva i framtiden. Data prefetching utförs vanligtvis via hämtning av nästkommande minnesblock, men det existerar mer komplexa implementationer av data prefetching, varav maskininlärning ränkas till dessa. Det finns fåtal begränsningar med maskininlärning, utan i fallet av en maskininlärningsbaserad data prefetcher kommer hårdvaran att vara flaskhalsen.

Således måste ett övervägande mellan prestanda och storlek på makininlärningsmod- ellen göras. Denna rapport behandlar optimering av storleken på maskininlärnings- baserade prefetchers utan att offra prestanda, samt hur det utfördes med hjälp av återkommande neurala nätverk realiserbara i hårdvara. Slutligen lägger denna rapport fram karaktärsdrag som främjar realierbarhet i hårdvara hos en maskinin- lärningsbaserad prefetcher, samt presenterar en specifik maskininlärningsbaserad prefetcher optimerad för hårdvara.

(7)

Acknowledgements

This project and its underlying research would not have been possible without the support of our supervisor Mateo Vázquez Maceiras. His extensive knowledge of the subject and supporting nature has been of great importance for the completion of the work presented in this report.

Hampus Larsson, Miranda Jernberg, Albin Pansell, Fabian Stigsson, Fredrik Ham- refors, Pontus Söderström, Gothenburg, June 2022

(8)

(9)

List of Figures

2.1 A simple feedforward neural network. . . 6

2.2 A simple recurrent neural network. . . 7

2.3 Long Short-Term Memory architecture. . . 8

2.4 The HLS process as presented by Vitis HLS and Xilinx [29] . . . 11

4.1 Features and labels after windowing . . . 18

4.2 The procedure of the address preprocessing. . . 19

4.3 A parametric search sweep conducted with the Weights & Biases analysis tool. A curve represents a given configuration and the right axis called accuracy represent the performance for that configuration. . . . 21

5.1 The distribution of offsets within a range of [-10000, 10000] in the 473.astar − s1 trace file that is included in the Spec06 benchmarks. Out of 5317304 offsets, 38.42% of them are within the range displayed in the histogram. . . 28

5.2 The distribution of offsets within a range of [-10000, 10000] in the 482.sphinx3−s0 trace file that is included in the Spec06 benchmarks. Out of 6101531 offsets 68.11% of them are within the range displayed in the histogram. . . 29

5.3 The distribution of offsets within a range of [-10000, 10000] in the 470.lbm − s0 trace file that is included in the Spec06 benchmarks. Out of 3923764 offsets 99.86% of them are within the range displayed in the histogram. . . 29

5.4 The distribution of offsets within a range of [-10000, 10000] in the 437.leslie − s0 trace file that is included in the Spec06 benchmarks. Out of 10624283 offsets 45.43% of them are within the range displayed in the histogram. . . 30

5.5 The distribution of offsets within a range of [-10000, 10000] in the 433.milc − s2 trace file that is included in the Spec06 benchmarks. Out of 6743856 offsets 70.69% of them are within the range displayed in the histogram. . . 30

5.6 The distribution of offsets within a range of [-10000, 10000] in the 471.omnetpp−s0 trace file that is included in the Spec06 benchmarks. Out of 8940570 offsets 70.07% of them are within the range displayed in the histogram. . . 31

(12)

List of Figures

5.7 Performance for different layer configurations when other hyperparameters were held constant. The configurations were trained and tested on the same trace file. . . 33 5.8 Test results after training and testing on data from the same file.

Accuracy is displayed in blue and filtered accuracy is displayed in orange. The metrics are defined in 5.1.2. . . 34 5.9 Performance compared to next line baseline when trained on 80% of

the data in the traces, and tested on the remaining 20% of the data.

The bars represent the relative performance compared to the next line prefetcher, in terms of the accuracy of correctly predicted load addresses. The height of the bars are proportional to the performance of the baseline. The horizontal red line, indicates the accuracy of a next line prefetcher. . . 35 5.10 Test results after testing on a different file. Accuracy is displayed in

blue and filtered accuracy is displayed in orange. The metrics are defined in 5.1.2. . . 36 5.11 Performance compared to next line baseline when the model was

tested on traces it had not been trained on. The bars represent the relative performance compared to the next line prefetcher, in terms of the accuracy of correctly predicted load addresses. The height of the bars are proportional to the performance of the baseline. The horizontal red line, indicates the accuracy of a next line prefetcher. . 37 5.12 Usage of resources when adding dense layers . . . 39 5.13 Visualization of the final model . . . 42

(13)

List of Tables

3.1 PYNQ Z2 resources . . . 14

4.1 Input and output dimensions of the final model . . . 25

5.1 Model configuration . . . 33

5.2 C synthesis results for the different prefetchers. . . 38

5.3 Overview of area usage by dense layers . . . 40

5.4 RTL synthesis results for the different prefetchers. . . 40

5.5 C synthesis and RTL synthesis results for the final model. . . 42

(14)

1

Introduction

Since its introduction in the 1970’s the microprocessor has gained in performance every generation. The chip fabrication process has evolved significantly and transis- tors has shrunken all the way down to the 5nm used today. Memory performance (specifically main memory DRAM), on the other hand, has increased at a compara- tively slow rate [1]. Thus for optimizing the speed of computer programs there is a need for bridging technologies between the fast microprocessor and the bottleneck memory.

The most employed solution to the processor-memory gap is the usage of memory hierarchies [2]. A memory hierarchy can be thought of as pyramid structure containing different memory technologies of varying speed and capacity. The pyramid structure therefore usually consists of a stack of low capacity, but high speed memories together with high capacity, but low speed memories. More specifically, such a memory hierarchy typically consists static RAM (SRAM) working together with dynamic RAM (DRAM). SRAM is utilized in processor-near caches as its speed is more in line with that of the processor. However, SRAM is more expensive than DRAM and therefore cannot be used as main memory; DRAM is cheaper per unit of storage at the cost of speed. The configuration of SRAM in tandem with DRAM allow for minimal memory access penalties and large main memory capacities. Al- though, this solution is not perfect. Data intensive programs are commonplace today and only so much data can be stored at once in the fast memories [3]. Moreover, many of these data intensive programs do not exhibit locality and therefore the cache will have to be flushed a lot, a behavior which will leave the processor stalling for periods of time when instructions could have been executed instead.

If in theory the cache memory always were to contain the data required by the processor there would be no stalling; no penalty time spent accessing the slow main memory. This requires some mechanism to populate the cache in advance with the required data. This mechanism is called data prefetching. By using data prefetch- ing, the processor can hide a memory latency by fetching the data required into the cache while the processor is already busy executing some instruction. Then when the processor has finished executing the instruction, the required data can be loaded into the cache, which prevents the processor from stalling.

Data prefetching is the act of predicting what the next memory address to fetch will be. Today there exists a plethora of different approaches and implementations of data prefetching. Common approaches include: the addition of a fetch instruction

(15)

1. Introduction

into the target processors set of instructions, or the usage of a separate hardware chip dedicated for data prefetching [2]. A particularly popular tool in today’s society, which also happens to be applicable to data prefetching, is machine learning (ML).

Machine learning is a sub-category of artificial intelligence (AI) handling computer self-learning [4]. Computers are trained to modify their own actions in order to achieve a specified result. The computer employs self evaluation of past actions to decide on future actions; a concept applicable to data prefetching. The underlying theory being that the computer itself is suited to predict what data it will require in the future. Normally, to perform efficient data prefetching some kind of strategy has to be developed by a human. This strategy should provide a plan for what data to prefetch depending on the current state of the running program. But if the program is complex it might not be clear what is an optimal strategy. Moreover, the developers will always be limited by their brainpower and knowledge of the subject when designing these strategies of prefetching. If ML is introduced into the equation, suddenly the computer itself will be the one designing the strategies; computers are extremely well equipped to analyze and handle great loads of complex data. ML for data prefetching is not a total novelty, there is the ML-Based Data Prefetching Competition [5] being held at the ISCA conference. For the past years this competition has suggested many powerful and well performing ML-based prefetchers. There is, however, a catch: the prefetchers might be suitable in theory, but in reality they are costly and resource heavy. Most might not be realizable in physical hardware.

Today it is not entirely clear what complexity of ML is possible in hardware. Or for that matter, does certain ML models translate better to hardware?

1.1 Problem Definition

As mentioned, data prefetching exists to tackle the processor-memory gap, with main memory speed being the bottleneck. For data prefetching to be useful two conditions have to be met: The accuracy of the prefetcher has cause a performance gain of big enough significance to validate routing of power to the prefetcher. Ad- ditionally completion time of an executed prefetch must be smaller than the period of instruction fetching performed by the processor. If both of these conditions are satisfied the data prefetcher will result in net positive performance.

Predicting memory addresses typically require the prefetcher to detect patterns in the way the processor is accessing the memory. Machine learning models thrive in finding patterns in complex data. Thus to achieve a great prefetching accuracy machine learning is a well-performing asset. Although, as with most technology there is typically a trend between the size of the ML model and its performance. A more complex ML model will typically be able to identify more complex pattern; To reach great prefetching accuracy it seems natural to maximize the ML models computational power. As seen in the ML-Based Data Prefetching Competition last year [5]

there exists prefetchers easily satisfying the accuracy requirement presented in the previous paragraph. However, the power of these prefetchers also mean a great cost if actual hardware implementation was to be pursued; the price of performance has to be paid in either resources, in terms of area on the board, or in speed. Resources

(16)

1. Introduction

are limited in a hardware setting, which in turn means a sacrifice in speed might be the result in the pursuit of computational power. A sacrifice in speed is in this case synonymous with a slower clock cycle. Not a lot of research has been conducted on the topic, and it is not clear what ML-based prefetchers are both realizable and beneficial in hardware.

1.2 Purpose and Goal

This leads to the purpose of this project that is to research different ML-based data prefetchers in order to find models that are suiting for a real hardware setting. Thus the performance of ML-based prefetchers will be studied. ML-based prefetchers with optimal performance in terms of predicting future load addresses will be evaluated in terms of how well they might perform in real hardware. The underlying reason being that some well-performing prefetchers today, that are ML-based, focus on the prediction accuracy of a model instead of its actual real-life feasibility. This means that these models might not be realisable in any hardware setting.

The goal is therefore to research and develop a suitable, and therefore feasible, ML model that can be used to implement hardware prefetching. Different ML models of varying characteristics are to be tested and judged based on their performance and resource cost. Thus, model characteristics promoting feasibility for hardware implementation can be identified. In turn the final aim is to implement an optimal ML model for data prefetching in a hardware setting.

1.3 Limitations

As a limitation for this project, the data that were being used to train and test the accuracy of the ML models were within already existing data traces that have been generated for The International Symposium on Computer Architecture’s (ISCA) ML-Based Data Prefetching Competition from 2021 [5]. The reason for using these traces and not generating new ones are a question of time. Generating high-quality datasets can be tricky and very time consuming. Thus, to be able to stay within the timeframe given for this project, as well as to keep the focus on the ML model implementations, the decision was made to use pre-generated datasets. The traces that was available in this project were traces from program executions originating from SPEC 2006, SPEC 2017, and GAP benchmark suites[6]. However, only traces from the SPEC 2006 benchmarks suit were actually used for training and testing purposes. The reason for this is the lack of computing resources available to train and test on a all of the available data.

(17)

2

Background

This section contains explanations of the theory behind the methods employed in section 4. Here relevant areas are explained in sufficient depth for understanding of the different processes. Furthermore this section serves as a base for understanding section 3, where the tools utilized in this project are presented.

2.1 Data Prefetching

The ideal cache is both infinitely large and infinitely fast. With unlimited speed there would be no processor stalling when loading the cache. Unlimited size would then result in the cache having everything stored and thus handling every possible memory reference. A cache of unlimited size can be mimicked by utilizing data prefetching. Data prefetching is the act of anticipating a memory reference and loading the corresponding memory block into cache before the microprocessor makes said reference, thus costly memory fetches can be hidden [7]. There are multiple ways to perform data prefetching, both in hardware and in software. While handling prefetching on a software-level, it is common to expand the instruction set to include a ’fetch’ instruction [7]. To compare, hardware prefetching does not add any extra instruction overhead; a hardware prefetch is realised by implementing specific processor-monitoring hardware [2].

2.1.1 Hardware Based Data Prefetching

Usually when implementing data prefetching in a physical environment a specific processor monitoring hardware is created to coexist alongside the CPU. As mentioned in the introduction (section 1) there exists different strategies when data prefetching is concerned, which in turn results in a plethora of different hardware based data prefetchers. Two examples of hardware based data prefetchers are the Markov prefetcher [8] and the Stride prefetcher [9]. The Markov based prefetcher is implemented by storing a table containing addresses that followed cache misses.

The Markov table is then based on Markov probabilities. That is, the address most likely to follow a given address is the one remembered by the prefetcher. The stride prefetcher employs a different strategy; prefetching is implemented by using set offset fetching. In any time state the next address to be prefetched is the one located some set offset away.

(18)

2. Background

2.1.2 Software Based Data Prefetching

As mentioned, in addition to the hardware based data prefetchers, there exists software based data prefetching. Every computer has an instruction set architecture (ISA), which contains instructions for commanding the processor [10]. Thus, the ISA is the interface between the actual computer hardware and its software. The ISA can be expanded by adding further instructions, as in the case when adding a ’fetch’

instruction to perform data prefetching. A prominent example of a software based prefetcher is the compiler algorithm designed by Mowry et al.’s [11]. The compiler algorithm identifies which data references might result in a cache miss. Then, for these references a ’fetch’ is inserted ahead of time. As can be confirmed by this example, no additional hardware has to be implemented to prefetch on a software level; a benefit compared to hardware based data prefetching.

2.2 Machine Learning

Machine learning allows algorithms to solve problems by learning from data rather than explicitly being programmed to solve some specific problem [12]. Tom Mitchell provides a more technical definition of machine learning: “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E" [13]. Computers are trained to modify their own actions in order to achieve a specified result. The computer employ self evaluation of past actions to decide on future actions. It has shown success in solving certain computationally intensive tasks that traditionally have been hard to solve, such as image recognition. There are many classes of algorithms in machine learning; in this project supervised learning has primarily been employed. Supervised learning algorithms learn by getting feedback from a loss function of the incorrectness of a prediction compared to a target value. By modifying parameters in a way that minimizes the loss function, these algorithms can learn to represent functions that can be used to solve complex problems.

2.2.1 Data Preprocessing

For a machine learning model to be able to perform predictions on data, it is often necessary to transform the raw input data to a format that is applicable for that specific ML model. This is accomplished by transforming the way the data is represented into a new representation, a representation that is more informative than the original raw representation of the data [14].

2.2.2 Neural Networks

Neural networks (NN) is a type of machine learning model that are inspired by the neurons in the human brain. The network consists of neurons that are grouped into different layers: the input layer, one or several hidden layers and the output layer. As can be seen in Figure 2.1, the different layers are connected to each other

(19)

2. Background

through complex pathways and the structure is used to interact between the groups of neurons to then make predictions about the outcome [12]. This architecture is also called a dense neural network. It consists of fully connected layers, i.e. the neurons of every layer are connected to all of the neurons of the following layer.

These layers will be referred to as dense layers throughout the report.

Figure 2.1: A simple feedforward neural network.

There are different kinds of layers that can be used for implementation of a ML model. In this project dense layers will be specially significant. When using a dense layer with intended functionality, it performs the following operation:

output = activation(dot(input, kernel) + bias)

Where activation is an activation function that scales the output values from one layer to another. The purpose of an activation function is to add a nonlinear function to the neural network, which allows it to compute nonlinear functions. The dot denotes the dot product of the kernel and the input, where the kernel is the weight matrix that is trained and updated with the propagation algorithm [15]. The bias is a vector with weights that is added to the dot product.

2.2.3 Recurrent Neural Networks

Many machine learning problems can be thought of as time series forecasting problems. A time series is a sequence of data points that occurs in a sequential order [16]. The task is to extrapolate future data points based on previous data points.

For machine learning models that are to find patterns in time series, it is important to retain the sequential nature of the data when representing it as input. What this means is that, rather then feeding an input example to a model consisting of one data point in time, a series of data points could be fed as one input example.

(20)

2. Background

For time series forecasting, Recurrent Neural Networks (RNN) can be used to retain the sequential nature in the data. These layers allow a type of memory by which the layer can compute its output, not only based on the current input, but also based on previous inputs.

As figure 2.2 displays, a RNN represents history by having recurrent connections to the neurons which can provide unlimited history length and differs from a feedforward neural network, see figure 2.1, which represents history by the context of N-1 and is limited [17].

Figure 2.2: A simple recurrent neural network.

2.2.4 LSTM - Long Short-Term Memory

An extension of the RNN is the Long Short-Term Memory (LSTM) network which has been reported to provide great results over a large number of different domains that are related to sequential data [18]. This network solves a shortcoming of basic recurrent neural networks which makes them inherently hard to train, known as the vanishing gradient problem [19]. They are also more suited to remember input values that are further back in time compared to basic recurrent neural networks.

The Tanh activation function is used within a LSTM cell as an activation function.

The architecture of the LSTM can be seen in figure 2.3. It can be described in several steps that start with an input value X_t and h_t−1. These two values are then combined and a decision about the state of the f orget valve are being made when the combined value passes through the sigmoid activation function. The vectors that has been created go through the tanh layer operation that will pass some value on to the memory pipeline as well as the sigmoid operation that tell if the value should be passed on to the pipeline or not [20]. When the memory are going through

(21)

2. Background

x C _t-1

h _t-1

C _t h _t

tanh

t

x

x x

+

h _t

tanh

Figure 2.3: Long Short-Term Memory architecture.

the top pipeline the state of the f orget valve matters and it will decide what will happen to the memory. The last step is that the combined values from earlier decide what part of the pipeline that will be the output [20].

2.2.5 Output representation

Classification in machine learning is when a model is designed to predict a discrete class from a set rather than a continuous value. Binary classification is a way of representing an output which could be either true or false. This involves encoding the output of a model as probability of the output being true. One-hot encoding is a way of representing the output for classification problems. With one-hot encoding the output is represented as a one dimensional vector where every index in the vector corresponds to a certain class. The value of that vector at a certain index represents the probability of that class being predicted. For the labels in training data, the output examples are encoded as a one-hot vector where the correct class has a probability of 1.0. It is a popular method to use when processing data that contain categorical variables [21].

2.2.6 Training

A supervised learning based machine learning model learns from its training data.

Usually, data is split up into a training set, validation set and a test set [22]. The training set contains the training examples from which the model adjusts its weights in order to minimize the loss function. The data in the training and validation set is used to analyze the performance of the model. When a model is trained, the training data is fed to the model in batches. A batch is a subset of training examples from a dataset. The model computes an output for each training example in the batch and

(22)

2. Background

its loss. With optimization techniques such as gradient descent the losses computed earlier are averaged and then used by the backpropagation algorithm to update the weights of the model.

The backpropagation algorithm computes the gradient for each weight in the model with respect to the average of the loss function [23]. For recurrent neural networks, a different algorithm called backpropagation through time is used. Subsequently the model weights are updated as to minimize the loss function for that batch. When all the batches have been trained on an epoch has been completed, which is one iteration of training for all data. This process is repeated for a predefined number of epochs or until a certain performance threshold has been exceeded.

2.2.7 ML Evaluation

To be able to determine the performance of a model, the model is usually evaluated with a validation set after training. This is done by measuring the performance by using the trained model on a independent dataset. This provides important information about the model such as its accuracy, efficiency, if it is under-fitting or over-fitting, and also the possibility to see how different sizes of training sets are affecting the performance [24]. The insights gained from this evaluation can be used to adjust hyperparameters for the model. Hyperparameters are adjustable values that are not trained, such as layer depth and neurons per layer. When a model is deemed performant, a final evaluation is made on the test set which gives an indication of real world performance.

2.2.8 Pruning and Quantization

Pruning is method of model compression that is used to optimize a ML model even further and is performed by removing neurons or weights that are deemed to be redundant. This reduces the size of the trained ML model and will use up less memory as well as reduce the runtime. It optimizes the model in such a way that it minimizes loss in accuracy and performance [25].

A ML model can be compressed even further after the usage of pruning: by using quantization and weight sharing, which limits the number of effective weights that has to be stored. The aforementioned is achieved by introducing multiple shared connections to the same weight and then fine-tuning them [25]. The relevance of using pruning and quantization is that, as mentioned in section 1, that the size of the ML model is relevant for its feasibility on hardware which therefore makes them interesting concepts to use.

2.3 Machine Learning for Data Prefetching

A sub-category of the hardware-based data prefetchers are ML-based data prefetchers, where machine learning is utilized to predict future memory references. A prominent example of this is the transformer-based prefetcher TransforMAP [26], which

(23)

2. Background

was presented at the ML-Based Data Prefetching Competition in 2021 [5]. Trans- forMAP displays a 20.55% improvement over state-of-the-art Best-Offset prefetchers (non ML) and is based on the transformer model [26]. Although the transformer model, like most ML models, is both demanding and costly in reality. In fact the ML-Based Data Prefetching Competition aims to provide the chance to build and test any ML-based approach without hardware constraints and thus encouraging new ML-based solutions. Therefore TransforMAP only exists today as a theory crafted and simulator tested prefetcher. As such it is not known to what degree TransforMAP is realizable in hardware. What complexity of ML is suitable to be implemented on a chip? Or for that matter, does certain ML models translate better to hardware?

2.4 FPGA - Field Programmable Gate Arrays

FPGA stands for Field-Programmable Gate Array and is an integrated circuit board featuring a matrix of reconfigurable logic blocks connected via programmable inter- connects. Therefore an FPGA allows for altering of function after the manufacturing stage, as opposed to an Application Specific Integrated Circuit (ASIC) [27]. As such FPGA’s are commonplace when it is beneficial to change the characteristics of the board after production, e.g. when prototyping. The ability to change the function of the FPGA allows companies to save money as the same FPGA can be used over and over when new designs are developed. ASIC’s on the other hand are designed and produced for a specific purpose. They are expensive to produce in smaller quantities because the higher non recurrent engineering (NRE) cost (compared to a FPGA), and therefore only viable in larger quantities [28]. As such, an advantage of the FPGA is the flexibility of programming the IC-structure (integrated circuit structure) and therefore relatively low cost for small quantities compared to the changes in manufacturing needed to produce an ASIC.

2.5 High-Level Synthesis

High-level synthesis (HLS) is an automated design process where a Register Transfer Level (RTL) design is generated from an high level behavioral description. A HLS tool, for example Vitis HLS, will generate a RTL design matching the functionality of code written in a high level language. As such the design process can be shortened, compared to a design process stemming from usage of a Hardware Design Language (HDL).

2.5.1 HLS Design Flow

The HLS Design Flow (presuming usage of Vitis HLS) consists of five steps (see figure 2.4), where the first is the development of the intended functionality. The intended functionality is modelled in a high level language, such as C or any of its derivatives. Finally, the code can be run alongside a test bench file to ensure correct high level functionality.

(24)

2. Background

When correct high-level functionality is achieved the second step is C synthesis (or high-level synthesis). In this step of the design flow the high level code is synthesised into a HDL implementation. The designer is able to choose a suitable HDL, for example Verilog or VHDL. Products of the second step is a ’core’ together with timing and resource estimates.

If the core’s timing and resource estimates are within the acceptable threshold the third step can commence. C/RTL co-simulation is performed. By running the RTL alongside the C test bench correctness of function can be verified for the RTL design.

A design which passes the C/RTL co-simulation can move on to the next step:

RTL export, where the RTL is exported to be used in the final step. Finally the last step in the HLS design flow is the RTL synthesis. In the case of a chip RTL synthesis allows the RTL to be converted into a gate-level design. This is the design which then will be implemented on the chip. In the case of an FPGA RTL synthesis will provide a netlist, that via ’Place and Route’ will be mapped onto the fabric.

Figure 2.4: The HLS process as presented by Vitis HLS and Xilinx [29]

(25)

3

Setup

This section provides a description of the different tools used in realizing the concepts as described in Methodology (section 4).

3.1 Traces

It is of importance to use traces related to prefetching for proper training and evaluation of the ML models. As mentioned in section 1.3, pre-generated datasets were used from the SPEC 2006, SPEC 2017, and GAP benchmark suites. Standard Per- formance Evaluation Corporation (SPEC) is a corporation that provides maintained and established benchmarks that can be used to evaluate a systems performance and energy system [30]. The benchmark packages are CPU-intensive, which stresses a system’s processor and memory subsystem.

SPEC CPU 2006 benchmark is a now retired benchmark package that has a total of 31 different benchmark tests include speed measurements, measurements of compute-intensive performance and other different ways to measure computer performance [31]. SPEC CPU 2017 benchmark package has a total of 43 benchmarks that contain suites that are used for completion time comparisons and throughput or work per unit of time measurements [32]. The GAP benchmark suite includes spec- ifications of graph kernels, input graphs, and measurement methodologies as well as an implementation of an optimized reference of the state-of-the-art performance [6].

3.2 Model Design - Languages and Libraries

For the high-level design of the ML model Python was picked [33] as the language for development. Python was chosen due to the great library support for data science and ML which was needed in order to develop ML models efficiently. For designing the model structure the libraries TensorFlow [34] and PyTorch [35] were considered. TensorFlow was selected as the ML library to use due to the ready-to-use processes for training and testing, and minimal amount of boilerplate code needed.

TensorFlow provides tools to assemble deep neural networks by combining common layers and operations. TensorFlow also provide support for building efficient data input pipelines.

(26)

3. Setup

3.3 ML Development Tool

To be able to perform research and machine learning experiments in an organized and efficient way a tool for logging results was needed. Weights & Biases is a service mainly used for that purpose [36]. The service can log all relevant data from a machine learning run (training and testing round). This includes hyperparameter configuration, performance metrics, command-line output and system metrics for CPU and GPU. This is crucial in order to log the progress of the project in an organized way, and debug and analyse previous machine learning runs.

Weights & Biases also offers tools for doing parameter searches, so-called sweeps [37]. This makes it possible to pick a set of parameters to vary each run and then perform multiple runs one after another. This efficiently explores a large search space of training parameters. Weights & Biases have different schemes for varying parameters. Varying the parameters randomly being one of them.

Finally Weights & Biases offers a ready-to-use distributed system for performing machine learning runs. When a parameter sweep is started multiple computers can connect to that sweep with a ’sweep id’. The connected computers listens to requests from the Weights & Biases server and performs training and testing locally when they receive a new parameter configuration.

3.4 FPGA Hardware

With the goal presented as ’implementation of an optimal ML model for data prefetching in a hardware setting’ (see section 1.2) there was a need for high volume testing; many different ML models of varying characteristics had to be tested and evaluated in a hardware setting. Therefore, as a result of its flexibility a FPGA came to be utilized. More specifically the PYNQ Z2 was utilized to satisfy the high volume testing demand. The PYNQ Z2 is a FPGA board developed to take advantage of PYNQ, an open source project from Xilinx using Python based programming for easier implementation on Xilinx platforms. The PYNQ Z2 uses a ZYNQ based architecture and is based on the ZYNQ 7000-series SoC. As seen in table 3.1, the PYNQ Z2 sports a total of 53,200 6-input Lookup Tables (LUT) and 106,400 Flip-Flops (FF). Moreover the PYNQ Z2 has 220 Digital Signal Processors slices (DSP) available, which are utilized when implementing floating point operations. For memory the PYNQ Z2 has a total of 630KB of Block RAM.

3.5 HLS Development

A typical ML model is of great size. Furthermore the inner workings of such a model is usually treated like a black box by developers. Thus it is not feasible within the time frame of this project to construct a ML model bottom up by the means of a HDL. Instead it would be beneficial to be able to mimic the functionality of said ML model in a high level language, i.e. high-level synthesis (see section 2.5). Vitis

(27)

3. Setup

Table 3.1: PYNQ Z2 resources

PYNQ Z2

Resource Total Amount

Flip Flops 106,400

Lookup Tables 53,200

DSP 220

BRAM 630 KB

HLS is a product supplied by Xilinx [29] for compiling hardware kernels for the Vitis environment tools via high-level synthesis. That is, Vitis HLS is a tool employed in the development of IP blocks (reusable units of logic or functionality in hardware) for programmable logic such as FPGA’s (section 2.4).

3.5.1 Language Limitations

Vitis HLS provides developers with an higher level of abstraction; Instead of HDL C-type algorithmic-focused code is written. Developers do not have to spend time converting their algorithms into a hardware description, rather the intended functionality is scripted in C-type code and Vitis perform high-level synthesis. Although, there are limitations to what can be interpreted by Vitis HLS [38]. Supported languages are C and C++, with the recommendation being C++. The support for C is small compared to C++ in Vitis HLS; the language capabilities of C++ trump those of C. A limitation, however, is the available library support. For the synthesis tools to understand the high level code there cannot be any unknown libraries employed.

Thus, instead of third-party libraries, Xilinx and Vitis HLS promotes the usage of their internal libraries [39].

3.5.2 Test Benches

When designing in Vitis HLS there are a few tools at disposal during the HLS process. To streamline the verification of functional correctness Vitis HLS utilizes two types of simulation: C simulation and C/RTL co-simulation [40]. Both types of simulation is performed by comparison with a test bench written in C/C++. A test bench for a Vitis HLS project will feature a main() top-level function, which calls the function to be synthesized. The test bench is able to include other functionality, but the Xilinx documentation proposes the following properties of a ’well written’

test bench:

• The test bench should verify the correctness of the function to be synthesised.

• The test bench should be self checking

• main() should return ’0’ when the correct result is achieved

• An incorrect result should produce a non-zero return value of main()

(28)

3. Setup

3.5.3 Synthesis Tools

Vitis HLS provides tools for optimizing the workflow of developers [29]. In order to control the result of C synthesis a target clock period is provided by the developer.

Vitis HLS will try to reach this target and afterwards report success, alternatively failure, to reach the specified target. A timing estimate is provided by Vitis HLS.

While not a performance guarantee in any way as the timing will depend on the route and place, it can service as a tool for comparing design decisions.

The timing estimate mentioned in the previous paragraph is presented after successful C synthesis along with a synthesis report. The synthesis report includes metrics of resources used. Number of Lookup tables, Flip-flops, DSP, and BRAM used are presented. Actual timing will depend on the ’Place and Route’ and is presented after successful RTL synthesis. More details also become available to the developer post RTL synthesis; it is possible to discern accurate usage of Lookup tables, Flip-flops, DSP, and BRAM.

(29)

4

Methodology

This section aims to provide a description of the techniques employed in reaching the results specified in section 5. By utilizing theories and tools described in section 2 and 3 the workflow will be explained in detail.

4.1 ML Model Design

The initial idea of designing a ML model suitable for predicting the next load address, was that sequences of previous addresses were important for a model to be able to predict the next load address. This is similar to how natural language processing machine learning problems are formulated [41]. Therefore, models that were developed received a sequence of previous and current load address as input and the problem was treated as time series forecasting. LSTM layers were chosen for testing because their ability to take the sequentiality of the input data into account.

Regular deep neural networks were also considered. A framework for training and evaluating different models was established in order to study which hyperparameters and layer types are important for a model to be performant. All of the training examples represented offset inputs and output offsets, which is the step between Loadaddress_n and Loadaddress_n+1. As such, the output layer of all of the models were capped by the number of output units needed to represent the largest offset in the training and evaluation set.

By using offsets, rather than whole addresses to represent data, the data represents the change of the fetched load address relative to the last memory accesses.

Thus the data can be more informative in regards to pattern recognition in the memory accesses made by the processor.

4.1.1 Baseline

The Next line prefetcher was selected as the main baseline metric of performance due to its simple structure, but still reasonable performance. Next Line is a commonly used data prefetch strategy that is widely known. The next line prefetcher works by always prefetching the next load address relative to the most recently fetched load address.

(30)

4. Methodology

4.1.2 Data exploration

The available data provided by the MLArchSys competition [5] consisted of multiple trace files in a csv file format. Each trace file contained a series of load addresses originating from some program execution. These program executions were well known benchmarks [32] for evaluating performance of processors. A column in the csv files held the load addresses in a hexadecimal format. The load addresses were physical memory addresses.

Given the multitude of trace files provided by the competition, it was deemed unfeasible to train and test on all of the data. In addition, some of the trace files were too large for them to be considered. As a result three trace files were chosen as the source for training and three others were chosen for testing the models. The ones chosen for training were analysed and used to develop the machine learning models, while the three chosen for testing were only used for evaluating the performance of the models. All the traces were part of the SPEC 2006 benchmark [31]. For further details about the files used for training and testing see section 5.1.1.

4.1.3 Data pipeline

In order to process the large amounts of data in the trace files an input pipeline was built in TensorFlow. This made it possible to process a small amount of data at a time while streaming data from a file continuously. The whole process of reading, preprocessing, and training or testing could be performed continuously in the pipeline. The pipeline also made it possible to train and test on multiple trace files on the same run.

Because the pipeline was written with TensorFlow Core operations, the process of preprocessing, training and testing could be performed efficiently. A necessity when scaling up to larger amounts of data. The TensorFlow Core operations are functions specified in Python, but with underlying code written in C++ to be efficient during runtime.

4.1.4 Address preprocessing

To feed the data to the ML model preprocessing was needed. Given the sequential nature of the training data, i.e that load addresses are loaded sequentially into the cache memory from the main memory, the data was always preprocessed as to allow for sequential models to work with the data. This was achieved by representing each sample’s input (feature) as a sequence of load addresses of length i as shown in figure 4.1. The output (label) then became the next load address in the sequence.

Several different ways of preprocessing the data were explored.

The trace files contained data with 48 bit long load addresses. The 6 least least significant bits referred to the exact address within a cache block. Since data prefetching is done in whole cache blocks only 42 bits needed to be predicted. One way of formu- lating the prediction task was to convert each load address to a binary array, with

(31)

4. Methodology

each value representing a bit i.e. one or zero. Then the input to the model became a i × 42 matrix where i represented the input sequence length and 42 the number of bits needed to identify a specific cache block. This is the process displayed in figure 4.2.

Figure 4.1: Features and labels after windowing

The output load address was also represented as a binary array in this case. Then when trying to predict addresses there were 2⁴²possible addresses for a cache block, which made the prediction task unfeasible to treat as a classification problem. The loss function used for this representation was the binary cross entropy function, and since each output in the bit array is treated as an individual output, this loss function was applied to each of them.

4.1.5 Offset preprocessing

The addresses could also be represented as offsets. E.g. if the following access pattern of addresses occurs: "1, 2, 3, 2" the sequence could be represented by offsets as: "1, 1, -1". That is, simply the difference between each two addresses. This would allow the model to be oblivious to the position of the access pattern in the memory since it only considers offsets. Hence, it was presumed that the learning process would be facilitated. Furthermore, the principle of locality which states that the processor often access memory addresses nearby relative to the last accessed address, suggests that many offsets would occur within a smaller range compared to the entire address range [42]. Thus the prediction problem could be narrowed to only predict offsets within a certain range, without disregarding too many offsets that were out of range. Those patterns could be harder to predict for the machine learning model, which is another reason to focus on the ones occurring within an offset range. The offset is calculated as:

o_n= a_n− a_n−1

(32)

4. Methodology

Figure 4.2: The procedure of the address preprocessing.

Another benefit of this representation is that the output could be treated as a classification problem on the offsets themselves and not on the individual address bits in an output array. By one-hot encoding each of the possible offsets, each class could be represented as an index in the one-hot encoded vector as shown below:

Offset range from r_min to r_max Bit in one-hot encoding: b o_n =⇒ [b_r_min, b_r_min₊₁, ..., b−1, b₀, b₁, ..., b_r_max−1, b_r_max]

Example:

2 =⇒ [0, 0, ..., 1, ..., 0, 0] where b₂ = 1

When offset preprocessing was used, an offset range was always set in order to limit the amount of possible classes. Also, it was hypothesised that by limiting the training data to be within an offset range, it would filter out large context jumps that could happen in the access patterns for processors. This would be beneficial because the models would not be trained to predict stochastic patterns, but instead to predict more common patterns. What this meant was that when models were trained with an output representation of a one-hot encoded load address or a bit array of an offset, the training data was filtered to only contain examples where the offset output was within a certain range. When offset preprocessing was used the binary cross entropy loss function was applied individually in the case of a bit array as output. If one-hot encoding was used as output representation however, categorical cross entropy was used.

(33)

4. Methodology

4.1.6 Training ML models

In order to find an optimal machine learning model, a base model was defined which could be configured in different ways and subsequently trained and tested. Weights

& Biases (see section 3.3) was used to systematically iterate through hyperparameters and test how they affected the performance of the model. The logs and graphs of training and validations result for different configurations was useful to draw conclusions about which hyperparameters positively contributed to the performance of the model. Each model had an input layer with the dimensions i × 42, where i was the sequence length and 42 the minimum number of bits needed to represent a load address. In this case the load address refers to a physical address in memory. This input was passed to the neural network’s input layer and propagated to the output layer. The LSTM layers were always put closest to the input layer and took the sequence of input data or previous layer data as input. The first dense layer in the network flattened the data from the previous layer to a one dimensional vector of inputs. The activation functions between each layer were either the ReLu or the sigmoid activation. ReLu is considered a standard activation function. It is also deemed to be the most used one for the implementation of neural networks that consists of several layers[43]. Due to its simpleness and its performance, the ReLu function seemed to be the most suitable one for the designing of this model. Dropout was applied between each layer except between the last layer and the output layer.

In some configurations, batch normalization was used in order to normalize the output for each layer, this was especially beneficial for outputs from LSTM layers since the weights had a tendency of becoming very large when trained for many epochs.

4.1.7 Testing ML Models

Testing and evaluating a model is of great importance to be able to discern if the ML model displays performance aligned with the purpose of this project. The theory behind the evaluation is mentioned in section 2.2.7.

The models were trained on 90% of the load addresses in one trace file at a time. The remaining 10% of the data was used to validate the performance of the model. This split was used to test different model configurations, while a split of 80% of training data and 20% of validation data was used when training the model with different files. These are both common ways of splitting train and validation data within the machine learning research field. The reason why both splits were used was to increase the validation set when testing the model to reduce the risks of choosing data with low variance. When testing the different configurations a set of values were defined for each hyperparameter. These were to be varied, and a random combina- tion of these hyperparameters were sampled for a model to be trained and tested.

Given the large number of combinations that could be sampled and the long run time to train and test each model, only a subset of the possible model configurations were tried. As described in section 3.3, an analysis tool called Weights & Biases was utilized to track the performance and model configurations. The graph in figure 4.3, shows the different model configurations that were tested with the hyperparameters deemed most important along with the performance of that configuration (please

(34)

4. Methodology

note that the exact values of the graph are of no importance in this section). Each curve represent a training and validation run with the preconditions described in section 4.1.6 and the beginning of this section. In the right most column of the graph, titled "Accuracy", the accuracy metric described in section 5.1.2 indicates how well the model performed with that configuration. The other columns displays the different hyperparameters that were varied. This tool allowed for logging and filtering runs based on hyperparameters, and served as a basis for selecting a configuration that would be studied further.

Figure 4.3: A parametric search sweep conducted with the Weights & Biases analysis tool. A curve represents a given configuration and the right axis called accuracy represent the performance for that configuration.

When a model configuration was selected for further testing, it was tested in two ways. The first test was conducted to test how well the model was able to learn different access pattern in trace files and then predict them correctly in data it had not been trained on but in the same trace file. When this test was performed the input and output examples in the test set and training set were randomized. By testing this, the accuracy metric along with a comparison to the baseline metrics provided an indication on what patterns the model could learn and how well it could be trained in general. The other test that was conducted was to train a model with the selected configurations on a larger amount of training data, namely all the training data in the three trace files used in the parametric sweep. Then the trained model was tested by predicting load addresses on the three trace files used for testing individually. The metrics described in 5.1.2 were logged. The purpose of this test was to test how well the chosen model configuration could generalize when trained on more data and tested on completely different traces it had not been trained on before. By testing this an indication on how well the model might perform in a general context is approximated.

4.1.8 Optimization

The choice of using pruning for optimization was made because of the size of the model. It was applied on the trained model to minimize its size and make it easier to fit the model onto the hardware.

(35)

4. Methodology

The FPGA had limited resources so the ML model had to be as small as possible without losing its performance. Especially since LSTM layers are being used, which have high computational cost because each gate within the LSTM has its own Weights & Biases [44]. This is the same reason why quantization was implemented to compress the trained model even further.

To shortly summarize, the goal of using a pruned and quantized model is that it should be more feasible for hardware application without losing any performance and maybe even minimize the models loss accuracy.

4.2 Prefetcher Implementation

In order to implement a data prefetcher in hardware, HLS (see section 2.5) was employed with the tool Vitis HLS (see section 3.5). The starting point for HLS is code written in a high level language. Thus in order to implement and test different ML models in hardware a conversion from python to C++ was necessary. Furthermore, in order to have a comparison baseline two non ML prefetchers were implemented in C++.

Considering the circumstances of this project, together with the fact that the data prefetchers are to be hardware tested in their entirety, there is an important factor shaping development: Every layer of the ML models has to fit in hardware at the same time. Normally when implementing a ML model in hardware this is not the case. However, as the ML model aims to speed up program execution by the benefit prefetching, it is not feasible to store parts of the model in memory. As such the layers cannot be of unreasonable sizes, and methods of minimizing the layers has to be taken into consideration.

4.2.1 Layer by Layer Implementation

Conversion of a ML model to C++ was performed by creating the desired functionality of each network layer, one by one. Normally when creating a neural network, library and framework usage is prevalent. However, as per the restrictions imposed by Vitis HLS (see section 3.5.1) no external libraries or frameworks can be employed;

The neural network models had to be constructed from the ground up only utilizing native C++ functionality. Every layer in the neural networks was designed as a function with a number of inputs and one output, where the first input was an array consisting of the previous layer values and the sole output produced was an array of output values computed by the layer. The other input parameters was layer de- pendent, which means they varied depending on the layer-type. The layer-function then performed layer specific calculation and manipulation of the input. In contrast to developing software, where it is usually a good idea to write as general code as possible, the layer-functions was created as exact and precise as possible. The reason for such exactness being the fact the ML prefetchers will be implemented in hardware for one single task: To prefetch addresses based on its training. In order to archive the aforementioned exactness the expected size of Weights & Biases

(36)

4. Methodology

was declared beforehand for each layer-function. As such, when a layer-function is executed it already knows what to expect in form of weights and bias; The code can be optimized for its purpose instead of trying to be general and multi-purpose.

The Weights & Biases were deduced by reading the model file (produced by Keras when saving the model) as seen in section 4.2.5. By assigning the read-data from the model file, Weights & Biases were stored in arrays, which ultimately is mapped to BRAM in the synthesis process.

Requirements of realization of the different ML models in C+ are the existence of the necessary network layers. As such two different layers had to be constructed before ML model implementation could start. The first layer implemented was a Fully Connected Layer (also known as a dense layer, see section 2.2.2). Dense layers together with LSTM Layers were the main components of the models. Therefore, if configurable and reusable versions of these two layers were implemented any version of the different models could be constructed. To create a specific model the correct layers only had to be stacked and configured in the order corresponding to that particular model. Configuration of the layers in this case meant loading the model with the correct weight matrices together with setup of the layer-specific properties, such as number of units in a dense layer. As mentioned above, a dense layer was the first layer to be implemented. The dense layer was configured to receive the following input:

void denseLayer(int in[N], int out[M], float w[N][M], float bias[M]) Where in denotes the input from the previous layer, out denotes the output of the current layer, w denotes the weight matrix and bias denotes the array of biases to be added. In which N is the size of the input and M is the size of the output.

To satisfy the functionality of the dense layer it has to perform the output cal- culations described in section 2.2.2:

output = activation(dot(input, kernel) + bias)

As a result, dot product was implemented in C++ using an iterative approach:

for (i = 0; i < size_of_matrix_row; i++) {

dotProduct += input[i] * kernel[i][column];

}

For the activation function in the dense layer equation, three possible functions were implemented: ReLu, Sigmoid, and Tanh. As there functions are viable no matter the layer type they could be reused in the LSTM implementation. The LSTM layer was set up according to figure 2.3 and with the following top level structure:

(37)

4. Methodology

void lstm_layer(float cell_state[H], float hidden_state[1][H], float x[1][D],

float next_cell[H], float next_hidden[H]);

Where x denotes the current input to the layer of a predetermined dimension D.

The LSTM layer was connected according to the previously mentioned figure 2.3.

As seen, the LSTM structure includes matrix multiplication between inputs and weights. Need for matrix multiplication was resolved by utilizing an approach of iteration, where a singular part of an output element for some i, j, k is calculated as follows:

matrixOut[i][j]+=matrixInA[i][k]*matrixInB[k][j];

As implicated by the += in the formula above, the supporting structure consists of loops constrained by the maximum values of i, j, k (maximum matrix dimensions).

Finally, the LSTM layer was completed by connecting the matrix multiplication into the structure.

4.2.2 Single Layer Network - The First ML Model

Utilizing the implemented dense layer a simple Neural Network was constructed, from which conclusions was to be drawn. By monitoring the resource usage of this simple model in the following steps (the HLS process) it is possible to estimate how a more complex network would fare; An understanding of what is realistic in hardware and on the Pynq Z2 board can be gained. The simple model consisted of a single dense layer with a total of 84 inputs and 5 outputs. Multiplying the input size with that of the output it can be deduced that the weight matrix of this dense layer had a total of 420 weights.

4.2.3 Adaptable ML Model

To further research the boundaries of what is feasible in hardware an adaptable ML model was constructed. The adaptable ML model consisted of a configurable amount of dense layers and was set up by stacking the desired layers (see section 4.2.1). Although, the size of the dense layers was fixed to 50 inputs and 50 outputs to keep results consistent. Calculating the resulting size of the weight matrices a total size of 2500 weights per layer can be deduced. To expand on the purpose of this adaptable model it can be stated: By examining the effect of an increase in the total amount of layers a mathematically based relation can be found. For example, the resource-usage-difference between one and two layers might be equal to the resource-usage-difference between two and three layers. Then the relation is linear and addition of an extra layer will increase resource usage by a constant predictable amount.

(38)

4. Methodology

4.2.4 Multi Layer Network - The Final Model

After having gathered performance estimates pertaining to the single layer model, as well as the the adaptable network, it was possible to deduce a ML-based prefetcher implementation of optimal characteristics. The final ML model consisted of four dense layers, with varying input sizes as described in table 4.1. Worth noting is that the first layer is the so called input layer, and the last being the output layer.

Table 4.1: Input and output dimensions of the final model Final Model

Layer Input Size Output Size

Dense Layer 0 215 256

To implement this final model in C++ the weights belonging to each layer was read and imported into its corresponding layer (see section 4.2.5). The four different layers was then stacked in order and connected to finalize the C++ implementation of the final ML model.

4.2.5 Reading the Model File

The model file produced by Keras was a .h5-file (also known as HDF5). This file format couldn’t be read directly into C++ so to overcome this issue, the .h5 module file was converted into a .txt-file. A python script was created and used which utilized the HDF5 for Python library [45] to convert and format the data in the h5-file into a .txt-file. Then after this, the data could be read directly into C++.

4.2.6 Stride Prefetching in C++

In order to create a baseline to which the ML models can be compared to in the HLS process (see section 2.5) a Stride prefecher was developed. Stride prefetching, or specifically Next Line, prefetching is trivial to implement in C++, but a working stride prefetcher in C++ is a necessity if HLS is to be invoked. That is, to be able to compare the HLS results of the ML models to those of a Next Line prefetcher, the Next Line prefetcher first has to be implemented in C++. Next Line prefetching was implemented by reading the current address and then fetching the next address.

4.2.7 Markov Table Prefetching Implementation

A Markov prefetcher was implemented in C++ for the same reason as the Next Line prefetcher, but with the additional benefit of increased performance and complexity. Markov prefetching assumes the saving of previous cache misses together with the respective preceding addresses. The address holding was implemented in C++ by using two arrays: one for addresses having preceded a cache miss, and on

Optimizing on-chip Machine Learning for Data Prefetching

Optimizing on-chip Machine Learning for Data Prefetching

HAMPUS LARSSON MIRANDA JERNBERG ALBIN PANSELL

FABIAN STIGSSON

FREDRIK HAMREFORS PONTUS SÖDERSTRÖM

Bachelor’s thesis 2022

Optimizing on-chip Machine Learning for Data Prefetching

HAMPUS LARSSON MIRANDA JERNBERG

ALBIN PANSELL FABIAN STIGSSON FREDRIK HAMREFORS PONTUS SÖDERSTRÖM

Abstract

Sammandrag

Acknowledgements

Contents

List of Figures

List of Tables

1

Introduction

1.1 Problem Definition

1.2 Purpose and Goal

1.3 Limitations

2

Background

2.1 Data Prefetching

2.1.1 Hardware Based Data Prefetching

2.1.2 Software Based Data Prefetching

2.2 Machine Learning

2.2.1 Data Preprocessing

2.2.2 Neural Networks

2.2.3 Recurrent Neural Networks

2.2.4 LSTM - Long Short-Term Memory

x C t-1

h t-1

C t h t

t

h t

2.2.5 Output representation

2.2.6 Training

2.2.7 ML Evaluation

2.2.8 Pruning and Quantization

2.3 Machine Learning for Data Prefetching

2.4 FPGA - Field Programmable Gate Arrays

2.5 High-Level Synthesis

2.5.1 HLS Design Flow

3

Setup

3.1 Traces

3.2 Model Design - Languages and Libraries

3.3 ML Development Tool

3.4 FPGA Hardware

3.5 HLS Development

3.5.1 Language Limitations

3.5.2 Test Benches

3.5.3 Synthesis Tools

4

Methodology

4.1 ML Model Design

4.1.1 Baseline

4.1.2 Data exploration

4.1.3 Data pipeline

4.1.4 Address preprocessing

4.1.5 Offset preprocessing

4.1.6 Training ML models

4.1.7 Testing ML Models

4.1.8 Optimization

4.2 Prefetcher Implementation

4.2.1 Layer by Layer Implementation

4.2.2 Single Layer Network - The First ML Model

4.2.3 Adaptable ML Model

4.2.4 Multi Layer Network - The Final Model

4.2.5 Reading the Model File

4.2.6 Stride Prefetching in C++

4.2.7 Markov Table Prefetching Implementation

x C _t-1

h _t-1

C _t h _t

h _t