Ethics - Optimizing on-chip Machine Learning for Data Prefetching

The social and ethical aspects that had to be taken into account were those sur-rounding the machine learning algorithms and data prefetching. These algorithms are often developed as black boxes, as it is programmed to learn but developers can not predict what the result of the algorithm will be. The way an machine learning algorithm learns is based on assumptions made by the programmer. So as devel-opers it is needed to think about how the algorithm can perceive different parts of social and ethical dilemmas. An example would be the development of self-driving cars and how the algorithm controlling the vehicle should decide in situations of danger. How should the car choose in a situation where either the car occupants, pedestrians or drivers of other vehicles could die? It is up to the developer of the machine learning algorithm of the self driving car to decide what is "right" in difficult situations.

A positive aspect of machine learning is when using its capabilities to analyze huge amounts of data to detect online abuse against female politicians and journalists [47]. By utilising computational statistics and machine learning it would be possible to see patterns of online abuse. These methods will contribute towards a sustainable development according to the United Nations sustainable development goals.

A positive aspect of data prefetching is that it can be implemented to improve energy efficiency in hardware systems [48]. By improving the energy efficiency in the system and by optimizing the electricity consumption can help to conserve en-ergy in the long run. This has a positive impact on climate change which is a real threat today.

The focus in this project has not been on reducing the power consumption due to the time limit, but by implementing a prefetcher the execution time are reduced due to the increased amount of memory accesses. This leads to a reduction of static energy[49] even though further focus on the energy reduction should be considered in the future.

6

Conclusion

The purpose of this project, as stated in section 1.2, was to research ML prefetch-ers of differing attributes in order to identify characteristics suiting implementation in hardware. The aim being the aforementioned research culminating in a ML prefetcher optimal as seen to both performance and hardware feasibility.

By testing it was revealed that dense layers are more suiting for implementation in hardware than LSTM layers if chip area is of main concern; LSTM layers re-quired 599% more lookup tables and 412% more flip-flops compared to the dense layers. Further expanding onto the theory of dense layers being suited for ML-based prefetching it was found that LSTM layers generally did not improve the models abilities’ to learn to predict load addresses, compared to dense neural networks. The reason for this seemed to be that LSTM layers in general are much harder to train.

In fact, it hampered the performance in some cases. Dense neural networks were shown to be able to learn access patterns for the traces tested from the SPEC 2006 benchmark. In the tests performed and displayed in chapter 5 a higher accuracy compared to the Next Line prefetcher was recorded when training the final model on a training set and testing it on a test set within the same trace file. This indicates that the model that was analysed could predict prefetching patterns that a Next Line prefetcher would be unable to prefetch. When trained on data from three trace files and tested on three other traces individually however, the model was in line with the accuracy of the Next Line prefetcher. This indicates that the final model failed to generalize to detect patterns in the trace files used for testing, beyond the baseline.

The above findings promoted research into the constraining variables of dense layers in a hardware setting, which found no hampering of feasibility from adding as many as 7 dense layers to a network. Thus, a multi layer model is feasible in hardware as long as the complexity of its layers is monitored; the layers cannot be larger than what the hardware allows for in terms of area. On the other end of the spectrum, it was concluded that a model pursuing efficiency of area will not produce results aligned with the goal of this project as accuracy goals are not realized. The per-formance of such a model will be in line with the most basic Next Line prefetcher.

Moreover, the reported area usage still exceeded that of a better performing non-ML prefetcher (Markov prefetcher).

6. Conclusion

The final best model configuration when it comes to input and output was to repre-sent data with address offsets. This made the model independent of where data was placed in memory between different program executions. A model with this data representations became more general in solving the data prefetching problem. For the output representation one-hot encoding was considered the best option which allowed load addresses with offsets within [-2048, 2048] to be prefetched. This gave the model the ability to predict addresses locally around the previously fetched ad-dress, without having a too large output space to be trained on.

It should be noted that the final model did not satisfy the area constrains imposed by the PYNQ Z2. However, insights can be gathered from its implementation, and in line with the above stated conclusions, a more general conclusion with respect to the goal of this project can be formed. There are numerous possibilities to imple-ment ML-based data prefetching, but some model characteristics have been found to promote greater feasibility, as well as, improved performance in a hardware setting.

To summarize, a suiting model configuration is one with dense layers, offset based inputs and output, as well as limited offset range. These characteristics could allow for implementation in hardware with satisfactory performance.

For future work, implementing pruning and quantization to minimize the size of models could allow the creation of a more complex model that could fit in hardware alongside a CPU, possibly using the model characteristics that were found suitable in this project. This complex model could be trained on a larger and a more varied dataset of traces. Moreover, other machine learning algorithms such as transformers and reinforcement learning could be explored to test their suitability compared to the proposed model in this project.

Bibliography

[1] Carlos Carvalho. “The gap between processor and memory speeds”. In: Proc.

of IEEE International Conference on Control and Automation. 2002, pp. 27–

34.

[2] Steven P. Vanderwiel and David J. Lilja. “Data Prefetch Mechanisms”. In: 32.2 (June 2000), pp. 174–199. issn: 0360-0300. doi: 10.1145/358923.358939.

url: https://doi.org/10.1145/358923.358939.

[3] Top 10 Big Data Applications Examples: Healthcare, Entertainment and More.

2022. url: https://www.simplilearn.com/tutorials/big-data-tutorial/

big-data-applications. (accessed: 04.05.2022).

[4] Jafar Alzubi, Anand Nayyar, and Akshi Kumar. “Machine learning from the-ory to algorithms: an overview”. In: Journal of physics: conference series.

Vol. 1142. 1. IOP Publishing. 2018, p. 15.

[5] ML-Based Data Prefetching Competition. url: https://sites.google.com/

view / mlarchsys / isca - 2021 / ml - prefetching - competition. (accessed:

04.02.2022).

[6] Scott Beamer, Krste Asanovic, and David A. Patterson. “The GAP Benchmark Suite”. In: CoRR abs/1508.03619 (2015). arXiv: 1508 . 03619. url: http : //arxiv.org/abs/1508.03619.

[7] S.P. Vander Wiel and D.J. Lilja. “When caches aren’t enough: data prefetching techniques”. In: Computer 30.7 (1997), pp. 23–30. doi: 10.1109/2.596622.

[8] D. Joseph and D. Grunwald. “Prefetching using Markov predictors”. In: IEEE Transactions on Computers 48.2 (1999), pp. 121–133. doi: 10 . 1109 / 12 . 752653.

[9] Youngjoo Shin, Hyung Chan Kim, Dokeun Kwon, et al. “Unveiling Hardware-Based Data Prefetcher, a Hidden Source of Information Leakage”. In: Proceed-ings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. CCS ’18. Toronto, Canada: Association for Computing Machinery, 2018, pp. 131–145. isbn: 9781450356930. doi: 10.1145/3243734.3243736.

url: https://doi.org/10.1145/3243734.3243736.

[10] Instruction Set Architecture (ISA). url: https://www.arm.com/glossary/

isa. (accessed: 01.06.2022.

Bibliography

[11] Todd C. Mowry, Monica S. Lam, and Anoop Gupta. “Design and Evaluation of a Compiler Algorithm for Prefetching”. In: Proceedings of the Fifth Interna-tional Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS V. Boston, Massachusetts, USA: Association for Computing Machinery, 1992, pp. 62–73. isbn: 0897915348. doi: 10 . 1145 / 143365.143488. url: https://doi.org/10.1145/143365.143488.

[12] Qifang Bi, Katherine E Goodman, Joshua Kaminsky, et al. “What is Machine Learning? A Primer for the Epidemiologist”. In: American Journal of Epidemi-ology 188.12 (Oct. 2019), pp. 2222–2239. issn: 0002-9262. doi: 10.1093/aje/

kwz189. eprint: https://academic.oup.com/aje/article- pdf/188/12/

2222/32614486/kwz189.pdf. url: https://doi.org/10.1093/aje/kwz189.

[13] Jaime G Carbonell, Ryszard S Michalski, and Tom M Mitchell. “An overview of machine learning”. In: Machine learning (1983), pp. 3–23.

[14] Sotiris B Kotsiantis, Dimitris Kanellopoulos, and Panagiotis E Pintelas. “Data preprocessing for supervised leaning”. In: International journal of computer science 1.2 (2006), pp. 111–117.

[15] Dense layer. url: https://keras.io/api/layers/core_layers/dense/.

(accessed: 03.04.2022).

[16] Idit Cohen. “Time series-Introduction”. In: (2019).

[17] Toma´s Mikolov. “Recurrent neural network based language model”. In: Brno University of Technology, Johns Hopkins University, Washington, DC, 2010.

[18] Alex Sherstinsky. “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network”. In: Physica D: Nonlinear Phe-nomena 404 (2020), p. 132306. issn: 0167-2789. doi: https://doi.org/10.

1016 / j . physd . 2019 . 132306. url: https : / / www . sciencedirect . com / science/article/pii/S0167278919305974.

[19] Ampatishan Sivalingam. “Why do we need LSTM”. In: (2020). url: https:

//towardsdatascience.com/why-do-we-need-lstm-a343836ec4bc.

[20] Rohan Chikorde. Recurrent Neural Networks (RNN) and LSTM- Deep Learn-ing. url: https://www.linkedin.com/pulse/recurrent-neural-networks-rnn-lstm-deep-learning-rohan-chikorde. (accessed: 27.04.2022.

[21] Taher Al-Shehari and Rakan A. Alsowail. “An Insider Data Leakage Detec-tion Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques”. In: Entropy 23.10 (2021). issn: 1099-4300. doi: 10 . 3390/e23101258. url: https://www.mdpi.com/1099-4300/23/10/1258.

[22] Payam Refaeilzadeh, Lei Tang, and Huan Liu. “Cross-validation.” In: Ency-clopedia of database systems 5 (2009), pp. 532–538.

[23] Raúl Rojas. “The Backpropagation Algorithm”. In: Neural Networks: A Sys-tematic Introduction. Berlin, Heidelberg: Springer Berlin Heidelberg, 1996, pp. 149–182. isbn: 978-3-642-61068-4. doi: 10.1007/978-3-642-61068-4_7.

url: https://doi.org/10.1007/978-3-642-61068-4_7.

Bibliography

[24] Evaluating a Machine Learning Model. 2019. url: https : / / medium . com /

@skyl/evaluating-a-machine-learning-model-7cab1f597046. (accessed:

04.06.2022).

[25] Song Han, Huizi Mao, and William J. Dally. Deep Compression: Compress-ing Deep Neural Networks with PrunCompress-ing, Trained Quantization and Huffman Coding. 2015. doi: 10.48550/ARXIV.1510.00149. url: https://arxiv.org/

abs/1510.00149.

[26] P. Zhang, A. Srivastava, and Kannan R. et al. “TransforMAP: Transformer for Memory Access Prediction”. In: (). url: https://drive.google.com/file/

d/188l6-s1hDbjsMgLkrql_b3u4RDKGt6S2/view. (accessed: 04.02.2022).

[27] What is an FPGA? Field Programmable Gate Array. url: https : / / www . xilinx.com/products/silicon- devices/fpga/what- is- an- fpga.html.

(accessed: 27.04.2022.

[28] ASIC vs. FPGA: What’s The Difference? url: https://www.asicnorth.

com/blog/asic-vs-fpga-difference/. (accessed: 27.04.2022.

[29] Xilinx Documentation portal - Getting Started with Vitis HLS. url: https:

/ / docs . xilinx . com / r / en US / ug1399 vitis hls / Getting Started -with-Vitis-HLS. (accessed: 11.04.2022).

[30] Standard Performance Evaluation Corporation. 2022. url: https : / / www . spec.org/cpu2017/. (accessed: 18.04.2022.

[31] SPEC CPU 2006. 2022. url: https://www.spec.org/cpu2006/. (accessed:

18.04.2022.

[32] SPEC CPU 2017. 2022. url: https://www.spec.org/cpu2017/. (accessed:

18.04.2022.

[33] Python. 2022. url: https://www.python.org/. (accessed: 28.03.2022).

[34] Tensorflow - About. url: https://www.tensorflow.org/about. (accessed:

19.04.2022).

[35] PyTorch Documentation. url: https://pytorch.org/docs/stable/index.

html. (accessed: 04.05.2022).

[36] Weights & Biases - Experiments. url: https://wandb.ai/site/experiment-tracking. (accessed: 28.04.2022).

[37] Weights & Biases - Sweeps. url: https://wandb.ai/site/sweeps. (accessed:

28.04.2022).

[38] Xilinx Support - Vitis HLS language support. url: https://support.xilinx.

com/s/article/75770?language=en_US. (accessed: 11.04.2022).

[39] Xilinx Documentation portal - Using Libraries in Vitis HLS. url: https : //docs.xilinx.com/r/en-US/ug1399-vitis-hls/Using-Libraries-in-Vitis-HLS. (accessed: 11.04.2022).

[40] Xilinx Documentation portal - Verifying Code with C Simulation. url: https:

//docs.xilinx.com/r/en-US/ug1399-vitis-hls/Verifying-Code-with-C-Simulation. (accessed: 11.04.2022).

In document Optimizing on-chip Machine Learning for Data Prefetching (Page 56-62)