Svensk Sammanfattning - Advances Towards Data-Race-Free Cache Coherence Through Data Classifica

Koherent delat minne i parallella processorer är en attraktiv egenskap hos en minnearkitektur ur programmerarens synvinkel. Den gör det möjligt för pro-grammerarna att enkelt använda det delade minnet utan att explicit hantera problem rörande flytt och replikering av data. Inte desto mindre innebär cache-koherens en flaskhals som begränsar skalbarheten hos minnesar-kitekturer med delat minne.

Dagens parallella programmeringsdiscipliner, som det data-race-fria par-adigmet, undanröjer behovet av de striktaste cachekonherenssystemen, där koherensprotokollet reagerar på alla dataändringar. Istället kan semantiken¹ i den data-race-fria mjukvaran utnyttas för att säkerställa en enhetlig och konsekvent vy av det delade minnet, i mån av behov (som definieras av se-mantiken). Slutresultatet är ett system för koherens som möjliggör mer skalbara system, genom att reducera hur ofta koherensmekanismerna måste aktiveras: i stället för att aktiveras vid varje data läsning eller skrivning, behöver de bara aktiveras vid mer sällan förekommande synkroniseringar.

I sin enklaste form eliminerar ett sådant koherensprotokoll kommu-nikation mellan processorkärnor, som i stället ersätts av koherens-aktiviteter inom kärnan. Kärnor garanteras att få giltig data direkt från den delade sista-nivå-cachen (LLC, kort för Last-Level Cache). För att uppnå detta, ogil-tigförklarar varje kärna sin egen data när en synkronisering sker, eftersom datat skulle kunna ha blivit modifierat av en annan kärna. Dessutom uppdaterar varje kärna den delade LLC:n med sin egen modifierade data vid synkronisering, för att garantera att kärnor som läser datat efter synkroni-seringen får de uppdaterade värdena. Dataklassificering fungerar sålunda som ett användbart verktyg som kan utnyttjas av sådana koherenssystem:

varje kärna ogiltigförklarar sin egen data klassificerad som delad, och uppdaterar den delade LLC:n med egna modifierade data som är klassi-ficerad som delad. De data som klassificeras som privata påverkas inte av koherensmekanismerna. Fördelarna med en sådan privat/delad dataklassific-ering för cache-koherens är tvåfaldig: förutom att eliminera overheaden för koherenstrafik för privat data, elimineras dessutom lagrings-overhead av privat data.

Effektiviteten av koherensprotokoll som använder sig av dataklassificer-ing beror på kvaliteten av denna dataklassificerdataklassificer-ing. Att klassificera mer data

1 Synkronisering medelst barriärer och lås.

än nödvändigt som delad påverkar inte korrektheten, men det medför extra overhead i form av onödig datainvalidering, vilket leder till ökad nätverkstrafik och i slutändan längre exekveringstid av programmet som körs. Ett effektivt dataklassificeringssystem får därför inte klassificera privat data som delat, till exempel på grund av att en olämplig granularitet av klas-sificeringen har använts. Vidare bör ett effektivt dataklassificeringssystem tillfälligt kunna omklassificera delad data som privat när det är möjligt. Utan den här egenskapen, som vi kallar klassificeringsadaptivitet, förblir data delad under hela körtiden när den väl en gång är klassificerad som delad, vilket förhindrar optimeringar som tillåts av att ha mer privat data. Vidare bör en adaptiv klassificering inte införa någon overhead i sig.

I det här arbetet studerar vi privat/delad dataklassificering för cache-koherens baserat på data-race-fri semantik. Vi visar hur cachecache-koherens för data-race-fri programvara kan dra nytta av en dynamisk privat/delad datak-lassificering med låg overhead. Vi studerar olika avvägningar mellan att implementera klassificeringen i programvara med hjälp av operativsystemet, och med dynamisk hårdvarukonfiguration, som en klassificeringskatalog. Vi studerar avvägningarna mellan olika granulariteter av klassificeringen, det vill säga klassificering med cacheradsgranularitet och sidgranularitet (Kapitel II och Papper II). Vi studerar också effekten av den adaptiva datak-lassificeringen på koherensen. Våra resultat bekräftar att en minimal privat/delad klassificeringskatalog utan uttryckliga mekanismer för hantering av klassificeringsadaptiviteten möjliggör effektiva data-race-fria koheren-sprotokoll där avhysning av de delade katalogposterna implicit ger klassific-eringsadaptivitet (Kapitel III och Papper III).

Dessutom visar vi hur den privata/delade klassificeringen kan effektivt tillämpas på de hierarkiska topologierna (Kapitel IV och Papper IV). Att bevara den privata klassificeringen i hierarkierna blir viktigare, eftersom ogiltigförklaring av det delade datat i hierarkierna medför allvarliga påföljder på grund av de större mellanliggande delade cacherna i hierarkin.

Vi introducerar och studerar också ett hybrid-dataklassificeringssystem som dynamiskt tar hänsyn till effekterna av den privata/delade dataklassific-eringen på nätverkstrafiken (Kapitel V och Papper V). Nätverkstrafiken som genereras på grund av koherensskiktet används som återkoppling för att påverka resultatet av den privata/delade klassificeringen, så att den genere-rade nätverkstrafiken minimeras. Det resulterande klassificeringssystemet är ett hybridsystem där klassificeringsresultatet beror på både delningsmönstret och den resulterande nätverkstrafiken. Med andra ord tillhandahåller hy-bridsystemet en lösning för klassificeringsadaptivitet med låg overhead, där den delade datan tillfälligt kan klassificeras som privat.

Acknowledgements

I would like to thank Prof. Stefanos Kaxiras for accepting to be my advisor and supporting my research, introducing me to the topic of memory con-sistency and cache coherence¹, and helping me to get started with writing my research papers. Moreover, I would like to thank my deputy advisor, Prof.

Erik Hagersten, for his tremendous support and enlightening advice throughout my PhD research. Furthermore, I would like to thank Dr. Alberto Ros Bardisa from the University of Murcia, Spain, who assisted us with our simulation infrastructure and experimental setup during his visit to our search group as a post-doctorate researcher in the early phases of my re-search. I would also like to thank the director of the undergraduate studies at the Uppsala University, Dr. Mats Daniels, as well as my PhD candidate col-leagues Jonatan Lindén and Johan Janzén, for assisting me with preparing the “Swedish Summary” chapter of this thesis. My greatest thanks go to, among other colleagues at the Information Technology department, Michael Thuné the head of the IT department and Arnold Pears the head of the Com-puter Systems division, who always strived towards creating a healthy and a productive working environment. And last but not least, I would like to thank the Uppsala University for making all this happen by providing the opportunity for all of us to gather here.

I dedicated this thesis to my parents in the beginning, and I also wish to close this thesis by referring to my parents, to whom I owe a great debt of gratitude for their unlimited support.

1 Starting by the book written by Daniel. J. Sorin et. al. [9].

References

1. BYTE magazine Vol. 10, Num. 05 (May 1985), p.169.

2. The Operational Characteristics of the Processors for the Burroughs B5000 RevisionA,5000-21005.

http://bitsavers.informatik.uni-stuttgart.de/pdf/burroughs

3. Alastair J. W. Mayer. 1982. The architecture of the Burroughs B5000: 20 years later and still ahead of the times?. SIGARCH Comput. Archit. News 10, 4 (June 1982), 3-10. DOI=http://dx.doi.org/10.1145/641542.641543

4. William A. Wulf and Samuel P. Harbison. 2000. Reflections in a pool of pro-cessors—an experience report on C.mmp/Hydra. In Readings in computer ar-chitecture, Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi (Eds.). Mor-gan Kaufmann Publishers Inc., San Francisco, CA, USA 561-573.

5. Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi (Eds.). 2000. Readings in Computer Architecture. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

6. Per Stenström. 1990. A Survey of Cache Coherence Schemes for Multiproces-sors. Computer 23, 6 (June 1990), 12-24.

7. Andrew S. Tanenbaum and Herbert Bos. 2014. Modern Operating Systems (4th ed.). Prentice Hall Press, Upper Saddle River, NJ, USA.

8. Marshall C. Yovits and Marvin Zelkowitz. 1995. Advances in Computers. Vol-ume 40. Academic Press. San Diego, CA, USA.

9. Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan & Claypool Publishers.

10. Erik G. Hallnor and Steven K. Reinhardt. 2000. A fully associative software-managed cache design. In Proceedings of the 27th annual international sympo-sium on Computer architecture (ISCA '00). ACM, New York, NY, USA, 107-116. DOI=http://dx.doi.org/10.1145/339647.339660

11. Chao Li, Yi Yang, Hongwen Dai, Shengen Yan, Frank Mueller, and Huiyang Zhou. 2014. Understanding the tradeoffs between software-managed vs. hard-ware-managed caches in GPUs. In Proceedings of the Performance Analysis of Systems and Software (ISPASS). DOI: 10.1109/ISPASS.2014.6844487

12. M. V. Wilkes. 2000. Slave memories and dynamic storage allocation. In Read-ings in computer architecture, Mark D. Hill, Norman P. Jouppi, and Gurindar S.

Sohi (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA 371-372.

13. Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: saying no to complex consistency mod-els. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, 647-659. DOI: https://doi.org/10.1145/2830772.2830821 14. Sarita V. Adve and Kourosh Gharachorloo. 1996. Shared Memory Consistency

Models: A Tutorial. Computer 29, 12 (December 1996), 66-76.

DOI=http://dx.doi.org/10.1109/2.546611

15. Kourosh Gharachorloo. 1996. Memory Consistency Models for Shared-Memory Multiprocessors. Ph.D. Dissertation. Stanford University, Stanford, CA, USA. UMI Order No. GAX96-20480.

16. Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. 1990. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th annual in-ternational symposium on Computer Architecture (ISCA '90). ACM, New York, NY, USA, 15-26. DOI=http://dx.doi.org/10.1145/325164.325102

17. Sarita Vikram Adve. 1993. Designing Memory Consistency Models for Shared-Memory Multiprocessors. Ph.D. Dissertation. University of Wisconsin at Madi-son, MadiMadi-son, WI, USA. UMI Order No. GAX94-07354.

18. Robert C. Steinke and Gary J. Nutt. 2004. A unified theory of shared memory consistency. J. ACM 51, 5 (September 2004), 800-849.

DOI=http://dx.doi.org/10.1145/1017460.1017464

19. L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput., C-28(9):690–691, 1979.

20. Sarita V. Adve, Vikram S. Adve, Mark D. Hill, and Mary K. Vernon. 1991.

Comparison of hardware and software cache coherence schemes. In Proceed-ings of the 18th annual international symposium on Computer architecture (IS-CA '91). ACM, , 298-308. DOI=http://dx.doi.org/10.1145/115952.115982 21. Ioannis Schoinas, Babak Falsafi, AlvinR. Lebeck, Steve K. Reinhardt, James R.

Larus, and David A. Wood. Fine-grain Access Control for Distributed Shared Memory. Submitted for publication, March 1994.

22. David Chaiken, John Kubiatowics, and Anant Agarwal. LimitLESS Directories:

A Scalable Cache Coherence Scheme. In Proceedings of the Fourth Internation-al Conference on ArchitecturInternation-al Support for Programming Languages and Oper-ating Systems (ASPLOS IV), pages 224– 234, April 1991.

23. C. K. Tang. 1976. Cache system design in the tightly coupled multiprocessor system. In Proceedings of the June 7-10, 1976, national computer conference and exposition (AFIPS '76). ACM, New York, NY, USA, 749-753.

DOI=http://dx.doi.org/10.1145/1499799.1499901

24. L. M. Censier and P. Feautrier. 1978. A New Solution to Coherence Problems in Multicache Systems. IEEE Trans. Comput. 27, 12 (December 1978), 1112-1118. DOI=http://dx.doi.org/10.1109/TC.1978.1675013

25. Yen,W.C. and K.S.Fu. Coherence Problem in a Multicache System. Proc. of 1982 Int. Conf. on Parallel Processing, IEEE, 1982, pp. 332-339.

26. James R. Goodman. 1983. Using cache memory to reduce processor-memory traffic. In Proceedings of the 10th annual international symposium on Computer architecture (ISCA '83). ACM, New York, NY, USA, 124-131.

DOI=http://dx.doi.org/10.1145/800046.801647

27. Alvin R. Lebeck and David A. Wood. 1995. Dynamic self-invalidation: reduc-ing coherence overhead in shared-memory multiprocessors. In Proceedreduc-ings of the 22nd annual international symposium on Computer architecture (ISCA '95).

ACM, 48-59. DOI=http://dx.doi.org/10.1145/223982.223995

28. Hoichi Cheong and Alexander V. Veidenbaum. Compiler-Directed Cache Man-agement in Multiprocessors. IEEE Computer, 23(6):39-48, June 1990.

29. R. Cytron, S. Karlovsky, and K.P. McAuliffe, "Automatic Management of Pro-grammable Caches," Proc. 1988 Int'l Conf. Parallel Processing, CS Press, Los Alamitos, Calif., Order No. 889, Vol. II, Aug. 1988, pp. 229-238.

30. E. Darnell and K. Kennedy. 1993. Cache coherence using local knowledge. In Proceedings of the 1993 ACM/IEEE conference on Supercomputing (Super-computing '93). ACM, 720-729. DOI=http://dx.doi.org/10.1145/169627.169821

31. S. L. Min and J. L. Baer. 1992. Design and Analysis of a Scalable Cache Co-herence Scheme Based on Clocks and Timestamps. IEEE Trans. Parallel Dis-trib. Syst. 3, 1 (January 1992), 25-44. DOI=http://dx.doi.org/10.1109/71.113080 32. Mark D. Hill, James R. Larus, Steven K. Reinhardt, and David A. Wood. 1993.

Cooperative shared memory: software and hardware for scalable multiproces-sors. ACM Trans. Comput. Syst. 11, 4 (November 1993), 300-318.

DOI=http://dx.doi.org/10.1145/161541.161544

33. David A. Wood, Satish Chandra, Babak Falsafi, Mark D. Hill, James R. Larus, Alvin R. Lebeck, James C. Lewis, Shubhendu S. Mukherjee, Subbarao Palacharla, and Steven K. Reinhardt. 1993. Mechanisms for cooperative shared memory. In Proceedings of the 20th annual international symposium on com-puter architecture (ISCA '93). ACM, New York, NY, USA, 156-167.

DOI=http://dx.doi.org/10.1145/165123.165151

34. S. V. Adve and M. D. Hill. Weak Ordering—A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pp.

2–14, May 1990. doi:10.1109/ ISCA.1990.134502

35. S. V. Adve and M. D. Hill. A Unified Formalization of Four Shared-Memory Models. IEEE Transactions on Parallel and Distributed Systems, June 1993.

doi:10.1109/71.242161

36. S. V. Adve, M. D. Hill, B. P. Miller, and R. H. B. Netzer. Detecting Data Races on Weak Memory Systems. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 234–43, May 1991.

doi:10.1145/115952.115976

37. L. Valiant, “Bridging model for parallel computation,” Communications of the ACM, vol. 33, pp. 103–111, Aug. 1990.

38. A. Gupta, W.-D. Weber, and T. C. Mowry, “Reducing memory and traffic re-quirements for scalable directory-based cache coherence schemes,” in Int’l Conf. on Parallel Processing (ICPP), Aug. 1990, pp. 312–321.

39. The Intel® Xeon Phi TM Coprocessor Architecture Overview. Available:

https://software.intel.com/sites/default/files/Intel®_Xeon_Phi™_Coprocessor_

Architecture_Overview.pdf

40. Blas A. Cuesta, Alberto Ros, María E. Gómez, Antonio Robles, and José F.

Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th annual inter-national symposium on Computer architecture (ISCA '11). ACM, New York, NY, USA, 93-104. DOI=http://dx.doi.org/10.1145/2000064.2000076

41. Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki.

2009. Reactive NUCA: near-optimal block placement and replication in distrib-uted caches. SIGARCH Comput. Archit. News 37, 3 (June 2009), 184-195.

DOI=http://dx.doi.org/10.1145/1555815.1555779

42. Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Ho-narmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, and Ching-Tsun Chou. 2011. DeNovo: Rethinking the Memory Hierarchy for Disciplined Paral-lelism. In Proceedings of the 2011 International Conference on Parallel Archi-tectures and Compilation Techniques (PACT '11). IEEE Computer Society, Washington,DC, USA, 155-166. DOI=http://dx.doi.org/10.1109/PACT.2011.21 43. Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve. 2013. DeNovoND:

efficient hardware support for disciplined non-determinism. In Proceedings of the eighteenth international conference on Architectural support for program-ming languages and operating systems (ASPLOS '13). ACM, New York, NY, USA, 13-26. DOI=http://dx.doi.org/10.1145/2451116.2451119

44. Hyojin Sung and Sarita V. Adve. 2015. DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations. In Proceed-ings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 545-559. DOI: http://dx.doi.org/10.1145/2694344.2694356 45. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism, Hyojin

Sung, Ph.D. thesis, University of Illinois, Urbana-Champaign, 2015.

46. Alberto Ros and Stefanos Kaxiras. 2012. Complexity-effective multicore coher-ence. In Proceedings of the 21st international conference on Parallel architec-tures and compilation techniques (PACT '12). ACM, New York, NY, USA, 241-252. DOI=http://dx.doi.org/10.1145/2370816.2370853

47. Stefanos Kaxiras and Alberto Ros. 2013. A new perspective for efficient virtu-al-cache coherence. In Proceedings of the 40th Annual International Symposi-um on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 535-546. DOI: http://dx.doi.org/10.1145/2485922.2485968

48. David A. Wood, Mark D. Hill, and Richard E. Kessler. 1991. A model for esti-mating trace-sample miss ratios. In Proceedings of the 19th ACM Special Inter-est Group on Measurement and Modeling of Computer Systems (SIGMET-RICS). 79–89.

49. Mohammad Alisafaee. Spatiotemporal coherence tracking. 2012. In Proceed-ings of the 45th IEEE/ACM International Symposium on Microarchitecture (MICRO). 341–350.

50. Jason Zebchuk, Babak Falsafi, and Andreas Moshovos. 2013. Multi-grain co-herence directories. In Proceedings of the 46th IEEE/ACM International Sym-posium on Microarchitecture (MICRO). 359–370.

51. An-Chow Lai, Cem Fide, and Babak Falsafi. 2001. Dead-block prediction &

dead-block correlating prefetchers. In Proceedings of the 28th annual interna-tional symposium on Computer architecture (ISCA '01). ACM, New York, NY, USA, 144-154. DOI=http://dx.doi.org/10.1145/379240.379259

52. Stefanos Kaxiras, Zhigang Hu, and Margaret Martonosi. 2001. Cache decay:

exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01). ACM, 240-251. DOI=http://dx.doi.org/10.1145/379240.379268

53. J. Archibald and J. L. Baer, “An economical solution to the cache coherence problem,” in 12th Int’l Symp. on Computer Architecture (ISCA), Jun. 1985, pp.

355–362.

54. A. Gupta, W.-D. Weber, and T. C. Mowry, “Reducing memory and traffic re-quirements for scalable directory-based cache coherence schemes,” in Int’l Conf. on Parallel Processing (ICPP), Aug. 1990, pp. 312–321.

55. M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi, “Cuckoo directory: A scalable directory for many-core systems,” in 17th Int’l Symp. on High-Performance Computer Architecture (HPCA), Feb. 2011, pp. 169–180.

56. S. Demetriades and S. Cho, “Stash directory: A scalable directory for many-core coherence,” in 20th Int’l Symp. on High- Performance Computer Architec-ture (HPCA), Feb. 2014, pp. 177–188.

57. A. Agarwal, R. Simoni, J. L. Hennessy, and M. A. Horowitz, “An evaluation of directory schemes for cache coherence,” in 15th Int’l Symp. on Computer Ar-chitecture (ISCA), May 1988, pp. 280–289.

58. W.-D. Weber and A. Gupta, “Analysis of cache invalidation patterns in multi-processors,” in 3th Int’l Conf. on Architectural Support for Programming Lan-guage and Operating Systems (ASPLOS), Apr. 1989, pp. 243–256.

59. R. H. Katz, S. J. Eggers, D. A. Wood, C. L. Perkins, and R. G. Sheldon, “Im-plementing a cache consistency protocol,” in 12th Int’l Symp. on Computer Ar-chitecture (ISCA), Jun. 1985, pp. 276–283.

60. M. S. Papamarcos and J. H. Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” in 11th Int’l Symp. on Computer Architecture (ISCA), Jun. 1984, pp. 348–354.

61. J. A. W. Wilson, “Hierarchical cache/bus architecture for shared memory multi-processors,” in 14th Int’l Symp. on Computer Architecture (ISCA), Jun. 1987, pp. 244–252.

62. M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas, “Bulldozer: An approach to multithreaded compute performance,” IEEE Micro, vol. 31, no. 2, pp. 6–15, Mar. 2011.

63. N. Eisley, L.-S. Peh, and L. Shang, “In-network cache coherence,” in 39th IEEE/ACM Int’l Symp. on Microarchitecture (MICRO), Dec. 2006, pp. 321–

332.

64. D. B. Gustavson, “Scalable coherent interface and related standards,” IEEE Micro, vol. 12, no. 1, pp. 10–22, Jan. 1992.

65. S. Kaxiras and J. R. Goodman, “The GLOW cache coherence protocol exten-sions for widely shared data,” in 10th Int’l Conf. on Supercomputing (ICS), Jan.

1996, pp. 35–43.

66. J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Rigel: an architecture and scalable programming interface for a 1000-core accelerator,” in 36th Int’l Symp. on Computer Architecture (ISCA), Jun. 2009, pp. 140–151.

67. D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hen-nessy, M. A. Horowitz, and M. S. Lam, “The stanford DASH multiprocessor,”

IEEE Computer, vol. 25, no. 3, pp. 63–79, Mar. 1992.

68. Y. Maa, D. Pradhan, and D. Thiebaut, “Two economical directory for large-scale multiprocessors,” ACM SIGARCH Computer Architecture News, vol. 19, p. 10, Sep. 1991.

69. M. M. K. Martin, M. D. Hill, and D. J. Sorin, “Why on-chip cache coherence is here to stay,” Communications of the ACM, vol. 55, pp. 78–89, Jul. 2012. 70. M. M. Martin, M. Hill, and D. Wood, “Token coherence: Decoupling

perfor-mance and correctness,” in 30th Int’l Symp. on Computer Architecture (ISCA), Jun. 2003, pp. 182–193.

71. J. Nickolls and W. J. Dally, “The GPU computing era,” IEEE Micro, vol. 30, no. 2, pp. 56–69, Mar. 2010.

72. H. Nilsson and P. Stenstrom, “The scalable tree protocol - A cache coherence approach for large-scale multiprocessors,” in 4th Int’l Conference on Parallel and Distributed Computing, Dec. 1992, pp. 498–506.

73. M.Shah, J.Barreh, J.Brooks, R.Golla, G.Grohoski, N.Gura, R. Hetherington, P.

Jordan, M. Luttrell, C. Olson, B. Saha, D. Sheahan, L. Spracklen, and A. Wynn,

“Ultrasparc t2: A highly-threaded, power-efficient, sparc soc,” in IEEE Asian Solid-State Circuits Conference, Nov. 2007, pp. 22–25.

74. A. Ros, M. Davari, and S. Kaxiras, “Hierarchical private/shared classification:

key to simple coherence for clustered hierarchies,” in 21st Int’l Symp. on High-Performance Computer Architecture (HPCA), Feb. 2015, pp. 186–197.

75. B. R. Gaster, D. Hower, and L. Howes, “HRF-relaxed: Adapting HRF to the complexities of industrial heterogeneous memory models,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 12, no. 1, pp. 7:1–7:26, 2015. [Online]. Available: http://doi.acm.org/10.1145/2701618

76. D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K.

Reinhardt, and D. A. Wood, “Heterogeneous-race-free memory models,” in 19th Int’l Conf. on Architectural Support for Programming Language and Oper-ating Systems (ASPLOS), Feb. 2014, pp. 427–440.

77. M. S. Orr, S. Che, A. Yilmazer, B. M. Beckmann, M. D. Hill, and D. A. Wood,

“Synchronization using remote-scope promotion,” in 20th Int’l Conf. on Archi-tectural Support for Programming Language and Operating Systems (ASPLOS), Mar. 2015, pp. 73–86.

78. M. D. Sinclair, J. Alsop, and S. V. Adve, “Efficient gpu synchronization with-out scopes: Saying no to complex consistency models,” in 48th IEEE/ACM Int’l Symp. on Microarchitecture (MICRO), Dec. 2015, pp. 647–659.

79. P. Stenstrom, M. Brorsson, and L. Sandberg, “An adaptive cache coherence protocol optimized for migratory sharing,” in 20st Int’l Symp. on Computer Ar-chitecture (ISCA), May 1993, pp. 109–118.

80. Alan L. Cox and Robert J. Fowler. 1993. Adaptive cache coherency for detect-ing migratory shared data. In Proceeddetect-ings of the 20st International Symposium on Computer Architecture (ISCA), May 1993, 98–108.

81. H. Luo, X. Xiang, and C. Ding. Characterizing active data sharing in threaded

In document Advances Towards Data-Race-Free Cache Coherence Through Data Classification (Page 55-66)