• No results found

TIME PREDICTABILITY OF GPU KERNEL ON AN HSA COMPLIANT PLATFORM

N/A
N/A
Protected

Academic year: 2021

Share "TIME PREDICTABILITY OF GPU KERNEL ON AN HSA COMPLIANT PLATFORM"

Copied!
79
0
0

Loading.... (view fulltext now)

Full text

(1)

aster˚

as, Sweden

Thesis for the Degree of Master of Science in Computer Science with

Specialization in Embedded Systems 30.0 credits

TIME PREDICTABILITY OF GPU

KERNEL ON AN HSA COMPLIANT

PLATFORM

Marcus Larsson

larsson.marcus2@gmail.com

Nandinbaatar Tsog

nabarja@gmail.com

Examiner: Mikael Sj¨

odin

alardalen University, V¨

aster˚

as, Sweden

Supervisor: Matthias Becker

alardalen University, V¨

aster˚

as, Sweden

Company supervisor: Fredrik Bruhn

Bruhnspace, Uppsala

(2)

Abstract

During recent years, the importance of utilizing more computational power in smaller computer systems has increased. The utilization of more computational power in smaller packages, the abil-ity to combine more than one type of processor unit has become more popular in the industry. By combining, one achieves more power efficiency as well as gain more computational power in smaller area. However, heterogeneous programming has proved to be difficult, and that makes soft-ware developers diverge from learning heterogeneous programming languages. This has motivated HSA foundation to develop a new hardware architecture, called Heterogeneous System Architecture (HSA). This architecture brings features that make the process of heterogeneous programming de-velopment more accessible, efficient, and easier to the software developers. The purpose of this thesis is to investigate this new architecture, to learn and observe the timing characteristics of a task running a parallel region (a kernel) on a GPU in an HSA compliant system. With an objective to gain more knowledge, four test cases have been developed to collect time data and to analyze the time of the code executed on the GPU. These are: comparison between CPU and GPU, tim-ing predictability of parallel periodic tasks, schedulability in HSA, and memory copy. Based on the results of the analysis, it has been concluded that the HSA has potential to be very attractive for developing heterogeneous programs due to its more streamlined infrastructure. It is easier to adapt, requires less knowledge regarding the underlying hardware, and the software developers can use their preferred programming languages, instead of learning new programming framework, such as OpenCL. However, since the architecture is new, there are bugs and HSA features that are yet to be incorporated into the drivers. Performance wise, HSA is faster compared to legacy methods, but lacks in providing consistent time predictability, which is important for real-time systems.

(3)

Acknowledgements

First, we would like to thank the RTHAS project for providing us with the opportunity to work with this exciting thesis project. Next, we would like to express our sincere gratitude to our two supervisors Adj. Prof. Fredrik Bruhn for his great knowledge in the hardware area, and Matthias Becker for his expertise in real-time systems. Both our supervisors have helped us a great deal by putting us into the right direction and by providing extensive amount of support whenever necessary. We would also like to give a special thank you to the HSA Foundation for affording us the great opportunity to research such challenging theme and for accepting us, M¨alardalen University, as a member of the HSA Foundation.

Finally, we also would like to direct a big thanks to AMD for providing answers to our inquiries which we had during the thesis period.

(4)

Table of Contents

1 Introduction 5 1.1 Motivation . . . 6 1.2 Problem formulation . . . 6 1.3 Thesis outline . . . 6 2 Background 8 2.1 Real-time system . . . 8

2.2 Unpredictability in real-time systems . . . 8

2.3 Heterogeneous computing . . . 9

2.4 HSA foundation . . . 10

2.5 Related work . . . 11

3 Heterogeneous System Architecture 13 3.1 Legacy vs HSA systems . . . 13

3.1.1 Memory handling . . . 13

3.1.2 Queuing . . . 14

3.1.3 Instruction Set Architecture . . . 14

3.2 Terminology . . . 15

3.3 HSA Intermediate Language . . . 16

3.3.1 Compiling and running HSAIL kernels . . . 16

3.3.2 Execution of HSAIL kernels . . . 18

3.4 Memory model . . . 19

3.4.1 Memory structure . . . 19

3.4.2 Synchronization and scope . . . 20

3.5 Queuing model . . . 21

3.5.1 Architected Queuing Language . . . 21

3.5.2 Execution of an AQL packet . . . 22

3.5.3 Context switching . . . 22 3.6 Software stack . . . 23 4 Limitations 25 5 Method 26 5.1 Research method . . . 26 5.2 Experiment method . . . 26 5.3 System model . . . 27 5.3.1 Definition . . . 27

5.3.2 Fork-Join task model . . . 29

5.4 Hardware and Software . . . 29

5.4.1 Hardware . . . 29 5.4.2 Software . . . 31 6 Experiments 33 6.1 GPU kernel . . . 33 6.2 Test cases . . . 33 7 Implementation 35 7.1 Configuration . . . 35 7.2 Test software . . . 35

7.2.1 Comparison testing between CPU and GPU . . . 36

7.2.2 Timing predictability of periodic parallel task . . . 36

7.2.3 Load generator . . . 36

7.2.4 Scheduling mechanism of HSA . . . 37

(5)

8 Evaluation 39

8.1 Comparison between CPU and GPU . . . 39

8.2 Timing predictability of periodic parallel task . . . 43

8.3 Scheduling mechanism of HSA . . . 49

8.3.1 Full execution . . . 49

8.3.2 Kernel execution . . . 51

8.4 Memory copy in HSA . . . 52

8.4.1 Comparison between the environments . . . 52

8.4.2 Comparison between the two systems . . . 55

9 Future work 56 9.1 Context switching . . . 56

9.2 Resource sharing . . . 56

9.3 Queue management . . . 56

9.4 Real-time middle-layer . . . 57

9.5 Other real-time behavior . . . 57

10 Conclusion 58 References 59 Appendix A Install and configuration 61 A.1 Install and verify ROC . . . 61

A.2 Install CLOC . . . 61

A.3 Install HSAILasm . . . 61

A.4 Install AMD CodeXL . . . 62

Appendix B Tools 63 B.1 HCC . . . 63

B.2 CLOC . . . 63

B.3 HSAILasm . . . 64

B.4 AMD CodeXL . . . 64

Appendix C Result tables 67 C.1 Test case 1: Comparison between CPU and GPU . . . 67

C.2 Test case 2: Timing predictability of periodic parallel task . . . 70

C.3 Test case 3: Scheduling mechanism of HSA . . . 72

(6)

1

Introduction

Heterogeneous computing is commonly used in the industry [1] for various applications by com-bining more than one compute unit. One type of compute unit, Central Processing Unit (CPU), is most commonly used to run tasks for the operating system, for example, web browsers and word processing programs. Another type of compute unit is the Graphics Processing Unit (GPU). The GPU is commonly used when applications require certain types of computations that take advan-tage of a parallel execution model, for example, applications that process 3D images or games. When combining both CPU and GPU a software developer has access to more performance when using a programming framework that supports parallel computation, for example, OpenCL or OpenMP. However, it is time consuming to create applications that utilize more than one type of compute unit in an efficient and easy way. The challenges are that the various compute units use different amount of memory space and have a different instruction set architecture. Therefore, a software developer needs to have prior knowledge about hardware varieties. To make it eas-ier to program applications that can assign tasks to different compute units without any current knowledge about the hardware, several companies (AMD, ARM, Samsung, etc.) formed the HSA foundation1to develop a new hardware architecture: Heterogeneous System Architecture (HSA).

HSA is a new hardware architecture that aims to ease the work of creating applications using many compute units, without having to copy memory data between different physical memories [2]. A software developer only need pointers in a virtual memory space that are linked to all compute units. Figure1 displays a comparative view of the memory structure of a legacy system and that

Figure 1: Memory structure between a non-HSA system and a HSA system.

of an HSA system to depict the main difference between these two systems. In a legacy system the memory data is copied between different physical memories. Meanwhile, in a HSA system the data is saved in an unified coherent memory. Therefore, all different devices can access the same physical memory data by only using memory pointers. This saves time, since no memory copying is needed, which in turn lowers the power consumption.

The real-time properties of an HSA system are currently unknown. If it is possible to execute a real-time application in a predictable and timely manner in an HSA system, then it can be proved very useful when one has a lot of computational power in one System on Chip (SoC), that it is possible to execute tasks in parallel with deadline requirements. For this thesis work, the task is to investigate the time properties on a real-time application in a system which is HSA compliant. The equipment used in this thesis work is two systems which are HSA compatible. Both systems are running with an AMD Carrizo APU2 with different hardware configurations. An APU is an

SoC where both a CPU and a GPU resides. AMD Carizzo is the first APU that supports the first generation HSA3 to its fullest.

1http://www.hsafoundation.com

2http://support.amd.com/TechDocs/50742 15h Models 60h-6Fh BKDG.pdf 3http://www.hsafoundation.com/standards/

(7)

1.1

Motivation

The purpose of this thesis work is to investigate the possibilities of running real-time application on an HSA system. This type of investigation is important because more computational power could be used in an embedded system. Today, more applications require higher performance in smaller packages and less consumption of power, all of which can be achieved by the benefits that HSA brings.

As been mentioned in Section 1, the processor that is going to be used is an AMD Carizzo. The operating system used in this thesis work is a Linux based distribution called Ubuntu. The mainline kernel inside Ubuntu generates an overhead, due to the fact that the mainline kernel is not real-time capable. The extra overhead causes issues in achieving a hard real-time performance. Therefore, this investigation is going to try to find a feasible way to accomplish a system which handles tasks within a timely manner.

1.2

Problem formulation

To achieve a predictable and a timely execution of real-time application on an HSA system, the thesis will investigate the following:

1. The timing properties of an application running on an HSA system; 2. Investigate the unified coherent memory;

3. Look into HSA’s hardware and software specification to gain knowledge about HSA, for example, the memory model and the queueing model;

4. Time differences comparing to legacy heterogeneous computing methods.

The main goal is to look into the timing properties of a task in an HSA compatible system. When investigating the unified coherent memory, the interesting behavior to look after is if a task with huge memory allocation operation saves up time from having two physical memories in a legacy system, to one unified coherent memory in an HSA system. One part of the thesis work is to learn the behaviour of the hardware and the software architecture of HSA, to study how the memory model, and the queuing model are defined. Last, a comparison between legacy heterogeneous computing method with HSA is also investigated by comparing HSA with OpenCL.

1.3

Thesis outline

This thesis report is constructed as follows:

• In Section 2, the background is presented. The background topics covered are real-time, unpredictability in real-time systems, heterogeneous computing in general, HSA foundation, and about the related work.

• Section3introduces the concept HSA. This section describes the basic idea behind HSA and outlines its benefits as compared to a legacy system. It also explains the main goal of the memory model and how the queue management works.

• Section4 explains the limitations of this thesis project.

• Section5presents the method of the design and implementation idea and explains the hard-ware and softhard-wares used under this thesis project in addition to the explanation of the test cases.

• Section6 describes the experiments that are conducted in this thesis project. • In Section7, the experiment setup is explained, by algorithms and flow charts.

• Section8presents the results from the experiments, as well as the evaluation and discussion of the result.

(8)

• Section9covers future work. Work or ideas that could not be included due to time restrictions of of this thesis project and other restrictions as mentioned in this section.

(9)

2

Background

This section introduces the background subjects related to this thesis work. Section2.1introduces what a real-time system is and their unpredictability behavior in Section2.2. In Section 2.3 an introduction about heterogeneous computing is covered. Section2.4 covers a description of how the HSA Foundation was founded and what their purpose is. Finally, in Section2.5, the related work of this thesis project is mentioned.

2.1

Real-time system

A real-time system is a system that reacts to external events. It executes a function based on the external events and returns a response within a finite and required time. Therefore, not only the accuracy of the result, but also the timeliness is an important factor for the accuracy of the system. The real-time system can be divided into a hard, firm and soft [3] real-time system from perspective of the timing constraints (see Figure2). The hard real-time system must pass entire constraints. If the system misses a deadline once, it results in failure leading to a fatality and/or big cost damage. Therefore, the hard real-time systems are often considered to be safety critical. In the soft real-time system, one or more deadline misses may occur, but it affects the quality of the service. A firm real-time system is between hard and soft real-time system.

Figure 2: A real-time system requirements.4

2.2

Unpredictability in real-time systems

Optimality, feasibility, comparability, predictability, sustainability, and anomalies are regarded as the fundamental results in multiprocessor real-time scheduling that are independent of specific scheduling algorithms [4]. Since the beginning of 90s, the importance of predictability of real-time scheduling is started to be recognized widely [5] [6] [7]. Ha and Liu [6] represented for the first time the concept of predictability for real-time scheduling algorithm as noted in the survey performed by [4].

According to [4], a scheduling algorithm is considered as predictable as long as the response times of jobs cannot increase due to a decrease in their execution times while other parameters remain constant. Predictability is an important feature since task execution times constantly vary up to worst-case value in some extreme cases. Ha and Liu showed that all priority-driven as well as preemptive scheduling algorithms for multiprocessor systems are predictable. “We note that, for any dynamic priority scheduling algorithm, it is necessary to prove predictability before the

(10)

algorithm can be considered useful.” As emphasized here, a scheduling algorithm is regarded as one of the main factors of the predictability of the real-time system.

Furthermore, the architectural development determines the predictability of the system to a great deal. In the study of [8], Wilhelm et al. advise specifically relating future computer architectures for time-critical systems. Analyzability of the performance is counted as an essential criterion for the requirements of future architecture. Consequently, the following development cases are considered as negative.

• Although Random-replacement strategy of parts results in high performance when the strategy is performed partially, the different effects on the caches, different collisions on buses as well as different execution times are observed with entire performance.

• “The design of the internal and external buses, if done wrongly, leads to hardly analyzable behavior and great loss in precision.”

Most issues related to timing analysis are caused by the interference on shared resources which are divided between cost, energy, and performance reasons. Various users of a shared resource may frequently access the resource in an unknown way, so these access sequences can possibly lead to different states of the resource. Differences in the future timing behavior might be caused by the resulting resource states because the different sequences perform at different execution times.

The two types of interferences, inherent and virtual, cause two different non-determinism: the first interference causes real non-determinism while the latter one causes artificial non-determinism. Both interferences affect predictability in a negative way.

Inherent interferences on a shared resource observed by one user of the resource are encountered by another users activity at unpredictable time. Buses and memory can be counted as some examples of shared resources with inherent interferences.

2.3

Heterogeneous computing

From the perspective of scheduling, a traditional multiprocessor system is classified as homoge-neous multiprocessor/computing which uses identical processors for the computation. A number of different processors for different usages such as GPU, FPGA, and others have been developed as a result of rapid advances in microprocessor technologies. From around 2010, heterogeneous computing is under spotlight and related investigations started widely. Heterogeneous computing is referred to a system that uses collaboration of different compute units.

New issues and difficulties which cannot be found in conventional homogeneous systems arise in heterogeneous computing systems [9]. The following can be included in areas of heterogeneity [10]:

• Application Binary Interface (ABI) is a low level interface between 2 program modules. Computing elements in ABI may interpret memory in various ways. Both, little and big, endianness, calling convention, and memory layout might be included and relies on both the architecture and compiler in use.

• Application Programming Interface (API): Both library and OS services might be unequally available to all compute elements.

• Interconnect: The computing elements can have different types of interconnect except basic memory/bus interfaces. Interconnect may consist of dedicated network interfaces, direct memory access (DMA) devices, FIFOs, and so on.

• Instruction Set Architecture (ISA) is the part of a computer that is related to program-ming, which is the set of basic instructions in machine language. Computing elements which possibly have different instruction set architectures lead to binary incompatibility.

• Low-Level Implementation of Language Features: Function pointers are used for im-plementation of language features like functions and threads and it requires extra translation or abstraction when used in situations of heterogeneity.

(11)

• Memory Interface and Hierarchy: Compute elements may have various cache structures, cache coherency protocols and memory access that can be either uniform or non-uniform memory access (NUMA). The capability of reading arbitrary data lengths can vary because some processors can execute only byte-, word-, or burst accesses.

In order to establish HSA as a standard of heterogeneous computing, HSA has to clear as many of the above mentioned requirements as possible.

2.4

HSA foundation

AMD, together with several other companies, created an organization with the goal of building up a new architecture for heterogeneous programming. This organization is called the HSA foundation, and all its key founders are shown in Figure 3. The foundation creates royalty-free open-source drivers and software which is needed for the creation of heterogeneous applications.

Figure 3: Key members that formed HSA foundation.5

The foundation also has other backers, for example, from other companies and universities. Others can also gain membership by applying to HSA foundation. Members can get access to earlier drafts of updates of the specification, and join working groups to help developing the standard.

After few years of developing the standard, the foundation released its first version of the HSA standard in 2015. Three existing specifications contain the standard definition of HSA version 1.0. Platform System Architecture Specification The platform system architecture specification [11] defines the hardware requirements that are necessary to be included in an HSA system. It also includes design implementations for adopters to use. In the specification, two chapters describe the requirements of building a valid HSA product.

1. System architecture: This chapter covers the requirements for the hardware architecture for an HSA system. Some of the requirements for an HSA system includes a user mode queue, flat addressing, signaling and synchronization, atomic operations, agent scheduling, and an Architected Queuing Language (AQL).

2. HSA consistency model: This chapter explains how to obtain a memory consistency model in an HSA system, by using atomic operations, memory segments, ownership, scopes, and memory fences. It also describes how to avoid data races.

(12)

Programmer Reference Manual Specification The programmer reference manual specifica-tion [12] covers the Heterogeneous System Architecture Intermediate Language (HSAIL). HSAIL is a virtual language where the developers define the code which runs on the GPU. The code resembles how an assembly language looks like. This manual describes how the HSAIL is defined, the syntax and semantics, and how to write kernels with it by describing the all different types of instructions (arithmetic, branch, memory etc.). The binary representation format of an HSAIL code, BRIG, is also defined and explained in this specification.

Runtime Specification The runtime specification [13] defines the API which the developers can use to program heterogeneous application. The API covers all functions available to the programmer, describing how to use queues, memory, signals, etc. There are functionality that are not included in the runtime API, but can be implemented as extensions. Two extensions are explained: HSAIL finalization and images.

1. HSAIL finalization: With this extension, the application can compile HSAIL code in BRIG files and use them in a task dispatch.

2. Images: The image extension handles image data. This extension explains how to use images by specifying the channel type and channel order of an image.

2.5

Related work

In paper [14] the authors discuss the different approaches to achieve WCET in a system with more than one compute core. The general conclusion is that it is very difficult to achieve WCET for tasks running on a many-core system. The best approach to receive WCET is to run the task without any disturbance for a while and then measure the time it takes for the task to execute, and then do the same but with tasks running in the background.

Creating a test environment where we can measure the response time is one of the major parts of this work. In [15] two demand-based schedulability analysis are covered together with a device driver in Linux. They use two variations of the Linux kernel, one normal and one modified for a real-time system [16]. The modified Linux kernel produces better results, which indicate that a real-time kernel is also a better choice for this work. There is a Linux kernel called LITMUS RT6 which is specifically created to run real-time tasks on multi-core system on Linux. LITMUS RT is very useful, however, since the HSA drivers for Linux are newer, all drivers and libraries require to be up-to-date with the most recent version at all times to receive the latest necessary features. In paper [17] and [18] the authors talk about self-suspending tasks. This is useful since the hardware platform used for this work has four CPU cores and one GPU with multiple compute units, which a task can be swapped in between. If a task is on one CPU core and needs to be switched to one for the compute units on the GPU, the task will self-suspend into the GPU compute unit (for example, if the task requires more computational power) and will move back after some time period. During the time the task is on the GPU compute unit, no other task can run on the CPU core. The time required for the execution of the task on the GPU compute unit is unknown since the task is self-suspending itself.

Betts and Donaldson [19] discuss the problem of estimating the WCET of GPU accelerated applications. The authors state that the paper is the very first work on applying WCET analysis to the analysis of GPU accelerated application and the result of the WCET calculation depends on how the modelling of concurrency is effective which in turn leads to reduce of pessimism of WCET calculation. The evaluation of the work is based on the CUDA SDK which is not related with HSA. However, the usage of the real-time analysis, Control Flow Graph, can be worth of using under the master thesis.

Hirvisalo [20] has worked on static timing analysis of GPU softwares based on abstract Co-operative Thread Arrays (CTA) simulation. The author states, ”The method of the work is very scalable and can be used to analyze the WCET of very large numbers of parallel threads.” However the method proposed for the kernels is based on the experience of GPGPU programming which is

(13)

not designed for the HSA. Additionally, the method requires more investigations to fulfill the lack of formal proof of accuracy.

In the paper [21] and thesis [22], Elliott et al. present a real-time scheduling for GPU(s) based on the CUDA environment. Two real-time analysis methods, the Shared Resource Method and the Container Method, are proposed to soft real-time system that consists of a single GPU and multi-core CPU and a number of observations have been made on the system [21]. As a result of the observations, the authors recommend to investigate the real-time scheduling for multi-GPU platforms. In the multi-GPU platform, Elliott implements multi-GPU scheduler, GPUSync, to support the work in the GPGPU programming environment. GPUSync performs to reduce jitter greatly in video processing of the runtime experiments that executed computer vision programs with and without GPUSync [22].

Lakshmanan et al. have studied the Fork-Join task model in periodic real-time systems for the first time [23]. Most of the previous works are focused on many sequential tasks on multiple processor cores in real-time systems. Therefore, parallel programming models are expected to bring a new dimension to the multiple-core processors. Fork-Join model in parallel programming is widely used in Java and OpenMP systems. Each basic fork-join task begins always with a single master thread and splits into multiple threads at the fork structure. The sequential part resumes at the join structure once all of the parallel threads complete computations. In the basic fork-join task, sequential and parallel parts are executed in such a manner that the parts alternate with each other and end with a sequential part.

In this book [24], Jain introduces techniques required to perform system analysis. In general, there are two common types of analysts; those who can measure but cannot model, and those who can model but cannot measure. This book consists of following chapters required by the both types of analysts; an overview of performance evaluation, measurement techniques and tools, probability theory and statistics, experimental design and analysis, simulation, and queuing models. A common mistake in performance evaluation section introduces primary as well as elementary concepts of system analysis. The book is a distinguished work that contains various realistic cases and comments, such as full comprehension of the special features of observation data is dependent on the angles from where the result is viewed.

(14)

3

Heterogeneous System Architecture

The purpose of HSA is to simplify the process of creating parallel computations. Instead of having to know the target platforms architecture, the programming should only need to focus on implementing the code, rather than spending time to figure out how everything is structured. Instead of using complex parallel programming frameworks like OpenCL and CUDA, one can use languages that are commonly used in everyday applications, for example Java, C, and C++. In this section the HSA is going to be explained as to why it is suitable for parallel programming and why it makes it easier for programmers when adapting it, rather than using already existing tools. First, in Section 3.1the challenges with parallel programming are brought up together with the solutions that HSA present. In Section3.2the new and important terminology related to HSA is introduced. In Section3.3the kernel language of HSA is introduced. Section3.4discusses how the memory model in HSA works, while Section3.5discusses how the queuing model in HSA works. Last, Section3.6presents an overview about the process flow when creating an HSA application.

3.1

Legacy vs HSA systems

The purpose of HSA is to minimize the amount of work that the programmers have to put into their work when developing applications that utilize parallel computations. The architectural difference between an HSA system and a legacy system is what makes it more inviting. The problems contained under a legacy system shall be presented together with the corresponding solutions that come with the usage of an HSA system.

3.1.1 Memory handling

Legacy In a system with multiple processors and accelerators there usually also exist multiple memory pools. For example, in a common computer there is one memory pool for the CPU and one separate memory pool for the GPU. Since the memory pools are placed in physically different places, the communication between the memory pools is time consuming [citation-to-be-added]. Each time the CPU wants to dispatch a task to the GPU, the associated data in the CPU memory needs to be copied into the GPU memory. After the GPU has performed the computation, the data is copied back into the CPU memory. This process, memory swapping, is not only time consuming but also consumes more power compared to if no memory copying is needed. Thus, to copy data between two memory pools costs a lot when it comes to parallel computing.

The memory swapping between different memory pools does not happen automatically. The programmer needs to keep track of the data, which in turn can generate a lot of extra work when the programmer has to think of where to put the data and whether the data is already being used by a process or not. One programmer have to predict if the data they want to use is actually available simultaneously while the data is being processed.

Visibility of the memory data in a legacy system can be an issue. A multi-core CPU is cache coherent, meaning that all data writings are visible to all CPU cores as long the data is in the CPU cache. If one CPU reads a data located in the cache, then the system ensures that the read data is the latest updated one. The cache coherent property does not exist for the GPUs. The GPU can probe the CPU cache, but this procedure does not work in the other way around. For the CPU and the GPU to know what is inside the caches, the software, or rather the programmers, needs to swap the data back to the system memory to make them visible to the rest of the system.

As described, there are a lot of memory challenges when it comes to working with different memory pools in a legacy system.

HSA One step to close the distance between the different memory pool is to place the CPU and the GPU on the same chip. AMD has developed a new type of hardware architecture called APU, which is a System on Chip (SoC) that houses both a CPU and a GPU. The benefit of this is to have both the CPU and the GPU to share the same physical memory space. However, while they share the same memory, the memory is still divided into two separate spaces, one for the CPU and one for the GPU. The architectural design behind HSA is what merges the memory spaces into one unified coherent memory space. This unified coherent memory space lets data to be shared

(15)

between GPUs and CPUs without the need to copy the data. All it needs to be passed is a pointer to the memory address that is needed. More details about HSA memory model is going to be described in Section3.4.

3.1.2 Queuing

Legacy When an application wants to send a task to a hardware queue associated to a GPU, the dispatch process generates a large overhead. All data that is needed by the GPU is coming from the CPU. Therefore, all required data needs to be copied to different buffers and memory pools. After that, the CPU sends a command to the GPU to actually start the task that is located in a queue. Also, before the task lands in the hardware queue, the task will have to traverse through several software layers. These software layers can be drivers and different buffers in both user mode and kernel mode.

Figure 4: Hardware queue structure in a legacy system.7

In Figure4the whole trace of a task from the application to the hardware queue can be seen. When more than one application dispatches tasks to the hardware queue, all tasks are going to be held into one and single queue. This limits the possibility to run the GPU tasks in a better parallel way. More information regarding the queuing model for HSA shall be explained in Section3.5. HSA In HSA the queues are created in the user mode space. All processes can dispatch tasks into the hardware queue without having the task to bypass several software layers. Here, the task is put directly into the hardware queue that is associated with an accelerator. Instead of only having one software queue for each accelerator they can have multiple software queues attached to them. Also, if one kernel agent (accelerator) has a queue that is full, the kernel agent can move a task from its own queue to another queue that belongs to a different kernel agent.

3.1.3 Instruction Set Architecture

Legacy General-purpose computing on graphics processing units (GPGPU) can be both complex and have a high learning curve for programmers to get by. First, a new framework specialized for GPGPU programming needs to be learnt, for example, OpenCL or CUDA. Learning these frameworks can be time consuming since not only do programmers need to learn the framework itself, but also the architecture of the targeted platform that the application will run on. There are several different GPU vendors and architectures, where each can have a different Instruction Set Architecture (ISA). If a programmer want an application to conform to several different platforms, the probability of having to write the same application more than once is high.

(16)

HSA Each GPU vendor uses a different ISA, even same vendor can use different ISA. There is an intermediate language called Heterogeneous System Architecture Intermediate Language (HSAIL) that is a low-level intermediate representation which is ISA-independent. HSAIL is generated by a high-level compiler, which later can be translated into the targeted ISA by a finalizer. This finalizer translates the HSAIL code into machine code that the targeted platform can execute. The HSAIL code, or kernel, is the parallel region of the code. More about HSAIL in Section3.3.

Writing programs on an HSA platform can be done in different languages. Programmers can use what they already use, languages such as Java, C, C++, and Python. As long as the high-level compiler support HSA and can compile the parallel regions into HSAIL, any language can be used to program applications that can utilize multiple types of processors and accelerators.

3.2

Terminology

HSA introduces new concepts, the explanation of which can be found in this section. Learning how everything is connected in HSA is important to fully understand later sections.

Kernel A kernel is the HSAIL program, which holds the parallel part of the HSA applications. HSA agent In an HSA system agents participate in the memory model and are linked to an HSA device, for example a CPU, a GPU, or any other type of accelerator (DSP, FPGA). An agent is a communication bridge between the software and the hardware. The communication works in the same way as how CPU cores communicates with each other, using inter-connections [25]. An HSA agent can be a kernel agent if the agent is an accelerator that supports HSAIL instruction set and can execute Architected Queuing Language (AQL) kernel dispatch packets.

Host CPU A host CPU is an agent that supports normal x86 instructions as well as runs the operating system and the HSA runtime. This type of agent can also dispatch commands to kernel agents by issuing various memory instructions to build and enqueue AQL packets.

HSAIL Heterogeneous System Architecture Intermediate Language (HSAIL) is a low-level com-piler intermediate language. This language is designed to present the parallel regions of the code. More about HSAIL in Section3.3.

Finalizer The finalizer is a just-in-time compiler that extracts the BRIG embedded in the HSA program and translates it into the correct instruction set for the targeted platform.

AQL Architected Queuing Language (AQL) is a binary interface that is used to launch the dispatch packets. An AQL packet holds all necessary information about the code that is going to be executed on the HSA device.

Task dispatch There are two types of task dispatch: kernel and agent. Kernel dispatch is a process which sends a task to a software queue which includes a kernel that is going to be executed on an accelerator. An agent dispatch is a similar process, however, the task contains a function for another agent to run instead of a kernel object.

Kernel dispatch packet A kernel dispatch packet is used to submit kernels into accelerator that support HSAIL. The packet contains the specifications of the kernel, for example the work-group and grid size (more information regarding those in Section3.3), the kernel object, and the arguments for the kernel.

Agent dispatch packet An agent dispatch packet is used for sending functions to other agents to perform. Usually, the jobs are sent to other kernel agents which they can performed while being executed in a kernel agent. Comparing to a kernel dispatch packet, no kernel object is included. However, an ID is included to identify which function to run on the given agent. Atypical scenario is when a kernel execute an agent dispatch packet when a work-item need to run a built-in function.

(17)

Packet processor A packet processor is the scheduler which launches dispatch packets into the kernel agent. The task of the packet processor is to wait for a doorbell to be signaled by a queue, indicating that a packet exists to be consumed. The packet processor is then going to validate the packet to see if it is possible to send in the packet to the kernel agent or not.

HSA runtime The HSA runtime is the API that the programmer use to utilize the functionality of agents, queues, and kernel dispatching. This API is the communication link for the HSA driver that allows the usage of running kernels and managing software queues.

3.3

HSA Intermediate Language

HSAIL is a language that is similar to an assembly language [26]. The kernels that run on the accelerators are defined in HSAIL, unlike OpenCL where the kernels can be written in OpenCL C, which is a language similar to C [27]. Both the code that runs on the CPU and the parallel part on the GPU is written in one source file. A high-level compiler detects the part of the code that is going to be parallelized and produces both the host CPU code and a BRIG file for the parallel part. BRIG is a binary format that is embedded into the HSA application that gets translated into the targeted instruction set for the device. The translation part is done by the finalizer, which is a just-in-time compiler.

The purpose of HSAIL is to make it easy to port code to different target platforms. The number of architectures for the CPUs are limited (x86, ARM, MIPS), however, when it comes to GPUs there are several different architectures. HSAIL exist to make it possible to execute the same kernel code in different hardware, without any need of rewriting the code to fit the other hardware. To make the support for different ISA possible, the HSA foundation created the concept of machine models and profiles. There are two machine models: 32-bit and 64-bit models, and two profiles: base and full profile.

This section covers how the compiling process works and how the kernel gets dispatched into a software queue. Later, the execution of a kernel is explained by describing what is happening during the time a kernel is running on a GPU accelerator.

3.3.1 Compiling and running HSAIL kernels

A kernel is written in HSAIL, which in turn get compiled into a BRIG file by a high-level compiler. This BRIG file is embedded into the HSA application and is executed by the finalizer once the CPU program counter has reached that section of the code. This execution process, shown in Figure5, of an HSAIL kernel can be divided into two steps: finalization and loading [26].

Finalization This step generates the ISA for the targeted platform. The finalization step is optional, since the HSA runtime also support build-time and install-time compilation of HSAIL kernels. Before the finalization process, a BRIG file is needed. When using the HSA runtime API, the BRIG is loaded as a module into an object called program. By creating the program, the machine mode, profile, and the floating point mode for the targeted agent is specified. By doing so, the loaded kernel module receives the correct instruction set for the targeted device. The last part of this step is the finalization, where a code object gets extracted from the program object. To receive the code object, the programmer specifies the targeted agent’s ISA. This code object, which holds all information regarding the kernel that can be executed on a kernel agent, is then inserted into an executable that can be loaded in the next step.

(18)

Figure 5: Finalization and loading of an HSAIL kernel [12]. The processes are titled with the function name from the HSA runtime API.

Loading With the executable object, it is possible to extract some useful and needed information by first creating a symbol. This symbol represents the kernel function that the programmer wants to execute on the kernel agent. The symbol is achieved by indicating the name of that specific kernel function. By having a symbol, the application can perform queries on the executable to extract information, such as the memory size for the different memory segments (kernel arguments, global, private) and the kernel object. All of this information is needed when dispatching a kernel dispatch packet into a queue.

After the loading step is finished, and all necessary information is gathered, a dispatch packet object can be created. The dispatch packet holds all information regarding the kernel; the memory regions, the work-group and grid size (Section3.3.2), kernel arguments, and a completion signal. This signal is created beforehand and inserted into the dispatch packet. The signal is used to indicate when the kernel is with its execution. Before sending the dispatch packet into a queue, the programmer need to indicate the type of the dispatch packet. If a kernel is going to be executed the type is going to be set to kernel dispatch packet.

When all needed information is set, the packet can be put into the queue. First, the current queue position is needed. If the queue happens to be full, then the application needs to wait. Once the packet is inserted into the queue, the application signals the doorbell. This doorbell tells the packet processor (the scheduler) that a packet exists in the queue for processing.

(19)

3.3.2 Execution of HSAIL kernels

Once a kernel is dispatched, a grid is created within the allocated memory region which contains items. One item is a single instruction of execution. This grid is divided into work-groups, which are then divided into work-items (see Figure6). The grid size and the work-group size is defined by assigned values to the x, y, and z axis. Wavefront is a collection of work-items that are scheduled together in a lock-step fashion. Running in a lock-step fashion means that all instructions in one wavefront are running the same set of operations. The grid and the work-groups can also be defined by a high-level compiler if the compiler has support for HSAIL.

Figure 6: Grid of work-groups and work-items in an HSAIL kernel [12]. The dimensions in a grid can be specified in three ways:

• 1D is used when a single dimension vector is being used for calculations, for example using vectors.

• 2D can be used when running computations on an image where both height and width of the image is specified as the dimension.

• 3D can be used when 3D models need to be computed where also the depth of the model can be given.

Once a grid is being executed on a kernel, the work-groups are distributed among the different compute units. The performance of an HSA application scales depends on the number of compute units in the heterogeneous device. After all computations are completed on the accelerator, the dispatch completion signal activates, which tells the host CPU that the host code can continue.

(20)

3.4

Memory model

In HSA the memory model is important because it describes how the communication between HSA agents works. To program applications that execute in parallel can be difficult when data needs to be synchronized with each other. One common challenge is to avoid a data race. A data race is when two tasks read or write to the same memory location at the same time. If a ”read” command takes place before a ”write” command, then the ”write” command can get lost. This section covers how the HSA memory model takes care of this issue, and how the memory model retains memory consistency and data synchronization between multiple HSA agents.

3.4.1 Memory structure

The memory structure of HSA is defined to be efficient when working with heterogeneous pro-gramming. There are three different memory types in HSA [12]: flat memory, registers, and image memory. Each one of the memory types shall be explained in this section.

Flat memory is organized in one memory space, which is divided into segments. Each segment is a block which has different visibility preferences and properties, such as access speed, ownership, and lifetime. There are seven segment groups [26] [12]:

1. Global This memory segment is visible to all work-items in a grid running on a heterogeneous device as well as other agents. Depending on the memory allocation, other heterogeneous devices can also access the memory.

2. Readonly Readonly memory is similar to global, the only difference is that the memory can only be read and not written to.

3. Group Memory that can only be shared between work-items in the same work-group is divided into the group segment. The lifetime of a memory allocation in the group segment lasts from the time of the creation of the work-group until the work-group finishes its com-putations.

4. Private The private memory segment works similarly to the group segment, however, its lifetime lasts from when a work-item’s execution starts until the work-item finishes the exe-cution.

5. Kernarg The kernarg segment is the memory location in which the arguments for the kernel are stored.

6. Spill The spill segment is used for the HSAIL to store and load register values that can be useful for the finalizer. A high-level compiler can use this memory segment to optimize the code.

7. Arg When calling other functions from inside the kernel, the arguments are stored in the arg segment.

A flat memory address space means that the address can be reused. For example, both local memory space and the group memory space can begin at address 0 but point at a different physical memory space. This is useful because an individual address space can be reused.

The shared virtual memory is one of the main feature of the memory structure of HSA. One memory address can be accessed by all HSA devices in a system. This is what enables an HSA application to be efficient when no memory needs to be swapped around. In a legacy system, when a data needs to be copied from the system memory to another physical memory, and back again, the virtual memory address might not be the same as before. Therefore, previous pointers are not going to be valid if the physical memory space is changed. What HSA does to work around this issue is using a base address with an offset.

The second memory type is the registers. The registers are fixed-sized and are used by a high-level compiler. With HSAIL there are four different types of registers:

• C-registers (Control registers): 1-bit sized register used for receiving the output from a comparison operation.

(21)

• S-registers (32-bit registers): These registers can hold a 32-bit signed integers, unsigned integers, and floating point numbers.

• D-registers (64-bit registers): These registers can hold a 64-bit signed integers, unsigned integers, and floating point numbers.

• Q-registers (128-bit registers): The Q-registers contains packed data. A packed element can range from 8-bit to 64-bit in size.

The third memory type, image memory, is used when extra performance is needed when images and other graphics is involved [12]. Image handling is not supported by the HSA runtime; however, it does exist as an extension. The reason is that the images are not suitable for the memory model that the HSA has, but it is viewed as an independent fundamental of the model.

3.4.2 Synchronization and scope

This section is going to provide the information on how an HSA application can avoid a race. A race-free application is defined as two tasks operating on the same memory data, where one of them is a ”write”, at the same time. To achieve a race-free application, the memory model consists of scopes and synchronization mechanics. Whenever there is a race in an HSA application, the result is undefined [26].

Scopes in HSA is defined in a manner of where atomic operations can function and which memory address can be viewed by others (agents, work-groups). In HSA, there are five scopes: work-item, wavefront, work-group, agent, and system. The hierarchy starts with the smallest scope: work-item. The size of the scope increases by each step. Wavefront is a larger scope than work-item, work-group is a larger scope than wavefront, and so on.

The synchronization between agents in HSA is taken care of by atomic operations. Atomic operations include store, load, and read-write-store operations. In HSA, an atomic can have a release, acquire, and release-acquire semantics.

• Release: If a release atomic semantic is used means that operations declared before is visible for other work-items within the same scope.

• Acquire: This semantic is opposite to the release semantic. An acquire semantic ensures that post-operations are visible for other work-items within the same scope.

• Release-acquire: Uses both types of semantics, meaning that operations before and after are visible for other work-items within the same scope.

When using an atomic semantic a scope can be specified. It is important to assign correct scope to the atomic operations so an HSA application can avoid races. There are two ways to assign atomics with scopes that can help to achieve a race free application. The first one is called inclusive scope synchronization and the second one is called transitive scope synchronization [26]. Inclusive scope synchronization Work-items can be synchronized with each other if the atomic variables are declared under the same scope. However, two work-items can also be syn-chronized if they belong to two different scopes as long as they exist in the same work-group. For example, if one work-item is defined in a work-group scope, and another work-item in an agent scope, then the work-item defined in a work-group scope must exist in the same agent scope. If this is true, then both work-items can be synchronized using atomic operations.

Transitive scope synchronization Work-items can also be synchronized even if they are not fully related to each other. If a work-item A is synchronized with work-item C and work-item C is synchronized with work-item B, then work-item A will automatically also be synchronized with work-item B. For example, say that work-item A and C are defined in a work-group scope, and work-item B in an agent scope. Assume that work-item A and C is inserted inside the same work-group, and work-item B in another work-group. For work-item B to be synchronized in this case, work-item B can observe any change in work-item A (as long it does not involve work-item

(22)

B). Work-item C in this case can act like a bridge which links all three together.

To fully achieve a race free HSA application the developers need to apply and follow the mechanisms described in this section. It is up to the developers to create an HSA race free application, and the HSA specification [11] allows the developers to do so. The main rules to remember when programming race free applications is to utilize scopes and atomics in a correct way, and use release and acquire semantics on the smallest scope instance as possible.

3.5

Queuing model

Queues in HSA are defined in user mode. A user mode queue is attached to an agent, which can be a kernel agent. In Figure7 shows how a task can be dispatched without the need of bypassing driver APIs and buffers. A user mode queue is a software queue that one agent can have more than one of. Each kernel agent (i.e. GPU) only has one hardware queue. If many software queues are attached to the kernel agent, the scheduling works in a first come first serve fashion. This section is going to cover the mechanism of the queue model of HSA and how a dispatch packet is structured and sent into a software queue.

Figure 7: Hardware queue structure in a HSA system.8

3.5.1 Architected Queuing Language

An AQL packet describes the commands for the kernels that is going to be launched. The packet also contains the memory regions that is going to be used. The HSA runtime does not come with the functionalities to manipulate AQL packets, this is left out for the developers or libraries to implement. When creating an AQL packet, the programmer can choose between five different packet types [11].

• Vendor-specific packet - This packet format is for vendors that need a specific implemen-tation for their hardware.

• Agent dispatch packet An agent dispatch packet contains tasks that are for other agents. • Kernel dispatch packet A kernel agent dispatch packet contains tasks that are for other

kernel agents.

(23)

• Barrier-AND packet The use of barrier-AND packet can delay later packets using the AND-operand. The maximum number of barrier packets that can delay another packet is set to five [26].

• Barrier-OR packet Similar to barrier-AND packet, but use the OR-operand instead of the AND-operand.

There is also a sixth packet called invalid packet. At the start of the creation of a dispatch packet, the developer set the packet type to invalid before continuing. Once every preferences are set, the developer can set which type of packet it is going to be inserted into the queue. While all types of packets include their own unique bits of information, they all share the same 16-bit header. This header includes the type of packet, a barrier bit, acquire fence, and a release fence. The barrier bit is set if there is any other packets that needs to be finished before. The acquire fence determines the scope and the type of memory that should be initiated before the packet is being executed. The release fence bits are similar, but provided that the scope and memory are applied before the packet completion, but after the kernel has finished its execution.

3.5.2 Execution of an AQL packet

The procedure of an AQL packet can be divided into three phases [26]. The first phase is when a packet starts in the software queue. Once there is no prerequisite for the packet to run, the packet moves to the launch state. If the launch is successful, the packet moves into the second phase: active state. A packet can from the active state either reach an error state or a complete state. The complete state is the third phase, which indicate that the packet has finished the execution successfully and a new packet can get ready to launch (if any). An error state indicate that something erroneous has happened during the execution.

In the beginning the packet needs a space in a queue, which is received by incrementing a writeIndex. This writeIndex indicates that the number of packets in a queue will be increased by one. Reading the readIndex value will returns the number of packets that has been allocated in the queue. After a space has been allocated in the queue, the packet gets initialized by setting the packet as invalid. Then the packet gets copied into the queue and then set the correct packet type, before ringing the queue’s doorbell to indicate that a new packet has arrived and is ready to be consumed by the packet processor.

During execution, it is possible that two HSA applications (producers) can apply a packet into the same queue at the same time. If the queue happens to only have one open space left to allocate a packet, there can be consequences that can be unintended. To avoid this kind of situation, HSA runtime provides a set of functions called compare-and-swap. With these operations, a producer can read a writeIndex value and compare it to an expected value. If these two values do not match, the producer can assume that the space in the queue no longer applies. The main feature of the compare-and-swap operations are that these are atomic operations and, therefore, can be performed with one instruction.

Once a queue’s doorbell is signaled, the packet processor wakes up (if queue has been empty previously) to begin to launch the next packet. If the queue consists of more than one packet, the packet processor uses on a first-come, first-served basis. The packet processor simply consists of two nested loops. Loop number one sits and waits until the queue’s doorbell, while loop number two runs the existing packets in the queue. The second loop then continues to check if the packet type is valid. If the packet is set to invalid, then that means that the packet is still being written and the packet processor will wait until the packet is finished when a valid packet type has been assigned to it.

3.5.3 Context switching

Legacy GPU lacks the quality of service when more than once application generates tasks for the GPU. For that purpose, HSA is coming up with the specifications of tasks that it is able to preempt. Context switching is a useful feature to support the mechanism of scheduling different HSA tasks with priorities. In the HSA specification it includes the definition of the need to support

(24)

scheduling using context switching. However, the exact specification is not specified in the HSA, since context switching is considered platform-specific [11].

During a context switch, the kernel agent need to save all state to memory which can be activated again once the context gets preempted back. There are three different types of context switch priorities: switch, preempt, and terminate and context reset.

• Switch: With the priority switch, a context switch does not need to happen immediately. The ongoing work-items inside the kernel agent can finish their execution before a context switch.

• Preempt: A context switch needs to happen as quickly as possible. However, the context switch needs to guarantee that the context switch is happening while the latency is kept within the required minimum.

• Terminate and context reset: This type means that an HSA task needs to terminate immediately. The state of the context does not need to be saved.

3.6

Software stack

In Figure8 shows the flow of a program from its source code to its compiled form. At the top level the high-level programming languages resides (Java, C++, Python). The parallel regions can be written in the way the corresponding language specifies. A front-end and a back-end compiler takes care of converting the code into both HSAIL and CPU host code.

Figure 8: Software stack: A view from the high-level lanuage down to the HSA runtime. The LLVM is a compiler framework (includes both a front-end and a back-end compiler) that supports to compile HSA applications. The LLVM compiler generates the HSAIL/BRIG files for the parts that is going to be parallelized and the object code for the host CPU. Next, in Figure

9, the process after HSA runtime begins with loading the BRIG and initialize the finalizer, which compiles the BRIG binary code into machine code. The machine code is stored in a kernel object.

(25)

Figure 9: Software stack: From HSA runtime down to the accelerator.

With the kernel object and the CPU object code, a HSA object code can be generated. When the HSA object code is executed, a kernel agent is received from a list of available agents. A queue is then set to be associated with the received kernel agent. From here, a task can get dispatched to one of the created software queues before the packet processor assign the next task into the hardware queue dedicated for the accelerator.

(26)

4

Limitations

HSA is a new hardware architecture which version 1.0 of the specification was released in 2015. There are features according to the HSA specification that have not yet been fully implemented in the current version. Therefore, it is difficult to implement and test all features related to real-time. This section is going to define the limitations and differences between the HSA specification and the provided drivers.

Context switching As been described in Section 3.5.3, context switching is supported by the specification. However, context switching is not supported in the drivers. This leads to the following limitations:

• The possibility to test task preemption.

• The ability to use different queues and optimize task dispatch.

Running task on the CPU Existent drivers provided for the current HSA version does not support executing parallel regions on the CPU. This has led to limit the testing focus to only run parallel regions on the GPU.

Sharing GPU resources The ability to share GPU resources among more than one application at the same time is non-existent in the current driver implementation. Only one application can use the GPU, therefore, the next application has to wait until the GPU becomes unoccupied.

(27)

5

Method

This section introduces the main idea for this investigation. Firstly, we present the research method in Section5.1 and Section5.2 explains the experiment method. Section 5.3 provides an introductory explanation of the system model. Section5.4describes both hardware and software that is used under the experiments.

5.1

Research method

In Figure 10, the research method used in this thesis is depicted [28]. The arrows depicted in Figure10indicate the direction between different investigation states, i.e., the transition direction. The research method consists of the following activities: literature study, problem formulation, experiment design, experiment conduction and result analysis.

Figure 10: The research method.

• In the literature study, we study the related work and about HSA which includes fundamental mechanisms, current status of development and limitations of the investigation of the thesis work. The aim of the literature study is to distinguish a clear research problem.

• We state the problem formulation based on the limitation of HSA and the requirements of this thesis. The output of the problem formulation is to formulate the research question/aim of the investigation.

• The algorithm and the set of experiments are described in the experiment design (in Section

6). In addition, the experiment method is described in Section5.2.

• The experiment conduction consists of configuration (see Section7.1), implementation (see Section7.2) and execution of the experiments (see Section 7.3).

• The results analysis is considered as an evaluation of the thesis and is described in Section8. The aim of the results analysis is to evaluate the implemented system. The result/output of this activity may bring new theories to explain the results of the experiments. In addition, the output may extend to the development of new system model or future work [28].

5.2

Experiment method

The experiment technique used in this thesis uses software probes, which is easy to use and im-plement, to measure the execution time of the tasks. The measurements are based on evaluation that are performed with experiments on an actual hardware platform, i.e. actual experiments are performed and the evaluation is based on the measured results. Theoretical analysis is useful since no implementation is needed, which also is helpful in our case when the driver implementations are not fully completed (see Section 4). However, the intention of this thesis is to observe and locate the possible sources of interference of tasks executed on an HSA compliant hardware. Such hardware is both available and used throughout the thesis project.

Figure 11shows where the measurements starts and stops in the experiment software. There are two distinct ways of measuring the execution time of a task:

(28)

Figure 11: A UML sequence showing when the measurements are performed in the experiments.

1. The first approach (full execution) measures the region that is executed on both the CPU and the GPU device. This region includes the memory allocations, dispatch process, queue management, and the kernel execution. The kernel execution is the measured time which a task is executed on the GPU device. All other operations occurs on the CPU.

2. The second approach (kernel execution) only includes the time the task is executed on the GPU device, i.e. everything related to the CPU are excluded.

By subtracting the first approach with the second approach, we can extract how much of the execution time is dedicated to memory allocations, dispatch, and queue management. We can, based on the comparison between the two approaches, evaluate how much of the execution time is dedicated on the CPU and how much time is dedicated on the GPU.

As it is the first study of the timing behaviour of HSA, we aim to investigate a basic characteris-tic of HSA. In order to conduct the investigation as simple as possible, we choose a single algorithm for the entire experiments. The algorithm chosen is from the Basic Linear Algebra Subprograms [29] which is explained in Section6.1.

5.3

System model

In this section, we discuss the system model of HSA in detail. We explain the definition of the Fork-Join task model that is often used to represent parallel real-time systems [23]. In addition, we describe an adoption of Fork-Join task model to HSA. We consider the Fork-Join task model in HSA with h CUs which consist of c CPU CU(s) and g GPU CU(s):

h = c + g 5.3.1 Definition

The Fork-Join task model comprises sequential and parallel regions. The task starts with a single sequential master region and executes until the fork structure. The fork structure splits the task

(29)

into multiple threads in parallel region which starts to execute on the GPU. Once the parallel region finishes the execution, the join structure synchronizes the threads and resumes the master region [23]. A Fork-Join structure is able to be executed multiple times. The task set based on the

Figure 12: Fork-Join task model. Fork-Join structure is defined as the following (shown in Figure12):

Γ = {τ1, ..., τn|τi : ((Ri,1, Ri,2, Ri,3, ..., Ri,ri−1, Ri,ri), Di, Ti), i = 1, .., n}

Where,

• τi is the task

• ri is the total number of regions of task τi.

• Ri,j is jth region of task τi consists of sequential and parallel regions.

• Di, Ti are respectively the deadline and the period of task τi.

Ri,j=

(

(Si,2k+1, c2k+1); j = 2k + 1; k = 0, .., (ri− 1)/2;

(Pi,2k, g2k); j = 2k; k = 1, .., (ri− 1)/2;

Where,

• Si,2k+1 is the worst case execution time (WCET) in (2k + 1)th sequential region of task τi.

We assume that a thread in the sequential region executes on only one CPU CU. Therefore, c is equal to 1.

• c2k+1 is the number of CPU CU that used by Ri,2k+1 of the task and c2k+1= 1.

• Pi,2k is the WCET in the (2k)th parallel region of task τi and is defined as the maximum

WCET of parallel threads that run at R2k region.

Pi,2k= max jm2k

(W CET Pi,2kj )

(30)

5.3.2 Fork-Join task model

Based on the mechanism of the HSA system, all the tasks should execute on the host CPU first after which the parallel threads execute on the GPU(s). The real-time behaviour of the sequential part is not the scope of the investigation of the thesis work because the sequential part of the task uses the host CPU which is non-HSA agent. In other words, real-time properties of HSA depend on real-time properties of the GPU part. Accordingly, we focus on the details of the parallel part depicted in Figure13.

Figure 13: Parallel part of Fork-Join task model.

If we have a closer look at the parallel region of Figure12, the threads Pi,2kon g2kGPU CU(s)

can be defined as below:

Pi,2k= {Pi,2k1 , ..., Pi,2ks , ..., P m2k i,2k}; k = 1, ..., ri− 1 2 Where, • Ps

i,2k is sth thread in (2k)th parallel region of task τi.

• s is order of the thread in parallel region.

• m2k is number of threads in (2k)th parallel region.

As can be seen in the above model, we assume that the aim of the thesis work is to understand the real-time behavior of (m2k, g2k) pairs, where m2k threads are assigned on g2k GPU CU(s). We

will consider various cases of different combination of (m2k, g2k) pairs in the test case section.

5.4

Hardware and Software

Here in this section the hardware and the software that is used during this thesis is introduced. Section5.4.1starts with the hardware while Section5.4.2continues with the necessary software. 5.4.1 Hardware

For this project, two systems are used with an HSA compatible APU. The APU is an AMD Carrizo that has both a CPU and a GPU. In one of the systems, the APU has four CPU cores and six GPUs (4 + 6), and the second system has four CPU cores and eight GPUs (4 + 8). The AMD

(31)

Figure 14: The architectural design of an AMD Carrizo SoC.9

Carrizo is the first chip to support version 1.0 of the HSA to its fullest. As seen in Figure14, the full version of AMD Carrizo has two memory controllers.

In one of the acquired systems (System 1), only one memory controller exist. However, in the second system (System 2), both memory controllers are available. Further information regarding both acquired systems is listed in Table1.

Table 1: A list displaying the differences between the two systems. System 1 System 2

CPU Cores 4 4

CPU Clock Speed 1.8 GHz 2.1 GHz GPU Compute units 6 8

Shaders 384 512

GPU Clock Speed 800 MHz 800 MHz Memory Type DDR3 DDR4 Memory Controllers 1 2

Memory Capacity 8 GB 8 GB

System 2 is more powerful with two more GPU compute units and a higher CPU clock speed. Another notable difference is that System 2 has two memory controllers and a different memory type, which is both faster and consume less power [30]. This divergence in characteristics between the systems is useful when executing experiments that manage large amount of memory. The computational power from the GPU from both systems is shown in Table2.

Results for single-precision and double-precision are received from a benchmark program run-ning on both systems, as well as the transfer bandwidth speed. The theoretical computational power of both systems is calculated by following Equation1. The product from the equation is the number of floating-point operations per second (FLOPS) the GPU can perform. In both systems, each compute unit has 64 Arithmetic Logic Units (ALUs), where each can perform two floating point operations per clock cycle.

Figure

Figure 1: Memory structure between a non-HSA system and a HSA system.
Figure 3: Key members that formed HSA foundation. 5
Figure 4: Hardware queue structure in a legacy system. 7
Figure 5: Finalization and loading of an HSAIL kernel [12]. The processes are titled with the function name from the HSA runtime API.
+7

References

Related documents

That he has a view of problem solving as a tool for solving problems outside of mathematics as well as within, is in line with a industry and work centred discourse on the purposes

Next, I show that standard estimation methods, which virtually all rely on GARCH models with fixed parameters, yield S-shaped pricing kernels in times of low variance, and U-shaped

Results are then presented to practitioners in the company during phase 2 round of semi-structured interviews. RQ 1 results are presented to know the validity and also add

This is likely due to the extra complexity of the hybrid execution implementation of the Scan skeleton, where the performance of the CPU and the accelerator partitions do

The aim is to examine how large- scale tourism affects the opportunities for young adults living in rural areas; their perception of place and the perceived opportunities and

The aim of this population-based co- hort study was to estimate the incidence of LLA (at or proximal to the transmeta- tarsal level) performed for peripheral vas- cular disease

Det som gjordes under verifikationen var att skiva om hubot för att kunna skicka och ta emot meddelanden från briteback, registrera chattbotten i briteback, byta profilbild

The Myriad weight function is highly robust against (extreme) outliers but has a slow speed of convergence.. A good compromise between speed of convergence and robustness can