• No results found

Parallel gaming related algorithms for an embedded media processor

N/A
N/A
Protected

Academic year: 2021

Share "Parallel gaming related algorithms for an embedded media processor"

Copied!
120
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen f¨

or systemteknik

Department of Electrical Engineering

Examensarbete

Parallel gaming related algorithms for an embedded

media processor

Examensarbete utf¨

ort i Informationskoding

vid Tekniska H¨

ogskolan i Link¨

oping

av

John Tolunay

LiTH-ISY-EX--12/4641--SE

Link¨oping, 2012

Department of Electrical Engineering

Link¨

opings tekniska h¨

ogskola

Link¨

oping University

Institutionen f¨

or systemteknik

(2)
(3)

Parallel gaming related algorithms for an embedded

media processor

Examensarbete utf¨

ort i Informationskoding

vid Tekniska H¨

ogskolan i Link¨

oping

av

John Tolunay

LiTH-ISY-EX--12/4641--SE

Handledare: Jens Ogniewski

ISY,

Link¨

oping University

Examinator: Ingemar Ragnemalm

ISY,

Link¨

oping University

(4)
(5)

Presentation date

2012-12-03

Publishing date (Electronic version)

2012-12-10

Department and division

ISY

Department of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Language Engelska/English Svenska/Swedish Number of Pages 102 Type of Publication Licentiate Thesis Master Thesis Thesis C-Level Thesis D-Level __________________

ISBN (Licentiate Thesis)

ISRN

LiTH-ISY-EX--12/4641--SE

Title of series and series number ISSN

URL for electronic version

http://www.ep.liu.se

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-86154

Publication Title

Parallel gaming related algorithms for an embedded media processor

Author(s)

John Tolunay

Abstract

A new type of computing architecture called ePUMA is under development by the ePUMA Research Team at the Department of Electrical Engineering at Linköping University in Linköping. This contains several single instruction multiple data (SIMD) cores, which are called SIMD Units, where up to 64 computations can be done in parallel. The goal with the architecture is to create a low-power chip with good performance for embedded applications.

One possible application is video games. In this work we have studied a selected set of video game related algorithms, including a Pseudo-Random Number Generator, Clipping and Rasterization & Fragment Processing, analyzing how well they fit the ePUMA platform.

Keywords

ePUMA, parallelism, clipping, rasterization, random number generator, division, fixed point arithmetic, power consumption

(6)
(7)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

från

publiceringsdatum

under

förutsättning

att

inga

extraordinära

omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning

av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman

i den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens

litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet – or its possible

replacement –from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for

anyone to read, to download, or to print out single copies for his/hers own use

and to use it unchanged for non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional upon the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its www home page:

http://www.ep.liu.se/.

(8)
(9)

Abstract

A new type of computing architecture called ePUMA is under development by the ePUMA Research Team at the Department of Electrical Engineering at Link¨oping University in Link¨oping. This contains several single instruction multiple data (SIMD) cores, which are called SIMD Units, where up to 64 computations can be done in parallel. The goal with the architecture is to create a low-power chip with good performance for embedded applications. One possible application is video games. In this work we have studied a selected set of video game re-lated algorithms, including a Pseudo-Random Number Generator, Clipping and Rasterization & Fragment Processing, analyzing how well they fit the ePUMA platform.

(10)
(11)

Acknowledgements

I would like to thank my examiner Ingemar Ragnemalm for proposing this Master Thesis and for all of the support during this work. I also would like to thank Jens Ogniewski and Andr´eas Karlsson for all of their supervisions, recommendations and advices. At last but not the least I would like to thank my family and friends who have given me all of encouragements, faith, strength, love and wisdom to never surrender but to keep on fighting with the Master Thesis and with all the hard work of studies the past 8 years at the Link¨oping University to the end. Without them I would never achieve this possibility.

Thank you very much. John Tolunay 2012

(12)
(13)

Contents

1 Introduction 1

2 Background 2

2.1 The ePUMA architecture . . . 2

2.2 Master DSP Processor - ’Senior’ . . . 3

2.3 SIMD co-processor - ’Sleipnir’ . . . 5

2.3.1 Data Types . . . 5

2.3.2 Memory Hierarchy . . . 6

2.3.3 Vector Register Files . . . 6

2.3.4 Number Space . . . 7

2.3.5 Other properties of the SIMD Unit Co-Processor . . . 7

2.4 Programming on ePUMA . . . 8

2.4.1 SIMD Unit Programming . . . 8

2.4.2 Several SIMD Units Parallel Programming . . . 12

3 Theory 14 3.1 Random Number Generator . . . 15

3.2 Pseudo-Random Number Generator . . . 16

3.2.1 Linear Feedback Shift Register . . . 17

3.3 Graphics Pipeline . . . 19

3.3.1 Model Data and the Vertex Processor Stage . . . 20

3.3.2 The Clipping Stage . . . 20

3.3.3 The Rasterizer Stage . . . 20

3.3.4 The Fragment Processor Stage . . . 20

3.3.5 The Frame Buffer Stage . . . 21

3.3.6 Direct3D and OpenGL Graphics Pipeline . . . 21

3.3.7 Selected stages for the ePUMA . . . 22

3.4 Clipping . . . 23

3.4.1 Cohen-Sutherland Line Clipping . . . 23

3.4.2 3D Extension of Cohen-Sutherland . . . 29

3.4.3 Sutherland-Hodgman Polygon Clipping . . . 29

3.4.4 3D Extension of Sutherland-Hodgman . . . 32

3.5 Rasterization & Fragment Processing . . . 35

3.5.1 Polygon Fill Technique . . . 36

3.5.2 Texture Mapping . . . 43

3.5.3 Shading Techniques - Gouraud & Phong . . . 46

3.5.4 Blending . . . 51

3.6 Summary . . . 52

4 Parallelism on ePUMA 53 4.1 Parallelization of the Linear Feedback Shift Register . . . 55

4.2 Parallelization of the Clipping Process . . . 55

4.3 Parallelization of the Rasterization & Fragment Process . . . 57

5 Implementation 58 5.1 Pseudo-Random Number Generator . . . 58

5.2 Clipping . . . 62

(14)

5.2.2 Summary of the Clipping part on ePUMA . . . 67

5.3 Rasterization & Fragment Processing . . . 68

5.3.1 Regular implementation on C++ . . . 68

5.3.2 Special Case - Fixed-point Number Arithmetic . . . 70

6 Results 81 6.1 Result of the Pseudo-Random Number Generator . . . 81

6.2 Result of the Clipping . . . 84

6.2.1 Cohen-Sutherland . . . 84

6.2.2 Sutherland-Hodgman . . . 85

6.3 Result of the Rasterization & Fragment Processing . . . 87

6.3.1 Different polygon shapes in Floating Points . . . 87

6.3.2 Floating Points versus Fixed Points . . . 89

7 Conclusions 93 8 Future Improvements 95 8.1 Division . . . 95

8.2 Number Range Extension . . . 95

8.3 Simpler Tools for Parallelism . . . 95

(15)
(16)

List of Figures

1 An overview of the ePuma hardware architecture. Figure taken from [1]. . . 2

2 A general overview of the ePUMA block diagram. Figure taken from [6]. . . 3

3 An overview of the Sleipnir SIMD Unit. Modified figure taken from [1]. . . 5

4 The data vector format supported by a SIMD unit. Figure taken from [7]. . . . 5

5 Swap design. Figure taken from [1]. . . 6

6 Rotate design. Figure taken from [1]. . . 7

7 Simulation result when executing code on a SIMD Unit. Figure taken from [8]. 12 8 Different types of games where a random number generator is applied. Figures taken from [19], [20], [21] and [22] respectively. . . 15

9 Terrain Height Map created by the Perlin Noise procedural function. Figures taken from [23] and [24] respectively. . . 15

10 A 4-bit Linear Feedback Shift Register. . . 17

11 Different representations of the Exclusive OR operator (XOR). . . 17

12 A 16-bit Fibonacci LFSR. . . 18

13 A general overview of a Graphics Pipeline. Figure taken from [30]. . . 19

14 Different coordinate spaces a 3D object process through before the Clipping process. Figures taken from [31]. . . 20

15 Proceding from the NDC space to the Screen space during the rasterization process. Figure taken from [31]. . . 21

16 Comparison of the Direct3D and the OpenGL Graphics Pipeline. Figures taken from [32] and [33] respectively. . . 21

17 The outcode template. . . 23

18 The line clipping. . . 24

19 The polygon clipping. . . 29

20 The four cases on how a polygon can be clipped. Figure taken from [41]. . . 30

21 A triangle polygon which is being rasterized. Figures taken from [45]. . . 35

22 Filling an equilateral triangle. . . 36

23 Filling a quadrilateral. . . 37

24 Filling a concave polygon. . . 38

25 Mapping a image texture onto a 3D object. Figure taken from [47]. . . 43

26 A 2D Checkerboard painted in red and white. . . 43

27 Difference between the interpolations. Figures taken from [49]. . . 46

28 Gouraud intensities at each vertex of a polygon. . . 47

29 Phong intensities at each normal of a polygon. . . 49

30 Sketches of performing the video game related algorithms in parallel on ePUMA. 53 31 Sketches on a Quadrilateral and an Icosahedron. . . 70

32 Finding v and w to calculate the signed area of the triangle. . . 77

33 The result of the Texture Mapping implementation (floating point numbers). . . 87

34 The result of the Texture Mapping with Gouraud Shading implementation (float-ing point numbers). . . 88

35 The result of the Texture Mapping with Phong Shading implementation (floating point numbers). . . 88

36 Comparison of a texture mapped quadrilateral with perspective when rendered with floating point numbers and fixed point numbers. . . 89

37 The result of the implementation of an Icosahedron rotating around different axes with Texture Mapping (floating point numbers). . . 89

38 The result of the implementation of an Icosahedron rotating around different axes with Texture Mapping (fixed point numbers at Q16.16). . . 90

(17)

39 Fixed point numbers at Q8.8. . . 90 40 Fixed point numbers at Q4.4. . . 91

(18)
(19)

1 INTRODUCTION

1

Introduction

The smartphones are today the most popular computer devices that are used by home con-sumers in the whole world. Ever since the introduction of the Apple iPhone back in 2007, which revolutionized the whole smartphone market with its multi-touch screen and its App Store ser-vice, the smartphone market is filled with devices that have a corresponding multi-touch screen and software services where apps can be downloaded, either for free or for a sum of fee. Besides the iPhone, there are now several types of Android smartphones that are manufactured by HTC, Samsung, Motorola and Sony as well as Windows Phone smartphones like the upcoming Nokia Lumia 920 that are getting more space in the market.

As these smartphones are getting thinner and lighter and at the same time become more advanced with more built-in functions and faster processors, where the cores and the clock frequencies are increased, so does the battery consumption of these smartphones increase as well in order to handle the mobile applications that are becoming more visually and technically advanced. This gives the smartphones a very low longevity that requires then to be charged very often and therefore decreases the mobility of these smartphones which has become a large issue for these companies that are manufacturing these smartphone.

One way to solve this issue is to design a processor architecture that has several small core processors that make computations in parallel simultaneously. These core processors will have much more lower clock frequencies than a single core processor that has a very high clock frequency. But by taking the advantage of parallelism with these core processors they can outperform the single core processor in the terms of computation time and at the same time save up the power consumption significantly due to the lower clock frequencies they have.

This is a ideal situation for the smartphone companies to have and this is where the ePUMA architecture comes in. This architecture is developed by the ePUMA Research Team at the Department of Electrical Engineering in Link¨oping University in Link¨oping and it is designed to make computations on the 8 SIMD Units of the architecture has in parallel. This have lead to an interest of implementing video game related algorithms on this type of architecture to see if its capabilities are such that it could be a part of the upcoming processors of the smartphones in the future.

This report will start with a thoroughly description of the ePUMA architecture followed by a theoretical description of the video game related algorithms that were of interest to im-plement and how they could be parallelized. Afterwards comes the imim-plementation of these algorithms, the results from the implementations, what conclusions this report has made and what improvements could be done in the future regarding the ePUMA architecture.

(20)

2 BACKGROUND

2

Background

2.1

The ePUMA architecture

Figure 1: An overview of the ePuma hardware architecture. Figure taken from [1]. The ePUMA1 architecture is designed for high performance parallel computing at a low

power consumption and low silicon cost for telecommunications and Digital Signal Processing (DSP) applications. This architecture was created by the Department of Electrical Engineering at Link¨oping University in Link¨oping, Sweden and it is still under research development.

The architecture is formed according to the master-multi-SIMD DSP Processor architecture and this means that it has a single master CPU controller, with its main memory and the Direct Memory Access (DMA) controller, where the 8 SIMD2 co-processors are connected to it as seen

in Figure 1.

The main memory of the master CPU controller can be accessed by all the SIMD units in the ePUMA architecture which has a network connected as a star seen in Figure 1 where there are depicted by red marked lines. There is also a ring connected network inside the architecture that allows fast communications between all the SIMD units when they are performing parallel computations. This ring network is marked as blue in Figure 1.

This type of architecture is similar to the CELL architecture, developed by Sony, Toshiba and IBM (STI), which is used in the CPU of the video game console Sony PlayStation 3. Al-though the CELL architecture does have a very high power consumption in certain types of applications compared to the ePUMA architecture. Not only can the ePUMA architecture be used for telecommunications but also for High Definition TV and according to Ragnemalm et al.[1], there is a theoretical possibility that graphics can be rendered in real-time on the archi-tecture with a low power consumption. This is something that is extremely desired to have on hand-held devices like Apple iPhone and iPad, Android- and Windows-based cell phone and surf tablets since their main issue is that the battery consumption is very high when they are used for running apps and surfing on the Internet for instance. Since the processors are becom-ing much more faster and more powerful by their increasbecom-ing clock frequency, this also demands that the batteries must have a large and effective capacity so that the hand-held devices can be used for a fair amount of time for intensive applications. However, the capacity of the batteries is constrained by the expense issue since the batteries are made by the lithium-ion chemistry and these are very expensive to manufacture. This leads to higher prices on the smartphones

1embedded Parallel DSP platform with Unique Memory Access 2Single Instruction Multiple Data

(21)

2 BACKGROUND 2.2 Master DSP Processor - ’Senior’

the more capacity the batteries have in the smartphones.

In order to decrease the power consumption while maintaining the same clock frequency the so called manufacturing process can be decreased. This process involves to shrink the die area of a CPU processor, where it then can become more powerful and faster with its increased clock frequency while maintaining the same power consumption as the previous CPU processor that has a larger die area but lower clock frequency. Or the die area of a CPU processor can be shrinked so that the clock frequency is the same as the previous CPU processor but the power consumption will be decreased instead. As of 2012, the manufacturing process of a mobile CPU for the latest hand-held devices is at between 32 and 45 nanometers. And the lesser nanometers the CPU is manufactured at the more effective and at the same time powerful it becomes.

But as the ePUMA architecture has the capabilities of executing the current intensive applications on hand-held devices, which have a clock frequencies at between 800 MHz and 1.5 GHz, at a considerably lower clock frequency at 80 MHz by using parallel computing, this thesis will find out how well the architecture performs when computing video game related algorithms.

An in-depth description of the ePUMA architecture is found in the sections 2.2 and 2.3.

2.2

Master DSP Processor - ’Senior’

The Master DSP Processor in the ePUMA architecture - called ’Senior’ - is a RISC3 processor

that executes the sequential task in an application algorithm and runs code that is located in the main memory. This is used when there are tasks that require small computations and cannot be parallelized efficiently on the SIMD unit co-processors, which will be described thoroughly in section 2.3.

Figure 2: A general overview of the ePUMA block diagram. Figure taken from [6]. It also handles the coordination for all the 8 SIMD unit co-processor, where every SIMD unit is connected to the Master processor through 8 processing lanes, creating a star connected type of network. These processing lanes can provide with 8 16-bit data paths or 4 32-bit data paths for every SIMD unit. The code is initially located in the main memory of the Master pro-cessor. When the code is going to be executed by some or all of the SIMD unit co-processors for parallelized tasks, this code must be transferred to all of the affected SIMD unit co-processors

(22)

2.2 Master DSP Processor - ’Senior’ 2 BACKGROUND

local program memory (abbreviated PM) as seen in Figure 2.

When transferring the code from the main memory of the Master to the program memory of the SIMD units, a Direct Memory Access (DMA) is used to prevent the Master to be unnec-essary loaded with computations. This code transfer is referred to as a DMA transaction and also the input data will be transferred into the local vector memory of the SIMD units. When the input data and the code is in place of the SIMD units, the execution of the code with its input data can be started. After a SIMD unit has completed the execution of the code, it will send an interrupt signal to the Master and the Master will in its turn issue another DMA task to store the result from the local vector memory of that SIMD unit to the main memory of the Master processor. The Master can then load in new tasks to the SIMD unit or supply it with the new input data and let that SIMD unit run the same tasks again.

All of the SIMD units are connected to each other with a ring type connected network. This allows data to be transferred from one SIMD unit to another, which leads to powerful computational chains pipelining. For instance when one SIMD unit has done some computations and then transfer the result to the neighbour SIMD unit, the first SIMD unit can load in new input data and perform computations while its neighbour can at the same time perform computations on the transferred data. This leads to that data can be streamed efficiently by the running SIMD units. The Master processor will monitor and control the whole network by reading and writing to the node registers of the node network.

(23)

2 BACKGROUND 2.3 SIMD co-processor - ’Sleipnir’

2.3

SIMD co-processor - ’Sleipnir’

Figure 3: An overview of the Sleipnir SIMD Unit. Modified figure taken from [1]. A SIMD unit - also called ’Slepnir’ - is together with several other SIMD units used for performing parallel tasks computing sent from the Master Processor ’Senior’ though the con-necting processing lanes. The upcoming subsection will further describe its properties like the data types it supports, the structure of the memory hierarchy and the number space it currently can represent numbers in.

2.3.1 Data Types

Figure 4: The data vector format supported by a SIMD unit. Figure taken from [7]. Every SIMD unit in the ePUMA architecture supports fixed-point data types which consists of (according to the ePUMA Simulator Manual [8] ):

• Byte - A 8-bit signed or unsigned value. • Word - A 16-bit signed or unsigned value.

• Double Word - A 32-bit signed or unsigned value.

• Complex word - A 32-bit value with a 16-bit real part and a 16 bit imaginary part. Signed or unsigned.

• Complex double word - A 64 bit value with a 32 bit real part and a 32 bit imaginary part. Signed or unsigned.

Depending on which data type is used, these can be used to form up a whole vector that will be used to process several amounts of data in the Local Vector Memories that will be described in the next section. Figure 4 shows the different types of vector format of the data types that are supported by the SIMD unit.

(24)

2.3 SIMD co-processor - ’Sleipnir’ 2 BACKGROUND

2.3.2 Memory Hierarchy

The SIMD unit has several types of hardware memories built inside the unit: a Program Mem-ory (PM), a Constant MemMem-ory (CM) and three Local Vector Memories (LVMs).

The PM is the space where the Sleipnir executable is located and it can contain 1024 instruc-tions.

The CM is where the constant values are held and used for the SIMD unit when it is executing code. These values cannot be overwritten or be deleted during execution time of the SIMD unit but rather when the tasks are stored in the SIMD unit before the execution. This memory can contain 128 vectors, which corresponds up 128 words (16b).

The LVM is the main place where the input data, that comes from the Master processor, is processed and where it produces the output data that will be sent back to the main memory of the Master processor. Each of these three LVMs holds 80 kB and supports 8 memory access in parallel. Each of these memory access gives a 16b data word. Usually the LVMs are struc-tured that way that one of the LVMs will handle the DMA communication, writing results to main memory and reading new data [2]. While the other two will handle the processing of the persistent data and the input and output data respectively. The two LVMs, the one handling the DMA communication and the other handling the input and output data, can then switch the roles between every iteration of the computation of an algorithm.

Figure 5: Swap design. Figure taken from [1].

But the LVMs can also be organized in another way as well where every LVM will switch roles for every iteration. One will have the role of handling the input data, the second will handle the output data and the third will handle the DMA communication, which corresponds to a cyclic scheme instead. This type of structure is preferable to use for some algorithms compared to the previous described structure like where the output data from the current iteration will be the input data to the next iteration [2].

This type of memory hierarchy is according to Wang et al. compatible with the Open Com-puting Language (OpenCL) specification, which is a programming framework for heterogeneous parallel platforms like todays graphics cards nVidia GeForce and AMD Radeon [3].

2.3.3 Vector Register Files

The Vector Register File (VRF) of the ePUMA architecture can also be seen as a LVM but has a much lower latency access when its memory space is going to either be read or written. For instance it can perform two reads and one write per clock cycle. The VRF is though of lesser size than a LVM since it is much more hardware expensive. This register file can store up to 8 × 128-bit vectors or 8 × 8 × 16-bit scalars [1] [4].

(25)

2 BACKGROUND 2.3 SIMD co-processor - ’Sleipnir’

Figure 6: Rotate design. Figure taken from [1].

2.3.4 Number Space

The ePUMA architecture has for now a number space that ranges from 0 to 65535 and uses saturation arithmetics. This means that if any operation on two numbers results in a value that exceeds 65535, then it will become ’saturated’ and return a value that lies in the range [0, 65535]. The example below will demonstrate how the architecture performs an addition of two numbers that leads to a result larger than 65535:

z ← x + y z > 65535 ⇒ z ← z − 65535

⇒ z ∈ [0, 65535]

This is the only number space that consists within the architecture and therefore it does not have any unit that supports floating point numbers. When it comes to using decimal numbers, the idea is rather to represent all numbers in a fixed point number format instead. While the precision might not be in the same level nor easy to implement as the floating point numbers, the fixed point number representation is not computational demanding as the floating point number representation, which requires an additional floating point unit (FPU). Most of the well known hardwares that renders 3D computer graphics like Sony Playstation, Sega Saturn and Nintendo DS and even todays embedded systems do in fact use fixed point number representation instead due to the lack of a FPU [11] [12] [13] [14].

2.3.5 Other properties of the SIMD Unit Co-Processor

There are other properties and functions that the SIMD Unit has that have not been used extensively as a part of this thesis and these are listed below:

• Permutation Tables.

• Special built-in functions like MAC4, DCT5, FFT6, Taylor and Butterflies.

• Vector Accumulator Register (VAC), Vector Flag Register (VFLAG) and Special Purpose Register (SPR).

A full description of the listed items together with a detailed description of the pipeline of the SIMD Unit, can be found in these recommended papers [1], [2], [3], [4], [5] and [7].

4Multiply and Accumulate 5Direct Cosine Transform 6Fast Fourier Transform

(26)

2.4 Programming on ePUMA 2 BACKGROUND

2.4

Programming on ePUMA

The ePUMA architecture can for now only be programmed with a low-level programming lan-guage, which means that it provides no abstraction between the language and the hardware that would simplify the programming. The programming languages C++ and Java are exam-ples of high-level languages that have several abstractions above the hardware and therefore are much more easier and user-friendly to program in. On the other hand programs written in low-level can be much more optimized leading to them running more quickly and taking better advantage of the memory of the hardware [15].

The video game consoles Sega Saturn and Sony Playstation 2, released in 1994 and 2000 respectively, were very famous for being too difficult to program games and to take advantage of the performance of those hardwares when high-level programming languages were used. But when video game developers used low-level programming instead, they could implement much more impressive computer graphics and have several amount of physical objects on the screen and at the same time have a decent frame rate. This was in particular the case with the Sega Saturn that was competing against the original Sony Playstation and the Nintendo 64 between 1994 and 2000. The Sega Saturn had the possibility to render better graphics than the Sony Playstation if its games had been programmed with its low-level language. But video game developer preferred high-level programming because of its simple nature and as a result of programming video games in high-level on the Sega Saturn and the Sony Playstation, which had much more simplified hardware for high-level programming, the games on the Sega Saturn were not visually impressive as the games on the Sony Playstation. The programming difficulty was one of the several reasons why the Sega Saturn did not sell well as Sony Playstation and Nintendo 64 during that time [16] [17].

The low-level programming languages that is used in the ePUMA architecture are Assembler and Sleipnir Assembly Language (SAL). Assembler is used for programs that are going to be executed on the Master Processor as well as for programming the communication between the Master Processor, the DMA controller and all of the SIMD Unit co-processors for the parallel computing functionality in the architecture.

SAL has a strong resemblance of Assembler and is used for writing programs that are going to be run on a SIMD Unit. According to Karlsson (2010), today’s compiler technology is not sophisticated enough to handle vectorization of high-level language code. The compiled code would probably not be able to utilize all the parallel processing capabilities of Sleipnir. Writing small programs at a manageable level on the Sleipnir would be sufficient with SAL [4].

2.4.1 SIMD Unit Programming

When writing programs on a SIMD Unit using SAL, the SIMD units main benefit is to take advantage of the data vector format every LVM and the VRF are supporting. The SIMD unit is designed for highly parallel computations so that every element inside a vector in any of the built-in memories can be computed at the same time during a clock cycle.

This is the typical instruction format the SAL is using as seen in Listing 1: iter * o p e r a t i o n . cdt < options > dst src0 src1

(27)

2 BACKGROUND 2.4 Programming on ePUMA

Where every term in the instruction format is described in the following list according to [4] and [7]:

• iter - number of times the same instruction should be repeated. • operation - instruction mnemonic of the operation.

• cdt - makes instruction condition dependent. • options - additional options to the instruction. • dst - data destination.

• src0/src1 - source operands.

As an example of an instruction that performs high parallelism computation, Listing 2 shows an operation in SAL which adds the two vectors m0.v (LVM0) and vr0 (VRF) element-wise and then store the result in m1.v (LVM1) 128 times. Notice how both the index variables ar1 and ar0 are incremented with 8. Since a vector in a memory position has 8 element on every row, after the current addition it will then increment index position with 8, which represents the next row of the memory, and then perform the next addition. The index position of the memories m0 and m1 will be incremented while the index position of vr0 will be constant all along the way, which represents the first row of the VRF.

128 * a d d w q m1 [ ar1 + = 8 ] . v m0 [ ar0 + = 8 ] . v vr0

Listing 2: Add and store 128 vectors in a single assembly line.

This is the syntax structure that all the instructions with some exceptions follow in SAL. One of those exceptions but nonetheless important is the NOP operator. NOP stands for No Operation Performed and it is a statement that does not do anything at all. The reason why this remarkable operation is used is because to prevent hazards that can occur when executing code on the hardware.

When a single instruction, containing the operation, output and input operands, is exe-cuted by the SIMD Unit it actually goes through a section of several stage processes, like fetch, decode, selection, formatting, multiply and ALU7 long and short among others. This is called

a pipeline. Every instruction does not need to go through every stage process depending on which operation is going to be performed and which operands that are used as output and input but the pipeline requires that the instruction has gone through every process. Otherwise if an instruction has been performed but not gone through every stage process in the pipeline and the next instruction is performed after that, a hazard will probably occur.

This is where the NOP comes in the picture: when a instruction has been executed the NOP operator will go through the current stage process the instruction did not went through when it was executed and therefore ignored the rest of the stage process in the pipeline. But since one NOP only goes through one stage process, this operator must be used several times if there were several unprocessed stages in order for the pipeline to be ’filled’. Therefore the number of NOPs recommended to use after every instructions can vary greatly, but it usually lies between 1 to 10 NOPs. To be sure every instruction can be followed by 10 NOPs but this will also lead to that the code will be unoptimized since every NOP requires at least one clock

(28)

2.4 Programming on ePUMA 2 BACKGROUND

cycle. A detailed description and an overview of the pipeline of the SIMD Unit can be found in [4], [5] and [7].

Listing 6 shows an algorithm that performs some of the most common operators that have been used for different types of algorithms on ePUMA. Each of these operators will have a description on what every operators does so that the code will be easy to follow.

1 // S t a r t of the a l g o r i t h m

2 . main 3

4 // Copy the word v a l u e s t o r e d in i n d e x 3 of LVM1

5 // into i n d e x 10 in LVM0 6 c o p y q m0 [ 1 0 ] . w m1 [3]. w 7 10* nop

8

9 // S u b t r a c t the word v a l u e in i n d e x 5 of LVM1 with the 10 // word v a l u e in i n d e x 3 of LVM0 and s t o r e in vrf , 11 // row 0 and c o l u m n 0 , in o t h e r w o r d s : 12 // vr0 .0 = m1 [5]. w - m0 [3]. w 13 s u b w q vr0 .0 m1 [5]. w m0 [3]. w 14 10* nop 15 16 // vr0 .1 = m0 [1]. w + m1 [3]. w 17 a d d w q vr0 .1 m0 [1]. w m1 [3]. w 18 19 // vr2 .0 = m0 [2]. w * m1 [3]. w 20 mulwww < mul =16 > vr2 .0 m0 [2]. w m1 [3]. w 21

22 // S t o r e in the i n d e x i t e r a t o r ar2 , the v a l u e

23 // from vr2 .0 and r i g h t s h i f t it with 12: 24 // ar2 = vr2 .0 >> 12 25 l s r w q ar2 vr2 .0 12 26 27 // P e r f o r m b i t w i s e OR b e t w e e n the v a l u e s of m0 [3]. w and 28 // m1 [0]. w and s t o r e it in vr2 .1: 29 // vr2 .1 = m0 [3]. w | m1 [0]. w 30 orwq vr2 .1 m0 [3]. w m1 [0]. w 31 32 // P e r f o r m b i t w i s e AND : 33 // vr2 .2 = m0 [0]. w & m1 [0]. w 34 a n d w q vr2 .2 m0 [0]. w m1 [0]. w 35 36 // E n t e r i n g s e c t i o n part A D D I N G U P 37 A D D I N G U P : 38 39 // C o m p a r e v a l u e s b e t w e e n the v a l u e in vr2 .1 and 40 // the v a l u e in m0 [ 1 0 ] . w 41 c m p w q vr2 .1 m0 [ 1 0 ] . w 42

(29)

2 BACKGROUND 2.4 Programming on ePUMA

44 // the u n s i g n e d word v a l u e in vr2 .1 is g r e a t e r

45 // than the u n s i g n e d word v a l u e in m0 [ 1 0 ] . w 46 // ( vr2 .1 > m0 [ 1 0 ] . w )

47 jmp . ugt C O M P L E T E

48 10* nop 49

50 // O t h e r w i s e , add up the v a l u e in vr2 .1 with 1

51 // ( vr2 .1 = vr2 .1 + 1) 52 a d d w q vr2 .1 vr2 .1 1 53 10* nop

54

55 // Jump back up to the S e c t i o n part A D D I N G U P 56 jmp A D D I N G U P 57 10* nop 58 59 C O M P L E T E : 60 61 // End of the a l g o r i t h m 62 stop 63 64 // I n i t i a l i z e c o n s t a n t v a l u e s in the m e m o r y p o s i t i o n s of 65 // the C o n s t a n t M e m o r y 66 . cm 67 25 12 1 0 68 69 // I n i t i a l i z e v a l u e s in the m e m o r y p o s i t i o n s of 70 // the L o c a l V e c t o r M e m o r y 0 71 . m0 72 37 68 45 13 73 74 // I n i t i a l i z e v a l u e s in the m e m o r y p o s i t i o n s of 75 // the L o c a l V e c t o r M e m o r y 1 76 . m1 77 8 10 2 7 9 20

Listing 3: Algorithm which performs common operators in Slepnir Assembly Language. Codes that are written in SAL are saved in the file format *.sasm. Since the ePUMA architecture has not been implemented in hardware, a Python-based software has to be used instead to simulate the hardware and where the code can be executed. The software can for now only be used on a Linux-based operating system that runs in 32-bit mode8. By starting

the built-in terminal in a Linux Distribution - like Ubuntu, Linux Mint, Fedora, OpenSUSE and Debian - the code written in Sleipnir Assembley Language can be executed by the following command:

$ simdexec.py -t code.sasm

where the switch ’-t’ indicates that a simulation of a single assembly or a Python file will be performed [8]. Figure 7 shows how the results from the simulation will be displayed on the terminal, where the memory banks of the LVM0, LVM1 and VR can be observed. The

(30)

2.4 Programming on ePUMA 2 BACKGROUND

numerical values in those memories represents the values that are generated after the execution of a code has been finished. The result can be expanded to show even more memory banks and also to display the third memory LVM2 by adding extra flags during the command execution in the terminal.

Figure 7: Simulation result when executing code on a SIMD Unit. Figure taken from [8]. Note that this is what the SIMD Unit can display for now through the simulator software. The simulator does not contain any API that can make visualization of the computations from the implemented algorithms. Meaning that the numerical values generated from the algorithms are the only measurable data that can be observed. If these data are going to be visualized in some way, these data must be imported to a external graphical API that can visualize it. The most regular API:s are in this case OpenGL and Direct3D which are described in detail in Section 3.3.6.

As previously mentioned since this language is of a low-level type, there is no possibility to create variables, functions and classes that could form an understandable and structured code as in high-level languages like C++ and Java. But instead the memories is all there is to rely on and it is therefore very important to keep track of all the memory positions that contain the computed values.

2.4.2 Several SIMD Units Parallel Programming

To write code that utilizes several SIMD Units for parallel computing, the traditional low-level programming language Assembler is used for the Master processor ’Senior’ in the ePUMA ar-chitecture. As previously mentioned, this processor is connected to all of the 8 SIMD Units in a star-shaped network to handle all the data transactions between the Master processor and all the SIMD Units. The main idea is to let the Master processor decide when and on which of the SIMD Units the code of interest is going to be executed so that every SIMD Unit can make computations at the same time. This is done by transferring data between all the SIMD Units and the main memory of the Master Processor with the help of DMA transactions as described earlier in Section 2.2.

However, this transaction system and the connection between the Master and all the SIMD Units must be manually written in Assembler on the Master processor and it is according to Karlsson [4] a very tedious task to do. The ePUMA Research Team who are developing the architecture have the ambition to write tool libraries for the Master processor that handles the transaction system and the connection between the SIMD Units automatically and let the

(31)

2 BACKGROUND 2.4 Programming on ePUMA

main code be written in the high-level language C instead of Assembler. Nevertheless these tool libraries are still under development and have not been released yet as of writing this thesis and therefore the plans for implementing algorithms for several SIMD Units as a part of this Master Thesis had unfortunately to be abolished.

On the other hand, since all of the 8 SIMD Units are going to do parallelized computations simultaneously, conclusions can still be made theoretically on how fast the ePUMA architecture can process an algorithm per clock cycle when parallel computing is used. When the algorithm has been executed on a single SIMD Unit and the cycle time has been observed through the display result in the terminal, this cycle time can then be divided by the number of the SIMD Units that are going to be involved for the parallel computing. If all the SIMD Units are going to be used then the cycle time for the parallel computing can easily found by dividing the cycle time for a single SIMD Units by 8:

cTparallel=

cTsingle

nSIM D

The ePUMA Simulator can for now execute Assembler code on the simulated Master pro-cessor at the same time with SAL written code on several SIMD Units. This is an example taken from the ePUMA Simulator Manual [7] on how code, that is parallelized on two SIMD Units through the Master processor, is executed in the terminal:

$ esim.py -m master.asm -k kernel0.sasm -k kernel1.sasm -b indata.bin -a outdata.1024 The Assembler code written on the Master processor is represented by the file ’master.sasm’ preceded by the switch ’-m’ which tell the simulator that it is a Master program assembly file. Afterwards comes the two switches ’-k’ for the files ’kernel0.sasm’ and ’kernel1.sasm’ which tell the simulator that they are Kernel program assembly files, which are going to do computations in parallel. The switch ’-b’ tells about a binary data file which is represented as ’indata.bin’ and this contains input data that are going to be used for the computations. At last the switch ’-a’ allocates memory space for the output data of the computations which is going to be stored in the file ’outputdata.1024’.

As it also can be observed from the command line above, The Python-script ’esim.py’ is used instead of ’simdexec.py’ when executing code on several SIMD Units at the same time. This script also contain a menu where several properties of the whole ePUMA architecture can be displayed like the states of the Master, the SIMD Units and the DMA and the data that resides in all of the separate memories of the architecture. More information about this script and the whole menu system can be found in the ePuma Simulator Manual [7].

(32)

3 THEORY

3

Theory

The task of this Master Thesis is to implement video game related algorithms on the ePUMA architecture, which is considered as a low-power variant of the CELL architecture that is used as the CPU processor of the video game console Sony PlayStation 3 and blade servers manufactured by Mercury Computer Systems and IBM. The ePUMA architecture is designed to be used in future mobile processors of handheld devices and to deliver high computational performance in intensive applications while keeping a low-power consumption at the same. This is achieved by taking advantage of the 8 SIMD Units that the ePUMA architecture has and which are used for parallel computing of intensive applications. By letting every unit take care of a unique task, the computation time of the whole process can be reduced significantly if all the 8 SIMD Units are used simultaneously. Every single SIMD Unit can also perform computations on 8 memory bank positions in parallel by using vector operations, which would reduce the computation time by 8 times at a certain clock frequency. If all the 8 SIMD Units are then used to perform computations in parallel with vector operations, then the computation time would be reduced to as much as 64 times:

cTparallel=

cTsingle

nP arallel × nSIM D, nP arallel = 8, nSIM D = 8

But this also lead to another case where the clock frequency can be decreased by as much as 64 times by utilizing all the 8 SIMD Units for parallel computing with vector operations. They can therefore maintain the same computation time as it is on one SIMD Unit, that is performing a single operation on 1 memory bank position, with 64 times higher clock frequency. Decreasing the clock frequency and maintaining the computation time is also a ideal scenario the todays consumer electronic companies like Apple, Samsung, Sony, Motorola, LG and HTC among others are desiring. The main problem with todays smartphones and surf tablets is that they have too less battery life when they are used for Internet usage and interaction with mobile applications since the processors are becoming more powerful but at the same time consumes much energy. The most common solutions are to manufacture more effective batteries that can endure longer in smartphones and surf tablets. But at the same time these batteries can be very expensive to manufacture if they have really high capacity since smartphones and surf tablets often use lithium-ion chemistry in their batteries. This type of chemistry has a very high expense per watt-hour that is going to be manufactured.

Parallel programming has become more popular in the video game and the computer in-dustry with the introduction of the General Purpose Graphical Processing Unit (GPGPU) Programming that modern graphics cards like the AMD Radeon and nVidia GeForce supports. This type of programming allows the graphics cards for instance to handle the physics compu-tations instead of the CPU and this is done in parallel on all the cores the GPU has.

The most frequently used frameworks for GPGPU programming are CUDA (which only works on nVidia-based graphics cards), developed by nVidia, and OpenCL, developed by the Khronos Group. Especially with OpenCL, this frameworks specification is actually compatible with the memory hierarchy of the ePUMA architecture according to Wang et al. [3]. The ePUMA architecture might well be suitable platform for running video games on since it can deliver high performance at a low-power consumption. The video game related algorithms that are of interest of implementing as a part of this Master Thesis will be described in the next following sections.

(33)

3 THEORY 3.1 Random Number Generator

3.1

Random Number Generator

(a) Dices. (b) Cards. (c) Roulette. (d) Tetris.

Figure 8: Different types of games where a random number generator is applied. Figures taken from [19], [20], [21] and [22] respectively.

Random Number Generators are used in the fields of simulations, analysis, and statistical samplings in science, cryptography and especially in games. It is due to their abilities of pro-ducing numbers that do not follow any pattern at all which have created the genre ’games of chance’. These type of games as seen in Figure 8 include dices, shuffling playing cards, roulette wheels and slot-machine which often are played in restaurants, bars and casinos. The Random Number Generators have their applications in computers as well when it comes to video games, like the popular puzzle game Tetris created by the Russian computer engineer Alexey Pajitnov in 1984. In Tetris, one of the seven available blocks is randomly chosen to be the next block that falls down in the screen. This is randomly generated for every 5-10 seconds.

(a) Perlin Noise.

(b) Terrain Height Map.

Figure 9: Terrain Height Map created by the Perlin Noise procedural function. Figures taken from [23] and [24] respectively.

Another important area to use Random Number Generators in video games is when it comes to generating noise patterns. Figure 9a shows the Perlin Noise procedural function, which was created by Ken Perlin in 1985. Since this function has a random appearance to it, which has been generated by random numbers, this property can then be used to create realistic terrains. This is done by letting the height map of a flat surface be formed by the random appearance

(34)

3.2 Pseudo-Random Number Generator 3 THEORY

of the Perlin Noise function which eventually gives a terrain pattern as seen in Figure 9b. Unlike the case with Tetris, this requires several millions of random numbers to be generated per second in order to render the right kind of appearance of the terrain with the noise function. These are some of the examples where Random Number Generator are used in video games and therefore it would be a vital part for the ePUMA architecture to have a hardware support for it. In Section 3.2 the report will discuss about how Random Number Generators are generally implemented on a computer and which algorithm was implemented on the ePUMA architecture.

3.2

Pseudo-Random Number Generator

Pseudo-Random Number Generator are the most common method of generating random num-bers on a computer. It is pseudo in the sense that algorithms based on the Pseudo-Random Number Generator produce numbers that appears to be random but in reality are predeter-mined. They are using a mathematical formulae or precalculated tables and they have the characteristics of being efficient of producing numbers in a short time, deterministic where the same sequence of numbers from the starting can be reproduced again at a later point and pe-riodic where the sequence of numbers will repeat itself [18].

The other methods are True Random Number Generators and in contrary they use true ran-domness by extracting ranran-domness from physical phenomenons where sources can be everything from the mouse movement or time between the keystrokes from a user that is sitting in front a computer to radioactive sources and atmospheric noises. These method are however different from the previous method due to being inefficient in the way that they take longer time to pro-duce a number, nondeterministic because their inability to repropro-duce the same given sequence of numbers which let the method also become aperiodic as well. These methods are suitable to be used in real life games like lotteries and draws, gambling services and in cryptography [18]. Pseudo-Random Number Generators however are suitable for the use in video games even though they are indeterministic and periodic. If the length of the start value, which is called a random seed and which is used to initialize the whole generator, is large enough then it would take several millions of unique digits that are generated in a unique sequence before it would repeat itself at last [25]. The random seed must though be unique every time it is going to generate a unique sequence of numbers. Otherwise the same sequence would be produced if the value of the random seed remain constant. This is due to the fact that computers can not create random numbers by themselves since they only are constructed to follow instructions and to execute those quickly. They must rely on a source from the outside of the computer, which it creates a random pattern. Paradoxically, there is a source that is located inside a computer that could act as unpredictable and that is the internal clock which is used to keep track of the current date and time. This internal clock ticks for every second and by taking the time represented in milliseconds as a random seed, the Pseudo-Random Number Generator will eventually produce a unique sequence of numbers every time the generator is used since the random seed will be different every time [26].

One of the fast algorithms based on the principals of Pseudo-Random Number Generator that will be implemented on the ePUMA architecture is the Linear Feedback Shift Register and it will be described in the next section.

(35)

3 THEORY 3.2 Pseudo-Random Number Generator

3.2.1 Linear Feedback Shift Register

Figure 10: A 4-bit Linear Feedback Shift Register.

Linear Feedback Shift Register (LFSR) is a shift register which takes the output bit from the previous state and uses it as an input bit for the next output bit state. This process is a linear function and it consists of the XOR operator, where it takes in two input bits and produce the output bit from the XOR operator. Figure 10 shows an example of a 4-bit Linear Feedback Shift Register where the two last bit values are used as two input values into the XOR operator and produce the output bit from those two input bit values. The output bit value will be placed in the first slot of the array while the rest of the bit values in that register will move one step to the right. The last bit value though will be completely removed from the register and a new output bit value can be produced from the two new bit values that are last in the register. This is the whole process of the Linear Feedback Shift Register and because the register has a finite number of possible states, it will eventually repeat the same states. When the Linear Feedback Shift Register has 4-bit, it will generate at most 42 = 16 different types of states

before it will repeat the same states again. But the more bits a register contains the more unique states will be generated before it gets repeated again and therefore the numbers appear to be more random [27].

(a) Logic Gate. (b) Venn Diagram.

Figure 11: Different representations of the Exclusive OR operator (XOR).

The XOR is a Boolean operator that stands for Exclusive OR and it can be represented in many ways. This operator is in Figure 10 represented with the Logic Gate representation also seen in Figure 11a. The other way is to use Venn Diagrams which are used to show the logical relationship between a collection of sets seen in Figure 11b. If A and B are two different sets and the operator is going to be performed on those two sets, this is written as AL B from the mathematical perspective.

The XOR operator, also known as the ’either or’ operator, takes in input from two operands and if any of those are true then it will output the result as true. It will on the other hand output the result as false if both of the two operands are false or if both of those two are true. Table 1 summarize the whole principle of the XOR operator [29].

(36)

3.2 Pseudo-Random Number Generator 3 THEORY L A B 0 0 0 1 0 1 1 1 0 0 1 1

Table 1: Table output result of the XOR operator.

Figure 12: A 16-bit Fibonacci LFSR.

As mentioned, the more bits a register has the more random the generated numbers can appear to be since it takes longer cycles before the same sequence gets repeated again. But the register can be further advanced by not only have one but several XOR-operators placed at certain positions of the register, which are acting as input values. These operators are also connected to each other so that they affect the final output which will contribute to even more randomness. Figure 12 shows this type of register which has 16-bits and which is also called a Fibonacci Linear Feedback Shift Register, named after the Italian mathematician Leonardo Fibonacci. As it can be seen in Figure 12, four positions of the register are used as input bits for three XOR operators and produce one output bit for the first position of the register. The input bits are called taps and can be labelled as [16, 14, 13, 11].

The arrangement of taps for feedback in an LFSR can be mathematically expressed as a polynomial of variables with different positive integer potentials. Every variable in the polyno-mial will have a coefficient value that either is equal to 1 or is equal to 0. With the given taps, the so called feedback polynomial would be in the case below:

x16+ x14+ x13+ x11+ 1

where the term ’1’, equivalent to x0, corresponds to the input to the first bit of the register and not to any tap. To implement the 16-bit Fibonacci Linear Feedback Shift Register with its taps that forms the feedback polynomial above, there are built in operations in ePUMA as well as any modern computer that are called Left Shift and Right Shift.

Left Shift performs an arithmetics shift of an input value x to the left by the value n. This corresponds to have the input value x be multiplied by 2−n as seen below:

x << n = x × 2n

While the Right Shift performs an arithmetics shift of an input value x to the right by the value n. This corresponds to have the input value x be multiplied by 2n as seen below:

(37)

3 THEORY 3.3 Graphics Pipeline

The taps can now be implemented with the operators and the implemention of LFSR with the particular polynomial x16+ x14+ x13+ x11+ 1. Algorithm 1 is a pseudo-code version of

the 16-bit Fibonacci LFSR algorithm which is based on an implementation in the computer language VHDL [28]:

Algorithm 1 The Linear Feedback Shift Register Algorithm

1: procedure LFSR

2: lf sr ← (?startV alue?)

3: period ← 0

4: do

5: (unsigned) bit ← ( (lf sr >> 0) ˆ (lf sr >> 2) ˆ (lf sr >> 3) ˆ (lf sr >> 5)) & 1

6: lf sr ← (lf sr >> 1) | (bit << 15)

7: period ← period + 1

8: while (lf sr ¬ = ?startV alue?)

9: end procedure

This is the first video game related algorithm that was implemented on the ePUMA archi-tecture as a part of this Master Thesis. The implementation of LFSR is thoroughly described in Section 5.1 where the potential of parallelizing the algorithm on ePUMA will be discussed. The next section will discuss about the basic theory of the Graphics Pipeline and which algorithms from it were implemented on ePUMA.

3.3

Graphics Pipeline

Figure 13: A general overview of a Graphics Pipeline. Figure taken from [30].

A Graphics Pipeline is the part of rendering a complex 3D scene with 3D objects onto a 2D pixel display screen by dividing the whole rendering process into several stages. This is nowadays handled by the Graphics Processing Unit (GPU) on a computer and the benefits of a Graphics Pipeline are that the GPU can render the graphics more efficiently when it is divided into several stages. The more individual stages the rendering process is divided into in a graphics pipeline, the faster will the GPU process every individual stage and as a result the whole rendered scene at a faster rate. A Graphics Pipeline and its stages can be described in many ways since through time new stages have been added to it and old stages have been removed from it. But a general and simplified overview of a Graphics Pipeline can been seen

(38)

3.3 Graphics Pipeline 3 THEORY

in Figure 13 and it starts by sending in the model data of all the 3D objects that are in the 3D scene as an input to the vertex processor.

3.3.1 Model Data and the Vertex Processor Stage

The model data of a 3D object contains information of the vertices, edges, transformations and lighting among other attributes. In the vertex processor, it will perform operations like per-vertex lighting, which applies light sources, reflectance and other surface properties. It also performs linear transformations like rotation, scaling, translation and shearing on every vertex on every object. The 3D object must also process through several viewing transformations like Object-to-World transformation, World-to-Eye transformation and Eye-to-Normalized Device Coordinates transformation to make sure that it is visualized correctly on the screen. This means that in order for the 3D object to be viewed correctly on the 2D screen with its applied linear transformations and the lightings, the coordinates of the 3D object must be changed according to the perspective of the camera and then be normalized into values between -1 and 1.

(a) Object space. (b) World space. (c) Eye space. (d) NDC space.

Figure 14: Different coordinate spaces a 3D object process through before the Clipping process. Figures taken from [31].

3.3.2 The Clipping Stage

When the 3D object is in the Normalized Device Coordinates space, which is the normalized Eye space, the object must be clipped so that the GPU only renders the part of the object that is visible on the screen. Otherwise the GPU would spend enormous amount of computation power in vain on rendering segments that would not be visible to the user. But with the Clipping process it will make sure that the GPU will only render segments of an object that are visible on the screen.

3.3.3 The Rasterizer Stage

Now the 3D object is ready to move into the Rasterizer part, where the object is going to be transformed into 2D pixels according to Figure 13. The clipped object, which now is in Normal Device Coordinate space must now be projected into the 2D Screen space seen in Figure 15.

3.3.4 The Fragment Processor Stage

The 3D object is now projected into a 2D object and every visible triangle of that object will now turn into fragments of pixels. This is done by using scanlines for every row on the screen and when it detects a segment of the triangle, it will turn that segment into a pixel. When every visible triangle of the object has been filled with pixel of color information, rasterized object will then move ahead to the Fragment Processor and this is where per-pixel lighting and

(39)

3 THEORY 3.3 Graphics Pipeline

Figure 15: Proceding from the NDC space to the Screen space during the rasterization process. Figure taken from [31].

texture map can be applied. It blends every filled pixel on the screen with calculated lighting and with a texture map value.

3.3.5 The Frame Buffer Stage

At last when the Fragment Processor stage has been executed, the produced image will then be transferred into the ’Frame Buffer’. This is a 2D array data, stored in the GPU memory, that contains all the generated pixel values from the previous stage and this is the data that a computer monitor will display on the screen.

3.3.6 Direct3D and OpenGL Graphics Pipeline

(a) Direct 3D. (b) OpenGL.

Figure 16: Comparison of the Direct3D and the OpenGL Graphics Pipeline. Figures taken from [32] and [33] respectively.

The two most popular Application Programming Interfaces (API) for producing 2D and 3D Computer Graphics are Direct3D and OpenGL. Direct3D is developed by Microsoft and runs only under the Microsoft Windows operating system platform whereas OpenGL was devel-oped by Silicon Graphics in 1992 and is now managed by the non-profit technology consortium Khronos Group. This API runs under several operating system platforms like Windows, Linux and Mac OS X as well as the iOS and the Android operating system for touchscreen mobile devices. This is the main API that is used for creating games and visual application for the Linux and the Mac OS X platform.

Figure 16 shows the map scheme of the Graphics Pipeline for the Direct3D and the OpenGL API and while they differs in name and have their unique process stages, they still follow the same structure as seen in Figure 13. In the Direct3D map scheme, the first part {Vertex Data + Primitive Data} → {Tessellation} → {Vertex Processing} represents the model data and the Vertex Processor State as described in Section 3.3.1, as does the part {Display List} → {Evaluator} → {Per-Vertex Operations, Primitive Assembly} in the OpenGL map scheme.

(40)

3.3 Graphics Pipeline 3 THEORY

The second part in the Direct3D map scheme, {Pixel Processing ← Texture Sampler ← Tex-tured Surface}, and in the OpenGL map scheme, {Rasterization ← Texture Memory} → {Per-Fragment Operationes}, represents the Clipping, the Rasterizer and the {Per-Fragment Processor State as described in Section 3.3.2, 3.3.3 and 3.3.4 respectively.

The final part, {Pixel Rendering} in Direct3D and {Frame Buffer} in OpenGL, represent the Frame Buffer Stage as described in Section 3.3.5, when all the drawn pixels will be displayed on the computer monitor.

The Graphics Pipeline of Direct3D and OpenGL are continuing to be further developed through time. As for now, the Direct3D and the OpenGL API have now reached Version 11.1 and 4.2 respectively. These versions allow the stages Vertex Processing, Geometry Process-ing and Pixel ProcessProcess-ing on Direct3D to be programmable as well as the stages Per-Vertex Operations and Per-Fragment Operations on OpenGL. The earliest versions of the API:s did not allow this type of functionality but instead were fixed. That type of pipeline was called a fixed-function pipeline because all the stages were not programmable. But now that they are programmable this allows customized effects of transformations and lightings to be rendered on the screen. The programs that runs these type of effects on stages are called shaders (e.g. Vertex Shader, Geometry Shader, Pixel Shader and Fragment Shader).

New programmable stages have also been added through time like the Geometry Processing and Tesselation in the Direct3D pipeline as well as their counterparts on the OpenGL pipeline. The Geometry stage allows entire primitives like triangles and lines to be modified and the Tesselation stage allows the geometry of an input object to be subdivided into regions in order to create higher-order representation of the surface [34].

3.3.7 Selected stages for the ePUMA

Parts of the stages from the Graphics Pipeline have been of an interest for implementing it on the ePUMA Hardware as a part of this Master Thesis. The stages of interest are the Clipping, the Rasterize and the Fragment Processor procedure, as they were described in Section 3.3.2, 3.3.3 and 3.3.4. These stages have the potential of being parallelized efficiently on ePUMA. The next section will start with describing the whole Clipping process in detail.

(41)

3 THEORY 3.4 Clipping

3.4

Clipping

Clipping is the procedure of cutting away whole objects or parts of the objects that are not visible on the viewing screen before they are rendered in the frame buffer. This is performed in order for the display card to make efficient rendering of the objects so that the display card does not make unnecessary computations. It would otherwise be a waste for the display card to render every object that is loaded into the memory when they are not visible for the user. The objects that are to be rendered often are built up in either of these two formats: lines and polygons (which can represents shapes like triangle, quadrilateral and several corned shapes). There are several algorithms for performing line clipping and polygon clipping. In this report the Cohen-Sutherland algorithm for the line clipping and the Sutherland-Hodgman algorithm for the polygon clipping have been implemented on the ePUMA hardware and this will be covered in the next two upcoming section.

3.4.1 Cohen-Sutherland Line Clipping

This line clipping algorithm was created by Danny Cohen and Ivan Sutherland as a part of a flight simulation work in 1967 [35]. What it does is that it divides a 2D space into 9 regions where the middle part is representing the viewing screen visible to the user. Every region is labeled with a binary number called ”Outcode” which can be seen on Figure 17 below:

Figure 17: The outcode template.

With this outcode template, it can now be determined whether a line is completely inside, outside or partially inside and outside of the screen by its end points. The view screen, repre-sented by ’0000’ in Figure 17, is built up by 4 maximum and minimum coordinates in {x, y} notation:

Screen : {xmin, ymin}, {xmax, ymin}, {xmin, ymax}, {xmin, ymax}

while the end points of a line is built up by:

Line : {x1, y1}, {x2, y2}

To compare the two end points of a line with the view screens coordinates and to determine their outcodes, the bitwise operator ’OR’ (represented in C++ as ’|’ and in SAL as ’orwq’) is used to mark if any of the endpoints is outside the viewing screen. What the bitwise operator does is that it compares numerical values in binary format and outputs a new numerical value

(42)

3.4 Clipping 3 THEORY

(a) Before (b) After

Figure 18: The line clipping.

based upon the bitwise comparison. For example, given these two variables with assigned values a = 5= 01012 and b = 9 = 10012. The bitwise operator ’OR’ is performed on every bit according

to the table below:

| x y

0 0 0

1 1 0

1 0 1

1 1 1

In comparison, the bitwise operator ’AND’, which is also going to be used later in this section, is performed according to the table below:

& x y

0 0 0

0 1 0

0 0 1

1 1 1

If the bitwise operator ’OR’ is going to be performed on those two variables, a | b, it will then yield the following result below:

a | b 01012

| 10012

11012

While the bitwise operator ’AND’ performed on those two variables, a & b, will yield: a & b

01012

& 10012

References

Related documents

Syftet med denna uppsats är att beskriva hur det praktiska arbetet med intern kontroll går till i ett företag och på så sätt bidra med en ökad förståelse för intern

A.2.1 Test heap memory on subscribers and clients 44 A.2.2 Test latency and CPU usage at different message loads 45 A.2.3 Test latency and CPU with different number of subscribers

This paper reports on the results from a series of choice experiments focusing on the impact of the number of choice sets, starting point and the attribute levels in the cost

The main focus of this paper will be to prove the Brouwer fixed-point theorem, then apply it in the context of a simple general equilibrium model in order to prove the existence of

To investigate if a relationship exists between price of electricity and the number of workers in the manufacturing industry, a Vector Autoregressive model test will be

After controlling for age, sex, country of birth and employment status, household income and self-rated econ- omy were associated with all six psychosocial resources; occupation

The mean glance duration times from look down to look up for all S-IVIS presentations on occasions when the distraction criterion was met was for the CONTROL condition 3.65

We investigate the number of periodic points of certain discrete quadratic maps modulo prime numbers.. We do so by first exploring previously known results for two particular