• No results found

Deformation Functionality from RSX to SPUs on the

N/A
N/A
Protected

Academic year: 2021

Share "Deformation Functionality from RSX to SPUs on the"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

Migrating Mesh Skinning

Deformation Functionality from RSX to SPUs on the

PlayStation ⃝3 R

Anders R˚ anes

September 29, 2010

Master’s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Stefan Johansson

Supervisors at Coldwood: Andreas Asplund & Olof H¨ aggstr¨ om Examiner: Fredrik Georgsson

Ume˚ a University

Department of Computing Science SE-901 87 UME˚ A

SWEDEN

(2)
(3)

Abstract

In game development, performance is everything and the Playstation 3 provides a unique platform for utilizing parallelization of code to achieve extremely high performance. In this master’s thesis the issue of animation with smooth skinning is migrated from being a GPU process to becoming a parallelized and 358% faster process. This method is incorporated in an existing commercial game engine and integrated in a currently in development title for the Playstation 3. An in-depth study covers parallel processors, the CELL processor, used in the Playstation 3, and how contemporary industry leading game developers are utilizing the same unique architecture to increase their own games’ performance.

(4)

ii

(5)

Contents

1 Introduction 1

2 Problem Description 3

2.1 Problem Statement . . . 3

2.2 Goals . . . 4

2.3 Purposes . . . 4

2.4 Methods . . . 4

2.5 Related Work . . . 5

3 Creating a frame with skinned objects 7 3.1 Scene graph . . . 7

3.2 Vertex data . . . 7

3.3 Skeleton . . . 8

3.4 Culling . . . 9

3.5 Deformation . . . 9

3.6 Rendering . . . 9

4 Parallelized computation on the Playstation 3 11 4.1 Parallel processors . . . 11

4.2 The Playstation 3 processor and memory layout . . . 12

4.3 Leading industry Playstation 3 SPU utilization . . . 13

4.3.1 Santa Monica Studios . . . 14

4.3.2 Guerilla Games . . . 14

4.3.3 Dice . . . 15

4.4 Emergent . . . 15

4.4.1 Gamebryo . . . 15

4.4.2 Floodgate . . . 15

4.5 Reflections . . . 16

5 Accomplishment 19 5.1 Preliminaries . . . 19

5.2 Existing skinning frameworks . . . 19

iii

(6)

iv CONTENTS

5.2.1 SPU skinning in Gamebryo 2.4 . . . 19

5.2.2 Skinning in Gamebryo 2.3 . . . 21

5.2.3 Proposed SPU-skinning solution . . . 23

5.3 How the work was done . . . 24

6 Results 27 6.1 Performance in a stand alone demo . . . 27

6.2 Performance gain in the actual game . . . 27

7 Conclusions 29 7.1 Limitations . . . 29

7.2 Future work . . . 29

References 33

(7)

List of Figures

3.1 Example of a skinned mesh with skeleton, left: bindpose, right: animated. . . 8

4.1 A simplified view of the Playstation 3 processor and memeory layout. . . 13

5.1 SPU-Skinning in Gamebryo 2.4. . . 20

5.2 Software Skinning in Gamebryo 2.3. . . 21

5.3 Hardware Skinning in Gamebryo 2.3. . . 22

5.4 SPU-skinning solution proposed for Gamebryo 2.3. . . 25

v

(8)

vi LIST OF FIGURES

(9)

Chapter 1

Introduction

Parallelization is becoming more and more common in computing science in general and in game development in particular. Parallelization is an ever present area of work with a constant need for refinement in game development. One part of game development is an- imation of on-screen objects, which for a long time has been done by skinning. Skinning gets its name from the analogy to the human body where a collection of connected bones form a skeleton and each point of the skin is connected to one or several bones. When the bones are moved the points of skin are moved along with it. This creates much less work for animators which have enabled them to animate much more complex models.

In this master’s thesis, I propose a method for smooth skinning and incorporate it as a parallilized process in an existing game engine for the gaming platform Sony PlayStation⃝3R

(hereafter referred to as the Playstation 3). The work is done for the game developer Cold- wood Interactive, Ume˚a, for one of their upcoming commercial titles.

In Chapter 2, the problem is described in more detail and the goals required by Coldwood are presented. In Chapter 3, basic concepts of 3D graphics are explained and the steps to create a frame with regard to skinning is presented. In Chapter 4, an in-depth study of industry utilization of the parallel nature of the Playstation 3 is presented. Chapter 5 describes the existing skinning code in the used game engine Gamebryo and my proposed solution. The performance results of the solution are presented in Chapter 6, and in Chapter 7 conclusions are drawn from the work and limitations of the solution are presented along with suggestions for future work.

1

(10)

2 Chapter 1. Introduction

(11)

Chapter 2

Problem Description

Games generally use as much power as possible to show as much and as beautiful graphics as possible on each frame, and the framerate must be high, 30 – 60 fps1 is standard. For each frame there exists very little time in which many tasks have to be handled, such as input, animation, game logic, AI, rendering, etc. In recent years parallel processor architectures have become more and more common for both PC and consoles. One current research topic in computing science is to determine which algorithm and processes that can be parallelized and which that cannot. Rendering have for a long time been processed in parallel because of its natural ability to be split into non-interfering parts. Specialized graphics hardware have been a standard in computer entertainment for a long time, but the rest of the game loop is just now starting to catch up. Most major gaming companies are making most of their prestanda progress by utilizing the parallel power of modern processors. The animation part is of high priority because of its comparable size to the other, non-rendering, events in the game loop It is, however, not something that can be easily parallelized.

2.1 Problem Statement

Coldwood Interactive AB (hereafter referred to as Coldwood) is an independent game stu- dio, founded 2003 in Ume˚a, Sweden. They develop video games for platforms such as PC, Playstation 2, Playstation 3, PSP and XBOX. The studio utilizes the rendering middleware Gamebryo for some projects, which provides several solutions common to game develop- ment tasks, including a scenegraph handler. This master’s thesis only considers the use of Gamebryo on PC and the Playstation 3.

The Playstation 3 is equipped with one CPU (PPU), one GPU (RSX), and seven gen- eral purpose SIMD processing units (SPUs). When used traditionally the RSX easily becomes choked and the total frame time suffers. The frame time is bound by MAX- TIME(PPU,RSX), so the key is to balance the workload between all units so that no one unit exceeds given time limits. Thus, on the Playstation 3 specifically, it would make sense to let the SPUs do GPU work, even though they would perform worse in a direct comparison.

Coldwood has identified smooth skinning as a particularly problematic task. The current implementation do one scene traversal to calculate world matrices for every object, but then

1Frames per second.

3

(12)

4 Chapter 2. Problem Description

the RSX is used to do the actual deformation to transform the objects into the correct space for both shadowing and normal rendering. Traversing has a tendency to invoke ”pointer chasing” that easily results in L2 cache misses, inflicting PPU penalties, which in turn delays RSX execution. It also affects the RSX negatively directly through extensive use of vertex shaders, because of the need to perform the skinning twice for objects which are both shadow casters and visible in the normal rendering. Finding a way to redirect this work to the SPUs could have a very positive impact on the performance of the RSX.

2.2 Goals

The following goals are required by Coldwood:

– Implement a smooth skinning solution that can be distributed and run on SPUs.

– Profile, track bottlenecks (stalls, inefficient code, cache misses, etc.) and compare the new implementation to the old.

– The solution should be integrated into the existing code base and will be used in a commercial product on the Playstation 3. The product uses peripherals which require a steady 60Hz refresh rate (60 fps).

Additional optional goals include:

– Reduce cache miss penalties and use interleaved DMA transfers to further optimize memory usage.

– Propose future development to maximize the use of multiple processors.

– Introduce dual quaternion skinning[7] to eliminate skinning artifacts.

All required goals and all but the last optional goal are reached with the proposed and implemented solution in this master’s thesis.

2.3 Purposes

The main purpose of the proposed solution in this master’s thesis is to reduce the time the animation takes each frame so that more PPU and GPU power is free to perform other tasks, such as more advanced graphics or just more animated entities in a scene.

2.4 Methods

The above mentioned goals expressed by Coldwood gave a clear path for the work to follow.

First the game engine Gamebryo must be analysed along with the platform and several techniques for SPU parallelization. The extent of support for smooth skinning and SPU work in the engine would dictate which parts could be reused or reshaped, and which would have to be created from scratch. After the design phase is complete the solution will be implemented and integrated into the existing game.

(13)

2.5. Related Work 5

2.5 Related Work

Coldwood does not currently use the latest version of Gamebryo so more recent Gamebryo versions, which utilize SPU-skinning, are important related work. Other interesting work relating to this master’s thesis in a more general way is the progress other Playstation 3 developers and SCE2 themselves are making in regard to SPU utilization. Companies like Santa Monica Studios, Dice, Guerilla Games, and many more are all trying to harness the powerful SPUs on the Playstation 3.

2SONY Computer Entertainment

(14)

6 Chapter 2. Problem Description

(15)

Chapter 3

Creating a frame with skinned objects

This chapter will introduce the reader to common terminology and techniques in modern 3D graphics. Especially, it will focus on how skinned objects are created and processed for each frame in a computer game. Each frame actually contains much more calculations, such as physics simulation, game logic, sound effects, input handling etc. Several books have been written about each of those subjects, so to keep this chapter short all subjects not regarding how skinned objects are processed are omitted completely or mentioned only briefly. Several subjects are for the same reason simplified, and are in reality much more complex.

3.1 Scene graph

A very common approach to storing an entire scene of 3D objects is to use a scene graph.

The scene graph starts with a root node positioned at the origin of the world, all objects know only their position relative to their parent node. This way complex objects can be placed easily in the scene without placing all object it is composed of explicitly. For example, a car wheel only needs to know where on the car it is positioned, not where the car itself is position in the world.

3.2 Vertex data

In any application which displays 3D graphics with polygons, each triangle polygon is stored as three points in space. These points are called vertices. Each vertex is stored in an indexed list. To store the polygons, a second list is created which contains indices of the first list where each 3-tuple describes one polygon. Since most vertices are used by several polygons, this approach see to that the vertices are only stored once. This collection of polygons is called a mesh.

Depending on how the polygons will be rendered, different kinds of vertex data accom- panies the vertex position. Very common is the normal vector which is needed for basic lighting. It is a vector which is orthogonal to the plane of the polygon and gives the polygon an upside and a downside. For more advanced rendering effects other data is needed, most common are the tangent and binormal vectors. These are orthogonal to each other and to

7

(16)

8 Chapter 3. Creating a frame with skinned objects

the normal vector, which makes them tangent to the plane of the polygon.

For skinned objects the vertex data is stored in what is called bindpose. For humanoid objects the bindpose is called the T-pose because of the way humans are positions with their arms straight out from their bodies. The purpose of the bindposes is to store the vertex data in a neutral pose, which the skinning can manipulate each frame.

3.3 Skeleton

Each skinned mesh has an associated skeleton. This skeleton is composed of a tree of bones all orignating from a single root bone which is placed in the scene graph. Thus, traversing the scene graph will traverse each bone in the skeleton. The vertices in the mesh are as- sociated to one or several bones in the skeleton and when the bones move the associated vertices move according to how the influence of the bones are weighted. See Figure 3.1 for an example of a skeleton illustrated with its associated skinned mesh.

Figure 3.1: Example of a skinned mesh with skeleton, left: bindpose, right: animated.

The animations for the skinned objects are stored in keyframes, each positioning the bones of the skeleton in a specific pose. These keyframes have an associated timing starting at zero for the first keyframe. During an animation, at each frame, the current time is compared with the keyframe data to find the two keyframes closest to the current time in the keyframe timeline. These are then interpolated to find a pose for the skeleton which is a blend of the two keyframe’s poses.

(17)

3.4. Culling 9

The bone positions from the interpolated keyframe data are relative to parent bones in the skeleton tree and must, after the entire skeleton is updated, be recalculated in absolute, or global, terms to be used in the deformation later on.

3.4 Culling

To render only what is shown on the screen, cullers are used to decide whether an object will be shown or not. If the object is not shown during the current frame there will be no need to deform the vertices and render the polygons included in that object’s meshes. One major culler is the view frustum culler which culls all objects not inside the view frustum which is the space visible from the camera. Other cullers can cull objects occluded by other object and vertices which face away from the camera. There are a myriad of cullers and they are executed at different times during the frame calculation to cull vertices as soon as possible to reduce the workload on later systems.

3.5 Deformation

The deformation is the most costly step in the skinning animation, and is appropriately placed after most cullers. The deformation takes all vertex data, the bone matrices and the weights for all bone influences for each vertex. The vertex knows at what position it should have relative to one or several bones. Each vertex also has a weight associated with each bone influencing it and these weights always sums to one. The weighted mean of these relative vertex positions is calculated and stored as skinned vertex data. For example vertices in the middle of an underarm of a humanoid skinned mesh are influenced only by the underarm bone in the skeleton. The vertices positioned at the elbow are influenced both by the underarm and by the overarm, creating a seemless bend in the mesh at the elbow.

3.6 Rendering

Rendering is the process that actually draws the pixels on the screen. It takes the vertices which up until now have been in world space and transforms them to screen space. It is also responsible for applying all effects such as lighting. This step is always performed on the hardware on the GPU, which is highly parallelized in a hierarchical design for vertex and pixel processing. Traditionally the deformation step is also performed on the GPU as it is composed of matrix and vector operations which the GPU is specialized in.

(18)

10 Chapter 3. Creating a frame with skinned objects

(19)

Chapter 4

Parallelized computation on the Playstation 3

This in-depth study was done to gain understanding of how the Playstation 3 works and delves into its hardware, mainly the CELL processor. From there it continues to a survey on how leading Playstation 3 developers and SCE’s own teams are utilizing the unique architecture to create games today.

4.1 Parallel processors

Processors can be divided into four distinct categories in regard to their parallel compu- tation capabilities. These categories declare whether the processor has the capability for single or multiple instructions on the same data each cycle and whether it has the capability to apply the same instruction on a single piece or multiple pieces of data. This division is called Flynn’s Taxonomy and was proposed bi Michael J. Flynn in 1966. The first processor category is SISD, Single Instruction Single Data. This is a processor that each cycle applies a single instruction to a single piece of data. This is often what is thought of when speaking generally of processors.

The second processor category is MISD, Multiple Instruction Single Data. Such proces- sors performs multiple instructions to a single piece of data each cycle. This is not a widely used architecture due to the fact that SIMD and MIMD perform most parallel tasks much more efficiently than the MISD architecture. One could claim that all pipelined proces- sors belong to the MISD category. But since each pipeline stage changes the data which the different instructions are being applied to, it is not the same data in the different instructions.

The third processor category is SIMD, Single Instruction Multiple Data. Such proces- sors perform a single instruction on several different pieces of data each cycle. This kind of processor exploits data level parallelism which distributes data to different processing units.

This in contrast to task level parallelism which distributes different executing threads to dif- ferent processing units such as in multi-core SISD processors. Modern specialized graphics hardware is an example of SIMD processing where the input data, which is to be rendered, is divided in each rendering step between a larger and larger number of smaller and smaller specialized processing units.

11

(20)

12 Chapter 4. Parallelized computation on the Playstation 3

The forth processor category is MIMD, Multiple Instruction Multiple Data. Such pro- cessors perform multiple instructions on several different pieces of data each cycle. Almost all modern super computers are clusters of this kind of processor. This category almost overlaps with having several different processors. Should the processor units share a com- mon memory or should memory be distributed among the different processing units? Most common is a combination of the two, with different levels of the memory hierarchy being shared differently among the processors.

4.2 The Playstation 3 processor and memory layout

The Playstation 3 features two processors, the CELL[9] and the RSX[9], see Figure 4.1. The RSX ’Reality Synthesizer’ (RSX) was developed by Nvidia and SONY specifically for the Playstation 3 and was based on the NV47 Chip from Nvidias GeForce 7800 Achitecture[10].

It is the graphical accelerator of the Playstation 3. It will only be briefly discussed as it does not feature the unique parallel qualities which this in-depth study covers and has a peripheral part in the thesis itself.

The main CPU on the Playstation 3 is the CELL Broadband Engine (CELL) which was co-developed by SONY, Toshiba, and IBM. The development started in 2001 and the Playstation 3, released in 2006, was the first commercial product featuring it. The CELL has a unique architecture which incorporates both MIMD and SIMD principles. The main processing unit is the POWER Processing Element or PPE which incorporates the Pow- erPC Processing Unit or PPU. Using the CELL as a standard SISD processor this is the unit that performs all computation. The communication between all elements in the CELL is done through the Element Interconnect Bus (EIB). The EIB is a ring structured bus with a teoretical bandwidth of 204.8GB/s and a demonstrated bandwidth of 197GB/s [11].

The EIB connects the Integrated Memory Controller, (MIC) which handles the main memory, to the CELL elements. Also connected to the EIB is the Flexible IO system (FlexIO) which acts both like a southbridge that connects to all peripherals and as a north- bridge by connecting to the RSX. Additionally eight Synergetic Processing Elements (SPE) are also present on the CELL chip, each one containing a single Synergetic Processing Unit (SPU). On the Playstation 3 only seven SPEs are present on the chip to gain a higher man- ufacturing yield. One of the SPEs is only used by the operating system and thus is never available for developers[11].

At this point is seems that the CELL is a MIMD processor with shared memory through the EIB but that doesn’t give a complete picture as each SPE features a 256Kb Local Store (LS) making the memory layout both shared and distributed. The LS should not be confused with L2 cache as utilizing it in such a manor is highly inefficient. The greatest performance on a SPU is achieved by transfering a 256Kb chunk of data using DMA and then letting the SPU work independently on the data. Now the SPU does not have to stall for data fetching because of L2 cache misses in the LS[11] [1].

The entire CELL architecture resembles a MIMD architecture, but each SPU is in itself a SIMD processor specialized in vector, floating point and integer operations as well as having a complete instruction set for general purpose computations[11]. It becomes clear that the CELL processor is quite different from almost all other commercial multi-core processors

(21)

4.3. Leading industry Playstation 3 SPU utilization 13

featuring either SIMD or MIMD architecture. This introduces many new parallelization opportunities as well as many challenges.

CELL Broadband Engine

SPU

LS SPE

SPU

LS SPE

SPU

LS SPE

SPU

LS SPE

SPU

LS SPE

SPU

LS SPE

SPU

LS SPE

SPU

LS SPE

EIB

L2 Cache L1 Cache

PPU

MIC

Memory PPE

FlexIO

Pheripherals RSX

Figure 4.1: A simplified view of the Playstation 3 processor and memory layout.

4.3 Leading industry Playstation 3 SPU utilization

There is certainly scientific research being performed on the parallel computational capa- bilities of the CELL processor, and quite often it is performed on clusters of Playstation 3’s. Others utilize its processing capabilities by distributing applications to Playstation 3 units all around the world in projects like Folding@Home and Rosetta@Home to solve pro- tein folding, problems that require vast amounts of processing power. This master’s thesis in-depth study will be about game development, how market leading developers utilize the parallel capabilities of the Playstation 3. The three developers are Santa Monica Studios, Guerilla Games, and Dice. Game development is a very competitive market, it is safe to assume that not all information will be divulged by the developers to anyone outside the company as it is most likely regarded as highly confidential company secrets. They do how- ever still hold talks at game developer conferences, where they in general terms describe how they use the SPUs on the Playstation 3.

(22)

14 Chapter 4. Parallelized computation on the Playstation 3

4.3.1 Santa Monica Studios

Santa Monica Studio is an internal SCE development studio and was started in 1999. At Game Developers Conference in San Francisco, 2009, Jim Tilander and Vassily Filippov held a presentation regarding their then current game God of War III, which was later released in March 2010[4]. This presentation showed how they utilized SPUs in their engine.

They identified the three major stages in each frame as Simulation, Scene and Render.

They start out by showing how the frames can be double buffered by pipelining the stages between a multi-core processor and a GPU: rendering on the GPU and simulation and scene stages on separate cores. Since all stages are processed in parallel but for three different frames the total frame time is only bound by the most time-consuming stage.

On the Playstation 3 they argue the use of the SPUs as helper CPUs to alleviate both the PPU and RSX. Even though the SPUs might be slower than the processor it alleivates the total gain due to the parallelization. The game is profiled and costly operations are identified as candidates for SPU migration. Having code that is easily moved between the PPU and SPU is essential for development time, so they point out that it is important to keep memory behaviour on the PPU limited so it can easily be swapped for DMA calls on the SPU. The RSX runs shader code so it will always need to be rewritten for the SPU.

This approach enables on-demand optimization utilizing the SPUs.

4.3.2 Guerilla Games

Guerilla Games was founded in 2000 and since 2005 it is a subsidiary of SCE. It is based in Amsterdam and is most famous for their Killzone series and are currently developing the third installment. In February 2009, they released Killzone 2, and in March the same year they held a presentation at Game Developers Conference in San Francisco on the rendering technology used in the game[8].

The presentation mostly concerns the deferred rendering and graphics buffer layout, but in the end they show how they utilize the SPUs to generate display lists. To generate the display lists they saw the choice between a double buffered approach or a ring buffered approach, where the display lists were generated just in time before the RSX needed them.

However, both methods present problems. Double buffering requires a lot of memory, and ring buffering requires a lot of synchronization between the CPU and GPU to prevent data not yet consumed to be overwritten. Moreover, on the Playstation 3 the synchronization is not possible to achieve in an effective way since the SPUs execute asynchronously and possibly out of order.

They instead opted for a dynamic memory block allocation system for the display lists and the rendering resources and having the RSX signal when a block of memory was free to write with a simple fence command. So when an SPU starts a new task, it goes over each block and checks for a block which is marked free. When it finds one it locks the block and goes on to perform which ever calculations the task includes. When the task is completed it writes back to the memory block, marks it free and continues to search for another free block. This works in perfect unison with the 256Kb LS on the SPEs since each block in main memory can be made to fit exactly in the LS.

(23)

4.4. Emergent 15

4.3.3 Dice

Dice was founded in 1988 under the name Digital Illusion by four students in a dorm room at V¨axj¨o University and later grew to become Swedens biggest game developer. In 2004 EA bought the company and it became EA Dice. Their current main franchise is the Bat- tlefield series, which started in 2002 and continues to be a best seller worldwide with its current installment. At SIGGRAPH 2009 in New Orleans, Johan Andersson gave a presen- tation called Parallel Graphics in Frostbite - Current & Future, where he gave a glimpse of how their game engine worked, not just on the Playstation 3 but on parallel platforms in general[5].

They have designed the entire engine in terms of async jobs and all cores execute these jobs whether on the Playstation 3 with its two executing PPU threads and six SPUs, the XBOX 360 with 6 executing threads, or a PC with 2 – 8 executing threads. This view is very interesting as it compares the SPU setup of the Playstation 3 with other contemporary hardware setups. Some jobs are dependant on earlier jobs and these dependancies can be used to generate a job graph for the entire engine. This job graph not only describes exe- cution order but also shows sync points and how the workload is balanced at specific times during the execution of a frame. This easily shows bottlenecks which can then be balanced out between all job consumers.

4.4 Emergent

Emergent Game Technologies is a middleware developer, that is to say they only build the infrastructure other games need but no games themselves. This is contrary to most game engine retailers as they are often part of a gaming company which utilizes and develops the engine for themselves as well as sell it to other game developers. Theirs core product is the game engine Gamebryo and its toolset, a multi platform engine which has been released in different versions since 1999 (between 1999 and 2005 by Numerical Design Limited).

4.4.1 Gamebryo

The Gamebryo game engine [2, 3] contains all components needed to create a game. Most notable among the engine components is a scene graph, a renderer for each supported console and version of DirectX. It also contains a physics framework, a sound system, and for smaller projects, a base application which can be used to quickly setup a test environment, demo or small game. The engine is accompanied by serveral tools for creating graphical assets such as models, levels and animations. In the model creation tools exporter it is possible to specify if that particular model should be hardware or software skinned as each renderer supports both methods. Gamebryo also includes a streaming framework called Floodgate to take advantage of platforms which offer multi-core processors.

4.4.2 Floodgate

As one of the platforms supported is the Playstation 3, support for parallelized execution on the SPUs has been developed for the Playstation 3 specific parts of Floodgate. Unlike for example Insomniac Games approach to SPU utilization[6], Floodgate does not take direct

(24)

16 Chapter 4. Parallelized computation on the Playstation 3

control of all SPUs at launch, controlling synchronization and issuing itself. Instead Flood- gate utilizes the thread library SPURS1developed by SCE. This allows other non-Gamebryo SPURS tasks to be executed alongside the tasks issued by Floodgate.

The main components of Floodgate are the kernel programs which are compiled and run on the SPUs. These accept streams of data as input and returns an equally long stream of data as output. A task describes where to get the data for the input and where to store the data from the output and also which kernel is to be run on the SPU when the task is executed. Tasks can be chained together by setting one task’s output stream as the input stream of another which Floodgate then automatically schedules to run in the logically cor- rect order.

Tasks are stored in workflows which contain one or more linked tasks. These workflows are what is submitted to the SPUs by the streamprocessor, which handles the status polling and eventual return of each workflow. Each workflow is sheduled by the program by setting a specific priority and also by the streamprocessor according to when the workflow is sub- mitted and when its results are needed. Results which are needed earlier in the game loop are issued before those who are needed later and internally in each such synchronization group the priority decides which workflows are issued first.

This explicit synchronization poses a non trivial problem of avoiding stalls on the PPU by keeping it busy while the kernels are processing the streams on the SPUs. If the stream- processor is asked for the result to soon it will stall the PPU while waiting for the kernel to complete. Some reordering of the PPU code might be required to avoid stalls by keeping all processors busy at all times.

4.5 Reflections

It is clear that the parallelization of code and full utilization of the SPUs is the single most important aspect of making Playstation 3 games faster and thus affording a deeper graphical expression, whether it be realism or non-realistic shading. The companies examined are all closely tied to SCE and are thus encouraged to divulge their thoughts on SPU utilization on one premise; if everyone shares their knowledge all Playstation 3 games get better, which sells more Playstation 3 hardware which in turn gives a bigger client base which enables even more sales for upcoming titles.

Sharing knowledge is weighed against the notion that gaming is a highly competitive market with top tier sales vastly outweighing all other game sales, so being number one often means success and everything else results in failure. This notion allows one to ponder how much they really divulge to each other about their really cutting edge technology, since this is a huge part of what sets their games apart from other companies and hopefully ranks them in the top.

Since most top tier games take several years to develop and remain a secret to the world until trademarks, domain addresses, and such are registered, and the fact that all data found cover games which have already been released; I am confident in presuming that the data presented is in fact not their state of the art research, or at most, only conceptually cover

1SPURS stands for ”SPU Runtime System”.

(25)

4.5. Reflections 17

their current utilization of the Playstation 3 hardware.

The same is true even for Emergent Game Technologie’s Gamebryo Engine. The version Coldwood uses is somewhat outdated and it cannot be determined if their latest software, for examples, schedules tasks in a more efficient manner than the version used. That being said, the insight I have gained from Gamebryo is much larger than what I have learned of other companies’ engines, mainly because I have had access to the full source code, documentation, and samples for Gamebryo. I also had the opportunity to work with people who have used it for a long time in several commercial game titles.

(26)

18 Chapter 4. Parallelized computation on the Playstation 3

(27)

Chapter 5

Accomplishment

The work was done as for most general software development. Finding the most time- efficient way to migrate the functionality without compromising the efficiency of the code.

5.1 Preliminaries

One prerequisite was of course to use the already in-use game engine Gamebryo 2.3, which does not include SPU-skinning functionality. The Gamebryo 2.4 engine does, but it has a radically different approach to handling content with mesh classes which modifiers can be applied to. So the SPU-kernel code was given, but the framework of using it was not.

However, the threading framework Floodgate of Gamebryo 2.4 was deprecated because of an update to the Playstation 3 firmware, so Floodgate had to be imported from an even newer Gamebryo version, 2.6, which was released after the firmware update.

5.2 Existing skinning frameworks

The conclusion of the in-depth study (see Chapter 4.5) deemed that the implementation of a new threading framework to utilize the EDGE1 SPU API directly would be to time consuming and most likely not give a very noticeable gain in performance. An even lower level API called SPURS2 is also available, which is the direct Playstation 3 API from the Playstation 3 SDK. But trying to re-invent the wheel would be even more time-consuming.

The smartest and definitely the fastest solution was determined to be to use the SPU-kernel code from Gamebryo 2.4 and Floodgate 2.6 imported into Gamebryo 2.3, and then imple- ment an efficient modifier framework to use the imported functionality on the Gamebryo 2.3 asset-handling code.

5.2.1 SPU skinning in Gamebryo 2.4

The first step was to analyse Gamebryo 2.4 by stepping through the executing code to see how the kernel jobs were issued and how the modifiers worked. The modifier of interest is

1EDGE or Efficiently Distributed renderinG Engine, was designed by SCE to be an example engine for the Playstation 3, but is now a collection of game engine parts available to licenced Playstation 3 developers.

2SPU Runtime System

19

(28)

20 Chapter 5. Accomplishment

PPU SPU RSX

SceneGraph Mesh SkinningMeshModifier WorkflowManager CalculateBoneMatrices Task Deform Task Renderer

Update

UpdateDownWardPass SubmitTasks

AddRelatedTask(delayed): Calculate Bone Matrices Task Added to Workflow

FlushTaskGroup: Submit Outstanding Workflows to Floodgate

Tasks Executed Asynchronously

OnVisible: Mesh passed culling

CompleteTasks

Wait for task completion

SubmitTasks

AddRelatedTask(immediate): Deform Task Added to Workflow Tasks Executed Asynchronously

RenderImmediate: Render the Mesh

CompleteTasks

Wait for task completion

RenderMesh

Figure 5.1: SPU-Skinning in Gamebryo 2.4.

the SkinningMeshModifier which issues SPU jobs that calculate the bone matrices and then skin the model accordingly, see Figure 5.1.

The first time the SkinningMeshModifier is applied is during the update traversal of the scene graph. For each bone, which is represented by an AVObject (the base object for all scene graph nodes in Gamebryo), the location data is updated by an animation controller.

This data is part of the input to the CalculateBoneMatricesKernel which calculates a trans- formation matrix for each bone, but since the data is relative to the parent object in the scene graph the kernel cannot be run until the entire scene graph is traversed. Luckily the global workflow manager allows tasks to be issued with a delay flag, which holds tasks in the

(29)

5.2. Existing skinning frameworks 21

workflow until a flush command is issued. This command is issued by the game loop itself after the update traversal is done, and causes the workflow manager to send the workflow with the CalculateBoneMatrices tasks to be executed on the SPUs.

After that the updating of the scene graph is completed, a culler culls each item in the scene graph and if a skinned mesh is deemed visible by the culler, it asks the SkinningMesh- Modifier which asks the workflow manager which in turn asks the kernel if it has completed all the tasks in the workflow. This operation stalls the PPU, so it is important to give the PPU enough work to complete before trying to get the result from the SPUs. But when the kernel is complete, the SkinningMeshModifier issues a deformation task to the workflow manager, this task can be issued to the SPUs instantly so the delayed flag is not set. The deformation task deforms each vertex, normal, binormal, and tangent (hereafter referred to as vertex data), according to the particular weight of each bone influencing that particular vertex data.

When it becomes time to render the mesh in the end of the game loop, each mesh waits for the completion of its deform task and then sends the deformed vertex data to the renderer which displays it on the screen. This is possible since each mesh has both bindpose vertex data which is sent to the deformation kernel each frame and regular vertex data which the kernel writes back to.

5.2.2 Skinning in Gamebryo 2.3

PPU RSX

SceneGraph Geometry SkinInstance Renderer

Update

UpdateDownWardPass

RenderImmediate: Render the Geometry

Deform

Renderer

CalculateBoneMatrices

RenderGeometry

Figure 5.2: Software Skinning in Gamebryo 2.3.

The SPU-skinning solution in Gamebryo 2.4 utilize a large set of functionality which do not exist in Gamebryo 2.3 and the features in the two versions are implemented very differently. So the next step was to do a comparison between the SPU-skinning in Gamebryo 2.4 with the skinning in Gamebryo 2.3. In Gamebryo the skinning of each geometry can be

(30)

22 Chapter 5. Accomplishment

chosen to be done either by software or hardware skinning, see Figures 5.2 and 5.3.

As in Gamebryo 2.4, the update traversal of the scene graph in Gamebryo 2.3 is per- formed in the beginning of each game loop. When it comes across a skinned geometry it tells the renderer to perform the PPU side operation CalculateBoneMatrices to calculate the bone matrix for that geometry. A culler is used in Gamebryo 2.3 too, but it is not important to the skinning and therefore omitted from the figures.

Later in the game loop, when it is time to render the geometry,either the software or hardware skinning is called. For software skinning each geometry tells it’s skin instance to deform each vertex data according to the positions of the bones that influence that vertex data. Since the geometry class only contains bindpose data, the result of the deformation is written directly to the RSX buffer from where it is sent to the renderer to be drawn on the screen. The renderer does nothing to the data it recieves and renders it without modification.

PPU RSX

ceneGraph Geometry Renderer

Update

UpdateDownWardPass

RenderImmediate: Render the Geometry

Renderer

CalculateBoneMatrices

RenderSkinnedGeometry

Figure 5.3: Hardware Skinning in Gamebryo 2.3.

The hardware skinning in Gamebryo 2.3 is even simpler on the PPU side, hiding the more complex parts on the RSX. The calculation of the bone matrices is performed identically to the software skinning. However when it becomes time to render it uses a different render call to the RSX, which sends the bindpose vertex data of the geometry along with all the bones influencing the vertex data. The deformation is then performed on the RSX. This is a bad design choice for the Playstation 3 even though the RSX has specialized hardware for performing the vector and matrix operations needed by the deformation operation, because the RSX is tasked with all the shader code for a game. When there are SPUs available they should be utilized for exactly these kinds of operations, alleviating the PPU and RSX from vector and matrix heavy operations. This skinning design is, however, the natural choice

(31)

5.2. Existing skinning frameworks 23

for the standard setup with CPU and GPU, e.g. on a PC or XBOX360, where the GPU is many factors faster than the CPU at dealing with any vector and matrix heavy operations.

5.2.3 Proposed SPU-skinning solution

To integrate SPU-skinning in Gamebryo 2.3, one of the skinning methods would need to be modified to perform SPU-skinnin where previously the game had used the hardware imple- mentation. Otherwise a third variant would need to be implemented. To introduce as few potential errors as possibleit was decided that making few major changes and additions to the game engine’s code-base was the best approach. Implementing a third rendering state would require changes throughout the code-base so modifying an existing rendering state was chosen. Since the software skinning was not being used it was the obvious candidate to modify. This also gives the possibility of having both implementations, SPU- and hardware- based, running side-by-side in the game and leaving it up to the designers to choose which method to use for each element. Another advantage with modifying the software skinning into SPU-skinning is that no changes to the RSX code would be needed, since it already only renders the vertices it is given.

The functionality marked with green in Figure 5.4 must be imported from Gamebryo 2.4, the functionality marked with yellow must be heavily modified and the functionality marked with red must be implement from scratch. The threading framework Floodgate from Gamebryo 2.6 had to be imported into the solution due to compatibility issues with the Playstation 3 SDK, but the kernels and all the needed specialized classes could be imported from Gamebryo 2.4 since Floodgate showed no noticeable functional difference between versions 2.4 and 2.6.

The main thing lacking from Gamebryo 2.3 which was needed for utilizing the imported SPU skinning was a framwork that could feed the data to the kernels each frame. It would need to be attachable to geometries in the same way as SkinningMeshModifiers where at- tachable to meshes in Gamebryo 2.4, but also incorporate the functionality of meshes which is lacking in the geometry classes in Gamebryo 2.3. The major lacking functionality was a straight memory layout for bone data, such as weights and indices. These are originally placed in a tree structure in the skin instance bone data, making it impossible to send as a constant stream to the skinning kernel. Thus those had to be extracted, straightened and stored in the kernel feeder at load time for each geometry.

Heavy modifications were also needed in classes already used in Gamebryo 2.3 and in the deformation kernel from 2.6. The geometry classes need to allow the kernel feeder men- tioned previously to be attached. The updating of the geometry classes need to be modified to issue CalculateBoneMatrices tasks through the kernel feeder instead of calculating them in the renderer class. When a culler deemed a geometry visible it needs to complete that task and then issue a deformation task with the result of the CalculateBoneMatrices kernel together with the vertex data and the bone data as input.

In Gamebryo 2.3 the deform method in the skin instance deform and braid the different streams of vertex data in the format that the RSX want. However, it is possible to write directly to the graphics buffer on the RSX from the deformation kernel, so the deformation kernel had to be modified to write a single braided output stream and also take new input data for texture coordinates as they where also needed in the stream. The texture coor-

(32)

24 Chapter 5. Accomplishment

dinates could not be set up just once as in the original code because when the LS from a SPU is written back from the kernel it copies the entire memory block to its destination;

any holes in the stream not written would contain trash data. Any data that was needed in the graphics buffer braided vertex data stream would need to go through the deformation kernel.

5.3 How the work was done

A small test environment was created which only included Gamebryo and some helper classes that quickly starts up the environment for testing and creating smaller projects. Firstly, all software skinning was stripped form the engine and then the system was implemented for the engines Microsoft Windows environment. When that was running a switch to the Playstation 3 environment was made, still in the test environment. When that ran successfully it could be established that Gamebryo now provided the SPU-skinning, and the work continued with making the real game take advantage of the new functionality. During the development on the Playstation 3, profiling was performed to find and fix all bottlenecks.

(33)

5.3. How the work was done 25

PPU SPU RSX

SceneGraph Geometry TestKernelFeeder WorkflowMgr CalcBoneMatrices Deform Renderer

Update

UpdateDownWardPass SubmitTasks

AddRelatedTask(delayed): Calculate Bone Matrices Task Added to Workflow

FlushTaskGroup: Submit Outstanding Workflows to Floodgate

Tasks Executed Asynchronously

OnVisible: Geometry passed culling

CompleteTasks

Wait for task completion

SubmitTasks

AddRelatedTask(immediate): Deform Task Added to Workflow Tasks Executed Asynchronously

RenderImmediate: Render the Geometry

CompleteTasks

Wait for task completion

RenderGeometry

Write to G-buffer

Figure 5.4: SPU-skinning solution proposed for Gamebryo 2.3. With regard to Gamebryo 2.3, the colors of the functionalities have the following meaning. White – unmodified; yellow – the funcitionality is modified; red – new implementation; green – imported from Gamebryo 2.4.

(34)

26 Chapter 5. Accomplishment

(35)

Chapter 6

Results

Performance analysis was performed both in the test environment and in the actual title.

The results in the test environment show how well the parallelization increased performance, and the results from the actual game show what kind of contribution the system gave to the final product.

6.1 Performance in a stand alone demo

In the test environment the application was almost entirely made up of animations, an entirely blank background with 16 non-textured characters with 80 bones each, running a looping animation over and over. The characters were the same kind as the main characters in the actual game and contained 30428 vertices each which formed 47184 triangles. So the system would calculate 1280 bone matrices and then deform 486848 vertices. Since the hardware skinning had not been removed, two meshes could be created, one for each skinning system. The implementations could then be examined head-to-head in the same application. With hardware skinning the application ran with 17 fps and with the new SPU-skinning it ran with 61 fps, so the absolute speedup1 was 358%. As mentioned in the in-depth study there are 7 guaranteed SPUs on the Playstation 3 one of which is claimed by the operating system. Another is claimed by a different system running in Gamebryo which leaves 5 SPUs. So it is clear that each single SPU is not as fast the RSX, but combined they outperform it. Even when working with a system which goal is to fill the graphics buffer as fast as possible, work that the RSX was specifically designed for.

6.2 Performance gain in the actual game

When the development of the game started and up until the very last stages of the testing and performance tweaking, the SPUs where almost not utilized at all. One of the 6 availible SPUs was allocated by Gamebryo, but the other five where mostly in idle. When the SPU- skinning solution was about to be merged into the game a new fullscreen mlaa2 algorithm had been implemented which used all the SPUs. The algorithm was originally intended for a game which runs at lower fps than Coldwood’s title which has to be run at 60 fps for its advanced physics to work properly. This resulted in that all the SPUs now became occupied

1Absolute speedup is determined by comparing with the execution time of the best squential algorithm.

2Morphological Anti Aliasing

27

(36)

28 Chapter 6. Results

almost for the entire frame. There was simply no time to perform skinning for the large main character models at the same time as the mlaa was running. There where however a number of smaller models in each scene which could be skinned instead. There was no time to skin them on the RSX but they had few enough bones and vertices to be skinned next to the mlaa on the SPUs without creating a stall.

Before the mlaa was introduced the main characters where however for a short while skinned with the new SPU skinning solution, and a comparison was made between the old hardware skinning method and the new SPU method for the main characters. The difference between these actual characters and the versions in the test environment is that in the actual game the models have textures, normals, binormals, and tangents, which also need to be deformed. On average the framerate increased from 61.7 fps to 65.25 fps, which is a 5.7%

increase.

(37)

Chapter 7

Conclusions

The solution works very well in my opinion and even though it didn’t get used to its full capacity in the title it still contributes to the title in a smaller way. The actual performance is unrefutable, 358% faster than hardware skinning in a test environment leaves no room for speculation on the value of parallelized code in gaming. All required goals have been reached and thanks to the Floodgate library all but one of the optional goals have been met, to introduce dual quaternion skinning, which I regard as a complete success.

7.1 Limitations

The models which where created for this new SPU skinning were exported as software skinned models, this prevented the exporter from stripifying1the models. A possible solution would be to rewrite parts of the exporter to enable stripifying for software skinned models, or to export them as hardware skinned models and create a workaround in the program to identify and perform SPU skinning on those specific hardware skinned models only.

7.2 Future work

Another kind of animation is blendshape animation which combines different pre-posed shapes to achive animation. This is performed on the RSX in Gamebryo 2.3 but there exists a SPU solution in later versions. It should be possible to migrate that functionality using much the same method as the SPU smooth skinning but with a slightly modified kernel feeder. Blendshape animation is used in the title for facial animations. Another thing that could be done is to modify the deformation kernel to perform dual quaternion skinning, a slightly more expensive algorithm that reduces almost all anomalous artifacts that come from twisting bones to much.

1The process of storing all triangles in strips so that each new triangle uses the last two vertices and only one new vertex which helps reduce the size of the model data.

29

(38)

30 Chapter 7. Conclusions

(39)

Acknowledgements

Thanks to:

Christopher Holmberg Andreas Asplund

Olof H¨aggstr¨om Stefan Johansson

Dick Adolfsson

The rest of the Coldwood team

31

(40)

32 Chapter 7. Conclusions

(41)

References

[1] Abraham Arevalo and Ricardo M. Matinata and Maharaja Pandian and Eitan Peri and Kurtis Ruby and Francois Thomas and Chris Almond.

Programming the Cell Broadband Engine Architecture.

http://www.redbooks.ibm.com/redbooks/pdfs/sg247575.pdf.

[2] Emergent. Gamebryo 2.3 Documentation, 2007.

[3] Emergent. Gamebryo 2.5 Documentation, 2008.

[4] Jim Tilander and Vassily Filippov. Practical SPU Programming in God of War III.

Game Developers Conference, March 2009.

http://www.tilander.org/aurora/comp/gdc2009 Tilander Filippov SPU.pdf.

[5] Johan Andersson. Parallel Graphics in Frostbite – Current and Future. SIGGRAPH, New Orleans, 2009.

http://s09.idav.ucdavis.edu/talks/04JAndersson-ParallelFrostbiteSiggraph09.pdf.

[6] Jonathan Garrett. SPU wrangling, job management and debugging.

Game Developers Conference, 2009.

http://www.insomniacgames.com/tech/articles/0809/files/gdc2009 gpu wrangling.pdf.

[7] Ladislav Kavan and Steven Collins and Jiri Zara and Carol O’Sullivan. Geometric Skinning with Approximate Dual Quaternion Blending. ACM Transaction on Graphics 27(4), 2008.

[8] Michal Valient. The Rendering Technology of Killzone 2. Game Developers Conference, March 2009.

[9] Sony Computer Entertainment Inc. Press release, May 2005.

http://www.scei.co.jp/corporate/release/pdf/050517e.pdf.

[10] Sony Computer Entertainment Inc. Game developer conference, 2006.

[11] Thomas Chen and Ram Raghaven and Jason Dale and Eiji Iwata. Cell Broadband Engine Architecture and its first implementation. Technical report, IBM, Systems Performance, http://www.ibm.com/developerworks/power/library/pacellperf/, 2005.

33

References

Related documents

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa

DIN representerar Tyskland i ISO och CEN, och har en permanent plats i ISO:s råd. Det ger dem en bra position för att påverka strategiska frågor inom den internationella

Indien, ett land med 1,2 miljarder invånare där 65 procent av befolkningen är under 30 år står inför stora utmaningar vad gäller kvaliteten på, och tillgången till,