Co-design at Arm GPUs Software Hardware-

40  Download (0)

Full text

(1)

© 2017 Arm Limited

Johan Grönqvist MCC 2017 - Uppsala

Hardware- Software Co-design at Arm

GPUs

(2)

© 2017 Arm Limited

About Arm

(3)

© 2017 Arm Limited 3

Arm Mali GPUs: The World’s #1 Shipping Graphics Processor

Mali GPUs are in:

~50%

of

mobile VR…

~80%

of DTVs…

~50%

of

smartphones

1Bn

Mali GPUs shipped in

2016

151

Total Mali licenses

21

Mali video and display licenses

400m 150m

Mali graphics based IC shipments (units)

2012 2013 2014 550m

750m

2015

1Bn

2016

(4)

© 2017 Arm Limited 4

Arm

(5)

© 2017 Arm Limited

Graphics

(6)

© 2017 Arm Limited 6

Levels of Abstraction

Standardized APIs allow separation of concerns

Graphics APIs

HW hidden

ISA

Memory hierarchy

Portability

HW specific

optimizations in the Application

Application specific optimizations in the driver

Application User-space graphics driver

Kernel-space graphics driver

GPU hardware

OS graphics memory manager

Display kernel driver User

Kernel

Display subsystem

integration Display subsystem

(7)

© 2017 Arm Limited 7

A Graphics Pipeline

Vertices, triangles, fragments and pixels

Framebuffer Rasterization

ZS test & blending

CPU GPU Application

Primitive assembly & culling Vertex shader

Fragment shader

(8)

© 2017 Arm Limited 8

A Graphics Demo by Arm

Ice Cave, created by Arm in 2015 to show what kids of effects and level of detail a mobile GPU is capable of.

https://www.youtube.com/watch?v=gsyBMHJVhXA

(9)

© 2017 Arm Limited

Pipeline &

Optimizations

(10)

© 2017 Arm Limited 10

A Schematic OpenGL Implementation

def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data):

(vertices’, vertex_data’) = ([], [])

for (v, vd) in zip(vertices, vertex_data):

v’, vd’ = vertex_shader(v, vd) vertices’.append(v’)

vertex_data’.append(vd’)

triangles, triangle_data = primitive_assembly(vertices’, vertex_data’) for (t, td) in zip(triangles, triangle_data):

fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data):

depth, color = fragment_shader(f, fd) if is_visible(depth, pos):

output[pos] = blend_shader(output[pos], color)

(11)

© 2017 Arm Limited 11

A Schematic OpenGL Implementation

def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data):

(vertices’, vertex_data’) = ([], [])

for (v, vd) in zip(vertices, vertex_data):

v’, vd’ = vertex_shader(v, vd) vertices’.append(v’)

vertex_data’.append(vd’)

triangles, triangle_data = primitive_assembly(vertices’, vertex_data’) for (t, td) in zip(triangles, triangle_data):

fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data):

depth, color = fragment_shader(f, fd) if is_visible(depth, pos):

output[pos] = blend_shader(output[pos], color)

(12)

© 2017 Arm Limited 12

A Schematic OpenGL Implementation

def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data):

(vertices’, vertex_data’) = ([], [])

for (v, vd) in zip(vertices, vertex_data):

v’, vd’ = vertex_shader(v, vd) vertices’.append(v’)

vertex_data’.append(vd’)

triangles, triangle_data = primitive_assembly(vertices’, vertex_data’) for (t, td) in zip(triangles, triangle_data):

fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data):

depth, color = fragment_shader(f, fd) if is_visible(depth, pos):

output[pos] = blend_shader(output[pos], color)

(13)

© 2017 Arm Limited 13

A Schematic OpenGL Implementation

def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data):

(vertices’, vertex_data’) = ([], [])

for (v, vd) in zip(vertices, vertex_data):

v’, vd’ = vertex_shader(v, vd) vertices’.append(v’)

vertex_data’.append(vd’)

triangles, triangle_data = primitive_assembly(vertices’, vertex_data’) for (t, td) in zip(triangles, triangle_data):

fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data):

depth, color = fragment_shader(f, fd) if is_visible(depth, pos):

output[pos] = blend_shader(output[pos], color)

(14)

© 2017 Arm Limited 14

A Schematic OpenGL Implementation

def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data):

(vertices’, vertex_data’) = ([], [])

for (v, vd) in zip(vertices, vertex_data):

v’, vd’ = vertex_shader(v, vd) vertices’.append(v’)

vertex_data’.append(vd’)

triangles, triangle_data = primitive_assembly(vertices’, vertex_data’) for (t, td) in zip(triangles, triangle_data):

fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data):

depth, color = fragment_shader(f, fd) if is_visible(depth, pos):

output[pos] = blend_shader(output[pos], color)

(15)

© 2017 Arm Limited 15

A Schematic OpenGL Implementation

def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data):

(vertices’, vertex_data’) = ([], [])

for (v, vd) in zip(vertices, vertex_data):

v’, vd’ = vertex_shader(v, vd) vertices’.append(v’)

vertex_data’.append(vd’)

triangles, triangle_data = primitive_assembly(vertices’, vertex_data’) for (t, td) in zip(triangles, triangle_data):

fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data):

depth, color = fragment_shader(f, fd) if is_visible(depth, pos):

output[pos] = blend_shader(output[pos], color)

(16)

© 2017 Arm Limited 16

Programmable Parts

Everything else is fixed function

def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data):

(vertices’, vertex_data’) = ([], [])

for (v, vd) in zip(vertices, vertex_data):

v’, vd’ = vertex_shader(v, vd) vertices’.append(v’)

vertex_data’.append(vd’)

triangles, triangle_data = primitive_assembly(vertices’, vertex_data’) for (t, td) in zip(triangles, triangle_data):

fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data):

depth, color = fragment_shader(f, fd) if is_visible(depth, pos):

output[pos] = blend_shader(output[pos], color)

(17)

© 2017 Arm Limited 17

Parallelization Opportunities

We have many shading jobs and compute jobs running at the same time

def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data):

(vertices’, vertex_data’) = ([], [])

for (v, vd) in zip(vertices, vertex_data):

v’, vd’ = vertex_shader(v, vd) vertices’.append(v’)

vertex_data’.append(vd’)

triangles, triangle_data = primitive_assembly(vertices’, vertex_data’) for (t, td) in zip(triangles, triangle_data):

fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data):

depth, color = fragment_shader(f, fd) if is_visible(depth, pos):

output[pos] = blend_shader(output[pos], color)

Other Jobs

(18)

© 2017 Arm Limited 18

Early Visibility Testing

Dependency tracking becomes tricky with visibility tests in many places

def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data):

(vertices’, vertex_data’) = ([], [])

for (v, vd) in zip(vertices, vertex_data):

v’, vd’ = vertex_shader(v, vd) vertices’.append(v’)

vertex_data’.append(vd’)

triangles, triangle_data = primitive_assembly(vertices’, vertex_data’) for (t, td) in zip(triangles, triangle_data):

fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data):

depth, color = fragment_shader(f, fd) if is_visible(depth, pos):

output[pos] = blend_shader(output[pos], color)

Move up

(19)

© 2017 Arm Limited

GPUs

(20)

© 2017 Arm Limited 20

Arm Bifrost GPU Architecture

Job Manager L2 cache

APB IRQ

MMU SC0

Tiler SC4

SC1

SC5

SC3

...

(21)

© 2017 Arm Limited 21

A Bifrost Core

Thread management

Visibility

&

ColorBuffer Memory Ops

Instruction Execution

(22)

© 2017 Arm Limited 22

A Bifrost Core

Thread management

Visibility

&

ColorBuffer Memory Ops

Instruction Execution

EXECUTIONCORE

Message fabric Quad creator

Fragment front-end

ZS memory

Load/store cache

Texture unit Blender &

Tile access Late ZS

unit

Attribute units Varying

unit

To L2 Mem Sys To L2 Mem Sys

Color memory Tile writeback Quad manager

Execution engine 0

Thread state

Quad creator Compute front-end

Execution engine 1

Thread state

Execution engine 3

Thread state

(23)

© 2017 Arm Limited 23

Fixed Use-Case and Standardized APIs

Enabling hardware-software co-design

A GPU implementation uses

Fixed function hardware for triangle operations, rasterizations, interpolations, and more

Hardware to ensure serialization when necessary

Buffers to reorder threads to enable more parallelism, where serialization is not needed

Fixed function hardware for killing hidden pixels as early as possible

(24)

© 2017 Arm Limited 24

Fixed Use-Case and Standardized APIs

Enabling hardware-software co-design

A GPU implementation uses

Fixed function hardware for triangle operations, rasterizations, interpolations, and more

Hardware to ensure serialization when necessary

Buffers to reorder threads to enable more parallelism, where serialization is not needed

Fixed function hardware for killing hidden pixels as early as possible

Tuning hardware becomes very content dependent

Complexity of the Geometry, Number of buffers used, the amounts of data load per thread

Software and hardware evolve together over time

(25)

© 2017 Arm Limited 25

Fixed Use-Case and Standardized APIs

Enabling hardware-software co-design

A GPU implementation uses

Fixed function hardware for triangle operations, rasterizations, interpolations, and more

Hardware to ensure serialization when necessary

Buffers to reorder threads to enable more parallelism, where serialization is not needed

Fixed function hardware for killing hidden pixels as early as possible

Tuning hardware becomes very content dependent

Complexity of the Geometry, Number of buffers used, the amounts of data load per thread

Software and hardware evolve together over time

The API allows large and invasive hardware changes that are transparent to the application, due to the driver mediating between the two.

(26)

© 2017 Arm Limited 26

Mutual Adaptation Between GPUs and Applications

Both evolve over time

Better Hardware  Heavier Applications More complex geometries and more

complex computations.

More triangles per object

More objects

More complex light effects

More particle effects

Application Changes  Rebalanced GPUs HW buffers, memory hierarchy and

compute density must be balanced for the use-case.

Larger buffers to enable more out-of order and more parallelism

Higher compute performance

Large caches and more complex memory hierarchy

(27)

© 2017 Arm Limited

Index-Driven Position

Shading

(28)

© 2017 Arm Limited 28

IDVS – Introduction

Discarding triangles early

A vertex shader typically computes two kinds of things, a position and some data

A coordinate transformation from the “model space” to the “real space”

A coordinate transformation as objects move around

Data transformations for, e.g., lighting calculations

Often, the coordinate transformations are simple.

Sometimes, we can discard triangles based only on the position data.

Executing the data transformations is then useless work that we want to avoid.

(29)

© 2017 Arm Limited 29

IDVS – Introduction

Splitting the vertex shader

We want to split the vertex shader into two parts

The original vertex shader transformed both position and data out_position, out_data = vertex_shader(position, data)

Split into two parts, computing one piece each out_position = position_shader(position, data) out_data = data_shader(out_position, data)

Discard triangles between the two shaders

Implementation

Let the compiler split the shader into two

Let the hardware cull triangles between position_shader and data_shader

Let the driver manage memory buffers to ensure correctness

(30)

© 2017 Arm Limited 30

IDVS: Hardware

Transformed vertex positions Position

shading

Vertex Data

Culling

&

Tiling

Indices Primitive

assembly

Transformed Vertex Data Varying

shading

(31)

© 2017 Arm Limited 31

A Simple Example

Basic vertex shader

def vertex_shader(position, data):

transform, buffer = data

out_position = transform(position)

out_data = data_transform(out_position, buffer) return (out_position, out_data)

Input data in two parts, for position and data.

A simple transform to position the object

A separate

computation for, e.g., lighting

(32)

© 2017 Arm Limited 32

A Simple Example

Splits nicely into two parts

def vertex_shader(position, data):

transform, buffer = data

out_position = transform(position)

out_data = data_transform(out_position, buffer) return (out_position, out_data)

Input data in two parts, for position and other data.

A simple transform to position the object

A separate

computation for, e.g., lighting

Easily split into two parts

(33)

© 2017 Arm Limited 33

A Simple Example

Splits nicely into two parts

def position_shader (position, data):

transform, _ = data

out_position = transform(position) return out_position

def data_shader(position, data):

out_data = data_transform(out_position, data) return out_data

Input data in two parts, for position and other data.

A simple transform to position the object

A separate

computation for, e.g., lighting

Easily split into two parts

(34)

© 2017 Arm Limited 34

A More Complex Example

Transformation needed to compute data

def vertex_shader(position, data):

transform_data = compute_transform(data)

out_position = transform(transform_data, position) out_data = lighting_calculation(

out_position, data,

transform_data) return (out_position, out_data)

Input data in two parts, for position and data.

Complex dependencies on input data

(35)

© 2017 Arm Limited 35

A More Complex Example

Common subexpression

def vertex_shader(position, data):

transform_data = compute_transform(data)

out_position = transform(transform_data, position) out_data = lighting_calculation(

out_position, data,

transform_data) return (out_position, out_data)

Input data in two parts, for position and data.

Complex dependencies on input data

Common

subexpressions for position and data computations

(36)

© 2017 Arm Limited 36

A More Complex Example

Recomputing the same function twice

def vertex_shader(position, data):

transform_data = compute_transform(data)

out_position = transform(transform_data, position) return out_position

def data_shader(out_position, data):

transform_data = compute_transform(data) out_data = lighting_calculation(

out_position, data,

transform_data) return out_data

Input data in two parts, for position and data.

Complex dependencies on input data

Common

subexpressions for position and data computations

Naïve split requires computing transform twice

(37)

© 2017 Arm Limited 37

Directions for Solution

Intermediate results common for both position and data should not be recomputed

Identify such intermediate results

Store them to additional buffers

Only split shaders when we expect a performance improvement

Needs active content aware hardware-software co-design to obtain good performance in general

(38)

© 2017 Arm Limited 38

Summary

Graphics APIs give device designers a HW/SW co-design opportunity

Understanding the use-case is important for successful HW/SW co-design

HW/SW co-design is important for obtaining good performance

Applications and GPUs evolve together

(39)

39

39 © 2017 Arm Limited

The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks

(40)

40 40

Thank You!

Danke!

Merci!

谢谢!

ありがとう!

Gracias!

Kiitos!

감사합니다 धन्यवाद

© 2017 Arm Limited

Figure

Updating...

References

Related subjects :