Energy efficient graphics

(1)

Juni 2010

Energy efficient graphics

Making the rendering process power aware

Johan Bergman

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Energy efficient graphics

Johan Bergman

Today, it is possible to produce computer generated graphics with amazing realism, even in embedded systems. Embedded systems, such as mobile phones are

characterized by limited battery power, and as graphics become more complex, it becomes necessary to find a solution that provides the means to control the energy consumption of graphics at run-time. When energy resources are scarce it would be desirable to be able to limit how much energy is spent generating graphics so that other, more important, system components may continue to operate for a longer time. This thesis examines how the rendering process can be made power aware and energy efficient.

The proposed solution to achieve power awareness without modification to existing hardware and software is a library interposer on top of the OpenGL API. The design and implementation of the interposer library shows that it is possible to limit energy consumption with high precision through a relatively simple algorithm. The interposer limits the amount of time the processing units are actively rendering graphics and since CPU and GPU utilization displays a linear correlation with utilization, energy is preserved at the expense of frame rate or image quality.

To preserve an acceptable frame rate certain visual effects are turned off to reduce the frame rendering time. Lowering image quality makes it possible to increase frame rate while keeping utilization constant. Measurements show that energy consumption remains stable at lowered image quality and higher frame rate.

In the conclusion of this thesis are thoughts on how to incorporate such a system in existing frameworks for power management, and how power management

frameworks could be improved to better exploit the possibilities presented by a power aware rendering process. During the research for this master thesis it has become apparent that a scalable rendering process is desirable not only for power management but can be used for other purposes as well.

Ämnesgranskare: Stefanos Kaxiras

Handledare: Barbro Claesson, Detlef Scholle

(4)

(5)

1 Introduction 5

1.1 Background . . . 5

1.2 Problem statement . . . 6

1.2.1 Energy efficient computer graphics . . . 6

1.2.2 Computer graphics and power management . . . 6

1.2.3 Design and implementation of a power aware graphics system . . . 6

1.3 Method . . . 6

1.4 Limitations and time constraints . . . 7

2 Computer graphics 8 2.1 Graphical Processing Unit . . . 8

2.2 The graphics pipeline . . . 9

2.3 Tile-based rendering . . . 10

2.4 Ray casting and ray tracing . . . 11

2.5 Graphics library . . . 11

2.6 Unified shader architecture and General Purpose GPUs . . . 11

2.7 Graphics in embedded systems . . . 12

2.8 Summary . . . 12

3 Power management 14 3.1 Static power management . . . 14

3.2 Dynamic power management . . . 15

3.3 Power management policy . . . 15

3.3.1 Break even schemes and time-out policies . . . 15

3.3.2 Predictive wake-up . . . 16

3.4 The Advanced Configuration and Power Interface (ACPI) power management framework . . . 16

3.5 Power management module . . . 16

3.6 Operating system power management . . . 17

3.6.1 ECOSystem . . . 17

3.6.2 EQoS . . . 18

3.7 Power manageable components . . . 19

3.7.1 Power management at the application level . . . 19

(6)

CONTENTS

3.7.2 Power management at the driver level . . . 20

3.8 Workload prediction for graphics applications . . . 20

3.8.1 Signal-based workload prediction . . . 21

3.9 Hardware support for dynamic power management . . . 21

3.9.1 Power modes . . . 22

3.9.2 Dynamic Voltage and Frequency Scaling (DVFS) . . . 23

3.10 Summary . . . 24

4 Power management methods in computer graphics software 25 4.1 Energy efficient graphics applications . . . 25

4.2 Level of detail . . . 27

4.2.1 Simplification models and progressive meshes . . . 28

4.2.2 Level of detail control . . . 29

4.3 Power awareness in the rendering pipeline . . . 29

4.3.1 Per-vertex transform and lighting . . . 29

4.3.2 Clipping and culling . . . 31

4.3.3 Fragment shading . . . 32

4.3.4 Fog . . . 33

4.3.5 Texturing . . . 33

4.3.6 Anti-aliasing . . . 35

4.4 Energy efficient tile-based rendering . . . 36

4.5 Frame rate . . . 37

4.6 Summary . . . 38

4.6.1 Reduce computational complexity . . . 38

4.6.2 Reduce external memory accesses . . . 39

4.6.3 Power efficiency in the graphics pipeline . . . 40

5 Design and implementation of a power aware graphics system 41 5.1 Requirements . . . 41

5.2 Specification . . . 44

5.2.1 User perspective . . . 44

5.2.2 Graphics application perspective . . . 44

5.2.3 Power management module perspective . . . 45

5.2.4 Hardware perspective . . . 45

5.3 Design and implementation . . . 45

5.4 The interposer library . . . 46

5.5 Energy consumption and workload . . . 48

5.6 The utilization limit . . . 48

5.7 Optimization of image quality . . . 49

5.8 Graphics system operation . . . 52

5.8.1 Intra-process communication (IPC) . . . 52

5.8.2 Threaded architecture . . . 53

5.8.3 Interposed OpenGL functions of note . . . 54

5.8.4 Rendering timer . . . 55

(7)

5.8.5 Rendering buffer . . . 56

5.8.6 Measuring energy consumption . . . 56

5.8.7 Workload prediction . . . 57

6 Results 58 6.1 Demos . . . 58

6.2 Measurements . . . 59

6.2.1 Estimating utilization . . . 59

6.3 Summary of the measurements . . . 61

7 Conclusion 63 7.0.1 Energy consumption of tile-based rendering . . . 63

7.0.2 Library interposition as a method for power management . . . 63

7.0.3 Power management and the OpenGL API . . . 64

7.0.4 Power management frameworks and power aware rendering . . . . 64

7.1 Future work . . . 66

7.1.1 Extend the functionality of the library interposer . . . 66

7.1.2 OpenGL library interposition for other reasons . . . 66

7.1.3 Mapping the power / performance trade-off . . . 67

Bibliography 68

(8)

(9)

Introduction

This master thesis will investigate how to best increase energy efficiency for computer generated graphics in embedded systems, in particular, the possibility of scaling the operation of rendering in order to trade performance for power. This chapter describes the background, goals and limitations of this thesis.

1.1 Background

The performance of computer-generated graphics in embedded systems, such as mobile phones, has increased steadily over the last decade. The limited energy resources of these systems mean that power consumption is becoming a more and more pressing issue.

Many embedded systems today include graphics processing unit (GPU)-s dedicated to performing complex computations for computer graphics and high-resolution displays.

3D graphics has become an integral part of mobile phones, but this comes at the cost of increased energy consumption and reducing the energy footprint of the processing units involved in generating graphics is vital.

Efficient use of limited resources in the computer science field is referred to as Quality of Service (QoS). QoS can have different meanings in different context. In this project, QoS refers to the relationship between energy consumption and performance. Another concept used in this thesis work is power awareness. A power aware device has several modes of operation, where each mode has a different level of energy consumption but also performance. Making a device or module power aware will provide the system with tools that will enable it to use energy more efficiently. Minimizing power consumption while maintaining as much performance as possible is referred to as graceful degradation.

This master thesis will be performed at ENEA AB in collaboration with Uppsala University, in the spirit of the GEODES¹project. The GEODES project aims to provide design techniques embedded software needs to face the challenge of long power-autonomy in feature-rich systems and possibly life-critical systems.

1GEODES - Global Energy Optimization for Distributed Embedded System

(10)

Introduction

1.2 Problem statement

There are several topics, areas of interest, for this thesis. These are listed below. Some parts of this master thesis work is purely theoretical while other areas of research form the basis for design and implementation.

1.2.1 Energy efficient computer graphics

The main topic for this thesis work is a comprehensive survey of how to maximize the performance of computer generated graphics under power constraints. As a first step towards power awareness it is necessary to understand how computer graphics actually works. How does computer graphics hardware acceleration work and what are the benefits of having it? It is also necessary to know when and where energy is being consumed in a graphics system, and perhaps, most importantly of all, how energy can be saved?

1.2.2 Computer graphics and power management

A related subject is how rendering of computer graphics can be included in a power management framework. It is possible to design a graphics system that is able to deliver top quality graphics when requested but continue to operate, albeit at reduced performance, when energy resources are scarce. This thesis contain a comparison of different power management frameworks and how these relates to computer graphics. How does a power management framework that includes power aware computer graphics look like?

What are the difficulties that need to be overcome in order to be able to achieve a power aware rendering process?

1.2.3 Design and implementation of a power aware graphics system

The practical part of this thesis consists of the specification, design and implementation of a power aware graphics system on the i.MX31 hardware platform. The implementation is a prototype and only has a subset of features proposed in the design. It was used to test and demonstrate some of the reasoning behind the design.

The design aims to describe how power awareness can be introduced in a system with minimal changes to existing software and hardware. This thesis work attempts to explain all of the options that are available when it comes to trading performance for power and also how to choose the method that provides the best trade-off.

1.3 Method

This thesis project were conducted in two distinct phases. The thesis work began with a literature study of academic research papers, manuals, technical documentation and reports. To focus the research, research subjects were updated on a weekly basis as new knowledge was gained. The second phase built upon the knowledge gained from

(11)

the literature study to design and implement the power aware graphics system for the i.MX31 board. Although the implementation only is a prototype some measurements were carried out in order to validate parts of the design and as a starting point for discussion. This document was continuously written and re-written throughout the entire process.

1.4 Limitations and time constraints

It was decided early on that the development platform for the power aware system was going to be the i.MX31 development board running Linux 2.6. The only graphics library considered was OpenGL. The time frame for the literature study was going to be ten weeks after which another ten weeks should be set aside for design and implementation.

From the beginning it was also the intention that one of the research subjects would be display power management since the display usually is the component with the highest energy consumption in an embedded system but the study was cut from the final report since it was not complete.

(12)

Chapter 2

Computer graphics

This chapter contains a brief description of a general computer graphics system. How a combination of dedicated hardware and software enable the projection of a three- dimensional scene onto a screen.

2.1 Graphical Processing Unit

Figure 2.1: Overview of a traditional GPU in an embedded system.

A GPU is a microprocessor dedicated to performing computations associated with computer graphics. Traditionally, the main task of the GPU is to render graphics.

Rendering is the process of synthesizing an image from the description of a scene. Special- purpose hardware can always perform a given task, such a 3D rendering more efficiently than a general-purpose CPU[8] and using dedicated graphics hardware helps embedded systems get by with lower-clock-rate CPUs. The scene description consists of geometric primitives in euclidian 3-dimensional space. The light, color, reflection properties and the viewer’s position is also taken as input to calculate the image.

The rendering process has several stages where complex calculations can be performed in parallel and it is this property which makes GPUs so efficient at computer graphics computations. A GPU features a highly parallel structure and fast memory

(13)

access. Together, these features enables a GPU to speed up the rendering process considerably. GPUs are optimized for high throughput and not low latency as a CPU is[22].

The GPU may or may not provide acceleration for the entire rendering process. In many systems where the space on chip is limited the GPU only supports a subset of graphics operations and the rest of the rendering process has to be carried out on the CPU.

2.2 The graphics pipeline

Figure 2.2: The stages of a typical graphics pipeline.

Below follows a short description of the stages of the traditional GPU pipeline, which is also referred to as scanline rendering. The graphics pipeline can be divided into two main stages.

• Geometry stage. During the geometry stage the vertices that are the input to the graphics pipeline are transformed into a stream of triangles, in a common 3D space with the viewer located at the origin.

• Fragment stage. The fragment stage is responsible for generating the pixel value of each pixel on the display.

Each stage consists of several sub-stages which may be performed in slightly different order depending on the system architecture. A flow-chart of a typical graphics pipeline is presented in figure 2.2.In many embedded systems the geometry stage computations are carried out on the CPU[2]. The possibilities for optimization and graceful degradation in the rendering pipeline is a large part of this thesis work and will be explained in detail in chapter 4.3.

According to the GPU, the world is made up of triangles and before any computations are done any complex shape has to be split into triangles. OpenGL or some other graphics library is used to push each triangle into the graphics pipeline one vertex at a time.

To fully take advantage of the parallelism in the GPU all objects that will be put into the image will first have to be transformed into the same coordinate system. Lighting is then added to the scene on a per-vertex basis by combining information about all the light- sources in the scene. In the next step the vertices are projected onto the virtual camera’s film plane. Ideally, only objects that are visible from the camera are considered when calculating the pixel value.

When it comes to determining which visible screen-space triangles overlaps pixels on the display each pixel can be treated individually, which allows the GPU to fully utilize its parallelization capabilities.

The process is called rasterization. Lighting is not enough to give an image the impression of realism. By draping images called textures

(14)

Computer graphics

Figure 2.3: Overview of a tile-based GPU in an embedded system.

over the geometry an illusion of detail is added to the image. Textur- ing requires a huge amount of memory accesses in quick succession and GPUs are equipped with high-speed memory as well as excellent

caching to provide fast access to textures[22]. Several of these stages require information about how far away the closest objects is from the viewpoint of the virtual camera for each pixel. For this reason a depth buffer called z-buffer is kept which updates the minimum distance each time the a triangle is closer than any previous triangle and the pixel value is updated. For complex scenes where many triangles overlap the number of reads from the z-buffer can become very large.

2.3 Tile-based rendering

Tile-based rendering is a technique where the 3D scene is decomposed into regions, or tiles. It was originally developed to speed up the rendering process by allowing multiple triangles to be rendered in parallel but is now used in low-power embedded graphics systems to provide sequential rendering of large scenes, maximizing utilization of limited hardware acceleration[3].

The geometry stage of tile-based rendering is the same as the traditional graphics pipeline. Geometry operations can be carried out on the CPU or special-purpose hardware. The processed vertices are sent to the rasterizer, but before they reach it the triangles are sorted into bins that correspond to different tiles. Since a triangle might span several tiles it may have to be put into multiple bins and sent to the fragment stage repeatedly. The sorting might be performed as part of the rasterization process or it might be performed by the CPU as part of the geometry stage. The tiles are then rendered one by one and the pixel values written to the frame buffer.

In a traditional graphics pipeline the depth values of each fragment and textures have to be placed in off-chip memory because the depth- and texture buffers would have to be unfeasibly large to be able to store all data. The benefits of using a tile-based rendering system is that the depth values and textures for one tile can be stored on chip which allows very fast very efficient memory access without large buffers[2]. The i.MX31 development board used for the implementation part of this thesis uses a tile-based rendering technique.

(15)

2.4 Ray casting and ray tracing

The overwhelming majority of computer graphics systems uses the rasterization rendering process described above. An alternative rendering process is ray casting and its relative ray tracing. Ray casting is a technique where the pixel value is calculated by simulating the path of light through a 3D environment. Pixel value is calculated based on the first surface that crosses the path of the ray projected from the viewpoint for each pixel. Ray tracing is a more advanced technique where the ray is allowed to bounce several times within the scene, thereby producing effects such as reflection and shadows.

It is used to create images with incredibly realistic lighting and color at the expense of increased computational complexity. Ray traced images typically takes several seconds to render and are therefore unsuitable for interactive application. Hardware acceleration of ray tracing is still in development and it will take some time before the first interactive ray traced applications hit the market.

2.5 Graphics library

A graphics library enables programmers to utilize the graphics accelerator hardware of the GPU and provide a common API to applications that wishes to render 3D graphics.

There exists several graphics libraries. OpenGL|ES is a cross-platform API for 2D and 3D graphics applications that is adapted in order to suit the limited hardware resources of embedded systems. It is the graphics library used for the practical part of this master thesis. The graphics library enables developers of graphics application to access dedicated hardware without having to consider differences between platforms.

The graphics library resides in the operating system and translates API calls into actions executed on either the CPU or the GPU.

2.6 Unified shader architecture and General Purpose GPUs

GPUs today, implement the geometry and rasterization stages of the graphics pipeline using programmable hardware called vertex- and fragment shaders. Creating a unified shader is done by implementing functionality for both vertex- and fragment shaders.

A uniform shader architecture is more desirable for applications with uneven workload between the different stages of the graphics pipeline.

Moya et al.[25] have measured the benefits of using a unified shader architecture using a comprehensive simulated environment. The amount of space needed on chip for shader hardware could be reduced by as much as 30% using a conservative estimate.

Since GPUs outperform CPUs when it comes to computational power a lot of ef- fort has been put into using them for other calculations than graphics. This is not as straightforward as it may sound and up until a couple of years ago was not even possible due to hardware constraints. Processes with large computational requirements, high parallelism and whose throughput is more important than low latency are suitable for

(16)

Computer graphics

running on a GPU[22]. A GPU which performs operations traditionally handled by a CPU is referred to as a General Purpose GPU (GPGPU).

GPGPU computing is outside the scope of this thesis but the unified shader model and general purpose GPUs serve as an example of the ongoing trend that GPUs are becoming more programmable and versatile. Using hardware acceleration does not only improve the execution time of applications but also has a positive impact on power consumption. The development of new hardware is very important to achieve high system performance and low power consumption.

2.7 Graphics in embedded systems

Embedded systems with computer graphics, including mobile phones differs from more powerful computer systems in several aspects. Handheld devices have a very limited power-supply compared to a stationary computer which requires innovation at both the hardware and the software level to increase life-time of such systems by smart design and energy efficiency. Another limitation of handheld devices is the size. Most mobile phones are small and even if the power supply is increased the extra power turns into heat which could potentially be damaging to circuits unless thermal design aspects are considered[8].

Small memory bandwidth and limited chip area for dedicated hardware such as GPUs are some of the challenges when designing embedded systems apart from the demands for low power consumption. Small displays which are held close to the eye of the viewer actually result in higher demands for image quality than in a desktop system[1].

Embedded systems limited rendering capabilities are really stretched for applications that are developed for PCs and ported to embedded systems. Computer games are often distributed to PCs and embedded systems simultaneously which requires embedded systems to reduce level of detail in order to provide acceptable frame rates[24, 9].

It is important to make a distinction between interactive and non-interactive graphics. Non-interactive graphics can be rendered as simple bitmaps on other devices and does usually not infer as calculation-heavy demands on the GPU[8].

2.8 Summary

The rendering graphics process takes the description of a 3D scene and projects it onto a 2-dimensional screen. Scanline rendering is the name of the process used in almost every system today. Tile-based rendering is an alternative technique which splits the scene into regions and render each region separately. It is mostly found in embedded systems.

The graphics pipeline can be divided in a number of stages that all contribute to the final image. A GPU provides hardware acceleration of computations that are needed to render graphics. The reason why a GPU is able to speed up the rendering process by several orders of magnitude is because the rendering pipeline contains several stages where many elements can be processed in parallel. Applications that request computer generated graphics can call a graphics library like OpenGL to access hardware accelerated. Using

(17)

GPUs increase the energy consumption and speed of the rendering process considerably.

A trend in GPU technology is towards more general-purpose hardware that can be used for other than rendering purposes. Energy efficiency and power awareness is especially important in embedded systems where the battery and space on chip is limited.

(18)

Chapter 3

Power management

Power management (PM) is the process of efficiently directing power to different components of a system. It is especially important in embedded systems that rely on battery power. This chapter aims to provide a definition for various power management concepts used throughout this thesis. It will also describe the power management control techniques that are available at different levels of the hardware/software stack and how they relate to system components such as computer graphics. Power management is used not only to prolong battery life-time but also to reduce noise and cooling requirements for integrated circuits.

Luca Benini et al.[5] have presented a comprehensive survey of the various system- level power management techniques available. They look at several aspects of dynamic power management such as what type and how much information should be exchanged between the manager and the system components.

3.1 Static power management

The best solution to power management issues is of course to reduce power consumption without any degradation of performance. Static power management refers to minimization of leakage current and other power consumption characteristics of hardware circuits on the base power consumption level, but could also refer to a more general minimization of power consumption at various levels of performance.

The base power consumption of a system can be defined as sum of the energy consumed by components without power management and the lowest possible power state for each power manageable component[35].

What all static power management has in common is that it is implemented at design time and does usually not require as much software support as dynamic power management, in the shape of complex algorithms and architectures. Static power management and dynamic power management are not mutually exclusive and both are required to improve life-time of feature-rich embedded systems.

(19)

3.2 Dynamic power management

Any electronic design need to be able to deliver the peak performance when requested[5].

This is why static power management alone cannot suffice. At the peak performance power consumption is high, even though most components will not be running at full capacity all the time. Without some way of restricting the power consumed by inactive or partially active components the battery will either have to be impractically large or the system will have a very limited life-time.

Dynamic Power Management (DPM) is a way for dynamically reconfigurable systems to provide requested services with a minimum of power consumption. DPM techniques include methods to turn off or reduce the performance of system components when they are not used to their full capacity. Most systems experience fluctuations in workload during runtime and any DPM is based on the assumption that it is possible to predict future workload with a degree of certainty[5].

3.3 Power management policy

How to configure a system using DPM is called power management policy. Power management policy design can be designed off-line by the developer, or it can be implemented as a general adaptive solution which dynamically reconfigures itself at run-time. Many DPM policies is a combination of both.

Predictive techniques for setting power management policy does not guarantee optimal solutions. All predictive techniques uses information about past events to make reliable predictions about the future. Regardless of the predictive algorithm used the quality is dependent on the correlation between past and current events, which is always beyond the control of the designer[5].

Most dynamic power management research has focused on optimizing power under performance constraints. Power is a global system resource which means that the challenges of power-constrained QoS are different from other types of QoS.

3.3.1 Break even schemes and time-out policies

The time-out policy is widely used in laptops and handheld devices. It is a simple policy which shut down a component after some inactivity. This policy is clearly sub-optimal and can even prove contra-productive in specific situations[5]. An example where a simple time-out policy with a fixed time-out is ineffective is in a system with a periodic workload and the period is slightly longer than the time-out value. A processor in such a system would enter a sleep mode when the time-out expires but the energy savings, if any, would be too small to justify the transition energy overhead.

This example highlights the need to set the time-out correctly. The break even scheme is one of the most commonly used to determine the optimal time-out value. The processor is put into a sleep mode when the expected energy gain is greater than the mode transition cost. It is a scheme that is intuitive, easy to implement and performs relatively well for components with few power modes[5].

(20)

Power management

Using the break even scheme on the GPU and the rendering process as a whole requires that the arrival time of the next request is predicted with accuracy. Lu et al. have tested a simple time-out technique using different schemes to determine the optimal time-out. They show that it is possible to save power by an efficient time-out scheme. The adaptive algorithm they use outperforms the other methods, indicating that adaptive algorithms are well suited for power management[21].

3.3.2 Predictive wake-up

When the processor goes into sleep mode and there is an incoming request it will take some time for the processor to get up to speed. If the power manager gets accurate information about the expected future use of an application it can employ the predictive wake-up scheme to reduce performance penalties and save energy due to wake-up performance loss[21]. Rendering of computer graphics is often done at highly periodic intervals, which makes graphics applications prime targets for predictive wake-up schemes.

3.4 The Advanced Configuration and Power Interface (ACPI) power

management framework

The ACP Interface define a platform-independent open interface for device configuration and power management of individual devices and entire systems. ACPI removes device power management responsibilities from firmware interfaces and describes a set of standardized power management states and methods as guidelines for developers[10].

Most platforms implements the ACP Interface[29]. The ACPI specification contains information about how to set the operating modes of many kinds of hardware, such as CPUs and other processing units.

ACPI-based power management algorithms spend less than 1% of computations on power management according to Benini et al[5]. The ACP Interface does not guarantee that power management will be efficient or that it will be computationally feasible. ACPI provides the interface, it is up to the designer to implement the power management.

3.5 Power management module

Most suggested power management design solutions feature a power management module. The module is responsible for gathering system status information, making intelligent DPM decisions and actuating commands to the individual subsystems. The module’s three main tasks are sometimes concretized as sub-modules within the module. The sensor module gathers system information such as which devices are registered and how their power management capabilities are. It also monitors the current state of each system component at run-time. This information is then relayed to the policy manager that contains the decision-making algorithms. The output of the policy manager are state change commands. The actuator module takes these commands and relays these to the affected components. In some systems, the policy manager and actuator

(21)

is combined into one module. This is by no means the only possible architecture to handle power management but it serves as an example and a starting point for further discussion.

Figure 3.1: The basic power management module architecture.

3.6 Operating system power management

Power management can be implemented at various levels in the system architecture.

Simple embedded systems with few features often has power management in the form of a dedicated hardware circuit. It is possible to implement power management either as an application or middleware. Due to the fact that most power management policies require the power management module to be able to access low-level drivers and hardware, the vast majority of power management architectures puts the policy manager in the operating system.

Operating system power management is a hardware/software design challenge since both hardware resources and applications need to facilitate power management functionality for the system to be effective[5].

To facilitate power management the platform should ideally provide support to measure the power consumption of each individual component[35]. To make accurate decisions, the power management module need accurate and timely updates about system and subsystem power consumption. The power manager can measure the remaining battery at regular intervals and adjust the power budget continuously to fit the desired life-time of the system. Without this information the manager is left to rely on the feedback provided by the components themselves which may be delayed or inaccurate.

The feature of accurately measuring current at run-time is costly to implement, and so, most embedded systems today lack this feature.

3.6.1 ECOSystem

Zeng et al.[35] has produced one of the relatively few research papers that puts energy as the first priority resource. Since, this is the purpose of this thesis it is interesting to describe their proposed solution in detail. In the ECOSystem, each application is

(22)

Power management

given an amount of currentcy that can be used to purchase the right to consume energy of hardware devices. One currentcy unit represent a certain amount of energy within a ceratin time frame. The ECOSystem model resource management policies have two main goals. The policy tries to eliminate waste by using each device as efficiently as possible. At the same time, it limits the offered workload of the system to ensure a minimum life-time. One of the strengths of the ECOSystem and the currentcy approach is that it does not require devices or applications to be power aware although application involvement in power management is facilitated. Currentcy is allocated to applications at specific time intervals. The authors found that a one-second period is sufficient to achieve smooth energy allocation. The amount of total currentcy available determine the maximal power consumption in that time frame and is proportional to the estimated battery model and desired life-time of the system. Each task (application) is then given an amount of currentcy depending on its relative priority to other tasks. The ECOSystem is designed to incorporate a flexible enough device interface to support a wide variety of devices. Each device has its own charging policy.

3.6.2 EQoS

Another example of a system that tries to maximize performance under power constraints is the EQoS framework. A limited energy budget and real-time tasks are both incorporated in the Energy-aware Quality of Service (EqoS) concept described by Pil- lai et Al.[29] They introduce utility as a way of maximizing system performance under power constraints. Utility is a measure of the value of benefit gained from a particular task or application.

The Quality of Service (QoS) term is originally a traffic engineering term used in packet-switched telecommunication networks, to denote resource reservation control mechanisms. In this context, QoS refers to the ability to provide different priority to different applications, users or data flows. High priority data flows get a larger share of the network resources. The EQoS concept has the same basic idea, but uses power instead of bandwidth as the primary system resource.

How to assign utility is straightforward for some tasks. For applications that provide interactive graphics the authors admit that the assignment of utility is somewhat arbitrary. They suggest using a combination of objective image quality measures and common sense. Also, the utility assignment is insensitive to task dependencies which makes utility assignment a question of some intricacy.

Tasks running in the EqoS system have to be able to gracefully degrade. How this is achieved is up to the developer. At the most extreme some tasks will not be allowed to run since their execution would violate the run-time demands. The earliest-deadline-first scheduler used for CPU scheduling in the EqoS system ensures that all tasks above a certain utilization are allowed to run.

(23)

3.7 Power manageable components

Power management on the component level is a feature provided by some electronic equipment to turn off or go into low-power mode when there is a shortage of resources.

A power manageable component (PMC) is defined as a functional block. The power manager makes no assumptions about the internal structure. A PMC can be an application, a driver or some other system component. What separates a PMC from other components is the availability of multiple power modes. Components can have both external and internal power management. When designing internal power management policy the policy have to be more conservative since the component usually lacks ob- servability of the system operation[5], but at the same time internal management has the benefit of direct access to the hardware registers of the managed device, enabling a fast feedback loop. Depending on how important the component is to overall system performance the performance degradation has to be implemented in such a way that it does not impair overall system performance.

Effective power management requires high flexibility in the types of resources that can be integrated as power management components[5]. Rendering of computer graphics can be considered as a functional block. It has a clearly defined purpose in a system, even though the rendering process affects several hardware devices.

3.7.1 Power management at the application level

As mentioned above the power management decision-making could potentially be implemented at the application level. This section assumes that the power management module resides in the operating system and that applications exists in the system as PMCs.

Some forms of power management can only be implemented in the application layer.

Carla Schlatter Ellis[12] makes a good argument for higher-level power management.

She claims that the application should play an important role in power management and that the operative system should provide the application with a power management interface.

Figure 3.2: An example of the input/output of the power management module. Lafruit et al. [18]

The operative system should provide the application with updates about the power state of the system. The application itself has the most knowledge about which services

(24)

Power management

it should offer to the user and the importance of different aspects of those services.

Trade-offs exists at the application layer specifically, but not exclusively for computer graphics. A GPS, for instance, has a non-trivial trade-off between processor idle time and polling frequency[12].

The power management interface between operating system and application is potentially beneficial to both parties. By communicating its future requests to the operating system is able to incorporate this information in the decision-making process. The applications can request task-specific power management by communication with the PM module[21]. A power-aware application is able to scale its service content in an efficient way since it has full knowledge of its own operation. Power-aware applications can therefore be less conservative with their graceful degradation schemes.

3.7.2 Power management at the driver level

A device driver is relatively easy to model as a PMC. Reducing the maximum utilization or performance of a hardware device often correspond to separate system features, making drivers ideal for power management implementation. Information the PM needs from drivers include the power requirements of each state, transition energy and delay[21].

These parameters can be submitted during a handshake face at system startup or when a new device is connected to the system. The current status of a device should be made visible to the power manager in order to make accurate power management decisions, even if the device has internal management[35].

3.8 Workload prediction for graphics applications

If it is possible to predict the workload of the rendering process then predictive DPM techniques (Section: 3.3) can be used to save energy. It is also vital if the system should has some constraints on the amount of rendering that is allowable. The frame structure of a 3D application offers a rich set of structural information which can be used to predict future workload. Gu et al.[17] have shown that it is possible to predict the future workload of a computer game frame with high accuracy.

Important rendering parameters include: average triangle area, triangle count, average triangle height and vertex count[24]. According to Lafruit et al.[18] the number of vertices and the number of rendered pixels of the object are the most significant parameters when calculating complexity for the rendering stages. The workload of processing a frame is almost linearly correlated with its rasterization workload. All primitives of the same type can be rendered in approximately the same time for 3D games and other applications[17]. An example of on-the-fly calculation time prediction has been implemented by Lafruit et al.[18] The Computational Graceful Degradation technique uses scene description parameters to predict the workload of each individual scene and the content is then scaled accordingly to fit the rendering within the allotted time frame. A performance heuristic often used[33] for modeling the rendering pipeline execution time is shown below in equation 3.1.

(25)

T (x) = max (c₁∗ V (x), c₂∗ P (x)) (3.1) Where x is the object that is being rendered, V is the number of vertices of the object and P is the number of projection fragments of the object.

3.8.1 Signal-based workload prediction

Analytical schemes for workload prediction are often computationally expensive. The developer often has to compromise computation complexity. Otherwise the computational overhead might cause the rendering to miss frame deadlines. Analytical prediction models may have problems with specific applications. Workload estimation techniques that takes additional input about the executing application is one method that can be used for this[24].

To reduce computational complexity and improve the accuracy of workload estimation Mochocki et al.[24] Suggests using a signal-based prediction scheme. The model does not require any elaborate system model but instead uses a cause- and effect-thinking to assign each frame signature with a workload taken from actual measurements. If another frame arrives in the pipeline then a signature containing a subset of graphics rendering parameters is compared with previous signals and if it is a good enough match then the workload of the stored signal is the estimation for the frame to be rendered. The signal-based approach demands a minor adjustment to the regular 3D graphics pipeline.

In order to collect enough information about each frame to calculate its signature a signature buffer has to be implemented in the middle of the geometry rendering stage.

The benefits are substantial compared to analytical methods in terms of accuracy. The signal-based scheme is easy to understand, has a tolerable computational complexity and is sensitive to application-specific workloads. For a 3D graphics benchmark rendering scene the prediction error never exceeded 3% which is a substantial improvement compared to analytical methods[24].

3.9 Hardware support for dynamic power management

Without hardware support the power management capabilities of a system are limited.

Hardware support for power management in embedded systems include dynamic process and temperature compensation (DPTC), active well-bias, clock gating, sleep modes and dynamic voltage and frequency scaling (DVFS). The platform used for the practical part of this thesis supports all of these features but DVFS is unfortunately not available for the GPU. The DPTC mechanism measures the circuit’s speed dependency on the process technology and operating temperature and lowers the voltage to the minimum level needed to support the existing required operating frequency. Active well bias minimizes leakage current by lowering the well power to the transistors in the circuit.

Both DPTC and active well bias are hardware support for static power management and is considered outside the scope of this thesis, which focuses on the use of different

(26)

Power management

power modes and to some extent DVFS. Equation 3.2 describes the dynamic power consumption of a processor.

P = C ∗ V²∗ F (3.2)

Where P is power, C is the capacitance switched per clock cycle, V is voltage and F is the switching frequency.

3.9.1 Power modes

A power mode is a power-performance trade-off mode of operation for a system component which can be either hardware or software. A PMC could have a continuous range of power modes, enabling the power manager to fully utilize the power saving capabilities of the device and of the system as a whole. The hardware overhead and increased design complexity that this fine control requires have resulted in that most components offer a very limited amount of power modes. Mode transition typically comes at the a non-negligible cost in either performance or delay, or both. These transition costs have to be taken into account when designing power management systems. High power states generally have smaller transition latency and higher performance overhead than low power states[5].

Clock gating

Integrated circuits consists of multiple components each of which may have several power domains. A GPU usually has sub-modules that can be gated individually. Pruning the clock tree so that the flip-flops of a sub-module is halted is called clock-gating.

Clock-gating is ideally suited for internal power management. The low performance overhead caused by stopping the clock makes it possible to use clock-gating for very short periods in idle mode without affecting the performance in any significant way. Clock- gating does not eliminate dynamic power dissipation on the external clock circuitry or the leakage current[5].

Sleep modes

An ACPI (Section: 3.4) compatible device typically feature at least one sleep mode.

When a component is in sleep mode the power is shut off completely to some parts of the circuit. Turning off power completely to a component requires controllable switches as well as handling the potentially large wake-up time. When powering up the component’s operation must be reinstated. This will take even longer is there are mechanical parts which have to come up to speed[5]. Sleep modes are definitely useful for power savings but are not as effective in real-time operating systems which are more or less always in an active state[29].

(27)

Mode transition

The increasing need for power management in embedded systems have led to the con- struction of devices featuring many intermediate power modes, where they previously only had two, on or off. Components such as CPUs and GPUs today typically feature several operational- and even non-operational modes. Modern hard drives have dozens of operational modes and some devices feature scalable power levels. Hard tim- ing constraints and subsystem dependencies have to be considered when designing a mode transition algorithm[19].

The optimal mode transition sequence is hard for programmers to identify since it quickly becomes complex as the number of modes increases. The results can sometimes seem contra intuitive. Liu et al.[19] have analyzed the idle-mode transition sequences on a component- and system-level. Their algorithm calculates the optimal mode transition sequence for an arbitrary number of devices and low-powered modes in logarithmic time. The first stage of the algorithm identifies the optimal mode transition under tim- ing constraints. The second stage calculates the system-wide energy savings potential by using the constraints of all subsystems and choosing the mode transition that provide the largest energy savings on the system-level. The power manager is not aware of the optimal sequence for each resource. This information is handled by each driver internally. In systems where the idle energy cost equal the active energy cost the algorithm outperforms traditional optimization schemes by 30%-50% in system-wide energy consumption. The algorithm only considers idle-mode transition and can be combined with algorithms for active-mode optimization.

3.9.2 Dynamic Voltage and Frequency Scaling (DVFS)

Simultaneous scaling of processor voltage and frequency has been used extensively for many years to reduce the power consumption of CPUs and other processing units. Equa- tion 3.2 indicates that frequency has a linear relationship with power consumption and that the supply voltage has a quadratic relationship. A processor that experiences slack time will have a lower power consumption if it is able to run at a lower voltage and frequency. The time it takes for a calculation to complete will then increase but the overall power consumption will be reduced. If the processor is able to stay in a busy state at a low voltage /frequency then the overhead of mode transition can be avoided.

Using DVFS on a GPU or a CPU performing graphics rendering will increase the frame latency (Section: 4.5) slightly but reduce power consumption significantly. Park et al.[27] have used DVFS on a development board with performance similar to the i.MX31 to achieve energy savings of up to 46% compared to a non-DVFS scheme. Taking advantage of both intra-frame- and inter-frame slacks to scale down the CPU frequency in times of reduced workload. Intra-frame DVFS conservatively identify rendering slack time caused by the imbalance between different stages in the pipeline, while the inter- frame DVFS picks up any remaining slack in between frames.

(28)

Power management

3.10 Summary

Embedded systems rely on battery power to operate. A combination of static- and dynamic power management is needed to prolong the system life-time. Most power management frameworks uses a power management module that resides in the operating system[12]. The power management module is responsible for monitoring system operation and actuating DPM policy. A good PM framework will include both power aware applications and hardware drivers as power manageable components[5]. How to configure a system so that it achieve system power consumption objectives is called power management policy. Most power management policies optimize power under performance constraints, but for some systems it is more important to be able to guarantee system life-time, rather than performance[29]. Power aware system components provide several power modes. A low-power mode usually has some form of reduced performance characteristics. The rendering process can be considered a functional block and if it is able to scale its operation and thereby reduce energy consumption then it can be modeled as a PMC. The power management frameworks studied for this thesis only consider PMCs with a restricted range of functionality. They fail to address how such a complex component as a graphics rendering system can be modeled. Hardware support is required to perform dynamic power management. Dynamic voltage and frequency scaling (Section: 3.9.2) is a technique that enable processing units to run at reduced clock frequency in a low power mode.

(29)

Power management methods in computer

graphics software

Modern embedded systems require both high performance 3D graphics and low power consumption. The rendering of 3D graphics is a computationally heavy and memory intensive task which consume a lot of energy for the CPU, GPU and memory.

The major part of this chapter is dedicated to the description of the basic operations that are required to render graphics and if there exist alternative rendering techniques that produces similar results. If there exists more than one method and the different methods come at different computational complexity then a less expensive method can be used, thereby reducing the rendering workload. Modern rendering solutions consists of an almost infinite number of visual effects. It would not be possible to describe every effect in this thesis, but instead this chapter lists the basic operations of the graphics pipeline in an embedded system, such as the one found on the i.MX31 board.

When designing a power aware graphics driver it is only natural to target the processing units specifically. The GPU consumes power when it is working but also whenever it has to fetch data from external memory. External memory access is one of the most energy consuming operations in embedded systems[3]. To achieve energy efficiency in an embedded system both the rendering workload and the amount of memory accesses has to be addressed. This chapter describes how to produce energy efficient graphics in an embedded system. The methods will be divided into sections in a top-down approach.

4.1 Energy efficient graphics applications

3D graphics rendering is a powerful tool for creating visually pleasing and effective applications, but rendering 3D content comes at the cost of increased computational workload. When the GPU and CPU are running at high capacity a lot of power is consumed. Developers have to be careful when designing applications not to use superfluous animations and have power consumption issues as a part of the development process.

This is especially true for embedded systems where battery power is in limited supply.

(30)

Power management methods in computer graphics software

Adding detail to scenes Most 3D scenes contain more objects than a human can cate- gorize and remember instantaneously[7]. Developers of interactive 3D applications must determine whether adding detail to a scene is justified. More objects in a scene will result in higher rendering times and higher energy consumption. Removing objects from a scene will sometimes not only save time and power but also make the scene less cluttered and improve user acceptance.

Graphics application development Developers of graphics applications has the ability to affect the power consumption of 3D rendering by writing efficient code. Developers that are aware of the inner workings of the platform that they are developing for will be able to balance the workload across the different pipeline stages properly. How calls are made to the graphics library can affect how fast the scene is rendered even if the images produced are identical.

For instance, trying to read from a texture or vertex which is currently locked by the GPU causes the CPU to stall. The GPU might also have to stall because it requires a resource that is currently locked by the CPU. In general, applications should avoid accessing resources that the GPU might need during processing.

Also, applications should strive to maximize batch size as much as possible. A batch is a set of primitives used by the graphics library for a single API call. Every batch causes a small overhead for the CPU. The number of batches can be drastically reduced by intelligent application design[14].

Improving user productivity When designing 3D graphics applications it is not enough to minimize power consumption for a given computational task. Improving user productivity is equally important to energy efficiency. Zhong et al.[36] have come up with a useful definition of energy efficiency from the user’s perspective, as not only the lifetime of the battery but also how much the user is able to accomplish before the battery runs out.

Rendering steps such as occlusion, shadows, contrast and perspective are powerful tools for creating more user-friendly applications in handheld systems[8]. Both naviga- tion and text readability can be improved by making better use of such methods. There exists a trade-off between user productivity and power consumption. With intelligent use of 3D animations user productivity can be increased. Even though the instant power consumption may be higher while the user is active, he is able to accomplish a task in less time, enabling the system to go into sleep mode earlier, which results in a lower overall power consumption.

Interactive applications typically have a point where additional power allocation does not add to user-perceived quality, because of the time it takes for the user to respond to visual input[35]. An energy efficient graphical user interface is often one that enables the system to accomplish a task while having to wait as little as possible for user input.

A more aggressive approach is trying to predict user behavior and get the result ready even before the next input. The auto-complete feature in many search fields today is an excellent example of this approach[36].

(31)

4.2 Level of detail

Even if unnecessary rendering should be avoided it is not desirable to completely avoid it.

For many applications the whole point is to provide the user with 2D- or 3D graphics.

With growing complexity of polygonal models comes a need for fidelity- and quality control. If visual appearance can be maintained with fewer polygons then a substantial amount of rendering computations can be removed. The level of detail (LoD) concept involves decreasing the complexity of a 3D object representation according to metrics such as object importance or position. Level of detail techniques increases the efficiency of rendering by decreasing the workload on the graphics pipeline stages, especially vertex transformations.

Figure 4.1: Trade-off between the geometric error and computational cost, Tack et Al.

Nicoolas Tack et al.[33] have written one of the papers that describe an algorithm that combines demands for minimum frame-rate and image quality with power management.

The algorithm lets the user set a requested maximum error and exploits the remaining time to reduce power consumption. They use a geometric error model based on mean squared error (MSE) to approximate the impact each object has on perceived visual quality and it- eratively increase the LoD of the object that yields the highest distortion reduction at the smallest cost until the desired error is ob- tained. The remaining time needed to render the frame can be spent in idle mode to save power. The algorithm provides energy savings

between 30%-76% compared to traditional optimization techniques.

(a) (b)

Figure 4.2: (a) Spheres rendered with 23284800 vertices. (b) Spheres rendered with 1094400 vertices. The 21 times reduction in level of detail is barely perceptible.

http://en.wikipedia.org/wiki/Level of detail - 2010-03-25

(32)

The reason why LoD control has received such attention from the research community is the high potential for time and energy savings. Figure 4.1 serves as an illustrating example. As the number of triangles used to render the bunny object increases the geometrical error curve quite soon levels out. A small increase in the visual quality of a scene sometimes leads to a large increase in execution time. The reduced visual quality of an object is often unnoticed because of the small effect on object appearance when distant, moving fast or simply because the user’s attention is focused elsewhere in the scene.

4.2.1 Simplification models and progressive meshes

Figure 4.3: An aeroplane model at different levels of simplification. Watson et al.[34]

The level of detail that is actually necessary to preserve convincing realism varies within a scene. Objects that are close to the camera have to be modeled with very high vertex resolution to be convincing. It is useful to have several representations of the same object with varying degrees of resolution. That way, applications can choose to render some objects at reduced complexity. Simplification models take an object with high vertex count and tries to compute an approximation using fewer vertices while preserving image fidelity[15].

Using a simplification method it is possible to produce progressive meshes. A progressive mesh is a representation of an object at various levels of detail. Even low vertex count progressive meshes describe the overall shape of the object and as more data is added new vertices increase the level of detail. Producing progressive meshes is computationally heavy and is done offline.

(33)

4.2.2 Level of detail control

High-performance graphics have used level of detail control to speed up rendering while maintaining image quality for several years[11] and with the advanced rendering capabilities of today’s embedded systems LoD control is making its way into mobile phones.

LoD control typically works in one of two ways. Either a method for mesh simplification is built into the graphics application itself or a scene graph toolkit is used to balance LoD against other parts of the rendering process. Each solution requires that the developer constructs a custom system for LoD control and this might be the reason why there has been only partially successful attempts to create a unified interface for LoD control. The GLOD API developed by Cohen et al.[11] can be used to control the LoD in a standardized fashion which enables developers to focus on application development while leaving LoD control to the GLOD system. GLOD is meant to be flexible enough to work alongside different versions of the OpenGL API.

Not only does GLOD provide a method for LoD control on an object-to-object basis but can also adapt the LoD for separate parts of large objects that span large parts of the display. Adjusting the LoD of a large object will cause distant parts of the object to be rendered at lower LoD, thereby saving precious CPU cycles.

LoD control is certainly an excellent example of how graceful degradation exists in the rendering pipeline. The subject is far from trivial and due to time restrictions implementation details and a more in-dept study has to be left as future work.

4.3 Power awareness in the rendering pipeline

The application can become power aware by making intelligent use of the limited rendering processes but the rendering process can also become power aware itself. Power aware rendering requires a combination of dynamic power management methods and optimization without a decrease in performance.

The headings in the following section does not directly correspond to the pipeline stages mentioned in chapter 2. The different stages and the power-saving methods will be described in the order they are handled by the graphics pipeline (Section: 2.2).

4.3.1 Per-vertex transform and lighting

Vertices in the 3D scene is lit according to color and light sources within the scene. Ver- tex lighting is usually computed using the Phong[28] or Blinn-Phong[6] lighting model.

Phong lighting requires that the reflective properties of each object is described in terms of ambient, specular and diffuse reflection. For each light within the scene the specular term is calculated by calculating the dot product of the reflection vector R and viewer direction V. Calculating the dot product is computationally heavy, and Blinn’s modification to the Phong avoids this operation by approximating the angle between V and R, using the halfway vector H. The halfway vector can be found by multiplying the surface normal N and the viewer direction V.

(34)

Figure 4.4: The vectors needed to calculate Phong and Blinn-Phong lighting.

It is possible to use vertex lighting to interpolate the values of individual pixels, but it is also possible to calculate lighting on a per-pixel basis. Whichever method is used the process is called shading and will be explained in section 4.3.3. Lighting on a per-vertex basis is computationally heavy for scenes containing a lot of geometry data. Calculating vertex normals and tangents on the CPU and storing them reduces the complexity of vertex processing but requires slightly more vertex fetching[14].

Vertex caching Vertex caching is a transform and lighting optimization technique em- ployed by among others Park et Al[27]. Neighboring triangles often share vertices in the rendering pipeline. Before a vertex enters the transformation stage the vertex cache is searched and if the same vertex has been processed before the results are reused, thus avoiding performing transformation and lighting for the same vertex twice.

Many lights For each light source additional information is required to calculate lighting. Iterating over many light sources can increase the computational requirements for lighting and only using a subset of local light sources can speed up the process. Choosing only the lights closest to the viewpoint or the strongest light sources are two possibilities for choosing light sources, based on the assumption that these lights contributes the most to the overall appearance of the scene.

Light maps Lighting calculations per vertex can easily become computationally heavy as the number of vertices in a scene increases. Even with hardware support per-vertex

(35)

lighting such as Phong lighting, or even worse, ray tracing might lead to frame rendering times up to several seconds. Light maps are per-pixel lighting algorithms which uses textures containing lighting information and multiple rendering passes to perform another type of hardware accelerated lighting. Illumination maps are best used for global lighting and is not as good at handling local light sources[?]. Light maps can be used as a substitute for vertex lighting or the lighting methods can work in parallel. Some objects within a scene might be lighted using Phong or some other lighting method while other objects are lighted using a light map. If the object is stationary and there are no moving light sources within the scene then a light map can provide excellent lighting at low computational cost.

4.3.2 Clipping and culling

Only objects that are visible to the user needs to be rendered. Clipping is the process of removing vertices outside the current view. The more vertices that can be clipped the less time the rendering process will take and power will be saved. Clipping is non- trivial. Objects that are partially within the field of vision must be cut off where the object intersects the viewpoint border. There are different methods for removing the part of the object that is not visible.

A technique related to clipping is occlusion culling. While clipping is concerned only with removing objects that are outside the field of vision culling is the process of identifying objects that can safely be omitted from the rest of the rendering process because they are completely or partially occluded by other objects. It is also unnecessary to draw the parts of an object that is facing away from the camera since they are occluded by the front of the object. The issue of partially occluded objects apply to culling as well.

The algorithms used to identify content that can be culled are themselves computationally heavy and programmers have to take care when implementing culling so that the culling process does not require more power than it saves. To reduce the complexity of the culling process many culling algorithms are heuristics and approximations. One way to achieve graceful degradation is to use a less conservative approximation when identifying objects that can be removed. That way, more computations can be saved by not rendering but sometimes objects that are within the field of vision will be removed as well, sometimes resulting in images of severely lower quality.Culling is useful to both scanline- and tile-based rendering. In both cases, reducing the amount of depth comparisons and depth information that has to be stored in memory.

Z-max culling Z-max culling is a commonly used occlusion culling algorithm. For each tile of e.g. 8 ∗ 8 pixels the maximum z depth value is stored. If the stored value is smaller than the smallest z value of a triangle that tile is not rendered. To update z max for a tile all the z values of that tile has to be read which can be expensive[1]. To increase the effectiveness of z-max culling the scene should be rendered in a roughly front-to-back order[14].

(36)

Z-min culling A complementary culling technique is z-min culling. The idea is to determine if a triangle is definitely is in front of all previously rendered geometry. In that case, there is no need to perform z-buffer reads for the tile. It works similarly to z-max culling, but instead of storing the maximum z value of a tile it stores the minimum value.

If the minimum z value of the triangle is smaller than the z-min value of the tile then the triangle is in front and all pixels in the triangle will be rendered. The performance of the z-min culling technique reduces the total bandwidth needed for depth reads. It also affects the internal / external memory bandwidth ratio positively resulting in even lower power consumption[1].

4.3.3 Fragment shading

Lighting has to be added on a per-pixel basis to make a 3D scene more realistic. The rasterization process uses different shading techniques to calculate the impact lighting has on pixel color. The absence of shading is referred to as flat shading in which case the pixel color is the same for each pixel within a triangle. Pixel shading in OpenGL is referred to as a fragment shader, which is the name used in this thesis.

The traditional method for fragment shading is to interpolate pixel value using vertex lighting data. Gouraud shading is a technique that is used for this purpose. An alternative to Gouraud shading is Phong shading. Phong shading is basically a per-pixel generalization of the Phong vertex lighting algorithm (Section: 4.3.1). Phong shading assumes that the curvature of a triangle is uniform between vertices and performs Phong lighting using the approximated normal on each pixel. The computational cost of Gouraud shading is much less compared to Phong shading. The power consumption ratio can be described using the equation 4.1.

P_phong

P_gouraud = 20p + 6x + 24

p + 2x + 35 (4.1)

Where x is the number of pixels of a side of a triangle and p is the number of pixels within a triangle[13].

Depth first rendering Depth first rendering is a technique that can be used to speed up the fragment shading stage of the graphics pipeline. The first rendering pass is done without any color or shading information. When all geometry has been rendered color and shading is added to the scene. This way, no extra calculations are needed to render invisible surfaces. Depth first rendering also has a positive effect on frame-buffer bandwidth[14].

Adaptive shading Jeongseon Euh and Wayne Burleson[13] present an adaptive shading algorithm which enables the rendering process to make an intelligent decision about the optimal shading technique. The algorithm considers both graphics content and human perception. The algorithm set the shading technique individually for each object in a scene. The algorithm provides power savings around 80% for the shading part of