IBR camera system for live TV production

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Linköpings Universitet

SE-601 74 Norrköping, Sweden

601 74 Norrköping

LITH-ITN-MT-EX--06/020--SE

IBR camera system for live TV

production

Andréas Edling

Jonas Franke

(2)

IBR camera system for live TV

production

Examensarbete utfört i medieteknik

vid Linköpings Tekniska Högskola, Campus

Norrköping

Andréas Edling

Jonas Franke

Handledare Erik Fägerwall

Examinator Mark Ollila

Norrköping 2006-04-21

(3)

Rapporttyp Report category Examensarbete B-uppsats C-uppsats D-uppsats _ ________________ Språk Language Svenska/Swedish Engelska/English _ ________________ Titel Title Författare Author Sammanfattning Abstract ISBN _____________________________________________________ ISRN _________________________________________________________________

Serietitel och serienummer ISSN

Title of series, numbering ___________________________________

Nyckelord

Keyword

URL för elektronisk version

Division, Department

Institutionen för teknik och naturvetenskap Department of Science and Technology

2006-04-21

x

LITH-ITN-MT-EX--06/020--SE

IBR camera system for live TV production

Andréas Edling, Jonas Franke

Traditional television and video recordings are limited to show only views of a scene from were a camera is positioned. For certain situations, positioning a camera at the most attractive viewpoints might not be possible, such as above a crowd at a concert. In this thesis, technologies for alleviating this limitation, and extending

traditional filming processes, by using video-based rendering methods are investigated. With such video-based rendering methods, given a set of existing cameras, images from novel viewpoints can be produced without having any capturing device at the chosen viewpoint.

This thesis investigates the current state of image and video-based systems for creating virtual camera views, which techniques are used in different approaches, and what the current limitations are. Furthermore, an implementation of a virtual camera system is created which can be run on consumer graphics hardware. By merging geometries from different cameras, calculated with a stereo algorithm, a representation of the current scene can be rendered. The viewpoint can be changed by a user

interactively in real-time.

Image Based Rendering, Video Based Rendering, TV Production, Depth Calculation, Scene Reconstruction, Image Processing, Computer Graphics

(4)

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under en längre tid från publiceringsdatum under förutsättning att inga

extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a considerable time from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

(5)

Traditional television and video recordings are limited to show only views of a scene from were a camera is positioned. For certain situations, positioning a camera at the most attractive viewpoints might not be possible, such as above a crowd at a concert. In this thesis, technologies for alleviating this limitation, and extending traditional ﬁlming processes, by using video-based rendering methods are investi-gated. With such video-based rendering methods, given a set of existing cameras, images from novel viewpoints can be produced without having any capturing device at the chosen viewpoint.

This thesis investigates the current state of image and video-based systems for creating virtual camera views, which techniques are used in diﬀerent approaches, and what the current limitations are. Furthermore, an implementation of a virtual camera system is created which can be run on consumer graphics hardware. By merging geometries from diﬀerent cameras, calculated with a stereo algorithm, a representation of the current scene can be rendered. The viewpoint can be changed by a user interactively in real-time.

(6)

1 Introduction 1

1.1 Motivation . . . 1

1.2 Problem description . . . 1

1.3 Purpose . . . 2

1.4 Method . . . 2

1.4.1 Literature review and theoretical background . . . 2

1.4.2 Implementation . . . 2

1.4.3 System evaluation . . . 2

1.5 Limitations . . . 3

1.6 Disposition of the report . . . 3

2 Related work 4 2.1 Virtualized Reality . . . 4

2.2 Image-based visual hulls . . . 4

2.3 Relief Texture Mapping . . . 6

2.4 High quality video view interpolation . . . 6

2.5 Light ﬁeld rendering . . . 6

2.6 Microfacet Billboarding . . . 8

2.7 3DTV . . . 8

2.8 Optical ﬂow . . . 9

2.9 Tour into Video . . . 10

3 Theoretical background 12 3.1 Image-based modeling and rendering . . . 12

3.1.1 Photography vs. computer graphics . . . 13

3.2 Video-based rendering . . . 13 3.2.1 VBR for television . . . 14 3.3 Acquisition . . . 14 3.3.1 Camera setup . . . 15 3.3.2 Illumination . . . 16 3.3.3 Network infrastructure . . . 16 3.4 Geometric calibration . . . 17 3.4.1 Small-baseline calibration . . . 17 3.4.2 Wide-baseline calibration . . . 17 i

(7)

3.5 Colour calibration . . . 17

3.6 Camera representation . . . 18

3.6.1 Camera calibration matrix . . . 18

3.6.2 The external parameters matrix . . . 19

3.6.3 Fundamental matrix . . . 19

3.7 Stereo correspondence . . . 20

3.7.1 Disparity representation . . . 21

3.7.2 Multi-Baseline stereo . . . 21

3.7.3 Matching cost calculation . . . 22

3.7.4 Matching cost aggregation . . . 22

3.7.5 Disparity computation . . . 22

3.7.6 Epipolar Geometry . . . 23

3.7.7 Image rectiﬁcation . . . 23

3.8 Segmentation . . . 25

3.9 OpenGL geometry pipeline . . . 26

3.10 Hardware . . . 26

3.11 GPU programming and GLSL . . . 27

4 Implementation 28 4.1 Method . . . 28

4.2 System overview . . . 29

4.3 Camera calibration . . . 29

4.4 Image rectiﬁcation . . . 29

4.5 Real-time disparity image calculation . . . 29

4.5.1 Disparity hypothesis . . . 32

4.5.2 Winner Take All disparity calculation . . . 32

4.5.3 Symmetric neighbourhood ﬁltering (SNF) . . . 34

4.5.4 Max gradient calculation . . . 34

4.5.5 Aggregation . . . 36

4.6 Scene reconstruction . . . 37

4.6.1 Scene geometry artifacts . . . 37

4.7 Multiple geometries . . . 37

5 Experiments and results 41 5.1 Test one - existing multi-camera data set . . . 41

5.1.1 Depth maps . . . 41

5.1.2 Scene reconstruction . . . 42

5.1.3 Results . . . 43

5.2 Test two - Virtual Scene . . . 44

5.2.1 Camera calibration . . . 44

5.2.2 Scene reconstruction . . . 44

5.2.3 Results . . . 45

5.3 Test three - Recorded material . . . 47

5.3.1 Camera calibration . . . 47

5.3.2 Image rectiﬁcation . . . 47

(8)

5.3.4 Scene reconstruction . . . 49 5.3.5 Results . . . 50

6 Discussion 51

7 Conclusion and future work 53

7.1 Conclusions . . . 53 7.2 Future work . . . 53

(9)

2.1 Capturing device (Image from [1]) . . . 5

2.2 Visual hull volume extracted from silhouettes. (Image from [1]) . . . 5

2.3 Ordinary texture mapping versus relief texturing (Image from [2]) . 6 2.4 Multi-camera setup(Image from [3]) . . . 7

2.5 The light slab representation and two visualizations of a light ﬁeld (Image from [4]) . . . 7

2.6 Frontal and side view of microfacets (Image from [5]) . . . 8

2.7 3DTV (Image from [6] . . . 9

2.8 Scene ﬂow computed from two input images at diﬀerent times (Im-age from [7]) . . . 9

2.9 Background model constructed from a video sequence (Image from [8]) . . . 11

3.1 Traditional IBMR approach . . . 12

3.2 Photography vs. computer graphics . . . 13

3.3 Mobile setup where each module consists of a ﬁre wire camera con-nected to a laptop (Image from[9]) . . . 16

3.4 Pinhole camera model . . . 18

3.5 Triangulation in stereo images . . . 23

3.6 Correspondence for x in right image along the epipolar line l0 . . . . 24

3.7 Rectiﬁcation yields a horizontal epipolar line l’ with the same y-value as the corresponding pixel x . . . 24

3.8 Typical example of colour segmentation . . . 26

3.9 OpenGL geometry pipeline . . . 26

4.1 Visualization of system pipeline . . . 30

4.2 18 checkerboard images used for camera calibration . . . 31

4.3 Spatial conﬁguration of two cameras and calibration planes . . . 31

4.4 Intermediate results of the different filtering processes. Disparity hypotheses, SNF filter, max gradient filter and depth map aggregation. 33 4.5 12 different disparity hypotheses . . . 33

4.6 Disparity map from WTA calculation . . . 34

4.7 The four diﬀerent pairs of neighbouring pixels . . . 35

(10)

4.8 Original image compared to SNF ﬁltering with one and sixteen

iter-ations . . . 35

4.9 Thresholded maximum gradient map . . . 35

4.10 The diﬀerent combinations of discontinuities (marked by red bars) and the corresponding weights used for aggregation . . . 36

4.11 Original disparity map and disparity map after two and eight WTA aggregations . . . 36

4.12 Scene reconstruction by using original image, depth map and a high pass image . . . 38

4.13 Geometry from one camera view . . . 39

4.14 Geometry from two camera views . . . 39

4.15 Geometry from three camera views . . . 40

4.16 Geometry from three camera views without ﬁltering . . . 40

5.1 Depth map from Zitnick’s data set . . . 42

5.2 One geometry seen from front, side and top . . . 42

5.3 Matching of two geometries . . . 43

5.4 Views from left cameras of the four camera rigs . . . 44

5.5 16 images used for camera calibration . . . 45

5.6 Mismatching of geometries, correct matching without ﬁlter, correct matching with ﬁlter . . . 45

5.7 Close-up of three matching geometries where artifacts become visible 46 5.8 Stereo camera rig used for capturing a static scene . . . 47

5.9 Camera calibration process . . . 48

5.10 Original and rectiﬁed image. . . 48

5.11 Original image and a corresponding depth map . . . 49

5.12 Reconstructed geometry. Inconsistent depth maps leads to quite noticeable artifacts. . . 49

(11)

5.1 Rendering speed depending on the number of cameras rendered (without online depth map calculation) . . . 46 5.2 Rendering speed depending on the number of cameras rendered

(with online depth map calculation) . . . 46

(12)

Introduction

1.1 Motivation

In the past few years, TV broadcasting of dynamic events, such as concerts or sports events, has gone toward providing the spectator with a richer visual experience, such as spectacular camera ﬂights and close-ups from novel views. These demands put a lot of strain on current camera systems. There exists camera systems suspended in wires, which are used to generate spectacular footage to some degree, however such systems has limitations, especially economically.

Instead of using large or cumbersome camera devices, which in many cases proves to be impossible, calculation of novel views using an IBR1_/VBR2 _framework

can be utilized.

1.2 Problem description

As of today, only limited free viewpoint system exists on the television broadcasting market. There has been quite a lot of research on the subject but there are still a couple of problems and bottlenecks that needs to be considered.

Most of the work on IBR and view interpolation focus on rendering static scenes, which can produce high quality renderings. Extending this to capturing, representing and rendering dynamic scenes generates some additional problems. Large sets of data must be processed fast to generate novel views for real time use. This is generally hard to do, without much processing power. Many cameras generate more data, but the ﬁnal result is usually better. By using many cameras lined up closely together it is much easier to achieve a good visual result, however, this might not be a cost eﬀective solution. Additionally, this approach might not be feasible because such a device is expected to be heavy, cumbersome and needs a lot of space.

1_{image-based rendering} 2_{video-based rendering}

(13)

1.3 Purpose

The goal of this thesis work is to produce photo realistic rendering from novel viewpoints, by using as few cameras as possible. There are two main purposes in this project:

• To investigate, by reviewing existing techniques, how video-based rendering

techniques could be used to construct a virtual camera system for live pro-duction television

• Implementation of a system by using the most suitable techniques, and

eval-uation of the system to ﬁnd out bottlenecks and diﬃculties

The purpose with the implementation is to test a video-based rendering approach to ﬁnd out what the limitations are. Furthermore, the implementation should be seen as a framework, to identify current limitations and to evaluate possible future improvements.

1.4 Method

The work presented in this thesis can be divided into three parts; literature review, implementation and evaluation of the system.

1.4.1 Literature review and theoretical background

The first part of the work was to do a literature review of previous work related to video-based rendering. This review has mostly focused on scientific reports from the ACM3 _{and IEEE}4 _{databases, and books covering the field of computer vision}

and VBR [10, 11, 12]. Diﬀerent methods and algorithms have been considered and combined to reﬂect the most suitable system implementation.

1.4.2 Implementation

Based upon existing research and theoretical background covered in this thesis an implementation has been made, that can be ran on a computer with a decent graphics board. The implementation has been done using C++ and OpenGL.

1.4.3 System evaluation

The system implementation has been evaluated in three tests. The first test consists of multi-camera material capturing a real scene from the work presented in [3]. The second test is based on a virtual scene constructed in 3D Studio Max. The third test consists of material filmed by a stereo camera rig, shooting a static scene from arbitrary directions. Different aspects of the implementation is put to the test.

3_{http://www.acm.org} 4_{http://www.ieee.org}

(14)

1.5 Limitations

There exists many diﬀerent formats for transferring and broadcasting video today. How to convert and import video streams into a computer is not considered in this work. A considerable amount of time could be spent upon this subject, and if the implementation would be integrated with a TV production pipeline, the subject needs to be addressed.

1.6 Disposition of the report

Chapter two reviews previous research in the field of video-based rendering. Chapter three explains the theoretical background of different techniques used in video-based rendering and computer vision. Chapter four covers the implementation of the system and describes all the steps included. Chapter five explains the tests conducted and the results achieved, and the discussion of the results takes place in chapter six. Finally, chapter seven and eight concludes this thesis by summing up and suggesting future work that would extend and improve the work.

(15)

Related work

The work presented in this report relates to a number of different fields in computer graphics, computer vision and television broadcasting. There has been a lot of research in the field of dynamic IBR and Stereo vision. Stereo correspondence has been one of the most investigated topics in computer vision. Presented here are works that has been helpful in the research and development conducted.

2.1 Virtualized Reality

In [13], a visual event is captured using many cameras that cover the scene from diﬀerent angles. The time varying 3D-geometry of the event, along with the pixels of the image, is computed using a stereo technique. Triangulation and texture mapping techniques are then used to enable a virtual camera to reconstruct the event from any new viewpoint. The scene description from the camera closest to the virtual camera’s position can then be used for rendering. The viewer of the system, wearing a pair of stereo glasses can then move around in the world and observe it from a chosen viewpoint. This approach is a kind of enhanced virtual reality, and instead of moving around in a virtual world, a real world virtualization is achieved. An example where the technology from Virtualized reality has been used is the Eye Vision project. This system involves shooting multiple video images of a dynamic event, such as a football game. A computer processes the recorded images which then can be combined and displayed to give the viewer a feeling of ﬂying around the scene. The system was used on Super bowl XXXV, where 30 synchronized cameras was mounted around the stadium. The output was composed by interpolating views between all the cameras, achieving a virtual camera movement covering more than 200◦ of rotation.

2.2 Image-based visual hulls

The biggest problem with most computer vision algorithms is that they are to slow to achieve good results in real time. This implies oﬀ-line pre-processing,

(16)

Figure 2.1: Capturing device (Image from [1])

processing and computing. One of the best ways to extract geometry from pho-tographic inputs is an approach called visual hulls. Many researchers have used silhouette information to distinguish regions of 3D space where an object is, and is not present. A visual hull is extracted by using the silhouette of foreground objects in different input images. Each region outside the silhouette defines where the ob-ject cannot be. By carving away these regions, the intersection of silhouette cones defines an approximation to the geometry of the actual object. In [1], Matusik et al presents a visual-hull rendering algorithm that works for real-time applications. It can be used to process video streams and render the scene from novel viewpoints in real-time. This algorithm uses one of the input images as a reference image where all the pixels in this image represents the view along a single line in 3D. This 3D-line can then be projected into one of the other images as a 2D-line (the epipolar con-struction). At each pixel a set of depth intervals is computed in which the object must lie. This allows a full resolution visual hull to be constructed. The visual hull is shaded using the reference images as textures.

(17)

2.3 Relief Texture Mapping

In [2], an extension to ordinary texture mapping is presented, where the relief texture contains information about surface details in three dimensions. This enables the viewer to experience motion parallax. A relief texture contains depth displacement information for each texel. Relief texturing can be decomposed into two steps. First, a 3D warp is applied to the relief texture, depending on the view direction. Secondly, the warped texture is applied to a polygon and rendered as in standard texture mapping. The 3D warp in the ﬁrst step can be decomposed into two 1D warps, aligned with the texture coordinate axes.

Figure 2.3: Ordinary texture mapping versus relief texturing (Image from [2])

2.4 High quality video view interpolation

One of the most important works done in the ﬁeld of video-based rendering is pre-sented in [3]. This method shows how high-quality rendering of dynamic scenes can be accomplished using multiple synchronized cameras and a view based approach. This technique uses a 3D reconstruction algorithm combined from all cameras and provides much better rendering results compared to other approaches using the same amount of cameras. First, a dynamic scene is captured using eight cameras placed along an arc spanning over 30◦, as seen in Figure 2.4. The system acquires the video and computes depth maps of the scene oﬀ-line, and then the actual rendering is done in real-time. The user can interactively control the viewpoint while watching the video by rendering the scene from arbitrary positions within the arc spanned by the cameras.

2.5 Light ﬁeld rendering

A technique that can be used for generating new views from arbitrary camera po-sitions without using depth information or feature matching is light field rendering [4]. This technique generates novel views by combining and resampling the input images. The light field is defined as the radiance at a point in a given direction, which is equivalent to the plenoptic function as described in [14]. To measure the

(18)

Figure 2.4: Multi-camera setup(Image from [3])

plenoptic function one can imagine an eye at every possible position (Vx, Vy, Vz) and recording the intensity of the light rays passing through the point at every pos-sible angle (Θ, φ), for every wavelength, λ, at every time, t. The resulting plenoptic function can be described as a 7D function:

P = P (Vx, Vy, Vz, Θ, φ, λ, t) (2.1)

In [4] the plenoptic function is reduced to 4D to simplify the representation and make the calculation more efficient. To generate a new image from a light field is quite different from other interpolation approaches. The proposed solution is to parameterize lines by their intersections with two planes. The coordinate system on the first plane is defined as (u, v) and on the second plane (s, t). A line is then defined by connecting points on both planes. This representation is called a light slab. For a virtual environment, a light slab can be created by rendering a 2D array of images, where each image represents a slice of the 4D light slab. Building a light field for a real scene requires a very large amount of images to obtain convincing rendering results, and a couple of other constraints must also be considered, such as controlled lighting, angle of view, focal length, aperture etc.

Figure 2.5: The light slab representation and two visualizations of a light ﬁeld (Image from [4])

(19)

2.6 Microfacet Billboarding

As with most of the different view interpolation techniques, the issue with billboard-ing is how to render objects realistically when their geometry cannot be completely acquired. In [5] Yamasaki et al. proposes a technique that is advantageous for ren-dering intricately shaped geometry. This approach can be divided into two major steps, modeling and rendering. In the first step, the surface model is acquired and resampled into a set of voxels. A set of range images is also generated at all camera positions where colour images are taken, which are used for texture clipping in the rendering process. A microfacet billboard is a single geometric primitive, which is a rectangle parallel to the image plane having the same center as the voxel. In the second step, the actual rendering, the object is rendered by a set of microfacets with colour texture. View-dependent microfacets are generated first and then the colour image of the object is mapped upon each of them. Since the facets remain perpendicular to the viewing direction, microfacet billboarding can render objects that have detailed shapes, such as trees and fur.

Figure 2.6: Frontal and side view of microfacets (Image from [5]) Another Microfacet billboarding approach is presented in [15] but in contrast to [5], only the streamed data from the cameras, together with a voxel model of the visual hull is used as an input. A visual hull is used as a geometric model of the foreground objects in the scene. For each occupied voxel a billboard is rendered, and since the coordinates of the billboard’s corners in 3D-space are known, their locations in the camera views can be computed and used to texture the billboards.

2.7 3DTV

A project that is called 3DTV is presented in [6]. This is a system for real-time acquisition, transmission and 3D display of dynamic scenes. The system consist of an array with 16 cameras, a cluster of network-connected PC’s, and a multi projector 3D display. High resolution (1024 x 768) stereoscopic images are displayed for multiple viewpoints without special glasses. Calibration of the cameras and image alignment are procedures that are necessary to achieve good image quality. This

(20)

system has enough views and resolution to provide an immersive 3D experience, but however, the cost for this kind of set-up is very high.

Figure 2.7: 3DTV (Image from [6]

2.8 Optical ﬂow

Motion is fundamental in all dynamic scenes, and the optical flow in a scene de-scribes how objects in a scene move from frame to frame, when captured by a video camera. Optical flow is a 2D projection of the 3D motion of the real world and represents the movement of brightness patterns in an image. This provides information about the spatial position of objects and the rate of change, which is a helpful and convenient image motion representation. Just as optical flow describes the 2D motion of the points in an image, the scene flow is defined as 3D motion of points in the real world [16]. It can be represented as a 3D vector field defined for each point on all surfaces in the scene.

Figure 2.8: Scene ﬂow computed from two input images at diﬀerent times (Image from [7])

(21)

In [7], an image-based approach is proposed to model dynamic events by cap-turing images of the event from many different viewpoints. The scene can then be recreated from arbitrary viewpoints by creating a true dynamic model of the event, which is called Spatio-Temporal View Interpolation. First, geometric information is required. In this case an extrinsic 3D model is used to understand how the geometry changes over time. At each time, the shape of the scene is estimated as a 3D voxel model using the ”Voxel colouring model” [17]. When it is known how the shapes in the scene move, the scene flow can be calculated. A motion vector for every voxel in the scene is calculated, which results in a 4D model of the scene. The 4D model allows calculations of geometric information as a continuous function. By using the scene flow and camera parameters, the algorithm then estimates the interpolated scene for any position and time.

2.9 Tour into Video

Tour into picture (TIP) is a method for making animations from one 2D picture or photograph of a scene and was first proposed by Horry et al [18]. TIP can provide convincing 3D effects by constructing a 3D model from the image, and generating a sequence of walk-through images. This technique has also been enhanced to handling video sequences in [8]. However, the technique for TIP cannot be directly applied to video sequences since TIP only handles still images, so the approach is a bit generalized. A couple of assumptions are made: First, the input video is composed of a continuous sequence of the scene. Second, there can’t be too much motion parallax effect in the scene. Third, the terrain ground in the scene has to be smooth so it can be modeled as a plane. The tour into video algorithm is divided into a couple of steps, where the first step is to generate a background image, which covers all the regions viewed from the entire sequence. This includes the background and all static foreground objects. For the dynamic foreground objects, a gradient map is used to extract boundary information. Given the background and foreground information, a 3D model is constructed, where the background model is based on a vanishing point circle. All the foreground objects are then modeled as polygons in 3D space and are attached to the background model. When the background and foreground model are connected, the scene can be navigated, and images from novel views can be generated.

(22)

Figure 2.9: Background model constructed from a video sequence (Image from [8])

(23)

Theoretical background

The process of generating novel views from a set of diﬀerent cameras can be ap-proached in a number of ways. Video-based rendering (VBR) is a sub-category of image-based modeling and rendering (IBMR), and combines many diﬀerent research disciplines in computer vision [10], computer graphics and video processing. This chapter covers the background theory of video-based rendering.

3.1 Image-based modeling and rendering

Traditional 3D computer graphics is a 3D to 2D process. A 3D model of a scene is created, animated and rendered into a 2D image. This way, full control of how the scene should look like can be achieved. In the last decade, a new ﬁeld between computer graphics and computer vision has emerged, called Image-based modeling and rendering (IBMR). In this ﬁeld computer vision techniques is utilized to render graphics directly from images. So, instead of going through the 3D modeling processes, IBMR starts with 2D images, calculates underlying 3D structure of the scene and then renders new views of the scene as 2D images. So, it is a 2D to 2D process with some knowledge of the 3D structure.

Figure 3.1: Traditional IBMR approach

(24)

3.1.1 Photography vs. computer graphics

Photographic techniques offer a very fast and inexpensive way to visualize the nat-ural impression of real-world scenes. Photos can be displayed almost immediately, but the scene can only be displayed as recorded. There is no possibility to alter the scene or use new viewpoints of the scene. In contrast, computer graphics allows manipulation of the scene and the viewpoints arbitrarily. However, this flexibility requires much effort in modeling the scene in terms of geometry, surface reflectance properties, illumination etc. Despite this effort, computer graphics images may still have a somewhat artificial flavour.

Figure 3.2: Photography vs. computer graphics

The ideal approach would be to combine the best of two worlds, photography and computer graphics. That is, combine the ease of photographic acquisition with the ﬂexibility of computer graphics. Various image-based modeling and rendering techniques have been developed toward this goal. Based on conventional photos, a description of the scene geometry is derived.

3.2 Video-based rendering

Until recently, research in image-based rendering and modeling has mostly focused on static scenes. The advantage with scenes without dynamic content is that the whole scene can be captured using one single still-camera, which can be moved from one viewpoint to another. When extending this to a dynamic scene, many video cameras are needed to record all the data necessary to render the scene from novel viewpoints. The ultimate objective in VBR is to photorealistically render arbitrary views of dynamic events at interactive frame rates. The input data acquisition is crucial as it determines the quality of the ﬁnal rendering. If the input data is not optimal then the output result won’t be satisfactory. Each camera also generates a huge amount of data that must be processed online or stored on mass storage

(25)

media for oﬄine processing. This multi-camera acquisition put high demands on the hardware and the cameras.

3.2.1 VBR for television

Traditional television images are displayed on the screen using a technology known as interlaced scan. Basically, there are two leading interlaced scan systems in use in the world today, NTSC and PAL.

NTSC is based on a 525-line and 30 frames-per-second system. In this interlaced system each frame is scanned in two ﬁelds of 262 lines, which is then combined to display a frame with 525 scan lines at 60Hz. PAL is also interlaced into two ﬁelds of 312 lines each to get an interlaced video image with 625 lines and 25 frames-per-second at a rate of 50Hz. PAL has a slightly better image quality than NTSC because of the increased number of scan lines. When the use of computers increased it was discovered that using interlaced images from television for display on computers did not produce good results. To improve this, progressive scan was developed. In contrast to the interlacing methods, progressive scan displays the image on the screen by scanning each line as a row of pixels. So, instead of scanning the lines in alternate order (1,3,5,...) and displaying this interlaced by a scan of the lines (2,4,6,...) every 1

30 second, the progressive scan displays every line

in the image every 1

60 second.

This is the reason why it is no good idea to use regular TV-cameras for VBR acquisition as they generate interlaced images, and to get all the information needed for each frame, the camera must be able to record in progressive scan. If regular TV-cameras are used, important properties that the TV-cameras should achieve is suﬃcient frame-rate and that the pixels can be accessed without being preprocessed by the camera hardware internally. The set-up used in video view interpolation [3] consist of 8 cameras that generates 1024 x 768 pixel colour images at 15 frames-per-second in progressive scan. To handle real-time storage of the videos they use two concentrators that synchronize the cameras and the uncompressed video streams are streamed via ﬁber optics cable to a bank of hard disks. Almost all cameras that are used for VBR today use CCD or CMOS imaging chips. However, the limitation in the acquisition process is not really the cameras, but the amount of data that needs to be processed simultaneously.

Other factors to consider for camera selection includes frame rate (fast motion vs slow motion), lens (zoom in/out, wide angle vs lens distortion), shutter speed adjustments to avoid motion blur and the dynamic range of the camera with respect to the lighting of the scene.

3.3 Acquisition

A good acquisition system is especially important in image and video-based render-ing. By using high-quality videos a better ﬁnal output can be produced, with less artifacts. Building a good aqcusition system involves choosing the right cameras

(26)

and designing the capture infrastructure. Besides choosing a suitable camera, a good infra-structure is also important.

3.3.1 Camera setup

When recording a scene it is crucial for the ﬁnal result that the cameras are mounted in optimal positions to be able to recover the geometry of the scene as accurate as possible. There are two diﬀerent VBR acquisition techniques that can be considered, small-baseline and wide-baseline camera setup [11].

Small-baseline setup

When capturing light ﬁeld data of a scene the cameras are generally arranged in a grid with equal distances between the cameras. By using equal distances between the cameras in the camera array, the interpolation between images becomes much easier. If irregular camera placement is employed, it is necessary to calculate depth scene information to correct for parallax eﬀects in the recorded images. Matusik’s 3D TV system [6] and Stanford camera array [19] are examples of setups that uses this approach.

Wide-baseline setup

When recording a scene from only a few viewpoints spaced far apart it is important that the camera positions are chosen carefully, because it is crucial that the acquired video contain as much relevant information as possible. One big issue with the wide-baseline setup is that it is almost impossible to obtain suﬃcient depth maps of the scene. 3D Dome [20], Visual Hulls [1] and Virtualized Reality [13] are a couple of projects that use the wide-baseline setup.

Mobile setup

Many of the earlier mentioned systems have shown very promising results but they all have the setup constraint that they need to be shot in a studio or a special scene. To be able to record dynamic events using multiple cameras in other environments it is necessary to use a system that is portable and can be set-up anywhere without to much exertion. An example of a mobile camera set-up system is described in [9]. Each unit in the system consists of a laptop, camera, tripod and a battery pack, and is controlled and synchronized via a wireless LAN. For synchronised recording all the cameras must be triggered at the same time, so the triggering must be performed by sending a triggering message to all modules when the recording begins. However, starting all modules at the same time may be critical, but when running the recording at 15fps the error is tolerable. The data is then stored in the local memory because the bandwidth to the laptop’s hard drive is not suﬃcient to store the data directly. 1 GB of RAM is enough to capture 80 seconds of 640 x 480 resolution video at 15 fps. The videos are then downloaded from the laptops and can be processed oﬄine.

(27)

Figure 3.3: Mobile setup where each module consists of a ﬁre wire camera connected to a laptop (Image from[9])

3.3.2 Illumination

Despite that scene lighting is an important aspect when setting up recording of a movie scene, there has not been much focus on this subject in the VBR research. This might seem a little strange, considering the strict acquisition requirements that VBR imposes to get optimal results. When using the recorded video as texture, the scene can be re-displayed from arbitrary viewpoints. Scene illumination, however, remains ﬁxed to the conditions present during recording. Ways to additionally recover surface reﬂectance properties and possibly also scene illumination is beyond the scope of this report.

3.3.3 Network infrastructure

When recording with many cameras there are large amounts of image data that needs to be streamed and this requires large bandwidth. To resolve this, multiple computers may be integrated into a network. The most demanding VBR applica-tions are those who use online processing where video is processed on the fly for instant display. In this approach, the synchronised video cameras are connected and controlled by client PCs, and the data from the client PCs are sent to a dis-play host PC which processes the combined data, and renders the output image. The big issue using on-the-fly processing is the computational power of the display host. Concerning this, as much computational work as possible must be done by the client PCs. Another way to make the network more efficient is to only send those images from the respective cameras that are necessary to perform target view calculations. VBR applications that store the images for later processing are less demanding than on-the-fly processing, however there are important issues. Perhaps the most challenging problem is the bandwidth of the computer’s internal bus and the storing of uncompressed high-resolution image data.

(28)

3.4 Geometric calibration

To recover scene geometry from different video images some kind of geometric cal-ibration is required. When the cameras used are calibrated the mapping between image coordinates and directions relative to each camera is known. This mapping is determined by, among the parameters, the camera’s focal length and its radial distorsion. Calibrating a camera consists of determining the transformation which maps 3D points of a certain scene or object into their corresponding 2D projections onto the image plane of a camera [12]. This transformation depends on two sets of parameters, extrinsic and intrinsic parameters. Extrinsic (or external) parameters defines the camera’s position and orientation in space with respect to a given ref-erence system. The intrinsic (or internal) parameters describes the characteristics of the camera, i.e. the image formation process through its optical system. To calibrate a multi-camera setup it is necessary to obtain one or more sequences prior to the actual recording that is to be done. The literature of geometric camera cal-ibration offers a number of algorithms. One method that provides good calcal-ibration results was proposed by Zhang [21]. However, the calibration procedure is a bit different depending on the baseline setup.

3.4.1 Small-baseline calibration

When using a small-baseline setup the extrinsic and intrinsic parameters can be determined by recording only one sequence, cause the field of view of the different cameras overlap. The calibration object provides feature points which are easy to detect, and whose 3D points are exactly known with respect to each other. There are two important issues that needs to be considered, first, the calibration object must cover most of the field of view of the cameras to get accurate calibration results. Second, at least two of the points of the calibration object must not be coplanar.

3.4.2 Wide-baseline calibration

Intrinsic and extrinsic parameters are determined separately, and it is not a good idea to use a calibration object that both is visible in all cameras and fills most of the field of view for all cameras. Instead intrinsic calibration is done for one camera at a time, by using a calibration object that fills up the full field of view for the camera. Extrinsic parameters can then be estimated by using another calibration object that is placed in the scene and is visible for all cameras. By knowing the intrinsic parameters the position and orientation of the cameras can be calculated.

3.5 Colour calibration

When matching images from diﬀerent cameras it is often necessary to match the colours between the cameras. One common approach taken toward this problem is to calibrate each camera independently through comparisons with known colours on

(29)

a colour calibration object. The camera colour spaces from the diﬀerent cameras can then be matched to a standard colour space or to a colour space derived from the scene [22].

3.6 Camera representation

To understand the mapping between object and image space, the model of the camera should be considered. The pinhole, or perspective, camera model is the simplest, and ideal, model of camera function. Such a camera maps a region of R3_,

i.e. the scene in front of the camera, into the image plane R2_.

Figure 3.4: Pinhole camera model

The projection from object to image space is a projective mapping by a 3x4 matrix P of rank 3, known as the camera matrix or perspective projection matrix. If a point in object space is represented by x = (x, y, z, t)T, and a point in image space is represented by u = (u, v, w)T, then the camera matrix transforms points in object space to points in image space according to

u = Px (3.1)

The perspective projection matrix can be decomposed as P = K[R|t] where:

• K is the 3x3 camera calibration matrix that depends of the internal parameters

of the camera, such as focal length

• [R|t] is the 3x4 external parameters matrix, and corresponds to the Euclidean

transformation from a world coordinate system to the camera coordinate sys-tem: R represents a 3x3 rotation matrix, and t a translation

3.6.1 Camera calibration matrix

The camera calibration matrix, K, contains the intrinsic camera parameters and can be written as:

(30)

K =   α0u αsv uv00 0 0 1   where

• αuand αv are the scale factors in the u− and v−coordinate directions. They are proportional to the focal length f of the camera: αu= kuf and αv= kvf , where ku and kv are the number of pixels per unit distance in the u and v directions

• c = [u0, v0]T represents the image coordinates of the intersection of the

optical axes and the image plane, also known as the principal point

• s, the skew of the camera. The skew is zero as long as the u and v directions

are perpendicular

The principal point, c, is often at the image center, and if the pixels are square,

αu and αv can be set to be equal.

3.6.2 The external parameters matrix

The orientation and the position of the camera is deﬁned by the 3x4 external pa-rameters matrix [R|t]. It consist of a rotation matrix R, and a translation vector t. The rotation matrix R can be written as the product of three matrices representing rotations around the X, Y and Z axes. For example, taking α, β, γ to be rotation angles around the Z, Y and X axes respectively yields:

RZ = 

 cosαsinα −sinα 0cosα 0 0 0 1   (3.2) RY=   cosβ0 01 sinβ0 −sinβ 0 cosβ   (3.3) RX=   10 cosγ0 −sinγ0 0 sinγ cosγ   (3.4)

3.6.3 Fundamental matrix

Consider a scene which is captured by two cameras. There exists a mapping between points in the ﬁrst image to epipolar lines1 _{in the second image. When determining}

the 2-dimensional projective transformation applied to the two stereo images, the

(31)

fundamental matrix, F is used. The fundamental matrix is a 3x3 rank-2 matrix that maps points in the first image to lines in the second image, and points in the second image to lines in the first image. If m is a point in the first image I, then

F m = e0 is an epipolar line in the second image I0. The fundamental matrix is deﬁned by the following equation

xTiF xi= 0 (3.5) for any pair of corresponding points xi and x0i in the two images.

3.7 Stereo correspondence

Stereo matching has long been one of the most central problems in computer vision. Fusion of the pictures recorded by our two eyes and exploiting the diﬀerence gives us a strong sense of depth.

Given two images that are recorded from slightly different views it is often desir-able (for many applications) to recover disparities between them. By knowing the disparities between objects, geometric information about the scene is provided. The accuracy of the disparities is very important in many applications, and especially motivated in view-synthesis and image-based rendering, which requires disparity es-timates in all the regions of the image, even those that are partially occluded and not textured. There exists many different stereo algorithms out there today and there are several problems that needs to be addressed. The first problem encoun-tered is how to simultaneously obtain sharp depth disparities at object boundaries and at the same time have smooth depth in regions which are texture less. Second, how to deal with occlusions, and third, how to recover depth of thin structures.

Any stereo algorithm makes assumptions about the physical world. For example, how does algorithms measure matches between points in two images, i.e. that they are projections of the same point in the scene? Most algorithms assume that the surfaces in the scene are lambertian, or pure diﬀuse, i.e. surface appearance does not vary with varying viewpoints. Further assumptions about the scene or the physical world are that surfaces are piecewise smooth, therefore most algorithms have built-in smoothness assumptions. Without such assumptions, the stereo correspondence problem would be ill-posed or underconstrained. Most algorithms make assumptions about camera calibration and epipolar geometry.

A taxonomy of dense, two-frame stereo methods is presented by Scharstein and Szeliski [23]. This taxonomy compares existing stereo methods and evaluates the performance of many diﬀerent variants. Most intensity-based stereo algorithms are based on the following steps:

• Per-pixel matching cost computation • Aggregation of the matching costs • Disparity computation/optimization

(32)

3.7.1 Disparity representation

Disparity was first introduced in human vision to describe the difference in location of corresponding features seen by the left and right eye. In computer vision disparity is often referred as inverse depth. Assuming two rectified2_{stereo images are used as}

input, the disparity map is a representation of the cost for matching corresponding pixels under a certain hypothesis. It is represented as an image (x, y, d) where d is the disparity. Objects that has high disparity, i.e. close to the camera, has high values and background objects that are far away from the camera has lower values. For two images whose planes are aligned in parallel, disparity is the diﬀerence in pixel coordinates for the same point in 3D. The amount of disparity is directly proportional to the baseline length and inversely proportional to the distance from the camera pair. Through camera calibration, the baseline, camera imaging char-acteristics and also the epipolar geometry is known.

To compensate disparity for many cameras, it is convenient to ﬁrst determine the distance of the 3D scene point to the cameras. Per-pixel scene depth is estimated by turning disparity compensation around: for each pixel in the left image, the corresponding pixel (along the epipolar line) in the right image is found. The diﬀerence in image coordinates is the point’s disparity, and via camera calibration the depth is derived. Ideally, there are a couple of desirable properties when calculating depth maps:

• Dense depth maps • Robust

• Smooth

• Discontinuities preserved • Globally consistent

• All images considered equally

3.7.2 Multi-Baseline stereo

Consider a multi-camera imaging system in which the imaging planes of the cameras lie on the same plane, and where the cameras have the same focal length F . For any pair of the cameras, the disparity, d, and the distance to the scene point, z, are related by

d = BF1

z (3.6)

where B is the baseline, or distance between the two camera centers. The precision of the estimated distance increases as the baseline between the cameras increases. However, increasing the baseline also increases the likelihood of mis-matching points in the images. So, there is a trade-oﬀ between ﬁnding correspon-dence between images (using small baseline), and for precise estimation of the depth (using wide baseline).

(33)

3.7.3 Matching cost calculation

Different approaches can be used to evaluate the cost for matching two pixels. The most common are squared intensity differences (SD), and absolute intensity differences (AD).

Using the disparity hypothesis d, the pixel (p, q) in the left image corresponds to the pixel (p− d, q) in the right image. The calculated costs for diﬀerent pixels constitutes how well the pixels correspond to the disparity hypothesis, and the value at location (p, q, d) keeps the cost of matching pixel (p, q) in the left image with pixel(p− d, q) in the right image.

3.7.4 Matching cost aggregation

Local and window-based methods aggregate the matching cost by averaging over a support region in the disparity space. A support region can be either two-dimensional at a fixed disparity or three-dimensional. For low computational costs most ap-proaches use 2D support regions, especially 2D rectangular windows. The size and position of this window are important parameters. Generally, large support windows is good to use to remove noise, but instead produces errors in depth boundary dis-continuities where the pixels have different depths. The same problem arises when using rectangular windows which otherwise can simplify the computation. To deal with this problem, windows with different sizes and shapes can be used, as in the graph cut algorithm [24]. In this approach connected windows are computed at each disparity. These connected windows contains only the pixels for which that disparity is likely, and the disparity which gives the largest window is assigned to each pixel. This approach provides good results but the computation cost for this approach is rather high.

3.7.5 Disparity computation

The disparity computation step takes the 3D disparity space as input and search for the best disparity value for each pixel. The most common methods when computing the ﬁnal disparities are local and global methods. For the local methods the dispar-ities are computed by performing a “winner-take-all” (WTA) optimization. This is done by simply choosing at each pixel the disparity associated with the minimum cost value.

Local methods minimize the overall matching cost and simply choose at each pixel the disparity associated with the minimum cost value. In contrast, the global methods perform almost all of their work during the disparity computation phase and often skip the aggregation step. For global methods the objective is to minimize global functions which consider both the overall matching cost and the smoothness of the solution.

(34)

3.7.6 Epipolar Geometry

The geometric information that relates two diﬀerent viewpoints of the same scene can be described by epipolar geometry. The underlying principle is that of binocular vision. Given a single image, the three-dimensional location of any visible object point must lie on the straight line that passes through the center of projection and the image of the object point. See Figure 3.5. With triangulation, the intersection of two such lines generated from two independent images can be determined.

C C'

x' x

X

Left image Right image

Figure 3.5: Triangulation in stereo images

To determine the scene position of an object point through triangulation, match-ing of the image location of the object point in one image to the location of the same object point in the other image must be done. Finding this correspondence however, does not mean that a search through the whole image is needed, but rather along a line, called an epipolar line, due to an epipolar constraint, as can be seen in Figure 3.6.

The epipoles, e, e0 are the points of intersection in the image planes by the line between the optical centers C and C0, which is called the baseline. Thus the epipole for one camera is the image of the optical centre of the other camera. The epipolar plane is the plane deﬁned by the point x and the optical centers C and C0. The epipolar line is then the line of intersection between the image plane and the epipolar plane. All epipolar lines for one image intersect at the epipole for the same camera. The image of an epipolar line in one camera translates to a ray through the optical center and image point in the other camera. Therefore, the point x in one image generates an epipolar line in the other image on which the corresponding point x0 must lie.

3.7.7 Image rectiﬁcation

Rectiﬁcation is a classical problem of stereo vision. In order to do a reconstruction of a scene, corresponding features in a stereo image-pair are required. Given a feature

(35)

C C' e e' x' x X l'

Epipolar line l' for x

Figure 3.6: Correspondence for x in right image along the epipolar line l0

in one image, the problem is to find the corresponding feature in a second image. Since we know that the second feature lies on the epipolar line defined by the first feature, one knows where to search. In order to find the corresponding point x0 in the second image for a point x in the first image, a scan along the epipolar line

l0 is done. To make this a lot easier and faster, rectified images should be used. The rectification process is a 2-dimensional projective transformation that puts the images on the same plane, horizontally aligned. This means, that an epipolar line for a corresponding point will in fact be horizontal and at the same y-value. With an appropriate choice of coordinate system, the rectified epipolar lines are scanlines of the new images, and they are also parallel to the baseline. Disparities between the images will then only be in the x-direction. See Figure 3.7.

C C'

x' x

X

l'

Epipolar line l' for x

Figure 3.7: Rectiﬁcation yields a horizontal epipolar line l’ with the same y-value as the corresponding pixel x

(36)

The ideal epipolar geometry is the one that will be produced by a pair of identical cameras placed side-by side with their principal axes parallel. Such a camera setup is called a rectilinear stereo rig. For other camera setups, where the cameras are placed in arbitrary positions, the epipolar geometry will be more complex. Many of the stereo algorithms described in literature have assumed a rectilinear stereo rig is used. This case makes the search for matching points much easier because of the simple epipolar structure. The theory of projective rectification, without use of camera matrices, is described in [25]. This algorithm performs rectification given a calibrated rig for which only point correspondence between images are given, hence for which the fundamental matrix can be computed. Another method of rectification is presented in [26].

The algorithm outline for the image rectiﬁcation presented in [25] looks as fol-lows.

• Compute the fundamental matrix and the two epipoles e and e0 _{(see 3.6) from} the given camera calibration parameters

• Select a projective transformation H0 _{that maps the epipole e}0 _{to the point} at inﬁnity

• Find the matching projective transformation H that minimizes the

least-squares distance

• Use the projective transformation H to resample the ﬁrst image, and the

projective transformation H0 to resample the second image

3.8 Segmentation

Image segmentation can be a helpful tool when calculating depth maps of a scene. The main idea with image segmentation is that regions with similar features in an image should be connected, and there are different methods available. The most common are edge-detection, region-growing and clustering techniques (using his-togram thresholding). The first two mentioned often uses syntactical methods while clustering is based on statistical methods. Region growing is the most frequently used and it can be divided into three sub categories: local, global or split-and-merge. Local techniques measures the similarity between neighbouring pixels, with respect to their gray-level or colour values. If the similarity is below a specified threshold, the pixels are connected, and all connected pixels forms an image region. The biggest benefit with local approaches is that they are simple and fast. However, there might arise a problem called chaining, which is caused because the decision if two pixels are connected only depends on pixels in a small neighbourhood. This leads to that pixels that aren’t very similar may be connected by a chain and are mismatched into a region.

For some VBR approaches it is absolutely necessary to distinguish between fore-ground and backfore-ground. Visual hulls [1], for example, requires accurate image segmentation to achieve good results. Color segmentation can bee seen in Figure 3.8

(37)

Figure 3.8: Typical example of colour segmentation

3.9 OpenGL geometry pipeline

It is important to have an overview of the different stages of vertex transformation in OpenGL. The pipeline is shown in Figure 3.9. To specify viewing, modeling and transformation, a 4x4 matrix M is used, which is then multiplied by the coordinates of each vertex (x, y, z, w). The viewing and modeling transformations are combined to form the modelview matrix, which is applied to the incoming object coordinates to get the eye coordinates. Next step is to apply the projection matrix to yield clip coordinates, which defines a viewing volume. After that, the perspective division is performed by dividing coordinates with w, to get normalized coordinates. The final step is to convert these coordinates to window coordinates by applying a viewport transformation.

Figure 3.9: OpenGL geometry pipeline

3.10 Hardware

Calculating depth via stereo correspondence is very computationally expensive. An approach is to utilize the latest oﬀ-the-shelf graphics hardware. Recent advances

(38)

for GPUs have made it possible to change the graphics pipeline programmatically through vertex and pixel shaders. With shaders, a GPU can be used to do a lot of diﬀerent tasks (which it wasn’t really intended for). Using the GPU to do calculations have some advantages over using the CPU. First, the GPU is processing data in parallel, which is very neat when the same operations are to be applied to lots of data (such as images). Furthermore, by doing computations on the GPU, the CPU is oﬄoaded, which leads to that it is free to do other tasks.

3.11 GPU programming and GLSL

The wide deployment of GPUs in the last several years has resulted in an increase in experimental research with graphics hardware, and has also opened a door to new opportunities regarding real-time applications in diﬀerent areas. The OpenGL Shading Language, or GLSL for short, is a programming language that complements the OpenGL standard API. GLSL is part of the core OpenGL 2.0 speciﬁcation. GLSL makes writing shaders to change the behaviour of the graphics pipeline rather straightforward.

(39)

Implementation

4.1 Method

To use an IBR camera system for live TV productions puts great demands on the quality of the output. Many cameras generate more data and often needs oﬄine processing, but the ﬁnal renderings are often better. The best solution with regards to the production pipeline would be to use existing cameras. However, to calculate depth maps of the scene, a stereo camera setup is necessary. When choosing a method for implementation, the following properties has been considered:

• Number of cameras needed • Quality of the output • Cost

• Robustness

• Speed of the system • Online/Oﬄine processing

The most important property is the quality of the output, especially when a system should be used for television broadcasting. Real time processing is another important aspect. Unfortunately, the video-based rendering approaches that provide real-time renderings today does not provide results satisfactory enough for television broadcasting. When implementing the system, all properties has been taken into account, and weighted against each other. Throughout this work, many different techniques have been considered, in the wide range of video-based rendering. To be able to evaluate the pros and cons of the different approaches, Matlab has been used. This is a great way to get a deeper understanding of how different algorithms works in theory, and also saves a lot of time compared to directly implementing everything in C++.

(40)

4.2 System overview

The system can be divided into two major parts, off-line and on-line processing. The off-line part consists of retrieving the camera parameters through camera calibration. The on-line part consists of depth map recovery, scene reconstruction and output renderings. The depth map calculation part is the most computationally demanding, and since one of the most important properties when designing the system is the ability to run in real-time, it is desirable that this part is as optimized as possible. In contrast to many other approaches, where depth map calculation is done offline using advanced and computationally heavy algorithms, the depth maps are acquired online on the GPU.

A visualization of the complete system pipeline can be seen in ﬁgure 4.1.

4.3 Camera calibration

The first step, which is done off-line, is to record the stereo images necessary for camera calibration. To get the correct parameters it is important that the stereo cameras used are set up the same way as in the final video acquisition of the scene. By using a planar checkerboard as the known calibration object in the scene, all sufficient parameters can be retrieved, presumed a satisfactory amount of images is acquired. It is important that all squares on the checkerboard are visible, and that all the positions of the checkerboard in different images are not coplanar. To compute the calibration parameters we used the camera calibration toolbox for matlab [27]. This approach is based on the calibration method proposed by Zhang [21].

4.4 Image rectiﬁcation

Since all the recorded stereo images are intended for disparity maps and scene recovery, they need to be rectified in order to get sufficient results. After the camera parameters have been retrieved, the rectification can be done in the matlab calibration toolbox. The individual calibration parameters for each camera, and images from the cameras are used as input. By stereo optimization the intrinsic and extrinsic parameters for the cameras are recomputed, together with the uncertainties to minimize the projection errors for all calibration grid locations. The spatial configuration of two cameras and the calibration planes can be seen in Figure 4.3. Finally, the rectifications of the images are done.

4.5 Real-time disparity image calculation

Most of the algorithms used for calculating precise disparity maps has a major disadvantage - they can’t be used for real time applications. Some disparity map algorithms have obtained excellent results. However, these methods mostly use global optimization techniques which requires a lot of computational power and can’t be used in real time. Most real time applications rely on local optimization

(41)

Left Image Right Image Rectified Left Image Rectified Right Image Symmetric Neighbourhood filtering

Winner Take All disparity calculation Disparity Hypotheses calculation Max gradient calculation & thresholding Aggregation Disparity Image Rectification Calbration Images Calibration Geometry assemblage & Rendering