Matting of Natural Image Sequences using Bayesian Statistics

(1)

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Examensarbete

LITH-ITN-MT-EX--04/016--SE

Matting of Natural Image

Sequences using Bayesian

Statistics

Fredrik Karlsson

(2)

LITH-ITN-MT-EX--04/016--SE

Matting of Natural Image

Sequences Using Bayesian

Statistics

Examensarbete utfört i Medieteknik

vid Linköpings Tekniska Högskola, Campus

Norrköping

Fredrik Karlsson

Handledare: Dr. Douglas Roble

Examinator: Prof. Anders Ynnerman

(3)

Rapporttyp Report category Examensarbete B-uppsats C-uppsats D-uppsats _ ________________ Språk Language Svenska/Swedish Engelska/English _ ________________ Titel Title

Matting of Natural Image Sequences using Bayesian Statistics

Författare

Author Fredrik Karlsson

Sammanfattning

Abstract

The problem of separating a non-rectangular foreground image from a background image is a classical problem in image processing and analysis, known as matting or keying. A common example is a film frame where an actor is extracted from the background to later be placed on a different background. Compositing of these objects against a new background is one of the most common operations in the creation of visual effects. When the original background is of non-constant color the matting becomes an under determined problem, for which a unique solution cannot be found. This thesis describes a framework for computing mattes from images with backgrounds of non-constant color, using Bayesian statistics. Foreground and background color distributions are modeled as oriented Gaussians and optimal color and opacity values are determined using a maximum a posteriori approach. Together with information from optical flow algorithms, the framework produces mattes for image sequences without needing user input for each frame.

The approach used in this thesis differs from previous research in a few areas. The optimal order of processing is determined in a different way and sampling of color values is changed to work more efficiently on high-resolution images. Finally a gradient-guided local smoothness constraint can optionally be used to improve results for cases where the normal technique produces poor results.

ISBN

_____________________________________________________

ISRN LITH-ITN- MT-EX--04/016--SE

_________________________________________________________________

Serietitel och serienummer ISSN

Title of series, numbering ___________________________________

Nyckelord

Keywords

Datum

2004-02-23

URL för elektronisk version

http://www.ep.liu.se/exjobb/itn/2004/mt/016 Avdelning, Institution

Division, Department

Institutionen för teknik och naturvetenskap Department of Science and Technology

(4)

Abstract

The problem of separating a non-rectangular foreground image from a back-ground image is a classical problem in image processing and analysis, known as matting or keying. A common example is a film frame where an actor is extracted from the background to later be placed on a different background. Compositing of these objects against a new background is one of the most com-mon operations in the creation of visual effects. When the original background is of non-constant color the matting becomes an under determined problem, for which a unique solution cannot be found.

This thesis describes a framework for computing mattes from images with backgrounds of non-constant color, using Bayesian statistics. Foreground and background color distributions are modeled as oriented Gaussians and optimal color and opacity values are determined using a maximum a posteriori approach. Together with information from optical ﬂow algorithms, the framework produces mattes for image sequences without needing user input for each frame.

The approach used in this thesis differs from previous research in a few areas. The optimal order of processing is determined in a different way and sampling of color values is changed to work more efficiently on high-resolution images. Finally a gradient-guided local smoothness constraint can optionally be used to improve results for cases where the normal technique produces poor results.

Keywords:

Matting, alpha estimation, compositing, bayesian framework, maximum a pos-teriori, color-vector clustering, optical ﬂow.

(5)

Acknowledgments

I would like to thank my academic supervisor Prof. Anders Ynnerman at the Department of Science and Technology, Link¨oping University, for giving me the opportunity to do this thesis work. Thanks to my supervisor Dr. Douglas Roble at Digital Domain for arranging a very stimulating project and for lots of ideas, feedback and support.

Thanks also to John Flynn of Digital Domain’s computer vision development team for all help on computer vision related topics. I would also like to thank Nafees Bin Zafar, Bill Spitzak and Matthias Melcher for patiently answering all my questions about NUKE.

And finally, thanks to my parents, Jörgen and Monica Karlsson and Maria Häll for all the support.

(6)

List of Figures

1.1 Input image (a), trimap (b), output image (c), output alpha (d) . 8 2.1 Image (a) with matte (b) over a background, forming

compos-ite image (c). Girl image courtesy of Corel Knockout tutorial. Composite image courtesy of Yung Yu Chuang. . . 11 3.1 Trimap (a), distance to foreground (b), distance to background (c) 17 3.2 Sample approaches used by [12] (a), [3] (b) and for this thesis (c) 18 3.3 Distributions in Bayesian framework. . . 23 3.4 Image (a), matte containing errors (b), insets (c) and gradient

image (d) . . . 25 4.1 Accumulated error map for frames 32 (a) and 37 (b) using FW

ﬂow. Validity bits corresponding to the errors, (c) and (d), with a fairly high error threshold. . . 29 5.1 Flowchart of trimap plugin . . . 34 5.2 Flowchart of image sequence matting . . . 34 6.1 Image showing performance on image with ﬁne strands of hair.

Original image from Corel Knockout tutorial . . . 36 6.2 Image with patterns near object edges. Original image courtesy

of Anna Yu. . . 37 6.3 Image (a), matte with errors (b), gradient map (c), improved

matte (d). . . 37 6.4 Comparison of bayesian technique vs. chroma keying with

near-constant colored background. Matte (b) and image (c) using ap-proach of thesis work, matte (d) using Primatte. Image courtesy of Anastasia Kapluggin. . . 38 6.5 Comparison between results using approach from this thesis and

Corel Knockout. Original image from Corel Knockout tutorial. . 38 6.6 Input image to matting performance comparison. Courtesy of

Philip Greenspun, http://philip.greenspun.com . . . 39 6.7 Comparison between matting techniques. From bottom:

Knock-out, Ruzon and Tomasi, Chuang et. al and method of this thesis work. Lighthouse image courtesy of Philip Greenspun, http://philip.greenspun.com. Matte images 2-4 courtesy of Y.-Y. Chuang. . . 40

(9)

6.8 Frames 31-34 of video sequence. Images show (from top) input image, trimap, clean object and matte. Original image sequence courtesy of Yung Yu Chuang. . . 41

(10)

List of Tables

5.1 Natural image matting settings . . . 33 5.2 Parallelization properties . . . 35 5.3 Trimap generation settings . . . 35 6.1 Processing times for image matting and trimap generation . . . . 42

(11)

Chapter 1 Introduction

This chapter presents a motivation to this thesis work, followed by a description of the problem to be solved and the actual objectives. Finally a section is dedicated to the structure of this report.

1.1 Motivation

This thesis describes research and development conducted at the software de-partment of Digital Domain, Venice, CA, USA from September 2003 until Febru-ary 2004. It is a part of the fulﬁllment of a Master of Science degree in Media Technology and Engineering at the Department of Science and Technology at Link¨oping University, Sweden.

The matting problem of separating a non-rectangular foreground image from a background image is a classical problem in image processing and analysis. A common example is a film frame where an actor is extracted from the back-ground to later be placed on a different backback-ground. Compositing of these objects against a new background is one of the most common operations in the creation of visual effects.

When the original background is of non-constant color the matting becomes an under determined problem, for which a unique solution cannot be found. Images of this kind are currently matted by hand, a process that is very time-consuming. In addition, the images used in the visual eﬀects industry are of high resolution and often contain motion blur and grain. The result of these factors is that many pixels contain a mixture of colors, making them very diﬃcult to process manually.

To summarize, there is a need to both increase the quality of mattes shot against arbitrary backgrounds, and also to reduce the amount of human inter-action required to generate them.

1.2 Problem Description

The problem of acquiring the matte from a photograph can be described as ﬁnding the alpha value α and foreground colors F = {R_f, G_f, B_f} given an

(12)

B ={R_b, G_b, B_b} . When performed on an entire image, the problem becomes:

given in input image (figure 1.1 a) and often also a trimap1 (figure 1.1 b), find the foreground image (figure 1.1 c) and the matte image (figure 1.1 d).

Figure 1.1: Input image (a), trimap (b), output image (c), output alpha (d) Since at each pixel position there are three constraining equations (the com-positing equation for R,G,B respectively) and more unknowns, the problem is under-determined and has an inﬁnity of solutions. There are however special cases where a single solution exists as described in section 2.2.

The computed F together with the α is considered a solution to the matting

problem and using it, a composite over a new background can be generated by computing a new C for all points. The matte, or solution, is considered to be

successfully generated if it correctly isolates a separable object from other ob-jects in the scene, that can be referred to as the background. The measurement of the success of acquiring the matte is highly objective and one can in general not automatically determine when a matte is perfectly generated.

1.3 Objectives

Given the presented motivation and description of the problem at hand, the following objectives were decided for this thesis work:

• Implement a program that performs natural image matting for single

im-ages

• Extend single image matting program to automatically or semi-automatically

process sequences of images

• Integrate program(s) into production pipeline at Digital Domain

The implementation was based on recent research in this area. A review of such previous work is presented in sections 2.3 and 2.4. Integration of the implementation into the pipeline and work ﬂow of Digital Domain meant careful consideration when deciding on inputs and output, formats, choice of operating systems and such.

1_{Alpha channel containing values of 0.0 for areas that are known to be background, values}

1.0 for known foreground and (around) 0.5 for unknown regions. Sometimes also referred to as garbage matte.

(13)

1.4 Thesis Outline

The rest of the report proceeds as follows. Chapter two discusses previous work done in the image matting area and other relevant ﬁndings. Chapter three presents the framework for solving the image matting problem, while chapter four describes how this framework is extended to work on sequences of images. In chapter ﬁve the implementation of the image matting is discussed, including properties of the environment in which the program is to be used. In chapter six results are presented, evaluated and compared to results using other approaches. Finally, chapter seven summarizes the report as well as gives some examples of future work that might be done in this area.

The reader of this thesis is expected to have good knowledge about math, in particular statistics, and general computer graphics.

(14)

Chapter 2 Background and Related

Work

This chapter introduces some important theories and previous work in this area of research. Since this thesis work is an implementation based on results from recent research this chapter contains fairly detailed descriptions of such methods.

2.1 Digital Matting and Compositing

Matting and compositing have been used in the movie industry since the be-ginning of the 1970s. The matte in that case was a strip of ﬁlm, which was transparent at places, that was placed on the original color ﬁlmstrip. When placed together and projected, light passed through transparent parts and was blocked everywhere else.

With the introduction of the alpha channel in 1984, [11], the digital equiv-alence of the matte was born. The pixel’s value of the alpha channel of the image corresponds to the opacity of that pixel. A pixel with an alpha value of 1 (or 255 for a typical 8-bit implementation) has its full color value (opaque) and zero valued pixels (transparent) will not be seen. Fractional opacities are represented by values between zero and one, and are important for transparency and motion blur.

Compositing in its most general form is when several pictures, or elements,

are mixed to form a single resulting image. The matte is used to successfully composite the acquired foreground element into a new scene. In compositing there are a wide range of operations that can be used to combine diﬀerent ele-ments into a single output. The most common one is the over operator, which is summarized by the compositing equation:

C = α F + (1− α) B (2.1) where C , F and B are the pixel’s composite, foreground and background

color value and α is the pixel’s opacity value. Figure 2.1 shows an input image with alpha matte composited over a second image (background), creating a composite image (c).

(15)

Figure 2.1: Image (a) with matte (b) over a background, forming composite image (c). Girl image courtesy of Corel Knockout tutorial. Composite image courtesy of Yung Yu Chuang.

2.1.1 NUKE Compositing Software

NUKE1 is Digital Domain’s Academy Award winning digital compositing soft-ware. The software has been refined over more than 8 years and has recently been made available to the public. Some of the more important characteristics of the software is its scanline based image computation, which makes it very efficient. It features 32 channels of full 32-bit floating point color data per node, making it able to process HDR images2.

NUKE is highly plugin-oriented; object loaders, creators and output viewers are plugins, in fact all operators in NUKE are essentially plugins. Using the NUKE Software Development Kit (NDK or NUKE SDK) developers has an easy way of access to a wide range of tools for creating custom NUKE plugins.

2.2 Constant Color Matting

As described in section 1.2, a unique solution to the matting problem only exists for certain cases. One of the more common of these special cases is constant

color matting, or chroma keying as it is often called. The main idea of this

technique is to record the subject against a background of constant color, often green or blue. In their 1996 paper, [13], Smith and Blinn analyzed and sum-marized several different approaches and provided a more scientific algorithmic approach to the findings of Peter Vlahos, an outstanding inventor in the field. If the foreground object is known not to contain the (constant) backing color, and the background is known to only contain that constant color, the system of equations is simplified and becomes solvable. This greatly restricts the possible scenarios, since only one plane of the full RGB colorspace is usable, and perfectly colored backgrounds are hard, if not impossible to create. A common solution to an imperfect backing color is to film an extra shot of the background with-out foreground object (known as a clean plate) that can be used as background color in the computations. Using other properties of constant color matting

1_{http://www.d2software.com/nuke.html}

2_{High Dynamic Range; images that allows a brightness exceeding 1.0 and preserves negative}

(16)

good results can often be obtained for gray tones and ﬂesh tones.

The paper also discusses the important and notoriously difficult problem of spill or flare. Spill is when blue light from the blue screen is reflected on the foreground object. In [16] Vlahos solved the problem for two important special cases; flesh tones and bright whites. Vlahos’ solution comes from years of experience rather that mathematical models, and in their paper Smith and Blinn fail to come up with a mathematical equivalence to Vlahos’ empirical findings.

2.3 Natural Image Matting

The research area of natural image matting tries to solve the matting equation without making any constraints regarding the properties of the foreground or background objects. Since the equation is unsolvable, the solution is not guar-anteed to be correct and is only an estimate, often based on statistics and local properties of the image. Another factor that makes this a diﬃcult problem to solve is the consideration of natural objects, with boundaries that in general do not project onto simple curves in the image. Such examples include strands of hair, plumes of smoke or trees and leaves. A strand of hair can often be smaller than one pixel in the image, resulting in pixels that receive light and color from more than one object.

Today at Digital Domain, as with most visual eﬀects companies, mattes from images shot against non-constant colored backgrounds are created using

roto-scoping. This is a technique where lines, commonly splines of some sort, are

ﬁtted around the boundary deﬁning the object. The interior of the curve is assigned an alpha value corresponding to the object (commonly 1.0) and the exterior the alpha of the background (0.0).

This curve is initially drawn for some keyframes, for instance every 5th frame. Next an attempt is made to generate the curve for frames in between keys. This is done by simply interpolating between the shapes of the keyframes, or by extracting the camera motion of the shot using tracking software. Often the interpolated results need to be modiﬁed by hand, and keyframes introduced every 3rd frame, or for diﬃcult cases even for each frame.

Apart from requiring time-consuming manual interaction, this approach is more or less incapable of generating mattes for objects with ﬁne details, such as strands of hair or branches and leaves of a tree, or semi-transparent regions. Ruzon and Tomasi’s approach to alpha estimation in natural images, [12], re-quires the user to provide information about what parts of the image are known to be foreground and background, and what parts are unknown, as in ﬁgure 1.1 (b). First the unknown region is partitioned into sub-regions using anchor

points. Each point serves as center of a window that deﬁnes a region forming a

local color distribution. The size of the window is set to deﬁnitely encompass portions of the known foreground and background regions. Pixels found in the known regions are considered samples from background and foreground distri-butions in color space. The pixels are then divided into clusters and each cluster is modeled as a mixture of axis-aligned Gaussians. The observed (composited) color C is modeled as another distribution located between the foreground and

(17)

the background distribution. The distance to the other distributions corre-sponds to the alpha value of the pixel. This alpha value is chosen such that it yields a maximum probability for the observed pixel. When the alpha is chosen, the estimated F and B are computed as weighted sums of cluster mean values.

In a paper from 2001, [3], Chuang et al. builds on the work by Ruzon and Tomasi to produce even better mattes. The presented method differs from previous ones in a few but important areas. Firstly, the unknown region is processed starting close to the known regions and then marches inward. As an effect of this different processing, the sampling of colors to form the distribution is also changed. The new approach includes not only colors from the known regions but also previously computed values. Furthermore, the optimization framework is changed, in that it computes the alpha value and the estimated

F and B simultaneously, instead of as two diﬀerent steps. Finally, the color

distributions are here modeled using oriented Gaussians, instead of Ruzon’s axis-aligned Gaussians.

The per-frame computation of mattes throughout this report is mainly based on the method of Chuang et. al and the process is detailed in chapter 3. A few commercial products for non-constant color matting must also be men-tioned. In later versions of Adobe’s Photoshop3a plugin called Extract is avail-able. This works in a fashion similar to the research projects described above in that it operates on three separate regions. The plugin quickly generates a mask but results are not as good when the image is more complex. Corel oﬀers a plugin to Photoshop in their Knockout program4. They claim it to be faster and more accurate than the Extract tool, but since no license was available only publicly available results have been evaluated.

2.4 Matting of Image Sequences

The approaches mentioned in the previous section can be extended to handle se-quences of images by simply processing each image individually. But since most methods require some initial information to guide the process, user interaction is then required for each frame. Since sequences often span over more than one hundred frames, this work is very time-consuming and alternative solutions are most welcome.

Mitsunaga et al. presented a system, [7], where the user provides background and foreground contours, which are then evolved over time. The ﬁrst step of their process is a boundary search, where a block matching scheme is used to generate an estimate of the object contour, based on information from the pre-vious frame. Next, a gradient vector ﬁeld is computed, describing the edge of the object. To favor boundaries between objects rather than internal edges a projected gradient is used. Furthermore, the gradient information is weighted, based on the expected orientation of the edge, acquired from the previous frame. Finally the matte is generated from the gradient map using a contour line de-scribed in parametric form. The process allows human interaction and guidance

3_{http://www.adobe.com/photoshop} 4_{http://www.corel.com/knockout}

(18)

for all steps. This system is mainly designed for use when there is a hard edge be-tween the foreground and background objects, since the process assumes strong smoothness in alpha values near object boundaries. Images in motion pictures often contain motion blur and ﬁlm grain and this in addition to high resolution often makes edges too soft to locate with such techniques. The process is fur-thermore not at all suited for ﬁner details such as wispy hair or leaves.

Hillman et al. chose a somewhat different approach in their attempt to solve the matting problem, [6]. Similarly to the methods of [12] and [3], their technique requires an initial guide to start the computation. Here, PCA5 is used to find the main axis through each cluster of color samples, which is then used to de-termine F , B and α. To cope with more difficult cases they also present several

extensions to the main method, based on for instance weighted averages and multi-classification. Finally, Hillman et. al also apply their technique to image sequences by matching a low-resolution version of the image to the previous frame and further classifications of pixels as background or foreground. For un-known pixels, the edge of the previous frame is examined, and those foreground and background colors are used. Their technique is sensitive to large movements between frames, but requires little or no user-input apart from the initial trimap. In their 2002 paper, [4], Chuang et al. extended their Bayesian framework for matting to process sequences of images. Their approach consists of three main parts; Bayesian matting, optical flow and background estimation. The user pro-vides trimaps for keyframes and optical flow is used to flow the trimap between the keyframes. Bayesian matting is used both to improve the results from opti-cal flow and to compute the final matte for each frame. Using image mosaicking techniques one can construct a background clean plate by drawing a garbage matte over the foreground part and forming a background based on remaining information from the image sequence. Using this estimated background, or a clean plate from using a fixed camera or motion control rig, provides accurate information about the background for the unknown regions. This helps both the quality of the generated matte and the running time of estimation.

The computation of per-frame trimaps in this report is based on this method of optical ﬂow and the use of Bayesian matting to improve results. This is detailed in chapter 4.

5_{Principal Component Analysis, a method for computing the principal, or dominant,}

(19)

2.5 Bayesian Statistics

Bayesian analysis is a statistical procedure where parameters of an underly-ing distribution are estimated not only based on an observation but also usunderly-ing

priors.

The methods starts with a prior distribution p(f ), based on for example an estimate of the relative likelihoods of parameters or the results of non-Bayesian observations. Given this prior distribution, data is gathered to form an observed distribution. The likelihood of the observed distribution is then calculated, as a function of model parameter values, and ﬁnally the likelihood function is multiplied by the prior distribution and normalized, to form the posterior distribution. The parameter estimate is then the mode6 of this distribution.

With a prior distribution over parameters f , p(f ) and a probability distri-bution p(g|f) of an observation g if f were the true parameters, the Bayesian paradigm gives that p(f|g) ∝ p(f)p(g|f). One common way to present the prob-lem of ﬁnding the model parameters is using the maximum a posteriori (MAP) estimate, where the MAP estimate of f is ˆf maximizes p(f )p(g|f). This in

turn can be written as ˆf maximizes log p(f ) + log p(g|f). Here, the ﬁrst term

corresponds to the prior distribution and this can be seen as a penalty term. Parameters f that do not correspond to the prior expectations will generate a small value for p(f ), thereby working as a penalty to the maximization function.

(20)

Chapter 3 Natural Image Matting

using Bayesian Statistics

This chapter presents the chosen approach to solve the matting problem for a single image. First an overview of the process is given, followed by a more detailed description of the steps involved.

3.1 Overview

To start the computations, the user provides an image containing a trimap in the alpha channel. The trimap is used to provide information about what known foreground and background colors are to be used to estimate the values of the unknown pixels. Furthermore, the unknown region of the trimap helps to reduce the amount of pixels to be computed, compared to solving for the entire image. The processing of unknown pixels starts from the edge of known regions and proceeds in toward the unknown area. For each unknown pixel, samples of the known foreground and background as well as samples from previously estimated pixels are taken. These samples are then grouped into clusters, containing similar foreground and background color, respectively. The clusters are used to model the distribution of possible foreground and background colors in the neighborhood around the current pixel, in a Bayesian framework.

Next, an iterative approach is used to ﬁnd the optimal foreground and back-ground color and alpha value for the current pixel. Optionally, a gradient con-straint can be applied once the estimated values are computed, thereby enforcing a local smoothness constraint to the computed alpha values.

This process is iterated for all pixels in the unknown region of the trimap.

3.2 Using Level Sets for Distance Measurements

An important part of the estimation framework is the use of previously esti-mated pixels. These are close to the current pixel and are more likely to be a good representation of the current pixel than values from known regions, that might be further away. The problem with using estimated values is that one needs to be sure that they are valid, otherwise the error will accumulate and

(21)

worsen as the image is processed. So, to minimize this problem, unknown pixels should be processed beginning close the the known region, where the known samples accurately model the current pixel, and moving in toward the center of the unknown region.

The approach in this thesis uses two 2D level sets, introduced and described in [10], representing the surfaces between the unknown region and the known foreground and background regions respectively. These level sets are used to determine the optimal order of processing the unknown pixels. Once the level sets have been reinitialized, the distance to both surfaces for each point in the image is known. The distance values for the two level sets can be seen in the ﬁgure 3.1. Distance values in images (b) and (c) have been normalized to range [0,1].

Figure 3.1: Trimap (a), distance to foreground (b), distance to background (c) Next, each point in the unknown region is inserted into a red-black tree, sorted by the shortest distance to any of the two surfaces. Since the red-black tree sorts items in the tree on insertion, an order of processing is easily deter-mined by simply traversing the tree.

3.3 Pixel Value Sampling Approach

As described in the overview, samples from the known foreground and back-ground as well as samples from previously estimated pixels are needed in order to model the distributions of foreground and background colors in the neigh-borhood around the current pixel. The approach by Ruzon and Tomasi [12] used only samples from known regions, but if the distance to one of the known regions is rather large the choice can lead to samples that poorly represent the state of the current pixel.

To overcome this limitation, previously computed foreground and background samples are used in an approach similar to that described in [3]. Although performing much better than the sampling method of Ruzon and Tomasi [12], their approach has a different problem. As described in their paper, a region encompassing the two known regions is created, from which samples are gath-ered, as shown in figure 3.2 (b). But for images at a resolution of 2048× 1556, the resolution commonly used for feature film effects, the distance to the closest surface can be on the order of 150 pixels, leading to a sampling neighborhood containing up to 100 000 pixels. Examining and sampling that many pixels for each unknown pixel in the image can be a very time-consuming process, and

(22)

the large amount of samples might in some cases lead to aliasing issues. The approach used for this work samples colors in three turns, from three dif-ferent positions; the closest known foreground and background pixels, and the current pixel, as shown in ﬁgure 3.2 (c). The aim of this sampling procedure is to keep the number of samples to a minimum, but still make sure that samples are acquired from both known regions and from previously estimated neighbor-ing pixels. As a result of this changed samplneighbor-ing, the processneighbor-ing time of an image can in extreme cases be up to 100 times faster compared to sampling the entire region as shown in ﬁgure 3.2 (b).

Figure 3.2: Sample approaches used by [12] (a), [3] (b) and for this thesis (c) Using distance measurements and gradient information from the level sets, the position of the foreground and background pixel that is closest to the current pixel can be computed. Due to the gradient information being inaccurate when the distance to the surface is rather large, the computation of the position is iterated;

Algorithm 1 Finding position of closest point on/inside level set.

1: PosN is set to current pixel (x,y).

2: while distance to closest level set point from PosN > threshold do

3: set new PosN = distance (closest LS point, PosN)∗ gradient(PosN) 4: end while

Once the positions are computed, samples are gathered from these locations in a circular fashion. The radius of the sampling neighborhood is increased until a user-speciﬁed number of samples have been acquired. This process is repeated for foreground and background regions and ﬁnally for estimated values, centered at the position of the current pixel.

3.4 Clustering

To reduce the amount of information that needs processing the foreground and background samples are partitioned into clusters. Clustering is an unsupervised process where a set of unorganized data is replaced by a number of clusters, which are collections of data points that are grouped together. The condition that determines what cluster each point should belong to could be very dif-ferent depending on the clustering method being used. Common examples are grouping pixels with similar color, texture and/or position. The purpose of clustering in this project is to have a cluster correctly represent a larger number

(23)

of samples. Using properties of the cluster such as its mean value and inverse covariance matrix one can then model the color distribution of the foreground and background color, as described in section 3.5.1.

When clustering data it is preferable that a Euclidean metric is used for the domain. This, however, is not the case of the RGB color space. Therefore the input image is converted to the CIE LAB color space, which provides the closest possible Euclidean approximation for the perception of color diﬀerences by a human observer. For some odd cases it was observed that the produced matte was better when not converting color spaces. This might be due to some case where the background and foreground colors are more similar in LAB color space than RGB. For this reason, the choice of color space was left to the user, as shown in table 5.1.

The following subsections describe the clustering methods that have been considered for this thesis.

3.4.1 Clustering by K-means

A common approach for clustering is to design an objective function that ex-presses how good a representation is, and then build an algorithm that obtains the best representation. The objective function for clustering by K-means is based on assuming that there are K clusters, where K is known. Each cluster i has a center c_i, and each sample j to be clustered is described by a feature vec-tor x_j. This feature vector can for example be the coordinate vector of a point or the color vector of a pixel. Using these conventions one can now describe the objective function as:

Φ(clusters, data) = i∈clusters    j∈ith cluster (x_j− c_i)T(x_j− c_i)    (3.1) The iteration of points is then based on the following to assumptions:

• Assume the cluster center is known and allocate each point to the closest

center.

• Assume the allocation of each point is known and choose new cluster center

as the mean of all points allocated to that center.

The process is started by randomly selecting K points as cluster centers and then iterating the two assumptions alternately. The iterations are terminated when the cluster centers have moved less than a distance from the previous iteration, or a maximum number of allowed iterations have been reached. One of the main drawbacks of the basic K-mean clustering approach is that the number of clusters are preset. This means that if there are four natural groups among the points to be clusters, but the user speciﬁes K to three, the results will be poor. To overcome this an iterative K-mean clustering algorithm is used, as described by algorithm 2. The main idea is to increase the number of clusters until all points are closer to their cluster center than some threshold.

(24)

Algorithm 2 Iterative K-mean clustering

1: Set K to 0

2: while max(distance(data point,cluster center)) ¿ threshold do

3: K = K+1

4: Choose K data points to act as cluster centers

5: while cluster center moves distance ¿ from previous iteration do 6: Allocate each data point to cluster whose center is closest 7: Ensure that every cluster has at least one data point 8: Replace cluster centers with mean of elements in clusters 9: end while

10: end while

3.4.2 Clustering using a simplified RANSAC algorithm

As mentioned in the previous section, one major problem with the original K-mean clustering algorithm is the need for K to be specified before entering the algorithm. Instead of iteratively running the K-mean algorithm with different parameters and evaluating them based on some criterion, this different clustering method can be used.

This algorithm is based on principles from the Random Sample Consensus (RANSAC) algorithm, first introduced in [5]. The RANSAC algorithm basically fits a model to data, potentially corrupted by outliers. The algorithm takes re-peated samples, and for each sample it computes a candidate model. The model computed from the sample is then evaluated over the whole set of observations. The model that gets the best vote is chosen as the solution. For the algorithm below, the parameters are estimated from N data items, out of M data items in total. The number of iterations, L, can be determined from a ratio of prob-abilities of the trial being a success versus a failure and the probability that a random data item fits the model.

Algorithm 3 RANSAC algorithm

1: for 1 to L do

2: Select N data items at random 3: Estimate parameter

4: Set K = number of data items that ﬁt the model with parameter vector, within some tolerance.

5: If K ¿ threshold, accept ﬁt and exit with success 6: end for

7: If this point is reached, operation failed

In the current case of clustering, the algorithm becomes greatly simpliﬁed. The evaluation criterion is, at this stage at least, only a Euclidean distance. The number of iterations, L, is also simple; continue until all points have been assigned to a cluster.

This alternative clustering algorithm was designed at the end of project, as a complement to the iterative K-mean clustering algorithm. It has therefore not been evaluated or compared to other clustering methods.

(25)

Algorithm 4 RANSAC-based clustering

1: while point still remaining do

2: Choose a point among the remaining to act as new cluster center 3: while cluster center moves distance ¿ from previous iteration do

4: Assign all points within radius R of cluster center to the cluster 5: If no points are assigned, choose new point as cluster center 6: If points have been assigned, move cluster center to mean of points 7: end while

8: Remove all points from data set that have been assigned to the cluster 9: end while

3.5 The Matting Problem in a Bayesian

Frame-work

As described earlier, the main objective of the matting operation is to ﬁnd the most likely estimates for F , B and α given an observation C. In [3], Chuang

et al. present this problem within a Bayesian framework, using a maximum

a posteriori (MAP) technique. That method is also used in this thesis and is

detailed in the following section.

Following the MAP approach, the matting problem can be described as a max-imization of model parameters, in our case F , B and α, over a probability

dis-tribution P . Using Bayes’s theorem the more complex expression P (·) can be split up into a sum of log-likelihoods L(·):

arg max F , B,αP ( F , B, α| C) = arg maxF , B,α P ( C|, B, α)P ( F )P ( B)P (α) P ( C) = arg max F , B,αL( C| F , B, α) + L( F ) + L( B) + L(α) (3.2)

The term P ( C) can be eliminated, since it is constant with respect to the

optimization parameters F, B, α.

Finding the optimal model parameters that maximize a given log-posterior function, in our case equation 3.2, is done in two steps; ﬁnding the optimal parameters for the current observation, and calculating the likelihood, accord-ing to equation 3.2, usaccord-ing the current model parameters. These two steps are detailed in the following two subsections.

3.5.1 Computing Expected Likelihood

The log-likelihood L( C| F , B, α) models the error in measurement of C, that is

it can be viewed as a measure of how well the current optimization parame-ters F , B, α ﬁt the observed, composited pixel C. This measurement can be

computed as:

L( C| F , B, α) = − C− α F − (1 − α) B

2

(26)

where σ_C is the camera variance, or camera noise. In terms of distribu-tions, the equation above can be described as a Gaussian distribution centered at C− α F − (1 − α) B with a standard deviation of σ_C, as seen in ﬁgure 3.3. The next task is modeling the probability distributions for the foreground and background colors L( F ) and L( B). This is done using information from the

clusters of samples of the known and estimated foreground/ background pix-els. Following ideas from robust statistics, a weight of each sample is used, such that pixels far from the current unknown pixel and samples with certain characteristics in alpha values are down-weighed:

w_i=α 2 ie −( xi−µ2) 2σ2 σ√2π (3.4)

The ﬁrst portion of the weight expression favors samples with large alpha values, and when computing weights for background pixels, α2_i is replaced with (1− α_i)2 to favor samples with low alpha values. The remainder of the weight expression corresponds to a Gaussian falloﬀ, which gives points closer to the current pixel position a stronger weight. In the implementation σ was chosen to 8.

With sets of foreground and background points and their corresponding clus-ter numbers and weights, the weighted mean value ¯M and weighted covariance

matrixC of each cluster can be computed: ¯ M = i∈NwiXi i∈Nwi (3.5) C = i∈Nwi(Xi− ¯M )(Xi− ¯M )T i∈Nwi (3.6)

where the sum is computed for all N points in the current cluster and w_i and X_i are the weight and color vector of the ith sample in the cluster. When all samples of one cluster have one or more identical color coordinates the covariance matrix cannot be inverted. To remedy this some noise must be added to the covariance matrix. Since the foreground and background samples are presumed to be captured with the same camera as the composited image, this noise comes in the form of the camera variance σ_C.

Adding the camera variance is done using Singular Value Decomposition (SVD). SVD decomposes a matrix A into a column-orthogonal matrix U , a diagonal matrix W and a orthogonal matrix V .

A = U · W · VT , W =       w₁ 0 0 · 0 0 w₂ 0 · 0 · · · · · 0 0 · w_n−1 0 0 0 · 0 w_n       (3.7) The non-zero elements of W are the singular values of the matrix, and for covariance matrices these correspond to the main axes of the distribution. Next,

(27)

the camera variance is added to these dominant components, and ﬁnally the co-variance matrix is formed again by multiplying the matrices, as in equation 3.7. Using these cluster properties, the foreground and background color distribu-tions L( F ) and L( B) can now be modeled as oriented Gaussian distributions,

centered at ¯F and ¯B, and with the largest radius C_F and C_B respectively as shown in ﬁgure 3.3: L( F ) = −( F − ¯F ) T _C F−1 ( F− ¯F ) 2 (3.8) P(B) XC _P(C) P(F) F B α 1− α

Figure 3.3: Distributions in Bayesian framework.

The ﬁnal term of the likelihood expression is L(α). For this work, as in [3] and [12], the log-likelihood for the alpha is assumed to be constant.

3.5.2 Maximization Over Model Parameters

The next step is ﬁnding model parameters that produce a maximum value for the sum of L( C| F , B, α) + L( F ) + L( B) + L(α). However, since α is multiplied

with the other model parameters in equation 3.2, the optimal values for the complete problem becomes a bit more complicated to ﬁnd. For this work, the approach of [3] is chosen, where the problem is divided into two parts. First

α is assumed to be constant and optimal values for F and B are found. Next,

the colors are assumed to be ﬁxed and an optimal value for α is computed, given F and B. To ﬁnd the best combination of F , B and α, the two parts are

computed alternately and iterated until convergence. This process is repeated for all pairs of clusters and the combination that yields the highest likelihood is used to represent the distribution.

To compute the optimal F and B, given a ﬁxed α, the partial derivatives of

equation 3.2 with respect to the two variables are set to zero. Expressed in matrix form, this results in:

(28)

  CF−1+Iα 2 σ2 C Iα(1−α) σ2 C Iα(1−α) σ2 C CB −1₊I(1−α)2 σ2 C   F B = CF−1F +¯ Cα_σ2 C CB−1B +¯ C(1−α)_σ2 C (3.9)

Solving this 6x6 equation system gives the best parameters F and B for a

ﬁxed α.

The second part assumes F and B to be constant, resulting in a simpler equation

for ﬁnding the optimal α:

α = ( C− B)· ( F − B)

F − B2 (3.10)

This equation corresponds to projecting a point ( C) onto a line ( F− B) in

color space. Note that for many cases the point cannot be projected onto the line, resulting in values for α outside of the valid [0,1] domain. Such values are simply thresholded to ﬁt [0,1].

3.6 Gradient Guided Locality Constraint

The high-resolution images commonly used in visual effects production has cer-tain properties that in some cases lead to worse results than might be expected. Often the foreground appears to be of a rather solid color, and there is a very clear edge between the object and the background. One would assume that this matte would be very simple to process. But when looking more closely it is revealed that what appeared to be a solid foreground color in fact contains a multitude of different colors. The result is that for some unknown pixels, the color of the closest background is in fact an almost perfect match, far better than that of either the known foreground or the neighboring pixels. This leads to lots of pixels incorrectly labeled as foreground and background, respectively. This scenario can be seen in the left images and the insets of figure 3.4. To overcome this limitation, a gradient guided locality constraint is introduced. The gradient of an image is a measurement of how much the color intensity changes between neighboring pixels. The gradient operator is commonly used as one step in edge detection algorithms, since edges and boundaries are charac-terized by large differences in intensity. When examining a generated gradient of an image with these properties, the pixel-level differences that cause the ar-tifacts in the generated matte are often too subtle to affect the gradient image. Instead, most of that region has a very low gradient value, indicating that it is a solid piece of the object. This can be seen in figure 3.4 (d).

(29)

Figure 3.4: Image (a), matte containing errors (b), insets (c) and gradient image (d)

Based on this observation, a constraint can be applied to the computed alpha value: if the alpha value of a pixel is signiﬁcantly diﬀerent from that of its neighbors, but the gradient value at the corresponding position is very low, it is likely that the computed alpha value is incorrect.

This approach is rather conservative; for pixels that are close to the object edge the gradient will have a substantial value, and thus the pixels are not treated at all, regardless if the alpha value is very diﬀerent from its neighbor’s. The values that are computed using the MAP estimation are kept unchanged. Only values that appear to be erroneous, and at the same time have gradient information showing that it is a region of little or no diﬀerence in intensity, are treated.

(30)

Chapter 4 Matting of Image Sequences

A simple way to use natural image matting on image sequences would be to process each image individually. But since each frame requires a trimap, indi-cating known foreground and background regions, this results in a lot of manual labor when processing a sequence of more than one hundred frames. This chap-ter presents methods used to extend the still image matting technique to work semi-automatically on sequences of images. First an overview of the process is given, followed by a more detailed description of the steps involved.

4.1 Overview

As input to the process the user supplies an image sequence where a trimap has been drawn on selected keyframes. These keyframe trimaps are then flowed through the video sequence to generate trimaps for frames in between keys. For each pair of frames the accuracy of the optical flow is measured to properly handle regions where the flow might be inaccurate.

The ﬂowed trimap is then reﬁned to a complete alpha matte using the natural image matting process as described in the previous chapter. Once completed, the output is converted back to a three-valued trimap.

Next, the image sequence is processed in reverse order using backward optical flow. When the trimaps have been generated using both forward and backward optical flow, the final output trimap value for each pixel is taken from the flowed image that had the highest accuracy measurement.

If the resulting trimap sequence still has errors, possibly due to change in topology of the foreground object, the output image can easily be edited for erroneous frames. These become new keyframes and the process can be run again. After trimaps have been computed for the entire image sequence, a high-quality alpha matte can be computed using the natural image matting technique for each frame.

4.2 Optical Flow

Two-dimensional image motion is the projection of the three-dimensional motion of objects, onto the image plane. Sequences of images allow the estimation of projected two-dimensional image motion as either instantaneous image velocities

(31)

or discrete image displacements. These are commonly referred to as the optical flow field or the image velocity field.

Two main methods exist for ﬁnding the motion: feature-based and gradient-based. Using the former ones, features are extracted from the sequence of images, matched between two neighboring frames and tracked over time. For gradient-based methods the ﬂow is recovered based on constraints formed by lo-cal spatial-temporal changes in image intensity. One such constraint is that the

intensity of a moving point does not change between timesteps, which gives rise

to an equation relating the change in image intensity at a point to the motion of the intensity pattern. However, optical flow computed under the intensity constancy constraint alone is not unique, and additional constraints must be imposed on the flow. An important such additional constraint is smoothness, stating that flow at nearby places of an image will be similar unless discontinu-ities exist there. The smoothness constraint, however, makes the optical flow algorithm unable to deal with discontinuities, for instance when boundaries be-tween regions move differently. In [1] a robust framework capable of dealing with discontinuities is described, as explained in some detail in section 4.2.1. During this project two optical flow algorithms have been evaluated; those described in [1] and [14]. The following subsections briefly describe the two optical flow methods, followed by a section on how the accuracy of optical flow computations can be measured and used to improve results within the matting framework.

In all results shown in chapter 6, the spline-based optical flow method was used. In a brief test, that method produced somewhat better results than using the approach of [1]. It should be noted though, that the brief test was in no way detailed enough to conclude that one method of optical flow in general performs better than the other; the spline-based optical flow was simply chosen throughout all tests to make sure matting results were not affected by using different methods of optical flow.

4.2.1 Optical Flow using a Robust Estimation Framework

As described in the previous section, constraints are often formed based on there being only a single motion present within a ﬁnite region. When transparencies, discontinuities and many other phenomena are present, this assumption is in-correct. The lack of robustness (regarding outliers) of the common least-squares method1 is what makes previous optical ﬂow algorithms unable to cope with discontinuities.

In [1] a framework is described that relaxes the single motion assumption, resulting in a more robust estimate of optical ﬂow. This framework is based on robust statistics that try to recover the structure that best ﬁts the majority of the data, while identifying and rejecting outliers.

The robust objective function can be non-convex, so ﬁnding the globally optimal solution is non-trivial. To ﬁnd local minima, a Simultaneous Over-Relaxation (SOR) technique is used. This belongs to a family of relaxation techniques also including Jacobi’s method and Gauss-Siedel relaxation. The main concept of the technique is to use an over-relaxation parameter that is

(32)

used to overcorrect the estimate of the parameter at stage n + 1. To ﬁnd the global optima a parameter is used, that controls when points are considered outliers. A Graduated Non-Convexity continuation method is used to track the minimum over a sequence of objective functions, with decreasing parameter values. For each parameter, the SOR method is used to ﬁnd a corresponding minima.

4.2.2 Spline-Based Optical Flow

In [14] the authors use a completely different approach than the previously explained to compute optical flow. Here, the motion estimation is viewed as an image registration problem with a fixed optimality criterion.

To represent the displacement ﬁelds, 2D splines are used, which are in turn controlled by a smaller number of displacement estimates on a course spline control grid. Within this spline control grid a set of m× m pixels are associated with each spline patch.

This spline-based description removes the need for overlapping correlation windows that are otherwise commonly used. Using linear or higher order splines, more complex motion can be represented than using the local translational model of correlation-based methods. Additionally, the spline based representa-tion has a computarepresenta-tional advantage over many other methods. Here, the error of each pixel contributes only to its four spline control vertices, instead of the

m x m pixels in an overlapping window. To handle larger displacements, the

algorithm is run in a hierarchical way, using Gaussian image pyramids. The algorithm is ﬁrst run on smaller pyramids and those results are then used to initialize computations on the next ﬁner level.

4.3 Accuracy Measurement of Optical Flow

When using optical flow in a single direction, disocclusion is an important fac-tor. This is when a feature suddenly appears in one frame, but was not available in the previous one. The optical flow algorithm cannot find any source pixels in the previous image that correspond to this new feature and hence the algo-rithm fails. But if the same pair of frames is computed using optical flow in the opposite direction, the problem instead becomes an occlusion - suddenly an ob-ject disappears. This phenomenon, however, is handled correctly by the optical flow algorithm. This observation is the reason for computing both forward and backward optical flow in the generation of trimaps. To determine which of the forward and backward flowed image should be used to compute the trimap, a measurement of accuracy is needed.

The first measurement is referred to as the error map, E_if, describing the error in flow between frame i− 1 and i. This error map is computed by taking the color space difference between the pixel value of the image generated by optical flow, C_if, and the true image of frame i, C_i;

E_if=C_i(x) − C_if(x) (4.1) Since the trimaps are generated incrementally, errors that occur between two frames are accumulated for subsequent frames. This means that it is not

(33)

enough to compute the error from optical flow for the current frame alone - the accumulated error from previous frames must also be considered. This accu-mulation is stored in an accumulated error map Af_i, where the previous value of Af is warped along the frames using the optical flow information. If frame i is a keyframe, then the value of Af_i+1 is simply E_i+1f . But for frame i + 2, the accumulation must be considered, and Af_i+2 becomes the warped Af_i+1+ Ef_i+2, measuring not only the per-frame error but also errors accumulated from all frames since the last keyframe. The accumulated error map for frames 32 and 37 using forward flow can be seen in figure 4.1 (a) and (b). The sequence con-tained keyframes at frame 30 and 40. As is seen in the images, the errors are accumulated, leading to a much larger error for frame 37 than 32.

In addition to the accumulated error map, a validity map V_i(x, y) is also used. For pixels that have an accumulated error larger than a user-deﬁned threshold, the validity bit is set to 0, indicating that there is a large possibility that the computed value of these pixels cannot be trusted. Like the error map, the va-lidity map is also ﬂowed and old values are combined with per-frame values. Thus, the validity bit of a pixel in frame i is set to 1 (trusted) if E_i ¡ threshold

and the validity value ﬂown from previous frame is 1, otherwise the validity is

set to 0. The validity map for frames 32 and 37 can be seen in ﬁgure 4.1 (c) and (d), and their corresponding error maps, (a) and (b).

Finally, for pixels whose validity bit is 0, the trimap value is set to unknown (0.5), indicating that whatever trimap value was ﬂowed from previous frame is not to be trusted and should be treated as unknown.

Figure 4.1: Accumulated error map for frames 32 (a) and 37 (b) using FW ﬂow. Validity bits corresponding to the errors, (c) and (d), with a fairly high error threshold.

Next the trimap is refined by performing the natural image matting de-scribed in chapter 3. Pixels that are incorrectly labeled as unknown, even though they are distinctly foreground or background, are identified correctly by the matting process, rendering a clean, correct alpha matte. This refined matte is then converted back to the trimap format by thresholding it to values 0.0, 0.5 and 1.0. The trimap refinement is done on both the forward and backward flowed trimaps. Since the computation of the alpha matte is fairly costly, this part of the operation can optionally be done on a down-scaled version of the

(34)

input image, which is scaled back after completed computations. For many im-ages the resulting trimap does not contain much detail, and the down-sampled operation can then be used without visible penalty, thereby saving lots of pro-cessing time.

Once both the backward and forward trimap has been generated, together with the corresponding accumulated accuracy maps, the information from these two processes needs to be combined in order to generate an optimal output trimap. Again, the accumulated error map comes in to play and for each pixel i the trimap value corresponding to the lesser of Af_i and Ab_i is selected as the output trimap value. Following the suggestions in [3], an additional penalty is added to unknown pixels, based on the observation that unknown pixels are common along depth discontinuities (where the optical ﬂow algorithm performs worse), and thus should not be trusted as much.

(35)

Chapter 5 Implementation

This chapter describes the actual implementation of the framework for matting of natural image sequences. First, the application environment is discussed, since there are special requirements related to the intended use of the program. Next, the program for computation of natural image matting is described, fol-lowed by a description of the implementation of the program for ﬂowing trimaps between keyframes.

5.1 Application Environment

It was decided early on in the process of implementing the framework for natural image matting that it was to be done as a plugin to NUKE (see section 2.1.1). Developing an application that is to be used in a production pipeline and in addition, integrated as a plugin to an external program, requires special at-tention to numerous details. What operating systems need to be supported? What input/output interfaces and formats are requested by the end-user as well as NUKE? How can as much of the work as possible be done in a parallel fashion, using the in-house render-farm1?

Considering the computers in Digital Domain’s renderfarm, it was decided that Microsoft Windows, Linux and Irix were the operating systems to be supported. These are OS’s that are supported by NUKE, and using the NUKE SDK, little eﬀort was needed to support a multi-platform environment.

To be able to make as much of the process as possible run on the render-farm, it was decided early on that the program was to be split up into two parts - one plugin for generating trimaps for all frames and one plugin that generates a complete matte given an image with a trimap. This decision and its beneﬁts are explained further in section 5.3.

RACE is Digital Domains render-farm resource management software. When a user wants something to be run on the render-farm, a job is submitted. The job commonly consists of one or many operations, performed on a per-frame basis. RACE is ﬁrst and foremost meant for each frame to be processed

invid-1_{Network of computers available as a resource for rendering and other computational tasks.}

Often consists of both computers used exclusively as render machines and ordinary desktop computers that are available once idle

(36)

ually, by one computer. This can be a limiting factor when the input of one frame depends on the output of previous frames, such as the trimap generation algorithm. Although this limitation can be partially overcome by writing cus-tom scripts, this was one of the main reasons to separate trimap-ﬂowing and matting into two separate components, as explained in section 5.3.

5.2 Natural Image matting

The implementation of the matting algorithm resulted in the per-image algo-rithm shown below.

Algorithm 5 Image Matting Algorithm

1: Convert input image to CIE-Lab color space 2: Create level set distance map

3: for all pixels marked unknown in trimap do 4: Acquire foreground and background samples 5: Compute clusters of color samples

6: for each pair of foreground and background cluster do

7: repeat

8: Find best foreground and background color given α 9: Find best α given foreground and background color 10: until α converges

11: Compute likelihood of converged parameters for cluster pair 12: if current likelihood ¿ previous best likelihood then

13: Set current likelihood to best likelihood 14: end if

15: end for

16: if gradient-based locality constraint enabled then

17: Check computed α against α of neighborhood and gradient information 18: end if

19: end for

20: Convert output image to RGB color space

5.2.1 User Controls

When choosing what parameters of the program should be controllable by the end-user, it was decided that more controls are always preferable, as long as they are defaulted to reasonable values. Since the Bayesian matting approach is based on statistics, the results will vary depending on various properties of the input image. Parameters that produce good results for one image can produce considerably worse results for another image, depending on properties such as patterns in the area of the edge between foreground and background, resolution and colors of the known background and foreground regions. It was therefore decided that virtually all parameters that aﬀect the outcome of the process should be accessible to the user. Table 5.1 shows the user-controllable settings for the natural image matting plugin and a brief description of each setting.

(37)

Parameter Description

variance Camera variance σ_C (noise)

max number of samples Keep increasing the sample window until this many samples have been gathered

max known samples When sampling from known region, gather this many samples or less. Separate settings for known and to-tal sample count allow users to control ratio between local (estimated) samples and ‘known’ samples max distance Maximum allowed distance from each point to its

cluster center. If current distance is larger, increase number of clusters

max num clusters Maximum allowed number of clusters

only cluster points When estimating best foreground/background color, limit output values to points actually in clusters noise threshold If α < this value, set α to 0 to reduce noise use Lab Perform processing in CIE-Lab colorspace

Check gradient Enforce smoothness constraint on computed α values by comparing it to neighboring values and gradient information

Table 5.1: Natural image matting settings

5.3 Matting of Image Sequences

Since performing matting computation on images of resolution ¿= 2045 x 1556 is a rather time-consuming process, it was important to be able to run as much of the process as possible on the render-farm. Another request was to be as flexible as possible with respect to future development. For example, if a new and much improved optical flow algorithm was implemented it would be desirable if this could be used without changing the matting plugins. These objectives of parallelization and flexibility were driving factors when the video matting plugins were designed.

Given the mentioned wish for flexibility, it was decided that the optical flow computations were to be kept outside of the trimap plugin. Using Digital Domains software for optical flow computations, this also meant that this first step of the process (see figure 5.2) could be done in parallel. Thus, the trimap plugin simply takes an image sequence, with trimaps at keyframes, and two optical flow image sequences as inputs, as seen in the trimap flowchart, in figure 5.1.

(38)

Input image sequence Trimap sequence FW optical flow BW optical flow FW flow trimap Refine (matte) Reform trimap FW trimap Merge trimaps BW flow trimap Refine (matte) Reform trimap BW trimap

Figure 5.1: Flowchart of trimap plugin

Looking at the complete image matting ﬂowchart of ﬁgure 5.2, one can identify the parts of the process that are runnable in parallel, listed in table 5.2

Input image sequence Optical flow computation Gradient computation Use gradient checking? Trimap generation FW flow BW flow Gradient image Trimap sequence Image matting Matte sequence

Matting of Natural Image Sequences using Bayesian Statistics

Department of Science and Technology

Institutionen för teknik och naturvetenskap

Examensarbete

LITH-ITN-MT-EX--04/016--SE

Matting of Natural Image

Sequences using Bayesian

Statistics

Fredrik Karlsson

LITH-ITN-MT-EX--04/016--SE

Matting of Natural Image

Sequences Using Bayesian

Statistics

Examensarbete utfört i Medieteknik

vid Linköpings Tekniska Högskola, Campus

Norrköping

Fredrik Karlsson

Handledare: Dr. Douglas Roble

Examinator: Prof. Anders Ynnerman

ISRN LITH-ITN- MT-EX--04/016--SE

2004-02-23

Keywords:

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Motivation

1.2

Problem Description

1.3

Objectives

1.4

Thesis Outline

Chapter 2

Background and Related

Work

2.1

Digital Matting and Compositing

2.1.1

NUKE Compositing Software

2.2

Constant Color Matting

2.3

Natural Image Matting

2.4

Matting of Image Sequences

2.5

Bayesian Statistics

Chapter 3

Natural Image Matting

using Bayesian Statistics

3.1

Overview

3.2

Using Level Sets for Distance Measurements

3.3

Pixel Value Sampling Approach

3.4

Clustering

3.4.1

Clustering by K-means

3.4.2

Clustering using a simplified RANSAC algorithm

3.5

The Matting Problem in a Bayesian

Frame-work

3.5.1

Computing Expected Likelihood

3.5.2

Maximization Over Model Parameters

3.6

Gradient Guided Locality Constraint

Chapter 4

Matting of Image Sequences

4.1

Overview

4.2