Real-Time Image Segmentation for Augmented Reality by Combiningmulti-Channel Thresholds.

(1)

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2017

Real-time image

segmentation for

augmented reality by

combining multi-channel

thresholds.

Alexander Poole

(2)

Real-time image segmentation for augmented reality by combining multi-channel thresholds.

Alexander Poole LiTH-ISY-EX--17/5083--SE Supervisor: Karl Holmquist

isy_{, Linköpings universitet}

Magnus Hammerin

Company

Examiner: Per-Erik Forssén

isy_{, Linköpings Universitet}

Computer Vision

Department of Electrical Engineering Linköping University

(3)

Abstract

Extracting foreground objects from an image is a hot research topic. Doing this for high quality real world images in real-time on limited hardware such as a smart phone, is a demanding task. This master thesis shows how this problem can be addressed using Otsu’s method together with Gaussian probability dis-tributions to create classifiers in different colour channels. We also show how classifiers can be combined resulting in higher accuracy than using only the indi-vidual classifiers. We also propose using inter-class variance together with image variance to estimate classifier quality.

A data set was produced to evaluate performance. The data set features real-world images captured by a smart phone and objects of varying complex-ity against plain backgrounds that can be found in a typical office or urban space.

(4)

(5)

1

Introduction

Remote guidance is a sub-field of augmented reality where remote help is pro-vided as a support solution. This can be done with the help of augmented reality glasses such as Microsoft Hololens [3]. Augmented reality allows the wearer to see the real world mixed with another reality which could be either virtual or real. When it is real and taken from a spatially separated location it is referred to as a geographical reality. A crucial aspect of remote guidance is showing objects, tools, hands and more to the person being guided in such a way that he/she/it intuitively realises how to solve the problem. This could either be done by virtual visualisations or by mixing two geographical realities.

Image masks are used to mix geographical realities. The simplest image masks are binary, they are obtained by dividing images into two parts, foreground and background. The image mask contains the information on which pixels belongs to which class. One of the simplest ways of performing foreground-background segmentation is thresholding. It struggles with complex backgrounds. To get around this problem the background colours has to be lighter or darker than the foreground object.

To facilitate the uses of remote guidance, no special equipment should be re-quired to perform foreground-background segmentation. Especially when smart phones are readily available. Ideally a segmentation algorithm should run in real-time in a smart phone and be able to handle complex foreground objects.

Interpreting the foreground-background segmentation problem as a classifi-cation problem, the algorithm can be called a classifier. A classifier classifies what class an image pixel belongs to. A multi-classification system is a system that combines different classifiers to improve the final classification.

A complex image is an image where lighting, objects and textures can vary greatly and knowledge about them is often not known. A complex object can vary greatly in its contours and colours, e.g a bush. A simple background does

(8)

not vary greatly in texture or colour and if it does vary it does so in a predictable way, e.g a red wall.

1.1 Aim

This thesis aims to show how thresholding in different colour channels can be used to create Gaussian probability distributions that can be combined to create a fast and robust real-time foreground-background segmentation algorithm. Do-ing this requires different colour channels to be evaluated in order to find the colour space(s) most suited for combining. Different ways of combining classi-fiers will also be examined.

1.2 Contributions

• We discuss how inter and intra-class variance [15] and image variance can be used to determine the quality of a foreground-background segmentation as seen in section 4.2. In section 4.3 we also show how a quality estimate can be used in tandem with standard methods for combining different clas-sifiers [14].

• In section 3.3 Otsu’s method is used to create an Gaussian probability dis-tribution in a general linear or circular colour space.

• How to combine multiple classifiers is presented in section 4.4. The results presented in section 5.4 demonstrates the performance of combining differ-ent colour channels.

• Real-time performance is crucial to augmented reality applications. An example Android application is developed to show that the algorithm pro-posed in section 4.4 can run in real-time on a smart phone as seen in sec-tion 5.5.

1.3 Problem definition

Foreground-background segmentation should be ideally be performed on com-plex objects against comcom-plex background. However this problem is hard to solve. Instead the problem of segmenting complex foreground objects from simple back-grounds will be examined. To facilitate the algorithms use it should be user friendly. User friendly here implies that the algorithm should be able to give feedback if it can not segment an image. It is also desired that the algorithm is robust enough to function in natural environments with shifting illumination.

Several constraints are set in place to help define what situations the algo-rithm needs to handle as can be seen below.

(9)

1.3 Problem definition 3

Complex foreground

• The foreground colours should all be different from the background colour. • The foreground can consist of several different objects.

Simple background

• The background should have one colour.

• The background can have texture, textured images can be seen in figure B.1-B.5.

• The background can not have so much texture that it can not be smoothed without destroying the foreground objects.

Additional specifications

• Varying degree of illumination is allowed. • Varying degree of shadows is allowed.

• Must have promising real-time performance on a mobile device. • Pre-calibration against the background is allowed.

(10)

(11)

2

Related Work

There are different ways of dividing an image into different classes. Here we take a look at semantic segmentation, unsupervised segmentation and matting. All of these methods work very well for dividing images into different classes. They do however have different strengths and weaknesses that are brought up through-out this chapter. In the end a parametric and a non-parametric unsupervised segmentation method was proposed, developed and tested.

2.1 Matting

Matting is used to separate foreground objects from background in images and videos. Once matting is performed an alpha mask is created. If pixel values are 1 they are considered definitely foreground if they are 0 they are considered definitely background. Pixels can also be somewhere in between resulting in a gradient between foreground and background.

Alpha matting was first mentioned for 2D images in 1984 when equation 2.1 was mentioned by Porter and Duff [22]. There also exists variants that take mul-tiple foreground objects into consideration.

I = αF + (1 − α)B (2.1) Where:

• I represents the image

• F represents the foreground image • B represents the background image • α is the alpha matte

(12)

Image and video matting is an under defined problem [27]. Most matting algo-rithms require a trimap to be able to separate foreground from background [27]. A trimap is a map telling the matting algorithm what pixels are foreground, back-ground or a mix of both. The quality of the trimap is crucial to the success of the algorithm. The trimap can either be generated automatically or by the help of user input. When performed offline on videos trimaps for certain key frames can be created and then propagate with the help of optic flow to get trimaps for all frames [27].

There are some matting techniques that do not require trimaps, one such as spectral matting [12]. Even though spectral matting can be used without user input, the result is not always the best. Results can be improved by different grouping algorithms [2]. Even so it is not possible to run spectral matting in real-time.

One approach that is faster than spectral matting is shared sampling (shared RT) [11] which achieves computational times fast enough for real time use when implemented on GPUs. This does not account for the time to create a trimap. The time to run the algorithm can further be decreased by the use of preprocess-ing [29].

Even though shared sampling is fast and the pre processing step eliminates the need for high quality trimaps, they are still required. One approach to auto-matically generate trimaps is by using edge detection and morphological opera-tions to generate an initial trimap [24].

2.2 Semantic segmentation

Semantic segmentation clusters part of the image that belong to the same class together. Foreground-background segmentation can be seen as semantic segmen-tation with two classes. There is no denying that if the goal is to create a general purpose semantic image classification algorithm then, neural networks are the way to go. COCO [18] is a competition featuring a large data set of complex images with many different classes. Researchers can submit their algorithms and compete. 2016 the winner in the COCO segmentation challenge was Li et. al. [16] whom won with a convolutional neural network.

Some older methods for semantic segmentation are Random Decision Forests, Markov Random Fields and Support Vector Machines. These methods are often combined with various preprocessing and postprocessing methods. [25]

There are two major problems with the methods mentioned above, they re-quire training data and they are computationally complex. Running them in real time requires powerful hardware.

Even though neural networks require large amounts of training data, they are becoming more effective. ENet is a deep neural network architectures that is more efficient than traditional architectures [21]. The problem is that in reducing computational complexity, accuracy is also reduced and it is still not fast enough for real time applications on a smart phone GPU.

(13)

2.3 Unsupervised segmentation 7 different types of data sets [10]. Even though they list many good image segmen-tation data sets they often feature complex backgrounds, something that is not necessarily the normal use case for remote guidance.

2.3 Unsupervised segmentation

There are also unsupervised methods for segmentation. These methods are not semantic since they do not have any information about the different classes. They can still be used to divide the image into different classes, but it cannot say what those classes are. Non-semantic segmentation algorithms work by finding re-gions or region boundaries in an image. This can be done in several ways, some of the most common are Clustering, Graph Based, Random Walk, Watershed and Contour Models [25]. All of these methods work by looking at regions in images and finding patterns. This often requires the algorithms to iterate over the im-age, resulting in higher computational complexity. One of the simplest methods for finding patterns in data is K-means [13]. K-means will later on be used as a benchmark to see what types of performance could be expected if choosing a clustering approach. Another method of clustering which builds on K-means is SLIC superpixels [7].

A problem with unsupervised segmentation methods is how to know what part of the image should be classified as foreground and background. One way of finding foreground object in the image is with transition regions [17, 30]. Tran-sition regions, even though good at finding contours to interesting parts of an image, often require morphological operations or other methods to create an im-age mask [20, 23]. Even though transition regions can be found quickly, using them requires other, often much more computationally complex methods.

Left are statistical approaches as mentioned by Buxton et. al. [8]. These can either be parametric such as Gaussian mixture models or non-parametric such as maximising a statistical criteria. Buxton mentions that the classical Fisher discriminant [9], which can be used to derive Otsu’s method [1], belongs to the latter. Buxton also suggested that once Otsu’s method has been used to segment individual colour channels an improved result can be gained by combining the channels. Buxton noted that there are 162 possible ways of combining the eight possible outcomes of performing Otsu’s method on three colour channels. Bux-ton then performs extensive search over the possible combination as to choose the one that maximises the Fisher discriminant. This is time consuming, taking 57 seconds to segment a single image on a Pentium M 1.8 GHz processor.

(14)

(15)

3

Theory

This chapter describes the theory behind different colour spaces, thresholding methods, probability distributions and evaluation methods used later on in the thesis.

3.1 Colour Spaces

There exist several different ways of representing colours in an image, the most common one for computer displays is RGB, this is also the format that most cam-eras produce. RGB can be mapped to a cube where the different axis represent different colours see figure 3.1. For a lot of applications RGB images are mapped into the grayscale colour space since three dimensions are harder to handle then one. The main problem with RGB is that there is a strong correlation between the axes making it hard to perform foreground-background segmentation with thresholding, as can be seen in figure 3.5.

Other colour spaces such as HSV [19] are cylindrical as seen in figure 3.2, often saturation is replaced with chroma resulting in a cone shaped colour space as seen in figure 3.3. In the HSV colour space, similar colours tend to distribute around similar hue values [28]. Hue, saturation and value are the three axes that make up the HSV cylinder. They are calculated according to equation 3.1-3.3. In the HSV colour space the image tends to be distributed along the axes as can be seen in figure 3.6. V = max (R, G, B) (3.1) S =        V −min (R,G,B) V if V , 0, 0 else, (3.2) 9

(16)

H =              60(G−B) V −min (R,G,B) if V = R, 120 +_{V −min (R,G,B)}60(B−R) if V = G, 240 +_{V −min (R,G,B)}60(B−G) if V = B, (3.3)

The hue colour channel is not well defined when V − min (R, G, B) is small. By looking at equation 3.2 and 3.1 it can be noted that this happens when value or saturation are small. This can also be seen by looking at figure 3.3.

Figure 3.1:The RGB colour space mapped to a cube.

Source: By SharkD - Own work, GFDL, https://commons.wikimedia.org/w/index.php?curid=3375025

(17)

3.1 Colour Spaces 11

Figure 3.2:The HSV colour space mapped to a cylinder.

Source: By SharkD - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8421538

Figure 3.3:The HSV colour space mapped to a cone.

Source: By SharkD - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8421620

(18)

Figure 3.4:Image used to showcase different colour spaces and thresholding algorithms throughout this chapter.

(19)

3.1 Colour Spaces 13

(a)

(b)

Figure 3.5: Point cloud of the pixels in figure 3.4, plotted against the RGB axes.

(20)

(a)

(b)

Figure 3.6: Point cloud of the pixels in figure 3.4, plotted against the HSV axes.

(21)

3.2 Thresholding 15

3.1.1 Principal component analysis (PCA)

PCA on a data set, X, will result in an orthogonal linear transformation, T , that transforms the data set to a new coordinate system such that the first principal component will be aligned to the direction of the greatest variance in the data set. The second principal component will be aligned to the second greatest variance in the data set orthogonal to the already created axis and so on. In this thesis pixel values of induvidual images has made up the data set X.

PCA can be used for several things such as creating a new coordinate system, dimensionality reduction, finding features in the data and much more. Singular value decomposition can be used to create T [26].

PCA can either be used to create a coordinate system where a multi-classification system could be more effective or it could be used to reduce the dimensionality of an image so that Otsu’s method can be used to greater effect.

When PCA is performed in this thesis T is calculated on a per image basis. If for example it were to be used in a video stream instead then the T could instead be used for the whole sequence or updated when required.

3.2 Thresholding

Binary thresholding is technique used for segmenting images. A binary segmen-tation mask is created for each pixel in an image, I(i). The simplest case is to set a mask to [0, 1] depending on if a certain pixel, is larger or smaller than a threshold,

T . Other common cases are described in OpenCV’s documentation [5].

Thresh-olding has one major drawback, pixels belonging to different classes must have distinctly different pixel intensity in order for them to be correctly classified.

3.2.1 Otsu’s method

One of the more successful techniques for choosing an automatic threshold is Otsu’s method [1]. Otsu’s method maximises or minimises different variance measurements depending how it is implemented. Three different image variance measurements are often mentioned when discussing how to implement Otsu’s method, these are σ2 the total image variance, σ_w2 intra-class variance and σ_b2 inter-class variance. σ2is calculated according to equation 3.4.

σ2= 1 K N −1 X i=0 [(i − µ)2h(i)] (3.4) Where µ is the mean pixel value calculated according to equation 3.5, N is the number of pixel values, K is the number of image pixels and h is the image histogram. µ = 1 K N −1 X i=0 [ih(i)] (3.5)

(22)

In order to calculate σw2 and σb2we split the image histogram into two halves

h1and h2. h1contains all pixels larger than T and h2all pixels smaller than T .

Variance and mean can easily be calculated for these images which allows us to calculate σw2 and σb2according to equation 3.6-3.7.

σw2 = w1σ12+ w2σ22 (3.6)

σ_b2 = w1w2(µ1−µ2)2 (3.7)

Where wn ∈ [0, 1] is a weight showing what fraction of all pixels belong to

class n. Otsu suggest choosing the threshold that maximises any of the three criteria seen in equation 3.8.

σ_b2 σw2 , σ 2 σw2 , σ 2 b σ2 (3.8)

Since σ2 is independent of the threshold σ_b2 or _σ12

w can be maximised instead.

Otsu also notes that the different variances are related according to equation 3.9.

σ2= σw2+ σb2 (3.9)

Maximising the inter-class variance is less computationally complex then min-imising the intra-class variance since only class means are required. In the end minimising the intra-class variance σw2 as seen in equation 3.10 is the same as

maximising inter-class variance σ_b2as seen in equation 3.11.

T = argmin i∈[0,N −1] σ_w2(i) (3.10) T = argmax i∈[0,N −1] σ_b2(i) (3.11)

3.2.2 Circular Otsu’s method

Otsu’s method does not work on circular colour channels such as hue. This can be addressed by using a circular version of Otsu’s method presented by Lei et al. [15]. Lei present two methods, one where circular statistics is used and one where the histogram is rotated. Lei notes that the two methods give similar results although rotating the histogram is less computationally complex.

Lei also notes that maximising σ_b2 and _σ12

w does not give the same threshold

since σ2 changes when rotating the histogram. Lei states that using intra-class variance give a better threshold than using inter-class variance. Once the his-togram has been rotated linear statistics still hold and equation 3.9 is still valid.

Lei also states that the optimal threshold always divides the image histogram in two equal parts if the number of bins are even, resulting in O(N ) complexity. The following quote is taken from Lei where he explains the relation between the two thresholds i and j that he uses to divide the histogram into two equal parts.

(23)

3.3 Probability distributions 17

Theorem 2.1. For a circular histogram of length N = 2n, applying the two-class Otsu criterion (σw2) using linear statistics extending to circular

histograms, assume the best answer is i..j − 1, j..i − 1. Then, there exists a best solution, such that |j − i| = n.

This means is that we get two thresholds, T1 and T2. T2 = T1+ N2 and T1 ∈

[0,N₂ −_{1]. The fact that we have a fixed amount of bins between the two} thresh-olds makes it so that the only difference between Otsu’s method and the circular Otsu’s method is that the histogram is rotated before calculating the inter-class variance.

To calculate the total image variance of a circular colour channel, circular statistics is better suited than linear statistics since the total variance does not change when rotating the image. To calculate the variance σ2with circular statis-tics image pixels I(i) are converted to angles θI = 2πIN where N is the number of

histogram bins. The total image variance can then be calculated as σ2= 1 − R [15]. Where R is calculated according to equation 3.12.

R =

√

S2_{+ C}2 _(3.12)

Where C and S can be calculated according to equation 3.13-3.14.

C = 1 n n X x=1 cos(θI(x)) (3.13) S = 1 n n X x=1 sin(θI(x)) (3.14)

Where n is the amount of pixels in the image, x is the pixel number and I(x) is the pixel value of pixel x.

3.3 Probability distributions

A thresholding algorithm can be seen as a probability function P (wi|I) = pw_i

where I is the pixel value, pwi is the probability that I belongs to class wi. Since

there are only two classes this can be seen as a Bernoulli distribution, resulting in the class probability being related as P (w2|I) = 1 − pw1 = pw2, where w1and

w2are the different classes. Using thresholding to calculate pw1can be described

by the Heavyside function, H, shifted by T as seen in equation 3.15.

pw1= H[I − T ] =        1 if I − T > 0, 0 else, (3.15)

When looking at the hue colour channel a boxcar function is better suited to represent its probability as seen in equation 3.16. N = 256 since we assume an 8 bit image.

(24)

pw_i = H[I − T ] − H[I − T + N /2] (3.16)

Instead of using Heavside and boxcar as probability functions other functions such as Gaussian distributions can be used instead as will be done in section 3.3.1-3.3.2. In figure 3.7-3.9 the histogram for the image shown in figure 3.4 has been used to create HSV, RGB and PCA histograms. The histogram has been overlaid with the probability that different pixel values belong to class 1.

Figure 3.7: The blue staple diagram shows histograms for red, green and blue colour channels. The red curve show the probability that a pixel belongs to class 1 according the Heavyside function as seen in equation 3.15. The yellow curve show the probability that a pixel belongs to class 1 according the Gauss-likelihood function as seen in equation 3.17

3.3.1 Gauss-likelihood

One way of getting more information from Otsu’s method is to create a Gauss-likelihood function instead of a binary mask. This allows us to calculate the prob-ability of a pixel belonging to a certain class as can be seen in equation 3.17. One drawback of using Gaussian probability distribution can be seen in figure 3.8. Looking at the saturation histogram, even though there are not many other pixels classified as class i the probability that the pixels with high pixel values belong to class i is only slightly above 50%. As mentioned above the two classes make up a Bernoulli distribution related according to P (wj|I) = 1 − pwi = pwj. Making

it so that the Gauss-likelihood only needs to be calculated for one of the classes.

P (wi|µi, σT, I) = e

−_{(µi −I)}2

2σ 2_T _(3.17)

Here µi is the mean pixel value of the class i, I is the pixel value and σT can

(25)

3.3 Probability distributions 19

Figure 3.8: The blue staple diagram shows histograms for hue, saturation and value colour channels. The red curve show the probability that a pixel belongs to class 1 according the Heavyside function or Boxcar function as seen in equation 3.15-3.16. The yellow curve show the probability that a pixel belongs to class 1 according the Gauss-likelihood function or circular Gauss-likelihood function as seen in equation 3.17-3.22.

Figure 3.9:The blue staple diagram shows histograms for PCA1, PCA2 and PCA3 colour channels. The red curve show the probability that a pixel be-longs to class 1 according the Heavyside function as seen in equation 3.15. The yellow curve show the probability that a pixel belongs to class 1 accord-ing the Gauss-likelihood function as seen in equation 3.17

(26)

T there is a 50% chance of I belonging to class i. T is seen as the maximum

distance from µ where a pixel I should be classified as foreground. If we want a 50% chance of a pixel being classified as foreground at the decision boundary T then equation 3.17 can be written as equation 3.18.

2σ_T2ln(0.5) = −(µ − I)2 (3.18) From here σT can be calculated by following the steps in equation 3.19-3.20.

2σ_T2ln(2) = (µ − I)2 (3.19)

σT

p

2 ln(2) = ±(µ − I) = ±T (3.20) Since σT > 0 the negative soultion for T can be ignored. Finally σT can be

calculated according to equation 3.21.

σT =

T

2plog(2) (3.21)

3.3.2 Circular Gauss-likelihood

The Gauss-likelihood probability of a pixel can be modified to work in a circu-lar colour channel as seen in equation 3.22. N = 256 since a colour channel is assumed to be 8 bit. Worth noting is that using a Gaussian probability function works well in a circular colour channel as can be seen in figure 3.8.

P (wi|µ, σT, I) = e

−_{f (I,µ)2}

2σ 2_T _(3.22)

f (I, µ) is the distance function presented in equation 3.23. If the distance to

the mean is greater then N /2 then a shorter distance to the mean can be found at the negative pixel value I − N .

f (I, µ) = (I + N /2 − µ) mod N − N /2 (3.23) By the same reasoning as in equation 3.18-3.21 we can say that when f (I, µ) =

T then P = 0.5 resulting in equation 3.21 being applicable without modifications

even in the circular case.

3.4 Evaluation

When doing binary segmentation each pixel is classified into either foreground or background. Binary segmentation can be seen as the answer to the question "Is this pixel foreground", with the answer true if foreground and false if back-ground. In the following paragraphs different error measurements used will be described. In table 3.1 the meaning for variables used in evaluation equations can be seen.

(27)

3.4 Evaluation 21 Variable explanation

T P Classified foreground as foreground

T N Classified background as background

FP Classified background as foreground

FN Classified foreground as background Table 3.1:Evaluation variable explanation.

Accuracy This can then be used to calculate accuracy which is the most com-mon evaluation measure when doing segmentation [25]. Accuracy tells us the percentage of pixels classified correctly as can be seen in equation 3.24.

ACC = T P + T N

T P + FP + FN + T N (3.24)

Failure Failure is used to see if an algorithm has been successful at segmenting an image. If the accuracy is less than a certain value then the algorithm failed to segment a certain image, how this value is chosen can be seen in section 5.2.

Discard Discard is a way for an algorithm to say if it cannot segment a certain image. It can then choose to discard that image. If an image has been discarded then it cannot fail at segmenting that image.

Discard rate and failure rate Discard rate and failure rate are calculated as the amount of discarded or failed images divided by total number of images. Both should be low for a good segmentation algorithm.

(28)

(29)

4

Method

Methods for estimating the quality of a foreground-background segmentation, estimating the which class of a foreground-background segmentation contains the foreground and combining different classifiers are presented in this chapter. Estimating which class contains the foreground will be refered to as finding the polarity of the foreground-background segmentation.

4.1 Foreground-background polarity estimate

Otsu’s method leaves no guarantee that the mask given has foreground as white and background as black. If the foreground is not white the mask is seen as inverted. A way of estimating if the mask is inverted is required. Therefore we say that the polarity of the mask, P , is True if it is inverted else it is False. One such way is by looking at mask border as a probability. This question can be formulated mathematically as can be seen in equation 4.1. Where P (I(x)) is the probability that I(x) is foreground and x ∈ B where B is the set of border pixels and | B | is the set cardinality. If there is more than a 50% chance that the border is foreground then the mask should be inverted. This can be shown to give good results in section 5.1. P =        True if|1_B| P x∈B [P (I(x))] > 0.5, False else, (4.1)

4.2 Foreground-background segmentation quality

When performing foreground-background segmentation in several colour chan-nels we would like to be able to say whether a segmentation has been performed

(30)

successfully. This requires a quality estimate. Looking at equation 3.9 we note that σ_b2≤_σ2_{which in turn can be used to formulate equation 4.2.}

q = σ

2

b

σ2 ≤1 (4.2)

What this tells us is that if q = 1 then σ_b2 = σ2 meaning that we have two distinct classes in the image and that the only variance in the image is intra-class variance. In section 5.3 it can be seen that estimating segmentation quality with

q gives good results. However, this does not tell us how well the channel

rep-resented the foreground and background that we want to segment in the first place.

One way of estimating whether a colour channel contains enough information about the foreground and background for a segmentation to be performed is by looking at the total variance in the image. If the colour channel contains distinct foreground and background the variance should be higher than if the image only contains foreground or background as can be seen in section 5.3. This requires that the image does not contain large amounts of noise. Where large amount of noise is seen as more noise than a smoothing filter can remove without making the foreground and background indistinguishable.

If circular thresholding is used to calculate the threshold σw2 is obtained

in-stead of σ_b2 the q can be calculated according to equation 4.3 by using equa-tion 3.9. q = σ 2 b σ2 = σ2−_σ_w2 σ2 = 1 − σw2 σ2 ≤1 (4.3)

4.3 Combining classifiers

The goal of a multi-channel thresholding system is to combine several segmenta-tions in such a way that the final foreground-background segmentation is better than the individual classifiers. There are several ways of combining classifiers as mentioned by Kittler et al. [14].

The quality estimate mentioned in section 4.2 can be used to change the stan-dard probability equation 4.4 to equation 4.5. Here p(wi|I) is the probability that

a pixel belongs to class wi given its pixel value I and p(wj|I) is the probability

that the same pixel is belongs to class wj given its pixel value I.

P (wi|I) + P (wj|I) = 1 (4.4)

qP (wi|I) + qP (wj|I) = q (4.5)

This allows us to easily modify existing classifier combination methods pre-sented by Kittler as can be seen below. Z is the pixel being classified, it can be classified to either foreground, f , or background, b, but for the case of keeping

(31)

4.3 Combining classifiers 25

generality we assume m classes, wj, and R classifiers, xi. qi is a scalar

represent-ing the trust of classifier xi. P (wj) is the prior probability of class wjwhich can

be seen as how many pixels was classified as wj.

One way of estimating the prior is by first assuming equal prior, then perform-ing foreground-background segmentation and estimatperform-ing the prior from that seg-mentation. Estimating the prior for class wjis done by counting amount of pixels

classified as wjdivided by total pixels. In a video application the first frame could

be segmented with equal prior and the result from that frame could then be used as the prior for the next frame.

Product Rule assign Z → wjif, P (wj) R Y i=1 qiP (wj|Ii) = m max k=1 [P (wk) R Y i=1 qiP (wk|Ii)] (4.6)

If we assume equal priors for each class equation 4.6 can be simplified to equation 4.7. assign Z → wjif, R Y i=1 qiP (wj|Ii) = m max k=1 [ R Y i=1 qiP (wk|Ii)] (4.7) Sum rule assign Z → wjif, (1 − R)P (wj) + R X i=1 qiP (wj|Ii) = m max k=1 [(1 − R)P (wk) + R X i=1 qiP (wk|Ii)] (4.8)

If we assume equal priors for each class equation 4.8 can be simplified to equation 4.9.

(32)

assign Z → wjif, R X i=1 qiP (wj|Ii) = m max k=1 [ R X i=1 qiP (wk|Ii)] (4.9) Max rule assign Z → wjif, R max i=1 (1 − R)P (wj) + qiP (wj |_I_i_{) =} m max k=1 [(1 − R)P (wk) + R max i=1 qiP (wk |_I_i_)] (4.10)

If we assume equal priors for each class equation 4.10 can be simplified to equation 4.11. assign Z → wjif, R max i=1 qiP (wj |_I_i_{) =} m max k=1 [ R max i=1 qiP (wk |_I_i_)] (4.11) Min rule assign Z → wjif, P−(R−1)(wj) R min i=1 qiP (wj |_I_i_{) =} m max k=1 [P −_(R−1) (wk) R min i=1 qiP (wk |_I_i_)] (4.12)

If we assume equal priors for each class equation 4.10 can be simplified to equation 4.11. assign Z → wjif, R min i=1 qiP (wj |_I_i_{) =} m max k=1 [ R min i=1 qiP (wk |_I_i_)] (4.13)

(33)

4.4 Multi-channel thresholding 27 Majority vote δki=          1 if P (wk|Ii) = m max j=1 P (wj |_I_i_), 0 else, (4.14) assign Z → wj if, R X i=1 qiδji = m max k=1 [ R X i=1 qiδki] (4.15)

4.4 Multi-channel thresholding

An R-dimensional colour space with either linear or circular axes is used to repre-sent an image, where a pixel in a colour channel is denoted Ii. For each channel a

threshold value Tican beforehand be set letting us know how much variance, σi2

must be present in the channel Pi for it to be used as a classifier. Otsu’s method

can then be used to calculate the probability P (wj|Ii). If σi2 < Ti or qi < Tqi then

the channel is discarded. A version of the algorithm being outlined here can be seen in figure 4.1.

If all channels have been discarded then a black mask is returned, indicating that no segmentation could be performed in the current image. The result given from Otsu’s method can either be a binary mask or a pixel-wise probability mask depending on if the Gauss-likelihood is used or not. Each classifier is assumed too have correct polarity by the method mentioned in section 4.1. Once P (wj|Ii)

has been calculated for each colour channel the final foreground-background segmentation can easily be done by using the preferred classifier combination method mentioned in section 4.3.

4.4.1 Parameter tuning

In the multi-channel thresholding algorithm there are 2R + 1 parameters that needs too be tuned for it to give good result.

For each colour channel there are two thresholds, Tiand Tqiwhere i ∈ R needs

to be set. If an image i has variance lower than Tithen it is regarded as having to little information to be properly foreground-background segmented. If an image

i has q-value lower than Ti

qthresholding has not managed to separate it into the

foreground and background classes. If any of these two cases occur the colour channel is discarded.

The parameters are tuned on a simple training data set where the images should either give good results or bad results. The parameters are set as to max-imise accuracy.

(34)

(35)

4.5 Data set 29

4.5 Data set

A data set was created consisting of 48 images featuring different cameras, light-ning conditions, shadows and backgrounds. The data set features real-world im-ages captured by a smart phone, the data set features foreground objects of vary-ing complexity against plain backgrounds that can be found in a typical office or urban space. Ground truth was created by manually classifying each pixel in each image as foreground or background.

Each image in the data set has been labelled according to the labels in table 4.1. These labels aim to show where the algorithm often fails if it does so at all.

The data set together with individual image labels and ground truth can be seen in appendix B.1-B.5. A part of the data set has been labelled as training data set and can be seen in figures 4.2.

Label Explanation Abbreviation

Shadow Shadows in image Sh Illumination Strong illumination in image Il Dark-fore Dark foreground object D-f Complex-fore Complex foreground C-f Texture-back Textured background T-b Complex-back Complex background C-b Reflection Reflections in image Re Similar Foreground and background are similar Si Training Image used for training Tr

(36)

(a) Labels: complex-fore, complex-back, texture-back, training (b) Labels: complex-fore, training (c) Labels: dark-fore, shadow, reflection, training (d) Labels: dark-fore, shadow, training (e) Labels: dark-fore, training (f) Labels: complex-fore, illumi-nation, training (g) Labels: shadow, training (h) Labels: training (i) Labels: texture-back (j) Labels: texture-back, similar, training

(37)

5

Results

In this chapter results for polarity estimates, segmentation quality estimates and classifier combination are presented. We evaluate how to combine classifiers in different colour spaces to see what works best for our data set. We show that the best evaluation result for multi-channel thresholding is gained by combining with the sum rule in the HSV colour space with a Gaussian probability mask as seen in table 5.3. Multi-channel thresholding will be referred to as our proposed method.

A comparison between our proposed method in the HSV space, Otsu’s method in the gray scale colour space and K-means in the HSV colour space can be seen in figure 5.1. Further the evaluation results for the three methods can be seen in table 5.1. It is very interesting that using K-means in the HSV colour space without any modifications or improvement and without taking into consideration that hue is a circular colour channel provides better accuracy than our proposed method.

Table 5.1:Evaluation results for our proposed method, K-means and Otsu’s method on gray scale. Time is the average segmentation time for one image. D.R. stands for discard rate and F.R. stands for failure rate.

D.R. F.R. Accuracy HSV C Sum 0 0.21 0.916 K-means 0 0.26 0.918 Otsu’s method 0 0.58 0.765

(38)

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

(m) (n) (o) (p)

(q) (r) (s) (t)

Figure 5.1:Figure 5.1a-5.1d feature original images, figure 5.1e-5.1h feature ground truth, figure 5.1i-5.1l feature results from our proposed method, fig-ure 5.1m-5.1p featfig-ure results from K-means and figfig-ure 5.1q-5.1t featfig-ure re-sults from Otsu’s method.

(39)

5.1 Foreground-background polarity estimate 33

5.1 Foreground-background polarity estimate

In figure 5.2 we show the images that needed to be inverted after segmentation. What should be noted is that there are two types of images that are normally in need of polarity adjustment. Images that cannot be segmented properly as seen in figure 5.2d and 5.2e or images that get the wrong polarity after segmentation as seen in figure 5.2f. The first case needs to be handled by other means such as discarding the segmentation as mentioned in section 4.4.

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 5.2:Figure 5.2a-5.2c feature grayscale images, figure 5.2d-5.2f show image mask after foreground-background segmentation and figure 5.2g-5.2i show the mask after the polarity estimate has been taken into account.

5.2 Failure rate

If a segmentation result has low accuracy the segmentation is considered a failure. In figure 5.3 the image, segmentation result, image overlay and evaluation result

(40)

together with accuracy can be seen for several images. Worth noting is that for a successful segmentation the accuracy is normally between 0.95 − 0.99. When the segmentation is partially successful it drops down between 0.85 − 0.95. When background is starting to get classified as foreground or vice verse, the accuracy drops even lower and this is where the algorithm starts failing and the final result is no longer desired in an augmented reality application. Therefore accuracy below 0.85 is seen as a failure.

(a) (b) (c) (d) (e) (f)Acc. = 0.84 (g)Acc. = 0.99 (h)Acc. = 0.91 (i) Acc. = 0.50 (j) Acc. = 0.67 (k) (l) (m) (n) (o)

Figure 5.3: Example of different accuracy (Acc.) on different images. Fig-ure 5.3a-5.3e show the value channel, figFig-ure 5.3f-5.3j show the segmenta-tion result together with its accuracy and figure 5.3k-5.3o show the original images overlaid on a control panel by the help of the image mask.

5.3 Foreground-background segmentation quality

We are going to show how to choose quality estimate thresholds, Tq, T , for the

different colour channels in the gray, RGB, HSV and PCA colour spaces. The thresholds are chosen as to maximise accuracy while maintaining low discard rate.

(41)

5.3 Foreground-background segmentation quality 35 In figure 5.4 we show the image presented in figure 4.2d in different colour channels with their foreground-background segmentation masks and their differ-ent q and σ2values.

Even though accuracy is prioritised over discard-rate in individual colour channels, this might not be the best strategy when more channels are combined. Since the end goal is to combine colour channels we might be able to be a bit lenient with accuracy as to lower discard rates. Therefore two sets of thresholds are chosen, one with lower discard rate and one with higher discard rate. Set 1 is the parameter set that maximises accuracy while set 2 is the set that tries to be more balanced. A third set of parameters referred to as set 0 was also added, set 0 contains parameters that never discards an image.

Multiple parameters were tested for each colour channel and those with the most impact on evaluation results are shown in table A.1-A.4.

The nature of PCA makes it so that the first colour channel should give much better result both in regards to accuracy and in regards of discard rate. Therefore it is natural with higher discard rates for PCA2 and PCA3.

The results when using the parameters obtained from the training data set on the evaluation data set can be seen in table 5.2. Blue, Saturation and, Value have the highest accuracy while keeping discard rates low.

Table 5.2:Evaluation result for the different colour channels on the evalua-tion data set. The following words have been abbreviated discard rate (D.r.), fail rate (F.R.) and accuracy (Acc.). Set 0, set 1, and set 2 indicates which set of parameters that the channel has been using.

Set 0 Set 1 Set 2

D.R. F.R. Acc. D.R. F.R. Acc. D.R. F.R. Acc. Gray 0.00 0.58 0.77 0.45 0.21 0.81 0.16 0.45 0.78 D.R. F.R. Acc. D.R. F.R. Acc. D.R. F.R. Acc. Red 0.00 0.55 0.77 0.47 0.16 0.88 0.34 0.24 0.85 Green 0.00 0.58 0.77 0.61 0.11 0.85 0.32 0.29 0.80 Blue 0.00 0.53 0.79 0.45 0.16 0.86 0.21 0.34 0.82 D.R. F.R. Acc. D.R. F.R. Acc. D.R. F.R. Acc. Hue 0.00 0.66 0.76 0.79 0.08 0.80 0.11 0.58 0.76 Sat. 0.00 0.47 0.84 0.32 0.21 0.90 0.24 0.24 0.89 Value 0.00 0.50 0.78 0.39 0.18 0.86 0.37 0.21 0.84 D.R. F.R. Acc. D.R. F.R. Acc. D.R. F.R. Acc. PCA1 0.00 0.53 0.79 0.39 0.18 0.86 0.37 0.21 0.86 PCA2 0.00 0.68 0.74 0.89 0.03 0.86 0.53 0.21 0.80 PCA3 0.00 0.92 0.64 0.61 0.34 0.63 0.45 0.50 0.66

(42)

(a) Red, σ2= 0.02 (b) Green, σ2= 0.02 (c) Blue, σ2= 0.02 (d) Hue, σ2= 0.04 (e) Sat., σ2= 0.02 (f) Value, σ2= 0.03 (g) Red, q = 0.66 (h) Green, q = 0.73 (i) Blue, q = 0.67 (j) Hue, q = 0.99 (k) Sat., q = 0.66 (l) Value, q = 0.71 (m) PCA1, σ2= 0.02 (n) PCA2, σ2= 0.05 (o) PCA3, σ2= 0.01 (p) Gray, σ2= 0.02 (q) PCA1, q = 0.70 (r) PCA2, q = 0.81 (s) PCA3, q = 0.45 (t) Gray, q = 0.72

Figure 5.4:Showing the different colour channels figure 4.2d together with their foreground-background segmentation masks.

(43)

5.4 Combining classifiers

The best results for each colour channel in each parameter set can be seen in table 5.3. The total evaluation results can be seen in table 5.4-5.5. Best result is gained by using the sum rule on the HSV colour space, since it has low failure rate while a discard rate at 0 and a high accuracy.

Table 5.3:Here we show the best way of combining classifiers from the dif-ferent colour channels for difdif-ferent parameter sets. The best result for each colour channel is highlighted.

Set 0 Discard Rate Fail Rate Accuracy RGB C Max 0.00 0.42 0.856 HSV C Sum 0.00 0.32 0.892 PCA B Max 0.00 0.29 0.871 Set 1 Discard Rate Fail Rate Accuracy RGB C Max 0.32 0.18 0.918 HSV C Sum 0.00 0.21 0.916 PCA B Sum 0.16 0.29 0.851 Set 2 Discard Rate Fail Rate Accuracy RGB C Max 0.05 0.42 0.847 HSV C Sum 0.00 0.29 0.903 PCA B Max 0.03 0.32 0.870

Table 5.4:The discard rate per label for the top classifier combining of each colour space. Im/label shows how many images has been labelled with that label.

Categories RGB C Max HSV C Sum PCA B Max Im/label Dark-fore 0.20 0.00 0.00 10 Illumination 0.17 0.00 0.00 6 Complex-back 0.50 0.00 0.00 12 Complex-fore 0.64 0.00 0.00 14 Reflection 0.50 0.00 0.00 2 Shadow 0.27 0.00 0.00 11 Similar 0.20 0.00 0.00 5 Texture-back 0.38 0.00 0.00 16

5.4.1 Algorithm examples

In figure 5.5-5.7 a typical success case can be seen for each of colour space. In figure 5.8-5.10 a typical failure case can be seen for each colour space. What

(44)

Table 5.5: The fail rate per label for the top classifier combination of each colour space. Im/label shows how many images has been labelled with that label.

Categories RGB C Max HSV C Sum PCA B Max Im/label Dark-fore 0.00 0.33 0.67 10 Illumination 0.00 0.17 0.17 6 Complex-back 0.00 0.17 0.25 12 Complex-fore 0.00 0.21 0.21 14 Reflection 0.00 0.00 0.50 2 Shadow 0.18 0.18 0.36 11 Similar 0.80 0.80 0.80 5 Texture-back 0.00 0.19 0.13 16

is interesting is why different combination rules peform differently on different colour channels and that PCA works best without discarding channels.

Even though RGB has the lowest discard rate it does not work well on seg-menting the evaluation data set. When the channels are not discarded the max rule is the classifier that does the best job of combining the RGB channels. This is because the channel often does not complement each other instead it is more about picking the best channel and using it for segmentation.

HSV is the colour channel that works best in segmenting the evaluation data set. Even though RGB has better accuracy and lower fail rate HSV is still a better colour channel since it has no discard rate. This might sound strange but 79% of the evaluation data set was segmented successfully in the HSV colour space while only 50% was successfully segmented in the RGB colour space.

The reason why HSV performs well is because the different colour channels measure different things and are more independent of each other than the RGB channels. One of the reasons why HSV does not perform even better has to do with the fact that hue is not well defined for low saturation and value which is not taken into account in this thesis. This has required the hue discard parameters to be higher than for other channels and hue is often discarded when it could be used to improve segmentation result as seen in figure 5.9a.

PCA becomes worse when discarding channels simply because first principal component might be discarded while the other channels might not. Even though the first principal component does not always give a good result it is clearly supe-rior to the second and third principal component as seen in table 5.2.

(45)

(a)Red, x0 (b)Green, x1 (c)Blue, x2

(d)P (x0) (e)P (x1) (f)P (x2)

(g)Mask (h)

Figure 5.5: Maxrule example for a RGB image. Figure 5.5a-5.5c show the red, green and blue colour channels, these are used to calculate the foreground-background probability as seen in figure 5.5d-5.5f, the proba-bility is then combined with the maxrule to the mask as seen in figure 5.5g and mask is used to overlay the original image as seen in figure 5.5h.

(46)

(a)Hue, x0 (b)Sat., x1 (c)Value, x2

(d)P (x0) (e)P (x1) (f)P (x2)

(g)Mask (h)

Figure 5.6: Sumrule example for a HSV image. Figure 5.6a-5.6c, these are used to calculate the foreground-background probability as seen in fig-ure 5.6d-5.6f, the probability is then combined with the sumrule to the mask as seen in figure 5.6g and mask is used to overlay the original image as seen in figure 5.6h.

(47)

(a)PCA1, x0 (b)PCA2, x1 (c)PCA3, x2

(d)P (x0) (e)P (x1) (f)P (x2)

(g)Mask (h)

Figure 5.7: Maxrule example for a HSV image. Figure 5.7a-5.7c, these are used to calculate the foreground-background probability as seen in fig-ure 5.7d-5.7f, the probability is then combined with the maxrule to the mask as seen in figure 5.7g and mask is used to overlay the original image as seen in figure 5.7h.

(48)

(a)Red, x0 (b)Green, x1 (c)Blue, x2

(d)P (x0) (e)P (x1) (f)P (x2)

(g)Mask (h)

Figure 5.8: Maxrule example for a RGB image. Figure 5.8a-5.8c, these are used to calculate the foreground-background probability as seen in fig-ure 5.8d-5.8f, the probability is then combined with the maxrule to the mask as seen in figure 5.8g and mask is used to overlay the original image as seen in figure 5.8h.

(49)

(a)Hue, x0 (b)Sat., x1 (c)Value, x2

(d)P (x1)

(e)Mask (f)

Figure 5.9: Maxrule example for a HSV image. Figure 5.9a-5.9c. The hue and value colour channels where discarded resulting in only saturation be-ing used to calculate the foreground-background probability as seen in fig-ure 5.9d, the probability is then combined with the sumrule to the mask as seen in figure 5.9e and mask is used to overlay the original image as seen in figure 5.9f.

(50)

(a)PCA1, x0 (b)PCA2, x1 (c)PCA3, x2

(d)P (x0) (e)P (x1) (f)P (x2)

(g)Mask (h)

Figure 5.10: Maxrule example for a PCA image. Figure 5.10a-5.10c, these are used to calculate the foreground-background probability as seen in fig-ure 5.10d-5.10f, the probability is then combined with the maxrule to the mask as seen in figure 5.10g and mask is used to overlay the original image as seen in figure 5.10h.

(51)

5.4.2 Equal prior

The result when comparing equal prior to estimated prior can be seen in table 5.6. As can be seen using prior knowledge does not improve the result.

Table 5.6: Sum rule used to combine HSV classifiers with parameter set 1 on the training data set with and without equal prior. Eq is used to denote equal prior.

DiscardRate FailRate Accuracy HSV B Sum 0.1 0.2 0.92 HSV B Sum eq 0.1 0.1 0.93 HSV C Sum 0.1 0.4 0.87 HSV C Sum eq 0.1 0 0.98

The reason why assuming equal priors give better result is that it does assume that individual pixels follow the same behaviour as the image as a whole.

Below is an example were we try to make it more clear why estimating equal prioirs can make the result worse. This example together with the data in ta-ble 5.6 is why equal prioirs have been assumed in this report.

Example The goal of this exampel is to show how a single pixel can be missclas-sified in certain cases when estimating prior.

We have the output of two classifiers x1 and y1which we want to classify a

pixel with. There are two classes W1and class W2. We know beforehand that the

pixel is in class 1. Using the sumrule as seen in equation 4.8 and assuming equal priors leads us to equation 5.1.

x1+ y1 ≥(1 − x1) + (1 − y1) (5.1)

Which can be simplified to equation 5.2.

x1+ y1≥1 (5.2)

Since x1and y1are probabilities we also know that equation 5.3 must be valid.

0 ≤ x1, y1≥1 (5.3)

Using prior estimates and the sumrule we get equation 5.4.

P (w1) + x1+ y1 ≥P (w2) + (1 − x1) + (1 − y1) (5.4)

Here P (w1) and P (w2) is the prior estimate of the pixel. P (w2) = (1 − P (w1))

which can be used to create equation 5.5.

P (w1) + x1+ y1≥ 3

2 (5.5)

Values for x1, y1 and P (w1) can be chosen such that equation 5.2 and 5.3

(52)

This shows that using prior estimates does not guarantee better performance then equal priors.

5.5 Real-time performance

To see if real time performance can be obtained the max rule was implemented in the RGB colour space with binary probability resulting in the following fps 13.8 fps on a [960x540] image run on Android 6 using Snapdragon 410 with quad-core 1.2 GHz cortex A53 CPU [6]. Only the max rule for RGB was implemented since it is the easiest to implement without having to modify openCV4Android [4] functions. The app can be seen in use in figure 5.11.

(a) (b) (c)

Figure 5.11: Showing the algorithm described in section 4.4 running on a smart phone in real-time.

5.5.1 Complexity

The complexity of the individual components of our proposed method using Gaussian probability distribution for R-channels on an N pixel image are listed below. Here we only list the parts of the algorithm that work on a per pixel basis.

• Compute, T , q and σ2takes O(N ) • Creating probability mask takes O(N ) • Combining channels takes O(N )

(53)

5.5 Real-time performance 47

The result is O(N ). The only part of the algorithm that can not be performed on a per pixel basis is computing T , q and σ2. The rest of the algorithm works on a per pixel basis and can be performed fast on a GPU. Figure 5.12 shows the time increases when increasing pixels for our proposed method compared to K-means algorithm, here both our proposed method and K-K-means can be seen to have complexity O(N ). In table 5.7 we can show the time it takes to run our algo-rithm compared to K-means. Even if K-means can be implemented to run more efficiently it is still 50 times slower then our proposed method on a 6 megapixel image.

Figure 5.12:The complexity of our proposed method compared to K-means. The graph shows the average time of segmenting an image when run on the training data set. The algorithms were run over several versions of the train-ing data set where the data sets had different image size. Time of runntrain-ing each algorithm has been normalised so that it takes one time unit for them both on the smallest scale.

(54)

Table 5.7: The complexity of our proposed method compared to K-means. The algorithms were run over several versions of the training data set where the data sets had different image size. The size of the images in the training data set where varied.

HSV C Sum [S] Knn [S] Megapixels [MP] 0.02 0.51 0.06 0.07 2.26 0.2 0.10 5.06 0.5 0.19 8.79 1.0 0.30 14.29 1.5 0.41 21.62 2.2 0.53 30.23 2.9 0.73 37.70 3.8 0.91 47.77 4.9 1.17 61.95 6.0

(55)

6

Conclusion and future work

Here we present a conclusion of what have been done and show how our results could be improved upon in the future.

6.1 Conclusion

The goal was to develop a lightweight algorithm capable of performing foreground-background segmentation in real world scenario. To test whether this goal was achieved a labelled data set was created. The labels where chosen as to verify that the constraints and specifications mentioned in section 1.3 were uphold. Looking at the results in table 5.4 and 5.5 we can see that the only category where the al-gorithm systematically fails is when the foreground and background are similar. Ideally this category should be discarded. But thankfully telling a user not to use the algorithm on an image where the foreground and background have the same colour is possible.

Using the variance and the quotient, q, between total variance and inter-class variance has been shown to be an efficient way to quickly estimate the quality of a classifier. The quotient q has also been shown to be an efficient way of rating classifiers. The result is a robust and efficient foreground-background segmen-tation algorithm capable of not only real-time performance but also capable of telling a user when a certain image can not be segmented allowing the user to take action. We have also shown that several simple classifiers can be combined as to give better performance than the individual classifiers by them self.

What can clearly be seen by looking at tables 5.3, 5.2 and A.5 is that combin-ing classifiers with discard is better than the best combincombin-ing classifiers without discard, which in turn is better than the individual colour channels with discard. That said, K-means does a very good job of segmenting the data set but it is not as fast as our proposed method. The problem with both our proposed

(56)

method and K-means is that they are old techniques and can not handle complex backgrounds and are not always reliable.

6.2 Future work

Some things that should be examined in the future is how to handle badly de-fined hue. This is most likely something that should be handled when computing thresholds and quality estimates. This will most likely result in the hue channel being discarded less. Only hue pixels from a hue channel that are well defined should be used to combine classifiers. This will increase the overall accuracy.

Instead of the Gaussian distribution in linear colour space, the step function could be smoothed as to give uncertainty for pixels close to the decision boundary while maintaining high certainty of high or low pixel values.

What this algorithm does and does well is to segment images fast. Since com-bining classifiers and calculating the probability that the pixel belongs to a cer-tain class works on a per pixel basis it could be performed on the GPU, resulting in higher fps. This should make real-time performance on full HD video stream possible.

With mobile devices getting more powerful and neural networks becoming more efficient it is not unthinkable that they could be used for foreground-background segmentation in the near future. This is most likely way to develop a foreground-background segmentation algorithm capable of robust segmentation of complex foreground objects against complex backgrounds.

(57)

(58)

(59)

A

Evaluation Results

This chapter features evaluation results.

A.1 Colour channels parameter tuning

Below is the result of parameter tuning of the individual colour channel on the training data set.

Table A.1: Evaluation results for gray scale. Set 1 represent the threshold values with best accuracy with descent discard-rate while set 2 has a better discard-rate while still descent accuracy.

Gray DiscardRate FailRate Accuracy TGrq TGr

0 0.5 0.76 0.50 0.006

0.2 0.4 0.75 0.65 0.006 0.3 0.3 0.77 0.65 0.010 Set 2 0.2 0.3 0.79 0.60 0.010 Set 1 0.5 0 0.95 0.60 0.020

A.2 Combining classifier result

Result of running different classifier combination on the evaluation data set.

(60)

Table A.2: Evaluation results for the different colour channels in the RGB colour space. Set 1 represent the threshold values with best accuracy with descent discard-rate while set 2 has a better discard-rate while still descent accuracy.

Red Discard Rate Fail Rate Accuracy TRq TR

0 0.4 0.82 0.60 0.008

0.1 0.3 0.83 0.60 0.011 Set 1 0.5 0 0.94 0.74 0.011

0.6 0 0.95 0.74 0.022

Set 2 0.3 0.1 0.90 0.68 0.011 Green Discard Rate Fail Rate Accuracy TGq TG

0.00 0.50 0.75 0.50 0.006 0.20 0.40 0.75 0.60 0.006 Set 2 0.40 0.10 0.87 0.50 0.017 0.50 0.10 0.89 0.74 0.006 Set 1 0.60 0.00 0.95 0.78 0.017 Blue Discard Rate Fail Rate Accuracy TBq TB

0.00 0.40 0.80 0.60 0.006 0.20 0.20 0.88 0.70 0.006 Set 2 0.30 0.10 0.93 0.70 0.008 Set 1 0.40 0.00 0.98 0.78 0.008 0.30 0.10 0.92 0.68 0.019

(61)

A.2 Combining classifier result 55

Table A.3: Evaluation results for the different colour channels in the HSV colour space. Set 1 represent the threshold values with best accuracy with descent discard-rate while set 2 has a better discard-rate while still descent accuracy.

Hue Discard Rate Fail Rate Accuracy THq TH

0.00 0.30 0.86 0.00 0.00 Set 2 0.20 0.20 0.88 0.60 0.01 0.40 0.10 0.91 0.98 0.01 Set 1 0.50 0.10 0.92 0.983 0.01 0.60 0.10 0.92 0.985 0.01 0.20 0.20 0.88 0.60 0.10 0.20 0.20 0.88 0.60 0.10 0.40 0.10 0.91 0.98 0.21 Saturation Discard Rate Fail Rate Accuracy TSq TS

0.00 0.50 0.78 0.50 0.003 0.20 0.30 0.83 0.50 0.006 Set 1 0.50 0.00 0.98 0.70 0.006 Set 2 0.30 0.20 0.88 0.66 0.006 Value Discard Rate Fail Rate Accuracy TVq TV

0.00 0.50 0.78 0.50 0.006 0.20 0.30 0.82 0.65 0.006 Set 2 0.40 0.10 0.90 0.73 0.006 Set 1 0.50 0.00 0.98 0.73 0.008

(62)

Table A.4: Evaluation results for the different colour channels in the PCA colour space. Set 1 represent the threshold values with best accuracy with descent discard-rate while set 2 has a better discard-rate while still descent accuracy.

PCA1 Discard Rate Fail Rate Accuracy TP CA1q TP CA1

0.00 0.30 0.86 0.65 0.008 Set 2 0.10 0.20 0.90 0.72 0.008 Set 1 0.30 0.00 0.97 0.75 0.008 PCA2 Discard Rate Fail Rate Accuracy TP CA2q TP CA2

0.00 0.70 0.72 0.50 0.011 0.50 0.30 0.75 0.65 0.011 0.30 0.50 0.72 0.50 0.014 Set 2 0.60 0.20 0.80 0.63 0.014 Set 1 0.60 0.20 0.80 0.78 0.014 PCA3 Discard Rate Fail Rate Accuracy TP CA3_q TP CA3

0.00 0.90 0.65 0.00 0.00 0.50 0.40 0.70 0.50 0.007 Set 2 0.70 0.20 0.85 0.50 0.008 0.70 0.20 0.73 0.60 0.011 0.40 0.50 0.67 0.50 0.011 Set 1 0.80 0.10 0.85 0.60 0.008

(63)

A.2 Combining classifier result 57

Table A.5:Combining classifiers with parameter set 0. Discard Rate Fail Rate Accuracy RGB B Prod 0.00 0.58 0.78 RGB C Prod 0.00 0.55 0.83 HSV B Prod 0.00 0.76 0.75 HSV C Prod 0.00 0.34 0.88 PCA B Prod 0.00 0.92 0.68 PCA C Prod 0.00 0.45 0.83 RGB B Sum 0.00 0.55 0.79 RGB C Sum 0.00 0.50 0.83 HSV B Sum 0.00 0.34 0.87 HSV C Sum 0.00 0.32 0.89 PCA B Sum 0.00 0.58 0.78 PCA C Sum 0.00 0.39 0.85 RGB B Min 0.00 0.58 0.78 RGB C Min 0.00 0.53 0.83 HSV B Min 0.00 0.76 0.75 HSV C Min 0.00 0.42 0.86 PCA B Min 0.00 0.92 0.68 PCA C Min 0.00 0.42 0.82 RGB B Max 0.00 0.42 0.81 RGB C Max 0.00 0.42 0.86 HSV B Max 0.00 0.58 0.79 HSV C Max 0.00 0.37 0.89 PCA B Max 0.00 0.29 0.87 PCA C Max 0.00 0.32 0.86 RGB B Vote 0.00 0.55 0.79 RGB C Vote 0.00 0.58 0.80 HSV B Vote 0.00 0.34 0.87 HSV C Vote 0.00 0.29 0.88 PCA B Vote 0.00 0.58 0.78 PCA C Vote 0.00 0.58 0.78

(64)

Table A.6:Combining classifiers with parameter set 1. Discard Rate Fail Rate Accuracy RGB B Prod 0.32 0.21 0.871 RGB C Prod 0.32 0.18 0.916 HSV B Prod 0.00 0.29 0.880 HSV C Prod 0.00 0.21 0.914 PCA B Prod 0.16 0.37 0.824 PCA C Prod 0.16 0.32 0.831 RGB B Sum 0.32 0.18 0.871 RGB C Sum 0.32 0.18 0.916 HSV B Sum 0.00 0.24 0.888 HSV C Sum 0.00 0.21 0.916 PCA B Sum 0.16 0.29 0.851 PCA C Sum 0.16 0.29 0.833 RGB B Min 0.32 0.21 0.871 RGB C Min 0.32 0.18 0.915 HSV B Min 0.00 0.29 0.880 HSV C Min 0.00 0.24 0.910 PCA B Min 0.16 0.37 0.824 PCA C Min 0.16 0.32 0.831 RGB B Max 0.32 0.18 0.871 RGB C Max 0.32 0.18 0.918 HSV B Max 0.00 0.26 0.870 HSV C Max 0.00 0.24 0.915 PCA B Max 0.16 0.29 0.851 PCA C Max 0.16 0.29 0.834 RGB B Vote 0.32 0.18 0.871 RGB C Vote 0.32 0.18 0.916 HSV B Vote 0.00 0.24 0.888 HSV C Vote 0.00 0.24 0.906 PCA B Vote 0.16 0.29 0.851 PCA C Vote 0.16 0.29 0.834

Real-Time Image Segmentation for Augmented Reality by Combiningmulti-Channel Thresholds.

Master of Science Thesis in Electrical Engineering

Department of Electrical Engineering, Linköping University, 2017

Real-time image

segmentation for

augmented reality by

combining multi-channel

thresholds.

Alexander Poole

Abstract

Contents

1

Introduction

1.1

Aim

1.2

Contributions

1.3

Problem definition

2

Related Work

2.1

Matting

2.2

Semantic segmentation

2.3

Unsupervised segmentation

3

Theory

3.1

Colour Spaces

3.1.1

Principal component analysis (PCA)

3.2

Thresholding

3.2.1

Otsu’s method

3.2.2

Circular Otsu’s method

3.3

Probability distributions

3.3.1

Gauss-likelihood

3.3.2

Circular Gauss-likelihood

3.4

Evaluation

4

Method

4.1

Foreground-background polarity estimate

4.2

Foreground-background segmentation quality

4.3

Combining classifiers

4.4

Multi-channel thresholding

4.4.1

Parameter tuning

4.5

Data set

5

Results

5.1

Foreground-background polarity estimate

5.2

Failure rate

5.3

Foreground-background segmentation quality

5.4

Combining classifiers

5.4.1

Algorithm examples

5.4.2

Equal prior

5.5

Real-time performance

5.5.1

Complexity

6