Implementation and evaluation of content-aware video retargeting techniques

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Implementation and evaluation of content-aware

video retargeting techniques

Examensarbete utfört i bildkodning vid Tekniska högskolan i Linköping

av

Stefan Holmer

LiTH-ISY-EX--08/4163--SE

Linköping 2008

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

Implementation and evaluation of content-aware

video retargeting techniques

Examensarbete utfört i bildkodning

vid Tekniska högskolan i Linköping

av

Stefan Holmer

LiTH-ISY-EX--08/4163--SE

Handledare: Harald Nautsch

isy, Linköpings universitet Nils Andgren

Telestream AB

Examinator: Robert Forchheimer

isy, Linköpings universitet

(4)

(5)

Avdelning, Institution

Division, Department Information Coding

Department of Electrical Engineering Linköpings universitet

SE-581 83 Linköping, Sweden

Datum Date 2008-09-21 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

http://www.icg.isy.liu.se http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-ZZZZ ISBN — ISRN LiTH-ISY-EX--08/4163--SE

Serietitel och serienummer

Title of series, numbering

ISSN

—

Titel

Title

Implementation och utvärdering av innehållsstyrd omformatering av videosekvenser

Implementation and evaluation of content-aware video retargeting techniques

Författare

Author

Stefan Holmer

Sammanfattning

Abstract

The purpose of this master thesis was to study different content-aware video retargeting techniques, concentrating on a generalization of seam carving for video. Focus have also been put on the possibility to combine different techniques to achieve better retargeting of both multi-shot video and single-shot video. This also involved significant studies of automatic cut detection and different measures of video content. The work resulted in a prototype application for semi-automatic video retargeting, developed in Matlab. Three different retargeting techniques, seam carving, automated pan & scan and subsampling using bi-cubic interpolation, have been implemented in the prototype. The techniques have been evaluated and compared to each other from a content preservation perspective and a perceived quality perspective.

Nyckelord

Keywords seam carving, retargeting, aspect ratio, scaling, saliency, cut, shot, detection, widescreen, fullscreen, processing, optical flow, sampling, pan, scan, energy

(6)

(7)

Abstract

The purpose of this master thesis was to study different content-aware video retargeting techniques, concentrating on a generalization of seam carving for video. Focus have also been put on the possibility to combine different techniques to achieve better retargeting of both multi-shot video and single-shot video. This also involved significant studies of automatic cut detection and different measures of video content. The work resulted in a prototype application for semi-automatic video retargeting, developed in Matlab. Three different retargeting techniques, seam carving, automated pan & scan and subsampling using bi-cubic interpolation, have been implemented in the prototype. The techniques have been evaluated and compared to each other from a content preservation perspective and a perceived quality perspective.

Sammanfattning

Syftet med examensarbetet har varit att studera tekniker för ändring av bredd/höjd-förhållandet i videosekvenser, där hänsyn tas till innehållet i bilderna. Fokus har lagts på en generalisering av “seam carving” för video och möjligheterna att kom-binera olika tekniker för att nå bättre kvalitet både för videosekvenser som består av endast ett, eller flera, klipp. Detta innefattade således också omfattande studier av automatisk klippdetektering och olika mått av videoinnehåll. Arbetet har resul-terat i en prototypapplikation utvecklad i Matlab för halvautomatisk förändring av bildförhållande där hänsyn tas till innehållet i sekvenserna. I prototypen finns tre metoder implementerade, “seam carving”, automatiserad “pan & scan” och ned-sampling med bi-kubisk interpolering. Dessa metoder har utvärderats och jämförts med varandra från ett innehållsbevarande perspektiv och ett kvalitetsperspektiv.

(8)

(9)

Acknowledgments

I would like to thank my supervisors, Nils Andgren at Telestream and Harald Nautsch at ISY, Linköping University, for their ideas, input and for reading the report. A thanks also goes to my opponent Erik Lindblad, Marcus Frödin and Oskar Hermansson for valuable discussions and to family and friends for listening to my complaints during hard times and for being enthusiastic with me during better times.

Stockholm, August 2008 Stefan Holmer

(10)

(11)

Notation

· The dot product

N (.) Normalizing operator. First normalizes all val-ues in a map to the range [0..1]. Then calculates the average m of all local maximas except the global maxima M . Lastly multiplies the map by (M − m)2_.

∇ The gradient of the function f

∇· The divergence operator, giving the divergence of a vector field

∇2 _{The Laplace operator, giving the Laplacian of a}

vector field

The across-scale difference operator between two maps. It is calculated by interpolating the map at the coarser scale to the same size as the map at the finer scale, and subtracting the in-terpolated coarser map from the finer map.

hN(.) Function which calculates the N-bin histogram of an image

(14)

(15)

Chapter 1

Introduction

Video retargeting is the problem of changing the aspect ratio of an existing video sequence into the aspect ratio of an arbitrary display, while preserving the viewers’ experience [18]. This is often solved by the automatic method letterboxing, where mattes are inserted into the video to reach the target aspect ratio with the original composition preserved. It can also be solved by more or less automatic methods such as pan & scan. Content-aware video retargeting is the name of the methods which change the aspect ratio of video while trying to preserve the important parts of the video.

1.1 Problem statement

The purpose of this thesis is to look into the possibilities of doing automatic or semi-automatic retargeting of video sequences built up of more than one shot. Different techniques will be evaluated, especially focus will be given to a general-ization of the seam carving technique proposed by Shai Avidan and Ariel Shamir in their paper on content-aware image resizing, published in 2007 [1]. The problem statement is as follows:

“Can seam carving be generalized to video, and combined with other retargeting techniques, to do automatic or semi-automatic video retargeting?”

1.2 Method

This thesis has been put together by dividing the work in a set of steps. The first step has involved doing a survey of the literature in the area of cut detection, video and image retargeting and image and video editing. From these sources a set of techniques were chosen and implemented into a prototype application. The prototype application was used to evaluate the algorithms and to compare the output of the algorithms in a visual matter. A rather subjective comparison was made since there was no obvious way to measure the actual quality of the outputs.

(16)

2 Introduction

In the last sections the result is presented and discussed and ideas for future work are presented.

1.3 Glossary

Table 1.1. Cut detection glossary

English Explanation

Shot A sequence of frames recorded uninterrupted by one camera Cut A transition from one shot to another

Hard cut A sudden cut

Soft cut A gradual transition from one shot to another. Often a sequence of frames that belongs to both shots.

Dissolve Soft cut where the frames gradually interpolates from one shot to the next

Wipe A gradual spatial transition from one shot to the next

Hit A correctly found cut

Missed hit A cut that was not found

(17)

1.3 Glossary 3

Table 1.2. Retargeting glossary

English Explanation

Connectivity The pixels of a seam must be connected

Content-aware retargeting Retargeting by taking the contents of the image or image sequence into consideration

Monotonicity The seam must include one and only one pixel in

each row in the vertical case, and one and only one pixel in each column in the horizontal case

Pan & scan A rectangle of the image with the wanted aspect ratio is cut out and used as the new retargeted image. In image sequences this may introduce virtual cuts and pans.

Poisson reconstruction The carved image is reconstructed by solving the Poisson equation given the pixels of the borders of the image and the derivative estimates of the image Retargeting Changing the aspect ratio of an image, or a sequence

of images

Saliency maps An image defining the importance of each pixel in the image being retargeted

Seam An 8-connected path of pixels with width 1 from the

top to the bottom or from the left to the right of an image. For the path to be a seam, it must conform to the monotonicity and connectivity constraints.

Seam carving The removal of seams with the lowest energy from

(18)

(19)

Chapter 2

Cut detection

To be able to use different retargeting methods for different parts of a video se-quence, there is a need to divide the video in a clever way. Frames that should be processed with the same method should be similar since this will increase the chances of the same method working well for all of them. Alike frames could be found by doing a full search through the entire video sequence, comparing all frames and grouping them. This would however be extremely time consuming as well as yield other problems such as frequent changes of method when watching the sequence in the correct order. A faster method, which does not suffer from the problem of frequent changes of method, is locating each shot scene in the sequence. The frames in a shot have good chances of being alike and are temporally located beside each other. This method is a lot faster because we do not have to do a full search. It is enough to find the two frames where a scene change has occurred. This problem is in literature called cut detection and several suggested solutions exist, with different advantages and disadvantages.

Cut detection is a classification problem where we want to classify pairs of frames into the classes cut or no cut. Since we have a classification problem the performance of the methods can be defined by measuring the number of correct hits, missed hits and false hits. A correct hit is defined as a detected cut when a cut is truly present, a missed hit occurs when a cut is present but was not detected and a false hit occurs when a cut is detected but was not truely present. Recall and precision are often used to evaluate these measures [2].

Recall = Correct

Correct + M issed (2.1a) P recision = Correct

Correct + F alse (2.1b)

In the case of video retargeting we want to maximize the number of correct hits while minimizing the number of missed hits. On the other hand it is not crucial to avoid false hits since these will not make it harder to choose the correct method, but if the false hits are too frequent there will be a problem of too frequent changes of retargeting method. This means we want a recall close to one, while the precision

(20)

6 Cut detection

is of less importance.

2.1 Histogram

A histogram of an image is found by counting the number of occurrences of each color in the image. The N-bin histogram is found by putting occurrences of colors in the bin which closest represents their value. The count of each bin is plotted. In the case of an image I with 9 pixels with 3 different colors taking values from the set {1, 2, 3}, the 3-bin histogram h3(I) may look like in Figure 2.1. In Figure

2.1 five pixels have the value 1, two pixels have the value 2 and two pixels have the value 3. If a 5-bin histogram had been calculated it would have two empty bins since only three different pixel values exists in the picture. The 1-bin histogram on the other hand would have 9 occurrences in the bin since all pixel values would be put in that bin.

(a) A small image with 3 × 3 pixels.

1 2 3 0 1 2 3 4 5 6 Pixel value Occurences

(b) A 3-bin histogram of the image in Figure 2.1(a). Figure 2.1. Example of histogram calculations.

(21)

2.2 Considered methods 7

2.2 Considered methods

As mentioned, several methods for solving the cut detection problem have been suggested. Some of the available methods are based on feature tracking [17], correlation [10], histograms, DCT, motion and block matching [2]. As earlier stated, the application of video retargeting values a high recall ratio while precision is less important. By knowing this, Boreczky and Rowe recommends the use of a running histogram method, with support of motion vectors to reduce the number of false hits, in their comparison of video shot boundary detection techniques [2]. Test results stated by Browne et al in their evaluation of video shot boundary detection algorithms agree on that the use of histogram methods is the best choice when trying to achieve high recall ratio [3].

2.3 Running histogram based cut detection

The running histogram algorithm for cut detection works by computing a gray-scale histogram over each frame of the video sequence. The histogram difference is computed between consecutive frames and is compared to a threshold Thigh. If the

difference exceeds this threshold, a hard cut has been detected. If the difference exceeds a low threshold Tlowat frame i, the start of a gradual transition is flagged.

Inside a potential gradual transition, the histogram difference between frame i + j and frame i is computed. If this difference drops below Tlow for at least two

consecutive frames, the flag is removed, otherwise the gradual transition is over when the difference exceeds the threshold Thigh. A disadvantage with this method

is when we have a long gradual transition, which makes the difference between a frame i + j and frame i getting larger than Thigh before the gradual transition

has ended. Due to this, the same gradual transition may be detected as multiple different gradual transitions. One solution could be to increase the value of Thigh,

lowering the probability that this will happen. Though, this means we will not find all the hard cuts we want.

Rather than using the running histogram method above, we propose the use of an alternative variant. The histogram difference is computed as above, and when the difference reaches above T2, a potential cut is flagged. As long as we have a

potential cut we calculate the difference between frame i and frame j, where j is the first frame of the potential cut and i is the current frame.

d1(i, j) = 1 N N X k=1 |hi(k) − hj(k)| (2.2) hi= hN(I(i))

hi(k) is the value of the k:th bin of the histogram hi and hN is the mapping of the image I(i) to its N bin histogram. We also calculate the difference between the current frame and the previous frame.

d2(i) = 1 N N X k=1 |hi(k) − hi−1(k)| (2.3)

(22)

8 Cut detection

In both equations 2.2 and 2.3, N is the number of bins of the histogram calculated by the function hN. If d2(i) goes below T2 again within 2 frames, we have found

a hard cut. Otherwise, if d1(i, j) − d1(i − 1, j) goes below a threshold T1 for at

least 5 consecutive frames, we have found a gradual transition. T1 tells us when

the gradual transition has passed, which will be when the difference between the first frame of the transition and the current frame is not increasing.

(23)

Chapter 3

Content-aware video

retargeting

Literature suggests a wide set of techniques for content-aware video retargeting [12, 7, 18]. There are methods which do nonlinear scaling, cropping and combinations thereof. This thesis will concentrate on a new method based on the work of Shai Avidan and Ariel Shamir on content-aware image retargeting [1]. As a complement, a cropping method – pan & scan – will be used. Comparisons will also be made with bi-cubic scaling. Bi-cubic scaling is a method where the video is spatially resampled to reach the target size, and the new pixel values are interpolated from the old using a bi-cubic interpolation function.

3.1 Saliency maps

All content-aware retargeting techniques depend on the knowledge of what is im-portant in a video sequence. If the measure of importance is bad for a given sequence, the retargeting will focus on the wrong parts of the video, and the result will be useless. Seam carving is based on the assumption that interesting pixels of an image have relatively high energy values given some energy measure, thus not being cut away. This is of course not completely true for all images, and to get a somewhat general video retargeting algorithm there is a need to reduce this problem. Avidan and Shamir suggest the use of saliency measures to guide the re-sizing process in their paper on seam carving for images [1]. A saliency measure is a way of finding prominent features in an image. A saliency map is an image with saliency measures for each pixel of the original image. Different saliency measures can be combined into the saliency map. Examples of saliency measures could be feature and object detectors. For video sequences another kind of saliency measure becomes available – the measure of motion. Research has been done looking for more general focus of attention models which also could be used to find saliency maps. As an example, Itti, Koch and Niebur combine a large set of feature maps in different scales to find a general saliency map in their paper on saliency-based

(24)

10 Content-aware video retargeting

(a) Road image

(b) Road image with the seams to be removed col-ored in red.

(c) Road image with the seams removed.

Figure 3.1. Seam carving applied to one frame of the road sequence, illustrating the

(25)

3.1 Saliency maps 11

visual attention models [4]. Their model is also extendable to additional feature measures, so a face detection algorithm could be included. The saliency map can be constructed from the saliency measures through a weighted sum as in Equation 3.1, where S is the saliency map, Si is each of the N saliency measures and wi is a weight associated to Si. An example of why saliency maps are needed can be seen in Figure 3.1. S = N X i=1 wiSi (3.1a) N X i=1 wi= 1 (3.1b)

3.1.1 Face detection

Faces are a kind of objects which humans easily and often focus on. We are also sensitive to distortion in faces. Because of these reasons it is a good idea to include faces in the saliency map to prevent them from being distorted by the seam carving. In this work we chose to use the object detection algorithm proposed by P. Viola and M. Jones in their paper on object detection, trained for detecting faces, since it is both accurate and fast [14]. The detection works by applying a cascade of increasingly more complex classifiers. In that way the background can early be discarded and more time can be spent on regions with potential faces. Whenever a classifier fails, the region is discarded. The classifiers are based on simple rectangular, Haar-like features, which can be calculated in constant time at any scale and location from an integral image. Three different directions of the features are used, horizontal, vertical and diagonal and three kinds of features are looked for, two-rectangle features, three-rectangle features and four-rectangle features. The features, as shown in Figure 3.2, are calculated as the difference between the sum of the black rectangles and the sum of the white rectangles. When the features are calculated, a detection sub-window of the image is selected. This sub-window is where we want to check for faces. If this sub-windows is of the size 24 × 24, the complete set of features to compute is 180000. Computing this complete set of features is to expensive and therefore a small set of these features are selected and later combined to form an effective classifier. A weak classifier is made up of one of these features, and all weak classifiers are evaluated on a training set of images of both faces and non-faces. Depending on the precision of a weak classifier it is assigned a weight, so when combined into the effective classifier, weak classifiers with low precision are weighted low. The strong classifier is built up by evaluating it on the training set of images. Weak classifiers are added to the strong classifier until the false detection rate is low enough. This learning process must only be done once, and the resulting detector can then be used in the system. As mentioned earlier, a set of strong classifiers are constructed and applied in a cascade, where a fail means that the region is discarded.

(26)

Figure 3.2. Example of Haar features. The large white rectangles illustrate the detection

sub-window. Features are calculated as the difference between the sum of the black rectangles and the sum of the white rectangles.

3.1.2 Motion estimation

In video sequences motion is an important factor. Without motion a video se-quence would only be percepted as a set of still images appearing one after an-other. For example, consider a stationary camera, recording a ball. If the ball is not moving either, the shot could be expressed by just a single frame. But if the ball or the camera moves, motion is introduced, and multiple frames are needed to visualize what is happening in the shot. This makes it reasonable to assume that it often is desirable to keep the parts of a frame where we have objects moving. The optical flow M can be used as a measure of motion between the frames. Lucas and Kanade [8] provide a method for finding the optical flow between two frames using the images’ spatial and temporal gradients, assuming a locally constant optical flow. This method is derived by minimizing the error as in Equation 3.2, where

W is the region of integration, i.e. the neighborhood around a pixel which is used

to estimate its flow, and w(x) is a weighting function. The weighting function is usually selected to be 1 or a Gaussian bump which is 1 over the center pixel. d is the disparity between the first frame I and the second frame J and forms a vector which tells us where a pixel has moved from one frame to another.

min

d  =

Z Z

W

(J (x + d) − I(x))2w(x)dx (3.2)

Tailor series expansion followed by derivation makes it possible to set the deriva-tives to zero to find the minimum, this leads to Equation 3.3, where the weighting function w(x) has been set to 1.

Z Z

W

(J (x) − I(x) + dT∇J (x))2_{∇J (x)dx = 0} _(3.3)

Rearranging these terms, moving the two first terms to the right side, gives Equa-tion 3.5, where T is the structure tensor, see EquaEqua-tion 3.9 [5]. Solving this equaEqua-tion for the disparity d in each point of the frames gives a vector field M built up by

(27)

disparity vectors, which we call the optical flow.

Z Z W ∇J (x)∇T_{J (x)dx} d = Z Z W ((I(x) − J (x))∇J (x)dx (3.4) T d = e (3.5) Another way to find the optical flow is to do block matching, dividing frame i into N equally sized blocks and trying to find the best match in frame i + 1. The optical flow is then defined by the vectors from the blocks in frame i to the matching blocks in frame i + 1. The advantage of the structure tensor method is that it gives sub-pixel accuracy, while the block matching method will only give integer accuracy.

There is a set of problems that occur when using motion estimation techniques to find moving objects in a video sequence. Often we have the case where not only objects are moving, but also the camera moves, making each pixel in the sequence appearing as moving. This could be solved by finding the camera’s ego-motion and subtracting it from the optical flow. The ego-motion can appear from camera panning, zooming, rotation and tilting, all giving a different kind of optical flow. Liu and Gleicher are in their paper on retargeting using a fisheye-view warping technique [7] using an affine model (Equations 3.6 and 3.7) for global motion, proposed by Wang and Huang in their paper on motion analysis in the MPEG domain [16], which they are fitting to the optical flow. The model can be fit to the optical flow M by solving the least squares problem in Equation 3.8 [16]. This method makes the assumption that the camera’s ego-motion actually is the dominant motion. Mcam(x, y) = zoom rotate −rotate zoom x y + pan tilt (3.6) Mcam(x, y) =  

zoom rotate pan

−rotate zoom tilt

0 0 1     x y 1   (3.7)           x1 y1 1 0 x2 y2 1 0 .. . ... ... ... y1 −x1 0 1 y2 −x2 0 1 .. . ... ... ...               zoom rotate pan tilt     =           u1 u2 .. . v1 v2 .. .           (3.8)

The global motion is then subtracted from the optical flow and pixels where the difference is less than a threshold Tmare assigned to the set of background pixels, otherwise they are assigned to the foreground. This is repeated until the variance of the model parameters is below a threshold Tvar. Liu and Gleicher also suggest

that the motion saliency map should be filtered at the end, as pixels cohering to the same object should have the same saliency measure. To preserve the contrast between different objects a bilateral filter is suggested [7].

(28)

Testing this method on a couple of different shots revealed that it is difficult to segment the objects from the background. It is possible to get a rough estimate of the camera ego-motion, but the ego-motion vector field subtracted from the optical flow does not find the objects. This has to do with the perspective and depth in the frames, e.g. the background moves slower far from the camera than close to the camera. Also if objects with a size that is significant compared to the image size are moving in the scene, these will disturb the camera motion estimation. Another problem is to estimate the optical flow in neighborhoods in the image which lacks texture and where we have the aperture problem. The aperture problem, illustrated in Figure 3.3, occurs because the processing that is done when estimating the optical flow works in small neighborhoods of the image. This neighborhood can be seen as an aperture floating over the image, and where the processing is done inside the aperture. This leads to the problem of being able to find the corresponding points in frame i and frame i + 1 where the neigborhood is too simple. In Figure 3.3 the too simple neighborhood is illustrated as a line. If the neighborhood would have contained a corner, this problem would not have appeared. The apperture problem may be partly solved by looking at a certainty

Figure 3.3. The aperture problem illustrated with a moving line. In A the true motion

vectors are found since the ends of the line is visible. In B, where only the part of the line inside through the circle (the aperture) is visible, the true motion vectors are impossible to find.

measure calculated from the structure tensor’s (see Equation 3.9) eigenvalues as in Equation 3.10. Where the certainty is below a threshold, we have a neighborhood where we can’t estimate the optical flow. Despite these problems, the camera ego-motion estimate is usable to identify shots where we have significant panning or tilting, thus making it possible to identify shots which should not be retargeted using seam carving.

T = ∇f ∇Tf = λ1e1e1T + λ2e2e2T =    ∂f ∂x 2 ∂f ∂x ∂f ∂y ∂f ∂x ∂f ∂y ∂f ∂y 2    (3.9)

(29)

c = λ1− λ2 λ1+ λ2

2

(3.10)

3.1.3 Saliency from combined multiscale image features

Itti, Koch and Niebur proposed a method to generate saliency maps inspired by the behavior and the architecture of the early primate visual system [4]. They fed the saliency map into a winner-take-all neural network to find the areas of attention in the image in a sequential order.

Three different types of features are extracted in multiple spatial scales; color, intensity and orientation. The scales are created using dyadic Gaussian pyramids, i.e. low-pass filtering and subsampling by a factor 2 until the desired scale is reached. The intensity pyramid is found by taking the mean of the color images in each scale. The color pyramid is found by normalizing each color in the pyramid by the intensity, thus separating color from intensity. Red, green and blue color channels are combined into red, green, blue and yellow channels. The orientation pyramid is found by filtering the intensity pyramid with a filter bank of Gabor filters of different orientations.

Gabor filters are the product of a cosine grating and a 2D Gaussian function, as in Equation 3.11. The rotation RT_{(θ) defines the orientation of the filter.}

g(x0, y0) = e−x02 +φ2 y022σ2 cos  2πx0 λ + ρ (3.11) x0 y0 = cos θ sin θ − sin θ cos θ x y = RT(θ) x y

Itti et al recommends using four orientations θ ∈ {0◦, 45◦, 90◦, 135◦}. Each feature pyramid is combined using center-surround differences defined as the difference between fine and coarse scales as in Equation 3.12, where c ∈ {2, 3, 4} and s =

c + δ, δ ∈ {3, 4}. Here is the across-scale difference operator between two maps.

It is calculated by interpolating the coarser scale to the same size as the finer scale, and subtracting the interpolated coarser map from the finer map.

CSA, B(c, s) = |A(c) B(s)| (3.12)

For color channels, a special case of center-surround differences is constructed with the motivation that in the visual cortex the center is excited by one color and inhibited by another, while in the surround the opposite is true [4], see Equation 3.13 for the case of red and green.

CSA, B(c, s) = |A(c) B(s)| = [A = R − G, B = −A] =

CSR-G, G-R(c, s) = |(R(c) − G(c)) (G(s) − R(s))| (3.13)

Finally, these center-surround differences are normalized using a normalization operator which weights maps with few strong peaks high and weights maps with many peaks of approximately the same size low. The normalization is done by first

(30)

normalizing all values in a map to the range [0..1], then the average m of all local maximas except the global maxima M . Lastly the map is multiplied by (M − m)2. These normalized maps are reduced to the same scale and summed together giving three conspicuity maps I, C, O at scale 4. These maps are normalized again, and the mean is the saliency map, see Equation 3.14.

S = 1

3(N (I) + N (C) + N (O)) (3.14)

3.2 Seam carving

Seam carving is originally a still image retargeting method proposed by Avidan and Shamir in their paper on content-aware image resizing [1]. The idea of the method is to find horizontal and vertical seams through the image with minimum energy given an energy measure. A seam is defined as an 8-connected path with width one through the image. In the case of vertical seams, the seam goes from the top to the bottom through the image, and in the case of horizontal seams, the seam goes from the left to the right through the image. The seams must also conform to the two constraints:

Monotonicity the seam must include one and only one pixel in each row in

the vertical case, and one and only one pixel in each column in the horizontal case

Connectivity the pixels of a seam must be connected

Apart from carving seams, it is also possible to insert seams using the same method. All processing is done in the same way, but instead of carving the seam, a new seam is inserted and its pixel values is interpolated from its neighboring pixels. In the case of retargeting video from one aspect ratio to another, it is interesting to combine seam insertion and seam carving in a way that gives the least visual distortion to the viewer.

3.2.1 Video generalization

The simplest way to use the seam carving method on video sequences would be to apply it on each frame separately, considering each frame being an independent still image. By doing this, one ignores the fact that video frames are temporally highly statistically dependent – a dependency which upon being destroyed will introduce severe distorsion to the video sequence as a whole. To prevent this, the method must be modified in a way that takes this dependency into account. One way of doing this would be to divide the frames into clusters and calculate the mean energy of each cluster of N frames as in Equation 3.15.

ei= 1 N N −1 X k=0 e(I(i · N + k)) (3.15)

(31)

3.2 Seam carving 17

Where I(t) is the intensity image of the frame at time t, see Equation 3.16.

I(t) = I(x, y, t) (3.16)

This mean energy image is used to calculate the cost function for the set of N frames that it represents as in Equation 3.17. The same equation can be used for calculating the costmap for horizontal seam carving if the energy function is rotated 90◦before calculating the costmap.

Mi(j, k) = ei(j, k) + min      Mi(j − 1, k − 1) Mi(j − 1, k) Mi(j − 1, k + 1) (3.17)

The cost function calculation can, from an image analysis perspective, be seen as a structuring element (see Figure 3.4) applied from right-left, bottom-up, iteratively on the energy image starting at the second last row. The structuring element floats over the energy image and the minimum of the three bottom pixels is added to the position of the image where origo of the structuring element is placed. The ones in the boxes define how the energy values are weighted. Everything outside the energy image is defined as having infinite energy.

Figure 3.4. Min-structuring element

The same cost function is used to carve seams from each of the N frames, resulting in the same seams being removed from all of them. With a large enough N this solution makes use of temporal dependency between the frames. However, as N grows the probability that we will carve seams through important parts in one of the frames grows. Another advantage of this method is the low computational cost, since we only need to compute one costmap for each set of N unique frames. A better way to take the temporal dependency into account is to compute the cost function from a weighted mean of energy images, where temporally closer frames are weighted higher than frames further away as in Equation 3.18. This is better since it is more likely for a frame to look the same as its neighboring frames, than frames further away in time. The drawback is that the cost function must be recomputed for every frame, making it N times slower than the previous suggestion. eweighted(t) = 1 P2N j=1wj N X k=−N wk+N +1e(I(t + k)) (3.18)

(32)

Avidan, Rubinstein and Shamir proposes in a recent paper on seam carving for video [12] that the seams should be generalized by using the same constraints (list on page 16) in the temporal dimension as for the spatial dimensions. That is, in the case of vertical seam carving, the seams are not allowed to move more than one pixel horizontally between the frame at time t and the frame at time

t + 1. In the case of horizontal seam carving, seams must not move more than

one pixel vertically between frames. For this to be possible, we need to know the correspondences between seams in frame t and seams in frame t + 1. This is solved by finding the optimal sheet through the video rather than finding one seam at the time in each frame. By doing seam carving this way, we can easily limit the seams from making spatial skips through time, and the flickering will be avoided. The advantage of this method is that we will get somewhat adapting seams. Finding the optimal sheet given these constraints is a much more difficult problem which can’t be solved efficiently by dynamic programming. Avidan et al proposes the use of graph cuts for finding the optimal sheets, modified to take the seam constraints into consideration [12]. A graph is constructed with one node per voxel in the image volume, and each node has edges ("arcs" is used in their paper) to its neighbors. Each edge is assigned a cost given by the energy function used. In the case of vertical seam carving, a source node S is connected to the left most column of nodes and a sink node T is connected to the right most column of nodes with infinite weights on the edges (Figure 3.5). The graph cut is then defined by splitting the graph into two subsets, Sset and Tset, and the cost of the cut is the

sum of the edges dividing the subsets. To force the graph cut to follow the seam constraints, the vertical edges are removed and infinite weights are inserted on edges directed from the sink to the source horizontally and diagonally. The graph for doing horizontal seam carving is constructed in the same way, but with the sink and source nodes connected to the top and bottom rows of the graph. In their paper Avidan et al prove why this graph construction follows the seam constraints [12].

S

T

Figure 3.5. Two dimensional graph with source S and sink T. A potential cut is marked

with red. All edges are directed in both ways, their weights are not shown. This graph must be modified for the cut to follow the seam constraints.

Finding the cut with the lowest cost in a video sequence has a computation time which depends on the number of nodes times the number of edges in the graph.

(33)

Figure 3.6. An example of a 3d graph cut. This example represents a movie with 5

frames of a width and height of 5 pixels each.

This means that the computation time is quadratic in the number of voxels in the video sequence. Avidan et al proposes that the performance should be improved by first solving the problem in an N times subsampled coarser graph, refining the solution at each level n until reaching the graph with the original resolution. Even so, this method will be significantly slower than the previously mentioned method. It is also worth to mention that the memory requirements for this method when retargeting a 100 frames long, 4 seconds if we have 25 frames per second, 1080p high definition video sequence (a resolution of 1920x1080 pixels) would require around 1.5 GB of memory. A solution to the memory problem would be to divide the sequence into smaller parts and process one part at a time. But the shorter each part is, the worse the results will be. At some level this method will give worse results than the mean energy method using dynamic programming. Thus we have chosen not to implement and evaluate the graph cut method.

(34)

3.2.2 Energy functions

In their earliest paper on seam carving, Avidan and Shamir recommend a set of energy functions to be used with the seam carving operator [1]. The function that generally seems to work best on their test images is the L1-norm of the gradient,

defined as in Equation 3.19. e1(I) = ∂ ∂xI + ∂ ∂yI (3.19)

In this work the L2-norm of the image gradient, defined as in Equation 3.20 is also

used. e2(I) = s  ∂ ∂xI 2 + ∂ ∂yI 2 (3.20)

Shai Avidan et al also suggests a new energy measure in their second paper on seam carving [12]. This new energy measure is a measure of how much new energy the seam carving inserts into the image, rather than a measure of how much energy is removed from the image, which is how the measures from the earliest article on seam carving works. Minimizing the inserted energy means that we are trying to minimize the amount of new edges in the image after seam carving. This is a good idea under the assumption that edges induced by seam carving are very noticeable and disturbing to the viewer. The measure is referred to as forward energy since it is a measure of energy after a potential seam has been carved. The new energy function has different values depending on which path we are taking through the image. These energy functions are defined as in Equations 3.21. Since these energy functions depend on the direction of the seam, it must be calculated in a different way for horizontal seam carving. In such case, the image can be rotated 90◦, then same equations can be used for that case.

CL(j, k) = |I(j, k + 1) − I(j, k − 1)| + |I(j − 1, k) − I(j, k − 1)|

CU(j, k) = |I(j, k + 1) − I(j, k − 1)| (3.21)

CR(j, k) = |I(j, k + 1) − I(j, k − 1)| + |I(j − 1, k) − I(j, k + 1)| Thus the mean of each function over time can be calculated as

Cdi = 1 N N −1 X k=0 Cd(I(i · N + k)) (3.22) d ∈ {L, U, R}

To be able to calculate the costmap from the forward energy function, the costmap must be extended to take into account that the energy function depends on the path the seam is taking. The new costmap can be calculated as in Equation 3.23. Mi(j, k) = P (j, k) + min      Mi(j − 1, k − 1) + CL_i(j, k) Mi(j − 1, k) + C U i (j, k) Mi(j − 1, k + 1) + C R i(j, k) (3.23)

(35)

3.2.3 Poisson reconstruction

Removing a large number of seams from an image may result in sharp spatial transitions in color which were not visible in the original image. In their first paper on seam carving, Avidan and Shamir suggests the use of Poisson reconstruction [9] to smooth these transitions. This can be done by carving the seams from the gradient images rather than from the original intensity images. When all seams have been carved, the image is reconstructed from the gradient images by solving the Poisson equation with Dirichlet boundary conditions shown in equation 3.25. The boundary values are given from the boundary of the original intensity images. This is essentially a least squares problem to find the image with the gradient that minimizes the difference to the carved gradient, see Equation 3.24. The solution to the minimization problem is found by setting the gradient of the expression to zero, which gives us the Poisson equation in Equation 3.25. This solution can also be reached by seeing ∇ as a linear mapping which we can transpose. We can then form the normal equations as ∇T_{∇f = ∇}2_{f = ∇}T_{G = ∇ · G.}

min f Z Z Ω ||∇f − G||2_dxdy, _{f |} ∂Ω= f∗|∂Ω (3.24) ∇2_{f = ∇ · G over Ω,} _{f |} ∂Ω= f∗|∂Ω (3.25)

In Equations 3.24 and 3.25 f∗ _{is our original intensity image and f is our}

unknown intensity values. ∂Ω denotes the boundary of the unknown subset Ω, thus the Dirichlet boundary condition is used. G is a guidance vector field, in this case given by the gradient of the original intensity image f∗.

In the case of retargeting video, the video sequence can be seen as a volume of voxels. We could then use the three dimensional Poisson equation to find the three variable function which is closest to having the three dimensional gradient ∇f (x, y, t). Wang, Raskar and Ahuja have generalized the idea of Poisson image editing [9] to Poisson video editing [15]. They are formulating the problem as finding the potential function f , whose gradient is closest to the vector field G3D,

as in Equation 3.26.

min f

Z Z Z

||∇f − G3D||2dxdydt (3.26)

They derive the solution as the solution of the three dimensional Poisson equation in equation 3.27.

∇2_{f = ∇ · G}

3D (3.27)

∇f · ˆn = 0 (3.28)

In the paper they propose the use of Neumann boundary conditions as in Equation 3.28 where ˆn is the normal of the boundary. This means that the intensity function

we are trying to find should have a gradient that is orthogonal to the boundary normal. The solution is thus found by solving a large system of linear equations. Since this problem is much more time consuming to solve than the 2D case, they suggest that it is solved using a 3D multigrid Poisson solver, for example the one proposed by Roberts in his paper on multigrid solutions of Poisson’s equation [11]. It is worth noting that solving the 3D Poisson equation also places a much higher demand on memory space since we must store one more dimension of gradients.

(36)

3.3 Pan & scan

Pan & scan is a retargeting method based on cropping the video frames. Re-targeting using pan & scan can either be done by an operator or automatically. When doing the retargeting automatically, the cropping window is usually found by searching for the window with the wanted size and aspect ratio which maximizes an information measure of a frame. Cropping frames in this way introduces mo-tion which appears like camera momo-tion when the cropping window moves around [7]. This should be prevented and restricted to cinematically plausible motions. Liu and Gleicher are in their paper on automatic pan & scan suggesting that the cropping window motion in a shot is limited to three shot types: cropping, virtual panning and virtual cutting. A shot is handled as a single unit, thus only allowing one type of virtual camera motion per shot.

3.3.1 Cropping

Cropping is done by selecting a single cropping window for each shot. The window is found by maximizing the information measure for the entire shot. The window is then cut out of each frame and used as the new frame. Liu and Gleicher uses a brute force method for finding the optimal window, iterating through every possible solution and calculating the information loss. Since the same cropping window is used for all frames in a shot, the mean saliency map, potentially together with the mean energy image, can be used to find the optimal window, see Figure 3.7.

Figure 3.7. The optimal window shown in purple, overlaid on the sum of the mean

energy image and the mean saliency map of the second Bo Kasper shot.

3.3.2 Virtual panning

A linear horizontal virtual panning is introduced by finding the optimal windows in each frame and doing a linear interpolation to find the function defining the movement of the window. This function can be found by solving a least squares

(37)

3.3 Pan & scan 23

problem, given that the start time t1 and end time t2 of the movement are known.

Liu and Gleicher suggest an interpolation function which provides constant accel-eration in the beginning and the end of the movement to simulate a real camera panning, see Figure 3.8 and Equation 3.30. When the optimal pan has been found, a penalty measure is calculated as the sum of the information loss weighted with a pan penalty. Figure 3.9 shows the first and last optimal windows found, one window per frame between the windows shown is also calculated. The window po-sitions at each time is given by x(t), and is calculated as in Equation 3.29, where

t1 and t2 is the start and end time of the pan, x1 and x2 is the start and end

positions of the pan.

x(t) = min      x1 , t < t1 (1 − d(t−t1 t2−t1)x1+ d( t−t1 t2−t1))x2 , t1≤ t ≤ t2 x2 , t > t2 (3.29)

d(u) is the interpolation function and is given by Equation 3.30. To find x(t)

we need to find the values of 4 parameters, x1, x2, t1, t2and a. We search through

all possibilities of t1, t2 and a and fit the curve to the optimal window positions,

which gives us x1 and x2.

x(t) = min      u2 2a−2a2 , u < a a 2−2a+ u−a 1−a , a ≤ u ≤ 1 − a 1 −_2a−2a(1−a)22 , u > 1 − a (3.30)

Figure 3.8. The interpolation function d(u) used to interpolate the different window

positions. This function provides constant acceleration in the beginning and in the end. Time has been assigned to the x-axis and position to the y-axis.

The residual of the new interpolated positions, x(t), and the optimal positions, ˆ

xi, is then calculated as in Equation 3.31 and the parameter set giving the smallest residual is selected.

(38)

24 Content-aware video retargeting r = N X i=1 ||ˆxi− x(i)|| (3.31)

Figure 3.9. Shows the potential first and last window positions in the virtual pan.

3.3.3 Virtual cuts

A virtual cut is introduced by finding one window in the left part of a frame and another window in the right part of the frame, these are found by maximizing the information measure in the two windows. The new subshots must be at least L frames long. To find the time for the cut, tc, and the order of the cut, that is if the left window is the first window or the other way around, all possibilities are tried and the penalty is calculated. When the new subshots have been determined, cropping may be done on each of them.

3.4 Method classification

To be able to combine different retargeting methods, there is a need to be able to classify which method gives the best result for a given frame. This could be done by extending Liu and Gleicher’s idea to all methods, minimizing the penalty (or maximizing the information) given all methods. Unfortunately this will become very time consuming and will give false results since seam carving actually pro-duces a result with the minimum information loss, given its information measure. Thus seam carving will most likely always be the best method, even though this might not be the case to the human eye. A better solution would be to define a

(39)

3.4 Method classification 25

different goodness measure for each method. Each measure is compared to a dis-criminant function Tq, and if above, the method is a candidate for the shot. The candidate which is furthest above the threshold is used. If there are no candidates, the method with a quality measure closest to the threshold will be used.

3.4.1 Seam carving quality measure

Generally seam carving is a bad choice in images where there are few seams with low energy. One heuristic way to decide whether seam carving should be used or not would be to count the number of seams whose energy is below a certain level

Te. If an image is supposed to be shrunk with N columns, the goodness of seam carving could be estimated as in Equation 3.32, where S is the set of x coordinates of size N which gives the least values of M (1, x). The same measure can be used for horizontal seam carving where the image is shrunk with N rows, but in that case the S contains y coordinates and the sum is over M (y, 1).

qsc= − 1 N X i∈S M (1, i) (3.32) Another way to measure if seam carving should be done would be to look at the camera motion. In sequences where there are lots of camera motion, seam carving generally performs poorly. The parameters estimated for the model in Equation 3.6 can be used for this purpose. If the pan or the tilt estimates are larger than a threshold Ttranslationwe can say we have too much camera ego-motion to use seam

carving.

3.4.2 Pan & scan quality measure

Liu and Gleicher describe in their pan & scan paper a combined penalty measure which is used for deciding between cropping, virtual pans and virtual cuts [7]. The combined penalty measure is a weighted sum of different penalty measures between [0, 1]. This measure is used together as a quality measure for comparison of different methods.

qps= 1 −

X i

(40)

(41)

Chapter 4

Implementation

A prototype application was implemented to examine the results of this retargeting technique. The control structure of the application was implemented in Matlab due to its strengths of visualizing results and its image and signal processing toolboxes. Parts of the prototype were implemented as Matlab mex files in C++. The optical flow estimation and the face detection were done in C++ to be able to take advantage of the OpenCV library. The costmap calculation and seam carving and insertion routines were also written in C++ to achieve better performance. Implementing parts of the prototype in C++ will make it easier to reuse code when implementing the technique into a product.

Shots found Preprocessing

Processing

Find shots Calculate measure

Retarget method 1 Calculate measure

Retarget method i Stitch shots

Figure 4.1. Statechart of the application structure.

4.1 Buffering

Most video sequences are too large to make it possible to load all frames into memory. Because of this video processing requires a buffering procedure where the application processes a maximum number of F frames at a time. The methods used in this application requires that all frames cohering to the same shot are processed as a unit. A solution to this problem is to split the application into two steps – a preprocessing step and a processing step.

(42)

28 Implementation

During the preprocessing step all frames of the shot are preprocessed, F frames at a time, and the calculations needed for the processing step are being executed. In the processing step all frames are loaded into memory again, F frames at a time, to be processed. For this method to work, the calculations must not require more than F continuous frames. In this application the only calculations that requires more than one frame in memory at the same time are cut detection and motion estimation.

4.2 Cut detection

The cut detection algorithm described under Section 2.3 was implemented. Pseudo code for the algorithm can be seen in Algorithm 1.

4.3 Preprocessing the frames

4.3.1 Global preprocessing

During the frame preprocessing all saliency measures and energy images that are needed for the retargeting are calculated. To avoid problems with noise in the frames appearing as unwanted energy, there may in some cases be useful to low pass filter an image before the energy image is computed. Another possibility would be to use an edge preserving noise reducing filter, to avoid blurring the actual edges. To be able to estimate a seam carving quality measure for the shot, the energy image is normalized before further preprocessing. Three different kinds of energy measures have been implemented, L1- and L2-norm of the gradient (backward energy), and forward energy. All of these can be weighted together and combined with the different saliency measures. As the application has been constructed with modularity in mind, it provides a well defined measure interface for extensions with other kinds of energy or saliency measures. From the GUI the user is able to select one or more measures and assign corresponding weights for each measure. The measures are then applied to the sequence, and are at last weighted together to form a saliency map and an energy image.

Since the application is supposed to combine different retargeting methods, the preprocessing step must produce a measure which makes it possible to decide whether to use seam carving on a shot or not. The measure used is an estimate of the camera motion in the shot, as described in Section 3.4.1. Figure 4.4 shows the pan and tilt estimates calculated from a sequence divided into four shots, where the first and the third shot has noticeable panning and the second and fourth shot has no noticeable camera motion at all. Using the mean of the estimates over each shot we get an estimate of the mean motion over a shot. The mean estimate is more robust to noise produced by the ego-motion model and from the optical flow estimate, Figure 4.5 shows these means. These plots show that it’s possible to select a threshold of about 0.5, and where the tilt estimate or the pan estimate is above this threshold, we can make the decision that the shot contains too much camera motion to be able to do seam carving.

(43)

4.3 Preprocessing the frames 29

Algorithm 1 Find cuts in a video sequence V Require: V of size M × N × 3 × F

Require: T1, T2, Nbins

for frame fcolor in V with index i do

f ⇐ grayscale image of fcolor

if fcolor is first frame in V then

HF(3) ⇐ histogram(f, Nbins) next frame end if HF(2) ⇐ HF(3) HF(3) ⇐ histogram(f, Nbins) d2(i) ⇐_Nbins1 P |HF(3) − HF(2)|

if gradual transition flagged then

d1(i) =_Nbins1 P |HF(3) − HF(1)|

if d2(i) < T2after being high for less than 2 frames then {Hard cut found}

append(C, i)

remove gradual transition flag

else if |d1(i) − d1(i − 1)| < T1for 0 frames then

tlowStart⇐ i

else if |d1(i) − d1(i − 1)| < T1 for > 10 frames then {Gradual transition

found}

append(C, btlowStart+tgradStart₂ c) remove gradual transition flag

end if

else if d2(i) > T2 then

flag for gradual transition

HF(1) ⇐ HF(3) d1(i) ⇐ 0 tgradStart ⇐ i else d1(i) ⇐ 0 end if end for return C

(44)

30 Implementation 0 50 100 150 200 250 300 0 1000 2000 3000 4000 5000 6000 7000 8000 Frame n

Histogram difference between frame n and n−1

Running histogram differences

d 1(k) d 2(k) T 2 (a) 0 50 100 150 200 250 300 0 20 40 60 80 100 120 140 160 180 200 Frame n

Histogram difference between frame n and n−1

Running histogram differences

d 1(k) d 1(k) − d1(k−1) T₁ (b)

Figure 4.2. Running histogram differences over time. One gradual transition and two

(45)

4.3 Preprocessing the frames 31

Measure

process(buffer : images,draw : bool) : image disp() : void retrieveSettings() : void reset() : void result() : image weight : int name : string measureType : string L1Norm magGrad : image frameCount : int lpFilter : image L2Norm magGrad : image frameCount : int lpFilter : image ForwardEnergy leftEnergy : image midEnergy : image rightEnergy : image frameCount : int Motion motionSaliency : image prevFrame : image frameCount : int VisualAttention visualAttentionSaliency : image frameCount : int UserSaliency userSaliency : image frameCount : int FaceDetector faceSaliency : image

Figure 4.3. Class diagram of the preprocessing modules.

10 20 30 40 50 60 70 80 90 100 110 120 −3 −2 −1 0 1 2 Pan

Motion estimate between frame n and frame n−1

Pan estimate 10 20 30 40 50 60 70 80 90 100 110 120 −3 −2 −1 0 1 2 Tilt

Motion estimate between frame n and frame n−1

Tilt estimate

Figure 4.4. Pan and tilt estimates of the camera ego-motion between each frame in a

(46)

32 Implementation 1 2 3 4 −1.5 −1 −0.5 0 0.5 1 Shot

Mean of the pan estimate

Mean of the pan estimates of each shot

1 2 3 4 −1.5 −1 −0.5 0 0.5 1

Mean of the tilt estimates of each shot

Shot

Mean of the tilt estimate

(47)

4.4 Processing the frames 33

4.4 Processing the frames

For processing, the user is able to choose between a set of different retargeting methods which are applied sequentially on the video. The methods implemented are seam carving, pan & scan and down sampling using bicubic interpolation. For example, it’s possible to remove 50 seams, insert 20 seams and then down sample the result to a wanted aspect ratio, thus getting less carving distortion, with the loss of getting a different type of distortion from the down sampling. Also in this case the application is constructed with modularity in mind, meaning that a well defined interface exists for making it easy to extend the application with other methods as long as they follow the interface. Each method has access to both the saliency map and the energy image calculated in the preprocessing step.

Method

specificPreprocess(filename : string)

retarget(buffer : frames,params : RetargetParameters,draw : bool) : image disp() : void

retrieveSettings() : void name : string settings : Settings

SeamCarving

findSeams(params : RetargetParameters) : void sortedSeams : Seams seams : Seams sortedHorSeams : Seams seamsHor : Seams PanScan cropParams : CropParams panParams : PanParams BiCubicResize

Figure 4.6. Class diagram of the processing modules.

4.5 Seam carving

We have chosen not to implement the graph cut-based version of seam carving for video, but to implement the version based on dynamic programming. Pois-son reconstruction have been implemented in two dimensions, thus solving a 2D Poisson equation for each frame. The Poisson equation is solved using a direct analytical method based on discrete sine transforms, as described in a paper on direct analytical methods for solving Poisson equations by Simchony, Chellappa and Shao [13].

Depending on the goal of applying seam carving on a video sequence, the desire may be to either add seams, remove seams or combine the two. For instance when using seam carving to resize a shot from a wide screen aspect ratio (e.g. 16:9) to an aspect ratio of 4:3, one could use either one of the mentioned approaches. To retarget a sequence of the size w × h pixels to an aspect ratio Rdesired we can

remove N seams as in equation 4.1 or add N seams as in equation 4.2. It’s also possible to combine the two as in equation 4.3.

(48)

34 Implementation Nremove= w − Rdesiredh (4.1) Nadd= w Rdesired − h (4.2) Nadd= w − n Rdesired − h, n ∈ [0, Nremove] (4.3)

4.5.1 Seam carving preprocessing

A costmap is calculated from the saliency map and the mean energy image. From the costmap the seams to cut and the seams to be inserted are found and sorted. In the case of seam carving for single images, the time costs for carving one seam per iteration through the image is not significantly high. The complexity of carving K seams from an N × M × 3 sized image is O(3KN (M − K₂)). In the case of video carving, on the other hand, this becomes a problem. When carving K seams from an N × M × 3 × F sized video, the complexity becomes

O(3KN (M −K₂)F ). But if the same seams are carved from each of the frames, the seams can be preprocessed before carving without noticeable costs. Consider the array S of size N × K, with the x-coordinate of each pixel being removed from row i being stored at row i. This means that each column in S is a seam to be carved. Also notice that the coordinates of seam i are only correct when seams 1..i − 1 have already been removed, since the seams are not sorted. The goal is to preprocess S in a way which makes it possible to carve all pixels from each row using only one iteration. This would decrease the complexity of video carving to O(3N M F ) with the loss of having to preprocess S. The preprocessing can be done by rearranging the seams row-wise so that the first x-coordinate in row i in

S is the coordinate of the left most pixel being removed from row i in the frames.

Pseudo code for this preprocessing algorithm can be seen in Algorithm 2.

Algorithm 2 Rearrange S Require: S of size N × K for i = 1 to K do for j = 1 to N do c ⇐ min S(j, ∗) for k = c + 1 to K do if S(j, k) ≥ S(j, c) then S(j, k) ⇐ S(j, k) + 1 end if end for R(j, i) ⇐ S(j, c) S(j, c) ⇐ ∞ end for end for return R

(49)

4.6 Pan & scan 35

4.5.2 Method implementation

To speed up the seam carving processing, parts of it are implemented into a C++ library which have been connected to Matlab through C++ Mex-functions. The calculations of costmaps have been implemented in C++ due to the fact that it had to be done by looping through the energy images. Two different functions for calculating the costmap exists, one for calculation from a single energy image, and one for calculation from a pixel energy image and three "forward energy im-age". The seam carving and insertion have also been implemented in C++ and processes one frame at a time. They both require a preprocessed seams matrix as input, which has been sorted using Algorithm 2. Two help wrapper classes have also been implemented for accessing the Matlab image data, Array3dWrap and Array2dWrap, where Array2dWrap is a special case of Array3dWrap where the size of the third dimension is one (see Figure 4.7).

Array3dWrap

Array3dWrap(data : T*,size : int[]) : void set(row : int,col : int,channel : int,value : T) : void select(row : int,col : int,channel : int) : T dim(i : int) : int

data : T* size_ : int*

Array2dWrap

select(row : int,col : int) : T set(row : int,col : int,value : T) : void rows() : int

cols() : int

Figure 4.7. Class diagram of the C++ wrapper classes for Matlab image data.

4.6 Pan & scan

To be able to retarget a shot using pan & scan, the max energy window-measure must be estimated over the shot in the preprocessing step. This measure finds a window of the desired width and height in each frame which maximizes the preceding measures estimated, resulting in a list of window coordinates for each frame of the shot. The windows are restricted to having the same height as the original frame, but the width is varied to get the desired aspect ratio.

4.6.1 Pan & scan preprocessing

The windows returned from the preprocessing step are used by the pan & scan method to calculate the best window through the whole shot as in Section 3.3.1, which is used as the potential crop. Pan & scan also fits a curve to these win-dow coordinates as described in Section 3.3.2. Two penalty measures are also calculated, one for cropping and one for panning.

Implementation and evaluation of content-aware video retargeting techniques

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

Implementation and evaluation of content-aware

video retargeting techniques

Implementation and evaluation of content-aware

video retargeting techniques

Examensarbete utfört i bildkodning

vid Tekniska högskolan i Linköping

av

Abstract

Sammanfattning

Acknowledgments

Contents

Notation

Chapter 1

Introduction

1.1

Problem statement

1.2

Method

1.3

Glossary

Chapter 2

Cut detection

2.1

Histogram

2.2

Considered methods

2.3

Running histogram based cut detection

Chapter 3

Content-aware video

retargeting

3.1

Saliency maps

3.1.1

Face detection

3.1.2

Motion estimation

3.1.3

Saliency from combined multiscale image features

3.2

Seam carving

3.2.1

Video generalization

S

T

3.2.2

Energy functions

3.2.3

Poisson reconstruction

3.3

Pan & scan

3.3.1

Cropping

3.3.2

Virtual panning

3.3.3

Virtual cuts

3.4

Method classification

3.4.1

Seam carving quality measure

3.4.2

Pan & scan quality measure

Chapter 4

Implementation

4.1

Buffering

4.2

Cut detection

4.3

Preprocessing the frames

4.3.1

Global preprocessing

4.4

Processing the frames

4.5