Detection of point-shaped targets

(1)

Detection of point-shaped targets

Gunnar Farneb¨

ack, Hans Knutsson, G¨

osta Granlund

Computer Vision Laboratory

Dept. of EE

Link¨

oping University

December 18, 1996

Abstract

This report documents work done at the request of the Swedish De-fence Research Establishment. The studied problem is that of detecting point-shaped targets, i.e. targets whose only significant property is that of being very small, in a cluttered environment.

Three approaches to the problem have been considered. The first one, based on motion compensation, was rejected at an early stage due to expected problems with robustness and computational demands. The sec-ond method, based on background modeling with principal components, turned out successful and has been studied in depth, including discussion of various extensions and improvements of the presented algorithm. Fi-nally, a Wiener filter approach has also turned out successful, including an approximation with separable filters.

The methods have been tested on sequences obtained by an IR sensor. While both the two latter approaches work well on the test sequences, the Wiener filter is simpler and computationally less expensive than the background modeling. On the other hand, the background modeling is likely to have better possibilities for extensions and improvements.

(2)

1 Introduction

This report documents work done at the request of the Swedish Defence Re-search Establishment. The problem considered is that of detecting point-shaped targets in a cluttered environment. With “point-shaped” it is understood that the only important structure of the target is that it is very small. It does not mean, however, that it necessarily has to be constrained to only one pixel. In fact, the point spread function of the sensory system will in general distribute it over a few pixels.

That the background is assumed to be cluttered means that it is nontrivial to distinguish between targets and background. Simple approaches such as thresholding are far from sufficient.

Three different approaches to the problem have been considered and are presented in turn in sections 2 – 4. The first approach, motion estimation and compensation, and the second approach, background modeling with principal components, are both inspired by image and video coding techniques. The third approach uses the more conventional idea of Wiener filtering.

The implemented algorithms have been tested on two different test sequences. Both sequences have been obtained by an IR sensor. The background consists of a cloudy sky above the horizon and sea below. A single target moves over the background, with varying velocity and intensity, except for some frames where it is outside the sector covered by the sensor. Each frame is 186 pixels high and 2048 pixels wide. For practical reasons, only a fraction of the width is shown in the figures illustrating the methods. Two sample frames from one of the sequences are shown in figure 1.

In accordance to the test sequences, the case of a single target has primarily been considered. The complications of multiple targets are, however, discussed.

2 Motion estimation and compensation

One obvious approach to the problem would be to assume that the background moves in a regular way. Then, if the motion field of the image could be esti-mated, the background from the previous frame could be warped according to this motion, generating an estimation of the current background. The idea is of course that the target, moving in a different way, would differ markedly from the estimated background.

For some methods to estimate the motion field, see [1, 2, 3].

The idea is interesting and has some nice properties. It has, however, some serious drawbacks too. The advantages and disadvantages are discussed below, together with an illustrative example. Because of the drawbacks this approach has not been investigated further.

2.1 Advantages

Two good properties are the following:

1. The algorithm is largely independent of the shape and the structure of the target. One only has to avoid letting the target influence the estimation of the motion of the background. As long as the target is assumed to be

(4)

50 100 150 200 250 300 350 400 20 40 60 80 100 120 140 160 180 50 100 150 200 250 300 350 400 20 40 60 80 100 120 140 160 180

Figure 1: Two consecutive frames from a test sequence. Here the target has very high intensity.

(5)

point-shaped, this property is not very important, but for more general cases it could be quite useful.

2. If the velocity of the target could be estimated, in addition to that of the background, it would be possible to predict its next location. Unfor-tunately it is hard to estimate the velocity of a small and fast moving object, such as the target in the test sequences.

2.2 Drawbacks

Four drawbacks are presented.

1. Motion estimation is computationally intensive. Many algorithms also make computations based on more than two frames, requiring a great deal of memory as well.

2. It is hard to get good estimates of the motion. If one tries to get indepen-dent estimates at each pixel, noise and fundamental limitations, such as the aperture problem, will be significant problems.

More robust results can be obtained by trying to fit a motion model to the whole image. One common model is the assumption that the velocity varies affinely with the image coordinates. The question then is to what extent the motion model actually is able to describe the motion in the image.

It is of course not necessary to fit a single model to the whole image. Different models could be used on different parts of the image. It is not even necessary to have a fixed subdivision of the image, but with motion-based segmentation algorithms the computational cost grows still worse. A specific complication with the available test sequences is that the motion is large. The background typically moves with a speed of a few up to 25 pixels per frame. The target can move as fast as about 100 pixels per frame.

The significance of accurate motion estimation is of course to get a correct estimation of the background after warping the previous image. Otherwise false targets may appear or at least it will be harder to distinguish the real one.

3. If the target happens to move consistently with the background, it will not be revealed with this method. Note that the condition of consistent motion does not require that the target moves with the same velocity as the background (e.g. clouds), only that the apparent motion, after projection into the image plane, is the same.

4. The image warping will leave empty areas at the border because parts of the image were not present in the previous frame. This effect could be reduced by a bidirectional estimation from a future frame as well, at the cost of lost causality.

(6)

50 100 150 200 250 300 350 400 20 40 60 80 100 120 140 160 180

Figure 2: Warped frame.

2.3 Illustrative example

To illustrate the ideas of the algorithm, but for no other purpose, an example is presented. Figure 1 shows two consecutive frames from one of the test se-quences. After estimating the motion of the background, the first of the frames was warped with respect to this velocity field. The result, a prediction of the background of the second frame, is presented in figure 2. The difference image is plotted in figure 3 together with an image with the absolute values of the difference. Notice that the difference image contains two copies of the target. A negative one for the warped previous position and a positive one for the current position. That the target dominates the difference image does not say much because it was clearly visible in the original images as well. In fact the motion estimation used here was far from accurate.

3 Modeling the background with principal

com-ponents

The idea of this approach is to model the background with a method that is able to describe the background well but hopefully is unable to describe the target. The first step is to make a data compression encoding of the image that is adapted to the background. After the second step, reconstruction, there will be large errors where the target is.

3.1 Notation

The algorithm uses a block coding scheme, where the blocks have arbitrarily been chosen as squares, 15 × 15 pixels large. Each block can be considered as a 225-dimensional vector space. It is immaterial how the blocks are encoded into vectors. The way used here is illustrated in figure 4, for reference.

(7)

0 100 200 300 400 200 150 100 50 0 −100 −50 0 50 100 150 50 100 150 200 250 300 350 400 20 40 60 80 100 120 140 160 180

Figure 3: Top: Plot of the difference image. Bottom: Absolute values of the difference image.

(8)

1 2 3 14 15 16 30 31 196 210 211 224 225 -¾      1 2 .. . 225     

Figure 4: Encoding of image blocks into vectors.

3.2 Principal components

Considering the blocks as stochastic vectors, there will be certain correlations between the coordinates of the vectors. The idea of principal components is to make a change of basis that yields uncorrelated coordinates. Let x denote a stochastic vector and let Cx = E(xxT) be the autocorrelation matrix. If A is

a matrix which defines a change of basis, y = Ax, the autocorrelation matrix for y is Cy = E(yyT) = E(AxxTAT) = ACxAT. Now the autocorrelation

matrix Cx is symmetric, so by the spectral theorem there exists an orthogonal

matrix A which makes Cy diagonal. Moreover, this new orthonormal basis is

given by the eigenvectors of Cx, which are called principal components. By

convention, the eigenvectors are also ordered so that their corresponding eigen-values are decreasing. What this means is that in the new basis the coordinates are uncorrelated and ordered with decreasing variance.

3.3 Computation of principal components

We want to find principal components for blocks of size 15 × 15 from the back-ground. To do that we take a large number of samples of this block size from one or more frames. The samples should be unaligned and may well be over-lapping. Denote this set of blocks, the training data, with {bi}N_i=1, using the

vector representation of the blocks from section 3.1.

There are essentially two ways to compute the principal components. One is to estimate the autocorrelation matrix from the training data,

˜ Cb= 1 N N X i=1 bbT, (1)

(9)

of all the training data, B = ¡b1 b2 · · · bN¢, and apply singular value

decomposition to B. Mathematically these methods are very closely related but there are some differences in storage and time complexity. The choice depends generally on the vector dimension, the number of training data and available storage size.

Here the former method is preferable due to storage requirements. The reason is that the dimension of the vectors is fairly large (225) and we need a still larger number of training blocks to get an accurate estimate of the principal components. Hence the estimated autocorrelation matrix has 2252 _elements,

while the matrix B has 225N elements, where N > 225.

With 9847 training blocks taken from the image in figure 5 we obtain a full basis of principal components. The 16 most important of these are visualized in figure 6 and the distribution of eigenvalues, i.e. variances, for the principal components can be found in figure 7. The first principal component is essentially the DC-component. Following are a number of components with mostly low frequencies. It is evident that the horizontal structures are dominating in the background, which should come as no surprise given figure 5. It should also be mentioned that the sensor seems to add artificial horizontal structures. These are also taken into account by the principal components.

50 100 150 200 250 300 350 400 20 40 60 80 100 120 140 160 180

Figure 5: Part of the frame from which the training blocks have been taken.

3.4 Projection into a subspace

A full basis for the block space is capable of describing any block, not only blocks with background. Therefore we want to restrict ourselves to some subspace of the block space. Let {ai}n_i=1be a basis for an n-dimensional subspace, combine

these vectors into a matrix A =¡a1 a2 · · · aN¢, and let b be a block. In

general b is not a vector in the subspace so we want to find an approximation of b that lies in the subspace. This can be expressed as finding a vector x that makes ˜b= Ax close to b. The usual approach is to minimize the residual energy kAx − bk, where k · k denotes the Euclidean norm. This yields the least

(10)

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

(11)

0 50 100 150 200 250 10−1 100 101 102 103 104 105 106

Figure 7: Distribution of the eigenvalues corresponding to the principal compo-nents.

(12)

mean squares solution

x= (AT_A)₋₁_AT

b, (2)

˜

b= A(AT_A₎−1_AT_b. ₍₃₎

In the case that the basis vectors are orthonormal the expression is radically simplified because AT_A_{= I, giving the usual formula for orthogonal projection}

x= AT

b, (4)

˜

b= Ax = AATb. (5)

It has already been mentioned that the principal components are uncorre-lated and ordered with decreasing variance. This means that the n-dimensional subspace yielding residuals with minimum variance is spanned by the n first principal components.

In terms of image coding, x is the encoded block and ˜bis the reconstruction. The coding of a whole image is made blockwise. In an image coding context the components of x would also have to be quantized with reasonably few bits prior to transmission or storage. This is of no consequence here, since we want good reconstruction and no transmission is to occur.

The use of principal components for image coding is known as the Hotelling transform (see [4]). This transform differs from the approach here in that the covariance matrix and the mean vector is used rather than the autocorrelation matrix. In the continuous case the corresponding transform is the better known Karhunen-Lo`eve transform.

3.5 Preliminary algorithm

A preliminary algorithm can now be presented:

1. Compute a principal component basis as described above and take the first n components as a basis for a subspace.

2. Divide the image into non-overlapping 15 × 15 blocks.

3. Project each block into the subspace and reconstruct them. Note that the principal components are orthonormal so that the simpler formula for ortogonal projection can be used.

4. Compute the residual (difference) image and search for the maximum er-ror.

3.5.1 Example

Figure 8 shows a sample frame with a faintly visible target in the lower left. The projection into a 56-dimensional subspace can be found in figure 9. Figure 10 holds the residual image and figure 11 gives the blockwise errors, measured with Euclidean norm.

(13)

50 100 150 200 250 300 350 400 450 500 50

100

150

Figure 8: Original frame.

50 100 150 200 250 300 350 400 450 500 50

100

150

Figure 9: Projection into a 56-dimensional subspace.

50 100 150 200 250 300 350 400 450 500 50

100

150

(14)

5 10 15 20 25 30 2 4 6 8 10 12

Figure 11: Blockwise errors measured with Euclidean norm.

3.5.2 Number of principal components and choice of error measure There are two important omissions in the given algorithm. The first is the dimensionality of the subspace. The second is how to measure the error in the reconstructed image.

To start with the latter, there are several alternatives for measuring the error. One possibility is to find the point in the residual image with largest absolute value. One should note however, that the whole block containing a target will be distorted, not only the pixels where a target is. Hence it makes more sense to measure the errors blockwise instead of pixelwise.

Given the fact that the projection into a subspace was made with the goal to minimize the Euclidean norm of the residual vector it is natural to use this norm as error measure. But the choice of the Euclidean norm was to some extent done to allow for the simple solutions given by the least squares and orthogonal projection. Hence there are reasons to look at other measures too. Notably the 1-norm (sum of absolute values) and the infinity norm (maximum of the absolute values) have been studied.

To return to the question of the dimensionality of the subspace, it can be noted that with a very low-dimensional subspace nor the background neither the target can be described. With some more components the background can be described better but so can the target. What one can hope for is that the ability of describing the background initially increases much faster than the ability of describing the target. If this is the case the signal to noise ratio (considering the target as signal and the background as noise) will rise initially, up to some level, and then start to decline.

The effects of changing the dimensionality of the subspace and measuring the error in three different ways have been tested on a sample frame. The result can be found in figure 12. It is clear that the best results are obtained with an about 60-dimensional subspace, using the infinity norm to measure the error.

3.5.3 Drawbacks

There are three important drawbacks to this preliminary algorithm.

1. Not only targets give errors in the reconstructed image. Noise or untypical background also does, and those errors can be comparatively large.

(15)

0 50 100 150 200 250 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 dimensionality SNR

Figure 12: Signal to noise ratio for different dimensionalities of the subspace. The ratio is computed as the block error for the target divided by the maximum block error among all other blocks. The solid curve is for the infinity norm, the dashed for the Euclidean norm and the dotted for the 1-norm.

(16)

2. If the target is split between two blocks (i.e. on the border of two blocks), both blocks will get large residuals. These residuals will, however, be smaller than they would have been if concentrated to one block and the SNR will be considerably smaller.

3. Since the error is spread out over the block it is not possible to reliably position the target with better than block precision.

All these drawbacks can rather easily be overcome, as is shown by the im-proved algorithm that follows.

3.6 Improved algorithm

The answer to the second drawback is of course to use overlapping blocks instead of non-overlapping ones. The overlap should be as large or at least nearly as large as the maximum size of the target. Here an overlap of three pixels has been used. Of course this leads to the possibility that a target may be detected in two adjacent blocks (or possibly four), which one should be aware of.

The first and the third problems can actually be solved at the same time. The key idea is to note that a target is small and localized but that other reasons for large errors are likely to be distributed over the block. Suppose therefore that the same operation of projection into a subspace were additionally to be made on blocks with a hole in the center. Then errors caused by noise or untypical background would tend to be large in both cases. For a block with a target at the center, the error would be large in the first case but small in the second case. If the target were positioned off-center the errors would again be large in both cases.

The conclusion is that it is interesting to look at the ratio of the errors for full blocks to the errors for hollow blocks. Unfortunately this procedure seems to require that blocks at all possible positions in the image be examined, instead of just slightly overlapping blocks, leading to a significant increase in computational complexity. But the situation is not as bad as it looks. Still the first idea of looking at the error for slighly overlapping blocks can be used to find a number of candidates for target locations. Then for all positions in these blocks the above mentioned ratio measure is computed and the result is used for target detection and accurate location.

3.6.1 Principal components for hollow blocks

The size of the hole in the hollow blocks should be about the same as the largest allowed size of the target. Here a hole of size five by five has been used. The computation of principal components for hollow blocks is in principle no different from the case of full blocks. The same goes for the projection into a subspace.

A straighforward approach would, however, require a different mapping of the 200-dimensional hollow blocks into vectors, than was used for full blocks. This is of course not at all impossible but it is somewhat impractical in terms of implementation details and likely to be somewhat inefficient too.

A better idea is to use the same mapping between block and vector as for full blocks but force the 25 center positions to be zero. If the computation of principal components is made from an estimated autocorrelation matrix, this

(17)

means that the same matrix can be reused after zeroing out the relevant rows and columns.

How the 16 first principal components for hollow blocks look can be seen in figure 13. An obvious difference from the principal components for full blocks is that these are zero at the 25 center pixels, as should be expected. Except for that, it is hard to see any differences in these first components. The differences will however be more marked for later components because they have to diverge sooner or later. The reason is that at the very end, the 25 least important components have to be a basis for the 25 dimensions at the “removed” center.

In the projection phase it turns out that no adjustments at all have to be done except that the center of the block should be excluded when computing the block error.

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Figure 13: The 16 most important principal components for hollow blocks.

3.6.2 Algorithm

With these explanations the improved algorithm can be presented:

1. Compute principal components for full blocks and for hollow blocks. Use some number of the most important components as basis for subspaces of the two block spaces.

(18)

3. Compute the residual error for each (full) block after projection into the subspace.

4. Select some of the blocks with largest error for closer examination. 5. Place full blocks and hollow blocks centered at each pixel of the blocks

selected in the previous step.

6. For each of these blocks, compute the ratio of the two residual errors. 7. The center of the blocks with largest ratios are the most likely positions

for a target.

It is not mentioned how to determine an appropriate number of blocks to examine further. Simple possibilities are to take a fixed number or use some threshold but probably some more sophisticated technique is necessary. In the experiments a fixed number of blocks, ten, have been used.

The evaluation of the ratio values also require some heuristics. Here some fixed thresholds have been used to classify targets at three levels of certainty. 3.6.3 Results

The improved algorithm has been tested on the two sequences. The results for 43 frames from one of the sequences are presented. Figure 14 contains the lowest ratio values for each frame. At the beginning of the sequence the target is clearly visible to the eye but not very bright. Gradually the target gets weaker and is around frame 15 very faint. Then again it gets stronger and around frame 30 it is very bright. From frame 32 and onwards the target is, however, no longer present. The ratio values are in good agreement with this course of events. In all frames where the target is present, it has obtained the lowest ratio values, making the detection successful.

In figure 15 three different signal to noise ratios are presented. The largest residual or smallest ratio in each frame is supposed to be the target, and is considered as the value of the signal. The largest residual or smallest ratio among other blocks than where the target was found is used as the value of the noise.

3.6.4 Time and storage requirements

To determine the time and storage requirements, there are two distinct phases to consider. The first is the training phase, when the principal components are computed. The storage requirement is essentially that of the estimated autocorrelation matrix (size 225 × 225), two sets of principal components (2 × 225N , where N is the number of used components), and possibly additional storage needed for the two eigenvalue decompositions. There are two significant parts of the time requirements. The first part is the time it takes to estimate the autocorrelation matrix, which is linear in the number of training blocks. It has not been studied how many blocks are necessary, but it is quite certain that the close to 10 000 used in section 3.3 are an exaggeration. The absolute minimum number of training blocks is the number of principal components to be used, but to get reasonably good estimates it is likely that a far larger number is required.

(19)

5 10 15 20 25 30 35 40 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 frame ratio

Figure 14: Ratio values for the most probable target in each frame in a test sequence.

(20)

5 10 15 20 25 30 35 40 0 2 4 6 8 10 12 frame SNR

Solid: _{largest residual among nontargets}residual for target

Dashed: residual for target mean residual

Dotted: largest ratio among nontargets_{ratio for target}

(21)

The second part is of course the two eigenvalue decompositions of 225 × 225 matrices.

The second phase is the actual detection of point targets. Excluding the storage of the principal components and the image to be analyzed, the stor-age requirements are rather moderate. The number of operations required for projection and reconstruction of a block is 2 × 225N multiplications and addi-tions. With a block overlap factor of (15

12)2 this means 3.125N multiplications

and additions per pixel in the image. The ratio measure requires about 850N multiplications and additions per examined point.

Two things should be noted about the time requirements. First, the train-ing procedure can be separated from the detection procedure. New principal components can be computed in another process while the detection algorithm is running. When it is necessary to change to new principal components, these can simply be transferred from the other process. Second, the algorithm is very well suited for parallelization.

3.7 Further discussion

Below follow various extensions, improvements, and variations of the algorithm. These have been considered, to a lesser or larger degree, but have not been implemented.

3.7.1 Structured targets

This method can to some extent be adapted to targets with structure. Assuming that the target is still reasonably small, although not pointsized, it should still be possible to find blocks differing from the background. The size of the blocks may have to be altered.

The ratio measure is probably hard to use for larger targets because that would require very large full blocks and hollow blocks, leading to a large com-putational cost. Also the ratio measure only tells where a target probably is, not how it looks. To allow more accurate detection of a structured target, and possibly classification, one would like to extract the target from the background. That is the subject of the next section.

It should be noted that this method does not use any information about the structure of the target. While this in one sense is a shortcoming of the method, it is in another sense its big advantage. One does not need to know the structure of the target, only its maximum size.

3.7.2 Target extraction

Once a target has been located it may be interesting to extract it from the background. This can be done quite easily in the case of a small target that only takes up a moderately large fraction of a block. It is somewhat more involved to extract an object that is larger than about half the block. Both these cases are described below.

Assume first that the target is small and that it has been located to lie within some part of a block. Call that part the target area. It is only known that the target does not extend outside the target area but it does not have to fill it up. In preparation for the extraction of the target we are going to make

(22)

an estimation of how the target area would have looked, had it been free from the target. In other words, we are going to extend the background to the target area.

Let q be a block vector defining a mask, i.e. q is zero for the points where the target may be located and one outside. Let Q be a diagonal matrix with q on the diagonal. Let {pi}n_i=1 be the principal component basis for the subspace

used before and set P = ¡p1 p2 · · · pN¢. Finally b is the block being

examined. Now we want to project b into the subspace, ignoring the target area. This can be expressed as finding ˜b= Px minimizing (b − ˜b)T_{Q(b − ˜}_b).

An equivalent statement is the mean squares problem (note that QT_Q_{= Q)}

min

x kQPx − Qbk (6)

with the solution

x= (PT_QP)−1_PT_Qb. ₍₇₎

The pixels in the target area that are sufficiently far from the estimated background value can now be classified as belonging to the target. With this the target has been extracted.

If the target area is too large this approach will fail. The reason is simple. Trying to extend a few background pixels to the whole block cannot be made with any accuracy.

In the case of a large target, possibly several blocks large, one has to make the extraction iteratively. The idea is to place a block at the border of the supposed target, so that the intersection between the block and the target is appropriately large. Then the algorithm above can be applied to the block and the result is a better approximation of that part of the target. Repeat this while moving the block around the border of the target until its shape stabilizes.

The method described here, as well as the idea of projection into a subspace in general, is closely related to the technique called normalized convolution, described in [5].

3.7.3 Subpixel positioning of point targets

A point target may be distributed over a few pixels, due to the properties of the sensor. Ideally this distribution is a sampled version of a space-invariant point spread function. Depending on the subpixel positioning of the target, the sampled point spread function will vary, as illustrated in figure 16 for the one-dimensional case.

With an extracted target and knowledge of the point spread function of the sensor, it should be possible to estimate the position of the target with subpixel precision.

3.7.4 Multiple targets

Multiple targets can be detected with the algorithm, as long as the targets are separated far enough. When they are too close it will be impossible to place the hollow blocks so that one target is in the hole and the rest of the block is covered by background only. Then the ratio measure will not indicate a target, but the block error will at least be large.

(23)

Figure 16: Dependence of the subpixel positioning when sampling a point spread function.

There are two ways to counter this problem. One is to do something like what is described in section 3.7.1 for structured targets. In the simplest case this means to increase the maximum allowed size for a target (specifically the size of the holes in the hollow blocks), to make clusters of point-shaped targets detectable.

The second way is to improve the heuristics for interpreting the ratio values and block errors, preferably by incorporating tracking of targets. Then if two targets are predicted to get close, it is not required that the ratio is large, but only that the block error is. A more sophisticated approach would be to exclude the pixels where other targets are estimated to be, when computing the ratio measure.

3.7.5 Tracking

A very important addition to the presented algorithm is tracking of detected targets from frame to frame. There are at least two reasons why tracking is desirable:

1. False alarms can be avoided if it is required that detected targets move consistently over a few frames. Spurious noise is not likely to behave so well.

2. If it can be estimated where a target is likely to appear in the next frame, one can give higher priority to searching in that area.

(24)

3.7.6 Multiple background models

Often it is the case that the background is not homogeneous with respect to its statistical properties. A natural improvement of the algorithm would be to use different sets of principal components for different parts of the background. In the tested sequences for example, the upper part of each frame consists of cloudy sky while in the lower part there is sea. It would then be natural to have different background models above and below the horizon.

Other divisions are of course conceivable. As long as a fixed segmentation of the image can be used it is relatively easy. A worse case is when a segmentation must be estimated online. There are methods for this, but they are in general very computationally intensive.

3.7.7 Changed background

If the background does not change, or rather if its statistical properties do not, the same set of principal components can be used for all frames. This assumption is however quite optimistic in the long run. But since the computation of the principal components is quite expensive one wants to keep the old ones for as long as possible.

There are a number of possible strategies for updating the principal compo-nents. One is to precompute a few different sets of principal components to be used at different conditions, such as different types of weather, and use the one that works best at the moment. Another is to recompute the principal compo-nents from time to time. Either with certain intervals or when the old ones do not seem to work well anymore. Note that the estimation of the autocorrelation matrix can be done continuously. Of course old data should be given less weight in the estimation of the autocorrelation matrix, easiest implemented by using a forgetting factor. As mentioned in section 3.6.4 the eigenvector decomposition can also be done in parallel with the detection. The only slight drawback is that the new principal components may be based on somewhat old data when they are ready to be used.

A different, and computationally more attractive strategy, is to use a fixed basis for whole the block space and only vary which of the vectors are used to span the subspace. In the following section it is discussed how this approach works for the special case of the DCT basis.

3.7.8 Using DCT instead of principal components

The use of principal components have two important disadvantages. The first is that their computation is costly. The second is that one is not likely to find an already existing hardware implementation.

The latter is on the other hand very likely for one specific image transform, namely the discrete cosine transform (DCT), which is a core part of many important image coding standards, such as JPEG and MPEG.

(25)

DCT is a block based transform, with the basis vectors given by dij= αiαjcos (2x + 1)iπ 2N cos (2y + 1)jπ 2N , i, j= 0, 1, . . . , N − 1, (8) αk=    q 1 N, k= 0, q 2 N, k= 1, 2, . . . , N − 1, (9)

where N is the side of the block and x and y are block coordinates. See [4] for more details.

The idea here is to replace the principal components with DCT basis vectors. Instead of computing eigenvectors of the autocorrelation matrix C and ordering them with decreasing eigenvalues, we take the DCT basis vectors {di} and order

them with decreasing values of dT

iCdi. The construction of a subspace spanned

by a number of the most important basis vectors proceeds as before. Using the same C as in section 3.3, the 16 most important DCT basis vectors are shown in figure 17. The effect of the size of the subspace and the choice of error measure is shown in figure 18.

It is more complicated to replace the principal components for hollow blocks. The reason is that if the center pixels are zeroed from the DCT basis vectors, they will cease being orthonormal. There are essentially two ways to get around this problem. One way is to take the zeroed vectors and apply an orthogonaliza-tion process such as Gram-Schmidt to obtain a new set of orthonormal vectors. This procedure is described in [6]. The second way is to accept that the basis is non-orthonormal and use equation (3) instead of equation (5) for the projection into the subspace.

It has not been studied whether existing DCT hardware can actually be used with the modifications needed for this algorithm, and the outcome is not obvi-ous. But even if it should turn out negative, the DCT approach is still interest-ing, because it removes the need for the computationally expensive eigenvector factorization.

Experimentally the DCT approach turns out to work well. The results are not significantly different from those obtained with principal components, see figures 19 and 20. This is not very surprising because the most important principal components in figure 17 are not very different from low-frequency DCT components. If the main structures of the images had not been horizontal or vertical, the DCT basis would probably have been less successful.

3.7.9 Modeling the target as well as the background

A possible extension of the algorithm would be to try to model the target as well as the background. One could also question the fact that a number of the principal components are used with equal weight, while the rest are ignored completely. An alternative would be to see how well each component fits the background and the target respectively. Based on this each component gets a weight and for target detection the weighted sum of the projections of a block onto the components is computed.

This approach has not been pursued further, because it seems to lead to the Wiener filter, which instead has been studied in a conventional filter setting.

(26)

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

(27)

0 50 100 150 200 250 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 dimensionality SNR

Figure 18: Signal to noise ratios for different sizes of the subspaces, using a DCT basis. The meaning of the curves is the same as in figure 12.

(28)

5 10 15 20 25 30 35 40 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 frame ratio

Figure 19: Ratio values, using a DCT basis, for the most probable target in each frame in the same sequence as in figure 14.

(29)

5 10 15 20 25 30 35 40 0 2 4 6 8 10 12 frame SNR

Solid: _{largest residual among nontargets}residual for target

Dashed: residual for target mean residual

Dotted: largest ratio among nontargets_{ratio for target}

(30)

4 Wiener filter

Considering the target as signal and the background as noise, it is possible to design a filter that improves the signal to noise ratio. In this problem the spectra of the signal and the background can be estimated, making the Wiener filter a suitable choice.

4.1 Filter design

The spectrum of the target is essentially given by the spectrum of the point spread function of the sensor. Here the spectrum, somewhat arbitrarily, has been approximated with a simple Hanning window, shown in figure 21. With good knowledge of the sensor, a more accurate estimation can of course be done. The spectrum of the background has been estimated from a 186×256 sample from one frame of a test sequence. To the sample a windowed fourier transform has been applied. Finally the result has been smoothed with an averaging filter, giving figure 22.

Denoting the signal spectrum with S and the noise spectrum with N , the transfer function of the Wiener filter is given in the fourier domain by [2, 7]

H = S

S+ N. (10)

Since we are most interested in getting good results for weak signals, the de-nominator can be approximated with N , i.e.

H= S

N. (11)

The result is shown in figure 23.

Inverse transformation of H gives a filter kernel which has been truncated to 15 × 15 coefficients. The fourier transform of the realized filter kernel is presented in figure 24.

4.2 Separable filters

To make the computational cost of the convolution smaller, it is possible to approximate the two-dimensional filter with two one-dimensional filters, one vertical and one horizontal. This approximation can systematically be made using the singular value decomposition, SVD. The idea is that the SVD is applied to the transfer function in the fourier domain. The outer product of the most important singular vectors, one to the left and one to the right, gives a separable approximation of the filter. This approximation is in a least squares sense the best possible. The two singular vectors used above can now be regarded as one-dimensional filters. Inverse transformation and truncation yields a vertical filter kernel of size 3 × 1 and a horizontal filter kernel of size 1 × 11, for a total of 14 coefficients. The separable approximation of the Wiener filter is shown in figure 25, while the realized separable filter is shown in figure 26.

4.3 Results

The Wiener filter turns out to work at least as well as the principal component method. Figure 27 shows the result from filtering the frame in figure 8 with both

(31)

Figure 21: The spectrum of the target.

(32)

Figure 23: Wiener filter in the fourier domain.

(33)

Figure 25: Separable approximation of the Wiener filter in the fourier domain.

(34)

50 100 150 200 250 300 350 400 450 500 50 100 150 50 100 150 200 250 300 350 400 450 500 50 100 150

Figure 27: Result of applying the 2D Wiener filter (top) and the separable approximation (bottom) to the image in figure 8.

the realized filters. Figure 28 gives signal to noise ratios for the same sequence that was used to produce figure 15. The SNR is computed as the ratio between the magnitude of the largest filter output and the largest magnitude of the filter output elsewhere, excluding a small neighborhood around the supposed target. As for the algorithm in section 3.6, the target is found in all the frames where the target is present.

5 Conclusions

In conclusion, both the last two approaches work well enough on the test se-quences. For use in a real system, however, there are many practical details that must be added. Especially important is the interpretation of the block errors and ratio values in the background modeling approach and of the filter outputs in the Wiener filter approach. To get a robust system that avoids false alarms, something more sophisticated than fixed thresholds is necessary. Some kind of tracking to incorporate coherency over time is certainly also necessary.

Although it works well here, it should be noted that the Wiener filter is an in-herently linear method. This limits the possibilities of improvements compared to the other two, nonlinear approaches. Some, but not all, of the discussion in section 3.7 carries over to the Wiener filter approach rather straighforwardly. One difference is the extension to structured targets. While the background modeling did not use, or need to know, the structure of the target, the Wiener

(35)

0 5 10 15 20 25 30 35 40 45 0 2 4 6 8 10 12 frame SNR

Solid: _{largest filter output elsewhere}filter output for the target

Dashed: Corresponding for the separable filter

Figure 28: Signal to noise ratios for the two realized Wiener filters on a test sequence.

(36)

filter both uses and requires knowledge of the spectrum of the target.

An advantage of the Wiener filter approach is that it is structurally simpler and computationally less expensive than the background modeling approach.

The motion compensation approach was rejected at an early stage, but if the difficulties of the method can be handled, it may be quite useful. It is possible that certain kinds of backgrounds are well adapted to this approach.

References

[1] F. Dufaux and F. Moscheni. Segmentation-based motion estimation for sec-ond generation video coding techniques. In L. Torres and M. Kunt, editors, Video Coding: The Second Generation Approach, chapter 6, pages 219–263. Kluwer Academic Publishers, 1996.

[2] G. H. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer Academic Publishers, 1995. ISBN 0-7923-9530-1.

[3] G. Farneb¨ack. Motion-based Segmentation of Image Sequences. Mas-ter’s Thesis LiTH-ISY-EX-1596, Computer Vision Laboratory, S–581 83 Link¨oping, Sweden, May 1996.

[4] R. C. Gonzalez and R. E. Woods. Digital Image Processing. Addison-Wesley, 1992.

[5] C-F. Westin. A Tensor Framework for Multidimensional Signal Processing. PhD thesis, Link¨oping University, Sweden, S–581 83 Link¨oping, Sweden, 1994. Dissertation No 348, ISBN 91–7871–421–4.

[6] M. Gilge. Region-oriented texture coding. In L. Torres and M. Kunt, editors, Video Coding: The Second Generation Approach, chapter 5, pages 171–218. Kluwer Academic Publishers, 1996.

Detection of point-shaped targets