Stabilizing Augmented Reality views through object detection

(1)

,

STOCKHOLM SWEDEN 2017

Stabilizing Augmented

Reality views through object

detection

KARL JOHAN ANDREASSON

(2)

object detection

KARL JOHAN ANDREASSON

Datalogi

Civilingenjör Datateknik, Masterprogram i Datalogi Supervisor: Stefan Carlsson

Examiner: Stefan Carlsson Project sponsor: Bontouch AB

(3)

(4)

In two of the mobile applications developed by Bontouch there exists an AR view which eased the discovery of cer-tain points of interest. In the application developed for the Swedish postal service, PostNord, there exists such an AR view to help the users find delivery points of parcels handled by PostNord. The AR view does not perform as desired when the user is close to a delivery point due to in-consistent or faulty sensor data received from the rotation sensors and the GPS.

This thesis is focused on how to make the AR view more stable in these instances. To accomplish this stabilization the approach tested is where the video feed captured by the camera of the phone is analyzed to find the logotype of Posten and fixating the unstable point of interest to this logotype.

Two different detectors are evaluated and tested for the best result. One of the approaches implements the ORB algorithm and finds the logotype in the image. The second one utilized the knowledge of the logotype to make clever assumptions of which the contours in the image will need to satisfy, then the candidates will be evaluated by the colors inside the contour.

Once a match has been found it will then be analysed to determine if it was a true match of the logotype. If the match was a true positive the delivery point in the AR view will be fixated on this logotype. Three different approaches on how to perform this will be discussed and evaluated.

(5)

I två av de mobila applikationerna utvecklade av Bontouch finns det en AR-vy som underlättar för användarna av ap-plikationen att hitta till en plats av intresse. I applikatio-nen utvecklat för PostNord finns det en sådan AR-vy för att användarna ska enkelt kunna hitta till det utlämnings-ställe deras paket har kommit till. Denna AR-vy beter sig inte önskvärt när användaren är nära ett utlämningsställe på grund av okonsekvent eller till och med felaktig data från rotationssensorerna och från GPS:en i telefonen.

Denna uppsats är fokuserad på att hur man skulle kun-na gå till väga för att göra AR-vyn mer stabil. För att åstakomma denna stablisering av AR-vyn kommer bilder tagna av kameran kontinuerlight att analyseras för att hit-ta logotypen för Posten. Utlämningsstället som är oshit-tabilt kommer att placeras ut där logotypen hittas i bilderna som är analyserade.

Två olika sätt att hitta logtypen i bilderna är testade och utvärderade. En som använder sig av algoritmen ORB för att hitta logotypen i bilderna. Det andra tillvägagångs-sättet använder sig av kunskap om logotypen för att göra smarta antaganden om denna som konturer i bilderna måste uppfylla, sen analyseras kandidaterna baserat på färgerna inom deras kontur.

När detektorn har hittat vad den tror är en match kom-mer den sedan att analyseras för att se om det är en logo-typ på den positionen och då även visa användaren att den har hittats genom att fixera utlämningsstället i AR-vyn vid denna positionen. För att utföra detta är tre olika sätt be-skrivna och utvärderade.

(6)

1 Introduction 1

1.1 Problem Statement . . . 3

2 Background 5 2.1 Augmented Reality . . . 5

2.1.1 Obtrusiveness . . . 7

2.1.2 Mobile augmented reality . . . 8

2.1.3 User expectations on AR . . . 8

2.2 Object Recognition . . . 9

2.2.1 Feature Detection . . . 9

2.2.2 Feature Descriptor . . . 10

2.2.3 Existing libraries . . . 10

2.2.4 Canny Edge Detection . . . 11

2.2.5 SIFT . . . 11 2.2.6 SURF . . . 14 2.2.7 ORB . . . 15 2.2.8 Circle Detection . . . 17 2.2.9 Ellipse fitting . . . 19 2.3 Color segmentation . . . 19 2.3.1 RGB . . . 19 2.3.2 HSV . . . 20 2.3.3 CIE L*a*b* . . . 20

2.3.4 Comparison of color spaces . . . 21

3 Implementation 23 3.1 Present AR view . . . 23 3.2 Object recognition . . . 25 3.3 Target logotype . . . 25 3.4 Custom . . . 30 3.5 Feature Detection . . . 33 3.6 AR incorporation . . . 35

3.7 Heuristic of when to trust the object detector . . . 36

(7)

4 Result 41 4.1 Feature Detector . . . 41 4.2 Custom Detector . . . 41 4.3 AR presentation . . . 42 5 Discussion 45 6 Conclusions 49 Bibliography 51 Appendices 53

(8)

Introduction

The idea to display information over real life is not new. Functional augmented reality, AR, has been a dream for quite some time. It was first anticipated when Frank Baum formulated the idea of overlaying data into real life in one of his works. Augmented reality, AR, is also heavily included in science fiction movies and novels. In video games information have been presented to the consumer of the video game in different ways on a head-up display, HUD.

To present relevant information in a non-intrusive way in real life is not an easy task. The computational power required to accomplish a functional AR view for the user is substantial. Previously only available in science fiction, it is only in recent years where making this idea a reality is possible. Researchers at MIT wrote an article in 1997 where they imagined the future of AR but could not realize it with the hardware available to them at the current time [30].

Now that hardware is becoming more and more powerful, AR is making an appearance in several commercial products. Most notably of these commercial products is the Google Glass [12] where AR is the main focus of the product. When using the Google Glass the user is presented with an overlay of the real world on the glasses through a projector that projects the information on the glasses. This approach requires the user to purchase additional equipment for the possibility of an AR view and suffers from being intrusive on the end user. While wearable computing becomes more technically advanced, which enables the devices to shrink in size, the intrusive nature of AR solutions could be more common and socially acceptable. However, as it stands as of now the fraction of people owning the kind of equipment enabling these AR views is small enough that the commercial interest is quite low.

Another approach to realizing these relatively old ideas of AR views is to use already existing devices of the users - mainly mobile phones or cars. This is classified as a non obtrusive variant of augmented reality [22]. The development of mobile phones and in particular smart phones has enabled the average user to access much greater computing power in their pockets. This speed of the development of the ever increasing computational power inside the mobile phones coupled with the fact that

(9)

Figure 1.1: Example of using AR in phone applications.

these smart phones tend to have a well performant camera has enabled distributing reasonably performing AR views to the majority of the consumers.

The main parts of constructing an appreciated AR view has mostly been to incorporate the video feed from the camera and using GPS location together with GPS and different sensor information captured by the phone. These sensors are typ-ically the compass, accelerometer and gyroscope. However, this sensor information that the AR views are constituted of are not perfect and can lead to some peculiar behaviour of the application.

In some of the mobile applications developed by Bontouch there exists an already implemented AR view which performs well when the point of interest is not in the vicinity of the user and the GPS does not report too much uncertainty. However, when the user of the mobile application is close to one of the points of interest in the AR view the performance sometimes decrease substantially.

In one of the applications developed by Bontouch for PostNord, the Swedish Postal service, delivery points are searched for in a AR view. These delivery points are generally postal offices or grocery stores. This view in the application for Post-Nord can be seen in figure 1.1. In common with all the delivery points is that the logo of Posten is present in the vicinity of the entrance of the delivery point.

(10)

1.1 Problem Statement

Examine if there is a way to improve the AR views present in projects at Bontouch AB is the main focus of this report. To accomplish this it would be beneficial for the overall feel of the AR view if the precision and stabilization of the view were improved. This is particularly important when the person is close to one of the points of interest in the AR.

To accomplish this set out task; the approach that is to be examined and evalu-ated is to use computer vision techniques to analyze the video feed provided by the camera of the phone to find the logotype of Posten and fix a point of interest in the AR view to its correct location. How to find the points of interest in the video feed is a major problem as the distance needed for this to be a useful addition is quite large.

When and how to override the position of nearby delivery points, where imprecise or even faulty sensor and GPS information have caused the delivery point to be positioned wrongly and/or in a unstable position, need to be evaluated and tested. How to use this information from the analysis, more precisely how to incorporate it with the present AR view will be evaluated and tested as well.

The issue of when to use the object detection is evaluated as this approach requires a great deal of computational power to accomplish the analysis of the video feed in real time with an acceptable number of frames examined per second. Methods for how to incorporate the information given from the computer vision part are formulated and evaluated to give a good trade-off between how well the point of interest follow its real world equivalent delivery point and not allowing false positives to ruin the end user experience of the application.

(11)

(12)

Background

In this chapter a background is given to what has been done previously in the fields relevant to this report.

First in Section 2.1 an overview of what AR is and the current state of the art research in the area is presented. The limitations and difficulties are also discussed. Current state of the art research and implementations of object recognition and other relevant computer vision techniques are discussed in section 2.2.

In section 2.3 the most commonly used color spaces are described and previous work on comparing different color spaces against each other are summarized.

2.1 Augmented Reality

Views implementing AR views have seen huge improvements over the last years. It is being incorporated into an increasing amount of projects as either the main focus or as a complement to existing solutions. There has been increasingly widespread focus on AR views as they have been a part of science fiction in different media throughout the 20th century.

The definition of AR is somewhat controversial due to the where it lies on the Reality-Virtuality Continuum [18] as shown in figure -Virtuality Continuum. When a view goes being defined as augmenting reality to augmented virtuality is not a well defined step, and a consensus has not been reached between researchers in the area. This implies that what some researchers say is AR might very well be classified as augmented virtuality by another team of researchers.

In the paper by Singh and Singh [29] they present their own definition for what AR entails. It is defined by Singh and Singh as a view that incorporates additional

Mixed Reality

Real Environment Augmented Reality Augmented Virtuality Virtual Environment

(13)

information to augment a view of the real world. The authors also include a baseline example of what an AR view is where they describe a view of a city in a birds eye perspective. In this view there are additional information presented of the street and building names.

AR views according to Singh and Singh does not necessarily only augment the graphical view but a view where a certain sound is heard when the user is close to a certain place. In fact any media type can be used to augment a view to produce an AR view. This report will use the same definition as Singh and Singh as to what definition of AR is used.

Examples of AR in the real world with widespread usage are applications aiding navigation through highlighting the current route on a map to a desired target based on where the users are located at the moment. There are also applications using the camera feed to locate grocery items and display advertisements based on groceries found in the camera feed. Recent popular applications are showing altered video feed of the user where the image of the user has been altered to a certain setting, such as how the user would look if she was old or fat.

The absolutely most common way of implementing an AR view is to implement a magic lens - which is a way of letting the user view the real world through a camera view which has been augmented [8].

As shown AR is being used for several different use cases and it is seeing increas-ing usage in mobile phone applications. In common with almost all the applications using AR are the demand for a specific environment which helps the AR to func-tion correctly [29]. This means that the applicafunc-tions work quite well in controlled environments but once lighting and surroundings change the quality of the AR part of the application starts to decrease. This is a large problem for the adaptation of AR in the commercial market as it will lead to a lesser impression from the users if the application does not perform up to their expectations in all situations [21].

To make for a performant AR [29] Singh and Singh described the technologies needed to realise this goal. These requirements are defined in the paper as high-quality sensing, computing, and communications platform. Also described in the paper is that the requirements of the AR views are met by what is being increasingly common hardware in todays consumer market. The technologies needed for a well performant AR view are defined using three steps. The first step consists of sensors of the environment. These sensors are used, for example, to find the users location or using computer vision to find information about the users environment. It is also required to have the computational power to create an understanding of the environment around the user through processing of the information gathered by the sensors to figure out relevant information about the current scene. The third and last step of the technologies in an AR view are the triggers of when to show information to the end user, overwhelming the end user with information is certainly not desired and displaying wrong information can be really detrimental for the application. These three steps are connected as shown in figure 2.2.

The precision of the sensors needed for a well performing AR system are quite high. These requirements are well defined in the article by Azuma [1] and in the

(14)

Sensors

Processing

Displaying

Augmented Reality

Figure 2.2: AR steps

same article the author describes how a small angular error of only 1.5 degrees when displaying a virtual cup placed on a table two meters away will produce a projection error displayed to the user of 52mm in the resulting view.

There has been research done to make use of AR in very critical part of secu-rity mechanisms, for example Volvo has sponsored a research project which set out to find pedestrians and obstacles around the car to avoid or to mitigate the conse-quences in the event of a collision [20]. The research is not used in the cars produced by Volvo and the conclusion of said research paper was that further work is needed but it is achievable. This does indicates that the problems with inaccuracy can be conquered but might require high performance equipment which may or may not be available or feasible to include in products.

Privacy concerns of the user are one of the major obstacles when new tech-nologies are developed and released to the public. This is very much the case for applications with AR as well. Giving out, from the users perspective, sensitive information about the users location and surroundings places a good amount of trust on the developers of the said application. The benefits of the application will have to outweigh the cost the user place on the sensitive information for which the application requires to function properly.

2.1.1 Obtrusiveness

In AR the obtrusiveness of a technology affect the breakthrough this technology will have. For example the Google glass and the need to have the glasses on at all time or put them on whenever you want to use the features available through the AR view presented by the Google glass [12]. This is a real hurdle and it has been shown by Olwal [22] that people in general will prefer to use previously owned technology for the AR views instead of views constructed with new hardware. This

(15)

might not be very surprising but does showcase that the adaptation rate of new hardware composes a problem that needs to be overcome, either through general public adaptation or through integrating the augmented reality view in present technologies and hardware. This integration will of course limit the capabilities of the AR views compared to what is possible when a well functioning AR view is one of the main goals of the project of developing the hardware. One example of such project where the AR and most importantly the computer vision part is the main focus of the whole project is the Google Tango project [11]. How much of an impact this particular project will have on the end user market is still to be seen.

Olwal argues further that having a hand-held magic lens type of augmented re-ality systems acts as a view which complements the real world rather than replacing the real world as head worn solutions do. This is in line with the definition of AR, where head worn solutions tend to be closer to the Augmented virtuality on the Reality-Virtuality continuum [22].

Unobtrusive AR defines a set of new requirements on the AR view compared to obtrusive AR with a preference of direct views for the user and an avoidance of head worn displays [22] such as the one shown in Azuma’s article from 1993 [1].

2.1.2 Mobile augmented reality

Mobile augmented reality, MAR, is defined as AR created and accessed through a mobile device, for example a mobile phone, a digital camera, or navigators, in mobile contexts. Especially the mobile phone market has been in the forefront in the MAR development. This progress can be credited to a large extent to the development and inclusion of high end sensors in the mobile platforms [21]. This has enabled the first part of the three main technologies required for a performant AR view.

This progress of the sensors in a mobile phone coupled with the progress in the computational power of the mobile phones released in the last decade the second part of the technologies required are coming together to make the realization of performant AR views on mobile phones.

It has been shown that nontrivial real time AR views are possible on mobile phone devices where considerable rendering and computing is required. Valero shows that 3D modeling and spatially aware placement in the real world is possi-ble [10].

2.1.3 User expectations on AR

Because of the technical progress of mobile phones in the recent time the underlying requirements are now met to produce valuable mobile augmented reality views. It has been shown that the ability to provide a novel way of displaying information which are met by acceptance and success from the users are required to be based upon knowledge of the said users expectations and requirements.

The expectations from users on mobile AR are high to help them in their various activities that the applications are set out to perform [21]. How to design such an

(16)

interface where the expectations are so high is a complex challenge. Because of the futuristic feel of mobile augmented reality this makes the users be predetermined of how the views should function and the quality of said views.

The value for the users of a mobile AR view are, according to Olsson, Lagerstam, and Kärkkäinen, related to experiences of captivation and intuitiveness as well as to playfulness, inspiration, and creativity [21].

It is quite clear that the user expectations on AR views are generally high. A failure to live up to the users expectations will decrease the overall appreciation of the whole application and of course is not desired. On devices where computational power and battery power is especially limited smart solutions and integrations are required for the application to receive the end users praise.

2.2 Object Recognition

Object recognition is defined as the act of being able to, through different ap-proaches, differentiate a sought after object from the background in an image or a video. While humans usually have no problem to perceive and differentiate ob-jects from the background to computers this is no trivial task but requires quite some computational power to achieve. It is especially hard for computers when false classifications are really disastrous and there needs to be close to no such false positives.

Object recognition is a well known subfield in the computer vision research field and there have been plenty of work done to track and recognize objects in an image. In this section the most prevalent ways of performing object recognition are discussed.

The most common model to perform object detection is to first find the features in an image using a feature detector and then describe these features using feature descriptors. These descriptors are then matched against pre calculated descriptors of the object scanned for.

2.2.1 Feature Detection

Feature detection is used extensively as a first step to detect a specific object in an image. The act of feature detection outputs points in the image that are deemed interesting for further usage in many computer vision algorithms. Once feature de-tection has been performed these features can then be used to find descriptors which in turn can be matched against descriptors in reference images. These matches are then evaluated to determine if the sought after object is in fact visible in the image evaluated.

There are three classifications of feature detectors based on how feature points are found in the image. Feature points can either be found in edges in the image, in the found corners of the image, and in blobs in the image. The classification of the feature detectors discussed in this report are classified as shown in table 2.1.

(17)

Feature detector Edge detector Corner detector Blob detector

Canny edge detector X

SIFT X X

SURF X X

ORB X X

Table 2.1: Classification of feature detectors.

2.2.2 Feature Descriptor

Once the features are detected there needs to be a way to describe these features in a way that is repeatable between executions. Without repeatability there really is no stability of the descriptor as the same feature would be described differently from one description to another. Another important thing to consider is the comparabil-ity of feature descriptors. Most implementations of feature descriptors accomplish this through using the euclidean distance between the vector representation of the feature descriptor.

A commonly used way to describe the features detected is to use several points around the feature and describe these. This gives contextual information for the feature point which can be compared against each other easily. All of SIFT, SURF, and ORB uses this approach to describe features detected.

2.2.3 Existing libraries

The absolutely most prevalent library for computer vision implementations across several platforms today is the Open Source Computer Vision Library, OpenCV.

Set out to increase the adaptation pace of machine recognition in commercial products OpenCV releases a common infrastructure for computer vision applications under a BSD-license. This means that it is easy to use in both commercial and educational purposes.

With more than 2500 implemented algorithms for general purpose computer vision purposes it covers every part of the computer vision field whether it be facial recognition or object recognition. With over 7 million downloads it is the de facto business standard for business and academic purposes.

While written in C++ it has interfaces for C, C++, Matlab, Python and Java and supports Windows, Linux, Mac OS, along with Android and iOS [23].

The existence of such an extensive library for computer vision enables advanced use cases be implemented and realised much more easily than if everyone had to reinvent the wheel every time.

It is however worth noting that some of the algorithms are patented and dis-tributed in a non-free section of the library. The algorithms in this non-free section of the library are SIFT and SURF. This needs to be taken into consideration when considering using these algorithms in commercial products.

(18)

library is Vuforia which is a proprietary Software Development Kit, SDK, developed by Qualcomm [32].

Vuforia does not expose the developer to the algorithms but provide a black box solution where the computer vision part of the program is the call to the Vuforia SDK. If the result of the said computer vision call is not good enough there is no way of changing the underlying algorithms as these choices of algorithms made by the developers are not published.

2.2.4 Canny Edge Detection

Canny edge detection named after John Canny who in 1986 developed this algo-rithm where a detector is proposed. This detector uses adaptive thresholding with hysteresis based on the amount of noise estimated to be in the image.

An edge point by the Canny algorithm is defined as local maximum of the Gaussian operator Gn convolved with the image processed. Canny specified Gn to

be the derivative of the two dimensional Gaussian in the direction n.

In two dimensions an edge will have a tangent to the contour that the edge defines, this tangent is defined as the direction of the edge. n can be approximated through smooth gradient function of G_n convoluted with the image. The edges can then be found by δ δnGn∗ I = 0 or δ2 δn2G ∗ I = 0

Because the signal to noise ratio is different for each edge and to accept an edge this ratio needs to be above a certain threshold there is a need for using different width, or scale, of the Gaussian operator. The edges marked by the smallest operator are used to estimate how the larger operators would respond if these were the only edges in the image. If the output from the larger operators are significantly greater than what was estimated from the smaller operators a new edge is located at this spot.

Canny notes that most edges are in fact found using smaller operators and the ones picked up by the larger ones are mostly shadows, or shading edges, but does enable finding edges at a larger scale [5].

2.2.5 SIFT

Developed by David Lowe the algorithm for Scalar-Invariant Feature Transform, abbreviated SIFT [16], is a patented algorithm to find and describe features in an image which can be used to match against features computed from images of a target to perform object recognition. An image with the dimensions 500×500 pixels will according to the tests performed by Lowe generate around 2000 stable image

(19)

features where only three are needed for a correct and reliable object matching. Of course more image features will increase the probability of a correct match.

The SIFT algorithm for object detection described in the paper is computed in a four stage process. The four stage process can be described as

1. Extrema detection in all scales over the image.

This is done through implementing a difference-of-Gaussians function which is commonly used throughout computer vision. This difference-of-Gaussians function is used to locate points of interest in the image that are invariant to both scale and orientation.

2. Locate keypoints.

Localization of keypoints is implemented in a way that selects stable candi-dates from the extrema detection. These keypoints are fit to a model where scale and location is determined.

3. Keypoint orientation is determined.

For each extrema that is determined to be a keypoint is now assigned one or more orientations. These are then used for the rest of the computation and in doing so makes them invariant to these possible transformations.

4. Keypoint descriptor.

Describes each keypoint based on the gradients in the surroundings of the keypoint in the image. This surrounding is based on the determined scale of said keypoint. Gradients are then transformed to a 128 dimensions vector that allow for significant change in illumination and shape.

In the extrema detection the technique used is a scale space function based on the convolution of the scale space Gaussian and the image that is processed. Note that here the ∗ is the convolution operand.

L(x, y, σ) = D(x, y, σ) ∗ I(x, y)

Then a difference-of-Gaussians are calculated by the difference of two scales where the scales differ by a factor of k.

D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y) = L(x, y, kσ) − L(x, y, σ)

This is a well suited function to evaluate extrema in the image partly due to the efficiency to calculate but mostly due to being an approximation of scale normalized Laplacian of Gaussian.

The difference-of-Gaussians are evaluated for octaves, where the scale is doubled in each step, in scale space where three scales are used for each octave. Each sample

(20)

points is then compared to its neighbors in both its scale and the scale above and below to find the local extreme points. If and only if the sample point is higher or lower than all its 26 (9 in the scale below + 8 in the same scale + 9 in the scale above) neighbors it will be considered as a candidate for a keypoint.

Lowe showed that the optimal number of scales used is three by comparing the repeatability and the correct match to descriptors in a database.

Now that the candidates for keypoints have been found these candidates need to be evaluated based on scale, location and ratio of principal curvatures (ratio of the amount a surface bends). First the candidates are required to have a contrast above a certain certain value as they are deemed unstable. Then the remaining candidates which are below a thresholded of the ratio of principal curvature are removed. This threshold ensures that no candidates have a have a bad placement, for example on an edge in the image. If a keypoint was to be located on an edge this would leave it susceptible to noise due to the strong difference-of-Gaussians that edges produce even if the edge is poorly defined.

The orientation of the keypoints are found through estimating the magnitude and orientation using differences in pixels around the keypoint. These pixels around the keypoint are evaluated and used to create a histogram where the orientation is weighted by the magnitude of the sample pixel. This data are then collected in 36 bins that covers the possible orientations. The orientation with the highest value bin is chosen along with orientations that are within 80% of the highest value in their respective bin. The cases with more than one orientation are rare according to the tests conducted by Lowe.

The feature descriptor proposed by Lowe is computed through creating a his-togram over 4×4 regions of orientation around the keypoint. The end representation of the keypoint is then these 16 histograms with eight bins in each for a resulting descriptor with 128 dimensions (16 ∗ 8 = 128)

These descriptors can then be used to detect a specific object in the image by matching the keypoints against a reference database by finding the nearest neighbor. The nearest neighbor is found by calculating the Euclidean distance to the keypoints in the reference database through a modified version of kd-trees to find the close neighbors efficiency. These keypoints are then thresholded for its second closest match to only get distinct keypoint matches, Lowe used a threshold ratio of distance to keypoints between first and second closest match to 0.8.

Hough transform is used to cluster the keypoints that agree on object mation and orientation. These clusters are then used to solve the affine transfor-mation that best relate the reference image to the processed image.

The affine transformation from reference image coordinates to processed image coordinates can be described as

" u v # = " m1 m2 m3 m4 # " x y # + " tx ty #

(21)

pa-Figure 2.3: Example visualization of SIFT keypoints.

rameters, m1, m2, m3, m4, tx and ty, three matches are needed. More matches

does improve stability and performance but three is the absolute minimum number of matches needed. These unknown parameters are found through solving the least square method for the x in the linear equation system

Ax = b

by writing it as

x = [ATA]−1ATb

An example of SIFT descriptors extracted using the OpenCV implementation and descriptor visualization are shown in figure 2.3.

The SIFT algorithm is patented in the United States of America by the Uni-versity of British Columbia and is included in the non-free package of SIFT. This package is not available for Android or iOS.

2.2.6 SURF

Speeded up Robust Features or SURF is an algorithm developed to decrease the computation required to detect and describe features in an image. To achieve its set out goal SURF is based on the Hessian matrix. The Hessian matrix of a function is a square matrix of second-order partial derivatives of the function.

H(f )(x) =         δ2_f δx2 1 δ2_f δx1δx2 · · · δ2_f δx1δxn δ2_f δx2δx1 δ2_f δx2 2 · · · _δxδ2f 2δxn .. . ... . .. ... δ2_f δxnδx1 δ2_f δxnδx2 · · · δ2_f δx2 n        

(22)

An approximation is used to the Hessian matrix which decreases computation time. The Hessian matrix for a certain position, x = (x, y), in the image at a scale σ is defined as H = " Lxx(x, σ) Lxy(x, σ) Lyx(x, σ) Lyy(x, σ) #

Here Lxx(x, σ) is the convolution of the second order derivative Gaussian and

the image at the point x. This gives, just as the difference-of-Gaussians used in SIFT, an approximation of a detector based on the Laplacian operator.

The detection is done using a box filter which approximates the second order Gaussian derivatives that are calculated with great efficiency using integral images. The size of the filters corresponds to the scale of the Gaussian derivative, for example the filter of size 9×9 corresponds to a scale, σ, of 1.2.

When determining which points are suitable to be considered keypoints for the feature detection the maxima of the determinant of the approximated Hessian ma-trix is used.

The SURF descriptor is based on the same properties as the SIFT descriptors where an orientation that is reproducible is found and then a rotated square that match the orientation is used around the keypoint to extract features. The size of the square is based on the scale of the keypoint. This square is then split into a 4×4 subregions and in these regions the Haar-wavelet responses weighted by a Gaussian with its center in the keypoint are summed up in both absolute value and normal value. This gives a resulting descriptor vector of size 64.

There are special variants of SURF where SURF-36 uses 3×3 subregions and U-SURF where orientation is not considered. This U-SURF is recommended where the orientation of the object will most likely not change and provide faster execution time as the orientation does not have to be computed [3].

Just as SIFT, SURF is patented in the United States of America and an imple-mentation is included in the non-free package of the OpenCV library.

An example of SURF descriptors is shown in figure 2.4.

2.2.7 ORB

To increase the performance of object recognition the algorithm ORB, oriented FAST and Rotated BRIEF, was developed by the OpenCV labs. As the name of the algorithm hints at ORB builds upon already well known present algorithms, namely FAST keypoint detector and BRIEF descriptors. By combining these already known algorithms the authors of the article where the algorithm was published report as good performance as SIFT and SURF while being two orders of magnitude faster than SIFT and significantly faster than SURF.

FAST, which is an acronym for Features from Accelerated Segment Test, is a corner detection algorithm that focuses on efficiency. Traditionally FAST uses a circle with radius 3 around a keypoint candidate to determine if it is a corner. In this line of the circle gives 16 points which are sampled. If there exists a set of these

(23)

Figure 2.4: Example visualization of SURF keypoints.

sampled pixels of size N where all have either larger intensity than the candidate plus a threshold or all have lower intensity than the candidate minus a threshold then this candidate is determined a corner [27].

ORB uses a slightly modified version of FAST where a 9 pixel radius is used instead of the traditional 3. Additionally ORB changes the threshold value to ensure that there will be a sufficient amount of corners detected in this stage of the algorithm. The orientation of a keypoint is determined using the intensity centroid1. Binary Robust Independent Elementary Features, BRIEF describes features us-ing a set of binary intensity tests for an image patch p of two image points x and y

τ (p; x, y) =

(

1 p(x) < p(y) 0 p(x) ≥ p(y)

Where p(x) is the intensity measured at point x. The points to measure to get the binary string with the tests are shown to produce optimal recognition rate if they are sampled from a Gaussian distribution around the keypoint [4].

ORB uses a modified version of the BRIEF descriptor where it allows for ro-tations in the plane of the object. This is needed because of the observation that the accuracy of regular BRIEF falls of greatly when a rotation of more than a few degrees are present.

To make this orientation possible in a fast manner the authors propose that the set of sampled points that are used in the intensity tests for the normal BRIEF descriptor are defined as

S = x1 · · · xn y1 · · · yn

!

1

Interested readers are directed to the article describing this corner analysis (P. L. Rosin. Measuring corner properties. Computer Vision and Image Understanding., 73(2):291 - 307, 1999.)

(24)

Figure 2.5: Example visualization of oriented FAST keypoints used in the ORB algorithm.

Then the steered BRIEF, sBRIEF, descriptor is defined as Sθ = RθS

where R_θis the rotation matrix for θ. The angles used are discrete with a degree of 2π/30 [28].

An example of the oriented FAST keypoints of the ORB algorithm are shown in figure 2.5.

The ORB algorithm in contrast to the SIFT and SURF algorithm is not patented and is available in the standard OpenCV library. Additionally since it exists in the standard OpenCV library it is included in both the Android and the iOS interface.

2.2.8 Circle Detection

To detect circles and circular shapes in an image is a major part of visual navigation and tracking of objects. As an example of a implementation of object tracking in a video feed in real time there has been successful work identifying the football in a soccer game using computer vision and circle detection. This example is particularly interesting as the ball is not always completely visible and a perfect circle can not always be found. Even though the ball in the soccer game sometimes is occluded from the camera a success rate of over 92% is reported with an even higher success rate if the whole ball is visible [7]. To discover potential balls in the soccer game video feed the technique used with great success was circle Hough transform. The circle Hough transform technique is using a technique where the edge pixels are identified and then grouped together to form a desired shape through a voting process. These desired shapes can be of arbitrary type and form after a generalized Hough transformation was published in 1981 by D. H. Ballard [2]. For circles the equation used is the standard equation to describe a circle

(25)

Figure 2.6: Binary thresholding example.

however for a more general case where the sought after shape is of a ellipse the equation is (x − xp)2 S2 x + (y − yp) 2 S2 y = 1

with an addition of a possible rotation θ if the major axis of the ellipse is not parallel to the x-axis. For an even more complex shape than an ellipse template matching is used to find these shapes in an image.

There has been additional work done on increasing the performance of the Hough transform for a specific shape. In the particular case where a elliptic shapes is sought after another algorithm is proposed which is based on the straight line Hough trans-form. This algorithm finds the centers of the ellipsoids in the image in the straight line Hough transformation space. Through knowledge of the transformation, xpand

yp, the unknown parameters are reduced from five to three. The resulting unknown

parameters are the rotation parameter θ and the size parameters Sx and Sy. These

unknown parameters are then found using a clustering of the edge pixels in the image space [19].

Another successful circle detection technique used is to use geometrical feature constraints to find the circles in the image. This has been shown to be successful technique when used to find the rims of a car in an image [13]. The technique used to find the circles candidates in the image analyzed a binary image is used. This binary image is found using a threshold found with Otsu’s algorithm [24] where the pixels in a grey scale image over an found optimal threshold are set to maximum, the value 255, and the rest is set to minimum, the value 0. An example of such thresholding result can be seen in figure 2.6.

This binary image is then used to as a basis to find the contours in the image through the Moore-Neighbor tracing algorithm. Once the contours of the image are found the circles are identified using a tangent line comparison with a perfect circle. Because circles are not always perfect and completely visible in the contours

(26)

the circles are constructed of partial contours in the image. Then the final center coordinates and the radius of the found circles in the image are found using the least square method.

The authors of the article which found the rims of a car in an image report good precision on the one image tested on [13]. The lack of performance testing on a larger scale is noteworthy as the only information given about its speed is the specifications of the computer used for testing.

2.2.9 Ellipse fitting

As described ellipsoids can be found in an image using Hough Transform, however it is not the only way. Fitzgerald presented in article of a rundown of different conic fitting an algorithm tailored for ellipse fitting called B2AC which find the best fit for an ellipse for a set of points using a least square method with a quadratic constraint [9].

The B2AC algorithm is implemented in the OpenCV library function fitEllipse where a rotated bounding rectangles representation of all found ellipses found in the image are returned.

2.3 Color segmentation

Depending on how a color is represented the act of classifying nearby colors is more or less difficult and provided varying results. To classify a color in a color space similar colors should ideally be very close in the color space. There are different approaches to how this is done, for example emulating how humans perceive colors and group them accordingly or having a focus on how a color change depending on the lighting in the image. The color representations are plenty and all have their merits.

Color segmentation is used extensively for face recognition to segment out the interesting part of an image with regards to potential areas where a face could be located[25] but also for general object recognition where the object is of a specific color or colors.

2.3.1 RGB

RGB is the most common color space and is the one used in LCD monitors to show the colors to the end user. In this color space the colors are represented using three separate channels consisting of the red, green and blue part in each pixel. Visually this can be seen as a cube where each axis represents either red, green, or blue.

In OpenCV this is the standard way of interpreting an image with one exception being that the blue and red color channel are swapped. This means that the color space used in practice in OpenCV is BGR and not RGB.

(27)

2.3.2 HSV

In the HSV color model the colors are represented as a cylinder where the angle of the cylinder is the hue, the radius of the cylinder is the saturation and the height of the cylinder is the value. This gives a rather easy way to differentiate colors depending on the hue value. The HSV color model was developed for computer graphics applications, mainly for usages such as color picking. It does not see widespread usage in computer vision.

In OpenCV images using the HSV color model the hue value is represented as discrete values from 0 to 180 instead of the traditional representation which is the interval [0, 360). Both saturation and value are represented using values from 0 to 255 which means that all three values in the HSV color model can be represented by a char. This means that there are 180 different possible hue values used to represent all the colors in the color model, one for every 2 degrees in the cylinder that represents the color space.

2.3.3 CIE L*a*b*

In 1948 Hunter defined the Lab color space [14] using a color space of three dimen-sions. The three dimensions used in the color space representation was lightness represented by the L channel while the a and b channel are being used for color opposites. In the modern version, the CIE L*a*b* color space, the difference is how the dimensions are based in the CIE version on the cubic root transformation while the original Lab color space use square root transformations. However, both Lab color spaces originate from the same CIE XYZ color space.

The lightness dimension in the color space is determined using cubic root of the relative luminance of the image. The a and b channels are as mentioned before the color opposites where a goes from green to red and b from blue to yellow. This is built upon the observation that no color is both blue and yellow at the same time, or red and green.

The CIE in the name of the color space comes from its endorsement from the Commission internationale de l’éclairage as reference model for color spaces.

The CIE L*a*b* is a color space designed to emulate how humans perceive colors to make a more realistic and better representation of the colors. The Euclidean distance in the CIE L*a*b* color space is proportional to the visual similarity of the colors [15].

It has been shown that using the CIE L*a*b* color space as one parameter to differentiate colors window panes have been able to be distinguished from images of a building[26]. This was accomplished using the k-means algorithm of the colors in the images to then compare the euclidean distance of these cluster centres to reference values.

(28)

2.3.4 Comparison of color spaces

In the article by Raghuvanshi and Agrawal different color representations are com-pared for segmenting images for potential face colors [25]. In the report the color segmentation is evaluated using four different color models. These models are RGB, Y CbCr CIE L*a*b* and HSV. In the paper the authors present a detailed

compar-ison where CIE L*a*b* and Y C_bCr performed the best with around 10% false face

detection and 5% false dismissal. The performance using the HSV color model were reported as almost as performant and RGB were way worse than the other three color models with a false face detection rate of almost 46% and a false dismissal rate of around 32.5%. The authors also proposed a combination of the available color model which overcomes the shortcomings of using a single color model. In the aforementioned report the published results show the errors cut in half using such a combination.

Neither of these color models are perfect in representing the different colors independent of lighting present in the scene. This is an open research field and in an article from researchers from Harvard [6] the authors Chong, Gortler and Zickler propose a new color model that would performs better than current color models. Their result show promising results where their color model perform better in the Witt database and equally on RIT-Dupont as the CIE L*a*b* and CIE L*u*v*. This color model is however not implemented in the OpenCV library but does show that the choice of color model is not trivial at all - especially when looking for certain colors independent of lighting in the scene.

It is shown in the same article by Chong, Gortler and Zickler that out of the existing color models in OpenCV the best results are expected from CIE L*a*b* or CIE L*u*v*.

(29)

(30)

Implementation

In this section the work done on comparing object recognition methods against each other and how to incorporate this information from the object recognition to improve a AR view is discussed.

First a breakdown of the current AR view that is present in the application. Then the target logotype of Posten is described and examined. This knowl-edge of the logotype is used for the custom detector to make assumptions on the candidates for the placement of the logotype in the image.

The custom detector is compared against another detector based on a more traditional way of image analysis where features of the keypoints in the image are compared against a database of features extracted from images of the logotype.

Both approaches have in common that the result they produce, the matches of the logotype in the image, needs to be interpreted and shown in the augmented reality view in some way. This AR view incorporation is described in section 3.6.

3.1 Present AR view

To get a better understanding of what is being improved upon an overview of the current AR view present in applications developed by Bontouch is presented.

In this report the Android application for PostNord is considered and more specifically the AR view present in this application. The choice of the Android version was made because of previous experience of said platform and the lack of a development environment possible for Apple’s iOS. There are currently no other version of the application for other platforms such as Windows Phone 8. However, the choice of targeting one specific platform for the report does not hinder the work to generalise to other platforms easily as the work is not Android specific. OpenCV exists for both Android and iOS and the custom detector is implemented using the C++ interface which enables it to work cross platform on both Android and iOS.

In the current view the information presented is the direction of the point of interest which in turn will help the user to locate the desired point of interest. These points of interest are in the case of the PostNord application the delivery points of

(31)

Figure 3.1: Screenshot from the AR view in the PostNord application.

parcels available. The points of interests are presented to the user as markers on the screen where the ones shown are the points of interest that are located in the direction that the screen is being held up to a maximum distance. For example if the user that is using the AR view is facing to the south only points of interest that are located south, plus the field of view, of the user are presented with the distance to the point of interest. A screen shot of the AR view can be seen in figure 3.1.

The AR view can be used to find the direction that the user should take to travel to the point of interest but also a point of interest can be selected to show relevant information about it. This kind of information about a point of interest in the view is in the case of PostNord the name and the address of the delivery point and whether the store which maintains the delivery point is open.

To make this AR view as performant as it is in the current state, the PostNord application using information gathered from the GPS sensor coupled with informa-tion from the rotainforma-tion sensor. This informainforma-tion is then processed to figure out the direction that the user is directed and displaying the relevant information of the points of interest to the user. This approach is heavily reliant on accurate informa-tion given to the applicainforma-tion from the sensors. If faulty sensor informainforma-tion is given the end information displayed to the user will also be faulty. This hard coupling is really hard to escape. For the GPS information a margin of error is also given

(32)

and this can be used to make the overall experience better for the end user (in figure 3.1 this can be seen by displaying the margin of error and a warning icon if the margin of error is large under the compass). However, the sensor information from the rotation sensor does not have this margin of error and can only be taken as the ground truth without any other information about the surroundings that can verify the information from the rotation sensor.

In the current AR view the camera preview is not used for anything else other than displaying it as a background for the points of interest. In this report different options of how to analyze this information to locate the points of interest and how to incorporate this information to make for a better user experience will be evaluated.

3.2 Object recognition

The different approaches that are implemented and evaluated are one implementing a feature detector using the known algorithm ORB and another approach which heavily uses the knowledge of the target logotype to recognize objects in the images. The fact that SIFT and SURF are patented in the United States of America together with that the OpenCV library interface for Android used in the implemen-tation stage does not, as mentioned, have support for non-free algorithms unfortu-nately rules out the usage of both SIFT and SURF

Both implementations use the observation that the user in a normal use case scenario is only going to find the logotype in the upper half of the augmented reality view. This is true because the logotypes are located above the entrance to the delivery points or in the same height as the user on the wall on the same building as the delivery point. The angle the user must have the phone to make the logotype appear in the bottom half of the screen is quite high and will not happen under normal use. This effectively reduce the size of the input image in half and, as a result, the computation time in half as well. This allows larger resolution before removing the bottom half without increasing the computation time significantly.

3.3 Target logotype

When performing the object recognition there needs to be a target that should be found in the image. In the scope of this project the target is the logotype of Posten. This particular logotype shown in figure 3.2 is a blue circle with a yellow horn in the upper middle of the logotype with a yellow post horn below it. This logotype is because of the shape low on corners which hinders corner detectors to be truly effective in detecting it. This is not a problem for the algorithms using a corner detection method as they all use a blob detection as an addition to the corner detection. This could however be something that hinders the performance of these algorithms.

Another observation of the logotype that is very noteworthy is the fact that it is almost always entirely flat where it exists in the real world. If it were a three

(33)

Figure 3.2: Logotype of Posten.

Figure 3.3: Plot of all colors in the RGB of the Swedish postal service logotype.

dimensional object that is being sought after the detection is significantly harder than on objects that exist in a plane.

The colors in the logotype are as said earlier yellow and blue, this combination of colors is quite rare in the real world. Especially the yellow part is rare and blue is more common.

An image containing only the logotype of Posten taken indoors with little sun-light is used to compare the different color spaces described in section 2.3. In figure 3.3 the logotype captured in daylight is extracted and plotted in the three dimensional world that the RGB color space describe, in figure 3.4 the colors of the same image is plotted in the HSV color space, and in figure 3.5 the colors are plotted in the L*a*b* color space.

(34)

Figure 3.4: Plot of all colors in the HSV of the Swedish postal service logotype.

(35)

distinction with a wide spread, this makes it very hard and unreliable to use for analyzing the camera preview. Worth noting is the gray representation which is a result of the transition from yellow to blue (the black part is from the background) and how far off these regions are from the part with the yellow and blue color.

The HSV color space does look very interesting from this initial analysis as there are two very well defined regions if only the hue value is considered. One extremely well defined region located around the hue value of 105 (this is using the OpenCV representation, in traditional representation of HSV it would be 105 ∗ 2 = 210), this region is the yellow part of the image. The blue part is also located in a quite well defined span of the hue axis, namely around the hue value of 10 to 30 with some outliers.

Plotting in L*a*b* looks at a first glance to be quite spread out as well. However, noting the scale of the a-axis and how small the spread is, it is ranging from 120 to 134 which is in the very middle of the range possible. If the lighting value is discarded there are two regions where one is in the positive b values (above 127 in the OpenCV representation) and one in the negative b values (below 127 in the OpenCV representation).

The HSV color space does look very interesting as the color space of choice when analyzing an image taken indoors. However this is not the focus of this report, we want to be able to find the logo outdoors in a range of different lighting conditions. In different lighting it would be ideal if the new center of the colors in the color spaces is close to the other centers from other lighting conditions. In figure 3.6, figure 3.7 and figure 3.8 the colors of images taken in worse lighting conditions are plotted in the same color spaces as earlier. The shown results in these figures are computed from randomly chosen points in the images. This is used to keep the diagrams readable and the result does not change significantly between executions. As seen the span of hue that needs to be accepted to not limit the detection significantly, gets quite wide when several different lighting conditions are consid-ered. This holds true especially for the blue in the logotype. This means that in the general case where the target logotype is searched for in an outdoor environment the only real way to determine if the circle found is the logotype of Posten is to look for the yellow hue representation in the image. This severely limits its effectiveness, and even with checks to make sure that there are no far outliers, such as red or other distinct colors from blue, in the image the task is hard and results in a fair share of false positives.

The usage of the hue values proved to work much better when there is a threshold of the value axis in the HSV color space. This filters the black in the image that happens to have a certain hue value due to the fact that black can take any hue value and still look the same. Another threshold to remove the white in the image are used on the saturation axis. This eliminated a whole lot of the false positives.

Another approach is to use the L*a*b* color space and limit the a-axis signif-icantly together with a threshold of the lower and higher lightness values. The thresholds of the lightness value is to make sure that no completely dark or white area gets registered as blue or yellow as these can be represented in several way in

(36)

Figure 3.6: Plot of all colors in the RGB of the Swedish postal service logotype in darker settings.

Figure 3.7: Plot of all colors in the HSV of the Swedish postal service logotype in darker settings.

(37)

Figure 3.8: Plot of all colors in the L*a*b* of the Swedish postal service logotype in darker settings.

the L*a*b* color space. The limited spread in the a-axis are due to the knowledge that the colors of the target logotype is blue and yellow which are opposites on the b-axis in the L*a*b* color space. This approach proved to produce a good result if a realistic number of outliers were accepted but the circle is discarded if there were too many outliers in the circle in the image. The points of the tested images of real world appearances of the logotype showed that the amount of yellow and blue on the b-axis was very small for it to be supposed to be classified as a match. This was worrying but proved not to be an issue as the spread was consistent across the tested images.

3.4 Custom

Using the knowledge of the object to locate in the video feed of the camera one approach implemented is to use this for a custom built detector. The most impor-tant knowledge of the visual features are the shape and colors. If we can be sure beforehand of what shape the logotype we are looking for is we can be certain that what would otherwise be an extremely limiting assumption is in fact safe. If we are looking for a square logotype it is possible to limit the possible matches found in the image analyzed by the vertices in the contour. In a perfect world where the whole square logotype is visible there should be four and exactly four vertices in the contour of the possible match. This is true for flat logotypes independently on rotation of the logotype in the video feed.

When the sought after shape is a circle the reasoning is not as easy as there are no real corners. This is unfortunately the case for the logotype of Posten which makes the number of vertices in the contour somewhat harder to decide on a value.

(38)

In the tests conducted a winning concept is to not search for a specific number of vertices but to set up a lower limit of the number of vertices in the contour. In the experimenting done a lower limit of 6 vertices in the contour has been a successful limit. A too high number of smaller circles will be discarded if a higher number of vertices required for the circle candidate is used.

As making assumptions on how the blobs are detected in the image is really hard and due to the sparsity of corners in the image an edge detector approach is the preferred way to find interesting part in the image. The most prevalent way of finding edges in an image is through using the Canny edge detector and is the algorithm used in the custom detector.

From the edges detected using Canny edge detectors the contours in the image are estimated using the OpenCV function findContours which is implemented using an algorithm by published in an article by Suzuki [31]. The algorithm finds the contours in an image by following the borders in the image.

From the intermediate testing it has been proven fruitful for the execution time of the custom detector to reduce potential circle candidates early on. If all the small contours that come from noise in the image can be removed before any other computation is done a great deal of performance for the overall detector is gained. One approach that has been the quickest and most successful is to set out a lower limit of the area of the circle candidate seeing as noise, even after a Gaussian blur has been performed on the image, will generally produce many small contours. The approach to limit the area to be above a specific number does make the detector not able to detect the sought after logo from a distance where it itself is filtered. This is not deemed a problem as the performance of the custom detector is not good enough to correctly score such small circles anyway. However, this could be a problem in the future when better hardware is available and larger performance demands are placed on the object detection.

It is quite expensive to calculate the absolute area of the all the contours includ-ing the small ones, especially when the contours are not concave. This is alleviated by calculating the bounding box of the contour and calculating the area of this instead which is really fast. There are cases where the contour coming from noise in an image are a long diagonal line almost which will not be caught by this method. However, this noise will be caught by the ellipse estimation that is done later and are quite rare.

Another discovery when analysing the output from the Canny edge detector is that almost exclusively there will be one or more large contours spanning close to the entire image. These are quite expensive to perform color analysis on and general shape analysis which makes it beneficial to set an upper limit of the area of a contour as well. These two thresholding operations prove very successful at reducing the noise captured in when finding all the contours in the image.

Once all the noise have been removed an ellipse fitting is performed and the area of this resulting ellipse is calculated along with the area of the contour. If the ratio of ellipse area to real contour area is too large or small the circle candidate is discarded. The output from each stage in the procedure can be seen in figure 3.9.

(39)

(a) Original image (b) Canny edge detec-tion

(c) Contours

(d) Removal of squares (e) Removal of noise (f) Circles found

Figure 3.9: The different stages in the circle detector.

In the last image of figure 3.9 the four matches are shown. It is obvious that not only ellipsoids are found, to combat this problem one course of action would be to set a more strict limit of the area ratio of the fitted ellipse and the actual contour area. However this has shown to be the best result due to the desired true positive rate. Some other approaches was to compute the symmetric difference between the fitted ellipse and the contour and limit this to a set value or to make sure that the contour was not concave. The elimination of concave contours eliminates a lot of bad circles, however it starts to eliminate the target logotype in the image when the distance is large enough. This happens because the contour of the outer circle will sometimes blend in with the postal horn in the middle of the logotype when the path tracing were performed to find the contours in the image.

It is not enough to return the three matches shown in the last image in figure 3.9 but some scoring of the ellipses found in the image is needed. As shown in section 3.3

(40)

Figure 3.10: Found logotype in the end of the custom detector.

the two of color spaces are either HSV or L*a*b* to score the ellipses. The end result was to chose the L*a*b* color space as the colors where the lighting change are closer than in the HSV color space, especially as the blue hue value can differ greatly. This is also supported by the results published by Raghuvanshi and Agrawal [25].

To score the circles the ratio of blue to yellow found in the logotype is used together with thresholding on outliers in the ellipses where an outlier is a color which is too bright, too dark, or does not exist in the desired span on the a-axis which is a the red to green measurement of the color.

The end result of the custom detector can be seen in figure 3.10. Here a green rectangle is placed around the found best match of the logotype from the custom detector.

3.5 Feature Detection

The feature detector implemented are using the ORB algorithm as both the SIFT and the SURF algorithm are not implemented in the Android version of the OpenCV distribution because of patent issues. This is not a major problem as the perfor-mance results presented in the article describing the ORB algorithm show that ORB is orders of magnitude faster while still providing equivalent performance to find matches in the image.

(41)

were followed trough:

1. Find the keypoints in the image through the appropriate OpenCV method. This method take an argument which limits the number of keypoints in the image. This can be used to achieve the performance desired and to speed up the execution.

The number of keypoints detected in the images where originally set to be around 2000 because of the observations made by David Lowe when testing his algorithm SIFT. However, this number was decreased to 400 in the testing of the detector as higher number did not increase the probability of finding the logotype in the image.

2. Extract features from the keypoints found in the image. OpenCV maintains methods for performing this operations for the ORB algorithm.

3. Match these features with the features computed from the target, these are precomputed to save time. These are matched using a FLANN based matcher. This matcher finds the approximately nearest neighbour in the dataset com-pared against. This is similar to the procedure proposed by David Lowe when testing the SIFT algorithm and how he handles the nearest neighbour problem using a modified kd-tree solution.

4. Using these found matches to determine whether or not a match has been found. This is done through discarding obvious outliers and then computing the homography. The homography is then used to find the position of the logotype in the image through an perspective transformation.

   ximage yimage 1   = H    x0_object y_object0 z_object0       xobject yobject zobject   =        x0 object z0 object y0_object z_object0 z_object0 z0 object       

The corners in the target image are used as positions to be transformed. Once they have been translated into the image from the camera these positions are used to determine where the match were located.

As ORB uses oriented FAST keypoints and oriented BRIEF descriptors tests was conducted to use normal FAST keypoints and normal BRIEF descriptors as this would provide a speedup compared to use ORB.