Markerless Augmented Reality for Visualization of 3D Objects in the Real World

Full text

(1)LiU-ITN-TEK-A-14/019-SE. Markörlös Augmented Reality för visualisering av 3D-objekt i verkliga världen Semone Kallin Clarke 2014-06-11. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A-14/019-SE. Markörlös Augmented Reality för visualisering av 3D-objekt i verkliga världen Examensarbete utfört i Medieteknik vid Tekniska högskolan vid Linköpings universitet. Semone Kallin Clarke Handledare Ali Samini Examinator Karljohan Lundin Palmerius Norrköping 2014-06-11.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Semone Kallin Clarke.

(4) Abstract This report documents the research and experiments on evaluating the possibilities of using OpenCV for developing a markerless augmented reality applications using the structure from motion algorithm. It gives a background on what augmented reality is and how it can be used and also theory about camera calibration and the structure from motion algorithm is presented. Based on the theory the algorithm was implemented using OpenCV and evaluated regarding its performance and possibilities when creating markerless augmented reality applications..

(5) Contents 1 Introduction 1.1 Background . . . . . 1.2 Motivation . . . . . . 1.3 Aim and Purpose . . 1.4 Problem Formulation. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 2 Augmented Reality 2.1 Definition of Augmented Reality . . . . . . 2.2 Fields of Use . . . . . . . . . . . . . . . . 2.3 Tracking . . . . . . . . . . . . . . . . . . . 2.3.1 Sensor-based Tracking . . . . . . . 2.3.2 Vision-based Tracking . . . . . . . 2.3.3 Hybrid Tracking Techniques . . . . 2.4 Marker-based Techniques . . . . . . . . . . 2.5 Markerless Techniques . . . . . . . . . . . 2.5.1 SLAM and Structure from Motion .. . . . .. . . . . . . . . .. . . . .. . . . . . . . . .. . . . .. . . . . . . . . .. 3 Related Work 3.1 ArUco . . . . . . . . . . . . . . . . . . . . . . . 3.2 Parallel Tracking and Mapping - PTAM . . . . 3.3 CityViewAR . . . . . . . . . . . . . . . . . . . . 3.4 Markerless Tracking Using Planar Homographies. . . . .. . . . . . . . . .. . . . .. . . . .. . . . . . . . . .. . . . .. . . . .. . . . . . . . . .. . . . .. . . . .. . . . . . . . . .. . . . .. . . . .. . . . . . . . . .. . . . .. . . . .. . . . . . . . . .. . . . .. . . . .. . . . . . . . . .. . . . .. . . . .. 4 4 4 5 5. . . . . . . . . .. 6 6 7 9 9 9 9 10 10 10. . . . .. 12 12 12 13 13. 4 Camera Calibration 14 4.1 Pinhole Camera Model . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Retrieve Intrinsic Camera Parameters . . . . . . . . . . . . . . 17 4.3 Lens Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 Structure From Motion Algorithms 5.1 Feature Detection and Descriptors . . . . . . . . . . . . . . . . 5.1.1 Speeded-Up Robust Features - SURF . . . . . . . . . . 5.1.2 Feature Matching and Outlier Removal . . . . . . . . . 1. 19 20 20 20.

(6) 5.2 5.3 5.4. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 20 21 23 24 24 24 25. 6 Experimental Studies 6.1 Hardware . . . . . . . . . . . . . . . . . . . . 6.2 Libraries and APIs . . . . . . . . . . . . . . . 6.2.1 OpenCV . . . . . . . . . . . . . . . . . 6.2.2 OpenGL . . . . . . . . . . . . . . . . . 6.3 Intrinsic Camera Parameters . . . . . . . . . . 6.4 Image Rectification . . . . . . . . . . . . . . . 6.5 Detecting Features and Extracting Descriptors 6.6 Feature Matching and Outlier Removal . . . . 6.7 Recover Extrinsic Parameters . . . . . . . . . 6.8 Triangulation . . . . . . . . . . . . . . . . . . 6.9 Handling Multiple Views . . . . . . . . . . . . 6.10 Point Cloud and Camera Pose Visualization . 6.11 Object Rendering . . . . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. 26 26 26 26 26 27 28 28 28 29 29 30 30 31. . . . . . . . .. 32 32 33 34 35 37 38 39 42. 5.5 5.6 5.7. RANdom SAmpling Consensus - RANSAC Epipolar Geometry . . . . . . . . . . . . . Recover Extrinsic Parameters . . . . . . . 5.4.1 The Five-point Algorithm . . . . . Triangulation . . . . . . . . . . . . . . . . Multiple-view SfM . . . . . . . . . . . . . Bundle Adjustment . . . . . . . . . . . . .. . . . . . . .. 7 Results and Discussion 7.1 Camera Calibration . . . . . . . . . . . . . . . 7.2 Feature Detection . . . . . . . . . . . . . . . . 7.3 Feature Description and Matching . . . . . . . 7.4 Outlier Removal . . . . . . . . . . . . . . . . . 7.5 Two View Pose Estimation and Triangulation 7.6 Adding Multiple Views . . . . . . . . . . . . . 7.7 Integrating Virtual Object . . . . . . . . . . . 7.8 Algorithm Performance . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 8 Conclusion. 44. References. 46. 2.

(7) List of Abbreviations AR. Augmented Reality. DoF. Degrees of Freedom. GPS. Global Positioning System. PTAM. Parallel Tracking and Mapping. RANSAC. RANdom SAmple Consensus. SfM. Structure from Motion. SIFT. Scale-invariant feature transform. SLAM. Simultaneous localization and mapping. SURF. Speeded Up Robust Features. VR. Virtual Reality. 3.

(8) Chapter 1 Introduction This report is the result of a master’s thesis project conducted at the department of sience and technology at Linköping University together with Sweco Position.. 1.1. Background. Sweco Position is a company who focuses on IT for urban development and is a subsidiary to Sweco which core business is qualified services in the fields of consulting engineering, environmental technology and architecture. Sweco Position describes their work as follows: In IT for urban development, Sweco links together different types of information with a geographic position. Investments in IT can contribute to significant financial and environmental gains. The areas of use are many, and can include everything from boosting efficiency in the transport sector and studying the need for public transport in a specific area, to planning of forest cutting and mapping of water and sewage lines. Sweco’s consultants within IT for urban development are active mainly in the areas of energy, infrastructure, public environments, forestry and transport [8].. 1.2. Motivation. Today most vision-based augmented reality applications use fiducial markers. This means that the environment needs to be prepared before tracking by placing the markers to be tracked. Marker-based tracking have an occlusion problem, if a marker is occluded by an object in the scene or is outside the. 4.

(9) field of view the tracking stops. The advantages of marker-based tracking, is the robustness and the speed of detecting and identifying the markers [28]. The extended use of mobile devices such as smart phones and tablets with sensors such as cameras, GPS’s, accelerometers and compasses means that a lot of ordinary customers have access to an AR device but also that the market for AR in unprepared environments is getting larger. Calculating the position and orientation of the camera in an unprepared environment (markerless tracking) is complex, this demands more from the camera, only what the camera can naturally observe in the environment can be used for tracking. To know how a bridge, a sofa or a building will look like in its intended location before it is really there is hard. Augmented reality can be used to solve this problem. To utilize the fact of the growing market of mobile devices the problem should be solved using markerless augmented reality.. 1.3. Aim and Purpose. This master’s thesis will examine the possibilities of the off-the-shelf library OpenCV to develop markerless augmented reality applications. The structure from motion algorithm will be implemented using algorithms provided by OpenCV. The aim of this work is to find out how mature OpenCV is when used in this context.. 1.4. Problem Formulation. Working with OpenCV there are lots of algorithms to choose from implementing structure from motion. The main questions that should be answered are: • How mature is OpenCV and its standard algorithms for implementing markerless augmented reality using structure from motion? Is it straightforward to determine which is the most suitable algorithm for each step, and how easy it to deploy and integrate them? • How feasible is it to deploy OpenCV and its standard algorithms for implementing markerless augmented reality on a mobile platform? Does OpenCV support on mobile platforms include the necessary algorithms, and will the processing power suffice for adequate results? 5.

(10) Chapter 2 Augmented Reality This chapter will give a background of what augmented reality is and how it can be used.. 2.1. Definition of Augmented Reality. In augmented reality (AR) the real world is superimposed by virtual objects in real-time. By doing this augmented reality enhances the users perception and interaction of the real world, it supplements reality by letting virtual and real objects coexist in the same place. The difference between AR and virtual reality (VR) is that in VR the user can not see the real world and is completely immersed inside a synthetic environment. Azuma [10] defines AR having the following characteristics: • It combines real and virtual objects • You can interact with the application in real time • It is registered in 3D This is the definition commonly used in AR research. The second characteristic rules out movies with special effects using computer generated imagery since you can not interact with it in real time. One hard thing about AR is registration. Registration refers to the alignment between real world objects and virtual objects, giving the illusion of the objects appear in the same world. To do this the system needs to have a continual knowledge of the users viewpoint. This makes tracking fundamental to every AR application, the registration is only as accurate as the tracking.. 6.

(11) 2.2. Fields of Use. Augmented reality was first used for medical and military applications. In medical applications AR can be used to aid during surgery by superimposing information on the patient this can help the doctor to know where to make an instion or where to drill or even giving him/her x-ray vision, showing what is inside the patient. It can also be used to give aspiring doctors visual instructions. In the military AR can be used to superimpose vector graphics upon the pilots view of the real world, it can also be used to aid with navigation and aiming [10]. Augmented reality has broaden its use over the years. In table 2.1 some AR fields and their use are mentioned.. Figure 2.1: Showing the CityViewAR application i use. A virtual building is superimposed on the camera view.. 7.

(12) Table 2.1: Fields and Usage of AR Field Sports and Entertainment. Usage AR can be seen when watching sports on TV. A common example is in American football where computer generated annotations are used to show the distance the offense have to proceed to earn a first down for each play. The system used to do this is called 1st & Ten [1]. Computer generated annotations in sports can also be of commercial character overlaying the view with commercial advertisement. Navigation To make navigation applications more effective AR can be used. Wikitude Navigation [9] is an AR Navigation system which presents the user with turn-by-turn GPS based guidance. The directions are overlayed onto a real world image which makes the user not having to take the eyes of their destination. This kind of applications have proven safety benefits and decreasing navigation errors [25]. A popular area for AR is in sightseeing where tourist can walk through a city or a historical site filming with their smartphones and see facts and figures overlayed on the screen Commercial and Re- Using AR for commercial use is a field which tail has been growing the last years. There are several SDKs for overlaying virtual content onto live stream video. Layar is focused on interactive print which makes it good to use for commercial purposes. You can scan the content of an ad in a magazine and get more information about the product or the campaign. Total immersion has developed TryLive which visualizes products in 3D for retailers and the e-commerce sector allowing people to ’try’ their product at home in front of the computer. Construction Augmented reality can be used to visualize new buildings. One example of this is the CityViewAR application which was developed after the earthquakes in Christchurch, New Zeeland, 2011. This application visualized what the buildings used to look like before the earthquakes [20]. An example can be seen in figure 2.1..

(13) 2.3. Tracking. Tracking provides information about the users viewpoint or the camera position and orientation in 6 DoF. There are different approaches to tracking which can be divided into separate fields. The most common fields are sensorbased, optical-based and hybrid techniques.. 2.3.1. Sensor-based Tracking. Sensor-based tracking includes sensors such as magnetic (compass), acoustic, inertial (accelerometer, mechanical gyroscope), optical and time frequency measurements (GPS, ultrasonic). They all have their advantages and disadvantages depending on for example which environment you want to track. The sensor-based trackers are good at motion prediction when fast changes occur, though it is not as accurate during slow movement.. 2.3.2. Vision-based Tracking. The vision-based approached uses image processing methods, often related to computer vision, to calculate the camera pose. Vision-based tracking is often divided into two fields, model based and feature based. Model based visual tracking uses a model of the object or the scene as a reference for the tracking system. This method may be good if the system depends on a certain place since it requires knowledge of the scene beforehand. Feature based visual tracking can either use markers placed in the scene or natural features in the image to calculate the camera pose. Features are extracted from the frames and their correspondences are used to estimate the position. In contrast to the sensor-based trackers the vision-based trackers are sensitive to rapid motion but works better during slow movement.. 2.3.3. Hybrid Tracking Techniques. To compensate for the lacks in sensor-based techniques respectively the visionbased techniques a hybrid approach can be carried out. Hybrid techniques combine different methods to make the tracking more robust. This can be combining several sensor-based techniques or combine a vision-based approach with one or several sensor-based.. 9.

(14) 2.4. Marker-based Techniques. When using vision-based tracking techniques fiducial markers can be placed in the scene to make the tracking easier. The markers are placed in the scene before the tracking starts and the location and pattern/size of the marker is known [10]. Advantages with this approach are the robustness, which means that the registration becomes good, and the speed of detecting and identifying the markers . Disadvantages are that the markers has to be placed in the scene beforehand and that as soon as the marker is occluded or is outside the field of view the tracking stops [28].. 2.5. Markerless Techniques. The term markerless tracking is defined different by different authors. There are those referring to markerless tracking as tracking without fiducial markers (black and white markers) where the marker can instead be a photograph, a magazine, a hand or a human face. There is also the definition referring to markerless tracking as techniques using GPS or geolocation to locate the users viewpoint. In this report markerless AR refers to tracking natural features in the users viewpoint which are related to the first definition here. Natural feature tracking is expensive to compute and not as robust as marker-based tracking.. 2.5.1. SLAM and Structure from Motion. Simultaneous location and mapping (SLAM), is a research field in the robotics community. SLAM aims at simultaneously getting the location of a moving object and the map of the scene that it is moving in from the data stream provided by one or several sensors. The sensors can be laser, radar, sonar or in monocular SLAM, a camera. The relationship between SLAM and SfM is quite obscure. SfM differs from monocular SLAM in the way that SfM has dealt with the problem in its most general form, that is for any kind of visual input, monocular SLAM has focused in sequential approaches for the processing of video input [12]. One way to define the relationship is that SLAM is an application field of SfM which can use other 3D mapping techniques. Monocular SLAM would therefore be an application of SfM where you add other tracking techniques 10.

(15) and filters. Monocular SLAM and SfM can be used in markerless augmented reality to get the camera pose and the coordinates of the real world objects.. 11.

(16) Chapter 3 Related Work This chapter presents a few applications and systems which is of interest considering markerless augmented reality and OpenCV for augmented reality.. 3.1. ArUco. ArUco is a minimal AR library which uses OpenCV algorithms to perform marker-based AR. ArUco is interesting since it uses OpenCV for marker detection and is easy to integrate with OpenGL [2]. ArUco shows that markerbased AR can be performed by OpenCV.. 3.2. Parallel Tracking and Mapping - PTAM. PTAM is a camera tracking system for augmented reality which tracks the 3D-position of the camera in real time. PTAM does not rely the tracking on markers, a pre-made map, known templates or inertial sensors but instead works out the structure of the scene as it goes [7]. PTAM was developed at the University of Oxford and is a promising system for markerless AR, though it is aimed for AR, vision and SLAM researchers. PTAM splits tracking and mapping into two separate tasks, processed in parallel threads, one which handles the tracking of the camera and one which produces a 3D-map of point features from previously observed frames[17]. It has been adapted to work on mobile phones (iPhone 3GS) the result presented in [18] shows reduced accuracy and robustness compared to a PC .. 12.

(17) 3.3. CityViewAR. CityViewAR is a mobile outdoor augmented reality application which allows users to see 3D virtual models in an outdoor environment. This application does not rely on computer vision techniques at all but instead uses GPS sensor information for tracking geographical position of the mobile device and electronic compass and accelerometer for measuring view direction of the camera (sensor-based tracking). Geographical coordinates only provide two dimensional position for tracking, the system also considers height of the users viewpoint based on heuristics assuming that the users are usually standing outdoors. This is an interesting approach to markerless augmented reality though it is said in [20] that the AR tracking in the future should be improved with computer vision techniques which would make it a hybrid system.. 3.4. Markerless Tracking Using Planar Homographies. One common special case which significantly simplifies tracking is to locate and track planar surfaces in the scene. This technique is described and used in several articles [23][22][13][27]. Using this approach makes the alignment of the real and virtual coordinate systems simpler than in the structure from motion case. It also makes the tracking more robust. It can be assumed that feature positions in adjacent frames of an image sequence are related by a planar homography or projective transformation.. 13.

(18) Chapter 4 Camera Calibration Camera calibration is done in order to get metric information from 2D images. The overall performance of the future application will depend on the accuracy of the camera calibration. One aspect of camera calibration is estimating the intrinsic parameters of the camera. This means the focal length, principal point and the aspect ratio. There are different camera calibration techniques and this thesis will use a 2D plane based calibration technique. This technique requires a camera to observe a planar pattern at at least two different orientations. This approach is flexible, easy to use and give good accuracy [29].. 4.1. Pinhole Camera Model. The pinhole camera model is a simple camera model and a good approximation to the behavior of most real cameras, though it is often improved by taking non-linear effects into account such as tangential and radial distortion [26]. According to the pinhole camera model a point in space with coordinates X = (X, Y, Z)T is mapped to a point in the image plane where a line joining the point X to the camera center meets the image plane, see figure 4.1a [14]. By using similar triangles (see figure 4.1b) the following relationship is obtained: x=f. Xcam Zcam. y=f. 14. Ycam Zcam. (4.1).

(19) (a) Showing pinhole camera model i 3D (b) Showing pinhole camera model in 2D. Figure 4.1: The pinhole camera geometry illustrated C is the camera center and p is the principal point In the real case the image plane would be behind the camera center. Where f is the focal length. Written be:    f 0 x y  ∼  0 f 0 0 1. in homogeneous coordinates this will    Xcam 0 0  Ycam   0 0  (4.2)  Zcam  1 0 1. where ∼ means equal up to scale. The origin of the coordinates image plane is here assumed to be at the principal point. This might not be the case so in general there is a mapping: x=f. Xcam + cx Zcam. y=f. Ycam + cy Zcam. (4.3). where (cx , cy )T are the coordinates of the principal point. This expression will in homogeneous coordinates be:       Xcam x f 0 cx 0   y  ∼  0 f cy 0  Ycam  (4.4)  Zcam  1 0 0 1 0 1 Now the intrinsic camera calibration matrix, K, can be written:   f 0 cx K =  0 f cy  0 0 1 15. (4.5).

(20) This model assumes that the image coordinates are Euclidean coordinates having equal scales in both axial directions i.e. square pixels. If this is not the case then the transformation from world coordinates to pixel coordinates is obtained by multiplying (4.5) with (mx , my , 1). This will give us the intrinsic camera calibration matrix:   αx 0 cx K =  0 αy cy  (4.6) 0 0 1 Where αx = f mx and αy = f my is the focal length of the camera in terms of pixel dimensions in the x and y directions respectively [14]. The intrinsic camera calibration matrix is sometimes considered as:   αx γ cx K =  0 αy cy  (4.7) 0 0 1 for added generality, where γ is the skew parameter. The skew parameter refers to the skewing of the pixel elements, this means that x and y are not perpendicular. For most cameras the skew parameter will be zero. The relationship between a 3D point and its corresponding 2D point in the image plane also depends on the rigid body transformation of the points in the world coordinate system to the points in the camera coordinate system this relationship can be expressed:     Xcam X  Ycam      ∼ R t Y  (4.8)  Zcam  0 1 Z  1 1 Where R is the 3x3 rotation matrix which represents the orientation and t is the 3x1 translation vector which represents the translation. These are the extrinsic camera parameters. ˜ and its pixel position , u The final relationship between a 3D point, X, ˜ is given by: ˜ s˜ u = PX (4.9) where P ∼ K R t is the projection matrix and s is an arbitrary scaling factor.. 16.

(21) 4.2. Retrieve Intrinsic Camera Parameters. By assuming that the model plane is located at Z = 0 in the world coordinate system the equation:   X s˜ u = K r1 r2 t  Y  (4.10) 1 is obtained where r1 and r2 are columns of the rotation matrix R and t is the translation vector. The mapping between the point in the model plane and its image will then be a planar homography:   X (4.11) s˜ u = H Y  1 Homography in computer vision refers to that any two images of the same planar surface in space are related. Given an image of the model plane a homography can be estimated. From (4.10) and (4.11) the following relationship can be derived: H = h1 h2 h3 = λK r1 r2 t (4.12) Where λ is an arbitrary scalar. There are two basic constrains on the intrinsic parameters given one homography. These are based on the fact that r1 and r2 are orthonormal. hT1 K−T K−1 h2 = 0. (4.13). hT1 K−T K−1 h1 = hT2 K−T K−1 h2. (4.14). By letting B = K−T K−1 and take the constraints into consideration the intrinsic matrix K can be solved.. 4.3. Lens Distortion. The pinhole camera model does not take non-linear effects into account like lens distortion. This means that the model needs to be improved in order to get a good calibration. This can be done by taking radial and tangential distortion into consideration [15]. The center of the radial distortion is the 17.

(22) same as the principal point. Radial distortion causes the image point to be radially displaced in the image plane. The new coordinates including distortion can then be expressed: xd = 1 + k1 r2 + k2 r4 + k3 r6 x + 2p1 xy + p2 r2 + 2x2 (4.15) yd = 1 + k1 r2 + k2 r4 + k3 r6 x + 2p2 xy + p1 r2 + 2y 2 (4.16) Where r2 = x2 + y 2 , k1 ...kn are radial distortion coefficients and p1 and p2 are the tangential distortion coefficients. These coefficients together with the intrinsic parameters are the results of the intrinsic camera calibration [3]. The new pixel position will then be: up = αx xd + cx. vp = αy yd + cy. 18. (4.17).

(23) Chapter 5 Structure From Motion Algorithms Structure from motion (SfM) is a computer vision field focusing on calculating the camera pose and the 3D structure of the scene simultaneously from corresponding image points. This can be used to find the geometry of a 3D scene or object from 2D images or as in this case recover the projection matrices of the camera. One important thing to consider is that the scene is only reconstructed up to scale. This means that if the entire scene is scaled by some factor k and the camera matrices are scaled by the factor 1/k, the projections of the scene points will remain the same. This means it is impossible to recover the exact scale of the scene [4]. The scene needs to be static for the algorithm to work properly and the process presumes a calibrated camera. The structure form motion pipeline can be divided into the following steps: 1. Feature Detection 2. Feature Correspondence 3. Pose Estimation 4. Triangulation 5. Bundle Adjustment All these steps makes the structure from motion algorithm computationally heavy. The steps will be explained in the following sections.. 19.

(24) 5.1. Feature Detection and Descriptors. Feature detection is done in order to find areas of interest in the image. Features can be edges, corners, lines or blobs and describes a part of the image which contains a lot of information. There are a large number of feature detectors developed and they vary in which features they detect (edges, corners, lines or blobs) but also in computational complexity and repeatability. The feature points are described by a N-dimensional vector which can be compared in euclidean space. The vector, also called descriptor, contains information about the image patch around the detected feature. When comparing images depending on their descriptors scaling and rotation can be taken into account. This is not possible when comparing images pixel by pixel, which is also very time consuming.. 5.1.1. Speeded-Up Robust Features - SURF. SURF is a local feature detector and descriptor which detects interest points (blobs) and describes these points in 64-dimensional vectors. The surf detector was developed as an alternative to the SIFT (Scale-invariant Feature Transform) with the goal to be faster to compute, while not sacrificing accurancy [11].. 5.1.2. Feature Matching and Outlier Removal. Feature matching is about finding corresponding points in two or more images, points that are projections of the same point in 3D space. The descriptors are match against each other to find the closest one. When matching points there will always be outliers, points that are mismatched, these will need to be handled, one way to remove these outliers is to use a robust estimator.. 5.2. RANdom SAmpling Consensus - RANSAC. RANSAC is a robust estimator which is popular in computer vision. Robust in the meaning that it is tolerant to outliers. The estimator tries to fit a model to some given data. A random sample is selected, this sample consists of the minimal subset of the data sufficient to determine the model. Random 20.

(25) selection is repeated a number of times and the selection with most support is said to be the most robust fit. The selection gets support (inliers) if the data is within a given threshold related to the model. In theory this means that a random selection which contains outliers will not gain much support. The algorithm is described in [14] and the steps can be followed in algorithm 1. Algorithm 1: Robust fit of a model to a data set S which contains outliers (I) Randomly select a sample of s data points from S and instantiate the model from this subset. (II) Determine the set of data points Si which are within a distance threshold t of the model. The set Si is the consensus set of the sample and defines the inliers of S (III) If the size of Si (the number of inliers) is greater than some threshold T , re-estimate the model using all the points in Si and terminate (IV)If the size of Si is less than T select a new subset and repeat the above. (V)After N trials the largest consensus set Si is selected, and the model is re-estimated using all the points in the subset Si . The three thresholds in the algorithm should be set accordingly: • The distance threshold, t should be set so that the probability that a point is an inlier is as high as possible. • The consensus set threshold, T should be set similar to the number of believed inliers in the set. • The sample threshold, N should be set so that the probability that some subset is free from outliers is high.. 5.3. Epipolar Geometry. The epipolar geometry describes the intrinsic projective geometry between two views and depends on the cameras internal geometry and relative pose and is independent of the structure of the scene. The fundamental matrix 21.

(26) F encapsulates this intrinsic geometry and two corresponding image points in two views satisfies the relationship x0T F x = 0. The epipolar geometry assumes the cameras satisfies the pinhole camera model. The geometry can be motivated by considering the search for corresponding image points when matching two views. The relationship between the corresponding image points, x and x0 , back-projected from the same point X in 3D-space are coplanar with the camera centers as can be seen in figure 5.1 (the plane is denoted p). The rays back-projected from x and x0 intersect at X, this is of most significance in searching for a correspondence.. Figure 5.1: Illustrating that the camera centers, the 3D-point and the corresponding image points all lay in the same plane π The baseline is the line joining the camera centers as seen in figure 5.2. If x is known how is then the corresponding point x0 constrained? The baseline and the ray defined by x determines the plane p the corresponding ray from the unknown point x0 will also lie in this plane and the point x0 will lie on the line of intersection of p and the image plane. This narrows down the search area of finding the corresponding point to only this line, this provides an epipolar constraint which can also be described by the fundamental or essential matrix (the essential matrix describes the intrinsic projective geometry between two views where the camera calibration is known) [14]. The terminology of epipolar geometry can be explained as follows: 22.

(27) Figure 5.2: Showing the baseline intersecting the image planes at their epipoles and joining the camera centers. • The epipole is the point in the image plane which is intersected by the baseline (the line joining the camera centers). • An epipolar line is a line where a epipolar plane intersects the image plane. All epipolar lines intersect at the epipole. • An epipolar plane is a plane containing the baseline. All epipolar planes intersects the image planes in epipolar lines. The epipolar constraint is considered when removing outliers in the feature matching phase. Using RANSAC the correspondences which do not follow the epipolar constraint (lie close enough to the epipolar line) will be considered as outliers and removed. It is also considered when determining the essential or fundamental matrixes from corresponding image points and when triangulating points.. 5.4. Recover Extrinsic Parameters. The projection matrix of a camera contains both the intrinsic and the extrinsic parameters of the camera. The intrinsic parameters can decided by 23.

(28) camera calibration as described in section 4.2 this is done once in the initial setup of the system. The extrinsic parameters, describing the pose of the camera (rotation and translation) needs to be decided for every image captured by the camera.. 5.4.1. The Five-point Algorithm. The five point algorithm is an algorithm being used to calculate the relative pose between the cameras. This algorithm needs five point correspondences to calculate the essential matrix. There are other algorithms working with six, seven or eight points. These works with the fundamental matrix meaning they do not take the calibrated camera into consideration. Less points means easier to operate for RANSAC [16]. A solver to the five-point algorithm is presented in the paper [21]. The essential matrix can then be decomposed to find the extrinsic parameters of the camera.. 5.5. Triangulation. Triangulation refers to the process of finding a point in 3D space given its projection onto two or more images. In order to this the projection matrix is required i.e the intrinsic and extrinsic parameters of the camera. Corresponding points satisfies the epipolar constraint described in section 5.3. The rays of corresponding points should intersect in the epipolar plane at the point where their corresponding 3D-point is located, this problem is trivial in the absence of noise but due to noise during measurement the rays may not intersect, then it is necessary to find the best point of intersection, a point which minimizes the error [14]. Triangulation gives the structure of the scene or object as a sparse point cloud representation.. 5.6. Multiple-view SfM. When adding more views we need to consider the reconstruction is only up to scale, which means each image pair gives a different scale. There are several sequential methods to determine the structure and motion for the successive views. The sequential methods incorporate the successive views one at a time. The one used in this thesis is called resection. This method 24.

(29) determines the pose of each additional view using previously reconstructed 3D-points. Other sequential methods are using the epipolar constraint to relate each view to its predecessor or merging partial reconstructions using corresponding 3D-points. There are some limitations with the sequential registration techniques. Substantial overlap is required since corresponding points must usually be visible in three or more views. There is also a problem with degenerate structure from motion configuration for which the algorithm might fail. This can be when there is only camera rotation but no translation, in planar scenes and if a 3D-point lies on the baseline of the views it is visible in.. 5.7. Bundle Adjustment. Bundle adjustment is last step of most multiple view 3D reconstruction algorithms. Bundle adjustment is done to minimize appropriate cost function and refine estimation to give the optimal estimate of noisy observations. Bundle adjustment refines the 3D coordinates (the structure) and the projective matrices simultaneously. Minimizing the re-projection error between the image location of observed and predicted image points.. 25.

(30) Chapter 6 Experimental Studies This chapter will give information about how the structure from motion algorithm was implemented.. 6.1. Hardware. The camera used is a Logitec e930 web camera. The resolution has been set to 1280x720 and autofocus has been turned off, this is because the focal length is calculated during camera calibration and the autofocus changes the focal length. The hardware used is a PC running Windows 7 (64-bits) with an Intel TM R Core i7-2760QM CPU 2.39 GHz processor and 16.0 GB RAM.. 6.2 6.2.1. Libraries and APIs OpenCV. OpenCV is an open source computer vision library which was designed with computable efficiency and real-time applications in mind. OpenCV support several platforms like Windows, Linux, Mac OSX, iOS and Android and has C++, C, Python and Java interfaces.[5]. For this project the C++ interface for OpenCV 2.4.8 was used with the Visual Studio 2013 IDE.. 6.2.2. OpenGL. OpenGL is a cross platform API for programming graphics applications both in 2D and 3D [6]. OpenGL is used in this project to visualize the point cloud 26.

(31) and the camera pose and for rendering the augmented object onto the real world frame.. 6.3. Intrinsic Camera Parameters. OpenCV has built in functions for calibrating the camera. These functions are based on the method described in section 4. OpenCV also provides the distortion coefficients for the camera as described in section 4.3. A planar chessboard pattern with pattern size 9 × 6 squares printed on an A3 paper is captured at 40 different positions and the corner coordinates are calculated. The corners are placed in a 3D coordinate system where z = 0 (all the points lay in the same plane). This information is then used with the calibrateCamera function in OpenCV which returns both the intrinsic camera matrix and the distortion coefficients. The calibration pattern can be seen in figure 6.1. The chessboard pattern stays fixed while the camera captures it at different positions. Camera calibration is only performed once as an initial step.. Figure 6.1: Camera calibration chessboard pattern. 27.

(32) 6.4. Image Rectification. The intrinsic parameters and the distortion coefficients are used to rectify the views, undistort them. This is done with the OpenCV functions initUndistortRectifyMap which computes the undistortion and rectification transformation map and remap which applies a generic geometrical transformation to the view. The views are rectified since the camera model is assumed to be a pinhole camera.. 6.5. Detecting Features and Extracting Descriptors. OpenCV has both a SurfFeatureDetector and a SurfFeatureExtractor class which detects the SURF points and extracts their descriptors. The detector class lets you set the hessian threshold to decide which detected key points should be retained. A lower threshold retains more key points. The images are undistorted before detecting and extracting features since most of the algorithms used assumes a calibrated camera. The SURF detector and extractor was chosen since it is scale invariant and relatively fast to compute.. 6.6. Feature Matching and Outlier Removal. The OpenCV brute force matcher class, BFMatcher is a descriptor matcher which for each descriptor in the first set finds the closest descriptor in the second set by trying each one. The distance is in this case decided using the L2 norm. To remove as many outliers as possible the BFMatcher knnMatcher (knearest neighbors) method is called with k = 2. This gives the two best matches for each descriptor. This is done for both views, so the first view is matched to second view and vice versa. If the two best matches returned by the knnMatcher are too similar, both matches are discarded, since we are not sure which one is true. If there is a larger difference between them the closest one is kept as a match. This is called ratio test. When there is only one best match for each descriptor a symmetry test is performed to only retain the matches which symmetrically conform from the first view to the second view and vice versa. 28.

(33) As a last step in the outlier removal the matches which do not obey the epipolar constraint are removed. This is done with help of RANSAC (RAndom SAmpling Consensus) which makes it possible to compute the fundamental matrix even in the presence of outliers. OpenCV has a findFundamentalMat function which can be used with the RANSAC method and gives you all the inliers. This feature matching method and outlier removal is described in [19].. 6.7. Recover Extrinsic Parameters. The five-point algorithm is not implemented in OpenCV 2.4.8 which is used for this project. The algorithm is implemented in OpenCV 3.0.0 which is still a beta version. There is a C++ implementation available based on the solver presented in [21] which was used to calculate the essential matrix and decompose it to recover the relative pose between the cameras. This solver is only used in the first step on the first two initial views. How the pose is calculated for the successive views is explained in section 6.9. When calculating the pose the first camera is assumed to be placed at (0, 0, 0) in the coordinate system and the second camera is then calculated relative to that.. 6.8. Triangulation. With the projection matrices and the point correspondences of the views the three dimensional points of the correspondences can be reconstructed. OpenCV provides a function called triangulatePoints which is used for triangulation, this implementation uses the least square method to minimize the error which occur due to noise in the point measurements, to triangulate the 3D-point. The least square method is described in [14] This function gives the 3D point coordinates in homogeneous coordinates which then are converted to Euclidean space. This can easily be done with the OpenCV provided function convertPointsFromHomogeneous.. 29.

(34) 6.9. Handling Multiple Views. When adding successive frames their features are extracted and matched to the previous frame the same way as described in section 6.5 and 6.6. What differs when adding more views is the pose estimation (recovering the extrinsic parameters). As stated in section 6.9 the pose of each additional view is determined using the previously reconstructed 3D-points. To keep track of the previous views some bookkeeping needs to be done. Information about the previous frames are stored in a class called Storage which also stores the point cloud. The detected SURF points of the new view is matched to the SURF points of the previous view and it is the checked if the matched points in the previous views contributed to the point cloud, if so this is also a corresponding 3D point in the new view. Having the 2D, 3D correspondences of the new view its pose can be calculated using the OpenCV algorithm solvePnPRansac. This function takes the aligned corresponding points together with the cameramatrix and the distortion coefficients. When having the pose new points are triangulated and added to the point cloud.. 6.10. Point Cloud and Camera Pose Visualization. To visualize the structure and the motion of the scene, the point cloud and the camera poses are rendered using OpenGL. This gives a sparse point cloud which shows the structure of the scene and the positions of the cameras for the different views. One thing to consider is the difference between the coordinate systems in OpenCV and OpenGL which can be seen in figure 6.2.. 30.

(35) (a) OpenGL coordinate system. (b) OpenCV coordinate system. Figure 6.2: Showing the difference between the OpenGL and OpenCL coordinate systems. 6.11. Object Rendering. As a last part of the experimental studies a 3D object is rendered onto the views. The virtual camera uses the pose calculated by the structure from motion algorithm which should be placed at the same position as the real camera. The structure from motion algorithm has also given the structure of the scene but it is not known were the virtual object should be placed. As of now a point is chosen on which the object should stay on, in all the successive frames.. 31.

(36) Chapter 7 Results and Discussion This chapter will present and discuss the result of the the experimental studies.. 7.1. Camera Calibration. The camera calibration gave the following intrinsic camera parameters:   769.42... 0 662.27... 0 770.41... 380.75... K= (7.1) 0 0 1 Looking at the principal point values cx = 662.27... and cy = 380.75... They were assumed to be more in the center of the view, which would be cx = 640 and cy = 360, though this result is still accepted. By default the calibrateCamera function in OpenCV gives five distortion coefficients, three radial and two tangential coefficients. Setting the CV CALIB RATIONAL MODEL flag the function returns three more distortion coefficients; this was done in this case. In total eight distortion coefficients were given by the OpenCV calibrateCamera function. OpenCV provides rich information about camera calibration and is easy to use for this purpose. The result of un-distorting an image with the intrinsic matrix and the distortion coefficients can be seen in figure 7.1.. 32.

(37) (a) Original image. (b) Undistorted image. Figure 7.1: Original and undistorted image. 7.2. Feature Detection. The SURF feature detector provided by OpenCV is relatively fast and easy to use. The feature detector is tried on different images containing both inside and outside settings and performs well in both regarding in finding features. In figure 7.2 the hessian threshold has been set to different levels to show the difference in accepted key points. Having more key points will later make the point cloud less sparse but it will also slow down calculations. The time performance of the SURF detector is presented in section 7.8.. 33.

(38) (a). Threshold 10, detected features 4560. (b). Threshold 10, detected features 4728. (c). Threshold 300, detected features 1405. (d). Threshold 300, detected features 1447. (e). Threshold 1000, detected features 712. (f). Threshold 1000, detected features 689. Figure 7.2: Detected features using different hessian thresholds, view one is shown to the left and view two to the right. 7.3. Feature Description and Matching. The process of calculating the descriptors of the feature points is very slow, this can be seen in figure 7.11 , the more feature points the slower it is which is not surprising since it requires more calculations. This is the part which slows down the application most. Matching the descriptors is not very fast either, if taken the outlier process into account, but this may be possible to make it more effective. Using 34.

(39) knnMatch gives the two best matches for each key point. This is done to be able to avoid as many outliers as possible. There may be other ways to avoid outliers which then can speed up the matching process. A first matching without any outlier removal is shown in figure 7.3. The hessian threshold is now set to 300 which means there are 1405 feature points in the first view and 1447 feature points in the second (figure 7.2). (a) Matches from view one to view two. (b) Matches from view two to view one. Figure 7.3: (a) Matches from the first view to the second and (b) from the second view to the first. 7.4. Outlier Removal. The outlier removing process is very effective, including ratio test, symmetrical matches and obeying the epipolar constraint. In this example there are 376 matches from view one to view two that survive the ratio test and 368 matches from view two to view one that survive. Figure 7.4 shows that there are still outliers which need to be handled. This is done by the symmetry test which only keeps those matches that are the same from view one to two and view two to one (figure 7.5), 277 matches survive this test. 35.

(40) (a) Matches from the first view to the second after ratio test. (b) Matches from the second view to first after ratio test. Figure 7.4: (a) Matches from the first view to the second and (b) from the second view to the first after ratio test has been performed.. Figure 7.5: Matches after symmetry test. As a last step only the matches which obey the epipolar constraints are accepted. Starting at 1405 matches we are now down to 217 after the epipolar test, these matches can be seen in figure 7.6 where it also can be seen that there are no outliers left.. 36.

(41) Figure 7.6: Matches after epipolar constraint test.. 7.5. Two View Pose Estimation and Triangulation. The pose estimation and the triangulation result is visualized by rendering the point cloud and the camera poses using OpenGL. It should be noticed that the first pose estimation between the first two views does not use the same algorithm as the successive views. The result of the pose estimation for the successive views can be seen in section 7.6. The OpenCV triangulatePoints function gives an acceptable result looking at the rendered point cloud. In most of the cases the points lays in the correct part of the coordinate system, though for some views a few points are placed on the wrong side of the z-axis. Where these outliers come from is not certain it could be that some outliers were not removed during the outlier removing process, which now brings errors to the triangulation. The triangulation only reconstructs the scene up to scale, this means that the size of the objects in the view are not known. An example of a rendered point cloud (triangulating the matches in figure 7.6 ) can be seen in figure 7.7. The figure shows the triangulated points and the camera pose. Looking at the image views in figure 7.6 the points in the point cloud seem to fit well, giving an acceptable result. This result depends on how many feature points and essentially on how many matches were found in the views.. 37.

(42) Figure 7.7: The rendered point cloud and camera poses of two views. 7.6. Adding Multiple Views. When calculating the pose of the successive views the detection of feature points and the matching follows the same procedure as described in section 7.2 - 7.4 (matching with the preceding view). In this step it has been shown that it is essential there are enough matches between views. If not enough 2D to 3D correspondences may not be found which means that the solvePnPRansac function from OpenCV can not calculate the pose of the new view correctly, and the algorithm will fail, this has shown to be the case for some test cases. This could in a real time augmented reality application lead to drifting and wrong alignment of the virtual and real world since the camera pose will be unknown or wrong. There has also been test cases where the algorithm has not failed and the pose of the successive views has been calculated this is shown in figure 7.8 where five camera poses are shown together with the point cloud. Number of found 2D to 3D correspondences between the views can be seen in table 7.1.. 38.

(43) Figure 7.8: The rendered point cloud and camera poses of five views Table 7.1: Found 2D → 3D correspondences Views 2→3 3→4 4→5. 7.7. Number of 2D → 3D correspondences 91 74 28. Integrating Virtual Object. Using the calculated camera poses a virtual cube has been rendered onto the images. Figure 7.9 shows images outside where only two views of the same scene has been used. Figure 7.10 shows an indoor image sequence of five views (the same views as in figure 7.8). As described in section 6.11 it is not known where the virtual object should be placed and the cube is placed on a point in the first view in which it should stay during the successive view(s). Knowing where the object should be placed in the real world is essential for an augmented reality application. When using markers this is straightforward, but in markerless AR applications it is not as obvious. There are different techniques that could be used and may be considered as further work. One option is to have a 3D model of the scene in which the model should be placed. With the SfM-technique used in this thesis the gained. 39.

(44) point cloud could be matched to the 3D model. There is a library called PCL (Point Cloud Library) which could be used for this purpose. It supports model fitting and real time object identification from complex 3D scenes and it could be used to recognize an object in the real world based on its geometric appearance [24]. Using this approach would give the metrics of the scene since the metric of the 3D model should be known. This solves the problem of the point cloud only being reconstructed up-to-scale. An other alternative could be to use an image of the scene as a marker. This would help to initialize the world coordinate system, knowing where the object should be placed. If the device has a GPS sensor this could aid in the placement of the object.. (a) Set one: First view. (b) Set one: Second view. (c) Set two: First view. (d) Set two: Second view. Figure 7.9: Cube rendered onto two successive views showing images taken outside. When adding more views there will be an accumulated error in the camera pose (and also in the point cloud points). This error can in theory be handled by performing bundle adjustment (described in section 5.7). Bundle adjustment has not been implemented here but the result shows that it may be necessary. If adding bundle adjustment its calculation time needs 40.

(45) (a) First view. (b) Second view. (c) Third view. (d) Fourth view. (e) Fifth view. Figure 7.10: Cube rendered onto five successive views showing images taken inside. to be considered. OpenCV 2.4.8 does not have an implemented algorithm for bundle adjustment. This can clearly be seen in the images in figure 7.10 where the cube is more and more displaced in the successive views. Looking at the images in figure 7.9 drift is also shown here, the pose calculation is not perfect. This can be due to errors in each part of the algorithm including the initial camera calibration.. 41.

(46) 7.8. Algorithm Performance. A small benchmark test has been performed to see how the different algorithms provided by OpenCV perform. The outlier removing process has also been evaluated regarding the time it consumes. The algorithms has been run five times each and an average time has been calculated. As seen in figure 7.11 the feature descriptor process is the one consuming most time of the OpenCV functions and the triangulation is the one running fastest (this of course depends on how many points are triangulated). The SURF feature detector and descriptor may not be suited in augmented reality applications because of its computation time, though it is effective in finding good features. Using SURF as the detector but choose another descriptor like FREAK or BRISK (also provided by OpenCV) could be a first step to try to speed things up.. Figure 7.11: The average time the OpenCV functions consume The outlier removal process is now very effective. These functions are not provided by OpenCV but worth evaluating since they could slow down the process and may be considered if they are necessary. Looking at the result in figure 7.12, the outlier removal process does not consume much time compared to the OpenCV algorithms (figure 7.11). Though it may be considered if all the steps are necessary. Removing outliers are of importance 42.

(47) Figure 7.12: The average time the outlier operations consume since there will be errors in the pose estimation and the triangulation if outliers exists.. 43.

(48) Chapter 8 Conclusion The objective of this thesis is to evaluate the maturity of the off-the-shelf library OpenCV when used in purpose to achieve markerless augmented reality implementing the structure from motion algorithm. The implemented algorithm does not run in real time, this is one of the criteria for augmented reality stated in section 2.1. All parts of the structure from motion algorithm is computationally heavy and it can be seen in the result, especially in the feature detection and descriptor extractor. There are successful results of markerless augmented reality using a structure from motion like approach like PTAM described in section 3.2. PTAM is running on parallel threads, separating the structure and motion to separate threads. It also uses a different feature detector and descriptor technique which is described in [17]. Doing this in the presented implementation could speed up the algorithm. OpenCV also has support for GPU programming. This is one thing that can make the implementation faster as well, now the algorithm is implemented on the CPU. One big problem is when not finding enough feature matches between the views, this makes the algorithm not reliable. This is something that needs to be handled for the algorithm to be considered for augmented reality. Augmented reality has high expectations on its calculations which needs to be as correct as possible for the alignment of the real and virtual world to be correct, this is hard to achieve with OpenCV and the structure from motion algorithm. Implementing the bundle adjustment algorithm would have improved the alignment, though its calculation speed needs to be considered as well. This should be considered as further work. OpenCV does support use on mobile devices though it does not support GPU programming on all mobile devices since it uses CUDA which is only supported by devices with NVidia graphics cards. This could be a problem 44.

(49) wanting to implement the algorithm on a mobile platform since GPU programming may be necessary to speed up the calculations. For using mobile devices and markerless augmented reality it would be preferable to use the sensors of the device such as GPS, gyroscope and the accelerometer. Though a hybrid system using computer vision algorithms may not be far away. The computer vision algorithms would aid in finding the exact position of the users view point regarding height for example.. 45.

(50) Bibliography [1] 1st and Ten. https://www.sportvision.com/football/1st-ten\ %C2\%AE-graphics. Accessed: 2014-05-12. [2] ArUco Website. http://www.uco.es/investiga/grupos/ava/node/ 26. Accessed: 2014-05-19. [3] Camera calibration toolbox for matlab. http://www.vision.caltech. edu/bouguetj/calib_doc/. Accessed: 2014-02-17. [4] Lecture 6: Multi-view stereo & structure from motion. http:// cs.nyu.edu/~fergus/teaching/vision/11_12_multiview.pdf. Accessed: 2014-05-20. [5] Opencv official webpage. http://www.opencv.org, . Accessed: 201402-03. [6] OpenGL Website. http://www.opengl.org/about/, . Accessed: 201405-19. [7] Parallel Tracking and Mapping Website. http://www.robots.ox.ac. uk/~gk/PTAM/. Accessed: 2014-05-14. [8] Sweco position, it for urban development. http://www.swecogroup. com/en/Sweco-group/Services/Geographical-IT/. Accessed: 201402-12. [9] Wikitude Navigation. http://www.wikitude.com/showcase/ wikitude-navigation/. Accessed: 2014-05-12. [10] Ronald T. Azuma. A survey of augmented reality. Presence: Teleoperators and Virtual Environments, 6(4):355–385, August 1997. [11] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In In ECCV, pages 404–417, 2006.. 46.

(51) [12] Javier Civera. Real-Time EKF-Based Structure from Motion. PhD thesis, Universidad de Zaragoza, September 2009. [13] Andrew Zisserman Gilles Simon, Andrew W. Fitzgibbon. Markerless tracking using planar structures in the scene. IEEE and ACM International Symposium on Augmented Reality, 2000. [14] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2003. [15] Janne Heikkila and Olli Silven. A four-step camera calibration procedure with implicit image correction. In Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97), CVPR ’97, pages 1106–, Washington, DC, USA, 1997. IEEE Computer Society. ISBN 0-8186-7822-4. URL http://dl.acm.org/citation.cfm? id=794189.794489. [16] David Nistér Henrik Stewénius, Christopher Engels. Recent developments on direct relative orientation. 2006. [17] Georg Klein and David Murray. Parallel tracking and mapping for small AR workspaces. In Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’07), Nara, Japan, November 2007. [18] Georg Klein and David Murray. Parallel tracking and mapping on a camera phone. In Proc. Eigth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’09), Orlando, October 2009. [19] Robert Laganière. OpenCV 2 computer vision application programming cookbook. Packt Publishing, London, 2011. [20] Gun Lee, Andreas D¨ unser, Seungwon Kim, and Mark Billinghurst. CityViewAR: A Mobile Outdoor AR Application for City Visualization. In 11th IEEE International Symposium on Mixed and Augmented Reality (ISMAR 2012) - Arts, Media, and Humanities Proceedings, pages 57–64, Atlanta, Georgia, USA, 2012. URL http://ael.gatech.edu/ lab/files/2012/10/ISMAR2012-program.pdf. [21] David Nistér. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Mach. Intell., 26(6):756–777, June 47.

(52) 2004. ISSN 0162-8828. doi: 10.1109/TPAMI.2004.17. URL http: //dx.doi.org/10.1109/TPAMI.2004.17. [22] Sang-Cheol Park, Sang-Woong Lee, and Seong-Whan Lee. Superimposing 3d virtual objects using markerless tracking. In ICPR (3), pages 897– 900. IEEE Computer Society, 2006. ISBN 0-7695-2521-0. URL http: //dblp.uni-trier.de/db/conf/icpr/icpr2006-3.html#ParkLL06. [23] Simon Prince, Ke Xu, and Adrian David Cheok. Augmented reality camera tracking with homographies. IEEE Computer Graphics and Applications, 22(6):39–45, 2002. URL http://dblp.uni-trier.de/db/ journals/cga/cga22.html#PrinceXC02. [24] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point Cloud Library (PCL). In IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, May 9-13 2011. [25] D. W. F. van Krevelen and R. Poelman. A Survey of Augmented Reality Technologies, Applications and Limitations. The International Journal of Virtual Reality, 9(2):1–20, June 2010. [26] M. Varga. Practical Image Processing And Computer Vision. John Wiley & Sons Australia, Limited, 2008. ISBN 9780470868546. URL http://books.google.se/books?id=LtL0AAAACAAJ. [27] A. Y. C. Neen W. T. Fong, S. K. Ong. Marker-less computer vision tracking for augmented reality. IADIS International Conferences Computer Graphics, Visualization, Computer Vision and Image Processing 2010, 2010. [28] Chunrong Yuan. Markerless pose tracking for augmented reality. Technical report, Fraunhofer Institute for Applied Information Technology, 2006. [29] Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.. 48.

(53)

No results found