Underwater 3D Surface Scanning using Structured Light

(1)

F10063

Examensarbete 30 hp

December 2010

Underwater 3D Surface Scanning

using Structured Light

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Underwater 3D Surface Scanning using Structured

Light

Nils Törnblom

In this thesis project, an underwater 3D scanner based on structured light has been constructed and developed. Two other scanners, based on stereoscopy and a line-swept laser, were also tested. The target application is to examine objects inside the water filled reactor vessel of nuclear power plants. Structured light systems (SLS) use a projector to illuminate the surface of the scanned object, and a camera to capture the surfaces' reflection. By projecting a series of specific line-patterns, the pixel columns of the digital projector can be identified off the scanned surface. 3D points can then be triangulated using ray-plane intersection. These points form the basis the final 3D model.

To construct an accurate 3D model of the scanned surface, both the projector and the camera need to be calibrated. In the implemented 3D scanner, this was done using the Camera Calibration Toolbox for Matlab. The codebase of this scanner comes from the Matlab implementation by Lanman & Taubin at Brown University. The code has been modified and extended to meet the needs of this project. An examination of the effects of the underwater environment has been performed, both theoretically and experimentally. The performance of the scanner has been analyzed, and different 3D model visualization methods have been tested.

In the constructed scanner, a small pico projector was used together with a high pixel count DSLR camera. Because these are both consumer level products, the cost of this system is just a fraction of commercial counterparts, which uses professional

components. Yet, thanks to the use of a high pixel count camera, the measurement resolution of the scanner is comparable to the high-end of industrial structured light scanners.

(3)

Popul¨

arvetenskaplig sammanfattning

I det här examensarbetet har en undervattens-3D-skanner konstruerats och utvecklats. M˚alapplikationen är undersökandet av objekt inuti den vatten-fyllda reaktortanken i kärnkraftverk. Projektet har utförts p˚a företaget Wes-Dyne TRC, som utför och utvecklar metoder för ickeförstörande provning av kärnkraftverk. Ickeförstörande provning kan utföras p˚a flera olika sätt, bland annat med analys av ultraljud, magnetpartiklar, röntgenstr˚alning och elek-triska virvelströmmar, samt med visuella metoder. 3D-skannrar kategoriseras till den senare gruppen. Tekniken har funnits tillgänglig i m˚anga ˚ar, men det ¨

ar först p˚a senare tid som den börjat användas i den här branschen. Det po-tentiella användningsomr˚adena är m˚anga, och tekniken förbättras hela tiden tack vare bättre kameror och snabbare datorer.

3D-skannrar används flitigt i film- och dataspelsindustrin. Andra använd-ningsomr˚aden omfattar industridesign, tillverkandet av proteser och ortoser, “reverse engineering” och prototypframställning, dokumentation av kulturarv samt kvalitetskontroll och inspektion. Användandet av en 3D-skanner resul-terar i en 3D-modell som kan visualiseras p˚a en dator. Den till˚ater användaren att vrida p˚a det skannade objektet och undersöka det fr˚an olika synvinklar och avst˚and. I m˚anga fall kan även belysningen p˚a objektet styras om s˚a önskas.

3D modeller är intressanta ur ett ingenjörsmässigt perspektiv eftersom det skannade objektet har de korrekta fysiska m˚atten (s˚a när som p˚a skannerns noggrannhet). Det medför att man enkelt kan utföra mätningar i modellen. Tillkomsten av 3D-skivare gör ocks˚a att man kan skapa en fysisk kopia av det skannade objektet.

I det här projektet har en 3D-skanner som baseras p˚a strukturerat ljus konstruerats. Tekniken bygger p˚a att flera olika linjemönster projiceras p˚a det undersökta objektet med hjälp av en projektor. Den belysta ytan ob-serveras samtidigt av en kamera som tar kort p˚a objektet under belysningen med respektive mönster. I och med att kameran och projektorn är riktade mot objektet fr˚an olika positioner och vinklar kan sedan triangulering utföras. Matematiskt motsvarar detta att räkna ut skärningspunkten mellan ett plan och en linje. Planet mosvaras av en smal linje i projektorns mönster. Lin-jen kommer fr˚an den bildpunkt (pixel) i kamerans fotografi som inneh˚aller motsvarande projicerade linje p˚a objektet. Skärningspunkten mellan planet och linjen motsvarar en tredimensionell punkt p˚a objektets yta. Utförs denna triangulering m˚anga g˚anger med olika plan och linjer framträder en modell av det skannade objektet. Den best˚ar av en stor samling punkter, kallat punkmoln, och kan vara sv˚ar att tolka om den visualiseras utan efterbehan-dling. Därför har flera olika sätt att visualisera 3D-modellen fr˚an skannern testats under projektets g˚ang.

För att kunna känna till positionen och vinkeln p˚a projektorns plan och kamerans linjer m˚aste skannern kalibreras. Det utförs genom att ta kort p˚a b˚ade ett fysiskt och ett projicerat schackmönster, vilket till˚ater kalibrering av kameran respektive projektorn.

(4)

jämför-bart med de bästa kommersiella systemen p˚a den professionella marknaden. Detta tack vare användandet av en digital systemkamera med högt antal pixlar.

(5)

1 Introduction

1.1 Project Description

3D scanners are used to collect information of the shape of a real-world object or environment. In addition to shape, they can also capture the appearance of the object, usually the color. This information can be visualized in a digital 3D model. This allows the user to rotate the object and view it from different virtual positions. The model can also be used to make measurements of the object. The output of the 3D scanner is usually in the form of a point cloud, i.e. a large set of sampled 3D points. This point cloud often requires additional processing to be visualized appropriately.

3D scanners are used extensively by the movie and video game industry. Other uses include industrial design, orthotics and prosthetics, reverse engi-neering and prototyping, documentation of cultural artifacts, quality control and inspection [13]. This project was made on behalf of the company Wes-Dyne TRC, a provider of inspection services for the nuclear power industry. They focus on non-destructive examination of objects in the reactor vessel of nuclear power plants. 3D scanners can possibly have many uses in this context, as any defect that is visible to the eye also can be scanned. A primary use of the 3D scanner developed in this project is to construct an accurate blueprint of specific parts in the reactor vessel, in an underwater environment. This can include narrow columns, having dimensions that are not known exactly. With exact dimensions of the column, a purpose built rig can be constructed. This rig then allows other measurement instruments to be used in this specific part of the vessel.

1.2 Classification of 3D Scanners

There is wide range of different 3D scanning technologies, each with its own advantages and limitations. A taxonomy of visual 3D scanners is shown in Figure 1. Another commonly used classification of 3D scanners is the division into active and passive technologies. Active scanners emit some kind of radi-ation (usually light) to probe the scanned object, while passive do not. The most commonly used technologies in the industry today are laser triangula-tion, structured light and contact scanning. The latter is the slowest but also the most accurate option. It requires that a probe is swept across every part of the surface that need examining. Laser triangulation involves sweeping a line laser across the surface, while the location of the line is recorded by a camera. This scanner can be classified to the depth from motion category, since it uses the motion of a laser line. The technology that this project has focused on is structured light. Structured light systems (SLS) work similarly to a line-swept laser scanner, but instead of only illuminating the surface with a single line and sweeping it, a whole pattern of lines are projected by a projector.

(8)

3-D Computer Vision, 3. Depth from Triangulation (Active Vision), Klaus D. Toennies 1

Computation of Depth / Shape

Depth from Motion

Indirect Depth Direct Depth

Time-of-Flight from-FocusDepth- _{Triangulation} _{Surface Normals} Structured

Light StereoVision

Shape from

Shading Shape fromTexture PhotometricStereo

3-D Computer Vision, 3. Depth from Triangulation (Active Vision), Klaus D. Toennies 2

Triangulation

Indirect measurement

of an inaccessible

distance (e.g., land

surveying):

- two sides and an angle of a triangle. - a side and two angles

of a triangle.

D

a b

D

E

a

Figure 1: A taxonomy of visual 3D scanners.

Depth-from-focus scanners uses the fact that objects laying closer or fur-ther away from the focus distance of the camera will be blurred by a specific amount. Since no light is emitted, the scanner is passive. Due to the nature of depth of field, describing how big part of the depth in the image that will be sharp, these scanners can only be used at quite short distances. The accuracy is not very good, but the technology allows narrow holes and similar geome-tries to be scanned, which cannot be achieved with triangulation scanners. This is because the illumination lines will either not be visible to the camera or the illumination will not enter the hole. Either way, narrow holes will be occluded from the scanning coverage.

Stereo vision, or stereoscopy, is another passive technology that uses two cameras to do the triangulation. This is similar to how the human visual depth perception works. By finding matching points in both of the captured images, triangulation can be performed.

Finally, some 3D scanners use the property that surfaces reflects light differently depending on their angle. The surface normal can be computed by measuring the amount of reflected light, if the surface is assumed to be perfectly diffuse (lambertian reflection). The ambiguity of the problem must be removed by illuminating the scene from several directions. Since the surface normal represents the gradient of the surface, a 3D model can be constructed by integration. This makes the technique sensitive to low frequency noise, so the model is often distorted. Also, surfaces that are not perfectly lambertian, but have some amount of specular reflection, will lead to inaccuracies in the 3D model.

1.3 3D Scanning Problems

(9)

Therefore, metals surfaces with a mirror finish will be problematic to scan. Fortunately, the objects in a reactor tank mostly have a quite matte finish, facilitating the use a 3D scanner. Semi-transparent objects can also be difficult to handle since there will not be one but several reflections. Objects that exhibit some degree of subsurface scattering, such as human skin, also degrade the quality of the reflection. Finally, disturbances in the medium in which the scanned object is placed present a problem with 3D scanning. Particles in the medium might stop the emitted light from reaching the surface. If the temperature of the medium varies, it will cause a varying optical index of refraction of the medium. This makes the light rays bend, giving an effect similar to that in a mirage, leading to distortions in the 3D model. This can be a very real problem in an underwater environment with high temperature variations in the water medium.

Since 3D scanners usually have only one camera, they can only collect information of objects that is visible from that point of view. To cover oc-cluded parts, the scanner must be repositioned and perform a new scan. The different point clouds recorded can then be merged into a complete model.

1.4 Structured Light vs. Laser Triangulation

The most obvious difference between the structured light and the line-swept laser is the number of images required to be captured during a scan. A laser scanner uses video to capture the location of the laser line at many time points, while a structured light scanner works well with a still camera. To get a reasonably dense point cloud with a laser scanner, many video frames need to be captured, typically in the order of several hundred or a few thousand frames. SLS, on the other hand, typically only require between one and two dozen frames, allowing the use of a still camera. This also means that the scan time is usually longer with a laser scanner than with a SLS. Since digital still cameras generally have a much higher pixel count than video cameras, there should be a potential to get a very high measurement resolution with a SLS, provided that the projector can match the pixel count of the camera. However, most computer projectors used only have pixel counts similar to video, limiting the resolution.

(10)

The Mathematics of Triangulation Geometric Representations center of projection image point image plane 3D point light direction for a projector light direction for a camera center of projection image point image plane 3D point light direction for a projector light direction for a camera

Figure 2.1: Perspective projection under the pinhole model. from a projector (or towards a camera) along the line connecting the 3D scene point with its 2D perspective projection onto the image plane.

2.2 Geometric Representations

Since light moves along straight lines (in a homogeneous medium such as air), we derive 3D reconstruction equations from geometric constructions involving the intersection of lines and planes, or the approximate intersec-tion of pairs of lines (two lines in 3D may not intersect). Our derivaintersec-tions only draw upon elementary algebra and analytic geometry in 3D (e.g., we operate on points, vectors, lines, rays, and planes). We use lower case let-ters to denote points p and vectors v. All the vectors will be taken as column vectors with real-valued coordinates v ∈ IR3_{, which we can also regard as} matrices with three rows and one column v ∈ IR3×1_{. The length of a vector} v_{is a scalar kvk ∈ IR. We use matrix multiplication notation for the inner} product vt

1v2 ∈ IR of two vectors v1and v2, which is also a scalar. Here vt

1 ∈ IR1×3is a row vector, or a 1 × 3 matrix, resulting from transposing the column vector v1. The value of the inner product of the two vectors v1 and v2is equal to kv1kkv2k cos(α), where α is the angle formed by the two vectors (0 ≤ α ≤ 180◦_{). The 3 × N matrix resulting from concatenating N} vectors v1, . . . , vN as columns is denoted [v1| · · · |vN]∈ IR3×N. The vector product v1× v2∈ IR3of the two vectors v1and v2is a vector perpendicu-lar to both v1and v2, of length kv1× v2k = kv1k kv2k sin(α), and direction determined by the right hand rule (i.e., such that the determinant of the

10

Figure 2: The pinhole camera model. Figure reproduced from Lanman & Taubin [10].

2 Theory

2.1 Optics

2.1.1 Pinhole camera model

To be able to convert what is seen by the camera into real-world measurements a camera model is needed. A simple model describing the geometry of the camera is the pinhole model. The aperture of the camera is represented as a point through which all rays of light pass, called the center of projection. This is an approximation since the real aperture of the camera is not infinitesimally small. The axis originating in the center of projection and pointing in the viewing direction is called the optical axis. A virtual image plane is located at a distance f in the positive direction of the optical axis, where f is the focal length of the camera. Alternatively the image plane can be placed at−f instead. In this definition the plane more accurately coincides with the actual position of the camera sensor. The image captured on this plane will however be rotated 180º. To simplify matters the virtual image plane will therefore be used through the rest of this report. An illustration of the denotations can be seen in Figure 2. Note that references will be made to the camera model throughout this theory section. Because of the principle of reversible optical paths, the projector can be seen as a camera, only it emits light instead of capturing it. Therefore, the same model can be used for the projector, as is done in the implementation.

2.1.2 Lenses

A simple model of how a lens refracts and focuses light is the thin lens model. The thin lens formula describes where an object at a distance s is focused at a distance q, both measured from the lens plane:

1 s+ 1 q = 1 f (1)

(11)

s q

Figure 3: Thin lens model. Figure reproduced from Wikipedia [16].

h

rear principal plane θ

Figure 4: Light enters a thick lens from a point infinitely far away.

undergoes two refractions as it passes through the lens, both upon entry and exit. When light enters the thick lens from a point infinitely far away, one can find a plane where an equivalent thin lens would have been located. By extending the lines from the rays entering and leaving the lens into the lens a point of intersection can be found, as seen in Figure 4. The rear principal plane of the lens contains all such points for different heights h. The focal length f is defined as the distance from the rear principal plane to the focal point, when the rays entering the lens come from a point infinitely far away. The distances s and q are then defined relative to the rear focal plane.

All modern camera lenses contain several lens elements. Instead of de-scribing how each element in the lens refracts the light it can be considered as a black box where only the entrance and exit properties are considered. Since the aperture in a camera lens lies inside the lens, behind one or several lens elements, the apparent position and size will differ from the actual when looking through the front of the lens. This apparent aperture is called the entrance pupil of the lens and its location coincides with the origin in the pinhole camera model.

(12)

c diameter of circle of confusion (CoC)

D depth of field (DoF)

di object size on sensor

do actual object size

Dep diameter of entrance pupil

f focal length of lens

H hyperfocal distance

m magnification

N f -number of lens s subject distance

Table 1: Variables relating to depth of field.

appear to have a longer effective focal length. In some optical lens designs however, the effective focal length can be shortened as the lens is focused closer than infinity.

2.1.3 Depth of Field

The portion of a scene in an image that is acceptably sharp is called the depth of field (DoF). Perfect sharpness will only occur at the (infinitesimally thin) focus plane in the scene. The sharpness will gradually degrade as you move away from this plane. At close focus distances, the distribution of the DoF will be symmetrical around the focus plane, while at longer focus distances a larger portion of the DoF will be on the far side of the focus plane. There are a number of factors that affect how large the DoF will be in a photograph. Most notably, the DoF gets shorter as the camera-to-subject distance s decreases. Furthermore, a higher f -number increases the DoF. The f -number N is defined as

N = f

Dep

(2) where Dep is the diameter of the entrance pupil. The variables used in this

(13)

The largest possible DoF is achieved when the focus distance is set to the hyperfocal distanceH. The DoF then extends to infinity. H is given by

H _≈ f

2

N c (3)

where c is the diameter of the CoC. For short-to-moderate subject distances s, when the subject distance is much smaller than H, the DoF D is approximately given by

D_{≈ 2Nc}m + 1

m2 (4)

where m is the magnification defined as m = di

do

. (5)

di is the size of an object measured on the sensor, and dothe actual size of

the object. When s does not satisfy s H, Equation 4 is invalid. The DoF is then approximately given by

D_≈ 2N cf

2_s2

f4_{− N}2_c2_s2 . (6)

It should be noted that magnification is an alternative way of expressing the subject distance - a smaller s corresponds to a bigger m. In Equation 4 the DoF is independent of the focal length used (assuming a fixed N and m), while in Equation 6 it decreases with increasing f .

Typical values used with the camera in this project are: c = 12_{× 10}−6_m,

N = 20, f = 0.055 m and s = 0.25 m. This translates into a hyperfocal distance of H ≈ 12.6 m. Since s is 12.6/0.25 ≈ 50 times smaller than H in this case, s H might not be fulfilled completely. Therefore, it is not entirely clear which equation that should be applied in this case. However, it can still be worth analyzing the properties of the DoF at subject distances either longer or shorter than this.

An interesting topic is how the sensor size affects the DoF at short sub-ject distances. The focal length increases linearly with the sensor size if the same FoV is to be kept. For a fixed sensor pixel count, c increases linearly with increasing sensor size. Keeping N constant, the hyperfocal distance therefore increases linearly with increasing sensor size. Therefore, the subject distance range when Equation 4 is satisfied is made smaller. For a given sub-ject distance, Equation 6 becomes a better approximation as the sensor size is increased, thereby giving a shorter DoF. However if Equation 4 remains valid, the DoF is increased linearly because of the increased CoC. Further-more, if the sensor size is increased while the focal length is kept constant, the hyperfocal distance decreases, extending the valid range of Equation 4.

(14)

Figure 5: A light ray P enters an interface between two material with different optical indices at an angle θ1. Since n2> n1it is refracted at a smaller angle

θ2. Figure reproduced from Wikipedia [17].

a higher f -number with a larger sensor. Specifically, the value of N where it starts to appear is increased by the same factor x. The loss in DoF can therefore be fully compensated with an increased f -number. The cost of increasing N is that less light hits the sensor, which must be compensated for with a longer exposure time. This also explains why lenses for larger sensors have a bigger maximum f -number. Also, the loss of sharpness due to diffraction should not be overrated. It might be that a small decrease of sharpness at the focal plane is acceptable for an increase of sharpness in out of focus areas.

In conclusion, a larger sensor size at short subject distances will give a larger DoF due to the increase of c. At longer distances, the DoF can be kept constant if the f -number is also increased.

2.1.4 Refraction

Since the 3D scanner is intended to be used in an underwater environment, the optical light path will contain more refractions than if the scanner is used above water. This difference is because water has a different optical index than air. Furthermore, the underwater housing has a glass plate preventing the water from entering, while enabling light to pass through. The optical index in this glass is also different from the index of the air, giving additional refractions. Light which passes from one optical medium to another receives a change in direction according to Snell’s law

sin θ1 sin θ2 =v1 v2 = n2 n1 (7) where n1and n2 are the optical indices of the different materials, and v1 and

v2are the velocity of light in them. θ1and θ2are defined as seen in Figure 5.

Equation 7 can be rewritten as

(15)

n t C Glass plate g θa na na e θa

Figure 6: Rays of light enters a glass plate from air and exits on the other side. The non-refracted ray is shown in red and the refracted in blue.

Light rays that pass from water to glass to air can then be described by nwsin θw= ngsin θg= nasin θa (9)

where nw, θw, ng, θg, na and θa are the optical indices and angles in water,

glass and air respectively. The values of the optical indices are approximately nw= 1.33, ng = 1.50 and na = 1.00. It is possible to separate the effects of

the refraction in the glass plate and the refraction from the air-water interface. The situation is equivalent to having a direct air-to-water interface and the glass plate covered with water or air, placed at any distance from the lens. First, let us consider the effects of the glass plate placed in air. From Figure 6 can be seen that the glass medium keeps the incident and exit angle θa

constant. Rather, the glass causes a slight shift of the ray, in the direction parallel to the glass plate, away from the optical axis. This causes the rays to converge, i.e. focus, at a point C further away on the optical axis. This is perceived as if the object is located closer to the lens [8], and introduces no distortions. The deviation of the point of convergence introduced by the glass plate of thickness t with optical index ng will be

e=t 1−n_na g (10) where na is the optical index of the medium of both sides of the plate. With

na= 1 and ng= 1.50 the deviation is 33 % of the plate thickness, or 3.3 mm

with a plate thickness of t = 10 mm.

(16)

Figure 7: Light rays from a real point P refracts in the water-air interface giving a continuous distribution of imaginary points. Three of them are illus-trated with corresponding dashed rays. Figure reproduced from Chenard & Petron [3].

In addition, the refractions also cause chromatic aberrations due to light of different wavelengths (i.e. colors) having slightly different velocities and thus different optical indices. This makes light rays of one color refract at a slightly different angle than another color. The result is that colored fringes appear in the image, particularly in high contrast areas and in the edges of the image. As with spherical aberration, chromatic aberration is made worse by increasing field of view and larger apertures1_{. However, if a lens with}

a sufficiently narrow field of view is used, i.e. with a sufficiently long focal length, these effects become very small and cannot be seen in the image.

The biggest effect the water has on an imaging system with a planar water-air-interface is that the apparent focal length increases greatly. To see why, Equation 9 can be rewritten using the paraxial approximation sin θ_{≈ θ and} tan θ_{≈ θ, i.e. for small angles}

θw≈ naθa nw = θa nw . (11)

Since nw> 1, this causes a decrease in the water angle. The relation between

the angle θ, the height h and the focal length f of a lens can be seen in Figure 4. With h being constant, this translates into the following in the case with water tan θw= h fw =_{⇒ θ}w≈ h fw (12) where fwis the prolonged focal length and θwthe reduced angle. Equivalently

in air

1_{Transverse (lateral) chromatic aberration is not reduced by decreasing the aperture}

(17)

Figure 8: When the circular object is viewed from an angle through an air-water interface, it appears distorted as an ellipse with a different position. Figure reproduced from Arizaga et al [1].

θa ≈

h fa

(13) where fais the focal length in air. Inserting this into Equation 11 gives

h fw ≈

h

nwfa ⇐⇒ f

w≈ nwfa . (14)

Therefore the apparent focal length is increased by a factor nw.

While an increased focal length yields no distortion in the perceived image of an object, there are other effects that will. When a fish viewed in the water while standing on land, it will appear compressed in the direction perpendic-ular to water surface. The position of the fish will also appear to change. The effect is shown in Figure 8. When the angle θa is decreased the distortion and

position change becomes smaller. When a camera is placed so that its optical axis is orthogonal to the interface one might think that any of these distor-tions would be removed. But since the FoV of the camera is always nonzero, the light rays will hit the interface at a nonzero angle , except the ray on the optical axis. This will cause a distortion of the image, where distances parallel to the optical axis are compressed. The effect is more pronounced at the edges of the image. A camera with a longer focal length will suffer less from this effect since the biggest angles θa are smaller.

2.2 Coordinate Systems

(18)

world coordinate system = 1 u u 0 = = p p p 1 = f

camera coordinate system

world coordinate system = 1 u u 0 = = p p p 1 = f

= 1 u u 0 = = p p p 1 = f

q u p x _y z x y x y z

Figure 9: Ideal pinhole camera coordinates. Figure reproduced from Lanman & Taubin [10].

z-axis is pointing in the same direction as the optical axis. The image plane is placed a distance of f = 1 from the origin q = (0, 0, 0). A point on this plane has the coordinates u = (ux, uy, 1)t. A point p = (px, py, pz)t on the

line containing the origin and u can then be described as   px py pz  = λ   ux uy 1   (15)

for some scalar λ. A point in a world coordinate system pW can be transformed

to camera coordinates p using

p= RpW+ T (16)

where T_{∈ R}3 _{is a translation vector and R}

∈ R3×3 _{is a rotation matrix. The}

matrices R and T are called the extrinsic parameters of the camera, describing the orientation and location with respect to the world coordinate system. Up to now, we have assumed that the unit of measurement is the distance from the center of projection to the image plane. Furthermore, the origin of the image coordinates (ux = 0, uy = 0) is located at the point of intersection

between the image plane and the optical axis, called the principal point. In reality, the origin in digital images is usually the top left corner (as in this report), and the unit of length is a pixel. A suitable unit of length in the world coordinate system is meters or millimeters as used in the implementation. To remedy these limitations a matrix K _{∈ R}3×3 _{is introduced describing the}

intrinsic parameters of the camera

λu = K(Rpw+ T) (17)

where K is multiplied with Equation 16 and where Equation 15 is inserted. Khas the following form

K=   f sx f sθ ox 0 f sy oy 0 0 1   . (18)

sx and sy are coordinate scale parameters allowing compensation for

non-square pixels, used in some sensors. sθ allows correction for a tilted image

plane. Finally (ox, oy)tare the coordinates of the principal point with respect

(19)

the mechanical and optical design of the camera. They remain constant as long as the zoom setting of the lens is not altered (assuming a zoom lens is used, prime lenses excluded). Note that f represents the effective focal length. Therefore, the intrinsic calibration is also affected by changes in the focus depth setting, though not by much for small focus changes. When doing the calibration of the SLS it is possible to calculate the intrinsic parameters once and for all through an initial calibration. This can then be stored and used with different extrinsic calibrations.

Since real-world lenses also display non-linear lens distortion, this model has to be extended to compensate for that, as is done in the implementation. They are also considered to be intrinsic parameters. The distortion can be divided into a radial and tangential component. Before Equation 17 is applied these distortions must be removed. Let ud = (ux,d, uy,d)t be the distorted

pixel that should be corrected, measured using the pinhole camera coordinates in the image plane. Let the radius r be given by r2_{= u}

x,d2+ u2y,d. Then the

corrected pixel coordinates u are written u= ux uy = (1 + kr,1r2+ kr,2r4+ kr,3r6)ud+ du (19)

where du is the tangential distortion vector du= _2k t,1ux,duy,d+ kt,2(r2+ 2u2x,d) kt,1(r2+ 2u2y,d) + 2kt,2ux,duy,d (20) and where kr,iwith i = 1, 2, 3 is the radial distortion coefficients in the

poly-nomial model used. kt,i with i = 1, 2 is the coefficients used in the tangential

“thin prism” model.

2.3 Stereoscopy

At the start of the project the plan was to use a stereoscopy-based 3D scan-ner. With that technique two or more cameras observe a scene from different viewpoints. By finding matching pixels, or correspondences, in the images captured by the cameras, triangulation can be performed to find 3D points. Being a passive technique, it has some advantages over the active counter-part. The scanning time is limited to the time is takes to capture two images. Provided the cameras are synchronized properly, the time needed is basically the exposure time of the cameras. Depending on the lighting of the scene and the sensitivity of the camera, this usually only takes a fraction of a second. Compared to active 3D-scanners, the resolution is not limited by the projec-tor or the laser, only the cameras. The scanning range is also very large. Provided a sufficiently long baseline can be used, scans of objects several hun-dred meters away can be performed. The biggest issue with stereoscopy is that the correspondence problem is very hard to solve. The algorithms used to find correspondences are very slow for large images and large point clouds. In addition the matches found are not always accurate. To see why, a basic description of the correspondence algorithm is given below.

(20)

horizontal line as the pixel in A. If the cameras have some other position and are rotated, the epipolar line has some other position and angle. Using the epipolar line, the search has been reduced from two to one dimension. This dramatically lowers the computational burden. There are different methods of how to measure the likeness between pixels. A popular method is the absolute difference in pixel values. Unfortunately, single pixel-to-pixel comparisons are insufficient. In all likelihood there are many pixels with very similar values to the pixel in A. Therefore a neighborhood, or window, of pixels around each pixel in A and B need to be considered. This means that not a single pixel value difference need to be computed for each pixel pair in A and B, but a matrix difference, increasing the number of operations needed. The sum of the absolute differences (SAD) between the pixels in the windows will be a measure of how dissimilar the two windows are. The algorithm chooses the pixel in B which has the smallest SAD value. But since the images are taken from different views, the scene has a different appearance in the images. Therefore the match found may be incorrect. The number of operations needed to match all the pixels in A with B is_O(N3_{), where N is}

the image height and width in pixels. The number of pixels to process in A is N2_{, while extra N times that are needed to process all the pixels on the}

epipolar line in B. In comparison, an active technique requires only _O(N2₎

operations.

A comparison of different correspondence algorithms was done in [12]. The results are presented as disparity maps in Figure 10. Large disparities, i.e. a big difference in the horizontal location of the pixel matches, means that the point is near. Conversely small disparities means points are far away. In Figure 10 this is illustrated as large disparities being bright, and small disparities being dark. As a reference, the ground truth disparities are also shown. Note that this was acquired using a laser 3D scanner. This is a testimony of how much more accurate active 3D-scanners are. Table 2 shows the computation times of some different algorithms, including the ones in Figure 10. Clearly, the better performing algorithm (10 - Graph cuts (GC)) has a quite long computation time even for very small images.

2.4 Active 3D Scanning

2.4.1 Triangulation

(21)

Figure 10: Disparity maps in decreasing order of performance. Figure repro-duced from Scharstein and Szeliski [12].

Tsukuba Sawtooth Venus Map

width 384 434 434 284 height 288 380 383 216 disparity levels 16 20 20 30 Time (seconds): 14 – Realtime 0.1 0.2 0.2 0.1 16 – Efficient 0.2 0.3 0.3 0.2 *1 – SSD+MF 1.1 1.5 1.7 0.8 *2 – DP 1.0 1.8 1.9 0.8 *3 – SO 1.1 2.2 2.3 1.3 10 – GC 23.6 48.3 51.3 22.3 11 – GC+occlusions 69.8 154.4 239.9 64.0 *4 – GC 662.0 735.0 829.0 480.0 *5 – Bay 1055.0 2049.0 2047.0 1236.0

Table 7: Running times of selected algorithms on the four image pairs.

Acknowledgements

Many thanks to Ramin Zabih for his help in laying the foundations for this study and for his valuable input and suggestions throughout this project. We are grateful to Y. Ohta and Y. Nakamura for supplying the ground-truth im-agery from the University of Tsukuba, and to Lothar Her-mes for supplying the synthetic images from the Univer-sity of Bonn. Thanks to Padma Ugbabe for helping to la-bel the image regions, and to Fred Lower for providing his paintings for our image data sets. Finally, we would like to thank Stan Birchfield, Yuri Boykov, Minglung Gong, Heiko Hirschmüller, Vladimir Kolmogorov, Sang Hwa Lee, Mike Lin, Karsten Mühlmann, Sébastien Roy, Juliang Shao, Harry Shum, Changming Sun, Jian Sun, Carlo Tomasi, Olga Vek-sler, and Larry Zitnick, for running their algorithms on our data sets and sending us the results, and Gary Bradski for helping to facilitate this comparison of stereo algorithms.

This research was supported in part by NSF CAREER grant 9984485.

References

[1] P. Anandan. A computational framework and an algorithm for the measurement of visual motion. IJCV, 2(3):283–310, 1989.

[2] R. D. Arnold. Automated stereo perception. Technical Re-port AIM-351, Artificial Intelligence Laboratory, Stanford University, 1983.

[3] H. Baker and T. Binford. Depth from edge and intensity based stereo. In IJCAI81, pages 631–636, 1981.

[4] H. H. Baker. Edge based stereo correlation. In L. S. Bau-mann, editor, Image Understanding Workshop, pages 168– 175. Science Applications International Corporation, 1980. [5] S. Baker, R. Szeliski, and P. Anandan. A layered approach to stereo reconstruction. In CVPR, pages 434–441, 1998.

[6] S. T. Barnard. Stochastic stereo matching over scale. IJCV, 3(1):17–32, 1989.

[7] S. T. Barnard and M. A. Fischler. Computational stereo.

ACM Comp. Surveys, 14(4):553–572, 1982.

[8] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical flow techniques. IJCV, 12(1):43–77, 1994. [9] P. N. Belhumeur. A Bayesian approach to binocular

stere-opsis. IJCV, 19(3):237–260, 1996.

[10] P. N. Belhumeur and D. Mumford. A Bayesian treatment of the stereo correspondence problem using half-occluded regions. In CVPR, pages 506–512, 1992.

[11] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hi-erarchical model-based motion estimation. In ECCV, pages 237–252, 1992.

[12] S. Birchfield and C. Tomasi. Depth discontinuities by pixel-to-pixel stereo. In ICCV, pages 1073–1080, 1998. [13] S. Birchfield and C. Tomasi. A pixel dissimilarity

mea-sure that is insensitive to image sampling. IEEE TPAMI, 20(4):401–406, 1998.

[14] S. Birchfield and C. Tomasi. Multiway cut for stereo and motion with slanted surfaces. In ICCV, pages 489–495, 1999.

[15] M. J. Black and P. Anandan. A framework for the robust estimation of optical flow. In ICCV, pages 231–236, 1993. [16] M. J. Black and A. Rangarajan. On the unification of line

processes, outlier rejection, and robust statistics with appli-cations in early vision. IJCV, 19(1):57–91, 1996.

[17] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, Cambridge, MA, 1987.

[18] A. F. Bobick and S. S. Intille. Large occlusion stereo. IJCV, 33(3):181–200, 1999.

[19] R. C. Bolles, H. H. Baker, and M. J. Hannah. The JISCT stereo evaluation. In DARPA Image Understanding

Work-shop, pages 263–274, 1993.

[20] R. C. Bolles, H. H. Baker, and D. H. Marimont. Epipolar-plane image analysis: An approach to determining structure from motion. IJCV, 1:7–55, 1987.

[21] Y. Boykov and V. Kolmogorov. A new algorithm for energy Table 2: Computation times of some different correspondence algorithms with

four different image sets. Table reproduced from Scharstein and Szeliski [12].

(22)

p p _n = P } { L= = _L+λ camera ray object being scanned intersection of light plane with object projected light plane L v p p = P } { L= = _L+λ camera ray object being scanned intersection of light plane with object projected light plane L {p = n ( p - q ) = 0}t p p q v v q q n

Figure 11: Triangulation by line-plane intersection. Figure reproduced from Lanman & Taubin [10].

camera line and the projector plane are divergent, which has an imaginary intersection behind them.

Mathematically a line can be represented in parametric form by

L ={p = qL+ λv : λ∈ R} (21)

Any point p on this line is then given by a point qL on the line, a direction

vector v and the scalar multiple λ. All the different points on the line will then be represented by an unique value of λ. Next, a plane can be described in implicit form as

P =_{{p : n}t_(p

− qp) = 0} (22)

Similarly, any point p on the plane is given by a point qp on the plane and a

normal vector nt_{. Since n}t_{is orthogonal to (p}

− qp) the scalar product must

be zero. Substituting Equation 21 into 22

nt(qL+ λv− qP) = 0 (23)

Distribute nt _{to get}

λntv+ nt(qL− qP) = 0 (24)

and solve for λ, giving

λ = n

t_(q

P− qL)

nt_v (25)

The triangulated point p is then given by Equation 21.

Now, to be able to find the equation of L and P , they must be extracted from a point in the camera’s image plane and a line in the projector’s image plane respectively. Starting with L, Equation 17 can be solved for pW to find

the line

(23)

center of projection image plane } 0 ) ( : { − = = P t } 0 : { = = L t center of projection image plane } 0 ) ( : { − = = P t } 0 : { = = L t n q q n p p u ul

Figure 12: Ideal pinhole camera coordinates. Figure reproduced from Lanman & Taubin [10].

where the rotation matrix property R−1_{= R}t_{has been used. Thus, the line’s}

origin, i.e. the camera’s center of projection, is given by qL=−RtTand the

direction vector by v = Rt_u_.

Next, a line in the projector’s image plane can be described in implicit form as

L =_{{u : l}t_u_{= 0}

} (27)

where l = (lx, ly, lz)t is a vector with lx6= 0 or ly6= 0.

Taking the scalar product of l and Equation 17 we get λlt_u_{= l}t_(Rp

W+ T) = (Rtl)t(pW − (−RtT)) = 0 (28)

where Equation 27 was used in the last equality. Comparing with Equation 22 we can see that this is the equation of a plane with n = Rt_l_{and q}

p=−RtT,

the projector’s center of projection. The line L and the corresponding plane P are shown in Figure 12. [10]

2.4.2 Projector Line Patterns

The object of the projected line pattern is to be able decode the unique projector pixel columns off the surface of the object. The central question is then which pattern transmits this information in the best way. A good pattern should be insensitive to noise, so that the planes are decoded reliably. A common way to increase the redundancy of the transmission is to project multiple frames with different patterns. This is called temporally encoded patterns. Patterns that consist only of one frame are called spatially encoded patterns, as the spatial information in a single pattern is the only available. They are useful if the scene one is trying to capture is rapidly changing, i.e. changing faster than the time it takes to project all the different temporally encoded patterns. Furthermore, the projected patterns need not be lines. Instead the information could be encoded into separate projector pixels. Each pixel then corresponds to a ray from the projector. A 3D-point then has to be triangulated by ray-ray intersection instead. The information required to be transmitted increases greatly with such an approach. This makes it more difficult to transmit the information with the same number of frames and with the same reliability as with a line pattern.

(24)

Structured Lighting Data Capture

Figure 5.2: Structured light illumination sequences. (Top row, left to right) The first four bit planes of a binary encoding of the projector columns, or-dered from most to least significant bit. (Bottom row, left to right) The first four bit planes of a Gray code sequence encoding the projector columns. robust to the known properties of the channel noise process? At a basic level, we are concerned with assigning an accurate projector column/row to camera pixel correspondence, otherwise triangulation artifacts will lead to large reconstruction errors. Gray codes were first proposed as one al-ternative to the simple binary encoding by Inokuchi et al. [ISM84] in 1984. The reflected binary code was introduced by Frank Gray in 1947 [Wik]. As shown in Figure 5.3, the Gray code can be obtained by reflecting, in a spe-cific manner, the individual bit-planes of the binary encoding. Pseudocode for converting between binary and Gray codes is provided in Table 5.1. For example, column 546 in our in our implementation has a Gray code repre-sentation of 1100110011, as given by BIN2GRAY. The key property of the Gray code is that two neighboring code words (e.g., neighboring columns in the projected sequence) only differ by one bit (i.e., adjacent codes have a Hamming distance of one). As a result, the Gray code structured light sequence tends to be more robust to decoding errors than a simple binary encoding.

In the provided MATLABcode, the m-file bincode can be used to gen-erate a binary structured light sequence. The inputs to this function are the width w and height h of the projected image. The output is a sequence of 2dlog2we + 2dlog2he + 2 uncompressed images. The first two images con-sist of an all-white and an all-black image, respectively. The next 2dlog2we images contain the bit planes of the binary sequence encoding the

projec-48

Figure 13: Binary (top) and Gray (bottom) bit plane sequences. Figure reproduced from Lanman & Taubin [10].

ordered from left to right. For example column number 643 has the code 1010000011 (29_{+ 2}7_{+ 2}1_{+ 2}0 _{= 643) in the binary numeral system. Each}

digit in this sequence is captured from a single frame, where the projector illuminates the scene with a unique pattern, called a bit plane. The first four binary bit planes are shown in Figure 13. The projected bit planes consist of black and white fields. A pixel located in a white field will be decoded as 1, or 0 if it lies in a black field. The first bit plane encodes the most significant digit in the bit sequence, and the following patterns encode the subsequent digits in descending order of significance. This way 2m_{different columns can}

be encoded by projecting m different patterns.

An improvement of the simple binary codes are the Gray codes introduced by Frank Gray in 1947 [15]. A key property of the Gray codes is that two neighboring column numbers only differ by one bit. As a result, the Gray code structured light sequence tends to be more robust to decoding errors than a simple binary encoding. Another important feature of the Gray code is that the projected lines will be wider than for the binary case, except for the two lines at the left and right edge. Yet, the same number of columns will be decoded. The last and finest bit plane pattern will have lines twice as wide as that of the traditional binary pattern, making them less sensitive to defocus and easier to distinguish. Also, if the projector pixel count is close to the camera pixel count, the captured image is less likely to suffer from aliasing effects.

Because of the distortion in the projector lens, the projected line pattern will be slightly warped. This means that the corresponding projector planes also will be warped. Since the triangulation assumes perfectly planar planes, this will create an error in the triangulated points. This can be compensated by pre-warping the projected patterns so that the projected planes become planar.

2.4.3 Depth of Field

(25)

optical axis of the camera and/or the projector need to be created. If the camera has such an angle φ, the DoF will, if it is not large enough, cover only a part of the surface, assuming it is planar. If the width of the surface projected to the focal plane is wo, the required DoF will be a fraction C of

that depending on φ such that

Dreq= Cwo (29)

where C is independent of the magnification. Assuming that Equation 4 applies, the DoF can be expressed in terms of wo by inserting m = wi/wo

D_≈ 2N c w2

i

wo(wo+ wi) (30)

The fraction D/Dreq will show how much of the required DoF is attained

D Dreq ≈ 2N c Cw2 i (wo+ wi) (31)

It is seen that this fraction decreases linearly with decreasing wo which

(26)

3 Methods

3.1 Hardware

3.1.1 Computer

At the start of the project a workstation PC was assembled to meet the needs of the future computations. A high end GPU (nVidia GeForce GTX 480) was selected with the intention of letting it handle some of the heavy corre-spondence computations. Since the stereoscopy based scanner was dropped in favor of a structured light scanner, the GPU never came to such a use. The visualization of the point clouds is however accelerated by the GPU. A high end CPU was also selected (Inter Core i7), to speed up general calculations. Because the chipset compatible with this CPU uses three channels of DDR3, three modules of 2 GB each was selected for a total amount of 6 GB RAM. To be able to use all of this memory, a 64-bit version of Windows 7 was selected. As it turned out, the peak memory usage of the system when the Matlab code is running is about 5 GB. This leaves 1 GB to other programs or additional Matlab variables.

3.1.2 Camera

A good camera for a 3D-scanner should have a number of different features. Firstly, it should have a high pixel count sensor, which allows for a high mea-surement resolution in the point cloud. But pixels alone are not enough – the optics of the lens need to be equally good, so that fine details can be re-solved. The focal length of the lens should be chosen so that the appropriate FoV at the chosen subject distance it attained. The focal length also affects the depth of field at moderate to long subject distances, as discussed in the theory section. The lens should be able to focus close enough so that the appropriate magnification is given. Also, the sensor size affects the depth of field, which should be as large as possible. A larger sensor requires bigger lenses, increasing the overall dimensions of the camera, which can be a dis-advantage if a compact scanner is needed. Larger lenses are usually sharper than smaller ones. Larger sensors also generally have a higher pixel count than small sensors. Therefore, the sensor size should be chosen so that a good balance between the previous characteristics is reached.

Since the camera is supposed to be used in conjunction with a computer, the interface between the two is also important. It should be possible to trigger the camera from the computer and automatically transfer the images the PC. It is also desirable to be able to change camera settings such as exposure, aperture and focus via the PC. If the camera is supposed to be built into a housing and fitted to a remote controlled robot, it is of paramount importance to be able to get a live view of the camera’s perspective, so that the scanner can be positioned and oriented appropriately. Finally, the cost of the camera should not be too high either.

(27)

Figure 14: The camera (Canon 500D) used in the scanner. The used kit lens is also shown.

The camera has a live view feature, and can be connected to a PC via USB 2.0. This allows the live view feed to be displayed on the computer monitor in real time. Captured images will be transferred to the PC in 1-2 seconds each. The PC needs to run a program that interfaces with the camera. For this purpose the DSLR Remote Pro Multi-Camera software was selected. Almost all settings on the camera can be controlled through this software, such as exposure, ISO-value, aperture and focus. However it is not possible to switch between manual and auto focus remotely, a feature that would have simplified the calibration and use of the final scanner. Also it is not possible to change the zoom-setting of the lens remotely, since it has no such motor, as with most if not all DSLR lenses. Being a system camera, the lenses can be easily changed to suit the needs of the scene. The camera uses the Canon EF-S mount, for which a wide variety of lenses are available on the market.

Another alternative would be to use a specialized machine vision camera. Some of them have an interface towards the computer that allows the video stream to be directly imported into Matlab. Being video cameras they gen-erally have lower pixel count than still cameras. There are some high end machine vision cameras with high pixel count, close to that of DSLRs. Since the bandwidth required to stream such high pixel count video is so large, the frame rate is usually lower, close to the sequential shooting frame rate of a DSLR. For the purpose of this 3D-scanner, a fast scan time is not that impor-tant, so a low frame rate video camera would do. The cost those professional high pixel count video cameras turned out to be much higher than that of consumer DSLRs, which explains why the latter was chosen.

3.1.3 Projector

(28)

Figure 15: Optoma PK301, the pico projector used.

resolution. Also the projector lens needs to be sharp so that the projected lines are clearly distinguished. Next, the projection surface area should match the surface area in view in the camera. Most conventional projectors have a quite big minimum projection size. The first projector that was used had a minimum projection size of about 40 x 30 cm. For some purposes this might be sufficient, while examination of small cracks requires a smaller projection area. It should be noted that the resolution of the scanner increases linearly with decreasing projection size, assuming the camera’s FoV is filled with the projection area. Therefore, a projector with a smaller projection area was chosen later on. Unlike traditional projectors, pico projectors generally have a smaller minimum projection size. The outer dimensions of the projector itself are also much smaller. Typical pico projectors are not much bigger than a cell phone, useful if they need to be built into a small housing.

The projector finally chosen was an Optima PK301 pico projector, shown in Figure 15. The minimum projection size is about 7 x 4 cm and the native pixel count is 848 x 480 pixels. This is a decrease from the office projector first tested, with a pixel count of 1024 x 768 pixels. But since the projection size is smaller, the highest possible resolution of the scanner is still better.

3.2 Tested 3D Scanners

3.2.1 Stereoscopy

At the beginning of the project, a simple test of ray-ray triangulation was first implemented in Matlab, to get familiar with a stereo vision scanner. This was done using the description in Lanman & Taubin [10]. Later, two images were captured with one camera at different viewpoints. Correspondences were gathered manually by selecting matching pixels in the images. Assuming an even spread of pixels in the camera’s FoV, i.e. a constant angle per pixel, 3D points could be triangulated. Since this assumption is quite far from the truth, the point cloud was distorted. The rough features of the scene were however easily discernible.

(29)

further. Two algorithms were tested in Matlab, the sum of absolute differences (SAD) and normalized cross correlation. It quickly became apparent that the matches found were not very reliable. Furthermore, because of imperfections in the calibration, the matches could deviate a bit from the epipolar line. In order not to miss those matches, the search had to be widened to a box, instead of a single pixel row. This increases the computational burden even further. It is increased by the height of the pixel search box. It is also quite safe to assume that the height of this box would increase linearly with increasing image size. This means that the order of operations required in the correspondence search goes up from N3_{to N}4_{, where N}

× N is the image size in pixels. The time it took to find one pixel correspondence was about 0.003 seconds with the fastest algorithm; Matlab’s built in normalized cross correlation. If the full pixel count of the camera is to be used, the total time required would then be 0.003× 15 × 106_{= 45000 s = 12.5 h, i.e. impractically}

slow. If the algorithm had been accelerated with the GPU, it is possible that this computation time could be reduced up to a few hundred times, bringing it down to a few minutes. But since the normalized cross correlation is implemented in C-code in Matlab, it is already quite fast.

A GPU-acceleration implementation in Matlab using the available tool-boxes would quite possibly not be fast enough. Instead an implementation in a fast low level language such as C could be needed, something that would require much more work than an implementation in Matlab. This problem, together with the inaccurate matches of the algorithm, founded the decision of leaving the stereoscopy based scanner in favor of an active technique. 3.2.2 Swept-Plane Scanner

The web page of the SIGGRAPH 2009 conference course Build Your Own 3D Scanner: Optical Triangulation for Beginnersby Lanman & Taubin [9] turned out to be a good resource for learning the basics of active 3D scanners. The web page provides source code to a Matlab implementation of a structured light 3D-scanner as well as an implementation of a swept-plane scanner. As the latter seemed easier to get started with, it was tested first. The idea is to sweep the shadow of a stick over an object and record it with a camera. The shadow is divided into two planes, one from the leading edge and one from the trailing edge of the shadow. Both planes are used in the triangulation to filter out noise. To perform ray-plane intersection, it is required to know the equation of the rays from the camera as well as that of the shadow planes at each video frame. The camera rays are acquired by camera calibration using the Camera Calibration Toolbox. The equation of the projector planes is acquired from two reference planes behind the scanned object. The angle between the planes is known and set to be 90°. The planes are also fitted with four calibration marks each, from which the equation of the planes can be determined through a calibration. When the shadow sweeps over the scene it is possible to extract the shadow lines created at the intersection of the shadow planes and the reference planes. From these lines it is possible to get the equation of the shadow planes at each frame.

(30)

point cloud also looked reasonably good. But since the laser plane was swept manually this lead to an uneven and quite sparse distribution of laser plane lines. For the resolution in the sweeping direction (x) to by the same as in the plane direction (y), it is required to sweep the laser at speed of 1 pixel per frame. This is not easy to do by hand, especially if the video pixel count is reasonably high.

3.2.3 Structured Light Implementation

After the swept-plane scanner had been tested, the Matlab code of the struc-tured light 3D-scanner from Lanman & Taubin’s web page [9] was tested. Since the code was designed to work with machine vision cameras directly interfacing with Matlab, the image acquisition part had to be replaced. Fol-lowing the calibration instructions described in Lanman & Taubin [10], prob-lems were encountered at the projector calibration. Basically, the calibration procedure crashed due to bad input data. Having discussed the issue with the author Douglas Lanman, it was clear that the current implementation had some issues, as this seemed to be a quite common problem among other users. Instead the projector calibration procedure was replaced with code from the separate projector calibration project procamcalib [6]. It is based on Camera Calibration Toolbox, but better adapted to projector calibration. The calibration procedure is described in the subsection below. This scanner does not implement pre-warping of the projected patterns. Doing so will up-set the pixel mapping, creating blurred lines. This decreases the contrast in the pattern, impairing the decoding of the projector lines.

The projector is connected to the PC through a VGA-cable and the pro-jector is detected as a secondary display by the operating system. To project the required images during a scan, the bit planes are displayed in a full screen image viewer on the secondary display. Since the pico projector has 848 pixel columns, 10 different bit plane patterns are required, because 29_{= 512, which}

is not enough, and 210_{= 1024 > 848. The scanner is implemented such that}

each of the bit-planes are also displayed with inverted intensity values, i.e. black is displayed as white and vice versa. This is done to more easily be able to determine if a pixel should be set to 1 or 0 in the Gray-code sequence. Since some parts of the scene may receive indirect illumination reflected from directly illuminated parts, a single fixed threshold will result in decoding ar-tifacts. Using an image capture of the original bit plane, A, and an image captured of the inverse, B, per-pixel intensity thresholds can be used to de-termine if a pixel should be set to 1 or 0. The images A and B have first been converted to gray-scale images. If the pixel intensity in A is higher than the corresponding pixel in B it should be decoded as 1, and 0 otherwise. The pix-els with a small difference_{|A − B| is more likely to be erroneously decoded, so} pixels having an absolute difference value lower than a constant minContrast are filtered out.

(31)

Figure 16: A set of captured pictures captured using the projected Gray code bit planes.

folder beforehand. During the scan, the image capture software needs to be triggered to take a picture, followed by a change of projector pattern and a new camera trigger, etc. The DSLR Remote Pro software has no native support for scripting, which makes automation trickier. The solution was to use a program called AutoHotKey, which makes it possible to write scripts that executes keyboard commands or mouse clicks in a specific order. A script that alternates between pressing F8 to trigger the camera and pressing the right mouse button to change the displayed projector pattern has been written. In order for the captured image to be transferred from the camera to the PC, a time delay Tdis added after the camera shutter release. If the delay

is too short, the camera will not have time to capture the next image, resulting in a missed projector pattern. A too long delay means an unnecessary long scan time. For short times of exposure, shorter that about 0.1 s, a delay of Td,min= 2.5 s is the shortest possible. For longer times of exposure, the delay

must be replaced with Td = Td,min+ Te, where Te is the time of exposure.

This is because the time of exposure is included in the delay time.

(32)

(a) (b)

Figure 17: A scanned scene (a), and its decoded projector columns (b) using the images in Figure 16.

(a) (b)

Figure 18: A small cut-out of Figure 17b. This image in (a) has not been filtered so there are wide fields of identical column values. (b) shows the image after filtration.

(33)

3.3 Calibration

Camera calibration is done in Matlab with the use of the Camera Calibration Toolbox [2]. By taking pictures of a flat chessboard pattern from different angles, the intrinsic and extrinsic parameters can be estimated. The user is required to input the size of the squares in the pattern. This information, and the assumption that the pattern is planar, makes finding the parameters a uniquely determined mathematical problem. The parameters can be more accurately determined by capturing more pictures of the pattern. It is recom-mended that 15-20 pictures are used in the calibration. More than 20 pictures seem to yield marginally better results, if any. To perform a calibration, the user has to mark the four next out-most corners in the chessboard pattern, in each picture. It is not necessary to exactly pinpoint the corners, because the algorithm will refine the location with an algorithm using the properties of the pattern. As long as the corner is inside a user-specified search box, the correct location should be found. Once the four corners have been marked, the program will try to find the location of the other intersecting squares in-side the box bounded by the four outer corners. The procedure is illustrated in Figure 19-21.

The actual calibration procedure is implemented as a non-linear optimiza-tion problem, where the goal is to find the calibraoptimiza-tion parameters that min-imize an error function. The error can be defined as the distance from the previously estimated corner locations to the reprojected locations obtained from a model described by the calibration parameters. This final error value, measured as the average pixel distance between the estimated and the repro-jected corner, is presented to the user when the main calibration procedure is completed. This can be used to judge how good the calibration was. It is also possible to get a 3D-plot of the extrinsic variables, giving a visual pre-sentation that can be compared to the actual setup. The distortions can also be visualized using vector field plots.

The camera calibration was used for the first two scanners, i.e. the stereo-scopic and the swept plane-scanner, as well as the structured light scanner. The latter also requires projector calibration. First the camera is calibrated. The projector can then be calibrated using pictures of a projected chessboard pattern. The pattern is projected to a white plane. The projector can then be calibrated in a similar way as the camera. Since the camera distortions are removed, the distortions seen in the projected pattern must come from the lens of the projector. To save time during the acquisition of the calibration images, both the projected pattern and the physical pattern are captured at once. They are positioned side by side on the same plane. That way it is sufficient to capture 15-20 images and not twice as many.

3.4 Underwater Scanning

3.4.1 Experiments

(34)

Figure 19: The user selects the four corners in the physical calibration pattern.

Figure 20: The estimated intersections of the squares in the pattern (red) and its associated the search boxes (blue).

(35)

The projector had its optical axis orthogonal to the wall and the camera’s optical axis had an oblique angle to the wall to provide a sufficiently big triangulation angle. Let this setup be called U Woblique for future reference.

To keep the camera’s optical axis orthogonal to the water interface a casing to the lens was constructed. The front end of the casing had flat glass port, allowing the camera to look through it. The back end was open, so that the lens could be inserted to the casing. By placing the camera just above the water surface, only the lens casing was submerged into the water. The projector on the other hand, was oriented looking straight down through the water. To remove ripple in the water surface, a Plexiglass plate was placed on top of the water surface, beneath the projector. This setup is called U W . The same setup was also used to make scans without the water. In this case, the calibration was made above water. This version of the setup is called OW .

3.4.2 Underwater Housing

At the end of the project an underwater housing was constructed, allowing the scanner to be fully submerged in water and to function in a more realistic environment. A separate housing was used for the projector and camera. The optical axes are positioned to be orthogonal to the glass port. The two housings were then attached to a common rig. The two housings are rotated a bit in relation to each other to provide an overlapping field of view and a sufficiently long baseline. The projector housing has two water tight lead-throughs: one for the VGA-cable and one for the power cable. The camera housing has a lead-through for the USB-cable. At the time of writing, the power cable has not yet arrived, so the camera has to be powered by battery. Photographs of the housing can be seen in Figure 22, as well as in Figure 23, where the housing is submerged and the projector is running.

The performance of the scanner showed to be similar to that used with the water tank, which is expected. It is however difficult not to disturb the calibration when the camera has to be taken out of its housing to charge the batteries. Therefore, any measurement of scanner accuracy with the housing has been postponed.

3.5 Visualization

As mentioned, the output of the 3D scanner is in the form of a lot of 3D points, called a point cloud. This point cloud can be quite hard to visual-ize effectively if left unprocessed. If the points are to be visualvisual-ized without processing, the points need to be viewed such that the density of points is not too large or small. With a too high density, the points will just appear as a uniformly textured and colored block, where only the edges of the point cloud are detectable. With a too low density, there will be too large gaps between the points for the mind to perceive it as a surface. With the right density however, the varying surface normal will be clearly visible as either an increase of decrease of point density, giving information of the surface depth and orientation.

(36)

(a)

(b)

Underwater 3D Surface Scanning using Structured Light

Examensarbete 30 hp

December 2010

Underwater 3D Surface Scanning

using Structured Light

Abstract

Underwater 3D Surface Scanning using Structured

Light

Nils Törnblom

Popul¨

arvetenskaplig sammanfattning

Contents

1

Introduction

1.1

Project Description

1.2

Classification of 3D Scanners

Computation of Depth / Shape

Triangulation

Indirect measurement

of an inaccessible

distance (e.g., land

surveying):

D

D

E

1.3

3D Scanning Problems

1.4

Structured Light vs. Laser Triangulation

2.2 Geometric Representations

2

Theory

2.1

Optics

2.2

Coordinate Systems

2.3

Stereoscopy

2.4

Active 3D Scanning

Acknowledgements

References

3

Methods

3.1

Hardware

3.2

Tested 3D Scanners

3.3

Calibration

3.4

Underwater Scanning

3.5

Visualization