A Comparison of Parallel Algorithms for Calculating the Anti-Aliased Euclidean Distance Transform

(1)

Master of Science Thesis in Computer Science

Department of Electrical Engineering, Linköping University, 2020

A Comparison of Parallel

Algorithms for Calculating

the Anti-Aliased Euclidean

Distance Transform

(2)

Master of Science Thesis in Computer Science

A Comparison of Parallel Algorithms for Calculating the Anti-Aliased Euclidean Distance Transform:

Daniel Eriksson LiTH-ISY-EX--19/5269--SE

Supervisor: Stefan Gustavson

isn_{, Linköping University}

Åsa Detterfelt

MindRoad AB

Examiner: Ingemar Ragnemalm

isy_{, Linköping University}

Division of Information Coding Department of Electrical Engineering

(3)

Abstract

This thesis presents a comparison of three different parallel algorithms, adapted to calculate the anti-aliased euclidean distance transform. They were originally designed to calculate the binary euclidean distance transform. The three algo-rithms are; SKW, Jump Flooding Algorithm (JFA), and Parallel Banding Algo-rithm (PBA). The results presented here show that the two simpler algoAlgo-rithms, SKW and JFA, can easily be adapted to calculate the anti-aliased transform rather than the binary transform. These two algorithms show good performance in re-gards to accuracy and precision. The more complex algorithm, PBA, is not as easily adapted. The design of this algorithm is based on some assumptions about the binary transform, which do not hold true in the case of the anti-aliased trans-form. Because of this the algorithm does not produce a transform with the same level of accuracy as the other algorithms produce.

(4)

(5)

Acknowledgments

I want to thank my supervisors Åsa and Stefan, for their constructive feedback on this thesis. I would like to thank my examiner Ingemar, for helping me formulate the problem at the center of this thesis. I would also like to thank my colleagues at the office for keeping me just the right amount of distracted for the duration of this project.

Linköping, October 2019 Daniel Ericsson

(6)

(7)

5.3.1 Execution Time . . . 19 5.3.2 Input Data . . . 19 5.4 Evaluation of Algorithms . . . 20 5.4.1 PBA . . . 20 5.4.2 SKW . . . 20 5.4.3 JFA . . . 20 6 Conclusions 21 6.1 Research Questions . . . 21 6.2 Future Work . . . 22 A Helper Functions 25 B SKW Code 29 C JFA Code 33 D PBA Code 35 Bibliography 43

(9)

List of Figures

2.1 Demonstration of pixel domination . . . 5

2.2 SKW vector template . . . 6

2.3 JFA vector template . . . 7

3.1 Edge-point approximation . . . 10

3.2 Input image . . . 11

4.1 Execution times . . . 13

4.2 Average distance errors . . . 14

4.3 Maximum distance errors . . . 14

4.4 Proportion of correct pixels . . . 15

4.5 SKW and JFA error image . . . 15

4.6 PBA error image . . . 16

4.7 Error of PBA for different values of m3 . . . 16

5.1 Zoomed PBA error image . . . 18

(10)

(11)

Notation

Abbreviations

Abbreviation Meaning

EDT Euclidean Distance Transform

AAEDT Anti-Aliased Euclidean Distance Transform

JFA Jump Flooding Algorithm

PBA Parallel Banding Algorithm

(12)

(13)

1

Introduction

The Euclidean Distance Transform(EDT) is a fundamental operation in image processing. Algorithms to produce this transform have been widely studied and applied to problems in fields such as medicine [3, 13], computer graphics render-ing [6, 11], and robot navigation [9].

Nysjö et al. [13] used EDT to create 3-dimensional models of surgical guides and other tools used in surgery. This was done by applying the transform to CT-images and creating a model which follows and fits the anatomy of the patient. Angayarkanni & Kumar [3] used the transform in a process to analyze mammo-grams and identify cancer cells.

Macedo and Apolinário [11] used the transform to create realistic shadows in real-time computer simulations. Their method created shadows with fewer arti-facts than other methods, and with good performance. Green [6] has presented a method, using the EDT of a texture, to recreate a high-resolution texture, while requiring less texture memory.

Juelg et al. [9] applied EDT to a voxel representation of an environment. The results were used by a robot to avoid obstacles and navigate towards a goal in the environment.

Another transform which performs a similar operation has been proposed in a paper by Gustavson & Strand [7] in which they show that their transform, the Anti-Aliased Euclidean Distance Transform (AAEDT) has some advantages over the regular EDT.

1.1 Aim & Motivation

Given the advantages AAEDT presents over EDT, it would be beneficial in many cases to use the former over the later. However, due to the abundance of well established and highly polished algorithms and implementations of EDT, and

(14)

2 1 Introduction

the lack of such algorithms for AAEDT, this is not commonplace. There is also a broad set of areas of application [5], in which such a change could improve performance of applications.

The aim of this thesis is to research what parallel methods and algorithms, commonly used to calculate EDTs of images, are suitable for adaptation to calcu-late AAEDTs of images. Specifically, parallel algorithms designed for graphical processing units are explored.

1.2 Research Questions

1. Are there any significant differences between EDT and AAEDT which affect parallelizability?

2. How do the chosen algorithms perform in terms of execution time and accuracy, when calculating AAEDT?

1.3 Delimitations

Three different algorithms are the focus of this thesis. They are: Parallel Banding Algorithm (PBA), Jump Flooding Algorithm (JFA), and SKW.

The distance function presented by Gustavson & Strand [7], for AAEDT, is only defined for 2-dimensional images. Deriving a function for higher dimen-sional images is complex. Because of this, only 2-dimendimen-sional images were used for testing the chosen algorithms.

(15)

2

Theory

This chapter outlines the transforms related to this thesis, the Euclidean distance transform and the Anti-Aliased Euclidean distance transform, as well as the algo-rithms which are the main focus of this report.

2.1 Euclidean distance transform

The EDT is a transform, which takes a binary image as input, and outputs an image of euclidean distances.

The binary input image consists of pixels with a value of ai ∈ {0, 1}.

Through-out this paper, the term feature pixel refers to a pixel in an input image with a

value of ai = 1. Non-feature pixels are called background pixels and have a value

of ai = 0.

The output of the EDT is an image where each pixel is assigned a value equal to the euclidean distance from the corresponding pixel in the input image to the closest feature pixel in the input image.

There are two main categories of algorithms to calculate EDTs; approximate and exact. An exact algorithm guarantees the correct output for any and all input images. Approximate solutions are often easier to implement, but does not guar-antee the same accuracy as an exact solution. While approximate algorithms do not give the correct output for all inputs, they do produce correct output for some inputs and close to correct output, within certain bounds of error, for others.

Related to the EDT are Voronoi diagrams, which rather than returning the distance to the closest feature pixel gives the index of the closest pixel. It is easy to see that given the Voronoi diagram of an image, calculating the EDT is a trivial problem. Because of this, some methods of calculating the EDT first calculate the Voronoi diagram and use it to calculate the EDT.

(16)

4 2 Theory

The algorithms described in this chapter were all devised to calculate the EDT and Voronoi diagrams of binary images.

2.2 Anti-Aliased Euclidean distance transform

The AAEDT is a modified version of the binary EDT, proposed by Gustavson & Strand [7]. This transform takes an input image consisting of pixels with values

ai ∈ [0, 1]. While the domain of this transform is a superset of the domain of

the binary transform, it does not produce the same mapping, as the binary trans-form, for that specific subset of its domain. When referring to the input of this

transform a feature pixel is any pixel with a value ai , 0.

Given an supersampled, anti-aliased image, the AAEDT returns an image sim-ilar to that of the binary transform. While the binary transform returns (for each pixel) the distance to the closest feature pixel, the AAEDT returns (for each pixel) the approximated distance to the closest ideal edge of a feature object, in the orig-inal image which was sampled.

A new category of feature pixels is edge pixels which consists of all feature

pixels with a value ai , 1. These pixels lie partially on a feature object in the

original image and therefore, lie on an edge, hence the name.

While plenty of work has been done on parallel implementations of the binary EDT, not much has been done in the realm of parallelizing the AAEDT. Gustavson & Strand [7] presented an implementation using raster scanning and Linnér & Strand [10] presented an implementation based on wave front propagation. As this thesis is focused on parallelizing AAEDT the latter method is, however of little interest as it requires arbitrary memory accesses.

2.3 Parallel Banding Algorithm

PBA is a parallel version of earlier sequential methods for calculating the exact EDT, using Voronoi region intersections to calculate the transform. The algorithm was proposed by Cao et al. [18]. It claims to be based on an algorithm by Hayashi et al. [8] but has a focus on efficient use of parallelization. The algorithm is based on concepts discussed by Saito & Toriwaki [16], as well as Maurer et al [12].

The algorithm consists of three separate phases. The first phase calculates the one-dimensional Voronoi diagram of the image, it finds the closest feature pixel that lies on the same row of pixels. The second phase creates a stack for each column which exclusively holds references to all feature pixels which Voronoi regions intersect with the column in question. A feature pixel’s Voronoi region is the set of points which closest feature pixel is the pixel in question. The third phase uses the stack created in the second phase to create the Voronoi diagram and EDT by traversing each pixel in a column, looking at the top two pixels on the stack and popping the stack whenever the current pixel is not in the Voronoi region of the top pixel any more.

In order to maximize parallelism each phase implements a parallel divide-and-conquer approach by dividing the image into smaller parts, that are worked

(17)

2.3 Parallel Banding Algorithm 5

on in parallel and the results from each part is then combined. The partitioning of the image resembles strips or bands, giving the algorithms its name. For the first phase the bands are vertical, consisting of a number of columns. For the second and third phase they are horizontal, consisting of a number of rows.

In the first phase, for each band in all rows in parallel, the closest feature pixel on the same row and band is found for each pixel. This is done using two simple sweeps, one from left to right and one right to left, updating the closest feature pixel using the currently closest found, and the closest pixel of the previous pixel in the sweep. The result is then distributed among all outer-most pixels of all bands in the same row, using a parallel prefix algorithm, the closest feature pixels of these outer most pixels are updated. Finally all other pixels in the band are updated based on the already updated outer-most pixels.

a b c

Figure 2.1:Pixel b being dominated by pixels a and c

The second phase creates a stack, for each column, of all proximate feature pix-els, where a proximate feature pixel is a feature pixel which Voronoi region is in-tersected by the column in question. This is again done separately for each band. A Voronoi region is a convex polygon which edges are the perpendicular bisec-tors of its corresponding feature pixel and the feature pixels of adjacent Voronoi regions. Consider then three different feature pixels a, b, c on rows ja> jb> jc, all

the closest feature pixels to some pixel in column i. If the perpendicular bisector between a & b and b & c intersect before reaching column i the Voronoi region of b is not intersected by column i and no pixel in that column belongs to the Voronoi region of b. This is called that b is dominated by a and c in the original paper by Cao et al. and is demonstrated in figure 2.1. This fact is used when creating the stack; each pixel in the column is considered in ascending order and pushed onto the stack. If the top feature pixel on the stack is dominated by the newly considered feature pixel and the second pixel on the stack, the top pixel is popped off the stack. A newly considered pixel is only pushed onto the stack once no more pixels are dominated or there is only one element on the stack. Once the stacks for each column in each band is created they are merged two by two in each column by pushing the bottom element of one stack onto the top of the stack in the band below, removing any dominated feature pixels in the pro-cess. This merger is easily done if the stacks are implemented as doubly linked lists. The stacks are merged until each column has only one single coherent stack.

(18)

6 2 Theory

The third and final phase of the algorithm takes the stack from the second phase and for each pixel in the same band and column, in parallel, assign it a closest feature pixel by looking at the top two pixels on its column’s stack, and popping the stack if the second element on the stack is closer. Once the last pixel in the band and column has found its actually closest feature pixel, the same process is repeated for the next band.

For the first two phases, number of bands can be determined by the parame-ters; m1, m2, respectively. The number of pixels to handle in parallel in the third

phase is determined by the parameter m3. The total work performed by this

al-gorithm is O(n2) for an image of size n x n pixels. Because of this optimal work

and focus on parallelism, the algorithm performs well when compared to other algorithms [9, 18]. In order to achieve this high performance, the values of the

three parameters m1, m2, and m3have to be tuned to the size of the input image,

and the hardware.

2.4 SKW

Schneider et al. [17] have presented a parallel algorithm for calculating the ap-proximate EDT and Voronoi diagram, designed for implementation on GPUs. The algorithm is based on an earlier sequential algorithm using vector propa-gation to calculate the transform. While it is commonly referred to as SKW after these authors initials, the algorithm was earlier proposed by Ragnemalm [14] under the name 4H-8SSED/SIMD.

-1,-1 -1,0 -1,1 0,0 1,-1 1,0 1,1 0,0 -1,-1 0,-1 1,-1 0,0 -1,1 0,1 1,1 0,0

Figure 2.2:Vector templates for SKW, with relative coordinates

The algorithm performs four scans; right, left, up and down each using one of the templates in fig 2.2, where the currently considered pixel’s closest feature pixel is updated, taking into consideration all closest feature pixels of the pixels in the template. The first two scans apply their vector templates on each pixel in every column starting with the leftmost and rightmost column respectively. The third and fourth scan similarly apply their vector templates to each pixel in all rows.

While this algorithm, just like PBA, performs O(n2) work, it does not perform

as well when it comes to execution time given the sequential nature of the calcu-lations performed by the algorithm.

(19)

2.5 Jump Flooding Algorithm 7

2.5 Jump Flooding Algorithm

A technique for parallel calculation of the EDT was proposed by Danielsson [4] and much later presented by Rong & Tan [15] under the name Jump Flooding Algorithm. The algorithm performs a number of steps equal to the base-2 log-arithm of the side length of the image. In each step the vector template in fig 2.3 is applied to every pixel in the image and their closest pixels are propagated throughout the grid. The size of the template is, at the start, so that the shortest distance between pixels (denoted s in figure 2.3) is equal to half the side length of the image. In every step this distance is halved until it is equal to one.

-s,-s -s,0 -s,s 0,-s 0,0 0,s s,-s s,0 s,s

Figure 2.3:Vector template for JFA, with relative coordinates

Since reads and writes, to and from memory containing the intermediate Voronoi diagram, are performed in parallel, two buffers can be used to avoid race conditions. One buffer acts as input, the other as output for each step of the algorithm, these roles are switched between the buffers after every step to ensure that the latest data acts as input to the next iteration.

This algorithm performs O(n2log(n)) work for images of size n x n. While it

is not linear like SKW, this algorithm does more efficiently parallelize its work and does therefore, in the vast majority of cases outperform SKW in terms of execution time. It is most efficient for small image sizes, but is outperformed by both PBA, and SKW for larger images [18].

2.6 Other Parallel Algorithms

All of the above discussed algorithms in some way do or can use the Voronoi di-agram to calculate the EDT of an image. There are other algorithms do not rely on this intermediate representation but rather purely calculates the EDT numer-ically, one such implementation has been presented by Zampirolli & Filipe [2]. These implementations can be very efficient with little overhead, but they are not suitable for adaption to be used for AAEDT. This is because the calculations in AAEDT are dependent on the direction of vectors between pixels, as well as the value of the feature pixel which requires the exact coordinates of the pixel to be known, which is precisely the information stored in a Voronoi diagram. Given the coordinates of all closest feature pixels, the distance to them can be calcu-lated for each pixel using the distance function defined by Gustavson & Strand [7]. An implementation of this function can be found in appendix A.

(20)

(21)

3

Method

This chapter describes how this thesis was conducted. The modifications to the chosen algorithms are presented, and how these algorithms were evaluated when applied to the problem of calculating AAEDT.

3.1 Implementation

Below are the changes, required for implementing the algorithms to calculate the AAEDT instead of the binary EDT, described.

All implementations were coded in C++, using CUDA runtime 10.1, and Mi-crosoft Visual Studio. All algorithms had in common that they now required an extra initialization of the image’s gradient, which is used by the new distance function for AAEDT, which can be found in appendix A. The focus when creat-ing the implementations were readability and correctness, meancreat-ing that there is known room for improvements in terms of performance.

3.1.1 Parallel Banding Algorithm

Implementing the first and third phases of PBA required little modification of the original algorithm for the binary transform. All that was required was to use the new distance measure for AAEDT. The second phase needed to be adapted slightly. In order to determine whether or not a pixel is dominated by two other pixels, the original measure using the center point of each pixel is insufficient. A correct method of determining whether a pixel is dominated or not, would require knowledge of where the ideal edge falls on the pixel. While this is hard to calculate exactly, an approximate location of the edge can be calculated using the local gradient of the image for the feature pixel (g) and the distance to the

edge from the pixel center (df), these measurements are shown in fig 3.1.

(22)

10 3 Method

a

a’

df

g

Figure 3.1:New point a’ for determining domination

3.1.2 SKW

No major modifications were required for this algorithm to work for AAEDT in-stead of binary EDT. The Voronoi diagram is created in the same way as the origi-nal algorithm, only now it utilizes the new distance measure to determine which feature pixels are closer.

3.1.3 Jump Flooding Algorithm

Just like with SKW, adapting this algorithm to work for AAEDT instead of binary EDT was more or less trivial. Simply replacing the distance measure with the one for anti-aliased images was sufficient to produce the Voronoi diagram and distance transform.

3.2 Evaluation

The three algorithms were evaluated based on the two categories of metrics de-scribed in this section; Time complexity and execution time, as well as errors in distance approximations. All tests were run on the same hardware; Intel Core i7-3770 CPU @ 3.40GHz and NVIDIA GeForce GTX 1050, running Windows 7. The algorithms were tested on the image seen in figure 3.2. An image consisting of circles was chosen to simplify calculation of the actual distance to feature ob-jects in the image. The tests were performed on square images with side lengths equal to 64, 218, 256, 512, 1024 and 2048 pixels. The images were created by uniform supersampling of the source image with a resolution of 16 x 16 samples per pixel.

(23)

3.2 Evaluation 11

Figure 3.2:Input image for evaluation tests.

3.2.1 Time Complexity & Execution Time

The time complexities of the studied algorithms are known and have been com-pared for the binary transform by Cao et al.[18] and Juelg et al.[9], and should not change for this new application. What is of more interest in this thesis is how implementations for calculating the anti-aliased transform compare to each other, given the more computationally demanding distance measure.

Time measurements were done using the C++ std::chrono library. Time was measured for the algorithmic calculations only, memory allocation was done be-fore measuring execution time.

3.2.2 Error in Distance Approximations

Given that all three algorithms use slightly different methods of propagating data throughout the image, different types of errors can potentially occur. The dis-tance approximations of each algorithm were compared to two different oracles; the actual distance in the over-sampled image, consisting of circles, and the re-sult of a brute-force implementation of AAEDT, which always gives the shortest distance given the distance function used by these algorithms. Due to the brute-force algorithm’s poor time complexity it was not used as an oracle for the largest image size.

Three main measures were used to evaluate the correctness of the algorithms; The average and maximum error of pixels in a transformed image were derived when compared to both oracles. The number of pixels assigned the same distance as when transformed using the brute-force implementation was also counted. These measures were chosen to make a comparison of the algorithms easy, and are similar to the ones used to produce the results by Gustavson & Strand[7].

(24)

(25)

4

Results

This chapter aims to present the results produced in this thesis. These results and their implications will be more thoroughly discussed in the following chapters.

4.1 Time Complexity & Execution Time

Figure 4.1 shows the execution time of the three implemented algorithms. The values are an average of ten runs of each algorithm for each image size. PBA was tuned for each image size using its parameters to ensure the best possible

performance. Parameter m3was fixed to 32 for reasons discussed in section 5.2.

26 27 28 29 210 211

101 102 103

Image Size (Side Length) [Pixels]

T ime [ms] SKW JFA PBA

Figure 4.1:Algorithms’ execution times for different image sizes.

(26)

14 4 Results

4.2 Distance Error

Figure 4.2 shows the average absolute error produced by the three algorithms, when compared to the two oracles; The ideal distance transform, using the ge-ometric definitions of the source image, and the brute-force implementation of AAEDT. 26 27 28 29 210 211 0 0.01 0.02 0.03 0.04

Error Distance [Pixels] SKW JFA PBA 26 27 28 29 210 211 0 0.01 0.02 0.03 0.04

Error Distance [Pixels] SKW JFA PBA

Figure 4.2: Average distance error compared to ideal transform(left), and

brute-force implementation(right).

Figure 4.3 shows the maximum error produced by the three algorithms, when compared to the two oracles. Note that compared to figure 4.2 the scale of the vertical axis differs by almost two orders of magnitude.

26 27 28 29 210 211 0.5 1 1.5 2 2.5

Error Distance [Pixels] SKW JFA PBA 26 27 28 29 210 211 0.5 1 1.5 2 2.5

Error Distance [Pixels] SKW JFA PBA

Figure 4.3:Maximum distance error compared to ideal transform(left), and

(27)

4.2 Distance Error 15

Figure 4.4 shows what proportion of the total number of pixels the three algo-rithms produced the exact same result for as the brute-force implementation.

26 27 28 29 210 211 0.5 0.6 0.7 0.8 0.9 1

Quotien

t

SKW JFA PBA

Figure 4.4: The ratio of correct (according to the brute-force

implementa-tion) pixels to total pixels, for different image sizes.

Figure 4.5 shows the error images of SKW, and JFA, when executed on an image of resolution 512 x 512.

Figure 4.5:Error of SKW(left) and JFA(right) on 512 x 512 image, compared

to ideal transform. Medium gray is zero error, red is an overestimation of +0.1, blue is an underestimation of -0.1. Errors are clamped to [-0.1, +0.1].

Figure 4.6 shows the error image of PBA, when run on an image of resolution 512 x 512. A difference in the output image is present along the upper right edge

(28)

16 4 Results

of the lowest region, when using different values for m3. These artifacts are more

thoroughly discussed in section 5.2, and are more pronounced in figure 5.1.

Figure 4.6: Error of PBA on 512 x 512 image, with m3 = 32(left) and m3 =

1(right), compared to the ideal transform. Medium gray is zero error, red is an overestimation of +0.1, blue is an underestimation of -0.1. The maximum absolute error is much greater than 0.1, but errors are clamped to [-0.1, +0.1].

Figure 4.7 shows the correlation between the PBA parameter m3 and the

ab-solute average and maximum error in an image of resolution 512 x 512. Which

shows that a low value for m3causes great error in some pixels, but does little to

the average error.

20 21 22 23 24 25 26 27 0.01 0.01 0.02 0.02 0.03 Value of m3 A v er ag e Error Distance [Pixels] 20 21 22 23 24 25 26 27 1 2 3 4 5 Value of m3 Maxim um Error Distance [Pixels]

Figure 4.7: Average(left) and maximum(right) distance error compared to

(29)

5

Discussion

This chapter will discuss the results and how they correlate to the research ques-tions of this thesis. Other points of interest that were discovered during this thesis will also be addressed.

5.1 Implementation-Specific Problems

In the experiments run, the execution times of the implementations are longer than expected. This is likely due to little effort being made to improve memory coalescing and under-utilizing the CUDA-libraries at hand. The implementations are rough, and should be improved upon if used for any real-time execution.

The relative performance of the algorithms are mostly consistent with the results of Cao et al. [18]; JFA and PBA perform best for smaller images. For larger image sizes SKW becomes more viable since the greater degree of parallelization utilized by PBA and JFA can not be achieved on the GPU used. It is also clear that the sub-optimal work performed by JFA results in poor performance for larger images.

5.2 PBA-Specific Problems

The greatest source of error in PBA occurs in the second and third phase of the algorithm. The original calculations, for eliminating candidates from the stack of proximate pixels, relies on every point lying on a strict grid. For the anti-aliased transform this condition is broken, resulting in the mostly vertical regions of greater error, seen in figure 4.6. In the non-binary case pixels in the same column can have different closest feature pixels on the same row. This is because the anti-aliased image can represent edges with a slope that is not strictly horizontal or

(30)

18 5 Discussion

vertical, unlike a binary image. This also results in some artifacts around perime-ters of Voronoi regions, due to the stack of proximate feature pixels not being

in proper order. The parameter m3 changes the severity of these artifacts, once

again seen in figure 4.6. With a small value, the stack can be popped prematurely and pixels in the following band miss some feature pixels which they should not. Alternatively the stack is not popped for the last pixel in a band, and the follow-ing feature pixels are not considered for the pixels in the current band. With

greater values for m3the problem is less prominent as there are fewer bands. For

the same reason the problem produces greater errors for larger images, as seen in figure 4.3, due to larger images requiring more bands to cover them. Figure 5.1 shows an example of this problem where PBA has assigned a pixel with the wrong closest feature pixel. The same type of error occurs at several points in the image, which present as the small white spots along a line through the spot circled in blue.

This problem might be solvable with a more rigorous calculation for deter-mining domination of candidates for proximate pixels, and possibly requiring a new definition of what constitutes a "proximate feature pixel", this, however, is beyond the scope of this thesis.

Figure 5.1:The gray-scale error image of PBA, with m3= 32, of a 2048x2048

image, zoomed in on the pixel with greatest absolute error. The pixel circled in blue has been assigned a pixel circled in red as its closest feature pixel. The actual closest pixel is circled in green.

For most of the experiments m3 was fixed to 32. A smaller value would be

(31)

uti-5.3 Errors in Data Collection 19

lized better by the implementation. Due to how CUDA distributes work on the

GPU [1] the hardware is under-utilized for m3 < 32, unless more advanced

syn-chronization is utilized to allow a block of threads to work on multiple columns simultaneously. In the implementation used here such efforts were not made. If this were changed, the execution time of PBA should scale better for larger image sizes. But before the issues with its correctness are addressed such optimizations are of little interest.

5.3 Errors in Data Collection

This section will discuss sources of errors and biases in the experiments and the attempts to mitigate them.

5.3.1 Execution Time

As mentioned in section 3.2, all experiments were run on the same hardware. The tests were all run ten times to mitigate any errors due to variance in the deploy-ment environdeploy-ment. Because Windows 7 was running on the same hardware, in-cluding the GPU, some processes would naturally interfere with the experiments. The environments each test was performed in were made as similar as possible, and background processes kept to a minimum. While having a GPU dedicated solely to calculations would eliminate some variance in the results, the setup used for these experiments might be a more realistic representation of a typical user environment.

The implementations are as mentioned not comparable to other published implementations, this is partly due to known design decisions, such as focusing on readability and not utilizing texture-memory to increase performance. The main result is that these implementations of the algorithms relative performance is not affected by the new distance measure used in AAEDT.

5.3.2 Input Data

Some feature objects in the source image are too small to accurately be repre-sented in images of lower resolution. The distance measure used in AAEDT as-sumes that edges through a pixel are approximately straight, for images with smaller-than-pixel features this is not the case and errors occur. This is clearly seen in figure 4.2 where the error of all algorithms is greater for smaller images, when compared to the ideal transform. This is not the case when comparing to the brute-force implementation, as it too uses the same resolution images.

Even though several different resolutions are used for the input images, only one type of source image was used for the experiments. The chosen image con-tains edges of all orientations as well as concave and convex features, despite this, results may be skewed due to the choice of image. It is important to note that approximately 40% of the pixels in the images are feature pixels, which are triv-ial for all three algorithms to accurately obtain the correct pixels for, as they are their own closest pixel and no propagation of information is necessary.

(32)

20 5 Discussion

5.4 Evaluation of Algorithms

In this section the results presented in chapter 4 are more thoroughly analyzed and summarized.

5.4.1 PBA

The experiments performed in this thesis show that the adaptation of PBA, to calculate the AAEDT presented here, does not produce results with the same level of correctness as the other, more simple algorithms. Figure 4.4 shows that as many as 30-40% of pixels are assigned with the wrong value.

The algorithm still maintains good levels of parallelism, and executes within

reasonable time. However, for larger image sizes this requires the parameter m3

to be assigned a small value, which has been shown to further increase the level of error produced in the output image.

5.4.2 SKW

This algorithm clearly under-utilizes the hardware due to its poor parallelism, but does scale better than JFA for larger images.

It produces results with low absolute average error, and while it does produce the wrong result for a significant proportion of pixels, the biggest error is small.

5.4.3 JFA

This algorithm performs well on smaller images in terms of execution time. It does, however, scale poorly with larger image sizes, just like implementations for the binary transform.

The algorithm produces results with the smallest absolute average error and gets close to all pixels correct according to the brute-force implementation of AAEDT. The absolute maximum error it produces is not as good as the average case. While it does not produce as big errors as PBA it is not as consistent as SKW is, especially for the larger image sizes.

(33)

6

Conclusions

This chapter summarizes the findings presented in this thesis. In addition to this some areas of work that could be done to continue the work done here, are mentioned.

6.1 Research Questions

The research questions stated in chapter 1 were:

1. Are there any significant differences between EDT and AAEDT which affect parallelizability?

2. How do the chosen algorithms perform in terms of execution time and accuracy, when calculating AAEDT?

In order to answer the first question, a look at PBA is the most interesting. For the inexact algorithms JFA, and SKW there are no attributes of AAEDT which af-fect the performance of the algorithms in any significant way, at least none that were discovered the work on this thesis. However, PBA does no longer produce the correct transform as it did for the binary transform. For AAEDT the features no longer lie on a strict grid. Because of this the calculations used to eliminate fea-tures as potential voronoi pixels, in the second and third phase of the algorithm, no longer work as intended, and produce erroneous results. These errors can, to an extent, be mitigated, but only by increasing the work done by the algorithm in the third phase, which is detrimental to its performance and parallelizability.

In terms of execution time the algorithms perform just like the versions for EDT, relative to each other. They are slow compared to implementations used in earlier work, which is in part due to the more heavy calculations utilized by

(34)

22 6 Conclusions

AAEDT, but also due to some less than optimal design present in these specific implementations.

In terms of accuracy, JFA performs best in the general case, out of the three tested algorithms. It produces the correct distances for close to all pixels in an image. However, the few errors it does produce, can be of significant magnitude. These errors seem to be more prominent for larger images. SKW produces a greater number of errors than JFA, but is more consistent and produces errors of smaller magnitude. PBA is the least accurate of the three algorithms and produces artifacts with errors of massive magnitude compared to the other al-gorithms.

6.2 Future Work

There are multiple things that could be done to expand on the work presented in this thesis.

Firstly, the algorithms were only tested on a single source image, with dif-ferent resolutions. The algorithms might perform difdif-ferently when the input image consists of other shapes of features and other proportions of feature to non-feature pixels.

Secondly, there is much room for optimization in the implementations eval-uated. More efficient implementations of the algorithms could be compared to their binary counterparts to make a comparison of how expensive the new dis-tance measure of the AAEDT is.

Lastly, only one possible adaptation of PBA is presented here. Some other method of producing the stack of proximate feature pixels in the algorithms’ sec-ond phase might solve some of the problems observed in this thesis.

(35)

(36)

(37)

A

Helper Functions

1 /*

3

4 This software is distributed under the permissive "MIT License":

5

6 Permission is hereby granted, free of charge, to any person obtaining a copy

7 of this software and associated documentation files (the "Software"), to deal

8 in the Software without restriction, including without limitation the rights

9 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell

10 copies of the Software, and to permit persons to whom the Software is

11 furnished to do so, subject to the following conditions:

12

13 The above copyright notice and this permission notice shall be included in

14 all copies or substantial portions of the Software.

15

16 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

17 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

18 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

19 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

20 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,

21 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN

22 THE SOFTWARE.

23 */

24 25 /*

26 * A somewhat tricky function to approximate the distance to an edge in a

27 * certain pixel, with consideration to either the local gradient (gx,gy)

28 * or the direction to the pixel (dx,dy) and the pixel greyscale value a.

29 * The latter alternative, using (dx,dy), is the metric used by edtaa2().

30 * Using a local estimate of the edge gradient (gx,gy) yields much better

31 * accuracy at and near edges, and reduces the error even at distant pixels

32 * provided that the gradient direction is accurately estimated.

33 */

34 35

36 struct fpoint { float x, y; };

37 struct ipoint { int x, y; };

38 struct uipoint { unsigned int x, y; };

39 struct stack { unsigned int top, bot; };

40

41 __device__ float edgedf(fpoint g, float a) {

(38)

26 A Helper Functions

42 float df, glength, temp, a1;

43

44 if ((g.x == 0) || (g.y == 0)) { // Either A) gx or gy are zero, or B) both

45 df = 0.5 - a; // Linear approximation is A) correct or B) a fair guess

46 }

47 else {

48 glength = sqrtf(g.x * g.x + g.y * g.y);

49 if(glength > 0) {

50 g.x = g.x / glength;

51 g.y = g.y / glength;

52 }

53 /* Everything is symmetric wrt sign and transposition,

54 * so move to first octant (gx>=0, gy>=0, gx>=gy) to

55 * avoid handling all possible edge directions.

56 */ 57 g.x = fabs(g.x); 58 g.y = fabs(g.y); 59 if(g.x < g.y) { 60 temp = g.x; 61 g.x = g.y; 62 g.y = temp; 63 } 64 a1 = 0.5f * g.y / g.x; 65 if(a < a1) { // 0 <= a < a1

66 df = 0.5f * (g.x + g.y) - sqrt(2.0f * g.x * g.y * a);

67 }

68 else if (a < (1.0f - a1)) { // a1 <= a <= 1-a1

69 df = (0.5f - a) * g.x;

70 }

71 else { // 1-a1 < a <= 1

72 df = -0.5f * (g.x + g.y) + sqrt(2.0f * g.x * g.y * (1.0f - a));

73 }

74 }

75 return df;

76 }

77

78 __device__ float dist(ipoint from, ipoint to, fpoint grad, float* img, uipoint imgSize) {

79

80 if (to.x < 0 || to.y < 0) return imgSize.x * 3.0f;

81

82 int id = to.y * imgSize.x + to.x;

83 float a = img[id];

84 if (a > 1.0f) a = 1.0f;

85 if (a < 0.0f) a = 0.0f;

86 if (a == 0.0f) return imgSize.x * 3.0f;

87

88 float dx = to.x - from.x;

89 float dy = to.y - from.y;

90 float di = sqrtf(dx * dx + dy * dy); 91 92 float df; 93 if (di == 0.0f) { 94 df = fmaxf(edgedf(grad, a), 0.0f); 95 } 96 else { 97 df = edgedf({ dx, dy }, a); 98 } 99 return di + df; 100 } 101

102 /*Initialize gradients and closest pixels*/

103 __global__ void setup(ipoint * closest, fpoint * grad, float* img, uipoint imgSize)

104 {

105 int idx = blockIdx.x * blockDim.x + threadIdx.x;

106 int idy = blockIdx.y * blockDim.y + threadIdx.y;

107 int id = idy * imgSize.x + idx;

108 #define SQRT2 1.4142136f

(39)

27

110 closest[id].x = idx;

111 closest[id].y = idy;

112 if (idx > 0 && idx < imgSize.x - 1 && idy > 0 && idy < imgSize.y - 1) {

113 grad[id].x = -img[id - imgSize.x - 1] - SQRT2 * img[id - 1] - img[id + imgSize .x - 1] + img[id - imgSize.x + 1] + SQRT2 * img[id + 1] + img[id + imgSize.x + 1];

114 grad[id].y = img[id imgSize.x 1] SQRT2 * img[id imgSize.x] img[id -imgSize.x + 1] + img[id + -imgSize.x - 1] + SQRT2 * img[id + -imgSize.x] + img[ id + imgSize.x + 1];

115

116 float g = grad[id].x * grad[id].x + grad[id].y * grad[id].y;

117 if (g > 0.0f) { 118 g = sqrtf(g); 119 grad[id].x /= g; 120 grad[id].y /= g; 121 } 122 } 123 else { 124 grad[id].x = 0; 125 grad[id].y = 0; 126 127 } 128 } 129 else { 130 closest[id].x = -1; 131 closest[id].y = -1; 132 133 grad[id].x = 0; 134 grad[id].y = 0; 135 } 136 }

(40)

(41)

B

SKW Code

1

2 __global__ void sweep(ipoint * voronoi, fpoint * grad, float* img, float* out, uipoint imgSize, int i, bool isForward, bool isRow) {

3

4 int idx = blockIdx.x * blockDim.x + threadIdx.x + i * isRow;

5 int idy = blockIdx.y * blockDim.y + threadIdx.y + i * !isRow;

7 ipoint neighbors[4];

8 neighbors[0] = { idx, idy };

9 if (isForward && isRow) {

10 neighbors[1] = { idx - 1, idy + 1 };

11 neighbors[2] = { idx - 1, idy };

12 neighbors[3] = { idx - 1, idy - 1 };

13 }

14 else if (!isForward && isRow) {

15 neighbors[1] = { idx + 1, idy + 1 };

16 neighbors[2] = { idx + 1, idy };

17 neighbors[3] = { idx + 1, idy - 1 };

18 }

19 else if (isForward && !isRow) {

20 neighbors[1] = { idx - 1, idy - 1 };

21 neighbors[2] = { idx, idy - 1 };

22 neighbors[3] = { idx + 1, idy - 1 };

23 }

24 else if (!isForward && !isRow) {

25 neighbors[1] = { idx - 1, idy + 1 };

26 neighbors[2] = { idx, idy + 1 };

27 neighbors[3] = { idx + 1, idy + 1 };

28 }

29 float bestDist = imgSize.x * 3.0f;

30 ipoint bestSite = { -1, -1 };

31 for (int i = 0; i < 4; ++i) {

32 ipoint n = neighbors[i];

33 if (n.x >= imgSize.x || n.x < 0 || n.y >= imgSize.y || n.y < 0)continue;

34 ipoint nSite = voronoi[n.x + n.y * imgSize.x];

35

36 float newDist = (nSite.x < 0) ? imgSize.x * 3.0f : dist({ idx, idy }, nSite, grad[nSite.x + nSite.y * imgSize.x], img, imgSize);

37

38 if (newDist < bestDist) {

39 bestDist = newDist;

(42)

30 B SKW Code 40 bestSite = nSite; 41 } 42 43 } 44 voronoi[id] = bestSite; 45 out[id] = bestDist; 46 } 47

48 void SKW(float* edt, float* img, uipoint size)

49 { 50 float* dev_img = 0; 51 fpoint* dev_grad = 0; 52 float* dev_edt = 0; 53 ipoint* dev_voronoi = 0; 54 55 56 cudaSetDevice(0); 57 58

59 cudaMalloc((void**)& dev_grad, size.x * size.y * sizeof(fpoint));

60 cudaMalloc((void**)& dev_edt, size.x * size.y * sizeof(float));

61 cudaMalloc((void**)& dev_img, size.x * size.y * sizeof(float));

62

63 cudaMemcpy(dev_img, img, size.x * size.y * sizeof(float), cudaMemcpyHostToDevice);

64

65 cudaMalloc((void**)& dev_voronoi, size.x * size.y * sizeof(ipoint));

66 67

68 dim3 block = { 8, 8 };

69 dim3 grid = { size.x / 8, size.y / 8 };

70

71 setup << <grid, block >> > (dev_voronoi, dev_grad, dev_img, size);

72 cudaDeviceSynchronize(); 73 74 75 block = { 1, 32 }; 76 grid = { 1, size.y / 32 }; 77

78 for (int i = 1; i < size.x; ++i) {

79 sweep << <grid, block >> > (dev_voronoi, dev_grad, dev_img, dev_edt, size, i,

true, true);

80 cudaDeviceSynchronize();

81 }

82 for (int i = size.x - 2; i >= 0; --i) {

false, true); 84 cudaDeviceSynchronize(); 85 } 86 87 block = { 32, 1 }; 88 grid = { size.x / 32, 1 };

89 for (int i = 1; i < size.x; ++i) {

true, false);

92 }

93 for (int i = size.x - 2; i >= 0; --i) {

false, false); 95 cudaDeviceSynchronize(); 96 } 97 98 99

100 cudaMemcpy(edt, dev_edt, size.x * size.y * sizeof(float), cudaMemcpyDeviceToHost);

101

102 cudaFree(dev_grad);

103 cudaFree(dev_img);

(43)

31

105 cudaFree(dev_voronoi);

106

107 return;

(44)

(45)

C

JFA Code

1

2 __global__ void propagateSites(ipoint * closest, ipoint * voronoi, fpoint * grad,

float* img, float* out, uipoint imgSize, int stepSize) {

3

7 ipoint neighbors[9] = { {idx - stepSize, idy + stepSize},

8 {idx, idy + stepSize},

9 {idx + stepSize, idy + stepSize},

10 {idx - stepSize, idy},

11 {idx, idy},

12 {idx + stepSize, idy},

13 {idx - stepSize, idy - stepSize},

14 {idx, idy - stepSize},

15 {idx + stepSize, idy - stepSize}

16 };

17 float bestDist = imgSize.x * 3.0f;

18 ipoint bestSite = { -1, -1 };

19 for (int i = 0; i < 9; ++i) {

20 ipoint n = neighbors[i];

21 if (n.x >= imgSize.x || n.x < 0 || n.y >= imgSize.y || n.y < 0)continue;

22 ipoint nSite = closest[n.x + n.y * imgSize.x];

23

24 float newDist = (nSite.x < 0) ? imgSize.x * 3.0f : dist({ idx, idy }, nSite, grad[nSite.x + nSite.y * imgSize.x], img, imgSize);

25 26 if (newDist < bestDist) { 27 bestDist = newDist; 28 bestSite = nSite; 29 } 30 31 } 32 voronoi[id] = bestSite; 33 out[id] = bestDist; 34 } 35

36 void JFA(float* edt,float* img, uipoint size)

37 {

38 float* dev_img = 0;

39 fpoint* dev_grad = 0;

(46)

34 C JFA Code 40 float* dev_edt = 0; 41 ipoint* dev_closest = 0; 42 ipoint* dev_voronoi = 0; 43 44 cudaSetDevice(0); 45 46

50

52

53 cudaMalloc((void**)& dev_closest, size.x * size.y * sizeof(ipoint));

55 56

57 dim3 block = { 8, 8 };

59

60 setup << <grid, block >> > (dev_closest, dev_grad, dev_img, size);

62

63 for (int i = size.x / 2; i > 0; i /= 2) {

64 propagateSites << <grid, block >> > (dev_closest, dev_voronoi, dev_grad, dev_img , dev_edt, size, i);

65 66 ipoint* tmp = dev_closest; 67 dev_closest = dev_voronoi; 68 dev_voronoi = tmp; 69 cudaDeviceSynchronize(); 70 } 71

73 74 cudaFree(dev_closest); 75 cudaFree(dev_grad); 76 cudaFree(dev_img); 77 cudaFree(dev_edt); 78 cudaFree(dev_voronoi); 79 80 return; 81 }

(47)

D

PBA Code

1

2 //==================================Phase1==================================

3

4 /*Calculate 1d-voronoi diagram for local band*/

5 __global__ void phase1Sweep(ipoint * closest, fpoint * grad, float* img, int m1, uipoint imgSize)

6 {

7 int idx = threadIdx.x + blockDim.x * blockIdx.x;

8 int idy = threadIdx.y + blockDim.y * blockIdx.y;

9

10 int bandId = idy * imgSize.x + idx * (imgSize.x / m1);

11 int pointX = idx * (imgSize.x / m1) + 1;

12

13 //Sweep right

14 for (int i = bandId + 1; i < bandId + (imgSize.x / m1); ++i) {

15

16 float oldDist = dist({ pointX, idy }, closest[i], grad[i], img, imgSize);

17 float newDist = dist({ pointX, idy }, closest[i - 1], grad[i - 1], img, imgSize) ; 18 19 20 if (newDist < oldDist) { 21 closest[i] = closest[i - 1]; 22 } 23 24 ++pointX; 25 } 26 27 pointX -= 2; 28 //Sweep left

29 for (int i = bandId + (imgSize.x / m1) - 2; i >= bandId; --i) {

30 31

32 float oldDist = dist({ pointX, idy }, closest[i], grad[i], img, imgSize);

33 float newDist = dist({ pointX, idy }, closest[i + 1], grad[i + 1], img, imgSize) ; 34 35 36 if (newDist < oldDist) { 37 closest[i] = closest[i + 1]; 38 } 35

(48)

36 D PBA Code 39 --pointX; 40 } 41 42 } 43

44 /*Propagate closest pixels across pixels on the edges of each band*/

45 __global__ void phase1Prefix(ipoint * closest, fpoint * grad, float* img, int m1, uipoint imgSize, int stepSize, bool up, bool right)

46 {

49

50 int bid = 2 * (idx + 1) * stepSize - 1;

51 int aid = bid - stepSize;

52 53 if (!up) { 54 bid += stepSize; 55 aid += stepSize; 56 } 57 58 if (!right) { 59 bid = 2 * m1 - bid - 1; 60 aid = 2 * m1 - aid - 1; 61 } 62

63 int ax = (aid / 2 + aid % 2) * (imgSize.x / m1) - aid % 2;

64 int bx = (bid / 2 + bid % 2) * (imgSize.x / m1) - bid % 2;

65

66 aid = idy * imgSize.x + ax;

67 bid = idy * imgSize.x + bx;

68

69 if (dist({ bx, idy }, closest[aid], grad[aid], img, imgSize) < dist({ bx, idy }, closest[bid], grad[bid], img, imgSize)) {

70 closest[bid] = closest[aid];

71 }

72 73 }

74

75 /*Updates the bands based on the edge-pixels closest sites*/

76 __global__ void phase1Update(ipoint * closest, fpoint * grad, float* img, int m1, uipoint imgSize) {

79

80 int bandId = idy * imgSize.x + idx * (imgSize.x / m1);

81 int pointX = idx * (imgSize.x / m1) + 1;

82

83 for (int i = bandId + 1; i < bandId + (imgSize.x / m1) - 1; ++i) {

84 if(dist({ pointX, idy }, closest[bandId], grad[bandId], img, imgSize) < dist({ pointX, idy }, closest[i], grad[i], img, imgSize)) {

85 closest[i] = closest[bandId];

86 }

87 if(dist({ pointX, idy }, closest[bandId + (imgSize.x / m1) - 1], grad[bandId + (imgSize.x / m1) - 1], img, imgSize) < dist({ pointX, idy }, closest[i], grad[i ], img, imgSize)) {

88 closest[i] = closest[bandId + (imgSize.x / m1) - 1];

89 } 90 91 ++pointX; 92 } 93 } 94 95 //==================================Phase2================================== 96 97

98 __device__ float lineColumnIntersection(fpoint a, fpoint b, float c) {

99

100 float a1 = b.y - a.y;

(49)

37

102 float c1 = a1 * a.x + b1 * a.y;

103 104 float a2 = 1.0f; 105 float b2 = 0.0f; 106 float c2 = c; 107 108 float det = a1 * b2 - a2 * b1; 109

110 if (det == 0) return FLT_MAX;

111

112 return (a1 * c2 - a2 * c1) / det;

113 }

114

115 __device__ bool isDominating(fpoint a, fpoint b, fpoint c, float column) {

116 fpoint p1 = { (a.x + b.x) / 2.0f, (a.y + b.y) / 2.0f };

117 fpoint p2 = { p1.x - (b.y - a.y), p1.y + (b.x - a.x) };

118

119 fpoint q1 = { (b.x + c.x) / 2.0f, (b.y + c.y) / 2.0f };

120 fpoint q2 = { q1.x - (c.y - b.y), q1.y + (c.x - b.x) };

121

122 float p = lineColumnIntersection(p1, p2, column);

123 float q = lineColumnIntersection(q1, q2, column);

124

125 return p <= q;

126 }

127

128 /*Calculate the proximate sites for each band*/

129 __global__ void phase2sweep(ipoint * closest, fpoint * grad, float* img, int m2, uipoint imgSize, stack * stk, stack * stkIndex)

130 {

133

134 int stkId = idx + idy * imgSize.x;

135 int stkStart = idx + idy * imgSize.x * (imgSize.y / m2);

136 int stkEnd = stkStart + imgSize.x * (imgSize.y / m2);

137 138

139 while (closest[stkStart].x == -1 && stkStart < stkEnd) {

140 stkStart += imgSize.x; 141 } 142 143 if (stkStart >= stkEnd) { 144 145 stkIndex[stkId].top = -1; 146 stkIndex[stkId].bot = -1; 147 return; 148 } 149 150 stkIndex[stkId].bot = stkStart; 151 stkIndex[stkId].top = stkStart; 152 153

154 for (int i = stkStart + imgSize.x; i < stkEnd; i += imgSize.x) {

155 int top = stkIndex[stkId].top;

156 int second = stk[top].bot;

157

158 if (closest[i].x == -1) {

159 continue;

160 }

161

162 while (stkIndex[stkId].bot != stkIndex[stkId].top) {

163

164 int aid = closest[i].x + closest[i].y * imgSize.x;

165 fpoint ag = grad[aid];

166 float adf = edgedf(ag, img[aid]);

167 float agLength = sqrtf(ag.x * ag.x + ag.y * ag.y);

168 if (agLength > 0.0f) {

(50)

38 D PBA Code

170 ag.y /= agLength;

171 }

172

173 int bid = closest[top].x + closest[top].y * imgSize.x;

174 fpoint bg = grad[bid];

175 float bdf = edgedf(bg, img[bid]);

176 float bgLength = sqrtf(bg.x * bg.x + bg.y * bg.y);

177 if (bgLength > 0.0f) {

178 bg.x /= bgLength;

179 bg.y /= bgLength;

180 }

181

182 int cid = closest[second].x + closest[second].y * imgSize.x;

183 fpoint cg = grad[cid];

184 float cdf = edgedf(cg, img[cid]);

185 float cgLength = sqrtf(cg.x * cg.x + cg.y * cg.y);

186 if (cgLength > 0.0f) {

187 cg.x /= cgLength;

188 cg.y /= cgLength;

189 }

190

191 fpoint a = { closest[i].x + ag.x * adf, closest[i].y + ag.y * adf };

192 fpoint b = { closest[top].x + bg.x * bdf, closest[top].y + bg.y * bdf };

193 fpoint c = { closest[second].x + cg.x * cdf, closest[second].y + cg.y * cdf };

194 if (isDominating(a, b, c, idx)) { 195 stkIndex[stkId].top = second; 196 top = second; 197 second = stk[top].bot; 198 } 199 else { 200 break; 201 } 202 } 203 stk[i].bot = top; 204 stk[top].top = i; 205 stkIndex[stkId].top = i; 206 207 } 208 stk[stkIndex[stkId].top].top = -1; 209 210 } 211

212 /*Calculate proximate sites for each column*/

213 __global__ void phase2merge(ipoint * closest, fpoint * grad, float* img, int m2, uipoint imgSize, stack * stk, stack * stkIndex, int stepSize)

214 {

217 int stkLowId = idx + idy * imgSize.x * stepSize * 2;

218 int stkHighId = stkLowId + stepSize * imgSize.x;

219 int stkStart = stkIndex[stkLowId].bot;

220 int stkMid = stkIndex[stkHighId].bot;

221 int stkEnd = stkIndex[stkHighId].top;

222 223 if (stkStart == -1) { 224 stkIndex[stkLowId] = stkIndex[stkHighId]; 225 return; 226 } 227 if (stkEnd == -1) { 228 return; 229 }

230 bool isAlmostMerged = false;

231 for (int i = stkMid; i <= stkEnd && stk[i].top > 0; i = stk[i].top) {

232 int top = stkIndex[stkLowId].top;

233 int second = stk[top].bot;

234 while (stkIndex[stkLowId].bot != stkIndex[stkLowId].top) {

235

236 int aid = closest[i].x + closest[i].y * imgSize.x;

(51)

39

238 float adf = edgedf(ag, img[aid]);

239 float agLength = sqrtf(ag.x * ag.x + ag.y * ag.y);

240 if (agLength > 0.0f) {

241 ag.x /= agLength;

242 ag.y /= agLength;

243 }

244

245 int bid = closest[top].x + closest[top].y * imgSize.x;

246 fpoint bg = grad[bid];

247 float bdf = edgedf(bg, img[bid]);

248 float bgLength = sqrtf(bg.x * bg.x + bg.y * bg.y);

249 if (bgLength > 0.0f) {

250 bg.x /= bgLength;

251 bg.y /= bgLength;

252 }

253

254 int cid = closest[second].x + closest[second].y * imgSize.x;

255 fpoint cg = grad[cid];

256 float cdf = edgedf(cg, img[cid]);

257 float cgLength = sqrtf(cg.x * cg.x + cg.y * cg.y);

258 if (cgLength > 0.0f) {

259 cg.x /= cgLength;

260 cg.y /= cgLength;

261 }

262

263 fpoint a = { closest[i].x + ag.x * adf, closest[i].y + ag.y * adf };

264 fpoint b = { closest[top].x + bg.x * bdf, closest[top].y + bg.y * bdf };

265 fpoint c = { closest[second].x + cg.x * cdf, closest[second].y + cg.y * cdf };

266 if (isDominating(a, b, c, idx)) { 267 isAlmostMerged = false; 268 stkIndex[stkLowId].top = second; 269 top = second; 270 second = stk[top].bot; 271 } 272 else if (!isAlmostMerged) { 273 isAlmostMerged = true; 274 break; 275 } 276 else { 277 stk[i].bot = top; 278 stk[top].top = i; 279 stkIndex[stkLowId].top = stkIndex[stkHighId].top; 280 return; 281 } 282 } 283 stk[i].bot = top; 284 stk[top].top = i; 285 stkIndex[stkHighId].bot = stk[i].top; 286 stkIndex[stkLowId].top = i; 287 } 288 stkIndex[stkLowId].top = stkIndex[stkHighId].top; 289 290 291 } 292 293 //==================================Phase2================================== 294

295 /*Calculate 2d voronoi diagram*/

296 __global__ void phase3(ipoint * voronoi, ipoint * closest, fpoint * grad, float* out , int m3, float* img, uipoint imgSize, stack * stk, stack * stkIndex, int baseY )

297 {

298 __shared__ bool done;

300 int idy = blockIdx.y * blockDim.y + threadIdx.y + baseY;

302

303 voronoi[id] = closest[stkIndex[idx].bot];

(52)

40 D PBA Code

imgSize.x], img, imgSize);

305

306 if (threadIdx.y == m3 - 1) done = false;

307 __syncthreads();

308

309 while (!done && stkIndex[idx].bot != stkIndex[idx].top) {

310 int a = stkIndex[idx].bot;

311 int b = stk[a].top;

312

313 float distC = out[id];

314 float distA = dist({ idx, idy }, closest[a], grad[closest[a].x + closest[a].y * imgSize.x], img, imgSize);

315 float distB = dist({ idx, idy }, closest[b], grad[closest[b].x + closest[b].y * imgSize.x], img, imgSize);

316

317 if (distA < distC && distA < distB) {

318 voronoi[id] = closest[a];

319 out[id] = distA;

320 if(threadIdx.y == m3 - 1) done = true;

321 }

322

323 __syncthreads();

324 if (distB < distA && distB < distC) {

325 voronoi[id] = closest[b]; 326 out[id] = distB; 327 stkIndex[idx].bot = b; 328 } 329 else if (threadIdx.y == m3 - 1) { 330 done =true; 331 } 332 __syncthreads(); 333 334 } 335 336 } 337

338 void PBA(float* edt, float* img, uipoint size)

339 { 340 float* dev_img = 0; 341 fpoint* dev_grad = 0; 342 ipoint* dev_closest = 0; 343 float* dev_edt = 0; 344 stack* dev_stack = 0; 345 stack* dev_stackIndex = 0; 346 ipoint* dev_voronoi = 0; 347 348 349 cudaSetDevice(0); 350 351

352 // Allocate GPU buffers

353 cudaMalloc((void**)& dev_closest, size.x * size.y * sizeof(ipoint));

357 cudaMalloc((void**)& dev_stack, size.x * size.y * sizeof(stack));

358 cudaMalloc((void**)& dev_stackIndex, size.x * param2 * sizeof(stack));

360

361 // Copy image from host to device.

363 364 365

366 dim3 block = { 8,8 };

368

369 setup << <grid, block >> > (dev_closest, dev_grad, dev_img, size);

(53)

41 371 372 373 //==============================Phase1====================================== 374 375 block = { param1, 32 }; 376 grid = { 1 , size.y / 32 }; 377

378 phase1Sweep << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size);

380

381 for (unsigned int i = 1; i <= param1; i *= 2) {

382 block = { param1 / i, 32 };

383 grid = { 1, size.y / 32 };

384 phase1Prefix << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size, i, true, true);

386 }

387 for (unsigned int i = param1 / 2; i >= 1; i /= 2) {

388 block = { param1 / i - 1, 32, };

389 grid = { 1, size.y / 32 };

390 phase1Prefix << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size, i, false, true);

392 }

393

394 for (unsigned int i = 1; i <= param1; i *= 2) {

395 block = { param1 / i, 32 };

396 grid = { 1, size.y / 32 };

397 phase1Prefix << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size, i, true, false);

399 }

400 for (unsigned int i = param1 / 2; i >= 1; i /= 2) {

401 block = { param1 / i - 1, 32 };

402 grid = { 1, size.y / 32 };

403 phase1Prefix << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size, i, false, false); 404 cudaDeviceSynchronize(); 405 } 406 407 block = { param1, 32 }; 408 grid = { 1, size.y / 32 };

409 phase1Update << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size);

410 cudaDeviceSynchronize(); 411 412 413 //==============================Phase2====================================== 414 415 416 417 block = { 32, param2 }; 418 grid = { size.x / 32, 1 }; 419

420 phase2sweep << <grid, block >> > (dev_closest, dev_grad, dev_img, param2, size, dev_stack, dev_stackIndex);

422 423 424

425 for (int i = 1; i < param2; i <<= 1) {

426 block = { 32, param2 / (unsigned int)(i * 2) };

427 grid = { size.x / 32, 1 };

428

429 phase2merge << <grid, block >> > (dev_closest, dev_grad, dev_img, param2, size, dev_stack, dev_stackIndex, i);

431 }

432 433

(54)

42 D PBA Code

434 //==============================Phase3======================================

435 block = { 1, param3 };

436 grid = { size.x, 1 };

437 for (int i = 0; i < size.y / param3; ++i) {

438 phase3 << <grid, block >> > (dev_voronoi, dev_closest, dev_grad, dev_edt, param3 , dev_img, size, dev_stack, dev_stackIndex, i * param3);

440 }

441 442 443

445 446 447 cudaFree(dev_closest); 448 cudaFree(dev_grad); 449 cudaFree(dev_img); 450 cudaFree(dev_edt); 451 cudaFree(dev_stack); 452 cudaFree(dev_stackIndex); 453 cudaFree(dev_voronoi); 454 455 return; 456 }

(55)

Bibliography

[1] Cuda c programming guide v10.1.168. URL https://docs.nvidia.

com/cuda/cuda-c-programming-guide/index.html. Retrieved

2019-06-05.

[2] F. d. A. Zampirolli and L. Filipe. A fast cuda-based implementation for the euclidean distance transform. In 2017 International Conference on High Performance Computing Simulation (HPCS), pages 815–818, July 2017. doi: 10.1109/HPCS.2017.123.

[3] N. Angayarkanni D. Kumar. Euclidean distance transform (edt) algorithm applied to binary image for finding breast cancer. Biomedical & Pharmacol-ogy Journal, 8(1):407–411, 2015.

[4] Per-Erik Danielsson. Euclidean distance mapping, 1980.

[5] Ricardo Fabbri, Luciano Da Fontoura Costa, Julio C. Torelli, and Odemir Martinez Bruno. 2d euclidean distance transform algorithms: A comparative survey, 2008.

[6] Chris Green. Advanced real-time rendering in 3d graphics and games

course – siggraph 2007 chapter 2 improved alpha-tested magnification for vector textures and special effects.

[7] Stefan Gustavson and Robin Strand. Anti-aliased euclidean distance trans-form. Pattern Recognition Letters, 32(2):252–257, jan 2011. doi: 10.1016/

j.patrec.2010.08.010. URL https://doi.org/10.1016/j.patrec.

2010.08.010.

[8] Tatsuya Hayashi, Koji Nakano, and Stephan Olariu. Optimal parallel

algorithms for finding proximate points, with applications. IEEE Trans-actions on Parallel and Distributed Systems, Parallel and Distributed Systems, IEEE Transactions on, IEEE Trans. Parallel Distrib. Syst, (12):

1153, 1998. ISSN 1045-9219. URL https://login.e.bibl.liu.

se/login?url=https://search.ebscohost.com/login.aspx? direct=true&AuthType=ip,uid&db=edseee&AN=edseee.737693&

lang=sv&site=eds-live&scope=site.

A Comparison of Parallel Algorithms for Calculating the Anti-Aliased Euclidean Distance Transform

Master of Science Thesis in Computer Science

Department of Electrical Engineering, Linköping University, 2020