Master of Science Thesis in Computer Science
Department of Electrical Engineering, Linköping University, 2020
A Comparison of Parallel
Algorithms for Calculating
the Anti-Aliased Euclidean
Distance Transform
Master of Science Thesis in Computer Science
A Comparison of Parallel Algorithms for Calculating the Anti-Aliased Euclidean Distance Transform:
Daniel Eriksson LiTH-ISY-EX--19/5269--SE
Supervisor: Stefan Gustavson
isn, Linköping University
Åsa Detterfelt
MindRoad AB
Examiner: Ingemar Ragnemalm
isy, Linköping University
Division of Information Coding Department of Electrical Engineering
Linköping University SE-581 83 Linköping, Sweden Copyright © 2020 Daniel Eriksson
Abstract
This thesis presents a comparison of three different parallel algorithms, adapted to calculate the anti-aliased euclidean distance transform. They were originally designed to calculate the binary euclidean distance transform. The three algo-rithms are; SKW, Jump Flooding Algorithm (JFA), and Parallel Banding Algo-rithm (PBA). The results presented here show that the two simpler algoAlgo-rithms, SKW and JFA, can easily be adapted to calculate the anti-aliased transform rather than the binary transform. These two algorithms show good performance in re-gards to accuracy and precision. The more complex algorithm, PBA, is not as easily adapted. The design of this algorithm is based on some assumptions about the binary transform, which do not hold true in the case of the anti-aliased trans-form. Because of this the algorithm does not produce a transform with the same level of accuracy as the other algorithms produce.
Acknowledgments
I want to thank my supervisors Åsa and Stefan, for their constructive feedback on this thesis. I would like to thank my examiner Ingemar, for helping me formulate the problem at the center of this thesis. I would also like to thank my colleagues at the office for keeping me just the right amount of distracted for the duration of this project.
Linköping, October 2019 Daniel Ericsson
Contents
List of Figures ix
Notation xi
1 Introduction 1
1.1 Aim & Motivation . . . 1
1.2 Research Questions . . . 2
1.3 Delimitations . . . 2
2 Theory 3 2.1 Euclidean distance transform . . . 3
2.2 Anti-Aliased Euclidean distance transform . . . 4
2.3 Parallel Banding Algorithm . . . 4
2.4 SKW . . . 6
2.5 Jump Flooding Algorithm . . . 7
2.6 Other Parallel Algorithms . . . 7
3 Method 9 3.1 Implementation . . . 9
3.1.1 Parallel Banding Algorithm . . . 9
3.1.2 SKW . . . 10
3.1.3 Jump Flooding Algorithm . . . 10
3.2 Evaluation . . . 10
3.2.1 Time Complexity & Execution Time . . . 11
3.2.2 Error in Distance Approximations . . . 11
4 Results 13 4.1 Time Complexity & Execution Time . . . 13
4.2 Distance Error . . . 14
5 Discussion 17 5.1 Implementation-Specific Problems . . . 17
5.2 PBA-Specific Problems . . . 17
viii Contents
5.3 Errors in Data Collection . . . 19
5.3.1 Execution Time . . . 19 5.3.2 Input Data . . . 19 5.4 Evaluation of Algorithms . . . 20 5.4.1 PBA . . . 20 5.4.2 SKW . . . 20 5.4.3 JFA . . . 20 6 Conclusions 21 6.1 Research Questions . . . 21 6.2 Future Work . . . 22 A Helper Functions 25 B SKW Code 29 C JFA Code 33 D PBA Code 35 Bibliography 43
List of Figures
2.1 Demonstration of pixel domination . . . 5
2.2 SKW vector template . . . 6
2.3 JFA vector template . . . 7
3.1 Edge-point approximation . . . 10
3.2 Input image . . . 11
4.1 Execution times . . . 13
4.2 Average distance errors . . . 14
4.3 Maximum distance errors . . . 14
4.4 Proportion of correct pixels . . . 15
4.5 SKW and JFA error image . . . 15
4.6 PBA error image . . . 16
4.7 Error of PBA for different values of m3 . . . 16
5.1 Zoomed PBA error image . . . 18
Notation
Abbreviations
Abbreviation Meaning
EDT Euclidean Distance Transform
AAEDT Anti-Aliased Euclidean Distance Transform
JFA Jump Flooding Algorithm
PBA Parallel Banding Algorithm
1
Introduction
The Euclidean Distance Transform(EDT) is a fundamental operation in image processing. Algorithms to produce this transform have been widely studied and applied to problems in fields such as medicine [3, 13], computer graphics render-ing [6, 11], and robot navigation [9].
Nysjö et al. [13] used EDT to create 3-dimensional models of surgical guides and other tools used in surgery. This was done by applying the transform to CT-images and creating a model which follows and fits the anatomy of the patient. Angayarkanni & Kumar [3] used the transform in a process to analyze mammo-grams and identify cancer cells.
Macedo and Apolinário [11] used the transform to create realistic shadows in real-time computer simulations. Their method created shadows with fewer arti-facts than other methods, and with good performance. Green [6] has presented a method, using the EDT of a texture, to recreate a high-resolution texture, while requiring less texture memory.
Juelg et al. [9] applied EDT to a voxel representation of an environment. The results were used by a robot to avoid obstacles and navigate towards a goal in the environment.
Another transform which performs a similar operation has been proposed in a paper by Gustavson & Strand [7] in which they show that their transform, the Anti-Aliased Euclidean Distance Transform (AAEDT) has some advantages over the regular EDT.
1.1
Aim & Motivation
Given the advantages AAEDT presents over EDT, it would be beneficial in many cases to use the former over the later. However, due to the abundance of well established and highly polished algorithms and implementations of EDT, and
2 1 Introduction
the lack of such algorithms for AAEDT, this is not commonplace. There is also a broad set of areas of application [5], in which such a change could improve performance of applications.
The aim of this thesis is to research what parallel methods and algorithms, commonly used to calculate EDTs of images, are suitable for adaptation to calcu-late AAEDTs of images. Specifically, parallel algorithms designed for graphical processing units are explored.
1.2
Research Questions
1. Are there any significant differences between EDT and AAEDT which affect parallelizability?
2. How do the chosen algorithms perform in terms of execution time and accuracy, when calculating AAEDT?
1.3
Delimitations
Three different algorithms are the focus of this thesis. They are: Parallel Banding Algorithm (PBA), Jump Flooding Algorithm (JFA), and SKW.
The distance function presented by Gustavson & Strand [7], for AAEDT, is only defined for 2-dimensional images. Deriving a function for higher dimen-sional images is complex. Because of this, only 2-dimendimen-sional images were used for testing the chosen algorithms.
2
Theory
This chapter outlines the transforms related to this thesis, the Euclidean distance transform and the Anti-Aliased Euclidean distance transform, as well as the algo-rithms which are the main focus of this report.
2.1
Euclidean distance transform
The EDT is a transform, which takes a binary image as input, and outputs an image of euclidean distances.
The binary input image consists of pixels with a value of ai ∈ {0, 1}.
Through-out this paper, the term feature pixel refers to a pixel in an input image with a
value of ai = 1. Non-feature pixels are called background pixels and have a value
of ai = 0.
The output of the EDT is an image where each pixel is assigned a value equal to the euclidean distance from the corresponding pixel in the input image to the closest feature pixel in the input image.
There are two main categories of algorithms to calculate EDTs; approximate and exact. An exact algorithm guarantees the correct output for any and all input images. Approximate solutions are often easier to implement, but does not guar-antee the same accuracy as an exact solution. While approximate algorithms do not give the correct output for all inputs, they do produce correct output for some inputs and close to correct output, within certain bounds of error, for others.
Related to the EDT are Voronoi diagrams, which rather than returning the distance to the closest feature pixel gives the index of the closest pixel. It is easy to see that given the Voronoi diagram of an image, calculating the EDT is a trivial problem. Because of this, some methods of calculating the EDT first calculate the Voronoi diagram and use it to calculate the EDT.
4 2 Theory
The algorithms described in this chapter were all devised to calculate the EDT and Voronoi diagrams of binary images.
2.2
Anti-Aliased Euclidean distance transform
The AAEDT is a modified version of the binary EDT, proposed by Gustavson & Strand [7]. This transform takes an input image consisting of pixels with values
ai ∈ [0, 1]. While the domain of this transform is a superset of the domain of
the binary transform, it does not produce the same mapping, as the binary trans-form, for that specific subset of its domain. When referring to the input of this
transform a feature pixel is any pixel with a value ai , 0.
Given an supersampled, anti-aliased image, the AAEDT returns an image sim-ilar to that of the binary transform. While the binary transform returns (for each pixel) the distance to the closest feature pixel, the AAEDT returns (for each pixel) the approximated distance to the closest ideal edge of a feature object, in the orig-inal image which was sampled.
A new category of feature pixels is edge pixels which consists of all feature
pixels with a value ai , 1. These pixels lie partially on a feature object in the
original image and therefore, lie on an edge, hence the name.
While plenty of work has been done on parallel implementations of the binary EDT, not much has been done in the realm of parallelizing the AAEDT. Gustavson & Strand [7] presented an implementation using raster scanning and Linnér & Strand [10] presented an implementation based on wave front propagation. As this thesis is focused on parallelizing AAEDT the latter method is, however of little interest as it requires arbitrary memory accesses.
2.3
Parallel Banding Algorithm
PBA is a parallel version of earlier sequential methods for calculating the exact EDT, using Voronoi region intersections to calculate the transform. The algorithm was proposed by Cao et al. [18]. It claims to be based on an algorithm by Hayashi et al. [8] but has a focus on efficient use of parallelization. The algorithm is based on concepts discussed by Saito & Toriwaki [16], as well as Maurer et al [12].
The algorithm consists of three separate phases. The first phase calculates the one-dimensional Voronoi diagram of the image, it finds the closest feature pixel that lies on the same row of pixels. The second phase creates a stack for each column which exclusively holds references to all feature pixels which Voronoi regions intersect with the column in question. A feature pixel’s Voronoi region is the set of points which closest feature pixel is the pixel in question. The third phase uses the stack created in the second phase to create the Voronoi diagram and EDT by traversing each pixel in a column, looking at the top two pixels on the stack and popping the stack whenever the current pixel is not in the Voronoi region of the top pixel any more.
In order to maximize parallelism each phase implements a parallel divide-and-conquer approach by dividing the image into smaller parts, that are worked
2.3 Parallel Banding Algorithm 5
on in parallel and the results from each part is then combined. The partitioning of the image resembles strips or bands, giving the algorithms its name. For the first phase the bands are vertical, consisting of a number of columns. For the second and third phase they are horizontal, consisting of a number of rows.
In the first phase, for each band in all rows in parallel, the closest feature pixel on the same row and band is found for each pixel. This is done using two simple sweeps, one from left to right and one right to left, updating the closest feature pixel using the currently closest found, and the closest pixel of the previous pixel in the sweep. The result is then distributed among all outer-most pixels of all bands in the same row, using a parallel prefix algorithm, the closest feature pixels of these outer most pixels are updated. Finally all other pixels in the band are updated based on the already updated outer-most pixels.
a b c
Figure 2.1:Pixel b being dominated by pixels a and c
The second phase creates a stack, for each column, of all proximate feature pix-els, where a proximate feature pixel is a feature pixel which Voronoi region is in-tersected by the column in question. This is again done separately for each band. A Voronoi region is a convex polygon which edges are the perpendicular bisec-tors of its corresponding feature pixel and the feature pixels of adjacent Voronoi regions. Consider then three different feature pixels a, b, c on rows ja> jb> jc, all
the closest feature pixels to some pixel in column i. If the perpendicular bisector between a & b and b & c intersect before reaching column i the Voronoi region of b is not intersected by column i and no pixel in that column belongs to the Voronoi region of b. This is called that b is dominated by a and c in the original paper by Cao et al. and is demonstrated in figure 2.1. This fact is used when creating the stack; each pixel in the column is considered in ascending order and pushed onto the stack. If the top feature pixel on the stack is dominated by the newly considered feature pixel and the second pixel on the stack, the top pixel is popped off the stack. A newly considered pixel is only pushed onto the stack once no more pixels are dominated or there is only one element on the stack. Once the stacks for each column in each band is created they are merged two by two in each column by pushing the bottom element of one stack onto the top of the stack in the band below, removing any dominated feature pixels in the pro-cess. This merger is easily done if the stacks are implemented as doubly linked lists. The stacks are merged until each column has only one single coherent stack.
6 2 Theory
The third and final phase of the algorithm takes the stack from the second phase and for each pixel in the same band and column, in parallel, assign it a closest feature pixel by looking at the top two pixels on its column’s stack, and popping the stack if the second element on the stack is closer. Once the last pixel in the band and column has found its actually closest feature pixel, the same process is repeated for the next band.
For the first two phases, number of bands can be determined by the parame-ters; m1, m2, respectively. The number of pixels to handle in parallel in the third
phase is determined by the parameter m3. The total work performed by this
al-gorithm is O(n2) for an image of size n x n pixels. Because of this optimal work
and focus on parallelism, the algorithm performs well when compared to other algorithms [9, 18]. In order to achieve this high performance, the values of the
three parameters m1, m2, and m3have to be tuned to the size of the input image,
and the hardware.
2.4
SKW
Schneider et al. [17] have presented a parallel algorithm for calculating the ap-proximate EDT and Voronoi diagram, designed for implementation on GPUs. The algorithm is based on an earlier sequential algorithm using vector propa-gation to calculate the transform. While it is commonly referred to as SKW after these authors initials, the algorithm was earlier proposed by Ragnemalm [14] under the name 4H-8SSED/SIMD.
-1,-1 -1,0 -1,1 0,0 1,-1 1,0 1,1 0,0 -1,-1 0,-1 1,-1 0,0 -1,1 0,1 1,1 0,0
Figure 2.2:Vector templates for SKW, with relative coordinates
The algorithm performs four scans; right, left, up and down each using one of the templates in fig 2.2, where the currently considered pixel’s closest feature pixel is updated, taking into consideration all closest feature pixels of the pixels in the template. The first two scans apply their vector templates on each pixel in every column starting with the leftmost and rightmost column respectively. The third and fourth scan similarly apply their vector templates to each pixel in all rows.
While this algorithm, just like PBA, performs O(n2) work, it does not perform
as well when it comes to execution time given the sequential nature of the calcu-lations performed by the algorithm.
2.5 Jump Flooding Algorithm 7
2.5
Jump Flooding Algorithm
A technique for parallel calculation of the EDT was proposed by Danielsson [4] and much later presented by Rong & Tan [15] under the name Jump Flooding Algorithm. The algorithm performs a number of steps equal to the base-2 log-arithm of the side length of the image. In each step the vector template in fig 2.3 is applied to every pixel in the image and their closest pixels are propagated throughout the grid. The size of the template is, at the start, so that the shortest distance between pixels (denoted s in figure 2.3) is equal to half the side length of the image. In every step this distance is halved until it is equal to one.
-s,-s -s,0 -s,s 0,-s 0,0 0,s s,-s s,0 s,s
Figure 2.3:Vector template for JFA, with relative coordinates
Since reads and writes, to and from memory containing the intermediate Voronoi diagram, are performed in parallel, two buffers can be used to avoid race conditions. One buffer acts as input, the other as output for each step of the algorithm, these roles are switched between the buffers after every step to ensure that the latest data acts as input to the next iteration.
This algorithm performs O(n2log(n)) work for images of size n x n. While it
is not linear like SKW, this algorithm does more efficiently parallelize its work and does therefore, in the vast majority of cases outperform SKW in terms of execution time. It is most efficient for small image sizes, but is outperformed by both PBA, and SKW for larger images [18].
2.6
Other Parallel Algorithms
All of the above discussed algorithms in some way do or can use the Voronoi di-agram to calculate the EDT of an image. There are other algorithms do not rely on this intermediate representation but rather purely calculates the EDT numer-ically, one such implementation has been presented by Zampirolli & Filipe [2]. These implementations can be very efficient with little overhead, but they are not suitable for adaption to be used for AAEDT. This is because the calculations in AAEDT are dependent on the direction of vectors between pixels, as well as the value of the feature pixel which requires the exact coordinates of the pixel to be known, which is precisely the information stored in a Voronoi diagram. Given the coordinates of all closest feature pixels, the distance to them can be calcu-lated for each pixel using the distance function defined by Gustavson & Strand [7]. An implementation of this function can be found in appendix A.
3
Method
This chapter describes how this thesis was conducted. The modifications to the chosen algorithms are presented, and how these algorithms were evaluated when applied to the problem of calculating AAEDT.
3.1
Implementation
Below are the changes, required for implementing the algorithms to calculate the AAEDT instead of the binary EDT, described.
All implementations were coded in C++, using CUDA runtime 10.1, and Mi-crosoft Visual Studio. All algorithms had in common that they now required an extra initialization of the image’s gradient, which is used by the new distance function for AAEDT, which can be found in appendix A. The focus when creat-ing the implementations were readability and correctness, meancreat-ing that there is known room for improvements in terms of performance.
3.1.1
Parallel Banding Algorithm
Implementing the first and third phases of PBA required little modification of the original algorithm for the binary transform. All that was required was to use the new distance measure for AAEDT. The second phase needed to be adapted slightly. In order to determine whether or not a pixel is dominated by two other pixels, the original measure using the center point of each pixel is insufficient. A correct method of determining whether a pixel is dominated or not, would require knowledge of where the ideal edge falls on the pixel. While this is hard to calculate exactly, an approximate location of the edge can be calculated using the local gradient of the image for the feature pixel (g) and the distance to the
edge from the pixel center (df), these measurements are shown in fig 3.1.
10 3 Method
a
a’
df
g
Figure 3.1:New point a’ for determining domination
3.1.2
SKW
No major modifications were required for this algorithm to work for AAEDT in-stead of binary EDT. The Voronoi diagram is created in the same way as the origi-nal algorithm, only now it utilizes the new distance measure to determine which feature pixels are closer.
3.1.3
Jump Flooding Algorithm
Just like with SKW, adapting this algorithm to work for AAEDT instead of binary EDT was more or less trivial. Simply replacing the distance measure with the one for anti-aliased images was sufficient to produce the Voronoi diagram and distance transform.
3.2
Evaluation
The three algorithms were evaluated based on the two categories of metrics de-scribed in this section; Time complexity and execution time, as well as errors in distance approximations. All tests were run on the same hardware; Intel Core i7-3770 CPU @ 3.40GHz and NVIDIA GeForce GTX 1050, running Windows 7. The algorithms were tested on the image seen in figure 3.2. An image consisting of circles was chosen to simplify calculation of the actual distance to feature ob-jects in the image. The tests were performed on square images with side lengths equal to 64, 218, 256, 512, 1024 and 2048 pixels. The images were created by uniform supersampling of the source image with a resolution of 16 x 16 samples per pixel.
3.2 Evaluation 11
Figure 3.2:Input image for evaluation tests.
3.2.1
Time Complexity & Execution Time
The time complexities of the studied algorithms are known and have been com-pared for the binary transform by Cao et al.[18] and Juelg et al.[9], and should not change for this new application. What is of more interest in this thesis is how implementations for calculating the anti-aliased transform compare to each other, given the more computationally demanding distance measure.
Time measurements were done using the C++ std::chrono library. Time was measured for the algorithmic calculations only, memory allocation was done be-fore measuring execution time.
3.2.2
Error in Distance Approximations
Given that all three algorithms use slightly different methods of propagating data throughout the image, different types of errors can potentially occur. The dis-tance approximations of each algorithm were compared to two different oracles; the actual distance in the over-sampled image, consisting of circles, and the re-sult of a brute-force implementation of AAEDT, which always gives the shortest distance given the distance function used by these algorithms. Due to the brute-force algorithm’s poor time complexity it was not used as an oracle for the largest image size.
Three main measures were used to evaluate the correctness of the algorithms; The average and maximum error of pixels in a transformed image were derived when compared to both oracles. The number of pixels assigned the same distance as when transformed using the brute-force implementation was also counted. These measures were chosen to make a comparison of the algorithms easy, and are similar to the ones used to produce the results by Gustavson & Strand[7].
4
Results
This chapter aims to present the results produced in this thesis. These results and their implications will be more thoroughly discussed in the following chapters.
4.1
Time Complexity & Execution Time
Figure 4.1 shows the execution time of the three implemented algorithms. The values are an average of ten runs of each algorithm for each image size. PBA was tuned for each image size using its parameters to ensure the best possible
performance. Parameter m3was fixed to 32 for reasons discussed in section 5.2.
26 27 28 29 210 211
101 102 103
Image Size (Side Length) [Pixels]
T ime [ms] SKW JFA PBA
Figure 4.1:Algorithms’ execution times for different image sizes.
14 4 Results
4.2
Distance Error
Figure 4.2 shows the average absolute error produced by the three algorithms, when compared to the two oracles; The ideal distance transform, using the ge-ometric definitions of the source image, and the brute-force implementation of AAEDT. 26 27 28 29 210 211 0 0.01 0.02 0.03 0.04
Image Size (Side Length) [Pixels]
Error Distance [Pixels] SKW JFA PBA 26 27 28 29 210 211 0 0.01 0.02 0.03 0.04
Image Size (Side Length) [Pixels]
Error Distance [Pixels] SKW JFA PBA
Figure 4.2: Average distance error compared to ideal transform(left), and
brute-force implementation(right).
Figure 4.3 shows the maximum error produced by the three algorithms, when compared to the two oracles. Note that compared to figure 4.2 the scale of the vertical axis differs by almost two orders of magnitude.
26 27 28 29 210 211 0.5 1 1.5 2 2.5
Image Size (Side Length) [Pixels]
Error Distance [Pixels] SKW JFA PBA 26 27 28 29 210 211 0.5 1 1.5 2 2.5
Image Size (Side Length) [Pixels]
Error Distance [Pixels] SKW JFA PBA
Figure 4.3:Maximum distance error compared to ideal transform(left), and
4.2 Distance Error 15
Figure 4.4 shows what proportion of the total number of pixels the three algo-rithms produced the exact same result for as the brute-force implementation.
26 27 28 29 210 211 0.5 0.6 0.7 0.8 0.9 1
Image Size (Side Length) [Pixels]
Quotien
t
SKW JFA PBA
Figure 4.4: The ratio of correct (according to the brute-force
implementa-tion) pixels to total pixels, for different image sizes.
Figure 4.5 shows the error images of SKW, and JFA, when executed on an image of resolution 512 x 512.
Figure 4.5:Error of SKW(left) and JFA(right) on 512 x 512 image, compared
to ideal transform. Medium gray is zero error, red is an overestimation of +0.1, blue is an underestimation of -0.1. Errors are clamped to [-0.1, +0.1].
Figure 4.6 shows the error image of PBA, when run on an image of resolution 512 x 512. A difference in the output image is present along the upper right edge
16 4 Results
of the lowest region, when using different values for m3. These artifacts are more
thoroughly discussed in section 5.2, and are more pronounced in figure 5.1.
Figure 4.6: Error of PBA on 512 x 512 image, with m3 = 32(left) and m3 =
1(right), compared to the ideal transform. Medium gray is zero error, red is an overestimation of +0.1, blue is an underestimation of -0.1. The maximum absolute error is much greater than 0.1, but errors are clamped to [-0.1, +0.1].
Figure 4.7 shows the correlation between the PBA parameter m3 and the
ab-solute average and maximum error in an image of resolution 512 x 512. Which
shows that a low value for m3causes great error in some pixels, but does little to
the average error.
20 21 22 23 24 25 26 27 0.01 0.01 0.02 0.02 0.03 Value of m3 A v er ag e Error Distance [Pixels] 20 21 22 23 24 25 26 27 1 2 3 4 5 Value of m3 Maxim um Error Distance [Pixels]
Figure 4.7: Average(left) and maximum(right) distance error compared to
5
Discussion
This chapter will discuss the results and how they correlate to the research ques-tions of this thesis. Other points of interest that were discovered during this thesis will also be addressed.
5.1
Implementation-Specific Problems
In the experiments run, the execution times of the implementations are longer than expected. This is likely due to little effort being made to improve memory coalescing and under-utilizing the CUDA-libraries at hand. The implementations are rough, and should be improved upon if used for any real-time execution.
The relative performance of the algorithms are mostly consistent with the results of Cao et al. [18]; JFA and PBA perform best for smaller images. For larger image sizes SKW becomes more viable since the greater degree of parallelization utilized by PBA and JFA can not be achieved on the GPU used. It is also clear that the sub-optimal work performed by JFA results in poor performance for larger images.
5.2
PBA-Specific Problems
The greatest source of error in PBA occurs in the second and third phase of the algorithm. The original calculations, for eliminating candidates from the stack of proximate pixels, relies on every point lying on a strict grid. For the anti-aliased transform this condition is broken, resulting in the mostly vertical regions of greater error, seen in figure 4.6. In the non-binary case pixels in the same column can have different closest feature pixels on the same row. This is because the anti-aliased image can represent edges with a slope that is not strictly horizontal or
18 5 Discussion
vertical, unlike a binary image. This also results in some artifacts around perime-ters of Voronoi regions, due to the stack of proximate feature pixels not being
in proper order. The parameter m3 changes the severity of these artifacts, once
again seen in figure 4.6. With a small value, the stack can be popped prematurely and pixels in the following band miss some feature pixels which they should not. Alternatively the stack is not popped for the last pixel in a band, and the follow-ing feature pixels are not considered for the pixels in the current band. With
greater values for m3the problem is less prominent as there are fewer bands. For
the same reason the problem produces greater errors for larger images, as seen in figure 4.3, due to larger images requiring more bands to cover them. Figure 5.1 shows an example of this problem where PBA has assigned a pixel with the wrong closest feature pixel. The same type of error occurs at several points in the image, which present as the small white spots along a line through the spot circled in blue.
This problem might be solvable with a more rigorous calculation for deter-mining domination of candidates for proximate pixels, and possibly requiring a new definition of what constitutes a "proximate feature pixel", this, however, is beyond the scope of this thesis.
Figure 5.1:The gray-scale error image of PBA, with m3= 32, of a 2048x2048
image, zoomed in on the pixel with greatest absolute error. The pixel circled in blue has been assigned a pixel circled in red as its closest feature pixel. The actual closest pixel is circled in green.
For most of the experiments m3 was fixed to 32. A smaller value would be
uti-5.3 Errors in Data Collection 19
lized better by the implementation. Due to how CUDA distributes work on the
GPU [1] the hardware is under-utilized for m3 < 32, unless more advanced
syn-chronization is utilized to allow a block of threads to work on multiple columns simultaneously. In the implementation used here such efforts were not made. If this were changed, the execution time of PBA should scale better for larger image sizes. But before the issues with its correctness are addressed such optimizations are of little interest.
5.3
Errors in Data Collection
This section will discuss sources of errors and biases in the experiments and the attempts to mitigate them.
5.3.1
Execution Time
As mentioned in section 3.2, all experiments were run on the same hardware. The tests were all run ten times to mitigate any errors due to variance in the deploy-ment environdeploy-ment. Because Windows 7 was running on the same hardware, in-cluding the GPU, some processes would naturally interfere with the experiments. The environments each test was performed in were made as similar as possible, and background processes kept to a minimum. While having a GPU dedicated solely to calculations would eliminate some variance in the results, the setup used for these experiments might be a more realistic representation of a typical user environment.
The implementations are as mentioned not comparable to other published implementations, this is partly due to known design decisions, such as focusing on readability and not utilizing texture-memory to increase performance. The main result is that these implementations of the algorithms relative performance is not affected by the new distance measure used in AAEDT.
5.3.2
Input Data
Some feature objects in the source image are too small to accurately be repre-sented in images of lower resolution. The distance measure used in AAEDT as-sumes that edges through a pixel are approximately straight, for images with smaller-than-pixel features this is not the case and errors occur. This is clearly seen in figure 4.2 where the error of all algorithms is greater for smaller images, when compared to the ideal transform. This is not the case when comparing to the brute-force implementation, as it too uses the same resolution images.
Even though several different resolutions are used for the input images, only one type of source image was used for the experiments. The chosen image con-tains edges of all orientations as well as concave and convex features, despite this, results may be skewed due to the choice of image. It is important to note that approximately 40% of the pixels in the images are feature pixels, which are triv-ial for all three algorithms to accurately obtain the correct pixels for, as they are their own closest pixel and no propagation of information is necessary.
20 5 Discussion
5.4
Evaluation of Algorithms
In this section the results presented in chapter 4 are more thoroughly analyzed and summarized.
5.4.1
PBA
The experiments performed in this thesis show that the adaptation of PBA, to calculate the AAEDT presented here, does not produce results with the same level of correctness as the other, more simple algorithms. Figure 4.4 shows that as many as 30-40% of pixels are assigned with the wrong value.
The algorithm still maintains good levels of parallelism, and executes within
reasonable time. However, for larger image sizes this requires the parameter m3
to be assigned a small value, which has been shown to further increase the level of error produced in the output image.
5.4.2
SKW
This algorithm clearly under-utilizes the hardware due to its poor parallelism, but does scale better than JFA for larger images.
It produces results with low absolute average error, and while it does produce the wrong result for a significant proportion of pixels, the biggest error is small.
5.4.3
JFA
This algorithm performs well on smaller images in terms of execution time. It does, however, scale poorly with larger image sizes, just like implementations for the binary transform.
The algorithm produces results with the smallest absolute average error and gets close to all pixels correct according to the brute-force implementation of AAEDT. The absolute maximum error it produces is not as good as the average case. While it does not produce as big errors as PBA it is not as consistent as SKW is, especially for the larger image sizes.
6
Conclusions
This chapter summarizes the findings presented in this thesis. In addition to this some areas of work that could be done to continue the work done here, are mentioned.
6.1
Research Questions
The research questions stated in chapter 1 were:
1. Are there any significant differences between EDT and AAEDT which affect parallelizability?
2. How do the chosen algorithms perform in terms of execution time and accuracy, when calculating AAEDT?
In order to answer the first question, a look at PBA is the most interesting. For the inexact algorithms JFA, and SKW there are no attributes of AAEDT which af-fect the performance of the algorithms in any significant way, at least none that were discovered the work on this thesis. However, PBA does no longer produce the correct transform as it did for the binary transform. For AAEDT the features no longer lie on a strict grid. Because of this the calculations used to eliminate fea-tures as potential voronoi pixels, in the second and third phase of the algorithm, no longer work as intended, and produce erroneous results. These errors can, to an extent, be mitigated, but only by increasing the work done by the algorithm in the third phase, which is detrimental to its performance and parallelizability.
In terms of execution time the algorithms perform just like the versions for EDT, relative to each other. They are slow compared to implementations used in earlier work, which is in part due to the more heavy calculations utilized by
22 6 Conclusions
AAEDT, but also due to some less than optimal design present in these specific implementations.
In terms of accuracy, JFA performs best in the general case, out of the three tested algorithms. It produces the correct distances for close to all pixels in an image. However, the few errors it does produce, can be of significant magnitude. These errors seem to be more prominent for larger images. SKW produces a greater number of errors than JFA, but is more consistent and produces errors of smaller magnitude. PBA is the least accurate of the three algorithms and produces artifacts with errors of massive magnitude compared to the other al-gorithms.
6.2
Future Work
There are multiple things that could be done to expand on the work presented in this thesis.
Firstly, the algorithms were only tested on a single source image, with dif-ferent resolutions. The algorithms might perform difdif-ferently when the input image consists of other shapes of features and other proportions of feature to non-feature pixels.
Secondly, there is much room for optimization in the implementations eval-uated. More efficient implementations of the algorithms could be compared to their binary counterparts to make a comparison of how expensive the new dis-tance measure of the AAEDT is.
Lastly, only one possible adaptation of PBA is presented here. Some other method of producing the stack of proximate feature pixels in the algorithms’ sec-ond phase might solve some of the problems observed in this thesis.
A
Helper Functions
1 /*
2 Copyright (C) 2009 Stefan Gustavson (stefan.gustavson@gmail.com)
3
4 This software is distributed under the permissive "MIT License":
5
6 Permission is hereby granted, free of charge, to any person obtaining a copy
7 of this software and associated documentation files (the "Software"), to deal
8 in the Software without restriction, including without limitation the rights
9 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10 copies of the Software, and to permit persons to whom the Software is
11 furnished to do so, subject to the following conditions:
12
13 The above copyright notice and this permission notice shall be included in
14 all copies or substantial portions of the Software.
15
16 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
22 THE SOFTWARE.
23 */
24 25 /*
26 * A somewhat tricky function to approximate the distance to an edge in a
27 * certain pixel, with consideration to either the local gradient (gx,gy)
28 * or the direction to the pixel (dx,dy) and the pixel greyscale value a.
29 * The latter alternative, using (dx,dy), is the metric used by edtaa2().
30 * Using a local estimate of the edge gradient (gx,gy) yields much better
31 * accuracy at and near edges, and reduces the error even at distant pixels
32 * provided that the gradient direction is accurately estimated.
33 */
34 35
36 struct fpoint { float x, y; };
37 struct ipoint { int x, y; };
38 struct uipoint { unsigned int x, y; };
39 struct stack { unsigned int top, bot; };
40
41 __device__ float edgedf(fpoint g, float a) {
26 A Helper Functions
42 float df, glength, temp, a1;
43
44 if ((g.x == 0) || (g.y == 0)) { // Either A) gx or gy are zero, or B) both
45 df = 0.5 - a; // Linear approximation is A) correct or B) a fair guess
46 }
47 else {
48 glength = sqrtf(g.x * g.x + g.y * g.y);
49 if(glength > 0) {
50 g.x = g.x / glength;
51 g.y = g.y / glength;
52 }
53 /* Everything is symmetric wrt sign and transposition,
54 * so move to first octant (gx>=0, gy>=0, gx>=gy) to
55 * avoid handling all possible edge directions.
56 */ 57 g.x = fabs(g.x); 58 g.y = fabs(g.y); 59 if(g.x < g.y) { 60 temp = g.x; 61 g.x = g.y; 62 g.y = temp; 63 } 64 a1 = 0.5f * g.y / g.x; 65 if(a < a1) { // 0 <= a < a1
66 df = 0.5f * (g.x + g.y) - sqrt(2.0f * g.x * g.y * a);
67 }
68 else if (a < (1.0f - a1)) { // a1 <= a <= 1-a1
69 df = (0.5f - a) * g.x;
70 }
71 else { // 1-a1 < a <= 1
72 df = -0.5f * (g.x + g.y) + sqrt(2.0f * g.x * g.y * (1.0f - a));
73 }
74 }
75 return df;
76 }
77
78 __device__ float dist(ipoint from, ipoint to, fpoint grad, float* img, uipoint imgSize) {
79
80 if (to.x < 0 || to.y < 0) return imgSize.x * 3.0f;
81
82 int id = to.y * imgSize.x + to.x;
83 float a = img[id];
84 if (a > 1.0f) a = 1.0f;
85 if (a < 0.0f) a = 0.0f;
86 if (a == 0.0f) return imgSize.x * 3.0f;
87
88 float dx = to.x - from.x;
89 float dy = to.y - from.y;
90 float di = sqrtf(dx * dx + dy * dy); 91 92 float df; 93 if (di == 0.0f) { 94 df = fmaxf(edgedf(grad, a), 0.0f); 95 } 96 else { 97 df = edgedf({ dx, dy }, a); 98 } 99 return di + df; 100 } 101
102 /*Initialize gradients and closest pixels*/
103 __global__ void setup(ipoint * closest, fpoint * grad, float* img, uipoint imgSize)
104 {
105 int idx = blockIdx.x * blockDim.x + threadIdx.x;
106 int idy = blockIdx.y * blockDim.y + threadIdx.y;
107 int id = idy * imgSize.x + idx;
108 #define SQRT2 1.4142136f
27
110 closest[id].x = idx;
111 closest[id].y = idy;
112 if (idx > 0 && idx < imgSize.x - 1 && idy > 0 && idy < imgSize.y - 1) {
113 grad[id].x = -img[id - imgSize.x - 1] - SQRT2 * img[id - 1] - img[id + imgSize .x - 1] + img[id - imgSize.x + 1] + SQRT2 * img[id + 1] + img[id + imgSize.x + 1];
114 grad[id].y = img[id imgSize.x 1] SQRT2 * img[id imgSize.x] img[id -imgSize.x + 1] + img[id + -imgSize.x - 1] + SQRT2 * img[id + -imgSize.x] + img[ id + imgSize.x + 1];
115
116 float g = grad[id].x * grad[id].x + grad[id].y * grad[id].y;
117 if (g > 0.0f) { 118 g = sqrtf(g); 119 grad[id].x /= g; 120 grad[id].y /= g; 121 } 122 } 123 else { 124 grad[id].x = 0; 125 grad[id].y = 0; 126 127 } 128 } 129 else { 130 closest[id].x = -1; 131 closest[id].y = -1; 132 133 grad[id].x = 0; 134 grad[id].y = 0; 135 } 136 }
B
SKW Code
1
2 __global__ void sweep(ipoint * voronoi, fpoint * grad, float* img, float* out, uipoint imgSize, int i, bool isForward, bool isRow) {
3
4 int idx = blockIdx.x * blockDim.x + threadIdx.x + i * isRow;
5 int idy = blockIdx.y * blockDim.y + threadIdx.y + i * !isRow;
6 int id = idy * imgSize.x + idx;
7 ipoint neighbors[4];
8 neighbors[0] = { idx, idy };
9 if (isForward && isRow) {
10 neighbors[1] = { idx - 1, idy + 1 };
11 neighbors[2] = { idx - 1, idy };
12 neighbors[3] = { idx - 1, idy - 1 };
13 }
14 else if (!isForward && isRow) {
15 neighbors[1] = { idx + 1, idy + 1 };
16 neighbors[2] = { idx + 1, idy };
17 neighbors[3] = { idx + 1, idy - 1 };
18 }
19 else if (isForward && !isRow) {
20 neighbors[1] = { idx - 1, idy - 1 };
21 neighbors[2] = { idx, idy - 1 };
22 neighbors[3] = { idx + 1, idy - 1 };
23 }
24 else if (!isForward && !isRow) {
25 neighbors[1] = { idx - 1, idy + 1 };
26 neighbors[2] = { idx, idy + 1 };
27 neighbors[3] = { idx + 1, idy + 1 };
28 }
29 float bestDist = imgSize.x * 3.0f;
30 ipoint bestSite = { -1, -1 };
31 for (int i = 0; i < 4; ++i) {
32 ipoint n = neighbors[i];
33 if (n.x >= imgSize.x || n.x < 0 || n.y >= imgSize.y || n.y < 0)continue;
34 ipoint nSite = voronoi[n.x + n.y * imgSize.x];
35
36 float newDist = (nSite.x < 0) ? imgSize.x * 3.0f : dist({ idx, idy }, nSite, grad[nSite.x + nSite.y * imgSize.x], img, imgSize);
37
38 if (newDist < bestDist) {
39 bestDist = newDist;
30 B SKW Code 40 bestSite = nSite; 41 } 42 43 } 44 voronoi[id] = bestSite; 45 out[id] = bestDist; 46 } 47
48 void SKW(float* edt, float* img, uipoint size)
49 { 50 float* dev_img = 0; 51 fpoint* dev_grad = 0; 52 float* dev_edt = 0; 53 ipoint* dev_voronoi = 0; 54 55 56 cudaSetDevice(0); 57 58
59 cudaMalloc((void**)& dev_grad, size.x * size.y * sizeof(fpoint));
60 cudaMalloc((void**)& dev_edt, size.x * size.y * sizeof(float));
61 cudaMalloc((void**)& dev_img, size.x * size.y * sizeof(float));
62
63 cudaMemcpy(dev_img, img, size.x * size.y * sizeof(float), cudaMemcpyHostToDevice);
64
65 cudaMalloc((void**)& dev_voronoi, size.x * size.y * sizeof(ipoint));
66 67
68 dim3 block = { 8, 8 };
69 dim3 grid = { size.x / 8, size.y / 8 };
70
71 setup << <grid, block >> > (dev_voronoi, dev_grad, dev_img, size);
72 cudaDeviceSynchronize(); 73 74 75 block = { 1, 32 }; 76 grid = { 1, size.y / 32 }; 77
78 for (int i = 1; i < size.x; ++i) {
79 sweep << <grid, block >> > (dev_voronoi, dev_grad, dev_img, dev_edt, size, i,
true, true);
80 cudaDeviceSynchronize();
81 }
82 for (int i = size.x - 2; i >= 0; --i) {
83 sweep << <grid, block >> > (dev_voronoi, dev_grad, dev_img, dev_edt, size, i,
false, true); 84 cudaDeviceSynchronize(); 85 } 86 87 block = { 32, 1 }; 88 grid = { size.x / 32, 1 };
89 for (int i = 1; i < size.x; ++i) {
90 sweep << <grid, block >> > (dev_voronoi, dev_grad, dev_img, dev_edt, size, i,
true, false);
91 cudaDeviceSynchronize();
92 }
93 for (int i = size.x - 2; i >= 0; --i) {
94 sweep << <grid, block >> > (dev_voronoi, dev_grad, dev_img, dev_edt, size, i,
false, false); 95 cudaDeviceSynchronize(); 96 } 97 98 99
100 cudaMemcpy(edt, dev_edt, size.x * size.y * sizeof(float), cudaMemcpyDeviceToHost);
101
102 cudaFree(dev_grad);
103 cudaFree(dev_img);
31
105 cudaFree(dev_voronoi);
106
107 return;
C
JFA Code
1
2 __global__ void propagateSites(ipoint * closest, ipoint * voronoi, fpoint * grad,
float* img, float* out, uipoint imgSize, int stepSize) {
3
4 int idx = blockIdx.x * blockDim.x + threadIdx.x;
5 int idy = blockIdx.y * blockDim.y + threadIdx.y;
6 int id = idy * imgSize.x + idx;
7 ipoint neighbors[9] = { {idx - stepSize, idy + stepSize},
8 {idx, idy + stepSize},
9 {idx + stepSize, idy + stepSize},
10 {idx - stepSize, idy},
11 {idx, idy},
12 {idx + stepSize, idy},
13 {idx - stepSize, idy - stepSize},
14 {idx, idy - stepSize},
15 {idx + stepSize, idy - stepSize}
16 };
17 float bestDist = imgSize.x * 3.0f;
18 ipoint bestSite = { -1, -1 };
19 for (int i = 0; i < 9; ++i) {
20 ipoint n = neighbors[i];
21 if (n.x >= imgSize.x || n.x < 0 || n.y >= imgSize.y || n.y < 0)continue;
22 ipoint nSite = closest[n.x + n.y * imgSize.x];
23
24 float newDist = (nSite.x < 0) ? imgSize.x * 3.0f : dist({ idx, idy }, nSite, grad[nSite.x + nSite.y * imgSize.x], img, imgSize);
25 26 if (newDist < bestDist) { 27 bestDist = newDist; 28 bestSite = nSite; 29 } 30 31 } 32 voronoi[id] = bestSite; 33 out[id] = bestDist; 34 } 35
36 void JFA(float* edt,float* img, uipoint size)
37 {
38 float* dev_img = 0;
39 fpoint* dev_grad = 0;
34 C JFA Code 40 float* dev_edt = 0; 41 ipoint* dev_closest = 0; 42 ipoint* dev_voronoi = 0; 43 44 cudaSetDevice(0); 45 46
47 cudaMalloc((void**)& dev_grad, size.x * size.y * sizeof(fpoint));
48 cudaMalloc((void**)& dev_edt, size.x * size.y * sizeof(float));
49 cudaMalloc((void**)& dev_img, size.x * size.y * sizeof(float));
50
51 cudaMemcpy(dev_img, img, size.x * size.y * sizeof(float), cudaMemcpyHostToDevice);
52
53 cudaMalloc((void**)& dev_closest, size.x * size.y * sizeof(ipoint));
54 cudaMalloc((void**)& dev_voronoi, size.x * size.y * sizeof(ipoint));
55 56
57 dim3 block = { 8, 8 };
58 dim3 grid = { size.x / 8, size.y / 8 };
59
60 setup << <grid, block >> > (dev_closest, dev_grad, dev_img, size);
61 cudaDeviceSynchronize();
62
63 for (int i = size.x / 2; i > 0; i /= 2) {
64 propagateSites << <grid, block >> > (dev_closest, dev_voronoi, dev_grad, dev_img , dev_edt, size, i);
65 66 ipoint* tmp = dev_closest; 67 dev_closest = dev_voronoi; 68 dev_voronoi = tmp; 69 cudaDeviceSynchronize(); 70 } 71
72 cudaMemcpy(edt, dev_edt, size.x * size.y * sizeof(float), cudaMemcpyDeviceToHost);
73 74 cudaFree(dev_closest); 75 cudaFree(dev_grad); 76 cudaFree(dev_img); 77 cudaFree(dev_edt); 78 cudaFree(dev_voronoi); 79 80 return; 81 }
D
PBA Code
1
2 //==================================Phase1==================================
3
4 /*Calculate 1d-voronoi diagram for local band*/
5 __global__ void phase1Sweep(ipoint * closest, fpoint * grad, float* img, int m1, uipoint imgSize)
6 {
7 int idx = threadIdx.x + blockDim.x * blockIdx.x;
8 int idy = threadIdx.y + blockDim.y * blockIdx.y;
9
10 int bandId = idy * imgSize.x + idx * (imgSize.x / m1);
11 int pointX = idx * (imgSize.x / m1) + 1;
12
13 //Sweep right
14 for (int i = bandId + 1; i < bandId + (imgSize.x / m1); ++i) {
15
16 float oldDist = dist({ pointX, idy }, closest[i], grad[i], img, imgSize);
17 float newDist = dist({ pointX, idy }, closest[i - 1], grad[i - 1], img, imgSize) ; 18 19 20 if (newDist < oldDist) { 21 closest[i] = closest[i - 1]; 22 } 23 24 ++pointX; 25 } 26 27 pointX -= 2; 28 //Sweep left
29 for (int i = bandId + (imgSize.x / m1) - 2; i >= bandId; --i) {
30 31
32 float oldDist = dist({ pointX, idy }, closest[i], grad[i], img, imgSize);
33 float newDist = dist({ pointX, idy }, closest[i + 1], grad[i + 1], img, imgSize) ; 34 35 36 if (newDist < oldDist) { 37 closest[i] = closest[i + 1]; 38 } 35
36 D PBA Code 39 --pointX; 40 } 41 42 } 43
44 /*Propagate closest pixels across pixels on the edges of each band*/
45 __global__ void phase1Prefix(ipoint * closest, fpoint * grad, float* img, int m1, uipoint imgSize, int stepSize, bool up, bool right)
46 {
47 int idx = threadIdx.x + blockDim.x * blockIdx.x;
48 int idy = threadIdx.y + blockDim.y * blockIdx.y;
49
50 int bid = 2 * (idx + 1) * stepSize - 1;
51 int aid = bid - stepSize;
52 53 if (!up) { 54 bid += stepSize; 55 aid += stepSize; 56 } 57 58 if (!right) { 59 bid = 2 * m1 - bid - 1; 60 aid = 2 * m1 - aid - 1; 61 } 62
63 int ax = (aid / 2 + aid % 2) * (imgSize.x / m1) - aid % 2;
64 int bx = (bid / 2 + bid % 2) * (imgSize.x / m1) - bid % 2;
65
66 aid = idy * imgSize.x + ax;
67 bid = idy * imgSize.x + bx;
68
69 if (dist({ bx, idy }, closest[aid], grad[aid], img, imgSize) < dist({ bx, idy }, closest[bid], grad[bid], img, imgSize)) {
70 closest[bid] = closest[aid];
71 }
72 73 }
74
75 /*Updates the bands based on the edge-pixels closest sites*/
76 __global__ void phase1Update(ipoint * closest, fpoint * grad, float* img, int m1, uipoint imgSize) {
77 int idx = threadIdx.x + blockDim.x * blockIdx.x;
78 int idy = threadIdx.y + blockDim.y * blockIdx.y;
79
80 int bandId = idy * imgSize.x + idx * (imgSize.x / m1);
81 int pointX = idx * (imgSize.x / m1) + 1;
82
83 for (int i = bandId + 1; i < bandId + (imgSize.x / m1) - 1; ++i) {
84 if(dist({ pointX, idy }, closest[bandId], grad[bandId], img, imgSize) < dist({ pointX, idy }, closest[i], grad[i], img, imgSize)) {
85 closest[i] = closest[bandId];
86 }
87 if(dist({ pointX, idy }, closest[bandId + (imgSize.x / m1) - 1], grad[bandId + (imgSize.x / m1) - 1], img, imgSize) < dist({ pointX, idy }, closest[i], grad[i ], img, imgSize)) {
88 closest[i] = closest[bandId + (imgSize.x / m1) - 1];
89 } 90 91 ++pointX; 92 } 93 } 94 95 //==================================Phase2================================== 96 97
98 __device__ float lineColumnIntersection(fpoint a, fpoint b, float c) {
99
100 float a1 = b.y - a.y;
37
102 float c1 = a1 * a.x + b1 * a.y;
103 104 float a2 = 1.0f; 105 float b2 = 0.0f; 106 float c2 = c; 107 108 float det = a1 * b2 - a2 * b1; 109
110 if (det == 0) return FLT_MAX;
111
112 return (a1 * c2 - a2 * c1) / det;
113 }
114
115 __device__ bool isDominating(fpoint a, fpoint b, fpoint c, float column) {
116 fpoint p1 = { (a.x + b.x) / 2.0f, (a.y + b.y) / 2.0f };
117 fpoint p2 = { p1.x - (b.y - a.y), p1.y + (b.x - a.x) };
118
119 fpoint q1 = { (b.x + c.x) / 2.0f, (b.y + c.y) / 2.0f };
120 fpoint q2 = { q1.x - (c.y - b.y), q1.y + (c.x - b.x) };
121
122 float p = lineColumnIntersection(p1, p2, column);
123 float q = lineColumnIntersection(q1, q2, column);
124
125 return p <= q;
126 }
127
128 /*Calculate the proximate sites for each band*/
129 __global__ void phase2sweep(ipoint * closest, fpoint * grad, float* img, int m2, uipoint imgSize, stack * stk, stack * stkIndex)
130 {
131 int idx = blockIdx.x * blockDim.x + threadIdx.x;
132 int idy = blockIdx.y * blockDim.y + threadIdx.y;
133
134 int stkId = idx + idy * imgSize.x;
135 int stkStart = idx + idy * imgSize.x * (imgSize.y / m2);
136 int stkEnd = stkStart + imgSize.x * (imgSize.y / m2);
137 138
139 while (closest[stkStart].x == -1 && stkStart < stkEnd) {
140 stkStart += imgSize.x; 141 } 142 143 if (stkStart >= stkEnd) { 144 145 stkIndex[stkId].top = -1; 146 stkIndex[stkId].bot = -1; 147 return; 148 } 149 150 stkIndex[stkId].bot = stkStart; 151 stkIndex[stkId].top = stkStart; 152 153
154 for (int i = stkStart + imgSize.x; i < stkEnd; i += imgSize.x) {
155 int top = stkIndex[stkId].top;
156 int second = stk[top].bot;
157
158 if (closest[i].x == -1) {
159 continue;
160 }
161
162 while (stkIndex[stkId].bot != stkIndex[stkId].top) {
163
164 int aid = closest[i].x + closest[i].y * imgSize.x;
165 fpoint ag = grad[aid];
166 float adf = edgedf(ag, img[aid]);
167 float agLength = sqrtf(ag.x * ag.x + ag.y * ag.y);
168 if (agLength > 0.0f) {
38 D PBA Code
170 ag.y /= agLength;
171 }
172
173 int bid = closest[top].x + closest[top].y * imgSize.x;
174 fpoint bg = grad[bid];
175 float bdf = edgedf(bg, img[bid]);
176 float bgLength = sqrtf(bg.x * bg.x + bg.y * bg.y);
177 if (bgLength > 0.0f) {
178 bg.x /= bgLength;
179 bg.y /= bgLength;
180 }
181
182 int cid = closest[second].x + closest[second].y * imgSize.x;
183 fpoint cg = grad[cid];
184 float cdf = edgedf(cg, img[cid]);
185 float cgLength = sqrtf(cg.x * cg.x + cg.y * cg.y);
186 if (cgLength > 0.0f) {
187 cg.x /= cgLength;
188 cg.y /= cgLength;
189 }
190
191 fpoint a = { closest[i].x + ag.x * adf, closest[i].y + ag.y * adf };
192 fpoint b = { closest[top].x + bg.x * bdf, closest[top].y + bg.y * bdf };
193 fpoint c = { closest[second].x + cg.x * cdf, closest[second].y + cg.y * cdf };
194 if (isDominating(a, b, c, idx)) { 195 stkIndex[stkId].top = second; 196 top = second; 197 second = stk[top].bot; 198 } 199 else { 200 break; 201 } 202 } 203 stk[i].bot = top; 204 stk[top].top = i; 205 stkIndex[stkId].top = i; 206 207 } 208 stk[stkIndex[stkId].top].top = -1; 209 210 } 211
212 /*Calculate proximate sites for each column*/
213 __global__ void phase2merge(ipoint * closest, fpoint * grad, float* img, int m2, uipoint imgSize, stack * stk, stack * stkIndex, int stepSize)
214 {
215 int idx = blockIdx.x * blockDim.x + threadIdx.x;
216 int idy = blockIdx.y * blockDim.y + threadIdx.y;
217 int stkLowId = idx + idy * imgSize.x * stepSize * 2;
218 int stkHighId = stkLowId + stepSize * imgSize.x;
219 int stkStart = stkIndex[stkLowId].bot;
220 int stkMid = stkIndex[stkHighId].bot;
221 int stkEnd = stkIndex[stkHighId].top;
222 223 if (stkStart == -1) { 224 stkIndex[stkLowId] = stkIndex[stkHighId]; 225 return; 226 } 227 if (stkEnd == -1) { 228 return; 229 }
230 bool isAlmostMerged = false;
231 for (int i = stkMid; i <= stkEnd && stk[i].top > 0; i = stk[i].top) {
232 int top = stkIndex[stkLowId].top;
233 int second = stk[top].bot;
234 while (stkIndex[stkLowId].bot != stkIndex[stkLowId].top) {
235
236 int aid = closest[i].x + closest[i].y * imgSize.x;
39
238 float adf = edgedf(ag, img[aid]);
239 float agLength = sqrtf(ag.x * ag.x + ag.y * ag.y);
240 if (agLength > 0.0f) {
241 ag.x /= agLength;
242 ag.y /= agLength;
243 }
244
245 int bid = closest[top].x + closest[top].y * imgSize.x;
246 fpoint bg = grad[bid];
247 float bdf = edgedf(bg, img[bid]);
248 float bgLength = sqrtf(bg.x * bg.x + bg.y * bg.y);
249 if (bgLength > 0.0f) {
250 bg.x /= bgLength;
251 bg.y /= bgLength;
252 }
253
254 int cid = closest[second].x + closest[second].y * imgSize.x;
255 fpoint cg = grad[cid];
256 float cdf = edgedf(cg, img[cid]);
257 float cgLength = sqrtf(cg.x * cg.x + cg.y * cg.y);
258 if (cgLength > 0.0f) {
259 cg.x /= cgLength;
260 cg.y /= cgLength;
261 }
262
263 fpoint a = { closest[i].x + ag.x * adf, closest[i].y + ag.y * adf };
264 fpoint b = { closest[top].x + bg.x * bdf, closest[top].y + bg.y * bdf };
265 fpoint c = { closest[second].x + cg.x * cdf, closest[second].y + cg.y * cdf };
266 if (isDominating(a, b, c, idx)) { 267 isAlmostMerged = false; 268 stkIndex[stkLowId].top = second; 269 top = second; 270 second = stk[top].bot; 271 } 272 else if (!isAlmostMerged) { 273 isAlmostMerged = true; 274 break; 275 } 276 else { 277 stk[i].bot = top; 278 stk[top].top = i; 279 stkIndex[stkLowId].top = stkIndex[stkHighId].top; 280 return; 281 } 282 } 283 stk[i].bot = top; 284 stk[top].top = i; 285 stkIndex[stkHighId].bot = stk[i].top; 286 stkIndex[stkLowId].top = i; 287 } 288 stkIndex[stkLowId].top = stkIndex[stkHighId].top; 289 290 291 } 292 293 //==================================Phase2================================== 294
295 /*Calculate 2d voronoi diagram*/
296 __global__ void phase3(ipoint * voronoi, ipoint * closest, fpoint * grad, float* out , int m3, float* img, uipoint imgSize, stack * stk, stack * stkIndex, int baseY )
297 {
298 __shared__ bool done;
299 int idx = blockIdx.x * blockDim.x + threadIdx.x;
300 int idy = blockIdx.y * blockDim.y + threadIdx.y + baseY;
301 int id = idy * imgSize.x + idx;
302
303 voronoi[id] = closest[stkIndex[idx].bot];
40 D PBA Code
imgSize.x], img, imgSize);
305
306 if (threadIdx.y == m3 - 1) done = false;
307 __syncthreads();
308
309 while (!done && stkIndex[idx].bot != stkIndex[idx].top) {
310 int a = stkIndex[idx].bot;
311 int b = stk[a].top;
312
313 float distC = out[id];
314 float distA = dist({ idx, idy }, closest[a], grad[closest[a].x + closest[a].y * imgSize.x], img, imgSize);
315 float distB = dist({ idx, idy }, closest[b], grad[closest[b].x + closest[b].y * imgSize.x], img, imgSize);
316
317 if (distA < distC && distA < distB) {
318 voronoi[id] = closest[a];
319 out[id] = distA;
320 if(threadIdx.y == m3 - 1) done = true;
321 }
322
323 __syncthreads();
324 if (distB < distA && distB < distC) {
325 voronoi[id] = closest[b]; 326 out[id] = distB; 327 stkIndex[idx].bot = b; 328 } 329 else if (threadIdx.y == m3 - 1) { 330 done =true; 331 } 332 __syncthreads(); 333 334 } 335 336 } 337
338 void PBA(float* edt, float* img, uipoint size)
339 { 340 float* dev_img = 0; 341 fpoint* dev_grad = 0; 342 ipoint* dev_closest = 0; 343 float* dev_edt = 0; 344 stack* dev_stack = 0; 345 stack* dev_stackIndex = 0; 346 ipoint* dev_voronoi = 0; 347 348 349 cudaSetDevice(0); 350 351
352 // Allocate GPU buffers
353 cudaMalloc((void**)& dev_closest, size.x * size.y * sizeof(ipoint));
354 cudaMalloc((void**)& dev_grad, size.x * size.y * sizeof(fpoint));
355 cudaMalloc((void**)& dev_edt, size.x * size.y * sizeof(float));
356 cudaMalloc((void**)& dev_img, size.x * size.y * sizeof(float));
357 cudaMalloc((void**)& dev_stack, size.x * size.y * sizeof(stack));
358 cudaMalloc((void**)& dev_stackIndex, size.x * param2 * sizeof(stack));
359 cudaMalloc((void**)& dev_voronoi, size.x * size.y * sizeof(ipoint));
360
361 // Copy image from host to device.
362 cudaMemcpy(dev_img, img, size.x * size.y * sizeof(float), cudaMemcpyHostToDevice);
363 364 365
366 dim3 block = { 8,8 };
367 dim3 grid = { size.x / 8, size.y / 8 };
368
369 setup << <grid, block >> > (dev_closest, dev_grad, dev_img, size);
41 371 372 373 //==============================Phase1====================================== 374 375 block = { param1, 32 }; 376 grid = { 1 , size.y / 32 }; 377
378 phase1Sweep << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size);
379 cudaDeviceSynchronize();
380
381 for (unsigned int i = 1; i <= param1; i *= 2) {
382 block = { param1 / i, 32 };
383 grid = { 1, size.y / 32 };
384 phase1Prefix << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size, i, true, true);
385 cudaDeviceSynchronize();
386 }
387 for (unsigned int i = param1 / 2; i >= 1; i /= 2) {
388 block = { param1 / i - 1, 32, };
389 grid = { 1, size.y / 32 };
390 phase1Prefix << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size, i, false, true);
391 cudaDeviceSynchronize();
392 }
393
394 for (unsigned int i = 1; i <= param1; i *= 2) {
395 block = { param1 / i, 32 };
396 grid = { 1, size.y / 32 };
397 phase1Prefix << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size, i, true, false);
398 cudaDeviceSynchronize();
399 }
400 for (unsigned int i = param1 / 2; i >= 1; i /= 2) {
401 block = { param1 / i - 1, 32 };
402 grid = { 1, size.y / 32 };
403 phase1Prefix << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size, i, false, false); 404 cudaDeviceSynchronize(); 405 } 406 407 block = { param1, 32 }; 408 grid = { 1, size.y / 32 };
409 phase1Update << <grid, block >> > (dev_closest, dev_grad, dev_img, param1, size);
410 cudaDeviceSynchronize(); 411 412 413 //==============================Phase2====================================== 414 415 416 417 block = { 32, param2 }; 418 grid = { size.x / 32, 1 }; 419
420 phase2sweep << <grid, block >> > (dev_closest, dev_grad, dev_img, param2, size, dev_stack, dev_stackIndex);
421 cudaDeviceSynchronize();
422 423 424
425 for (int i = 1; i < param2; i <<= 1) {
426 block = { 32, param2 / (unsigned int)(i * 2) };
427 grid = { size.x / 32, 1 };
428
429 phase2merge << <grid, block >> > (dev_closest, dev_grad, dev_img, param2, size, dev_stack, dev_stackIndex, i);
430 cudaDeviceSynchronize();
431 }
432 433
42 D PBA Code
434 //==============================Phase3======================================
435 block = { 1, param3 };
436 grid = { size.x, 1 };
437 for (int i = 0; i < size.y / param3; ++i) {
438 phase3 << <grid, block >> > (dev_voronoi, dev_closest, dev_grad, dev_edt, param3 , dev_img, size, dev_stack, dev_stackIndex, i * param3);
439 cudaDeviceSynchronize();
440 }
441 442 443
444 cudaMemcpy(edt, dev_edt, size.x * size.y * sizeof(float), cudaMemcpyDeviceToHost);
445 446 447 cudaFree(dev_closest); 448 cudaFree(dev_grad); 449 cudaFree(dev_img); 450 cudaFree(dev_edt); 451 cudaFree(dev_stack); 452 cudaFree(dev_stackIndex); 453 cudaFree(dev_voronoi); 454 455 return; 456 }
Bibliography
[1] Cuda c programming guide v10.1.168. URL https://docs.nvidia.
com/cuda/cuda-c-programming-guide/index.html. Retrieved
2019-06-05.
[2] F. d. A. Zampirolli and L. Filipe. A fast cuda-based implementation for the euclidean distance transform. In 2017 International Conference on High Performance Computing Simulation (HPCS), pages 815–818, July 2017. doi: 10.1109/HPCS.2017.123.
[3] N. Angayarkanni D. Kumar. Euclidean distance transform (edt) algorithm applied to binary image for finding breast cancer. Biomedical & Pharmacol-ogy Journal, 8(1):407–411, 2015.
[4] Per-Erik Danielsson. Euclidean distance mapping, 1980.
[5] Ricardo Fabbri, Luciano Da Fontoura Costa, Julio C. Torelli, and Odemir Martinez Bruno. 2d euclidean distance transform algorithms: A comparative survey, 2008.
[6] Chris Green. Advanced real-time rendering in 3d graphics and games
course – siggraph 2007 chapter 2 improved alpha-tested magnification for vector textures and special effects.
[7] Stefan Gustavson and Robin Strand. Anti-aliased euclidean distance trans-form. Pattern Recognition Letters, 32(2):252–257, jan 2011. doi: 10.1016/
j.patrec.2010.08.010. URL https://doi.org/10.1016/j.patrec.
2010.08.010.
[8] Tatsuya Hayashi, Koji Nakano, and Stephan Olariu. Optimal parallel
algorithms for finding proximate points, with applications. IEEE Trans-actions on Parallel and Distributed Systems, Parallel and Distributed Systems, IEEE Transactions on, IEEE Trans. Parallel Distrib. Syst, (12):
1153, 1998. ISSN 1045-9219. URL https://login.e.bibl.liu.
se/login?url=https://search.ebscohost.com/login.aspx? direct=true&AuthType=ip,uid&db=edseee&AN=edseee.737693&
lang=sv&site=eds-live&scope=site.