Citation
Parallelization of Two-Dimensional Skeletonization Algorithms

Material Information

Title:
Parallelization of Two-Dimensional Skeletonization Algorithms
Series Title:
Journal of Undergraduate Research
Creator:
Daya, Bhavya
Place of Publication:
Gainesville, Fla.
Publisher:
University of Florida
Publication Date:
Language:
English

Subjects

Genre:
serial ( sobekcm )

Notes

Abstract:
Though non-pixel based skeletonization techniques have advantages over traditional pixel-based methods such as faster processing time, they are more complex. Pixel-based techniques, such as the use of distance transforms and thinning, are easier to implement, and computational efficiency can be achieved by taking advantage of parallel computers. The purpose of this paper is to formulate, design, compare, and optimize the parallelization of the pixel-based techniques of thinning and the use of distance transforms. It was shown that the serial and parallel distance transform algorithm performs better than the serial and parallel thinning algorithm. Results show that the proposed parallel algorithm is computationally efficient and closely aligns with the human’s perception of the underlying shape. Optimizations were performed without compromising the quality of the skeletonization.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:


Full Text

PARALLELIZATION OF TWO-DIMENSIONAL SKELETONIZATION ALGORITHMS


Parallelization of Two-Dimensional Skeletonization Algorithms


Bhavya Daya


College of Engineering, University of Florida


Though non-pixel based skeletonization techniques have advantages over traditional pixel-based methods such as faster processing
time, they are more complex. Pixel-based techniques, such as the use of distance transforms and thinning, are easier to implement, and
computational efficiency can be achieved by taking advantage of parallel computers. The purpose of this paper is to formulate, design,
compare, and optimize the parallelization of the pixel-based techniques of thinning and the use of distance transforms. It was shown
that the serial and parallel distance transform algorithm performs better than the serial and parallel thinning algorithm. Results show
that the proposed parallel algorithm is computationally efficient and closely aligns with the human's perception of the underlying
shape. Optimizations were performed without compromising the quality of the skeletonization.


INTRODUCTION

Skeletonization is an algorithm that is performed in
order to obtain the center of an image. Knowing the center
of the image or skeleton is useful because the original
image can be recreated from it. Therefore, the skeleton is
used when describing and manipulating an image with the
least amount of human intervention. Medical imaging
applications, fingerprint identification systems, and robotic
vision systems make use of skeletonization to perform each
application accurately.
Due to its compact shape representation, image
skeletonization has been studied for a long time in
computer vision, pattern recognition, and optical character
recognition. It is a powerful tool for intermediate
representation for a number of geometric operations on
solid models. Many image processing applications depend
on the skeletonization algorithm. The parallelization of the
skeletonization algorithm creates a stepping stone for the
entire image processing application to enter the parallel
computing realm. Increasing the performance of the
skeletonization process will speed up the applications that
depend on the algorithm.
Though non-pixel based skeletonization techniques have
advantages over traditional pixel-based methods such as
faster processing time, they are more complex. Pixel-based
techniques, such as the use of distance transforms and
thinning, are easier to implement, and computational
efficiency can be achieved by taking advantage of parallel
computers. The purpose of this paper is to formulate,
design, compare, and optimize the parallelization of the
pixel-based techniques of thinning and the use of distance
transforms.

Image Skeletonization. An image skeleton is
presumed to represent the shape of the object in a relatively
small number of pixels, all of which are, in some sense,
structural and therefore necessary. The skeleton of an


object is conceptually defined as the locus of center pixels
in the object. [3] Unfortunately, no generally agreed upon
definition of a digital image skeleton exists. But for all
definitions, at least 4 requirements must be satisfied for
skeleton objects [3]:
1. Centeredness: The Skeleton must be centered
within the object boundary.
2. Preservation of Connectivity: The output skeleton
must have the same connectivity as the original object
and should not contain any background elements.
3. Consistency of Topology: The topology must
remain constant.
4. Thinness: The output skeleton must be as thin as
possible: 1-pixel thin is the requirement for a 2D
skeleton, and 1-voxel thin in 3D.
Image skeletonization is one of the many morphological
image processing operations. By combining different
morphological image processing applications, an algorithm
can be obtained for many image processing tasks, such as
feature detection, image segmentation, image sharpening,
and image filtering. Image skeletonization is especially
suited to the processing of binary images or grayscale
images. Many sequential algorithms exist for achieving a
two-dimensional binary image skeletonization.
The simplest type of image which is used widely in a
variety of industrial and medical applications is binary, i.e.
a black-and-white or silhouette image. [4] A binary image
is usually stored in memory as a bitmap, a packed array of
bits. Binary image processing has several advantages but
some corresponding drawbacks, as illustrated in Table 1.
There are different categories of skeletonization meth-
ods: one category is based on distance transforms, and a
specified subset of the transformed image is a distance
skeleton. [1] Another category is defined by thinning ap-
proaches; and the result of skeletonization using thinning
algorithms should be a connected set of digital curves or
arcs. [1]


University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008





BHAVYA DAYA

Table 1: Advantages and Disadvantages of Binary Images

I AvataesI isdvntge


* Easy to acquire
* Low storage: no more
than 1 bit/pixel, often
this can be reduced as
such images are very
amenable to compres-
sion.
* Simple processing


* Limited application: as
the representation is only
a silhouette, application
is restricted to tasks
where internal detail is
not required as a distin-
guishing characteristic.
* Specialized lighting is
required for silhouettes:
it is difficult to obtain
reliable binary images
without restricting the
environment.


A





Figure 1: Original Image [3]


- J:







Figure 2: Intermediate Step [3]


�.

r~
�'`
CN


Figure 3: Skeleton of Original Image [3]


phase 0
phase 1
phase 2
phase 3
phase 4


Figure 4: Thinning applied to 3D image [5]




Thinning Algorithms. Thinning algorithms are a very
active area of research, with a main focus on connectivity-
preserving methods allowing parallel implementation. The
images in Figures 1-3 display the results of a 2D
skeletonization thinning algorithm.
Thinning or erosion of the image is a method that
iteratively peels off the boundary layer by layer from
outside to inside. The removal does not affect the topology
of the image. This is a repetitive and time-consuming
process of testing and deletion of each pixel. It is good for
connectivity preservation. The problem with this approach
is that the set of rules defining the removal of a pixel is
highly dependent on the type of image and different sets of
rules will be applied to different types of images. Figure 4
is an image of the thinning process as applied to a three-
dimensional image. The phases are the thinning layers.

Distance Transform Algorithms. The distance
transform is the other common technique for achieving the
medial axis or skeleton of the image. There are three main
types of distance transforms, which are based on Chamfer,
Euclidean, and Voronoi diagrams. [5] The simplest
approach for the skeletonization algorithm is the Euclidean
Distance Transform. This method is based on the distance
of each pixel to the boundary and tries to extract the
skeleton by finding the pixels in the center of the object;
these pixels are the furthest from the boundary. The
distance coding is based on the Euclidean distance.
This method is faster than the thinning approach and can
be done with a high degree of parallelism. [1] Unfortu-
nately, the output is not guaranteed to preserve connect-
ivity. The distance transform process applied to skeletoni-
zation can be visualized as in the figure below. The ridges
on the image to the right belong to the skeleton.


University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008





PARALLELIZATION OF TWO-DIMENSIONAL SKELETONIZATION ALGORITHMS


Thread 0

Thread 1


Thread 2

Thread 3


Thinning and distance transform are pixel-based tech-
niques that need to process every pixel in the image [1].
This can incur a long processing time and leads to reduced
efficiency. Various techniques have been developed and
implemented, but a large percentage of them possess some
common faults that limit their use. These faults include
noise sensitivity, excessive processing time, and results not
conforming to a human's perception of the underlying ob-
ject in the image [1]. Since skeletonization is very often an
intermediate step towards object recognition, it should have
a low computational cost.
Figure 7 (see pg. 4) illustrates the methodology used to
parallelize the two-dimensional skeletonization algorithms.
The serial code for both algorithms was written for
comparison with parallel algorithms. To determine if the
serial code was achieving an appropriate skeletonization, it
was compared to MATLAB's results for the
skeletonization. MATLAB contains the functionality to
perform the distance transform skeletonization process and
the thinning skeletonization process but uses a different
algorithm. The distance transform skeletonization process
implemented in MATLAB achieves the skeletonization by
performing the Euclidean Distance Transform.
A shared memory architecture, Marvel Cluster, was cho-
sen for implementation. The machine contains 16 symmet-
ric multiprocessors with 32 GB of memory. Both algo-
rithms use the domain decomposition approach. The image
matrix is stored in shared memory for access by any proc-
essor. Each processor contains, in local memory, a piece of
the entire image matrix that is relevant for its calculations.
If other matrix values are not present in local memory and
are required, the values are fetched from shared memory.
The image is statically allocated by dividing the image
into strips. Each processor or thread is assigned to perform
computation on the strip that is allocated to it. Each proces-
sor performs the same computation on different sets of data
or in this case, different parts of the image matrix. Figure 8
shows split allocation in an image.


A study, performed by Klaiber & Levy [11], comparing
the performance of applications on different types of ma-
chines, showed that the performance of message passing
machines surpassed shared memory machines. Since there
is a lot of data that requires processing and the communi-
cation between processors without shared memory is pre-
dicted to be quite large, the shared memory communication
abstraction was chosen. The shared memory machine was
predicted to perform well when the application required
data decomposition and large amounts of data.
Implementation. The distance transform approach can be
accomplished by finding the Euclidean distance of each
pixel from the border of the image. The image below
shows an example of the Euclidean distance of a two di-
mensional image. The distances that are the greatest from
the border represent the skeleton of the image. The con-
nectivity is the most difficult aspect to preserve.
The simplest approach for the skeletonization algorithm
is the Chessboard Distance Transform (Figure 9). It has
been found that the Chessboard Distance Transform will
provide a good approximation for the Euclidean Distance
Transform with a reduction in computation cost in both the
serial and parallel domain. Since the work is done at the
pixel level, few errors will not be visible to a person view-
ing the image. This method is based on the distance of each
pixel to the boundary and tries to extract the skeleton by
finding the pixels in the center of the object; these pixels
are the furthest from the boundary (Figure 10).


ci
i
2
2
4


zl z u


Figure 9: Euclidean Distance Transform [6]



University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008


Figure 5: Original Image [5]

METHODOLOGY


Figure 6: Distance Transform [5]


Figure 8: Thread Allocation





BHAVYA DAYA


Skeletonization Algorithms




Distance Transform Thinning


I I


Implementation Method

Euclidean Distance Transform.

Approximation of the transform
used to decrease computation



Implement Serial Algorithm




Design Parallel Algorithm



Perform PRAM Analysis


Implementation Method

Elimination Table

Thinning performed
based on rules in a table



Implement Serial Algorithm


Design Parallel Algorithm



Perform PRAM Analysis


COMPARE


Implement Parallel Algorithm I


Implement Parallel Algorithm


Compare Speedup for each Algorithm and the Image Result


Distance Transform selected for Optimization



Optimize Implementation for Best Speedup

1. Dymanic Memory Allocation
2. Ghost Zone


Figure 7: Methodology Used to Parallelize Skeletonization Algorithms




















University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008
4





PARALLELIZATION OF TWO-DIMENSIONAL SKELETONIZATION ALGORITHMS


0 1 2 3 4 5


6 7


0 1 0 0 0 0 0 0

0 0 a 0 1 0 0 1
0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0

1 0 0 0 0 1 0 0

0 0 0 0 0 0 0 o

o o o o 0 0 0 a

O O 0 0 O 0 0


0 1 2 3 4 5 6 7
1 0 1 1 1 1 1 1
1 1 1 1 0 1 1 0

2 2 1 0 1 1 1 1
1 1 1 1 1 1 1 2

0 1 2 2 1 0 1 2

1 1 2 2 1 1 1 2

2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3


Figure 10: Chessboard Distance Transform


Thinning or erosion of the image is a method that
iteratively peels off the boundary layer by layer from
outside to inside. The removal does not affect the topology
of the image. This is a repetitive and time-consuming
process of testing and deletion of each pixel, though it is
good for connectivity preservation. The problem with this
approach is that the set of rules defining the removal of a
pixel is highly dependent on the type of image. Thinning is
best when performed on alphabets and numbers. Table 2
shows the elimination rules used during implementation.

Implementation of Serial Algorithms. The serial
distance transform algorithm contains many data
dependencies. The image has to be traversed in row major
order in order to achieve the skeletonization. Each pixel
depends on the value in rows and columns ahead of it.
When determining the distance from the edge of the image,
the previous computed values are depended on when
looking at a particular pixel. The information essential for
generating the distance transform matrix is shown in Figure
12 (page 6).
The functions fl is applied to the image I in standard
scan order, producing I*(i; j) = fl(i; j; I(i; j)), and f2 in
reverse standard scan order, producing T(i; j) = f2(i; j; I*(i;
j). This shows that the reverse scan of the image requires
that the standard scan order be completed first.
The steps required for the implementation of serial
baseline of distance transform and thinning algorithms are


illustrated in the flowcharts and described
below.


in the table


Design of Parallel Algorithms. The steps used to
design the distance transform algorithm and thinning
algorithm are illustrated in Figures 14 and 15 respectively.
Two methods for parallelizing the distance transform
were used: Ignore data dependencies inherent to the
algorithm and red-black ordering.
Red-black ordering is where the image pixels are
ordered by alternating the colors, red and black. The red
pixels are never directly adjacent to any other red pixels.
Directly adjacent to a pixel means either to the right, left,
above, or below the pixel. This allows the red pixels to all
be computed in parallel and the black pixels can be
computed in parallel, after the red pixels have computed.
Theoretically and during the formulation process, this
method seemed to provide potential for a large
improvement in performance.
The problem with the red-black ordering approach is that
the result is highly dependent on the image. Some images
work well when the reverse scan order is performed before
the standard scan order. Some images do not result in a
skeleton at all. Some images work well with the standard
scan order performed first. Since the algorithm does not
produce a reliable skeleton each time, this method was not
selected. Chessboard Distance Transform was selected and
utilized.


0 0 0 0 . 0 0 0 0 0 0 0 0 _ 0
0 1 1 1 1 1 0 0 1 1 1 1 1 0
0 1 1 1 1 1 0 0 1 2 2 2 10
0 1 1 1 1 1 0 0 1 2 3 2 1 0
0 1 1 1 1 1 0 0 1 2 2 2 1 0
0 1 1 1 1 1 0 0 1 1 1 1 1 0
r~i 01 0 0 0 0 0 0 0 0 0 _
Binary Ima ge Distan ce transformation

Figure 11: Binary Image (Left) and Chessboard Distance Transform (Right) [12]

University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 | Summer 2008





PARALLELIZATION OF TWO-DIMENSIONAL SKELETONIZATION ALGORITHMS


Table 2: Elimination Table [10]


Amount Elimination Rules
0 Never








8 Never


n~l~LP I ji~V- i,~ 1,PIR - 1� ~~irnQ


1, andt. f 0Q
^ ^.^* I10fflfll c P/ lw~l
B&eme5�rk


jl(Wfeft5 - j�T~ff 1, +1+ i-,7(fj +1) + i3

7hep FWNanst WFrt F f F tFhFie dfaetavrp HvQn~fairaa NNWss ar L
ate that C sii A l & f ?a) 2nd.ft T f - i swht tat
lr T (, ,f7 a T7" f f iww B f fe amp puaftW fn#((e hcF i7 a vahs TCh 7' vqar la
7f, +l. Par arcmma Cihmpe friai'Crw V �1tTM, 1) .
7hr?7 ftiavg lr'FF-pffep wa fift stansge birawfarfw


Figure 12: Distance Transform Algorithm [9]












University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 1 Summer 2008





PARALLELIZATION OF TWO-DIMENSIONAL SKELETONIZATION ALGORITHMS


Figure 13: SERIAL BASELINE FLOWCHART OF ALGORITHMS. DISTANCE TRANSFORM: The algorithm requires the image to be traversed four
times. The skeletonization process analyzes the distance transform matrix created. Each pixel is analyzed to determine if it is considered to be part of
the skeleton, by comparing the pixel value to its immediate neighbors. The pixels with the largest value are the ones that are farthest from the edge.
These pixels are guaranteed to be part of the image's skeleton. THINNING ALGORITHM: One processor goes through the entire image and eliminates
pixels based on the elimination table. Iterations need to be performed on the image to achieve a skeleton. The number of iterations depends on the
image being thinned. The thinning is performed using data from the original matrix and updating the new matrix. In order for the elimination procedure to
work, two two-dimensional arrays are required.






















University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008


DISTANCE TRANSFORM


Output Matrix to File


THINNING





BHAVYA DAYA


Figure 14: PARALLELIZATION FLOWCHART OF DISTANCE TRANSFORM ALGORITHM. The flowchart of the parallel algorithm is
similar to the serial algorithm, with the difference that the computation is being performed on different parts of the image matrix
simultaneously. The barrier synchronization is in place because the next step cannot continue until all the threads or processors have
completed the previous step in the algorithm. The reading of an image matrix from a file and writing the final image matrix to a file
consumes a lot of processing time. This is especially because every processor needs to open the file. It would be good if each
processor reads the part of the file, relating to its local work and allocates it into local and shared memory. The optimization of the file
10 aspect of the algorithm was considered, but could not be achieved due to programming language barriers. The algorithm is such
that there isn't a way to avoid the barrier synchronization points. To achieve a good skeletonization image, these synchronization
points have to be in place.




















University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008





PARALLELIZATION OF TWO-DIMENSIONAL SKELETONIZATION ALGORITHMS


Figure 15: PARALLELIZATION FLOWCHART OF THINNING ALGORITHM. The parallel thinning code is the serial algorithm
performed on each processor and on different sets of data. The data is split up using the strip decomposition which was also used in
the distance transform algorithm. The problem is that the waiting time for the algorithm is significant because each processor
processes the image, then analyzes if it needs to perform another thinning iteration, then performs the algorithm again.


Parallel Random Access Machine Analysis pramM). The
PRAM analysis of the favored approach of ignoring data
dependencies in the distance transform algorithm revealed
that the speedup becomes linear as the size of the image,
assumed to be nxn, becomes very large. Also the system
size, p, has to be a multiple of n in order to achieve the
linear speedup. If the number of processors, p, equals to the
size of the image, n, then linear speedup can be achieved
but the skeletonization process would not be properly
performed. There has to be a compromise between the
performance and accuracy of the algorithm. The PRAM
analysis is questionable. Since the communication is not
considered in PRAM, the actual performance of the
algorithm cannot be predicted. Based on the PRAM
analysis, some speedup should be achieved but it will not
be linear unless an optimization is performed.


The PRAM analysis of the parallelized thinning
algorithm revealed the same outcome as the distance
transform PRAM analysis. The prediction was that the
linear speedup is achievable if the problem size and system
size both increase.

RESULTS

Comparison of Distance Transform and Thinning
Algorithms. The image input into the thinning and
distance transform algorithms is 256 by 256 pixels. The
algorithms utilize dynamic memory allocation. Dynamic
allocation is used due to the improved overall performance
it provides. The performance of the serial and parallel
algorithms for both techniques is shown in Figure 16. The
parallel algorithms were performed on four processors.


University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008





BHAVYA DAYA


Total Times by Function: Default Revision


0.9

0.8

0.7

0.6

0.5
F-

0.3

0.2

0.1
0.1 -

0.0 DistTfrmSk...
DistTransformSk


ThinningSkelSeri... DistTransformSke...
Number of Threads


Figure 16: First bar is the serial DT algorithm; Second bar is the serial Thinning algorithm, Third bar is the parallel DT algorithm on 4 nodes; Fourth bar
is the parallel Thinning algorithm on 4 nodes (From left to right)


It can be seen that although the thinning algorithm is
easily parallelizable, the thinning needs to be repeated
many times, which decreases the performance. The
distance transform method yields a lower execution time,
parallel and serial, when compared to the thinning
algorithm. The parallel algorithm for the distance transform
has a lower execution time than the serial thinning
algorithm on the same input image. The parallel algorithm


Figure 17: Input Image


for the distance transform algorithm has approximately
three times the execution time of the serial algorithm. The
communication of the processors needs to be decreased.
The output of the images from both algorithms is shown in
Figures 17-22. The parallel distance transform output is
similar to the serial output. Of course, the MATLAB
operation is much better, but it can be seen that the
functionality was achieved.


Figure 18: MATLAB Skeletonization


University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008


ThinningSkel2.pa...


* user_main (DistTransformSkelSerial.upc:l 5) � upc_all_alloc i Other user_main TninriirgSirol uC:. '1 I
* Skel (ThinningSkel.upc:1 77) Iterate (ThinningSkel.upc:325) 0 user_main (DistTransformSkel.upc:l 8)
� upc_all_alloc (DistTransformSkel.upc:41) N upc_wait N Skel_one (DistTransformSkel.upc:1 43)
* Skelthree (DistTransformSkel.upc:l 76) upc_all_alloc (ThinningSkel.upc:52)





PARALLELIZATION OF TWO-DIMENSIONAL SKELETONIZATION ALGORITHMS


Figure 19: Serial DT Skeletonization


Figure 21: Serial Thinning Algorithm




The output of the serial and parallel thinning algorithms
are identical. The MATLAB thinning algorithm uses a
different approach to thinning therefore the result is not
useful for comparison. The superimposition of the distance
transform and thinning algorithm outputs may provide a
better skeletonization result than one algorithm alone.
The performance analysis performed above does not ex-
clude the file 10 operations. The manual execution time
results yielded the same conclusions. The file 10 added
extra overhead to the program, but the thinning algorithm
still performed slower than the distance transform algo-
rithm on the same image. The output of the thinning
algorithm is also not as good as the output of the distance
transform algorithm. The distance transform algorithm
provides the better skeletonization and potential for better
overall performance.


Figure 20: Parallel DT Skeletonization 4 Nodes


Figure 22: Parallel Thinning Algorithm 4 Nodes




Performance Analysis. The performance analysis does
not include the file I/O part of the program because it
wasn't parallelized. The computation, which was parallel-
ized in the formulation phase of the project, is compared to
determine the speedup. The distance transform algorithm
was compared to the serial algorithm and parallel algorithm
run on a single processor. The parallel program running on
a single processor is shown to be faster than the serial
program in Table 4 and Figure 23.
It seems that the communication increases as the number
of processors increases. This communication is limiting the
approach to linear speedup. Another problem is that the
algorithm used doesn't consider data dependencies. There-
fore as the number of processors becomes larger than 15
processors, the image is significantly different from the
serial baseline image. The user can make a compromise
and achieve four to five times speedup on five to six
processors and still achieve the image required.


University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008





BHAVYA DAYA

Table 4: Performance of Parallel Distance Transform Algorithm

Number of Serial Baseline Parallel Execution Speedup: Speedup:
Processors Execution Time Time Comparedto Serial Comparedto Parallel
Code Code on One Thread
1 25381 ms 2.213 nms 1.26 1
2 25.381 ms 10.504 ms 2,207 1.924
3 25.381ms 9.532 ms 2.663 2.12
4 25.381 ms 7.355 ms 3.45 2.748
5 25.381ms 5.702 ms 4.451 3545
6 25.381ms 5.396 ms 4.704 3.746
7 25381ms 4.707 ms 63 4.294
8 25.381ms 4.183 ms 6,068 4.832
S25.381 ms 4.154 ms 6.11 4.866
10 25. 381 ms 3.557 ms 7,136 5.683
11 25381 mn 3.432 ms 7,3 5.890
12 25.381 ms 3.372 ms 7,527 5.994
13 25.381ms 2.983 ms 8.509 6.776


Figure 23: Graphical analysis of speedup


Optimization of Algorithms. Distance transform
algorithm was optimized to analyze the performance
increase. Dynamic memory allocation was investigated to
optimize the memory on the threads. Dynamic memory
allocation reduces the performance of the algorithm. The
overhead of allocating memory dynamically affects the
performance especially when the image size is small. The
following shows the performance of the distance transform
program performed on one processor and four processors.


Speedup is not observed because the communication
between the nodes and memory allocation is causing the
performance to decrease. The speedup observed is 0.406.
The static memory allocation results in a speedup of 0.473.
Since the image size was large, a large impact was not
witnessed. Although the communication needs to be
improved, the memory allocation was investigated to
determine any performance improvement.


University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008


Speedup Of Parallel Distance Transform

Algorithm


8

4
8 -----------------
6 ^� -


1 2 3 4 5 6 7 8 9 10 11 12 13
NUMBER OF PROCESSORS

----Speedup: Compared to Serial Code
----Speedup: Compared to Parallel Code on One Thread
- Linear Speedip


t--~z~-Z-I






PARALLELIZATION OF TWO-DIMENSIONAL SKELETONIZATION ALGORITHMS


Total Times by Function: Default Revision
o.18
0.15
0.14
0.13
0.12
0.11

So.oe
S0.07
0.05

0.04
0.03
0.02
0.01
DistT-ansformSklSeial pal 1 OistTransfomSkel2.par: 4
Number of Threads
Suser_main (DistTransformSkelSenal upc 15) U upc_llaalloc Other
usermain (DistTransformSkel upc, 8) E upc_all_alloc (DistTransformSkel.upc 41) upc_wait
* Skel_one (DistTransformSkel.upc.144) U Skel three (DistTransformSkel.upc.177)



Figure 23: Performance Analysis - Dynamic Memory Allocation as
Compared to Serial Baseline







The serial baseline executed in 64.64ms. The execution
time of the dynamic memory allocated parallel algorithm
was 159.13 ms. Whilst the execution time of the static
memory allocated parallel algorithm was 136.72. Dynamic
memory allocation was used during the performance analy-
sis because it was user friendly when the program did not
know the image beforehand.
The remote shared accesses create a lot of overhead in
the parallel algorithm. It is possible to overlap communi-
cation with computation. This is done using split-


Total Times by Function: Default Revision
014
013
012
011

0U09

_ 0,07

0.05

o03
002
001
O'DlTansfomn, kelSeriM al. : 1 ODstTansfor el2.par: 4
Number ofThreads
* user main (DistTransformSkelSerial.upc.l 5) 1 upc_all_alloc Other
user_main (DistTransformSkel upc.18) � upc_wait Skelthree (DistTransformSkel upc.176)
* Skel one (DlstTransformSkel upc 143) K Skel two (DistTransformSkel upc 160)



Figure 24: Performance Analysis - Static Memory Allocation as
Compared to Serial Baseline







phase barriers instead of blocking barriers. Blocking barri-
ers have been used in the parallel program. Local process-
ing can be done while waiting for data or synchronization.
The ghost zones are the boundaries between the thread
data, as shown in Figure 23. At these points communica-
tion is optimum. The ghost zone is therefore pre-fetched.
While the pre-fetching takes place, the computation on the
other local data can be performed. After all threads process
the local data computation, the ghost zone is processed.


Figure 26: Ghost Zone












University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008


Thread 0
Ghost zone


Thread 1 - Filter applied to the image

...............

Thread 2



Thread 3





BHAVYA DAYA


Figure 26: Ghost zone performance analysis


The execution time of the parallel distance transform al-
gorithm drops to 107.41 ms. The ghost zone execution is
good for performance improvement, but the improvement
will be more visible when there are larger images. The
speedup is 0.602.


CONCLUSION


The analysis of both pixel-based algorithms reveals that
parallelization will create performance improvement but
the improvement will not be linear. After implementation,
optimization and analysis, it was found that the best algo-
rithm is the distance transform algorithm when executed on


five to six processors. Although linear speedup was de-
sired, the performance at the ideal system size of five to six
processors was 4.1 to 4.65 times the serial algorithm per-
formance. For many applications, the skeletonization proc-
ess is very time-consuming. With the performance
improvement and the quality of skeletonization remaining
the same as the serial algorithm, the medical and/or finger-
print applications can be allowed to transition to the paral-
lel computing realm. The optimizations should be further
investigated to achieve linear speedup. Various methods,
such as parallel file I/O, can increase the speed of the algo-
rithm significantly. Future work should focus on obtaining
linear speedup on a small system size. Based on prelimi-
nary results, it seems possible to achieve that goal.


Table 5: Performance Table - Execution Time vs. Number of Processors


N um be r of Serial Baseline Parallel Execution Speedup: Speedup:
Processors Execution Time Time Compared to Compared to Parallel
Serial Code Code on One Thread


1 25.381 ms 20.232 ms 1.254 1
2 25.381 ms 10.16 ms 2,498 1.99
3 5.381 ms 7.834 ms 3,24 2.582
4 25.381 ms 6.95 3 ms 3.65 2.91
5 25.381 ms 6.186 ms 4.103 3.27
6 25 381 ms 5,452 ms 4,655 3.71
7 25.381 ms 4.891 ms 5.189 4.14


University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008


Total Times by Function: Default Revision
0.11
0.10


O.Os

0.07

E O.06
S0.05
0.04
0.03
0.02
0.01
0.00 -
DistTransformSkelSerial.par: 1 DistTransformSkel_ hostZone.par: 4
Number of Threads
* user_main (DistTransformSkelSerial.upc:l 5) upc_alalllloc Other






PARALLELIZATION OF TWO-DIMENSIONAL SKELETONIZATION ALGORITHMS


Figure 28: Speedup improvement using ghost






























































University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008


Speedup Achieved using Ghost Zone
8 -Optimization














\ t \ I h t
NUMBEROF PROCESSORS
S-.- e.u r.p -n.uJ, I,, I, , *1 a Ih, o .n t , Ihnr..





BHAVYA DAYA


REFERENCES


[1] Morrison, P.; Ju Jia Zou, "An effective skeletonization method based on
adaptive selection of contour points," Information Technology and Applications,
2005. ICITA 2005. Third International Conference on , vol. 1, no., pp. 644-649
vol.1, 4-7 July 2005

[2] "Parallel digital signal processing: an emerging market." 24 Feb 2008.


[3] Tran, S.; Shih, L., "Efficient 3D binary image skeletonization,"
Computational , .. - Bioinformatics Conference, 2005. Workshops and Poster
Abstracts. IEEE, vol., no., pp. 364-372, 8-11 Aug. 2005

[4] Gray, S.B., "Local Properties of Binary Images in Two Dimensions,"
Computers, IEEE Transactions on, vol.C-20, no.5, pp. 551-561, May 1971


[5] "Skeletonization." 24 Feb 2008.
szeged.hu/-palagyi/skel/skel.html>




[6] Tran, S.; Shih, L., "Efficient 3D binary image skeletonization,"
Computational , .. - Bioinformatics Conference, 2005. Workshops and Poster
Abstracts. IEEE, vol., no., pp. 364-372, 8-11 Aug. 2005


[7] "Marvel Machine." High Performance Computing and Simulation Lab. 24
Feb 2008


[8] Fouard, C.; Cassot, E.; Malandain, G.; Mazel, C.; Prohaska, S.; Asselot, D.;
Westerhoff, M.; Marc-Vergnes, J.P., "Skeletonization by blocks for large 3D
datasets: application to brain microcirculation," Biomedical Imaging: Nano to
Macro, 2004. IEEE International . . on, vol., no., pp. 89-92 Vol. 1, 15-
18 April 2004

[9] Gisela Klette, "Skeletons in Digital Image Processing", Computer Science
Department of The University of Auckland,
< http://citr.auckland.ac.nz/techreports/2002/CITR-TR- 112.pdf>

[10] Lei Huang, Genxun Wan, Changping Liu. "An Improved Parallel Thinning
Algorithm" Institute of Automation, Chinese Academy of Sciences, P. R. China

[11] Klaiber, A.C.; Levy, H.M., "A comparison of message passing and shared
memory architectures for data parallel programs," Computer Architecture, 1994.,
Proceedings the 21st Annual International Symposium on, pp.94-105, 18-21 Apr
1994

[12] Yu-Hua Lee; Shi-Jinn Horng, "Fast parallel chessboard distance transform
algorithms," Parallel and Distributed , .. - . 1996. Proceedings., 1996
International Conference on, vol., no., pp.488-493, 3-6 Jun 1996



























University of Florida I Journal of Undergraduate Research I Volume 9, Issue 4 I Summer 2008