PARALLEL IMAGE PROCESSING WITH IMAGE

ALGEBRA ON SIMD MESH-CONNECTED COMPUTERS

By

HONGCHI SHI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL

OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

1994

ACKNOWLEDGMENTS

I wish to express my deep gratitude to my advisors, Dr. Gerhard X. Ritter and Dr.

Joseph N. Wilson, for teaching me so much about doing research, for allowing me the

opportunity to perform independent research on their contract, and for giving me the

chance to work as a research assistant. It has been a great pleasure working with them.

I am also grateful to the members of my dissertation committee, Dr. Sartaj Sahni,

Dr. Randy Y. Chow, Dr. Andrew F. Laine, and Dr. John Harris, for patiently reading

this manuscript and providing insightful suggestions.

I would like to thank Dr. Sam Lambert and Dr. Patrick Coffield of Wright

Laboratories, Eglin AFB, for their continued support of this research through US Air

Force contract F08635-89-C-0134. I also thank Dr. Chen-Ning Yang, SUNY at Stony

Brook, and Dr. Richard A. Scheuing and Dr. Reuben Chow, Grumman Corporation, for

offering me the fellowship through the Committee on Educational Exchange with China.

On this fellowship I started my study in the United States.

I would also like to thank my colleagues, particularly Dr. David S. Patching, for

many fruitful discussions.

Finally, I wish to thank my father, Renci Shi, my late mother, Cuie Ding, my sisters,

Yuzhi Shi and Xuezhi Shi, and my brother, Hongtao Shi, for their love and support. I am

indebted to my wife, Hong Wang, and my son, August W. Shi, for their love, support,

and patience during my study. To them I dedicate this dissertation.

TABLE OF CONTENTS

ACKNOWLEDGMENTS .............. .. ..... .... .. ii

ABSTRACT .................... ...................... v

CHAPTERS ................. .......................... 1

1 INTRODUCTION ................... ................ 1

1.1 Image Processing and Parallelism ........................ 1

1.2 Image Algebra and Parallel Image Processing ................. 2

1.3 N otation .. . . . . . . . . . . . . . . . . . 5

1.4 Dissertation Outline ................................. 5

2 OVERVIEW OF IMAGE ALGEBRA ......................... 7

2.1 Im ages . . . . . . . . .. . . . . . . . . . . 7

2.2 Templates ................... .................. 9

2.3 Pixelwise Operations ................................ 10

2.4 Global Operations ................................. .11

2.5 Image-Template Operations ........................... 12

3 SIMD MESH-CONNECTED COMPUTERS .

3.1 SIMD Parallel Computers ..........

3.2 Mesh-Connected Interconnection Networks

3.3 Mapping Images onto SIMD Mesh-Connect

. . . . . . . . . 16

. . . . . . . . . 16

. . . . . . . . . 17

ed Computers ......... 19

4 PARALLEL ALGORITHMS FOR IMAGE ALGEBRA PRIMITIVES ....

4.1 Pixelwise Operations ..............................

4.2 Global Operations ...............................

4.2.1 Semigroup Computations ........................

4.2.2 Global Reduce, Card, Choice, and Image Comparison Operations

4.3 Image-Template Operations .........................

4.3.1 General Image-Template Product ...................

4.3.2 Image-Template Product without Changing Domain ........

4.3.3 Image-Template Product with Special Nonlocal Template .....

5 PARALLEL IMAGE PROCESSING WITH IMAGE ALGEBRA .

5.1 Abingdon Cross Benchmark .....................

5.2 Shrinking Binary Image Components ...............

5.2.1 Levialdi's Parallel-Shrink Operators ............

...... 40

. . . 41

...... 48

...... 49

.22

.23

.23

.23

.24

.26

.26

.27

. 38

5.2.2 New Parallel-Shrink Operators ..............

5.3 Labeling Binary Image Components . . . . . . .

5.3.1 Naive Local Labeling Algorithm ............

5.3.2 Basic Two-Phase Local Labeling Algorithm . . .

5.3.3 Improved Basic Two-Phase Local Labeling Algorithm

5.3.4 Advanced Local Labeling Algorithm ..........

5.3.5 Fast Advanced Local Labeling Algorithm . . . .

5.4 Computing Properties of Image Components . . . . .

. . . . 51

. . . . 58

. . . . 59

. . . . 62

. . . . 65

. . . . 68

. . . . 7 1

. . . . 85

5.4.1 Algorithm for Image-Template Product with Component Template 86

5.4.2 Computing Area, Perimeter, and Compactness of Each Image

Component .................................. 90

5.4.3 Computing Height, Width, and Diameter of Each Image

Component .................................. 91

5.4.4 Computing Moments and Centroid of Each Image Component . 93

6 CONCLUDING REMARKS AND FUTURE RESEARCH ............ 95

6.1 Concluding Remarks ............................... 95

6.2 Future Research ................................... 97

REFERENCES ..........................................99

BIOGRAPHICAL SKETCH ................................ 106

Abstract of Dissertation Presented to the Graduate School of the University of Florida

in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

PARALLEL IMAGE PROCESSING WITH IMAGE

ALGEBRA ON SIMD MESH-CONNECTED COMPUTERS

By

Hongchi Shi

April 1994

Chairman: Dr. Gerhard X. Ritter

Cochairman: Dr. Joseph N. Wilson

Major Department: Computer and Information Sciences

Image processing and image analysis involve intensive computations. Parallel com-

puting has been perceived as the only economical way to achieve real-time or near real-

time task performance. However, programming parallel computers for image processing

applications is not an easy job due to the lack of high-level languages for parallel image

processing on parallel computers. Image algebra is a unifying mathematical theory for

image processing and analysis. It treats images as primary operands and addresses implic-

itly the parallelism in image processing, providing a highly parallel high-level language

for image processing. This dissertation addresses issues of parallel image processing

with image algebra on SIMD mesh-connected computers. It explores two aspects of

the relationship between image algebra and parallel image processing: how well image

V

algebra serves as a model for parallel image processing algorithms and how well SIMD

mesh-connected computers are suited for image algebra implementation.

In the first part of this dissertation, I select a group of image algebra primitives

useful for parallel image processing and develop efficient algorithms to implement these

primitives on SIMD mesh-connected computers. In the second part, I demonstrate that

image algebra can serve as an excellent model for parallel image processing. This is

accomplished by using image algebra to describe several new highly parallel algorithms

I have developed in order to solve various image processing problems. In particular, I

describe a fast algorithm for the Abingdon Cross image processing benchmark and a new

algorithm for binary image component shrinking. Additionally, I describe and analyze

several local image component labeling algorithms, one of which positively answers an

open question whether there exists a local labeling algorithm to label an n x n binary

image in O(n) time on an n x n mesh-connected computer with O(log n) bits of local

memory on each processing element. I also define a special class of image-template

operations that prove useful for computing properties of image components and develop

a general algorithm for them. Finally, I provide some suggestions for future research on

parallel image processing with image algebra on parallel computers.

CHAPTER 1

INTRODUCTION

Image processing and image analysis involve intensive computations, demanding

high-performance computers for practical applications. Parallel computing appears to

be the only economical way to achieve the level of performance required by image

processing and analysis [5]. Programming parallel computers for image processing is

quite difficult without a high-level language well suited for image processing. Image

algebra is a unifying mathematical theory for image processing and analysis. It also

provides a highly parallel language for image processing. This dissertation explores two

aspects of the relationship between image algebra and parallel image processing: how

well image algebra serves as a model for parallel image processing algorithms and how

well SIMD mesh-connected computers are suited for image algebra.

1.1 Image Processing and Parallelism

Image processing is an important area of research concerned with the manipulation

and analysis of images by computers [68]. It is routinely employed in many application

domains, such as document analysis, industrial robots for product assembly and inspec-

tion, medical diagnosis, military reconnaissance, and machine processing of aerial and

satellite imagery for weather prediction and crop assessment [22].

Image processing is also a challenging area. Image processing problems usually

involve very intensive computations. Consider a sequence of images at medium resolution

(512 x 512 pixels) and standard frame rate (30 frames per second) in color (24 bits per

pixel), which represents a data input rate of about 23.6 million bytes per second [89].

2

In order to enhance and segment such a sequence of images and to extract various

features from it, we may have to apply many thousands of operations to each input

pixel. Such an intensive computation, if handled by even the largest and most powerful

serial computer, can be hopelessly slow. A serial computer is considered inadequate for

image processing problems [83], and parallel processing is now generally accepted as a

necessity to support image processing applications [88, 5].

On the other hand, most image processing algorithms are inherently parallel because

they involve similar computations for all pixels in an image, which makes load distribution

easy. The massive amount of data in an image provides a natural source of parallelism

that can be employed by partitioning the image into sections [30, 6, 87].

It is generally realized that parallel processing appears to be the only way to achieve

the necessary speedups required by image processing [5], and image processing provides

significant data parallelism that can be used in parallel processing. However, it is not

always clear how to employ the data parallelism in image processing effectively. In fact,

effective employment of data parallelism in image processing remains an active area of

research.

1.2 Image Algebra and Parallel Image Processing

To support parallel image processing, we need a unifying theory that can serve as a

model for image processing algorithms and that also fits well into the theory and practice

of parallel architectures. The search for such a theory started over thirty years ago.

Based on the work of von Neumann's logical theory of automata [86], a cellular array

machine was proposed by Unger that could implement in parallel many algorithms for

image processing and analysis [84]. Over time, more cellular array machines have been

3

proposed. Cellular array machines usually are assumed to have a fixed capacity of local

memory.

With the advent of VLSI technology, many mesh-connected computers have been

developed for image processing. Although mesh-connected computers are closely related

to cellular array machines, they allow their local memory to increase with their array

sizes. The architectures such as the ILLIAC [47], the CLIP series [15, 14, 18], the MPP

[4], the GAPP [7], and the Hughes 3D machine [28] are arrays of processing elements

arranged as two-dimensional meshes with hardwired communication between neighboring

processors. The elementary operations of these machines create a mathematical basis for

the theoretical formalism capable of expressing a large number of algorithms for image

processing and analysis. The formalism associated with these architectures is that of pixel

neighborhood arithmetic and mathematical morphology, which is concerned with image

filtering and analysis by structuring elements. Morphological concepts and methods

were first unified by Serra and Sternberg into a coherent algebraic theory specifically

designed for image processing and image analysis [80, 70, 81]. More recently, a

new theory was introduced by Maragos to unify a large class of linear and nonlinear

systems under the theory of mathematical morphology [44]. However, morphological

methods have some well-known limitations due to their set-theoretic formulation, which

is based on the Minkowski addition and subtraction of sets [23]. The morphology-based

image algebras ignore the linear domains, transformations between different domains,

and transformations between different value sets [65, 61]. Another algebraic structure

formulated by Dougherty and Giardina [12] defines a set of basic image operations.

However, it provides too low a level of specification for expressing various image

processing tasks [65].

Another class of parallel image processing architectures is based on optical com-

4

putting concepts and is intended to remove the interconnection bottleneck common to

conventional cellular arrays [27, 26, 19]. These architectures use the inherent optical

parallelism and three-dimensional free space interconnection capabilities to do parallel

binary image processing. The formalism associated with these architectures is a coherent

homogeneous algebra referred to as image logic algebra [19] or binary image algebra

[27, 26]. Binary image algebra is based on the set-theoretic formulation. It has three

fundamental operations: the complement of an image, the union of two images, and the

dilation of two images. For this reason, binary image algebra falls into the same class

as other algebras based solely on mathematical morphology. Image logic algebra is also

based on the set-theoretic formulation. It contains all the logic operations, dilation, and

erosion of images. It also contains the image shift transformation that translates an image

vertically and horizontally, the image casting transformation that allows shrinkage of the

domain, and the multiple imaging transformation that allows enlargement of the domain.

However, these transformations between domains are very simple and not capable of

effectively performing higher-level vision tasks.

The development of a more general unified algebraic structure grew out of a need, by

the U.S. Air Force System Command, for a common image processing language. The goal

was the development of a complete and unified algebraic structure that provides a common

mathematical environment for image processing algorithm development, optimization,

comparison, coding, and performance evaluation. Image algebra developed by Ritter

and his colleagues at the University of Florida proved to be highly successful, capable of

fulfilling the tasks set forth by the government [65, 61]. It is a comprehensive and unifying

mathematical theory concerned with image processing and image analysis. It incorporates

and extends mathematical morphology. Furthermore, it allows transformations between

different domains and transformations between value sets. It extends the neighborhood

5

operations of morphology to more general image-template operations.

Programming parallel computers is very difficult at this time, as was programming

serial computers in the 1960s [87]. Each parallel computer has its own language, which

runs efficiently on one computer only, and high-level programming tools for parallel

computers tend to be tailored to particular machines. A high-level and architecture-

independent language for parallel image processing is desirable. Image algebra provides

a powerful algebraic language for image processing that, if properly imbedded into a

high level parallel language, will greatly increase a programmer's productivity as the

programmer can concentrate on developing effective algorithms themselves for particular

image processing problems without worrying about the parallel architecture.

1.3 Notation

The notation used in this dissertation is similar to that used in most technical texts.

Any new notation introduced will be explained in the dissertation.

I will use log to denote the base 2 logarithm. I will also use the following notation

to describe the asymptotic behavior of functions:

1. f(n) = O(g(n)) denotes the fact that there exist constants c and no such that

f(n) < cg(n) for all n > no;

2. f(n) = ((g(n)) denotes the fact that there exist constants c and no such that

f(n) 2 cg(n) for all n > no;

3. f(n) = 0(g(n)) denotes the fact that there exist constants cl, c2, and no such that

cig(n) < f(n) < c2g(n) for all n > no.

1.4 Dissertation Outline

In this chapter, I have described the relationship between parallel image processing

and image algebra. In Chapter 2, I will overview image algebra from the viewpoint of

6

parallel image processing. I will introduce the SIMD mesh-connected computer model

and describe the general principles of mapping image algebra primitives to SIMD mesh-

connected computers in Chapter 3. In Chapter 4, I will discuss implementation issues

of image algebra on SIMD mesh-connected computers and try to identify important

architectural features for image algebra. In Chapter 5, I will develop several new

parallel image processing algorithms and demonstrate how concisely and conveniently

image algebra can be used to express these algorithms. I will discuss algorithms for

the Abingdon Cross image processing benchmark in the first section. In the second

section, I will consider shrinking binary image components. The third section will discuss

algorithms for labeling binary image components In the fourth section, I will discuss

algorithms for computing geometric properties of each image component. Finally, I will

give a brief summary of what I have done and suggest some further research issues in

Chapter 6.

CHAPTER 2

OVERVIEW OF IMAGE ALGEBRA

Image algebra (IA) is a heterogenous algebra concerned with image processing and

image analysis. It provides a very general algebraic language for image processing. We

present an overview of image algebra and select a group of image algebra primitives for

parallel image processing from the viewpoint of parallel processing.

In image algebra, basic objects in image processing such as images, point sets, and

templates are formally defined as its operands. A set of unary and binary operations is

defined on and between these operands. Image algebra is inherently parallel in that all

the primitive operations are defined to work on images that are collections of pixels. The

parallelism of image algebra is inherent in its operations, which removes the need for

sophisticated compiler analysis normally needed to uncover available parallelism.

For parallel processing, image algebra can be considered as a collection-oriented

language that manipulates aggregate data structures and operations as a whole [79]. When

implemented on parallel computers, image algebra can have images distributed across

the available processing elements. Operations can be performed by all the processing

elements in the single instruction multiple data (SIMD) fashion. Here, we only describe

image algebra from the viewpoint of parallel image processing. A complete description

of image algebra can be found in the paper by Ritter, Wilson, and Davidson [65] and

the papers by Ritter [61, 62].

2.1 Images

Images are the most fundamental operands in image algebra. They are defined in

8

terms of two other types of elementary operands: value sets and point sets. Images are

endowed with two types of information, namely the spatial relationship of the points and

some type of numeric or other descriptive information associated with these points.

A value set F is a set of values that an image could assume. The most commonly used

value sets in image processing are the sets of integers, real numbers, complex numbers,

and binary numbers of length 1, which are denoted by Z, R, C, and Z21, respectively.

The operations on and between elements of a given value set F are the usual arithmetic

operations, lattice operations, or logic operations associated with F.

A point set X is simply a subset of some topological space. The most commonly

used point sets in image processing are subsets of n-dimensional Euclidean space Rn.

The operations on and between elements of points are the usual operations of coordinate

points. The operations on and between point sets are the operations U, n, \, choice

and card. The choice function returns an arbitrary element of a point set while the card

function returns the number of elements in a point set.

Given a point set X and a value set F, an F-valued image a on X is a function

a : X -+ F, which is usually expressed in terms of the graph of a by

a= {(x,a(x)) :xE X},

where a(x) E F.

An element (x, a(x)) of a is called a pixel, where x is the pixel location and a(x)

is the pixel value at location x. The set of all F-valued images on X is denoted by Fx

On parallel computers, pixels of an image are commonly assigned to the processing

elements by a mapping that relates each pixel location to the index of a processing

element.

9

2.2 Templates

In image algebra, the definition of a template unifies and generalizes the usual

concepts of templates, masks, windows, structuring elements, and neighborhood functions

into one general mathematical entity.

Templates are special types of images. An F-valued template from Y to X is an

element of (FX)Y. If t E (F) Y, then for notational convenience we define ty t(y)

in order to denote the image t(y) for each y e Y. The pixel values ty(x) of the

image ty = {(x, ty(x)) : x X} are called the weights of the template t at the point

y, and y is called the target point of the image ty. The pixel values of the template

t = {(y, ty) : y Y} are F-valued images on X. A parameterized F-valued template

from Y to X with parameters in P is a function of form t : P -- (FX)Y. Here, P is

called the set of parameters, and each p E P is called a parameter for t. For each p E P,

t(p) is an F-valued template from Y to X.

Let X C R". A template t E (FX)X is called a translation invariant template or,

simply, invariant template if and only if for x, y 6 X with x + z, y + z e X, where

z E Rn, we have ty(x) = ty+z(x + z). A template that is not translation invariant is

called a translation variant template or, simply, variant template. Translation invariant

templates can be defined pictorially. For example, a template t e (Rx)X with X C Z2

defined by

1 if x = y

2 if x=y+(0, 1)

ty(x)= 3 if x = y+ (1,0)

4 ifx =y+(1,1)

0 otherwise

is an invariant template that can be pictorially represented in Figure 1, where the hashed

cell denotes the location of the target pixel y.

1 2

t=\ / -

3 4

Figure 1. An invariant template

The weight ty(x) for a template t is usually a simple function of the target point y

and point x. It can be expressed by the same formula for every target point y. Thus, on a

parallel computer, the weights ty(x)'s can be computed by the corresponding processing

elements in the SIMD fashion.

2.3 Pixelwise Operations

The pixelwise operations form a major class of operations in image algebra. A

pixelwise operation applies a function to every pixel of an operand image or corresponding

pixels of operand images. This class of operations maps perfectly to the SIMD parallel

programming paradigm: all processors apply the same function to different elements at

the same time.

Pixelwise operations on and between F-valued images are the natural induced opera-

tions of the algebraic system F. Any function f : F -- F induces a function Fx Fx,

again denoted by f, and defined by

f(a) = {(x,c(x)) : c(x) = f(a(x)),x E X}.

In the same fashion, we can define n-ary functions of images. When an argument is just

an element of F, we should first extend it to a constant image that has the value of that

argument at each location on X.

The pixelwise operations are the operations on and between images defined in detail

by Ritter [62] and include the unary and binary arithmetic and logic operations, the

11

characteristic functions, the restriction function, which restricts an image to a subset of

its point set, and the extension function, which extends an image to another image on

a larger point set.

2.4 Global Operations

The global operations act on images in their entirety. These operations usually

require global communication when implemented on parallel computers.

Two simple global operations on images are the card function, which returns the

number of pixels in an image, and the choice function, which returns an arbitrary point

from the point set of an image.

Another important global operation is the global reduce operation. Suppose that

X C Rn is finite, say X = {x1,X2,... ,Xm}. If 7 is an associative and commutative

binary operation on the value set F, then the global reduce operation r on FX induced

by 7 is defined by

r a =x a(x) = a(xi)7a(x2)7 .. * ya(x),

where a E Fx. Thus, r : Fx -- F. Different associative and commutative binary

operations induce different global reduce operations. Typical uses of global reduce

operations are summing the pixel values of an image, finding the minimum/maximum

pixel value of an image, and using logic reduction to determine the and/or of a boolean

image.

Image comparison operations are also global operations. First, images can be

compared pixelwise, and then a logic global reduce operation can be applied to the

pixelwise comparison result.

12

2.5 Image-Template Operations

In terms of image processing, operations that combine images and templates are

the most powerful tools of image algebra. The image-template operations combine

images and templates by using appropriate binary operations. They may be used for

transformations between different domains and transformations between different value

sets.

Suppose that A = (F, F1, F2, o,7, 0, ) is an algebra. The sets F1, F2, and F

are value sets. The operation o : F1 x F2 -- F is a binary operation and the operation

7 : F x F -+ F is an associative and commutative binary operation on F. The constant

17 E F is the identity element of the 7 operation, i.e., fy71 = 177-f = f for every f E F.

The constant 0o E F2 is called the right zero element of the o operation, i.e., fi o 0o = 17

for every fi E F1. If a E FX and t E (F) Y, then the generalized product of a with t

is the binary operation : Fx x (FX)Y FY defined by

a0t = {(y,b(y)): b(y) =-x a(x) o ty(x),y E Y}.

We define the support of ty with respect to the algebra A as S(ty) =

{x E X: ty(x) / 0}. For example, if t E (RX)Y, the support of ty with respect

to the algebra (,R, R, , +, 0,0) = (R, +,0) is So(ty) = {x EX : ty(x) 4 0}. When-

ever the algebra is understood, the support with respect to the algebra is simply called

the support. For a template t E (FX)X with support S(ty) for ty, we define the relative

support ofty as R(ty) = {x y : x E S(ty)}, and define the relative support of t as

R(t) = U R(ty). For example, the relative support of the template t shown in Figure

yEX

1 is {(0, 0), (0, 1), (1, 0), (1, 1)}. Note that the relative supports of an invariant template

at different target points are the same. A template t is called a square (rectangular,

13

(rectangular, diamond, and circular, respectively) template if and only if all the points

in its relative support form a square (rectangle, diamond, and circle, respectively).

A neighborhood on X is a function N : X -- 2X such that x E N(x). A template

t E (FX)X is local with respect to N if and only if S(ty) C N(y) for all y E X. If

N is understood, then we simply say that t is a local template. For example, if N(x)

is a Moore neighborhood of x E X and t is a local template with respect to N, then

the support of t at each y E X always fits in the Moore neighborhood of y. An image-

template operation with a local template is called a local image-template operation. An

image-template operation with a nonlocal template is called a nonlocal image-template

operation.

One can derive many image-template operations by substituting appropriate binary

operations for 7 and o in the definition of the generalized product. Among the most

commonly used image-template operations in image processing are @, M, Em, , @,

Z, and described as follows:

* The linear image-template product O is obtained from the generalized product by

substituting + and for 7 and o and is defined for integer-, real- or complex-valued

images. Specifically, a t = (y, b(y)) : b(y) = E a(x) ty(x),y E Y .

I xEX )

* The additive maximum/minimum R and [E may be applied to integer- or real-valued

images. The operation M is obtained from the generalized product by substituting

V and + for 7 and o, while the operation E is obtained by substituting A and +

for 7 and o. Specifically, aMt = (y,b(y)): b(y) = V a(x) + ty(x),y Y}

I xeX )

and aElt= {(y, b(y)):b(y)= A a(x) + ty(x), y Y .

I xX J

* The multiplicative maximum/minimum @ and @ may be applied to non-

negative integer- or real-valued images. The operation @ is obtained from

the generalized product by substituting V and for 7 and o, while the

14

operation @ is obtained by substituting A and for 7 and o. Specif-

ically, a @t = (y,b(y))): b(y)= V a(x) -ty(x),y E Y and a@t =

I xeX )

{(y,b(y)): b(y)= A a(x) ty(x), y E Y

I xEX J

The bitwise match/dismatch i and CI may be applied to 1-bit binary number-

valued images. To obtain El and E] from the generalized product, we define two

operations + and -'. The operation + is defined as the bitwise exclusive-or, i.e.,

a;_-la-2...*aoG-b-lbl-2 ...bo = c;-lcl-2...CO, where ci = (ai + bi) mod 2, the

exclusive-or of ai and bi, for 0 < i < 1. The operation r is defined as the

bitwise complement of 4, i.e., al-lal-2... ao 'bt-lbl-2 .. o = ct-1 cT-2 ... Co

if and only if a_-lal-2 ... ao4rblbl-2 ... bo = c;_cl-2 ... co. The bitwise match

operation Il is obtained from the generalized product by substituting bitwise and A

and the bitwise operation 4' for 7 and o, while the bitwise mismatch operation

C is obtained by substituting bitwise or V and the bitwise operation + for 7

and o. Specifically, a lt = {(y,b(y)) : b(y)= A a(x)-'ty(x),y e Y and

L xEX )

a l t = (y, b(y)) : b(y) = V a(x)4ty(x),y E Y Note that the operation

I. xex E

extends the morphological hit-or-miss operation and is conceptually simpler. The

morphological hit-or-miss operation is only applied to binary images and it requires

two structuring elements T1 and T2. The structuring element Ti is for the foreground

of the pattern to be matched, while T2 is for the background of the pattern. The

operation l can achieve the same result with a single template that has 1 as the

weights for the part of support corresponding to T1 and 0 for the part of support

corresponding to T2.

The image-template operations are appropriate for use whenever any neighborhood or

region-based operation is to be performed on portions of an image. They are applicable

in areas such as filtering, edge finding, region growing, and feature matching. In

15

these applications, the template defines the relationship of a neighborhood of source

image points to result image locations and gives values to these points. The weighting

operation o determines the contribution of each of the neighborhood points to the

reduction operation r, which determines the final result image value. The image-

template operations implicitly specify communications between processing elements when

implemented on parallel computers.

CHAPTER 3

SIMD MESH-CONNECTED COMPUTERS

We now introduce the parallel architecture model that we intend to explore for par-

allel image processing with image algebra. Fine grain SIMD parallel computers are well

suited for applications that exhibit significant data parallelism such as those encountered

in image processing [45]. These computers are referred to as massively parallel computers

and are based on a large number of small-size processing elements, regular interconnec-

tion networks, and a simple control mechanism. Although a large number of parallel

architectures with various interconnection networks have been proposed, massively par-

allel architectures with mesh interconnection networks are suggested as natural parallel

architectures for image processing because their structures closely mimic structures of

images and provide efficient local communication structures [67, 55, 6]. We present an

overview of SIMD mesh-connected computers.

3.1 SIMD Parallel Computers

Massively parallel computers are primarily SIMD architectures due to their massive

parallelism nature. The concept of an SIMD parallel computer is illustrated in Figure 2.

SIMD massively parallel computers are made of individual processing elements (PEs),

each having a small arithmetic-logic unit (ALU) and a small local memory. Each PE

is given a unique index. An SIMD massively parallel computer has a separate program

memory and control unit. The control unit performs instruction sequencing, fetching,

and decoding. Instructions are broadcast by the control unit to the PEs for execution.

16

17

PEs are connected by an interconnection network that plays a very important role in

Interconnection Network

0 1 P-1

PE PE r PE

M M M

I/O < -

Control Unit Program Memory

Figure 2. Diagram of an SIMD parallel computer

the performance of parallel algorithms. While many interconnection networks have been

proposed and reviewed [17, 29, 51], the two-dimensional mesh network is the simplest

one and can be efficiently implemented with VLSI technology.

Here, we consider SIMD massively parallel computers with mesh-connected inter-

connection networks. Since most images in image processing applications are two-

dimensional images and most processor arrays are also two dimensional, we consider

processing two-dimensional images on two-dimensional processor arrays. Extensions to

higher dimensions can be worked out from the discussion of the two-dimensional cases.

3.2 Mesh-Connected Interconnection Networks

In a two-dimensional m x n mesh-connected computer (MCC), the PEs are logically

arranged as a two-dimensional array. Each PE is connected to its nearest 4 or 8 neighbors.

A 4-connected mesh is shown in Figure 3 while an 8-connected mesh is shown in Figure

4. We denote the processor in the i-th row and j-th column of the mesh by PE(i,j)

with PE(0, 0) in the upper left corer of the mesh. Sometimes we use a one-dimensional

18

indexing of the mesh in which (i,j) is mapped to i n + j. Data may be transmitted

0 1 2 n-1

0 PE PE PE ----- PE

1 PE PE PE -----PE

2 PE PE PE PE

m-1

Figure 3. Diagram of a 4-connected mesh array

0 1 2 n-1

0 PE PE PE ---- PE

1 PE PE PE ----- PE

2 PE PE PE -- PE

m-1 -----

Figure 4. Diagram of an 8-connected mesh array

from one PE to another PE via the mesh network. It is assumed that a PE can transmit

data to any of its nearest neighbors in unit time, but all PEs should transmit in the same

direction to their neighbors. A minor variation to the mesh is a mesh with wraparound

connections between PEs on the edge. For a 4-connected mesh, the right end of each

row connects to the left end of the row and the bottom of each column connects to its

top. For an 8-connected mesh, diagonal connections are also wraparound. The ILLIAC,

the CLIP, and the MPP are examples of mesh-connected computers.

19

MCCs have regular interconnection structures suitable for VLSI implementation. The

area occupied by the mesh network in an MCC does not increase faster than the size of

the mesh. However, the worst case data movement in an n x n mesh may take 0(n)

time due to its large diameter of 0(n). MCCs seem to be natural architectures for image

processing because their network structures are the same as image structures. Their large

diameters, however, make image processing algorithms involving communications over

long distance inefficient.

In a word model, we assume that the interconnection links between PEs are a word

wide. A word operation and transmission of a word can be accomplished in a single step.

When we consider the time complexity of an algorithm in this model, we only count the

number of word steps. In a bit model, we assume that the interconnection links between

PEs are a bit wide. Operations on words are broken down into operations on individual

bits. A bit operation or transmission of a bit can be done in one step. In this model,

time is measured in bit steps. Usually, a time complexity measured in the bit model is

slower than that in the word model by a factor of the number of bits in a word. We will

use the word model unless we mention the bit model explicitly.

3.3 Mapping Images onto SIMD Mesh-Connected Computers

To map an image onto the array of processors, we wish to assign each PE a pixel. If

the array is smaller than the image, there are two ways to handle this [67]. One way is

to process the image a block at a time and keep track of what happens where the blocks

meet or overlap as shown in Figure 5. Alternatively, if the PEs have enough storage

capacity, we can assign each PE a block of image pixels, and neighboring PEs must

then exchange information about all the pixels located on the borders of their blocks as

shown in Figure 6.

m-.

* 0... * ** * 0 ...*

66 **** * **** * ****

* -S.. * S .'. * "..*

S...* e ****. *0 ...0

0 0 "''0 "'' * "'' **

* *

Figure 5. Block by block processing of

image (n x n) on a small processor array

/ n/rn

.nlm

. ..

* **

* **S*

* 0 **'*

"''S

/

0 1 m-1

m- 0

-- 1

- }-^F--- *m1

[S -[ ] -" { ] O-

a large

(m x m)

rn-i

66** S

S....

.0 t~

*0i*4

0 1

S. ......**- ***** ****

...o o * -. ...o o -.. t :.. m-1

* -.. * *-* ... **

Figure 6. Processing of a large image (n x n) on a small processor

array (m x m) by assigning an (n/m) x (n/m) block to each PE

For parallel computations on MCCs, it is always easy to convert an algorithm

designed for a large network of processors into an equally efficient algorithm for a

smaller network of processors [35]. For an algorithm processing an n x n image on an

n x n processor array PA1 with each pixel assigned to a PE, we can always convert the

algorithm so that it can run on an m x m processor array PA2, where m < n by having

21

each PE of PA2 process n pixels to simulate 2] PEs of PAi. This will induce

a slowdown of at most [ which is to be expected, since we use a factor of n[

fewer processors. Thus, if we can design an algorithm that is efficient for a large array

of processors, we can scale it down to have an algorithm that is just as efficient for a

smaller array. We will only consider mapping one pixel to each PE.

A mapping of an image a E FX onto a processor array PA is a simple function h

that assigns a pixel at location x E X of the image to a processing element PE(i, j) of

PA, i.e., h(x) = (i,j). The simplest mapping is an identity function that maps the pixel

at location (i,j) to PE(i,j). When an image is mapped to a processor array, each PE

should have a flag to indicate whether it corresponds to a pixel of the image or not.

CHAPTER 4

PARALLEL ALGORITHMS FOR IMAGE ALGEBRA PRIMITIVES

Image algebra provides a powerful language for image processing. Several image

algebra programming languages have been developed to implement some subsets of image

algebra. They include Image Algebra FORTRAN [96], Image Algebra C [54], Image

Algebra Ada [92, 97], and Image Algebra C++ [93]. They are high-level languages with

extensions of images as basic data types and image algebra primitives as basic operators.

On serial machines, these languages simplify the programming tasks for image processing

due to replacement of large blocks of code by short algebraic statements. In order to

increase the performance of image algebra implementations, researchers have attempted

to implement image algebra primitives on special purpose architectures. The main idea

is that a host machine controls and communicates with a special purpose architecture to

execute image algebra primitives [33]. The architectures that have been targeted thus

far include cellular arrays [63], transputers [10], ERIM's Cytocomputer, Honeywell's

PREP, CM2 Connection Machine [91, 94], Aladdin [95], VITec-50 [78], MasPar [48],

and digital-optical architectures [8, 9]. To increase the performance of image algebra

implementations on architectures that are only designed for fast small neighborhood

operations, researchers have also developed methods to decompose templates to make

full use of those architectures [20, 21, 64, 39, 42, 43, 38]. All these attempts are restricted

to some specific available machines, and the implementations have been incomplete.

Programming image processing algorithms using a set of architecture-independent

primitives from image algebra, we abstract away from a particular architecture. We do

23

not have to worry about features of machines such as the number of processors and the

type of communication networks connecting processors. However, we still do need to

worry about the relative performance of the primitives on particular architectures. To

search for architectures well suited for image algebra, we map image algebra primitives

onto SIMD mesh-connected computers and analyze their performances.

4.1 Pixelwise Operations

Any pixelwise operation of image algebra applies a function to its image operand(s)

pixelwise and requires no communication between processing elements as long as the

same image to processor array mapping is used for the operand images. We can always

have a linear speedup for any pixelwise operation. However, global operations and

image-template operations are not so simple.

4.2 Global Operations

Image algebra global operations perform basically semigroup computations. Thus, I

discuss semigroup computations on MCCs first.

4.2.1 Semigroup Computations

Suppose that 7 is an associative operation on the value set F. The semigroup

computation for a set A = {aij E F : 0 < i,j < n} on an n x n processor array with ai,j

assigned to PE(i, j) is to compute0 r ai, = ao,o7ao,17 ... an-1,n-1. Suppose that

we store the final result in PE(0, 0). With simple modification of the algorithm given

below, we could store the final result in any PE.

24

Since the diameter of an n x n MCC is O(n), any global operation is expected to

take Q(n) time. An O(n) algorithm for the semigroup computation is as follows:

Algorithm Semigroupcomputation

1 Move data column by column to the first column. Every time a column of data

is moved to the first column, a 7 operation is performed by each PE in the first

column.

2 Move data one by one to the first row by the PEs in the first column. Every time

a datum is moved to PE(0,0), it performs a 7 operation.

end algorithm.

4.2.2 Global Reduce, Card, Choice, and Image Comparison Operations

For a global reduce operation r induced by 7, we can compute F on an image by

setting each PE not holding any pixel the identity of the associative binary operation 7

and performing the semigroup computation. Thus, we have the following result.

Theorem 4.1: There is an O(n) time MCC algorithm to compute any global reduce

operation r on an image with domain within an n x n point set.

The card operation counts number of pixels in an image. The algorithm can be

described as follows:

Algorithm Card

1 Each PE holding a pixel of the image is set value 1, and each PE holding no pixel

of the image is set value 0.

2 Perform the semigroup computation with the arithmetic addition as the associative

operation.

end algorithm.

Therefore, we have the following result.

Theorem 4.2: There is an O(n) time MCC algorithm for the card operation applied

to an image with domain within an n x n point set.

The choice operation chooses randomly a point from the point set of an image. To

perform the choice operation, we use one-dimensional index for each PE in the processor

array. The algorithm is as follows:

Algorithm Choice

1 Each PE holding a pixel of the image uses its index as the value, and each PE

holding no pixel of the image uses minus infinity for maximum or infinity for

minimum.

2 Perform the semigroup computations to obtain the maximum and the minimum.

3 Generate a random number between the minimum and the maximum obtained in

Step 2.

4 Broadcast the random number to every PE.

5 Each PE holding a pixel of the image with index less than or equal to the broadcast

index uses its index as the value, and any other PE uses minus infinity as its value.

6 Perform the semigroup computation to obtain the maximum.

7 Find the point from the index using the information of the pixel to PE mapping.

end algorithm.

Steps 1, 3, 5, and 7 take constant time. Step 4 takes O(n) time. Steps 2 and 6 perform

semigroup computations and take O(n) time. Therefore, we have the following result.

Theorem 4.3: There is an O(n) time MCC algorithm for the choice operation applied

to an image with domain within an n x n point set.

For an image comparison operation, we can first perform the comparison pixelwise,

which takes 0(1) time. We then perform the semigroup operation on the result image

with the logic and operation as the associative operation, which takes O(n) time. Thus,

we have the following theorem.

Theorem 4.4: There is an O(n) time MCC algorithm for any image comparison

operation applied to an image with domain within an n x n point set.

26

4.3 Image-Template Operations

The image-template operations derived from the generalized image-template product

of image algebra are powerful operations for image processing. The generalized image-

template product requires intensive computation. In this section, I discuss efficient

algorithms for the product. The more we know about the specific product type, the

more efficient algorithms we can design for it.

4.3.1 General Image-Template Product

The image-template product in its general form is difficult to implement efficiently

on SIMD mesh-connected computers. We have to compute the product one pixel by

one pixel in the point set of the result image. For each pixel of the result image, we

can apply a global reduce operation to the source image weighted with the weights of

the template at each pixel in the point set of the source image. Suppose a E FX is the

source image, t E (F)Y is the template, and b = a t e FY is the result image.

The algorithm is as follows:

Algorithm Image-templateproduct

1 Iterate points of Y in some order. For each y E Y, do the following:

1.1 Compute each weight ty(x) using PE(i,j) assigned to the pixel of a at x,

xEX.

1.2 Compute a(x) o ty(x) using PE(i,j).

1.3 Perform the global reduce operation induced by the 7 operation and store the

result to the PE assigned to the pixel of b at y.

end algorithm.

The complexity of the algorithm is the complexity of the global reduce operation

times the number of points in the point set of the result image. Thus, we have the

following result.

27

Theorem 4.5: Suppose that the result image has m pixels and the source image is

on an n x n point set. There is an O(mn) time MCC algorithm for the image-template

product.

The above algorithm is very general but not efficient. If the image-template product

does not change image domain, i.e., the result image domain is the same as the source

image domain, we can have more efficient algorithms.

4.3.2 Image-Template Product without Changing Domain

The image-template product that does not involve changing domain covers many

image processing operations. Fundamental operations in computer vision and image

processing such as convolutions and correlations [68, 3] are special cases of this product.

Convolutions require intensive computations. For an m x n image and a template

of t elements, a convolution takes O(tmn) time in a sequential computer. Thus,

much attention has been devoted to the development of efficient parallel algorithms

for convolutions [60]. We assume that images are mapped to the processor array by

the identity mapping. On MCCs, the convolution problem has been studied by several

researchers [34, 46, 59]. However, the algorithms proposed for convolutions only work

for invariant templates. Some of the algorithms require square templates [46, 59], and

some of them rely on broadcasting template weights to processing elements [34, 46].

Although one of the algorithms does not broadcast weights, it assumes that weights are

somehow distributed over the processing elements initially and are shifted during the

convolution computation, and the algorithm becomes complicated in order to reduce the

local memory requirement to 0(1) per processing element [59]. One of the algorithms

moves partial convolution results across processing elements during the computation,

which is not efficient when the size of image pixel values (such as binary images) is

small compared to the size of the convolution values [34].

28

We propose a simple and optimal algorithm for the image-template product on an

m x n MCC for an m x n image and a template of t elements [77]. The algorithm

takes O(t) time and requires 0(1) local memory per PE. The image-template product

is computed along disjoint convolution paths of the template. In order to compute

the image-template product more efficiently, we modify and extend the concept of

convolution paths introduced by Lee and Aggarwal [34]. Compared with the previously

proposed algorithms, our algorithm works for both invariant and variant templates without

restrictions on template shapes. Furthermore, it does not broadcast template weights and

does not move partial convolution results.

Convolution paths

The concept of convolution paths was introduced by Lee and Aggarwal [34] where

an algorithm was proposed to perform convolutions along convolution paths. Under that

definition, however, a convolution path starts with some pixel in the window and ends

at the target pixel. This causes their algorithm to have t 1 partial result movements,

where t is the length of the convolution path in the window. We define a convolution

path to start with the target pixel in order to move source image pixel values instead

of partial results.

Let t E (FX)X be a template and let R(t) denote its relative support. A convolution

path of length k for t is a sequence of points P = po, ,... Pk such that

a. po = (0,0);

b. pi E R(t), 1 < i < k;

c. pi is 4-connected or 8-connected to pi+1, 0 < i < k; and

d. pi 0 pj if i 5 j, O < i,j < k.

29

For example, a 3 x 3 template t has a convolution path (0,0), (-1,0), (-1,1),

(0,1), (1,1), (1,0), (1,-1), (0,-1), (-1,-1) in its relative support R(t) =

{(i,j): -1 < i,j < 1} as shown in Figure 7. The arrows indicate the directions for

pixel value movement in the computation of the image-template product along the con-

volution path.

j -1 0 1

0

\ /

Figure 7. A convolution path of a 3 x 3 template in its relative support

For a E Fx, t E (F) X, and a convolution path P = po, P,..., Pk, the partial

result (aOt)p of the image-template product a @t along P is as follows:

k

(aOt)p(y) = I a(y + pi) o ty(y + pi),y X.

Two convolution paths of a template are disjoint if they have no common points

in their paths except their first point (0,0). A template t with relative support R(t) is

partitionable by the disjoint convolution paths P1, P2,...,P; if every point of R(t) is

in one of the convolution paths. For example, the template in Figure 8 has two disjoint

convolution paths and is partitionable by the two disjoint convolution paths.

Figure 8. Two disjoint convolution paths of a template

For a E Fx, t E (FX)x, if t is partitionable by disjoint convolution paths

P1, P2, ..., PI, we can obtain a t from the partial results along the disjoint convolution

paths as follows:

(a t)(y) = a(y) o ty(y)7Fr (a t)p (y),y e X.

30

A convolution path is called an ideal convolution path if it contains no 8-connected

edges, since on a mesh-connected computer where processing elements are 4-connected,

the fewer 8-connected edges in the convolution path, the more efficient the computation

of the image-template product along the path, which will be clear later. For example,

the 3 x 3 template in Figure 7 has an ideal convolution path. Unfortunately, not all the

templates are partitionable by disjoint ideal convolution paths. For example, the template

in Figure 9 is not partitionable by disjoint ideal convolution paths.

Figure 9. A nonideal convolution path

We now discuss construction of disjoint convolution paths for a given template. Lee

and Aggarwal [34] allow only one convolution path for a template, while we allow more

than one disjoint convolution path for a template. It should be easier to construct efficient

disjoint convolution paths for a template under our definition. Lee and Aggarwal [34]

have proposed convolution paths for rectangular, diamond, and circular templates with

target pixels in the center. Here, we consider more general cases of templates.

A shrinking spiral path of a point set S is a path starting with one point on the

external contour of S and following points counterclockwise along the external contour

of the set formed by the points of S not followed yet. To introduce a simple procedure to

obtain a shrinking spiral path for a point set, we define the s-neighbor of a point p to be

the neighbor point marked with s in Figure 10 and denote it as Ns(p, s), s = 0, 1,..., 7.

3 2 1

4 p 0

5 6 7

Figure 10. The neighbors of a point p

The relative supports of most commonly used templates such as rectangular, diamond,

and circular templates have shrinking spiral paths. The shrinking spiral path finding

procedure is similar to the chain code finding procedure [53]. The procedure starts with

the left-topmost point po of the point set S and traces points counterclockwise along the

external contour formed by the points of S not traced yet. It follows 4-connected edges

if possible during tracing. Assuming the numerical operations on neighbor numbers to

be modulo 8, the procedure can be described as follows:

Procedure Shrinkingspiralpath

1 The path initially consists of P0. Set the current point p equal to po, and the

search direction s equal to 6.

2 While S is not empty, do the following:

2.1 If N8 (p, s 2) is in 5', then set p equal to Ns(p, s 2) and s equal to

s 2. Go to substep 2.10.

2.2 If Ns(p,s) is in S and Ns(p,s 1) has another neighbor in S if

Ns(p, s 1) is also in S, then set p equal to Ns(p, s). Go to substep 2.10.

2.3 If Ns(p, s + 2) is in S and Ns(p, s + 1) has another neighbor in S if

Ns(p, s + 1) is also in S, then set p equal to Ns(p, s + 2) and s equal to

s + 2. Go to substep 2.10.

2.4 If Ns(p, s + 4) is in S and Ns(p, s + 3) has another neighbor in S if

Ns(p, s + 3) is also in S, then set p equal to Ns(p, s + 4) and s equal to

s + 4. Go to substep 2.10.

2.5 If N8(p, s 1) is in S, then set p equal to Ns(p, s 1) and s equal to

s 2. Go to substep 2.10.

2.6 If N8(p, s + 1) is in S, then set p equal to N8(p, s + 1). Go to substep

2.10.

2.10.

32

2.7 If Ns(p, s + 3) is in S, then set p equal to Ns(p, s + 3) and s equal to

s + 2. Go to substep 2.10.

2.8 If Ns(p, s + 5) is in S, then set p equal to Ns(p, s + 5) and s equal to

s + 4. Go to substep 2.10.

2.9 Otherwise, no shrinking spiral path exists for this set.

2.10 Append p to the end of P and set S equal to S\{p}

end procedure.

For a template t with relative support R(t), suppose that the above procedure finds

a shrinking spiral path P = Po,Pi, .. .,Pr-1,Pr for R(t) U {(0, 0)}. Let pq = (0,0).

We can easily construct two disjoint convolution paths: P1 = Pq, Pq-1,..., p, po and

P2 = PqPq+1, ... Pr-1, Pr for the template t. The number of 8-connected edges in

the disjoint convolution paths in total is the same as the number of 8-connected edges

in the shrinking spiral path. To see how efficient the disjoint paths for a template are,

we only need to consider the number of 8-connected edges in the shrinking spiral path

of its relative support. Obviously, the worst case is that all the edges in the path are

8-connected and the best case is that all the edges in the path are 4-connected edges. To

compare with Lee and Aggarwal's convolution path construction method, we apply our

method to the templates discussed by Lee and Aggarwal [34].

Result 4.1: A K x L rectangular template has disjoint paths with no 8-connected

edges.

Applying the shrinking spiral path finding procedure to the relative support of a K x L

rectangular template, we obtain a spiral path as shown in Figure 11 for the relative support.

There are no 8-connected edges in the shrinking spiral path. However, Lee and Aggarwal

[34] cannot construct ideal convolution paths for all rectangular templates even if their

target points are in the center.

Figure 11. The shrinking spiral path of a rectangular template (5 x 6)

Result 4.2: An even K x K diamond template has disjoint convolution paths with

no 8-connected edges.

The shrinking spiral path of an even K x K diamond template is as shown in Figure

12. Lee and Aggarwal [34] also give ideal convolution paths for even diamond templates,

but they require the target points to be in the center.

Figure 12. The shrinking spiral path of an even diamond template (8 x 8)

Result 4.3: An odd K x K diamond template has disjoint convolution paths with

K 1 8-connected edges in total.

We consider two cases: K = 4k + 1 and K = 4k 1. For the case of K = 4k + 1,

we can have a shrinking spiral path as shown in Figure 13(a) by applying the shrinking

spiral path finding procedure. A turn of a spiral path reduces the template size by 4 in

both dimensions. Thus, we need k turns to reach the center. It is easy to see that each

turn has 4 8-connected edges. Hence, the spiral path has 4k (which is K- 1) 8-connected

edges. For the case of K = 4k 1, we can have a shrinking path as shown in Figure

34

13(b). After k 1 turns of the spiral path that have 4 8-connected edges each, we need

another turn that has 2 8-connected edges. Thus, the spiral path has 4(k 1) + 2 (which

is also K 1) 8-connected edges. Lee and Aggarwal [34] have extra K 1 points for

K = 4k + 1 and extra K points for K = 4k 1, and they require the target pixel of

the template to be in the center.

(a) 9x9 diamond template

-EL -- - I IH

(b) 7x7 diamond template

Figure 13. The shrinking spiral paths of odd diamond templates

Result 4.4: A circular template has disjoint convolution paths with some 8-connected

edges.

The shrinking spiral path of a circular template can be obtained as shown in Figure

14. It is not easy to give the exact number of 8-connected edges in the shrinking spiral

path for a general circular template. For a 9 x 9 circular template, we have a shrinking

spiral path with 5 8-connected edges. However, Lee and Aggarwal [34] need 7 extra

points and require the target point to be in the center.

Figure 14. The shrinking spiral path of a circular template (9 x 9)

For most commonly used templates, we have a simple algorithm to construct disjoint

convolution paths. For the various templates Lee and Aggarwal [34] have studied, we

have obtained results as good as or better than theirs. For a template t whose relative

support R(t) has no shrinking spiral path, we have the following two methods to find

disjoint convolution paths.

One method is to divide R(t) into subsets such that each subset is adjacent to (0, 0),

find a shrinking spiral path for each subset, and construct disjoint convolution paths from

those shrinking spiral paths. Another method is to extend R(t) by adding extra points

to it, find a shrinking spiral path for the extended set, and construct disjoint convolution

paths from the shrinking spiral path. Since the weights for the extra points are 00 (the

zero element of the o operation), the result is still correct, but the computation takes

extra steps.

Algorithm for image-template product without changing domain

The result of the image-template product is obtained from all the partial results along

the disjoint convolution paths. To compute a partial result along a convolution path, we

shift the source image according to the points on the convolution path; do the o operation

between the shifted source image and the corresponding template weights; and perform

36

the 7 operation on the intermediate results of the o operation. In this process, we need the

template weights along the convolution path. There are three methods to make weights

available along the convolution path. The first one broadcasts the weights. The second

one assumes that the weights are initially distributed over the processing elements and

shifts the weights during the computation. The third method is to compute weights on the

fly along the convolution path. The first and second methods have more communication

overhead and only work for invariant templates, since the weights of variant templates

are different for different target points. The third one works for both invariant and variant

templates. Furthermore, it has less communication overhead.

We compute weights on the fly along convolution paths. The weight t(,j)(x) is a

function of the target point (i,j) and point x. It can be expressed by the same formula for

every target point (i,j). Thus, each PE(i,j) can compute t(i,j)(x) in the SIMD mode.

We use weight(p) to denote the function computing t(i,j)((i,j) + p) by each PE(i,j).

We use shift(a, p, q) to denote the procedure of moving pixel value a(i,j) from

PE(i,j) to PE((i,j) + p q) for each i,j. It is easy to see that shift(a, p, q) can be

done in one route step on an 8-connected MCC. On a 4-connected MCC, it can be done

in one route step if p and q are 4-connected, and in two route steps if p and q are

8-connected.

Using weight(p) and shift(a, p, q), we have the following procedure to compute

the image-template product b = a t along a convolution path P = P1, P,..., Pk.

Procedure Product_along_convolutionpath

1 Initialize the partial result as the identity of the 7 operator. Copy the source

image a to c.

2 For each point pr on the convolution path, r = 1, 2,..., k, do the following:

2.1 Use the procedure shift(c, Pr-, Pr) to move a pixel value originally at

PE((i, j) + p) to PE(i, j).

37

2.2 Compute the weight t(i,)((i,j) + pr) by the procedure weight(pr).

2.3 Perform the o operation between the pixel value and the weight.

2.4 Perform the 7 operation between the partial result and the result obtained in

substep 2.3. Update the partial result.

end procedure.

Since shift(c, p, q) takes 1 step on 8-connected MCCs, it can be easily proved by

induction that c((i,j) + pr) will be shifted to PE(i,j) in step r in the above procedure.

On 4-connected MCCs, since shift(c, p, q) takes 1 step for 4-connected p and q, and

it takes 2 steps for 8-connected p and q, it can also be easily proved by induction that

c((i,j) + Pr) will be shifted to PE(i,j) at the (r + ns(r))-th step, where n8(r) is the

number of 8-connected edges in the path until pr. Thus, the above procedure correctly

computes the image-template product on the convolution path in k steps on 8-connected

MCCs and in k+ns(k) steps on 4-connected MCCs. Furthermore, the procedure requires

no partial result movement.

Using the above procedure, we present an efficient algorithm for the image-template

product on SIMD mesh-connected computers. Suppose that P1,P2,...,P, are the

disjoint convolution paths for the template.

Algorithm Image-template_productwithout_changingdomain

1 Compute the weight t(i,j)(i,j) using each PE(i,j). Initialize the final result as

the result of the o operation between the pixel value at (i,j) and the

corresponding weight.

2 For each convolution path Pr, r = 1,2,..., 1, do the following:

2.1 Use the procedure Product_along_convolution_path(Pr) to compute the

partial result along the path Pr.

2.2 Perform the 7 operation between the final result and the partial result

obtained in substep 2.1. Update the final result.

end algorithm.

38

Since the number of 8-connected edges in a convolution path is at most the length

of the convolution path, it is easy to see that the above presented algorithm correctly

computes the image-template product a t in O(t) steps, where t is the total number of

points in the relative support of t. Thus, we have the following result.

Theorem 4.6: Suppose that the template t is partitionable by disjoint convolution

paths with t elements in its relative support. There is an O(t) time MCC algorithm to

compute the image-template product b = a @t.

The above simple algorithm works for both invariant and variant templates without

restrictions on the template shapes. It does not broadcast template weights and does

not move partial results. When the template is a local template, the algorithm is very

efficient on MCCs. If the template is not local, its relative support could be as large as

the image point set. In this case, the algorithm takes O(mn) time for an m x n source

image. For a special class of nonlocal image-template product, we can develop more

efficient algorithms.

4.3.3 Image-Template Product with Special Nonlocal Template

It is not easy to design an efficient algorithm for the image-template product with a

general nonlocal template on SIMD mesh-connected computers. For the image-template

product with a special class of non-local templates which is very useful in computation

of image component properties, we develop a more efficient algorithm.

Suppose that c is a binary image. A connected component in a binary image is

a maximal connected set of black pixels. Let Cc(x) be the point set of all the black

pixels of c in the same connected component of the pixel at x if the pixel at x is black;

otherwise, let Ce(x) = 0. Note that if x and y are the locations of two pixels in the

same connected component of c, then Cc(x) = Cc(y). A parameterized template t(c) is

39

called a component template with respect to the binary image c if

tw(x) if x E Cc(y)

t(c),(x) = 0, O otherwise,

where o0 is the zero element of the o operation and w is called a weight image with

w(x) = o0 whenever c(x) = 0. By the definition, the support set of t(c) at target point

y is Cc(y), i.e., S(t(c)y) = Co(y).

Let a E Fx be a source image. Suppose that t(c) E (FX) is a component template

with respect to c. Let b = aOt(c) E FX. We have b(y) =xx a(x) o t(c)y(x) =

XEc(y) a(x) o t(c)y(x) = c(y) a(x) o w(x). Note that the source image a and the result

image b are partitioned into components by the binary image c. If the pixels of c at yl

and y2 are in the same connected component of c, the pixels of b at yl and y2 have

the same value, since b(yi) = () a(x) o w(x) x=e(y2) a(x) o w(x) = b(y2). Thus,

we only need to compute one value for each component of b and propagate that value

to all the pixels of the component.

The computation of the image-template product with a component template is highly

related to the image component labeling problem which is a complex process on MCCs.

Thus, we discuss the implementation detail of the image-template product with a com-

ponent template on MCCs after we discuss labeling image components. Here, we only

state the following result.

Theorem 4.7: Suppose that the number of bits in each pixel value of the result image

is b. There is an O(kn) time algorithm to compute the image-template product with a

component template on MCCs with (bkn l/k) bits of local memory per PE, where k is

an integer between 1 and log (2n).

CHAPTER 5

PARALLEL IMAGE PROCESSING WITH IMAGE ALGEBRA

Parallel image processing activities are currently growing with development of new

parallel image processing architectures and new image processing algorithms. However,

current algorithm development is not based on an efficient mathematical structure specif-

ically designed for image manipulation and image analysis. Each researcher develops his

own architecture and ad-hoc image processing tools. This leads to an abundance of non-

standard notation and inefficient development of highly sophisticated and cost-effective

image processing algorithms. Image algebra provides an excellent algebraic notation

and a common mathematical framework for image processing algorithm specification,

comparison, and performance evaluation. It is a purely mathematical structure, indepen-

dent of any computer architecture or language. Image algebra provides highly parallel

primitives and is a good tool to write parallel programs for image processing when it is

implemented on a well-fit parallel architecture.

The low-level image processing tasks require computations corresponding to each

pixel of an image. The computations are usually highly regular and can be characterized

by their local nature. The intermediate-level image processing involves reduction of

image pixel information to a form that will be effective for high-level image analysis.

The computations for intermediate-level image processing are not very regular and

parallelism is not immediately evident. The high-level image analysis involves cognitive

use of knowledge. There is much less agreement about the computational view of

high-level image analysis [5]. Many low-level image processing algorithms such as

41

image enhancement, edge detection, morphological image processing, and thresholding

algorithms have been described using image algebra [85]. For low-level image processing

applications, we only develop and describe algorithms with image algebra for the

Abingdon Cross image processing benchmark and shrinking binary image components.

For intermediate-level image processing applications, we develop a few algorithms

for labeling image components and computing properties of image components on

SIMD mesh-connected computers and describe the algorithms with image algebra. This

demonstrate that image algebra characterizes very well the low-level and intermediate-

level image processing on parallel computers.

5.1 Abingdon Cross Benchmark

The Abingdon Cross benchmark was developed to compare the performance of

computers used in digital image processing [56, 57, 58]. The benchmark involves

extraction of a skeletonized line cross from a solid cross embedded in white Gaussian

noise. No algorithm for the benchmark is given and any algorithm may be used. Usually

it includes linear and nonlinear operations.

The standard 512 x 512 Abingdon Cross image shown in Figure 15 is generated

using the method described by Preston [56]. Note that other images appear in this

section magnified by a factor of 2 relative to the 512 x 512 image in this figure. The

Abingdon Cross image has gray levels between 0 and 255. The vertical stroke occupies

columns 224 through 288 and rows 64 through 488, and the horizontal stroke occupies

columns 64 through 488 and rows 244 through 288. Both arms have gray level 160,

and the cross center has gray level 192, while the background has gray level 128. The

cross image is imbedded in white Gaussian noise. The noise added to all pixels has

zero mean and a standard derivation of 32. Since the strokes of the cross whose gray

42

level is 32 above the background, they have a signal to noise ratio of zero decibels

(S/N = 20 logo (32/32) = 0). Similarly, the cross center has a signal to noise ratio of

6 decibels (S/N = 201oglo (64/32) ; 6).

Figure 15. The standard 512 x 512 Abingdon Cross image

Due to the size limitation of the machine on which we perform the benchmark, we

scale down the standard Abingdon Cross image to a 64 x 64 image. There are two

43

methods for the scaling process [56]. In the first method, the value of each pixel is

taken randomly from the corresponding 8 x 8 block of pixels in the standard image. The

noise in the image obtained this way still have zero mean and a standard deviation of 32.

The signal to noise ratio is still maintained. However, as the two edges of each cross

stroke move closer due to the scaling down, the total signal energy decreases, while the

noise energy remains the same. In the second method, the value of each pixel is the

local average of the pixel values of the corresponding 8 x 8 block in the standard image.

The noise is still have zero mean. But the standard deviation of the noise is reduced

to (64/512) 32 = 4. The second method is usually used if the benchmark has to be

performed on a smaller image. The left image in Figure 16 is generated using the second

scaling method. The noise added to the image has a standard deviation of 4. To illustrate

the robustness of our algorithm for the Abingdon Cross benchmark, we generate a test

image shown on the right side of Figure 16 with higher level noise. The noise has zero

mean and a standard deviation of 16.

Figure 16. The scaled down Abingdon Cross image

(left) and the test image with higher level noise (right)

To obtain a skeleton cross from the noisy cross image, we proceed in four steps:

thresholding, enhancement, erosion, and thinning. The input image a is a 64 x 64

Abingdon Cross image with Gaussian noise. The output image e is a one-pixel thick

cross image. The algorithm is described as follows:

44

1. Thresholding: Since the gray level of the background is 128 and the gray level

of the arms is 160, we choose 144 as the threshold value to threshold the noisy cross

image. This can be done by

b := X>140(a).

The thresholded image is shown in Figure 17.

Figure 17. The thresholded cross image

2. Enhancement: This step enhances the thresholded image by removing noise. We

first apply a 1 x 7 horizontal binary median filter to the thresholded image to obtain the

image shown on the left side of Figure 18, then apply a 7 x 1 vertical binary median

filter to the thresholded image to obtain the image shown in the middle of Figure 18, and

finally or together the two filtered image to obtain the image shown on the right side of

Figure 18. This step can be done by

ci := X>3(b @h)

C2 = X>3(b Ov)

C := C V c2,

where templates h and v are defined as shown in Figure 19.

Figure 18. The horizontally filtered image (left), the vertically

filtered image (middle), and the or of the filtered images (right)

1

1

1

h= 11 1 1 1 1 1 v = 1

1

1

1

Figure 19. The templates used in the enhancement step

3. Erosion: This step makes the cross thinner. We apply the image-template operation

with the 3 x 3 template t as shown in Figure 20 to erode the enhanced image 3 times

as follows:

d := c

Loop 3 times

d := dit.

t= 1 1 1

Figure 20. The template used in the erosion step

Figure 20. The template used in the erosion step

46

The eroded image is shown in Figure 21.

Figure 21. The eroded cross image

4. Thinning: This step thins the cross to a one-pixel thick line cross. We adopt a

morphological thinning algorithm [31] for this step. The algorithm uses four structuring

templates dl, d2, d3, and d4 as shown in Figure 22 to remove pixels from four directions

and four structuring templates el, e2, e3, and e4 as shown in Figure 23 to remove extra

pixels at skeleton junctions. The algorithm consists of four passes. The first pass changes

0

d= 0

d3 =

0

1

1

-\

1

0

1

0

0

d2= 1

d4 = 0

0

0

1

1

1

I /

1

o

0

0

1

Figure 22. The structuring templates di, d2, d3, and d4

0 1

e = 1 1 e2= 1 1 0

1 1

1 1

e3= 1 1 1 e4= 0 1

0 1

Figure 23. The structuring templates el, e2, e3, and e4

the value of a pixel from 1 to 0 if the neighborhood of the pixel is one of the configurations

shown in dl, d2, and el. The second pass changes the value of a pixel from 1 to 0 if its

neighborhood is one of the configurations shown in d2, da, and e2. The third pass changes

the value of a pixel from 1 to 0 if its neighborhood matches one of the configurations

shown in d3, d4, and e3. The fourth pass changes the value of a pixel from 1 to 0 if

its neighborhood matches one of the configurations shown in d4, di, and e4. We can

express the thinning process as follows:

e := d

Loop 2 times

e :=eA (ee d) A (e d2) A (eC el)

e :=eA (ee d2) A (e d3) A (eC e2)

e := eA (e d3) A (e E d4) A (e e3)

e := eA(e d4) A (e d) A (e e4)

The thinned image shown in Figure 24 is obtained by the above process from the eroded

image.

Figure 24. The thinned cross image

The above algorithm for the Abingdon Cross image processing benchmark has been

implemented on the Hughes 3D machine emulator [28] by hand. The Hughes 3D

computer is basically a mesh-connected computer, and it has a clock rate of 10 MHz.

The experiment has shown an impressive result. The benchmark measures are a quality

factor and a price performance factor. The quality factor is defined to be the size of the

image processed divided by the execution time. The price performance factor is defined

to be the number of pixels processed divided by the product of the execution time and the

cost of the computer. The fastest result reported so far [58] is 42.2 microseconds for a

104 x 104 Abingdon Cross image on Martin Marietta's GAPP II whose clock rate is also

10 MHz with a quality factor of 2.5 million. Our benchmark result for a 64 x 64 Abingdon

Cross image is 15.5 microseconds with a quality factor of 4.1 million. If we perform the

benchmark for a 128 x 128 Abingdon Cross image, we will have an even larger quality

factor, since the execution time will not increase significantly. Because the 3D machine

is not commercially available, we are unable to compute the price performance for our

benchmark result at this moment.

5.2 Shrinking Binary Image Components

Shrinking image components is a useful process in counting and labeling connected

components in the image. A set S of black pixels in a binary image can be considered

49

connected according to two different distance definitions [66, 69, 32]. For any two pixels

in S, if there exists a path joining them such that all pixels of the path are in S and all

successive pixels at (i,j) and (k,l) satisfy di((i,j),(k,1)) = |i hi + Ij kI < 1, S

is called 4-connected; if they satisfy d2((i,j), (k,)) = max (li h, Ij k ) < 1, S is

called 8-connected. A connected component in a binary image is a maximal connected

set of black pixels. A parallel-shrink operator shrinks binary image components into

representative pixels, preserving the connectivity, when applied repeatedly to a binary

image. To count the connected components in the binary image, we count the represen-

tative pixels in stead. To label the connected components, we assign unique labels to

the representative pixels and propagate the labels to all the pixels of their corresponding

components.

From now on, we assume that black pixels have value 1 and white pixels have value 0.

We also assume 8-connectivity among the pixels belonging to the same component. With

a simple modification, however, the techniques can be extended to the 4-connectivity

case.

5.2.1 Levialdi's Parallel-Shrink Operators

The basic idea of Levialdi's parallel parallel-shrink algorithm is to shrink each

component in parallel to an isolated pixel at one corer of its bounding rectangle which

is the smallest upright rectangle containing all the black pixels of the component [36].

We refer to the isolated pixel as the representative pixel of the component.

Let a E {0, 1}X denote the source image, where X is an n x n point set. Levialdi

defined four parallel-shrink operators in terms of the Heaviside function. Each uses

a different 2 x 2 window and is distinguished by the direction in which it shrinks

components. Since a parallel-shrink operator changes pixel values of an image in parallel

50

when applied to the image, it is helpful to describe a parallel-shrink operator by the

configurations of the neighborhood used for changing pixel values.

Levialdi's parallel-shrink operator p-, uses the information in the 2 x 2 neighborhood

shown in Figure 25 and shrinks components toward the upper left covers of their

bounding rectangles.

j j+1

7 \

i+1

Figure 25. The neighborhood for Levialdi's parallel-shrink operator 'p,t

The operator can be described as follows:

a. When the window has the configuration shown in Figure 26(a), change the black

pixel to white;

b. When the window has the configuration shown in Figure 26(b), change the white

pixel to black;

c. Otherwise, do not change.

1 0 0 1

0 0 1

(a) (b)

Figure 26. The configurations for the parallel-shrink operator Wp;, where is either 1 or 0

If a* is the image obtained by applying a shrinking operation to the image a, then

a* = [aA (aC t)] V (a tt2),

51

where tl and t2 are as shown in Figure 27. Obviously, Levialdi's parallel-shrink operator

aput can be implemented in 0(1) time on MCCs.

o 1

t \ / t2 =\

0 0 1

Figure 27. The templates for Levialdi's parallel-shrink operator pt.

Similarly, Levialdi defined three other parallel-shrink operators which shrink con-

nected components toward three other covers.

Levialdi [36] has proved that when any one of his parallel-shrink operators is applied

repeatedly in parallel to a binary image,

1. a correct shrinking will be obtained, i.e., only black pixels that do not disconnect

a component will be changed to white pixels, and white pixels are not allowed to

become black pixels when this merges two or more components;

2. a single isolated pixel will be obtained for each component;

3. the maximum number of steps necessary for completely shrinking a given component

with an r x r bounding rectangle is 2r 1.

5.2.2 New Parallel-Shrink Operators

Levialdi's parallel-shrink operators shrink each component toward one corer of its

bounding rectangle. If we design a new parallel-shrink operator which shrinks toward

more than one corer, we should be able to shrink an image completely in fewer steps.

Thus, we try to combine Levialdi's parallel-shrink operators to obtain new parallel-shrink

operators [73]. Since a parallel-shrink operator combined from two Levialdi's parallel-

shrink operators which shrink toward opposite directions has difficulty to determine which

direction it should shrink toward from the local information, we can only combine any

52

two of Levialdi's parallel-shrink operators which do not shrink toward opposite directions

to obtain four new parallel-shrink operators.

As we know, Levialdi's parallel-shrink operator cul shrinks toward the upper left

corer, and his another operator pjj shrinks toward the lower left corer. Thus, combining

the effects of pit and VII, we have a new parallel-shrink operator VoI to shrink each

component toward the left side of its bounding rectangle. The new operator pl uses a

3 x 2 neighborhood as shown in Figure 28.

j j+1

\ /

Figure 28. The neighborhood for the new parallel-shrink operator pt

When applied to an image, the operator pl changes the value of a pixel if and only

if one of put, and jpn changes it. Thus, the operator cp can be expressed as follows:

a. When the neighborhood matches one of the configurations shown in Figure 29(a),

change the black pixel to white;

b. When the window has the configuration shown in Figure 29(b), change the white

pixel to black;

c. Otherwise, do not change.

0 0 1

1 0 0 1

*

*

1 0 0 1

0 0 1 *

(a) (b)

Figure 29. The configurations for the parallel-shrink operator p, where is either 1 or 0

If a* is the image obtained by applying t to the image a, then

a* = [aA (a ti) A (a t3)] V [(a [ t2) V (aj t4)],

where tl and t2 are as before, and t3 and t4 are as shown in Figure 30. It is not difficult

to see that the new parallel-shrink operator ; can be also implemented in 0(1) time

on MCCs.

o 0

t3 = t4=

0 1

Figure 30. Two more templates for the new parallel-shrink operator pj

Similarly, we can define three other new parallel-shrink operators which shrink toward

three other directions.

We can prove that any new parallel-shrink operator can shrink an n x n binary image

in at most [1.5n] 1 parallel steps, preserving the 8-connectivity during the shrinking

process. Due to similarity, we only give proofs for the operator ipj.

54

Theorem 5.1: The parallel shrinking algorithm is correct, i.e., when Ip is applied

repeatedly in parallel to a binary image,

(1) only black pixels that do not disconnect a component will be changed to white;

(2) white pixels do not become black if this merges two or more components.

Proof: (1) The cases in which a*(p) = 0 when a(p) = 1 are shown in Figure 31. We

consider the three possibilities that the shrinking process might disconnect a component:

(a) a = 1, b = e = 0, and at least one of c and d is 1: b* = 1. So the 8-connectivity

is preserved.

(b) e = 1, d = a = 0, and at least one of c and b is 1: d* = 1. So the 8-connectivity

is preserved.

(c) a = e = 1, b = d = 0, and c may have 0 or 1: b* = 1 and d* = 1. Thus, the

8-connectivity is still preserved.

c b a e 0 0

d 1 0 d 1 0

e 0 0 c b a

Figure 31. The cases in which a*(p) = 0 when a(p) = 1

(2) The cases in which a*(p) = 1 when a(p) = 0 are shown in Figure 32. The only

possibility that it may merge two or more components is that c = 1, b = d = 0, and a

55

and e may have any values of 0 and 1. However, c* will be 0. Thus, the 8-connectivity

is preserved.

c b a e 1 *

d 0 1 d 0 1

e 1 c b a

Figure 32. The cases in which a*(p) = 1 when a(p) = 0

Q.E.D.

Theorem 5.2: When li is applied repeatedly in parallel to a binary image, each

connected component in the image will be completely shrunk.

Proof: Suppose a component is bounded by a rectangle formed by four lines i = il,

i = i2, j = jl, and j = j2 as shown in Figure 33. Line L1 and line L2 divide the

component into three areas. Let A denote the set of all the pixels of the rectangle in

area A, including all the pixels on the boundary of A. Let B denote the set of all the

pixels of the rectangle in area B, and C denote all the pixels of the rectangle in area C.

Assume that all the pixels in B and C are white pixels, and there are some black pixels

on both line L1 and line L2. Initially, B and C may be empty, and A may contain black

pixels and white pixels in the rectangle.

Now consider applying our parallel-shrink operator Vp to this component. We may

change some black pixels to white and some white pixels to black in A. However, we

keep all pixels in B and C white, and change all the black pixels on lines L1 and L2 to

white, which means that we can move line L1 and line L2 one pixel to the left. Thus,

56

after an application of tpj, the size of A is decreased. If we apply W, repeatedly, we can

decrease the size of A to 0. Therefore, we can completely shrink the component.

S----L---------- -------

B '

A

/ I

i L. -2

Figure 33. Illustration of parallel shrinking process

Q.E.D.

Unlike Levialdi's parallel-shrink operators, the operator p1, when applied repeatedly,

might shrink a component into either a single isolated pixel or two vertically connected

pixels surrounded by all white pixels and then shrinks the component completely when

applied one more time.

Theorem 5.3: The maximum number of steps necessary for completely shrinking a

component with an r x r bounding rectangle using Wol is [1.5r] 1.

Proof: Suppose a component is bounded by an r x r rectangle as shown in Figure 33.

As we observed in the proof of theorem 2, every application of Wo1 moves both line L1 and

line L2 one pixel to the left. Thus, they move 2 pixels closer along the right side of the

57

bounding rectangle. After at most [I] 1 application of 1oj, line L1 and line L2 will be

one or two pixels apart on the right side of the rectangle. From now on, every application

of jpl will turn the black pixel(s) on the rightmost column into white pixel(s) and leave

one or two black pixels on the column immediately to the left of the rightmost column.

Thus, r more applications of Ij will completely shrink the component. Therefore, the

maximum number of steps for completely shrinking a component bounded by an r x r

rectangle is [1.5r] 1.

Q.E.D.

For a component bounded by an r x r rectangle, it is not difficult to see that any

shrinking algorithm, which shrinks components pixel by pixel, preserving 8-connectivity

in the shrinking process, takes at least r/2 steps in the worst case, since there can not exist

a pixel in the rectangle such that the maximum 4-connected distance between that pixel

and any pixel on the component boundary is less than r/2. Thus, this kind of parallel

shrinking algorithm needs Q(n) steps to shrink an n x n binary image in the worst case.

Levialdi's parallel shrinking algorithm shrinks an n x n binary image within O(n)

steps by changing some black pixels to white pixels and changing some white pixels

to black pixels as necessary. Thus, his algorithm is optimal asymptotically, but with a

multiplicative constant of 2. The new parallel shrinking algorithm we have proposed

is also optimal asymptotically and shrinks an n x n binary image within O(n) steps.

However, we have a smaller multiplicative constant, namely, 1.5. When used in a local

image component labeling algorithm, a parallel-shrink operator taking fewer steps requires

the labeling algorithm to store fewer intermediate images.

58

5.3 Labeling Binary Image Components

Labeling the connected components of a digitized image is a fundamental process

in image analysis and machine vision [68]. The process of labeling assigns a unique

label to each connected component in the image. Once an image has been labeled, the

components which correspond to different objects can be studied, described, and possibly

recognized by higher-level image analysis algorithms.

From the definition of connectedness of image components in a binary image, we

know that labels can be propagated locally among adjacent pixels. However, two pixels in

the same component can be connected by a relatively long path. Thus, the image labeling

problem has local and global features which can be used in developing local labeling and

global labeling algorithms. The labeling of connected components has been intensively

studied and many algorithms for different architectures have been proposed [2].

On an n x n MCC, Nassimi and Sahni [50] gave a well-known labeling algorithm

which uses a divide-and-conquer technique with global operations and complex pointer

operations to label an n x n binary image in O(n) time with O(logn) bits of local

memory. However, this algorithm has a very large multiplicative constant in its time

complexity. Recently, some parallel algorithms using only local operations have been

proposed [11, 1]. These algorithms use a fast binary image shrinking algorithm devised

by Levialdi [36]. They have very small multiplicative constants in their complexities and

are more practical. Cypher, Sanz, and Snyder's first algorithm [11] takes O(n) time and

O(n) bits of local memory. To reduce the space requirement of their algorithm, we use

a new parallel shrinking algorithm discussed in the last section and propose an algorithm

which takes O(n) time and reduces local space requirement by a constant factor.

Several algorithms has been proposed to reduce the space requirement more [11, 1,

59

75, 72]. Cypher, Sanz, and Snyder's second algorithm [11] requires O(log n) bits of local

memory. However, it takes O(n log n) time. Alnuweiri and Prasanna [1] provided an

algorithm with an integer parameter k between 1 and log (2n) which requires 0 (knl/k)

bits of local memory and takes O(kn) time. These well-known local labeling algorithms

fail to achieve the lower bound of O(n) in time and O(log n) in space which can be

achieved by the global labeling algorithm. Whether there exists a local labeling algorithm

which can achieve that lower bound has remained an open question [1, 2]. We propose

a fast algorithm which positively answers this open question. In order to label an n x n

binary image in O(n) time and O(log n) space, our algorithm uses a pipeline mechanism

with stack-like data structures to shrink O(log n) images at the same time. The algorithm

achieves the complexity lower bound with small multiplicative constants, making it the

most efficient algorithm in both practical and asymptotic complexity measures.

We consider labeling an n x n image on MCCs and present labeling algorithms in

terms of image algebra primitives [74, 71]. In the following presentation, we define

d E Zx by d(i,j) = i n + j, i.e., d has a unique value at each pixel. Given an

8-connected component S and pixels p and q in S, the 8-distance between p and q is

the length of a shortest path joining p and q such that neighboring pixels in the path are

8-connected. If we restrict all the pixels of the path to be contained in S, then the length

of a shortest such path is called the intrinsic 8-distance between p and q. The 8-diameter

of S is defined as the greatest 8-distance between any pair of pixels in S. The intrinsic

8-diameter of S is the greatest intrinsic 8-distance between any pair of pixels in S.

5.3.1 Naive Local Labeling Algorithm

A naive local labeling algorithm starts with assigning each black pixel a unique label

and propagates labels in 3 x 3 neighborhoods repeatedly until the maximum label of each

component is propagated to all its pixels.

Label propagating operator

Let po be the template shown in Figure 34 whose support is a 3 x 3 neighborhood.

The label propagating operator b can be easily expressed in image algebra. If b = O(a),

then

b = (aM po) X>o(a).

0 0 0

PO = 0 \ 0/ 0

0 0 0

Figure 34. The template for the label propagating operator b

The algorithm

Since any image comparison takes O(n) time on MCCs, we only compare two images

every n steps in the naive algorithm. The algorithm is expressed in image algebra as

follows:

Algorithm Label_1

1 := d.a;

k := 1;

loop

c := 1;

1 := (1);

if k mod n = 0 then

exit when 1 = c;

end if;

k := k+l;

end loop;

end algorithm.

61

The algorithm is very simple and only stores a couple of images. The execution

time depends on the intrinsic 8-diameter of the maximum component in the image, since

the label of each component is propagated pixel by pixel. Suppose that the intrinsic

8-diameter of the maximum component is d. The time complexity of the naive algorithm

is O(max (n, d)), since every label propagation takes constant time and every image

comparison takes O(n) time.

The intrinsic 8-diameter of a component could be directly proportional to n2. Thus,

the naive local labeling algorithm takes O(n2) time in the worst case. To be precise, we

consider a worst case image which contains only a spiral component as shown in Figure

35(a). For this specific example, it takes 60 iterations to propagate the maximum label

of the pixel at the lower right corer to the interior endpoint of the spiral. In general,

Figure 35. (a) An image containing a spiral component

(b) Labeling process of the spiral component

for an n x n image containing a spiral component as shown in Figure35(a), let Ln be

the number of iterations needed to propagate the maximum label to all the pixels in the

component. The maximum label can be propagated to a pixel (the pixel with dark shade

in Figure 35(b)) on the layer next to the outside layer in 4(n 2) steps and the remaining

pixels in the spiral component form a spiral component in an (n 4) x (n 4) image.

Thus, we have the following recurrence equation:

Ln = 4(n 2) + Ln-4

62

with L1 = 0, L2 = 1, L3 = 4, and L4 = 7. Solving the recurrence equation, we have

L -1 if n is even [n2

3 if n is odd 2

Therefore, the naive algorithm takes -1 = O(n2) local operations in the worst case.

For an 11 x 11 image shown on the left of Figure 36, the naive local labeling

algorithm Label_1 takes at least 23 iterations to obtain the label image shown on the

right of Figure 36.

Figure 36. An example of the naive labeling algorithm Label_l

5.3.2 Basic Two-Phase Local Labeling Algorithm

A good method to reduce the time complexity of the naive labeling algorithm is to

have a fast image shrinking phase before the label propagating phase. The number of

iterations required by an algorithm using both a shrinking operator and a label propagating

operator is only directly proportional to n. However, the price for decreasing time

complexity is an increase in space complexity.

The basic two-phase local labeling algorithm proceeds in two phases [11]. In the

first phase, it applies the parallel-shrink operation devised by Levialdi [36] to a binary

n x n source image ao repeatedly until an all-zero image is obtained, resulting in 2n

binary images ao,al,...,a2n-1. This phase shrinks connected components to their

representative pixels. Note that the image a2n-1 is an all-zero image. We refer to I

as the index of the image aj in the sequence.

In the second phase, it assigns labels to image a2n-2 which consists of representative

pixels only. These labels are then propagated to the pixels connected to them in a2n-3

by applying a label-propagate operation, then to a2n-4, and so on until ao is labeled.

63

In the labeling process, new representative pixels which represent new components may

be encountered. When this happens, new labels are assigned to the representative pixels

and the labeling process continues. Note that more than one connected component may

be shrunk to the same representative pixel, but at different iterations. Thus, a unique

label assigned to a representative pixel should include the iteration number at which it

is generated.

The above simple algorithm applies the label-propagate operations in the reverse

order to the images generated by the parallel-shrink operations. Thus, it needs to store

2n images, requiring each PE to have 2n bits of local memory.

Label-propagate operator

We have discussed Levialdi's parallel-shrink operator pu; in the last section. Cor-

responding to the parallel-shrink operator, we have a label-propagate operator. A label-

propagate operator labels the black pixels of at from the already labeled image ai+l.

Since a parallel-shrink operator, when applied to a binary image iteratively, may shrink

different components into the same isolated pixel in different shrinking steps, we assign

a new label (1 + 1, i,j) to an isolated black pixel in at which represents a new compo-

nent, where 1 + 1 is used to distinguish the component of single pixel at (0,0) and the

background (white pixels). Since 0 < i,j < n 1 and 0 < 1 < 2n 1, (1 + 1,i,j) can

be considered as a (3 log n + 1)-bit number with I + 1 in the first log n + 1 bits, i in the

middle log n bits, and j in the last log n bits.

The label-propagate operator bu, corresponding to Levialdi's parallel-shrink operator

put uses the information of the neighborhood as shown in Figure 37 to propagate the

labels of black pixels in image a,+l to their neighboring black pixels in image al.

j-1 j

i-1

i

Figure 37. The neighborhood for the label-propagate operator 'tut

Let lat be the label image of al. If lat = l,(laIl+, al), then

lat = ((lal+l V al) l pi) .at,

where the template pi is shown in Figure 38. The operator 0,; overlaps the label image

of al+l and the image at, and then propagates the labels in the first image to the black

pixels in the second image. It is not difficult to see that the label-propagate operator can

be implemented on MCCs in O(1) time.

o 0

P =

Figure 38. The template for the label-propagator operator kult

The algorithm

The basic two-phase local labeling algorithm using Levialdi's parallel-shrink operator

pul and its corresponding label-propagate operator 'ul is as follows:

Algorithm Label_2

ao := a;

/* Shrink image components */

for k := 1 to 2n 2 loop

ak := c (ak-1); /* Shrink */

end loop;

/* Assign and propagate component labels */

65

1 := (d + n2(2n 1)) a2n-2; /* Assign new labels */

for k := 2n 3 downto 0 loop

1 := l,1(l,ak+1); /* Propagate labels */

1 := ((d + n2(k + 1)) x=1(l)) V 1; /* Assign new labels */

end loop;

end algorithm.

Levialdi [36] has shown that when the operator 0utl is applied 2n 1 times to an

n x n image, the image becomes an all-zero image. Thus, the above algorithm is correct

and uses 2n 2 parallel-shrink operations and 2n 2 label-propagate operations. It

stores 2n 2 intermediate images. Therefore, both the time complexity and the space

complexity of the algorithm are O(n).

The algorithm is illustrated in Figure 39. It takes 20 iterations to shrink the 11 x 11

image and 20 iterations to label the image components. Note that only nonzero images

are shown in the figure.

H@ rT r;rlr r

r I r

Figure 39. Illustration of the labeling algorithm Label_2

5.3.3 Improved Basic Two-Phase Local Labeling Algorithm

As we discussed in the last section, the new parallel-shrink operator #;, when used

in a local image component labeling algorithm, requires the labeling algorithm to store

fewer intermediate images. We use the new parallel-shrink operator to reduce the memory

66

requirement of the basic two-phase local labeling algorithm and obtain an improved basic

two-phase local labeling algorithm.

Label-propagate operator

Corresponding to the new parallel-shrink operator pt, a label-propagate operator 1;

is designed which uses the information of the neighborhood shown in Figure 40.

j-1 j

i-i

i

i+1

Figure 40. The neighborhood for the label-propagate operator b1

Let la; be the label image of al. If lal = 1l(lal+1, al), then

la; = ((lal+l V al) M P2) a1,

where the template p2 is shown in Figure 41. The operator 1; overlaps the label image

of al+l and the image al, and then propagates the labels in the first image to the black

pixels in the second image.

0 0

2 = 0 0

\ /

0 0

Figure 41. The template for the label-propagator operator 1\

The algorithm

Since the new shrinking operator W, may shrink a connected component to two

neighboring representative pixels in a column, we should propagate labels with a 2 x 1

67

template shown in Figure 42 after assigning unique labels to representative pixels to make

representative pixels of each component have the same label.

0

P3 =

0

Figure 42. The template for making two representative

pixels of a component have the same label

The algorithm using the new parallel-shrink operator pI and its corresponding label-

propagate operator 01 is as follows:

Algorithm Label_3

ao := a;

/* Shrink image components */

for k := 1 to [1.5n] 2 loop

ak := ul(ak-1); /* Shrink */

end loop;

/* Assign and propagate component labels */

1 := (d + n2([1.5n] 1)) a[1.5n-2; /* Assign new labels */

1 := (1 M P3) X>0(1); /* Make two new labels for a component the same */

for k := [1.5n] 3 downto 0 loop

1 := ,1(1, akl+); /* Propagate labels */

1 := ((d + n2(k + 1)) X=1(l)) V 1; /* Assign new labels */

1 := (1 P3) X>o(1); /* Make two new labels for a component the same */

end loop;

end algorithm.

It has been shown that the new shrinking operator can shrink any n x n image

completely in at most [1.5n] 1 steps. The algorithm Label_3 is correct and takes

[1.5n] -2 parallel-shrink operations, [1.5n] -2 label-propagate operations, and f1.5n] -2

additional image-template operations involved with a 2 x 1 template, but it only requires

68

to store [1.5n] 2 intermediate images. Thus, the time complexity of the algorithm

is O(n). The space complexity is also O(n). However, the labeling algorithm Label_3

stores fewer intermediate images than the labeling algorithm Label_2.

The algorithm is illustrated in Figure 43. It takes 15 iterations to shrink the 11 x 11

image and 15 iterations to label the image components. Note that only nonzero images

are shown in the figure.

E* sr Ir [mu IITJ

Figure 43. Illustration of the labeling algorithm Label_3

5.3.4 Advanced Local Labeling Algorithm

In the basic local labeling algorithms, the label-propagate operations are applied in

reverse order to the images generated by the parallel-shrink operations. Thus, we have to

store the intermediate images or generate them on the fly in the label propagating process.

Suppose that for the source image a = ao, we have intermediate images ao, al,..., a,_1

to be generated by the shrinking process. If less than s images are allowed, then only

a subset of the s images can be stored at any time, and the remaining images have to

be generated in order during the label propagating process. Suppose that the image ai+l

has been labeled and the next image to be labeled is al. If at is stored, it can be labeled

using a label-propagate operation; if not, it has to be generated from a stored image a'l

with maximum index l' < 1.

Since Levialdi's parallel-shrink operator and the new parallel-shrink operator have

the same asymptotic complexity in shrinking binary images, we simply use Levialdi's

69

parallel-shrink operator. With Levialdi's parallel-shrink operator, the number of interme-

diate images needed for labeling an n x n image is at most s = 2n.

Cypher, Sanz, and Snyder's log-space algorithm [11] reduces the local memory

requirement to O(log n) binary images, but it increases the time to O(n log n). Alnuweiri

and Prasanna's algorithm [1] requires O (knl/k) bits of local memory with k in the range

between 1 and log (2n), but it takes O(kn) time.

Both the log-space algorithm [11] and the stack-based algorithm [1] can work

under the space constraint of storing only O(logn) binary images, but they have to

take O(nlogn) time. Although they are conceptually similar and have to use more

time if less space is allowed, the log-space algorithm stores intermediate images in an

order determined by two functions and is not general, while the stack-based algorithm

determines the storage order in a natural way using stacks and is more general. The stack-

based algorithm uses k stacks, each having capacity of storing (2n)1/k binary images,

where k is an integer in the range between 1 and log (2n). The stack-based algorithm

takes O(kn) time and requires space for storing O(kn1/k) binary images. Here, we

describe the stack-based algorithm using image algebra.

Let 2n = mk for any integer k with 1 < k < log(2n). We use k stacks

So, S1,...,Sk-1, each having capacity of storing m binary images. In the algorithm,

stack Sq stores each image ar with index r divisible by mq. We first define some

primitive operations on stacks as follows:

* set_empty(S): set stack S empty

* empty(S): return true if stack S is empty and false otherwise;

* pop(S): pop the top image from stack S;

* push(a,S): push image a into stack S.

The algorithm is as follows:

Algorithm Label_4

/* Initialize */

1 := 0; 1 = 2n;

forh := 0tok -1 loop

set_empty(Sh);

end loop;

q := k;

loop

for h := q- 1 down to 0 loop

/* Shrink the image mh(m 1) times and push m images into Sh */

push(a, Sh);

for r := 1 to mh+l mh loop

if r mod mh = 0 then push(a, Sh);

a :=

end loop;

if h # 0 then a := pop(Sh);

end loop;

/* Label m images popped from So */

for r := 1 to m loop

1 := l(1, pop(So));

1 := ((d + n2) X=(1)) V 1;

l := 1;

end loop;

/* Find the highest nonempty stack */

q := 1;

while q < k and empty(Sq) loop

q := q+1;

end loop;

exit when q = k;

a := pop(Sq);

end loop;

end algorithm.

Obviously, the algorithm only requires space for storing O(km) = 0 (knl/k) images.

Let Th be the number of parallel-shrink operations needed for labeling image arnh in

stack Sh after image a(r+)mnh is labeled. To label armh, the algorithm generates the

mh intermediate images between armh and a(r+1)mh, and stores m immediate images

in stack Sh-1. The image armh is labeled after all m pairs of images in consecutive

locations of Sh-1 are labeled. Thus, we have

Th = mh + mTh-1

with To = 0. Solving the recurrence equation, we have Th = hmh. Since the labeling

problem can be considered as the problem of labeling ao after amk is labeled in an

imaginary stack Sk, the number of parallel-shrink operations for the labeling problem is

kmk. Obviously, the algorithm takes mk label-propagate operations. Therefore, the time

complexity of the algorithm is O(kn).

5.3.5 Fast Advanced Local Labeling Algorithm

The previous algorithms reduce local space requirement at the price of increasing time

complexity. The reason for this is that they only shrink one image at a time. Let 2n = mk

for any integer k with 1 < k < log (2n). We propose a fast local labeling algorithm [75,

72] which shrinks k images at each time in order to maintain the time complexity to

O(n) regardless of k with O(km) local space. In the algorithm, any intermediate image

at has been generated and is available for use by the time it is needed. The algorithm is

based on a pipeline mechanism and stack-like data structures.

Shrinking multiple images each time

A collection of k binary images can be considered as an image containing k-bit

binary numbers. Suppose that ac is a collection of k binary images with k = O(log n)

and ac[h] is some binary image at for 0 < h < k 1. We can shrink the k images in ac

simultaneously to obtain a new collection of k images in ac* as follows:

ac* := [ac A (ac(tl)] V (ac t2),

where A and V are k-bit bitwise logic operations and tl and t2 are defined as shown in

Figure 44. This shrinking operator will be referred to as pl. If the h-th image in ac

is some binary image alh, i.e., ac[h] = alh, where lh is an index and 0 < h < k 1,

then ac*[h] = ath+ .

00...0 11...1

t = t=

00...0 00...0 11...1

Figure 44. The templates for shrinking multiple images

It is not difficult to see that the parallel-shrink operator pkl with k = O(log n) can

be implemented in 0(1) time on MCCs.

Pipeline mechanism

We design a pipeline consisting of k stages Stagek_1, Stagek-2,..., Stageo as shown

in Figure 45.

Stagek_1 takes image ao as input, shrinks it mk 1 times, stores the interme-

k-1 k-2 h 1 0

Figure 45. The pipeline mechanism

73

diate images ao,amk-1,... ,a(m-_1)k-1, and outputs a(m_)mk-1, a(m-2)mk-1, ..., 0

to Stagek_2. For each almk-1 taken from Stagek-1, where 1 = m -

1,m 2,... ,0, Stagek_2 shrinks it mk-1 1 times, stores the intermediate

images almk-,mk-mk-1+mk-21, ... alrk-1+(m-1)mk-2, and outputs almk-1+(ml)mk-2,

almk-1+(m-2)mk-2, ..., almk-1 to Stagek-3. In general, for each image almh+1

taken from Stageh+1, where 1 = mk-(h+l) 1,mk-(h1+) 2,..., 1,0, Stageh

shrinks it mh+l- 1 times, stores almh+,almh+i+mh, ... ,almh+4+(m-_)mh, and out-

puts almh+1+(m-1)mh, almh+1+(m-2)mh, ..., aMh+l to Stageh_1. Thus, for each

aim from Stage1, where 1 = mk-1 1,mk-2 2,..., 1,0, Stageo shrinks

it m 1 times, stores alm, alm+1,... alm+(m-l), and provides the sequence

alm+(m-1), alm+(m-2), *, alm+l, alm to the labeling process.

From the above analysis, we can see that Stageh takes mh+2 steps to consume a

sequence of m images supplied by Stageh+1 and takes mh+l steps to produce a sequence

of m images for Stageh_1. Similarly, Stageh+1 takes mh+2 steps to produce a sequence

of m images needed by Stageh, and Stageh_1 takes mh+l steps to consume a sequence

of m images provided by Stageh. Thus, each time Stage/ consumes a sequence of m

images supplied by Stageh+1, another sequence of m images has been generated by

Stageh+1. On the other hand, each time Stageh produces a sequence of m images, the

previous sequence of m images has been consumed by Stageh_1. Therefore, we can

organize the pipeline stages to work on k different images at each step such that Stageh

produces a sequence of m images exactly when Stageh_1 needs such a sequence. When

Stageh produces its first sequence of m images, Stageh_1 starts to work. After Stageh

produces its last sequence of m images, it stops working. Whenever Stageo produces a

sequence of m images, we label them by applying m label-propagate operations.

We can use the step number to coordinate the pipeline stages. Let Stagek_1

74

start to work at step Startk-l = 0. With this information, we can determine

when each stage starts to work and when it stops working. Stageh has to start to

work at step Starth = (mk-(h+1) 1) mh+ and stops working after step Stoph =

(mk-(h+l) -_ )mh+ + (mk-h )mh, producing its first full S(h) at step Starth +

(m 1)mh = (mk-h )mh and its last full S(h) at step Starth + (m-h 1mh =

(mk-(h+l) )mh+1 + (mk-h )mh. With this pipeline scheduling, Stageh inputs

an image from Stageh+1 at a step which can be divided by mh+1 and stores an image

at a step which can be divided by mh in its working period.

Step Stage2 Stage 1 Stageo Image labeled

0 0

1 1

2 2

3 3

4 4 4

5 5

6 6 6

7 7 \ 7,6

8 0

9 1 5,4

10 2

11 \3,2

12 0

13 D 1,0

Figure 46. Illustration of how the pipeline works

The pipeline starts with the source image ao at the highest stage and ends up at the

lowest stage with a sequence of mk images amkl, amk2,... a* a0 whose indices are

in descending order. An tiny example of labeling a 4 x 4 image shown in Figure 46

illustrates how the pipeline works. Here, n = 4. Choosing k = 3, we have m = 2. We

want a sequence of images a7, ag,..., ao for the labeling process. Intermediate images

in squares generated by each stage need to be stored.

Data structure

Because Stageh outputs images to Stageh_1 in the reverse order it shrinks them, it

is natural to have stack data structures to store the intermediate images. Since in the

pipeline mechanism, Stageh prepares the next sequence of m images when the previous

sequence of m images is being used by Stageh_1, we combine two stacks together to

form a stack-like data structure for intermediate images flowing from Stageh to Stageh_1.

A stack-like data structure S of depth m is shown in Figure 47. The pointer pop_pointer

always points to the top and the pointer push_pointer always points to the first unused

element from the bottom. When push_pointer points to the bottom, S is empty; and when

m elements have been pushed into an empty S, S is full. Popping an element from S,

we obtain the top element and shift the elements in area data 1 one position to the top.

Pushing an element into S, we put the element to the position pointed by push_pointer

and adjust push_pointer to (push_pointer+l) mod m.

top bottom

mr-1 0

data 1 data 2

poppointer pushpointer

Figure 47. The stack-like data structure

We define the following basic operations which can be done in 0(1) time.

* setempty(S): set S empty;

* pop(S): return the top image in S and shift images in area data 1 one position to

the top;

* push(a, S): push image a into area data 2 of S.

76

We use k +1 stack-like data structures S(k), S(k 1),..., S(0), where S(k) is 1 bit

deep while each S(h), 0 < h < k 1, is m bits deep. The indices of images stored in

each stack-like data structure S(h) are multiples of m.

Since Stageh pushes a new sequence of m images into the stack-like data structure

S(h) when Stageh_1 pops the previous sequence of m images out of S(h), pushing

images into S(h) will never overwrite it. As the example in Figure 46 shows, Stage1

pushes the image sequence ao,a2 into S(1) when Stageo pops the previous sequence

a6,a4 out of S(1). Also, when Stageh-1 needs another sequence and starts to pop

images out from S(h), Stageh has already pushed a new sequence into S(h). Thus, we

will never pop a empty stack-like data structure. In the same example, when Stageo

needs the sequence a2, ao, Stage1 has already pushed that sequence into S(1).

The algorithm

The fast local labeling algorithm is presented below. Variable step is used to

coordinate all the pipeline stages to work concurrently and correctly. Variables high_stage

and low_stage are used to indicate that only Stagehigh_stage through Stagelow_stage are in

their working periods. Variable ac holds all the current binary images of the k pipeline

stages with ac[h] storing the current image of Stageh. When performing push(ac[h],

S(h)), each PE extracts the h-th bit from variable ac and pushes that bit into S(h). When

performing ac[h] := pop(S(h+l)), each PE pops out a bit from S(h+l) and puts that bit

into the h-th bit position of variable ac.

Algorithm Label_5

input: a (a binary image to be labeled);

output: 1 (an image with components uniquely labeled);

/* Initialize */

1 step := 0;

2 1 := 0;l := 2n;

3 low_stage := high_stage := k-1;

4 for h := 0 to k loop

5 set_empty(S(h));

6 end loop;

7 push(a, S(k));

8 while (highstage >= 0) loop

9 if (step mod mlowstage = 0) then

/* Find the last stage to update S */

10 last_stage := low_stage;

11 while (step mod mlaststage+l = 0) loop

12 last_stage := last_stage+l;

13 end loop;

/* Update stack-like data structures */

14 for h := last_stage down to low_stage loop

15 if (step mod mh+l = 0) then

16 ac[h] := pop(S(h+l));

17 end if;

18 if (step mod mh = 0) then

19 push(ac[h], S(h));

20 end if;

21 end for;

/* Label the images in S(0) when it is full */

22 if (lowstage = 0) and (step+1 mod m = 0) then

23 for r := 0 to m-1 do

24 1:= 01(1, pop(S(O)));

25 1:= [(d+n2l).X=1(1)Vl;

26 l := -1;

27 end loop;

28 end if;

78

/* Adjust low_stage and high_stage */

29 if (low_stage > 0) and (step >= (low_stage-l)start) then

30 low_stage := low_stage-l;

31 ac[low_stage] := pop[S(low_stage+1)];

32 push(ac[low_stage], S(low_stage));

33 end if;

34 if (step >= high_stagestop) then

35 high_stage := high_stage-l;

36 end if;

37 end if;

/* Shrink k images at the same time */

38 ac := 'p(ac);

39 step := step+1;

40 end loop;

end algorithm.

Algorithm correctness and complexities

We define the output of a pipeline stage Stageh to be the sequence of images popped

out from S(h) by Stageh_1 in the order they are popped out. The output of Stageo

is the sequence of images used by the label-propagate operations in the order they are

popped out from S(0).

We know from the pipeline scheduling that Stageh+1 produces a sequence of m

images forming a full S(h + 1) exactly when Stageh needs that full S(h + 1). For each

image with index Imh+l from the output of Stageh+1, Stageh produces a sequence of m

images whose indices are Imh+l + (m 1)mh, Imh+1 + (m 2)mh .. ,mh+l + mh,

Imh+l, forming a full S(h) for the output of Stageh.

To argue the correctness of the algorithm, we first give a theorem. The proof of

the theorem is a little tedious, but it is easy to understand with the example given next.

79

Stageh converts the output of Stageh+1 a sequence of images whose indices are mh+l

apart in descending order into a sequence of images whose indices are mh apart in

descending order as the output of Stageh.

Theorem 5.4: In the above local labeling algorithm, the output ofStageh is a sequence

of m-h images a(mk-h-)mh, a(mk-h-2)mh, ... almh, ... a0 for 0 < h < k 1.

Proof: We prove the theorem by induction on h.

(i) h = k-1: Stagek-1 starts to work at step 0 with the input image ao popped from S(k)

(in line 16) and pushes almk-i into S(k 1) at step (m 1 l)mk-1 (in line 19),

1 = m-1, m-2,..., 0, producing its only full S(k 1) at step (m 1)mk-1. At this

step Stagek_2 starts to work with that full S(k 1) and pops almk-1 from S(k 1)

at step (m )mk-1 + (m 1 l)mk-1 (in line 16), l = m- 1, m- 2,..., 0. Thus,

the output of Stagek-1 is a(m-1)mk-1, a(m-2)mk-1, . amk-1, ao0

(ii) Induction hypothesis: The theorem is true for h + 1. That is, the output of Stageh+1

is a(,k-(h+l)_l)mh+l, a(mk-(h+l)-2)mh+l, . ., almh+, ... ao.

(iii)We show that the theorem is also true for h. Stageh starts to work

at step (mk-(h+1) )mh+1 and pops almh+i out from S(h +1) at step

(Stageh),tt + (mk-(h+l) 1 l)mh+l (in line 16), 1 = mk-(h+l) -

1, mk-(h+) -2,...,0. Starting at step (Stageh)star + (mk-(h+l) _- 1 )h+1,

Stage shrinks a(mk-(h+l)_ll-)h+1 and pushes a(mk-(h+l)_-_l)h+l +rmh into

S(h) at step (Stageh)a + (mk-(h+l)- 1 )mh+1 + (m 1 r)m (in

line 19), r = m 1,m 2,...,0, giving the /th sequence of m images

almh+l+(m-l)mh,almh+l+(m-2)mh,... ,almh+l for the output of Stageh. Since 1 =

mk-(h+l) 1, mk-(h+) 2,...,1, 0, putting all the mk-(h1+) sequences of m

images, we have the output for Stageh as follows: a(mk-h-_)mh, a(mk-h-2)mh, .. ,

armh, . ., ao.

Q.E.D.

Corollary 5.1: The local labeling algorithm is correct.

Proof: From Theorem 5.4, we know that the output of Stageo is akk_l, amk-2,.. ., ao.

This is exactly the sequence of images the label-propagate procedure should use. Thus,

the algorithm is correct.

Q.E.D.

Theorem 5.5: The space complexity of the local labeling algorithm is 0 (knl/k) and

the time complexity is O(n).

Proof: We have one stack-like data structure with depth of 1 bit and k stack-like data

structures with depth of m bits. Each stack-like data structure has 2 pointers which take

log m bits each. Thus, the stack-like data structures take km + 1 + 2k log m bits in each

PE. The label itself takes 3 log n+ 1 bits. Thus, we need km+1+2k log m+3 log n+1 bits

of local memory per PE. Therefore, the space complexity of the local labeling algorithm

is O(knl/k), since m = (2n)1/k and 1 < k < log (2n).

The time complexity of the algorithm is dominated by the number of the k-bit parallel-

shrink operations, the number of label-propagate operations, and the number of stack-

like data structure push/pop operations. Since Stageo produces its last full S(O) at step

(mk-(0+1) 1)0+1 (k-0 1)m0 = 2mk m- 1, the while loop (lines 8-40)

is executed 2mk m 1 times, which means the number of the k-bit parallel-shrink

operations needed is 2mk m 1. Obviously, the number of label-propagate operations

executed (in lines 23-27) is 2n. Since Stageh pops mk-(h+l) images from S(h + 1) and

pushes mk-h images into S(h) in its working period and the label-propagate procedure

81

pops 2n images from S(O), the number of stack-like data structure push/pop operations

k-l

can be computed as (mk-(h+1) k-h + 2n = (mk 1) + 2n.

h=O

Since the k-bit parallel-shrink operations, label-propagate operations, and stack-like

data structure push/pop operations each take 0(1) time, and 2n = mk, the algorithm

takes O(k) time for the initialization (in lines 1-7), O(n) time for all the k-bit parallel-

shrink operations, O(n) time for all the label-propagate operations, and O(n) time for

all the stack-like data structure push/pop operations. Therefore, the time complexity of

the algorithm is O(n).

Q.E.D.

Corollary 5.2: The local labeling algorithm is an O(n)-time and O(log n)-space

algorithm.

Proof: When k = O(log n), we have O(knl/k) = O(log n). Thus, the local labeling

algorithm is an O(n)-time and 0(log n)-space algorithm.

Q.E.D.

The fast advanced local labeling algorithm we have presented takes O(n) time and

requires 0 (knl/k) bits of local memory per PE to label an n x n image on an n x n

MCC, where k is in the range between 1 and log (2n). When k is 1, the time and

space complexities of our algorithm are basically the same as the first algorithm in [11]

and the algorithm in [1]. When k is O(logn), our algorithm requires O(logn) bits

of local memory per PE, the same as the second algorithm in [11] and the algorithm

in [1]. However, our algorithm only takes O(n) time, whereas their algorithms take

O(nlogn) time. The fast advanced local labeling algorithm gives a positive answer

to the question posed by Alnuweiri and Prasanna [1, 2]. Furthermore, the algorithm

82

uses local operators and involves low communication overhead, having a very small

multiplicative constants in its complexities. The algorithm presented here is the most

efficient algorithm for image component labeling on mesh-connected computers in both

theoretical and practical complexity measures.

An example

We illustrate how our local labeling algorithm works by labeling an 8 x 8 image. If we

use Levialdi's parallel-shrink operator, we have 16 intermediate images ao,al,...,as15.

Choosing k = 5, we have 4 pipeline stages Stage3,..., Stageo and 5 stack-like data

structures S(4),S(3),...,S(0), where S(4) is 1 bit deep and S(3),...,S(0) are 2 bits

deep. In Table 1, we show which image each stage obtains by shrinking a previous image

or moving from the higher stage at each step. Note that the image indices instead of the

images are used in the table. We also show the output of each stage and the content of

each stack-like data structure at each moment. The Sth column gives the image Stageh

works on at each step. The Oh column gives the output of Stageh. The S(h) column

shows the stack-like data structure S(h) at each step. As can be seen, the output of

Stage is the sequence of images whose indices are 15, 14,..., 0.

Table 1. Illustration of labeling an 8 x 8 image

by the fast advanced local labeling algorithm

Step S(4) St3 S(3) 03 St2 S(2) 02 Stl S(1) 01 Sto S(O) Oo

o [] __

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8 80

9 9

10 10

11 11

12 12 128

_[ 12 12 1

13 13 13

14 14 14 1[412

12 14 14

15 15 15 15 1514

15

S14

16 0 0

o 8 8 1

__ 12 12 12

17 1 9 13

12 13

12

18 2 10 l

F 10 10

84

Table 1. (Continued) Illustration of labeling an 8 x 8

image by the fast advanced local labeling algorithm

Step S(4) St3 S(3) 03 St2 S(2) 02 Sti S(1) 01 Sto S(O) Oo

19 3 11 11 0]

1 11

0

20 4 40

[F_ 4 4 8 F4

ff]8 8

21 5 9

F 9

M8

22 6 6

M 6 6 6

23 7 7

[ 7

6

24 0 0 [40

_4 4

25 1 5 ax

[A:M 5

M 4

26 2 [2To1]

262 2

27 3

1 I2 3

M j2

28 0 0

29 1 1

LN1

FM ol

85

5.4 Computing Properties of Image Components

Computing geometric properties of each connected component in an image is an

important step toward high-level analysis of the image [13, 68, 22]. Some algorithms for

computing geometric properties of image components on MCCs have been proposed [16,

49]. Dyer and Rosenfeld [16] have proposed MCC algorithms for computing geometric

properties of image components such as area, perimeter, compactness, height, width,

diameter, and moments. However, their algorithms take O(n2) time in the worst case.

Miller and Stout [49] have proposed O(n) time MCC algorithms for determining the

convex hull of each image component, for determining if each component is convex, for

computing the distance to the nearest neighbor component, for computing the internal

diameter and external diameter, etc. However, their algorithms use complex global

operations such as random access read and random access write, leading to large

multiplicative factors in their time complexities. Furthermore, they have not discussed

computing area, perimeter, height, width, and moments of each image component.

We have developed a general algorithm to compute geometric properties of each

image component in an image on MCCs [76]. The geometric properties to compute are

area, perimeter, compactness, height, width, diameter, moments, and centroid. Each of

the properties can be computed by the image-template product with a specific component

template. After the general algorithm for the image-template product with component

template is implemented, we only need to design a component template to compute

each property. The general algorithm for the image-template product with component

template on MCCs is modified from the stack-based image component labeling algorithm

and is discussed in detail in this section. It has a small multiplicative factor in its time

complexity and is practical, since the algorithm uses only local operators.

86

5.4.1 Algorithm for Image-Template Product with Component Template

We modify the stack-based algorithm for binary image component labeling to com-

pute the image-template product with component template on MCCs. During the process

of shrinking each connected component to a representative pixel, we compute the result

of the product for pixels in the component. In the propagating process, we propagate the

result stored in the representative pixel of each connected component to all the pixels

of the connected component.

Let a E FX be a source image. Suppose that t(c) E (FX)x is a component template

with respect to a binary image c. To compute b = a@t(c) E FX, we use Levialdi's

parallel-shrink operator PtL to shrink binary image c to obtain a sequence of images

co,Cl,... ,C2n-1 with co = c. We correspond an image ar to each image c, with

ao = a o w, where w is the weight image. As we know, Vu, shrinks each component

toward the upper left corer of its bounding rectangle. Thus, when a black pixel in Cr-1

is changed to white in c, by applying a parallel-shrink operation to c,-1, we accumulate

the value of that pixel in a,-1 to one of its left, upper left, and upper neighbors in ar using

the 7 operation. When a pixel in Cr is a representative pixel, the corresponding pixel

in a, will have the image-template product result for all the pixels in the corresponding

component. We propagate the result to all the pixels in the connected component in the

propagating process.

Suppose that the black pixel of cr-1 at x disappears in c,. If the black pixel is not an

isolated pixel in c,, we try to find a black neighbor from the neighbors in the order : the

left neighbor, the upper left neighbor, and the upper neighbor. We then accumulate the

intermediate result of a,-1 at x to the intermediate result of the found black neighbor in

a,r. To use image algebra to describe the process of computing ar from a,-1, we design

87

a parameterized template u(cr-, Cr) E (FX)X as follows:

S1 if cr(y) = 1 and x = y

10 if cr(y) = 1 and x = y + (0,1) and

Cr-i(x) = 1 and cr(x) = 0

10 ifcr(y) = 1 andx=y+(1,1)and

SU(C y Cr-l(x) = 1 and Cr(x) = 0 and

u(cr-1r)y(x)= cr(x- (0,1)) = 0

10 if cr(y) = 1 and x = y + (1,0) and

Cr- (x) = 1 and c,(x) = 0 and

cr(x (0, 1)) = 0 and cr(x (1, 1)) = 0

00 otherwise,

where 10 is the identity of the o operation and 00 is the zero element of the o operation.

The image ar can be simply computed by

ar := ar-1 )u(cr-1, Cr).

Note that the 7 operation here is the same 7 operation in b = a (t(c), but the o operation

here may be different from the o operation in b = a t(c). Since the template u is a

local template and only uses information in a 2 x 2 neighborhood, computing ar from

ar-1 can be done in 0(1) time on MCCs.

Let br be the image holding the results corresponding to the black pixels of cr. The

pixel values of br corresponding to the white pixels of Cr are 0. The image b = a Ot

will be bo V (1 co) 1y, where 1. is a constant image whose pixel values are the identity

element of the 7 operation, and the V is a bitwise logic or operation. To obtain br from

br+l, we first propagate the results corresponding to the black pixels of cr+l to their

black neighbors of c,, and then add the values corresponding to all the representative

pixels appearing in Cr to br. The first step can be done by

br := (br+l l Pi) Cr,

where pi is the template shown in Figure 38. Note that since the pixels in a connected

component should have the same result, the 1] operation is used. To do the second

88

step, we design a template v as shown in Figure 48 and use an image-template operation

with that template to extract all the representative pixels in Cr. Thus, the second step

can be done by

br := br [1 (cr i v)] + (cr [~ v) ar.

Therefore, to compute br from br+l, we can use the following IA expression:

br := (br+1 pi) [Cr (cr v)] + (Cr V) *ar.

It is easy to see that computing br from br+i can be implemented in 0(1) time on

MCCs.

0o 0 0

v = 0 1 0

0 0 0

Figure 48. The template for isolation of representative pixels

Let 2n = mk for any integer k with 1 < k < log(2n). We use k stacks

So, Si,..., Sk-1, each having capacity of storing m pairs of images. Stack Sq stores

each pair of images (cr, ar) with index r divisible by mq. The algorithm to compute

b = a t(c) with t(c) being a component template with respect to a binary image c

is as follows:

Algorithm Image-template_productwith_component_tempalate

/* Initialize */

b := 0;

a := aow;

for h := 0 to k 1 loop

set_empty(Sh);

q := k;

loop

for h := q -1 down to 0 loop

/* Compute mh(m 1) image pairs and push m pairs into Sh */

push((c, a), Sh);

for r := 1 to mh+ mh loop

if r mod mh = 0 then push((c, a), Sh);

c' := c;

c := p ((c'); /* Shrink components of c */

a := aOu(c', c) /* Compute a corresponding to c */

end loop;

if h # 0 then (c,a) := pop(Sh);

end loop;

/* Propagate component results m times */

for r := 1 to m loop

(c,a) := pop(So);

b := (b M pi) [c (c [l v)] + (c v) a;

end loop;

/* Find the highest nonempty stack */

q := 1;

while q < k and empty(Sq) loop

q := q+1;

end loop;

exit when q = k;

(c,a) := pop(Sq);

end loop;

end algorithm.

Following the analysis of the stack-based algorithm, we claim that the above algo-

rithm requires O (bknl/k) bits of local memory per PE and takes O(kn) time on MCCs,

90

where b is the number of bits in each pixel value of the result image and k is any integer

between 1 and log(2n).

5.4.2 Computing Area, Perimeter, and Compactness of Each Image Component

The area of a component in a binary image is defined to be the number of pixels in

the component. To compute the area of each component in a binary image a, we design

a component template t(a) with respect to a as follows:

t(a) (x) = {1 if x E Ca(y)

t(ay(X= otherwise.

Let image b be the image storing the area of each component in each pixel of the

component. We can compute b as follows:

b := aOt(a),

since we have

b(y)= = a(x) t(a)y(x)

xeCa(y)

xECa(y)

= card(Ca(y)).

The perimeter of a component in a binary image can be defined as the number of

pixels in its boundary. We assume that each component of the source binary image a

forms a digital manifold. To compute the perimeter of each component, we need to set

all the interior black pixels to white. If the template v is defined as shown in Figure 49,

then a A (a CO v) is the image with interior black pixels being set white. Let c be the

image storing the perimeter of each component in each pixel of the component. Using

the same component template t(a), we can compute c as follows:

c := [a A (a CZ v)] t(a).

1

Figure 49. The template for identifying the boundary pixels of each component

The compactness of a component in a binary image is usually measured by a/p2,

where a is the area of the component and p is the perimeter of the component. The

compactness of a component is one of the measures for quantifying the shape of the

component. Let d be the image storing the compactness of each component in each pixel

of the component. After b and c are computed, we can compute d as follows:

d := b/c2

Note that the / is a pseudo division with r/O = 0, where r is any number.

Therefore, we can use the general algorithm for the image-template product with

component template on MCCs to compute the area, perimeter, compactness of each

image component on MCCs. We have the following theorem.

Theorem 5.6: There is an O(kn) time algorithm to compute the area, perimeter, and

compactness of each component in a binary image on MCCs with O (knl/k log n) bits of

local memory, where k is any integer between 1 and log (2n).

5.4.3 Computing Height, Width, and Diameter of Each Image Component

The height of a component in a binary image is the distance between the highest row

and the lowest row occupied by the component. Likewise, the width of a component is

the distance between the leftmost and rightmost columns. To compute the height of each

92

component in a, we define a component template ti(a) with respect to a as follows:

ta ( i, j) i if (i,j) E Ca(y)

tl(a)y0 otherwise.

The template tl(a) is intended to set each black pixel its row coordinate. Let b be the

image storing the height of each component in each pixel of the component. We can

compute b as follows:

b := (a@tl(a))- (a@ti(a)),

since we have

b(y) = m (y) a(x) t(a)y(x) -i () a(x) t(a)y(x)}

= max {i} mm {i}.

(ij)6Ca(y) (itj)EC,(y)

Similarly, to compute the image c storing the width of each component in each pixel of

the component, we define a component template t2(a) with respect to a as follows:

t2(a) (ij) = j if (i,j) E Ca(y)

{0 otherwise.

The image c can be computed as follows:

c := (aOt2(a)) (a@t2(a)).

The diameter of a component in a binary image is the maximum distance between any

two black pixels of the component. Since we assume 8-connectivity, we use chessboard

distance here. It is shown in [16] that the diameter of a component is equal to the length

of the longest side of its bounding rectangle. Thus, after we have the length and width of

a component, we can simply compute its diameter by taking the maximum of its length

and width. Let d be the image storing the diameter of each component in each pixel of

the component. We can compute d as follows:

d := bVc.

93

Therefore, we can apply the general algorithm to compute the height, width, and

diameter of each image component. We have the following theorem.

Theorem 5.7: There is an O(kn) time algorithm to compute the height, width, and

diameter of each component in a binary image on MCCs with O (kn/k log n) bits of

local memory, where k can be any integer between 1 and log (2n).

5.4.4 Computing Moments and Centroid of Each Image Component

Moments and centroid of a component are useful as measures of location and shape

of the component. Various kinds of moments such as basic moments, central moments,

normalized moments, and invariant moments have been defined to describe the shape of

a component [25, 22]. However, other moments are defined on basic moments. Thus,

we only discuss how to compute basic moments of each component. Suppose that a is

a gray level image whose pixels form a group of components and the components are

represented by the black pixels in the binary image c. A moment of order p + q of a

component whose pixel locations form a set C is defined as mpq = E iPjqa(i,j).

(i,j)EC

The centroid (, J) of the component is defined by i = and j =

Let mpq be the image storing a moment of order p + q of each component in each

pixel of the component. Let (i, be the image storing the centroid of each component

in each pixel of the component. Defining a component template tpq(c) with respect to

c as follows:

tPj(c if (i, j) E C(y)

tpq(C)y(ij) 0 otherwise,

we can compute mpq and (1,) as follows:

mpq := aDtpq (c)

and

S:= mio/moo

j := mo0/moo.

Thus, we can apply the general algorithm to compute moments and centroid of each

component in an image. We have the following result.

Theorem 5.8: There is an O(kn) time algorithm to compute moments and centroid

of each component in an image on MCCs with O(knl/kb) bits of local space, where b

is the number of bits in each pixel value of the result image and k is an integer between

1 and log (2n).