<%BANNER%>

Parallel computations on reconfigurable meshes with buses

HIDE
 Title Page
 Acknowledgement
 Table of Contents
 Abstract
 1. Parallel mesh connected computer...
 2. Sorting n numbers on n x n reconfigurable...
 3. Sorting n2 numbers on n x n...
 4. Computational geometry on n...
 5. Conclusions
 References
 Biographical sketch
 
MISSING IMAGE

Material Information

Title:
Parallel computations on reconfigurable meshes with buses
Physical Description:
vii, 122 leaves : ill. ; 29 cm.
Language:
English
Creator:
Nigam, Madhusudan
Publication Date:

Subjects

Subjects / Keywords:
Computer architecture   ( lcsh )
Parallel processing (Electronic computers)   ( lcsh )
Computer and Information Sciences thesis Ph. D
Dissertations, Academic -- Computer and Information Sciences -- UF
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1992.
Bibliography:
Includes bibliographical references (leaves 106-121).
Statement of Responsibility:
by Madhusudan Nigam.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 029759069
oclc - 29374015
System ID:
AA00017678:00001

MISSING IMAGE

Material Information

Title:
Parallel computations on reconfigurable meshes with buses
Physical Description:
vii, 122 leaves : ill. ; 29 cm.
Language:
English
Creator:
Nigam, Madhusudan
Publication Date:

Subjects

Subjects / Keywords:
Computer architecture   ( lcsh )
Parallel processing (Electronic computers)   ( lcsh )
Computer and Information Sciences thesis Ph. D
Dissertations, Academic -- Computer and Information Sciences -- UF
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1992.
Bibliography:
Includes bibliographical references (leaves 106-121).
Statement of Responsibility:
by Madhusudan Nigam.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 029759069
oclc - 29374015
System ID:
AA00017678:00001

Table of Contents
    Title Page
        Page i
        Page ii
    Acknowledgement
        Page iii
    Table of Contents
        Page iv
        Page v
    Abstract
        Page vi
        Page vii
    1. Parallel mesh connected computer architectures
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
    2. Sorting n numbers on n x n reconfigurable meshes with buses
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
    3. Sorting n2 numbers on n x n meshes
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
    4. Computational geometry on n x n reconfigurable meshes with buses
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
    5. Conclusions
        Page 102
        Page 103
        Page 104
        Page 105
    References
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
    Biographical sketch
        Page 122
        Page 123
        Page 124
        Page 125
Full Text










PARALLEL COMPUTATIONS ON RECONFIGURABLE MESHES
WITH BUSES










BY

MADHUSUDAN NIGAM


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1992




























Copyright 1992

by

Madhusudan Nigam

All rights reserved














ACKNOWLEDGEMENTS


I express my deep gratitude towards my advisor Dr. Sartaj Sahni for his continuous

guidance and support during my course of studies. I am fortunate to have him as my

mentor. It has been a previlege and pleasure to work and to do research with him.

I also thank my dissertation committee members, Dr. Gerhard. X. Ritter, Dr. Nabil

N. Kamel, Dr. Ravi Varadarajan and Dr. Haniph A. Latchman.

My work at University of Florida was partially supported by Computer and Infor-

mation Sciences Department and National Science Foundation under grants MIP-

8617374 and MIP-9103379. This support is greatly appreciated.

I thank all members of my family for their love, support and patience during my

studies. In particular, I thank my mother, Ram Dulari, for her support during the years

1991-92. I thank my father, Lallan Ram, who inspired me to pursue this goal. I also thank

my younger brothers Rakesh, Rajesh, Mukesh and my sisters Meera and Sunita.

I would like to thank my associates at University of Florida, Vishal Walia, Venka-

tramanan Thanvantri, Madhu Acharya and Balaji Venkatraman for their help with the

typesetting software during the production of this dissertation.

Finally, my deep appreciation to my grandparents, the late Mr. and Mrs. J. P.

Nigam, for being the source of strength and inspiration during difficult times. To them I

dedicate this dissertation.














TABLE OF CONTENTS



1 ACKNOWLEDGEMENTS iii

2 ABSTRACT vi

3 CHAPTERS

1 PARALLEL MESH CONNECTED COMPUTER ARCHITECTURES 1

1.1 Introduction ...................... ........ * 1
1.2 Mesh Connected Computer .......... . ............. 2
1.3 M esh of Trees .... .......... . .................. 5
1.4 Pyramid Computer ... .................... ....... 7
1.5 Hypercube .............. ............. .. ........ 9
1.6 Modified MCC ..... ....... ... .. ... ..... ....... 10
1.7 Reconfigurable Meshes with Buses .............. ...... . 12
1.8 Comparision of RMB with other models ............ . 19
1.9 Dissertation Outline .......... ................ .... 22

2 SORTING n NUMBERS ON n x n RECONFIGURABLE MESHES .
W ITH BUSES ................. .............. 24

2.1 Introduction ........................... ......... 24
2.2 Column Sort ....................... .......... 26
2.3 Column Sort on An RMESH ................. ....... 27
2.3.1 Ranking ..... .......... .................... 33
2.3.2 Analysis of RMESH Column Sort .. ............... . 35
2.4 Column Sort On A PARBUS ........................ 36
2.5 Rotate Sort .................................... . 39
2.6 Rotate Sort On A RMESH ....... ................... 42
2.7 Rotate Sort On A PARBUS ......... ............... 45
2.8 A Combined Sort .......................... ... 46
2.8 Comparison with Other 0(1) PARBUS Sorts .............. 47
2.8 Sorting On Higher Dimension RMB ................. .... 48
2.9 Conclusions .... ..................... ....... 50









3 SORTING n2 NUMBERS ON n x n MESHES .............


3.1 Introduction ................
3.2 Modified Schnorr and Shamir Algorithm.
3.3 Further Enhancements . . . . .
3.4 Practical Implications . . . . . .
3.5 Simulation On RMBs . . . . . .
3.6 Conclusions . . . . . . . .


4 COMPUTATIONAL GEOMETRY ON N
MESHES WITH BUSES .........
4.1 Introduction ...............
4.2 Convex Hull ...............
4.2.1 N3 Processor Algorithm .........
4.2.2 N2 Processor Algorithm .........
4.3 Smallest Enclosing Rectangle ......
4.4 ECDF Search On An N x N RMESH .
4.5 Triangulation ...............
4.6 Conclusions ..............

5 CONCLUSIONS ..............

5.1 Summary .................
5.2 Future Work ...............


x N RECONFIGURABLE


5 REFERENCES


6 BIOGRAPHICAL SKETCH


106

122














Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

PARALLEL COMPUTATIONS ON RECONFIGURABLE MESHES
WITH BUSES

By

Madhusudan Nigam

December 1992

Chairman: Dr. Sartaj Sahni
Major Department: Computer and Information Sciences

We consider different reconfigurable meshes with buses architectures such as

PARBUS, RMESH and MRN for parallel computing. We compare reconfigurable

meshes with buses with other parallel computing models such as PRAM, fixed topology

interconnection networks and meshes with static buses.

We consider sorting n numbers on n x n reconfigurable meshes with buses. We

develop simple and optimal areaxtime2 complexity algorithms for all reconfigurable

meshes with buses architectures (PARBUS/RMESH/MRN). We extend our algorithms to

higher dimension reconfigurable meshes with buses. We show that it is faster to sort than

to select the kth element using Lin's algorithm on a PARBUS.









We show that the lower bound for the number of routes needed to sort n2 numbers

on an n x n bidirectional mesh established by Schnorr and Shamir is incorrect. We pro-

vide algorithms that are able to sort using a number of routes that is very close to the dis-

tance lower bound for bidirectional as well as strict unidirectional meshes. We point out

the practical advantages of our algorithms when applied to mesh connected computers

with tR >> 2 tc. We show how to simulate the mesh algorithms on reconfigurable

meshes with buses.

We develop 0(1) time algorithms to determine the convex hull of N planar points,

find the smallest enclosing rectangle of a set of N planar points, triangulate a set of N

planar points, and for the ECDF search problem for N planar points on N x N

reconfigurable meshes with buses. We also develop an 0(1) time algorithm to determine

the voronoi diagram of N planar points on N3 x N reconfigurable meshes with buses.

We show that Reisis's algorithm for determining the convex hull of N planar points is

incorrect. Our algorithms work on all N x N reconfigurable meshes with buses

(PARBUS/RMESH/MRN).














CHAPTER 1

MESH CONNECTED PARALLEL COMPUTER ARCHITECTURES

Introduction

Most control-flow based parallel computational models can be classified into either

Shared Memory Model (SMM) or Distributed Memory Model (DMM). In SMM, all pro-

cessor elements (PEs) share a global memory and communicate with each other via

shared memory. In SMM, each PE has its own local control and local memory. Each pro-

cessor is allowed to access an arbitrary location in shared memory based on allowed

read/write modes. In an exclusive-read, exclusive-write (EREW) model, at most one PE

can read from or write to a location in the shared memory in a cycle. In a concurrent-

read, exclusive-write (CREW) model, multiple PEs can read from the same memory

location, but only one PE can write to a location in the shared memory. In concurrent-

write (CW), multiple write to a single location in memory is allowed. The value written

is resolved based on a write conflict arbitration scheme. PRAM (Parallel Random Access

Memory) characterizes the SMM. Karp [KARP90] lists parallel algorithms for many

applications on SMM.

There is no global memory to share in a DMM and the interprocessor communica-

tion is done via an interconnection network. In this model each PE has its own local






-2-


memory and/or blocks of distributed memory modules available to PEs through an align-

ment network connecting PEs with memory modules. In DMM, there is a delay associ-

ated with remote memory access through a network. The interprocessor network is usu-

ally fixed at design phase. All the data routing and communication between PEs is depen-

dent on the capabilities of the network. The computation time on a DMM is dependent on

the diameter of the communication network, i.e., the maximum time to communicate

and/or route data between any two PEs.

There are various architectural classification schemes for parallel computational

models [HWAN84]. Flynn's Classification [FLYN66] is based on multiplicity of instruc-

tion streams and data streams on a multiprocessor system. Figure 1.1 gives a block

diagram for SIMD (Single Instruction Multiple Data) mutiprocessor computer architec-

ture. While many interconnection networks [HWAN84], [FENG81], [ULLM84] have

been proposed, the two-dimensional mesh, the hypercube, mesh of trees, and the pyramid

have emerged as the dominant ones for many practical applications. Figures 1.2 1.7

give examples of these interconnection networks.

Mesh Connected Computer

In Mesh Connected Computer (MCC) model, the PEs can be thought of as logically

arranged in a k-dimensional array A(nk-1, nk-2,..., no), where ni is the size of the ith

dimension and the number of PEs p = nk 1*nk-2* * *no. The PE at location

A(ik-l,...,io)is connected to the PEs at locations A(ikl,...,ij 1,...,i0), 0 5 j < k, pro-

vided they exist. Each PE is connected to its 2k neighbors. It is assumed that a pE may






-3-


Figure 1.1 Block diagram of an SIMD Computer

transmit data to any of its nearest neighbors in unit time.

Data may be transmitted from one PE to another PE via this interconnection pattern.

The interconnection scheme for a 16 PE MCC with k = 2 is given in Figure 1.2. Some

simple variations to the mesh include end-around connections. Figure 1.3 shows an MCC

with wrap-around connections, the right end of each row connects to the left end of the

row and the bottom of each column connects to its top. Figure 1.4 shows an MCC with

toroidal connection. The right end of a row connects to the left end of the next row and

the bottom of a column connects to the top of the next column. The ILLIAC-IV, MPP,

and DAP are some examples of mesh connected computers.






-4-


The regular structure and nearest neighbor (local connectivity) connection makes

MCC suitable for VLSI implementation. The fraction of the area occupied by intercon-

nect wires in a MCC is fixed regardless of the size N. MCC seems to be a natural struc-

ture for solving many problems in matrix computation and image processing. In parallel

and distributed processing, the computations are usually constrained by information flow

and data movement operations rather than processing time at PEs. MCC is not suitable

for computations like sorting where a single piece of data has to be moved from one

corer to the opposite corer. In a N112 x N1/2 2-MCC, the worst case data movement

may take as much as 2(N1/2) time. This deficiency, the large diameter, is the main

shortcoming of MCC's. Most problems involving communications over long distance

will have a lower bound on computation time at least equal to the diameter of the mesh.

Dekel [DEKE81] has parallel algorithms for matrix and graph problems on MCC and

other models. The studies [BAUD78], [KUMA83], [MARB88], [NASS79], [SCHE86],

[SCHER89], [THOMK77], [PARK87,90] have sorting algorithms on MCC and

[NASS80] has an optimal routing algorithm on MCC. The studies [HWAN84],

[MILLS84a,88de,89], [STOU82], and [LEIG92] have listed MCC algorithms for many

applications.






























Figure 1.2 A 4 x 4 Mesh Connected Computer

Mesh of Trees

Mesh of Trees (MOT) is a hybrid network based on meshes and trees. MOT has a

smaller diameter and is faster than a MCC for the same number of nodes. MOT is area

efficient in terms of VLSI complexity [THOM79,80], [ULLM84]. MOT is defined in

[LEIG81], [LEIG84], [CAPE83], [NATH83]. The N x N mesh of trees is constructed

from an N x N grid of processors by adding processors and wires to form a complete

binary tree in each row and each column. The leaves of the trees are the N2 grid nodes

and added nodes are the internal nodes of the trees. The network has 3N2 2N proces-

sors. The leaf and the root processors have degree two and all others have degree 3. The

N x N MOT has diameter 41og2(N). Figure 1.5 shows a 4 x 4 MOT.




-6-


Figure 1.3 Mesh with wrap-around connections


4 Q


Figure 1.4 Mesh with toroidal connections


rd^ -11
f^: -1 -a
E 1
{} {} 3-E


Figure 1.3 Mesh with wrap-around connections


_f:


-0/


~71


Q:


I :






-7-


The two-dimensional mesh of trees has a very natural and regular structure. There

are variations to MOT such as an additional sets of trees along the diagonal of the mesh,

connecting original grid nodes with mesh edges. A reduced mesh of trees has 1/3 as

many processors as the usual mesh trees. Mesh of Trees can also be defined in higher

dimensions. MOT offer dramatic speedup in time over meshes for the same number of

processors. Many problems can be solved in polylogarithmic time with polynomial size

MOT (Class NC). MOT is area universal, i.e., it can simulate any other network with the

same VLSI wire area with only a polylogarithmic factor slow down. Nath et al.

[NATH83] have area efficient algorithms for matrix computation, sorting, FFT and

graph problems. Other studies [ALNU86], [ALNU87], [ALNU88], [PRAS86] describe

mesh of trees algorithms for image processing, graph problems, geometric problems.

Leighton [LEIG92] has listed mesh of trees algorithms for many applications.

Pyramid Computer

The Pyramid Computer (PC) is a combination of tree and mesh structures. A

pyramid computer of size N has an N112 x N1/2 mesh connected computer as its base

(level 0), and log4(N) levels of mesh connected computers above. A PE at level k is con-

nected to 4 siblings at level k, 4 children at level k -1 and a parent at level k +1. Figure

1.6 shows a PC with 16 PEs at the base. The apex is the topmost level and has exactly

one PE. If some level of PC is a 2k x 2k mesh, then the level above it is a 2k-1 x 2k-1

mesh and the level below is a 2k+1 x 2k+1 mesh. Let (i,j,k) represent the PE in posi-

tion (i,j) at level k, then it is connected to PEs (il,j,k), (i,jl,k),




























O Original mesh nodes
[] Column tree nodes
Row tree nodes

Figure 1.5 A 4 x 4 Mesh of Trees


(i div 2,j div 2,k+l), (2i,2j,k-1), (2i+l,2j,k-1), (2i,2j+1,k-1), and

(2i + 1,2j+ 1,k 1), if they exist. Cantoni and Levialdi [CANT86] have proposed varia-

tions of PC for computer vision applications.

A PC with a base size N2 has (4N -1) PEs and has a diameter of the O(log2(N)).
3

PC is not as powerful as MOT. There are many problems which admit algorithms in class

NC on a N2-MOT but require -N steps on N x N PC. The studies [TANI82], [TANI84],

[MILLS84b,87a], [CHAN88] have algorithms for image processing, data movement and






-9-


geometric problems on PC.


Apex


(4,1)


(1,4)


(4,4)


Figure 1.6 Pyramid Computer with base 16 (42) nodes


Hypercube

A k-dimensional hypercube consists of N = 2k nodes. The nodes are labeled

0,1,...,2k-1. Two nodes in a hypercube are connected if their labels differ in exactly one

bit position in binary representation. The degree of each node is log2(N). The diameter of

an N PE hypercube is also log 2(N). Figure 1.8 shows an 8 PE hypercube. The hypercube

has become a popular architecture because of its versatility and efficiency. It can be used

both for general and special purpose applications. It can efficiently simulate any other

network of the same size. It can simulate meshes, binary trees and mesh of trees

efficiently. Ncube and iPSC are examples of commercially available hypercubes.






-10-


Since the degree of a node in a hypercube is log 2(N), the number of connections per

PE grows logarithmically with the size N of the network. This can be a problem for large

size hypercubes during physical layout for VLSI implementation. There are bounded-

degree derivatives of hypercube such as butterfly, shuffle-exchange graphs, cube-

connected-cycles, Benes network, de Bruijn graphs [ULLM84]. [RANK91], [LEIG92],

[MILLS87c], [STOJ86] have listed hypercube algorithms for many application areas

including image processing and computational geometry.



100 110

4 6

000



10 111



001 1 3
011


Figure 1.7 Hypercube with 8 (23) nodes

Modified MCC

Several modifications to MCC have been proposed to reduce the worst case PE-to-

PE communication time. These include attaching trees (MOT), embedding one or more

global meshes within MCC architecture, pyramid computer, attaching buses to proces-

sors. Carlson [CARL85], [CARL86] explores embedding one or more global meshes to






-11-


MCC. This modification improves asymptotic time complexity for computations with

low to medium interprocessor communications such as tree computations, prefix compu-

tations, finding connected components of a graph. There is no improvement for computa-

tions requiring high interprocessor communications such as sorting and computing a

Fourier transform.

Another modification suggested to MCC is adding one or more broadcast buses. A

broadcast bus connects PEs together so that they can transmit data to all other attached

PEs in unit time. It is assumed that only one PE can broadcast at any time. Ableson

[ABEL80] and Gentleman [GENT78] have analyzed problems where the worst case

solution is constrained by data movement. Jordan [JORD78,79], and Gentleman

[GENT78] were first to consider supplementing MCC with a global bus. Bhokari

[BOKH81,84], Stout [STOU83] consider finding the maximum, semigroup operations

and finding median on MCC with a single bus. They report asymptotic improvement in

time complexity over MCC. There is no improvement for computations requiring high

interprocessor communications such as sorting.

Prasanna Kumar [PRAS87ab], [PRAS89], Stout [STOU86], and Chen

[CHEN89,90] consider augmenting MCC with a bus for each row and column. Figure 1.8

shows such a MCC with multiple buses. Aggarwal [AGGA86] considers augmenting d-

dimension MCC with k global buses for finding the maximum and matrix multiplication.

The studies [PRAS87b], [PRAS89] and [CHEN89,90] consider semigroup computations,

linear algebra, numerical computations, image processing, and geometric problems on






-12-


MCC with multiple buses. They obtain improvement in asymptotic time complexity over

MCC with a single bus. In all these models the bus configurations are static, i.e., it is

fixed over computations.



Row Buses







9 1 1





Column Buses


Figure 1.8 A 4 x 4 MCC with Row and Column Broadcasting

Reconfigurable Meshes with Buses

Several authors have considered various reconfigurable parallel computational

models and architectures. These include the Bus Automata (BA) of Rothstein [ROTH76],

the Configurable Highly Parallel Computer (CHiP) of Synder [SYND82], the

Reconfigurable Meshes with buses (RMESH) of Miller et al. [MILLP87b,88abc], the

Content Addressable Array Processor (CAAP) of Weems et al. [WEEM87,89], the

polymorphic torus of Li and Maresca [LI89ab, MARE89], the processor array with a

reconfigurable bus system (PARBUS) of Wang and Chen [WANG90ab], and the

reconfigurable network (RN) of Ben-Asher et al. [BEN91].






-13-


CHiP [SYND82] consists of homogeneous PEs located at grid points and inter-

posed with programmable switches. Each PE and switch has local memory. PE store pro-

gram and data while each switch stores a fixed number of allowed interconnection pat-

terns. The switches are under a central global control of the program running the

machine. Various interconnection between processors are accomplished by using dif-

ferent stored patterns. CHiP architecture differs from other reconfigurable meshes in two

ways. First, the number of connection pattern between PEs is constrained by the limited

switch settings stored in the local memory. Second, the switch settings are under a central

control rather than under distributed control of PEs. Several issues related to Wafer

Scale Integration of CHiP are discussed in [HEDL82]. Gannon [GANN81] discusses

solutions to numerical and signal processing problems on CHiP.

Bus Automata (BA) [ROTH76] can be considered as a cellular automata (arrays of

automata with interconnections between any one automata and a fiexd bounded number

of neighbors) with a locally switchable global communication network. Each cell (PE)

can connect a subset of its input lines (links) directly to some subset of its output links

through internal channels. By alternating links and channels "buses" are set up which

can, in principle, establish direct communication from any particular cell to any desired

subset of cells. Communication on a bus is assumed to take unit time. The works of

Rothstein et. al. [ROTH76,77ab,78,79], [MOSH79], [CHAM87] focus on developing

0(1) algorithms for problems in the area of parsing and formal languages, pattern recog-

nition, neural modeling, cryptographic methods.






-14-


Other mesh-like architectures with reconfigurable buses are the content addressable

array processor (CAAP) of Weems et al. [WEEM87,89], the polymorphic torus of Li and

Maresca [LI89, MARE89], the reconfigurable mesh with buses (RMESH) of Miller et al.

[MILLP87b, 88abc], the processor array with a reconfigurable bus system (PARBUS) of

Wang and Chen [WANG90ab], and the reconfigurable network (RN) of Ben-Asher et al.

[BENA91]. We refer to all these models as Reconfigurable Meshes with Buses (RMBs).

The CAAP [WEEM87,89] and RMESH [MILLP87b, MILLP88abc] architectures

appear to be quite similar. So, we shall describe the RMESH only. In this, we have a bus

grid with an n x n arrangement of processors at the grid points (see Figure 1.9 for a 4x4

RMESH ). Each grid segment has a switch on it which enables one to break the bus, if

desired, at that point. When all switches are closed, all n2 processors are connected by

the grid bus. The switches around a processor can be set by using local information. If all

processors disconnect the switch on their north, then we obtain row buses (Figure 1.10).

Column buses are obtained by having each processor disconnect the switch on its east

(Figure 1.11). In the exclusive write model two processors that are on the same bus can-

not simultaneously write to that bus. In the concurrent write model several processors

may simultaneously write to the same bus. Rules are provided to determine which of the

several writers actually succeeds (e.g., arbitrary, maximum, exclusive or, etc.). Notice

that in the RMESH model it is not possible to simultaneously have n disjoint row buses

and n disjoint column buses that, respectively, span the width and height of the RMESH.

It is assumed that processors on the same bus can communicate in 0(1) time. RMESH






-15-


algorithms for fundamental data movement operations and image processing problems

can be found in [MILLP88abc, MILLP91ab, JENQ91abc].

An n x n PARBUS (Figure 1.12) [WANG90ab] is an n x n mesh in which the

interprocessor links are bus segments and each processor has the ability to connect

together arbitrary subsets of the four bus segments that connect to it. Bus segments that

get so connected behave like a single bus. The bus segment interconnections at a proc-

cessor are done by an internal four-port switch. If the upto four bus segments at a proces-

sor are labeled N (North), E (East), W (West), and S (South), then this switch is able to

realize any set, A = (A 1, A2), of connections where A i (N,E,W,S), 1 < i< 2 and the

Ai's are disjoint. For example A = ( N,S), (E,W) results in connecting the North and

South segments together and the East and West segments together. If this is done in each

processor, then we get, simultaneously, disjoint row and column buses (Figure 1.13 and

1.14). If A = ((N,SE,W},}, then all four bus segments are connected. PARBUS algo-

rithms for a variety of applications can be found in [MILLP91ab, WANG90ab, LIN92,

JANG92abcde, OLAR92, ELGI91, LI91ab]. Observe that in an RMESH the realizable

connections are of the form A = (A 1 ), A 1 {N,E,W,S).

The polymorphic torus architecture [LI89ab, MARE89] is identical to the PARBUS

except that the rows and columns of the underlying mesh wrap around (Figure 1.15). In a

reconfigurable network (RN) [BENA91] no restriction is placed on the bus segments that

connect pairs of processors or on the relative placement of the processors, i.e., processors

may not lie at grid points and a bus segment may join an arbitrary pair of processors.





-16-


L
(
L

[
(
L
(
L


]

)

]


- L-


: Processor

: Switch


- : Link


Figure 1.9 4x4 RMESH


m--


r-m


r-


0J 0

0 0
O O

000 _I" -0"


D
D
D


Figure 1.10 Row buses


O
o


IJ


0
c

E
C

c
E


LJ uLU LL L

L I L
) () ()

) () () (

-I ^ ,-, ^ ,- F ,


F-


--


-1





-17-


0

0

0

0
o"
"oFT


0

0

0

0
)
o0


o i
0

0

0

0
"0E


Figure 1.11 Column buses


Figure 1.12 4 x 4 PARBUS

Like the PARBUS and polymorphic torus, each processor has an internal switch that is

able to connect together arbitrary subsets of the bus segments that connect to the proces-

sor. Ben-asher et al. [BENA91] also define a mesh restriction (MRN) of their


0




El






-18-


Figure 1.13 Row buses in a PARBUS
























Figure 1.14 Column buses in a PARBUS






-19-


reconfigurable network. In this, the processor and bus segment arrangement is exactly as

for the PARBUS (Figure 1.12). However, the switches internal to processors are able to

obtain only the 10 bus configurations given in Figure 1.16 Thus an MRN is a restricted

PARBUS.

While we have defined the above reconfigurable bus architectures as square two

dimensional meshes, it is easy to see how these may be extended to obtain non square

architectures and architectures with more dimensions than two.






















Figure 1.15 4 x 4 Polymorphic torus


Comparison of RMB with other models

Reconfigurable Meshes with Buses (RMB) are attractive because they have a diam-

eter of 1. The time taken to broadcast along a bus is assumed constant 0(1) regardless of





-20-


N N N N
WP E WQ>E E E
S S S S
N N N N
WO E W^>E WO E W E
S S S S
N N
E wQ>E
S S


Figure 1.16 Local Configurations allowed for a switch in MRN

the length of the bus. This assumption may be unrealistic on currently realizable parallel

architectures. The conductance and capacity of wires and transistors imply a constant

lower bound for the transmission rate of a single switch. Despite this, there already exist

experimental RMB on VLSI chip such as YUPPIE (Yorktown Ultra-Parallel

Polymorphic Image Engine [LI89ab] and, CAAP [WEEM87,89]. RMB are simple and

easy to build with short wires (each edge has 0(1) length) in VLSI. The time to change a

state of a switch is orders of magnitude higher than the time for a signal to traverse a dis-

tance equal to the diameter of a switch [ROTH88]. This motivates nonelectronic or

hybrid implementation of RMBs. Schuster [SCHU91] considers optical implementations.

The RMB model is similar to an SIMD MCC except for the buses and switches. It

has an area O(N) for N112 x N1'2 RMB under the assumption that processors, switches,

and single links have constant size. It is assumed that only a single PE can broadcast on a






-21-


bus at any time, i.e., it functions in CREW mode. The switch is under control of adjacent

processors. The switch setting is dynamically controlled (software) by adjacent proces-

sors as long as there is no conflict in the desired switch settings by the adjacent proces-

sors. The switch setting allows the broadcast bus to be partitioned into different

configurations (subbuses). Each subbus can function as a smaller RMB. RMB can simu-

late any VLSI organization with equivalent area (those which can be laid out in an

N1/2 x N1/2 grid) without loss of time [MILLP87b]. RMB supports several techniques

of CRCW PRAM models giving the same or better time performance as a CREW PRAM

with N processors. Schuster [SCHU91] provides simulation of PRAM's by RMB's and

vice versa. Miller et al. [MILLP87b] give efficient simulation of Mesh of Trees (MOT)

on RMB and a simulation of Pyramid Computers (PC) without loss of time. The compu-

tation for problems requiring high interprocessor communication on fixed-topology net-

work require at least 0(diameter) time. RMB can solve some of these problems in con-

stant time in certain cases [SCHU91], [WANG90ab, 91ab].

RMB is more powerful and superior than PRAM (SMM). One dimensional linear

array of size N with reconfigurable bus can compute the OR of N Boolean inputs in 0(1)

steps, whereas the CREW PRAM with any polynomial number of processors require

o(log N) steps [COOK86]. The parity (summation modulo 2) of N inputs is computed in

0(1) time on a 3 x N PARBUS and MRN [LI91a]. It takes Q(log N/loglog N) steps for

CRCW PRAM having any polynomial number of processors [BEAM87]. Moreover, par-

ity of N2 inputs can be computed in 0(1) time on an N x N PARBUS and MRN.






-22-


RMB is more powerful than fixed-topology networks and meshes with static broad-

cast buses. Asymptotic improvement in running time of algorithms that solve several

problems on RMB [MILLP87b,88abc,91ab], [JENQ91abc], [JANG92abcd], [OLAR92],

[REIS89] have been reported compared to efficient algorithms on other fixed-topology

networks such as mesh of trees (MOT), pyramid computers (PC), hypercubes, and

meshes with static broadcast buses. For example, parity of N inputs on N112 x N112

mesh with multiple buses takes I(N1/6) [PRAS87a] and Q(N1'8) time on any rectangu-

lar N-processor mesh with multiple buses [BARN91], [CHEN89]. This can be solved in

0(1) time on a PARBUS and MRN.

Dissertation Outline

The prospects of developing 0(1) i.e. constant time algorithms on RMBs motivated

this dissertation. Several other dissertations have been motivated by similar considera-

tions [REIS89], [SCHU91], [WANG91a]. This dissertation covers algorithms for sorting

on two-dimension and higher dimension RMBs for all configurations in 0(1) time, sort-

ing N elements on N processor meshes using a number of routing steps approximately

equal to the distance lower bound in two-dimensions and its simulation on RMBs, con-

stant time algorithms for determining the convex hull of a set of N planar points, finding

the smallest enclosing rectangle of a set of N planar points, triangulating N planar points

and the ECDF search problem for N planar points.

In chapter 2, we discuss the sorting problem on two and higher dimensions on all

RMB configurations. Our algorithms require a fewer number of routes than previuosly






-23-


published algorithms for two-dimensional RMBs. Our algorithms are much simpler and

easier to implement. We show that it is faster to sort than to select the kth element using

Lin's algorithm [LIN92].

In chapter 3, sorting N elements on N processor meshes and RMBs is discussed. We

show that the lower bound for the number of routes needed to sort on an n x n bidirec-

tional mesh established in [SCHN86] is incorrect We give sorting algorithms on 2-

dimensional bidirectional and unidirectional meshes using a number of routing steps that

are very close to the distance lower bound. We demonstrate the practical implications of

our algorithms for sorting on a mesh. We also show how to simulate the mesh algorithms

on reconfigurable meshes with buses.

In chapter 4, we derive 0(1) time algorithms for determining the convex hull of N

planar points, finding the smallest enclosing rectangle of N planar points, triangulating N

planar points and the ECDF search problem for N planar points on N x N reconfigurable

meshes with buses. We prove that Reisis's convex hull algorithm [REIS92] is incorrect.

Our algorithms work on all RMB architectures (PARBUS/RMESH/MRN).

Chapter 5 concludes this work with a critical analysis of the results and a discussion

on future research directions.














CHAPTER 2

SORTING n NUMBERS ON n x n RECONFIGURABLE MESHES
WITH BUSES

Introduction

In this chapter, we consider the problem of sorting n numbers on an RMESH,

PARBUS and MRN. This sorting problem has been previously studied for all three

architectures. We can sort n numbers in 0(1) on a three dimensional n x n x n RMESH

[JENQ91ab], PARBUS [WANG90], and MRN [BENA91]. All of these algorithms are

based on a count sort [HORO90] and are easily modified to run in the same amount of

time on a two dimensional n2 x n computer of the same model. Nakano et al.

[NAKA90] have shown how to sort n numbers in 0(1) time on an (n log2n x n)

PARBUS. Jang and Prasanna [JANG92a] and LIN et al. [LIN92] have reduced the

number of processors required by an 0(1) sort further. They both present 0(1) sorting

algorithms that work on an n x n PARBUS. Since such a PARBUS can be realized using

n2 area, their algorithms achieve the area time squared (AT2) lower bound of K (n 2) for

sorting n numbers in the VLSI word model [LEIG85]. The algorithm of Jang and Pra-

sanna [JANG92a] is based on Leighton's column sort [LEIG85] while that of LIN et al.

[LIN92] is based on selection. Neither is directly adaptable to run on an n x n RMESH

in 0(1) time as the algorithm of [JANG92a] requires processors be able to connect their






-25-


bus segments according to A = { (N,S), (E,W)) while the algorithm of [LIN92] requires

A = {(N,S), {E,W)) and ((N,W), {S, E)). These are not permissible in an RMESH.

Their algorithms are, however, directly usable on an n x n MRN as the bus connections

used are permissible connections for an MRN. Ben-Asher et al. [BENA91] describe an

0(1) algorithm to sort n numbers on an MRN with O(nl +e) processors for any, e e >

0. This algorithm is also based on Leighton's column sort [LEIG85].

In this chapter, we show how Leighton's column sort algorithm [LEIG85] and Mar-

berg and Gafni's rotate sort algorithm [MARB88] can be implemented on all three

reconfigurable mesh with buses (RMB) architectures so as to sort n numbers in 0(1) time

on an n x n configuration. The resulting RMB sort algorithms are conceptually simpler

than the 0(1) PARBUS sorting algorithms of [JANG92a] and [LIN92]. In addition, our

implementations use fewer bus broadcasts than do the algorithms of [JANG92a] and

[LIN92]. Since the PARBUS implementations use only bus connections permissible in

an MRN, our PARBUS algorithms may be directly used on an MRN. For an RMESH,

our implementations are the first RMESH algorithms to sort n numbers in 0(1) time on

an n x n configuration.

In section 2, we describe Leighton's column sort algorithm. Its implementation on

an RMESH is developed in section 3. In section 4, we show how to implement column

sort on a PARBUS. Rotate sort is considered in sections 5 through 7. In section 5 we

describe Marberg and Gafni's rotate sort [MARB88]. The implementation of rotate sort

is obtained in sections 6 and 7 for RMESH and PARBUS architectures, respectively. In






-26-


section 8, we propose a sorting algorithm that is a combination of rotate sort and Scher-

son et al.'s [SCHER89] iterative shear sort. In section 9, we provide a comparison of the

two PARBUS sorting algorithms developed here and those of Jang and Prasanna

[JANG92a] and Lin et al. [LIN92]. For the PARBUS model, Leighton's column sort uses

the fewest bus broadcasts. However, for the RMESH model, combined sort uses the

fewest bus broadcasts. In section 10, we make the observation that using rotate sort, one

can sort Nk elements, in 0(1) time on an N x N x -.* x N k + 1 dimensional RMESH

and PARBUS.

Column Sort

Column sort is a generalization of Batcher's odd-even merge [KNUT73] and was

proposed by Leighton [LEIG85]. It may be used to sort an r x s matrix Q where

r > 2(s-1 )2 and r mod s= 0. The number of elements in Q is n = rs and the sorted

sequence is stored in column major order (Figure 2.1). Our presentation of column sort

follows that of [LEIG85] very closely.

There are eight steps to column sort. In the odd steps 1,3,5, and 7, we sort each

column of Q top to bottom. In step 2, the elements of Q are picked up in column major

order and placed back in row major order (Figure 2.2). This operation is called a tran-

spose. Step 4 is the reverse of this (i.e., elements of Q are picked up in row major order

and put back in column major order) and is called untranspose. Step 6 is a shift by [ -J.

This increases the number of columns by 1 and is shown in Figure 2.3. Step 8, unshift, is






-27-


(a) Input Q


(b) Output Q


Figure 2.1 Sorting a 9x3 matrix Q into column major order

the reverse of this. Leighton [LEIG85] has shown that these eight steps are sufficient to

sort Q whenever r > 2(s -1 )2 and r mod s = 0.


Untranspose


Figure 2.2 Transpose (step 2) and Untranspose (step 4) of a 9x3 matrix

Column Sort On An RMESH

Our adaptation of column sort to an n x n RMESH is similar to the adaptation used

by Ben-Asher et al. to obtain an O(n 17/9) processor reconfigurable network that can sort






-28-


1 10 19 -oo 6 15 24
2 11 20 -o 7 16 25
3 12 21 -o0 8 17 26
4 13 22 ft> -o 9 18 27
5 14 23 < 1 10 19 o
6 15 24 Unshift 2 11 20 o
7 16 25 3 12 21 o
8 17 26 4 13 22 o
9 18 27 5 14 23 o



Figure 2.3 Shift (step 6) and Unshift (step 8)

in 0(1) time. This reconfigurable network uses approximately 11 layers.

The input n numbers to be sorted reside in the Z variables of the row 1 PEs of an

n x n RMESH. This is interpreted as representing the column major order of the Q

matrix used in a column sort (Figure 2.4). The dimensions of Q are r I x s1 where r =

1/2*n3/4 and sI = 2*n1/4. Clearly, there is an nl such that rl 2 2*(s1-1)2 for all

n > n 1. Hence, column sort will work for n > n For n < n I we can sort in constant

time (as n 1 is a constant) using any previously known RMESH sorting algorithm.

The steps in the sorting algorithm are given in Figure 2.5. Steps 1, 2, and 3 use the

n x r sub RMESH Ai to sort column i of Q. For the sort of step 4, the Q matrix has

dimensions r x (s + 1). Columns 1 and s + 1 are already sorted as the -oo's and +oo's

of Figure 2.3 do not affect the sorted order. Only the inner s I 1 columns need to be

sorted. Each of these columns is sorted using an n x r1 sub RMESH, Bi, as shown in

Figure 2.6. The sub RMESHs X and Y are idle during this sort.






-29-


column 1 column 2


column s1


Figure 2.4 Column major interpretation of Q



Step 0: [Input] Q is available in column major order, one element per PE, in row 1 of
the RMESH. I.e., z[ 1,j] = j'th element of Q in column major order.

Step 1: [Sort Transpose] Obtain the Q matrix following step 2 of column sort. This
matrix is available in column major order in each row of the RMESH.

Step 2: [Sort Untranspose] Obtain the Q matrix following step 4 of column sort. This
matrix is available in column major order in each row of the RMESH.

Step 3: [Sort Shift] Obtain the Q matrix following step 6 of column sort. This matrix
(excluding the -o and +o values ) is available in column major order in each
row of the RMESH.

Step 4: [Sort Unshift] Obtain the sorted result in row 1 of the RMESH.

Figure 2.5 Sorting Q on an RMESH

Thus the sorts of each of the above four steps involve sorting r numbers using an

n x r1 sub RMESH. Each of these is done using column sort with the Q matrix now

being an r2 x s2 matrix with r2s2 = r1. We use r2 = n1/2 and s2 = 1/2*n1/4. With


AI A2 As,









-30-


this choice, we have 2(s2-1)2 < 2s22 = 1/2*n1'2
the r2 x s2 matrix.











X B1 B2 ... Bs-i Y







Lr/2J r r rr [r i/21

Figure 2.6 Sub RMESHs for step 4 of sort

Let us examine step 1 of Figure 2.5. To sort a column of Q we initially use only the

n x r sub RMESH A that contains this column. This column is actually the column

major representation of a matrix Wi. So, sorting the column is equivalent to sorting Wi.

To sort Wi, we follow the steps given in Figure 2.5 except that Q is replaced by Wi. The

steps of Figure 2.5 are done differently on Wi than on Q. Figure 2.7 gives the necessary

steps for each Wi. Note that this figure assumes that the n x n RMESH has been parti-

tioned into the s A i's and each A i operates independent of each other. To SortTranspose

a Wi, we first broadcast Wi which is initially stored in the Z variables of the row 1 PEs of

Ai to all rows of Ai (step 1). Following this, the U variable of each PE in each column of





-31-


Ai is the same. In steps 2-4, each Ai is partitioned into square sub RMESHs B ijk

1 < j < r2, 1 < k < s2 of size r2 x r2 (Figure 2.8). Note that Bijk contains column k

of Wi in the U variables of each of its rows. B ijk will be used to determine the rank of

j'th element of the k'th column of Wi. Here, by rank we mean the number of elements in

the k'th column of Wi that are either less than the j'th element or are equal to it but not in

a column to the right of the j'th column of Bijk. So, this rank gives the position, in

column k, of the j'th element following a sort of column k. To determine this rank, in step

2, we broadcast the j'th element of column k of Wi to all processors in B ijt. This is done

by broadcasting U values from column j of B ijk using row buses that are local to B ijk*

The broadcast value is stored in the variables of the PEs of B ijk. Now, the processor in

position (ab) of Bijk has the b'th element of column k of Wi in its U variable and the

a'th element of this column in its V variable. Step 3 sets S variables in the processors to

0 or 1 such that the sum of the S's in each row of Bijk gives the rank of the j'th element

of column k of Wi. In step 4, the S values in a row are summed to obtain the rank. We

shall describe shortly how this is done. The computed rank is stored in variable R of the

PE in position [k,l] of Bijk. Note that since k < s2 < r2, [k,1] is actually a position in

Bijk.

Now if our objective is simply to sort the columns of Wi, we can route the V vari-

able in PE[k,i] ( this is the j'th element in column k of Wi) to the R'th column using a

row bus local to Bijk and then broadcast it along a column bus to all PE's on column R.

However, we wish to transpose the resulting Wi which is sorted by columns. For this, we






-32-


Step 1: Broadcast the row 1 Z values, using column buses, to the U variables of all PEs
in the same column of A i.

Step 2: The column j PEs in all Bijk's broadcast their U values, using row buses local
to each B ijk, to the V variables of all PEs in the same B ijk.

Step 3: PE [a,b] of Bijk sets its S value to 1 if (U < V) or (U = V and b < j).
Otherwise, it sets its S value to 0. This is done, in parallel, by all PEs in all
B ijkS.

Step 4: The sum of the S's in any one row of Bijk is computed and stored in the R
variable of PE[k,1] of B ijk. This is done. in parallel, for all B ijk's.

Step 5: Using row buses, PE[k,l] of each Bijk sends its V value to the PE in column
R
[(R-l) mod s2]r2 + (k-l)r2/s2 + R- of A.
S2

Step 6: The PEs that received V values in step 5 broadcast this value, using column
buses, to the U variables of the PEs in the same column of A i.


Figure 2.7 Steps to Sort Transpose Wi into A i

see that the j'th element of Wi will be in column k and row R of the sorted Wi. Following

R
the transpose, this element will be in row (k-l)r2/s2 + r _1 and column (R-l) mod s2 +
S2

1 of Wi. Hence, the V value in PE[k,1] of B jk is to be in all PEs of column [(R-l) mod

R
S2]r2 +(k-l)r2/s2 + -1 of Ai following the SortTranspose. This is accomplished in
s2

steps 5 and 6. Note that there are no bus conflicts in the row bus broadcasts of step 5 as

the broadcast for each B ijk uses a different row bus that spans the width of A The total

number of broadcasts is 4 plus the number needed to sum the S values in a row of B ijk.






-33-


We shall shortly see that summing can be done with 6 broadcasts. Hence SortTranspose

uses 10 broadcasts.


Figure 2.8 Division of A into r2 x r2 sub RMESHs

Ranking

To sum the S values in a row of an r2 x r2 RMESH, we use a slightly modified

version of the ranking algorithm of Jenq and Sahni [JENQ91ab]. This algorithm ranks

the row 1 processors of an r2 x r2 RMESH. The ranking is done by using 0/1 valued

variables, S, in row 1. The rank of the processor in column i of row 1 is the number of

processors in row 1 and columns k, k < i that have S = 1. Hence, the rank of processor


Bi,, B __i,1,2 Bis___

Bi,2,1


B i,r2.1 B ir2.s2






-34-


[1,r2] equals the sum of the S values in row 1 except when S[1,r2] = 1. In this latter

case we need to add 1 to the rank to get the sum. The ranking algorithm of [JENQ91ab]

has three phases associated with it.


Phase 1: Rank the processors in even columns of row 1. This does not account for the
1's in the odd column.

Phase 2: Phase 2: Rank the processors in odd columns of row 1. This does not account
for 1's in the even columns.

Phase 3: Combine odd and even ranks.


The procedures for phases 1 and 2 are quite similar. These are easily modified to

start with the configuration following step 3 of Figure 2.7 and send the phase 1 and phase

2 results of the rightmost odd and even columns in Bijk to PE[k,1] where the two results

are added. To avoid an extra broadcast, the result for the rightmost column (phase 1 if r2

is even and phase 2 if r2 is odd) is incremented by 1 in case S[*,r2] = 1 before the

broadcast. While the phase 1 code of [JENQ91ab] uses 4 broadcasts, the first of these can

be eliminated as we begin with a configuration in which S[*,b] is already on all rows of

column b, 1 < b < r2. So, phases 1 and 2 use 6 broadcasts. Phase 3 of [JENQ91ab] uses

two broadcasts. Both of these can be eliminated by having their phase 1 (2) step 10

directly broadcast the rank of the rightmost even (odd) column to PE[k,1] bus using a

row bus that spans row k connected to a column bus that spans the rightmost even (odd)

column of Bijk. So, the summing operation of step 4 can be done using a total of 6






-35-


broadcasts.


Analysis of RMESH Column Sort

First, let us consider sorting a column of Q which is an r x s matrix. This

requires us to perform steps 1-4 of Figure 2.5 on the Wi's. As we have just seen, step 1

uses 10 broadcasts. Step 2 is similar to step 1 except that we begin with the data on all

rows of Ai and instead of a transpose, an untranspose is to performed. This means that

step 1 of Figure 2.7 can be eliminated and the formula in step 5 is to be changed to

correspond to an untranspose. The number of broadcasts for step 2 of Figure 2.5 is there-

fore 9. Steps 3 and 4 are similar and each uses 9 broadcasts. The total number of broad-

casts to sort a column of Q is therefore 37.

Now, to perform a SortTranspose of the columns of Q we proceed as in a sort of the

columns of Q except that the last broadcast of SortUnshift performs the transpose and

leaves the transposed matrix in column major order in all rows of the RMESH. This takes

37 broadcasts. The SortUntranspose takes 36 broadcasts as it begins with Q in all rows.

Similarly, step 3 and 4 each take 36 broadcasts. The total number of broadcasts is there-

fore 145.

A more careful analysis reveals that the number of broadcasts needed for the Sort-

Shift and SortUnshift steps can be reduced by one each as the step 5 (Figure 2.7) broad-

cast can be eliminated. Taking this into account, the number of broadcasts to sort a

column of Q becomes 35. However to do a SortTranspose of the columns of Q, we need






-36-


to do an additional broadcast during the SortUnshift of the Wi's. This brings the number

to 36. A SortUntranspose of Q takes 35 broadcasts and the remaining two steps of Figure

2.2 each takes 34 broadcasts. Hence, the total number of broadcasts becomes 139.


Column Sort On A PARBUS

The RMESH algorithm of Section 3 will work on a PARBUS as all connections

used by it are possible in a PARBUS. We can, however, sort using fewer than 139 broad-

casts on a PARBUS. If we replace the ranking algorithm of [JENQ91ab] that we used in

Section 3 by the prefix sum of [LIN92] ( i.e., prefix sum N bits on a (N+1) x N

PARBUS) then we can sum the S's in a row using two broadcasts. The algorithm of

[LIN92] needs to be modified slightly to allow for the fact that we begin with the S

values on all columns and we are summing r2 S values on a r2 x r2 PARBUS rather

than an (r2 +1) x r2 PARBUS. These modifications are straightforward. Since the S's

can be summed in 2 broadcasts rather than 6, the SortTranspose of Figure 2.5 requires

only 6 broadcasts. The SortUntranspose can be done in 5 broadcasts. As indicated in the

analysis of Section 3, the SortShift and SortUnshift steps can be done without the broad-

cast of Step 5 of Figure 2.7. So, each of these require only 4 broadcasts. Thus, to sort a

column of Q requires 19 broadcasts. Following the analysis of Section 3, we see that a

SortTranspose of Q can be done with 20 broadcasts, a SortUntranspose with 19 broad-

casts, and a SortShift and SortUnshift with 18 broadcasts each. Thus n numbers can be

sorted on an n x n PARBUS using 75 broadcasts.






-37-


We can actually reduce the number of broadcasts further by beginning with r =

n2/3 and st = n1'3. While this does not satisfy the requirement that r > 2(s-1)2,

Leighton [LEIG85] has shown that column sort works for r and s such that r 2 s(s -1)

provided that the Untranspose of step 4 is replaced by an Undiagonalize step (Figure

2.9). The use of the undiagonalizing permutation only requires us to change the formula

used in step 5 of Figure 2.7 to a slightly more complex one. This does not change the

number of broadcasts in the case of the RMESH and PARBUS algorithms previously dis-

cussed. However, the ability to use rl = n2/3 and s1 = n13 ( instead of rl = 2 n2/3 and

s = 1/2*n 13 which satisfy r > 2(s 1)2 ) significantly reduces the number of broad-

casts for the PARBUS algorithm we are about to describe.

Now, the SortTranspose, SortUntranspose, SortShift and SortUnshift for Q are not

done by using another level of column sort. Instead a count sort similar to that of Figure

2.7 is directly applied to the columns of Q. This time, Ai is an n x rl = n x n2'3 sub

PARBUS. Let D be thej'th (from the top) n 13 x r sub PARBUS of A (Figure 2.10).

The Dy's are used to do the counting previously done by the Bijk's. To count r1 = n2/3

bits using an n1'3 x n 23 sub PARBUS, we use the parallel prefix sum algorithm of

[LIN92] which does this in 12 broadcasts when we begin with bits in all rows of Di and

take into account we want only the sum and not the prefix sum. Note that the prefix sum

algorithm of [LIN92] is an iterative algorithm that uses modulo M arithmetic to sum N

bits on an (M + 1) x N PARBUS. For this it uses logN iterations. For the case of sum-
logM






-38-


ming N bits, this is easily modified to run on an M x N PARBUS in lo iterations
logM

with each iteration using 6 broadcasts ( 3 for the odd bits, and 3 for the even bits). With

our choice of r and s1, the number of iterations is 2, while with r = 2 n2/3 and s1 =

1/2 n 13, the number of iterations is 3. The two iterations, together, use 12 broadcasts.



1 2 4 1 10 19
3 5 7 2 12 20
6 8 10 3 13 21
9 11 13 Undiagonalize 4 14 22
12 14 16 > 5 15 23
15 17 19 6 16 24
18 20 22 7 17 25
21 23 25 8 18 26
24 26 27 9 19 27


Figure 2.9 Undiagonalize a 9x3 matrix

To get the SortTranspose algorithm for the PARBUS, we replace all occurrences of

Bijk by Dij and of PE[k,1] by PE[i,1] in Figure 2.7. The formula of step 5 is changed to

[(R-1) mod s ] r1 + (i-1)r l/s + [ R/s 11. The number of broadcasts used by the new

SortTranspose algorithm is 16. The remaining three steps of Figure 2.7 are similar. The

SortUndiagonalize takes 15 broadcasts as it begins with data in all rows and the step 1

(Figure 2.3) broadcast can be eliminated. The SortShift and SortUnshift each can be done

in 14 broadcasts as the step 5 (Figure 2.7) broadcasts are unnecessary. So, the number of

broadcasts in the one level PARBUS column sort algorithms is 59.






-39-


Di2 Dij = s1 x r1




D 2



Ai =n x r1 =n x n23


Figure 2.10 Decomposing Ai into Dij's


Rotate Sort

Rotate sort was developed by Marberg and Gafni [MARB88] to sort M N numbers

on an M x N mesh with the standard four neighbor connections. To state their algorithm

we need to restate some of the definitions from [MARB88]. Assume that M = 2' and N =

22t where s : t. An M x N mesh can be tiled in a natural way with tiles of size

M x N112. This tiling partitions the mesh into vertical slices (Figure 2.11(a)). Similarly

an M x N mesh can be tiled with N112 x N tiles to obtain horizontal slices (Figure

2.11(b)). Tiling by N1/2 x N112 tiles results in a partitioning into blocks (Figure

2.11(c)). Marberg and Gafni define three procedures on which rotate sort is based. These

are given in Figure 2.12.

Rotate sort is comprised of the six steps given in Figure 2.13. Recall that a vertical






-40-


(a) Verticle Slices

)N


(c) Blocks


(b) Horizontal Slices


Figure 2.11 Definitions of the Slice and block

slice is an M x N112 submesh; a horizontal slice is an N x N112 submesh; and a block is

aN12 x N"12 submesh.



Marberg and Gafni [MARB88] point out that when M = N, step 1 of rotate sort

may be replaced by the steps.

Step 1' (a) Sort all the columns downward;

(b) Sort all the rows to the right;






-41-


Procedure balance (v,w);
(Operate on submeshes of size v x w)

sort all columns of the submesh downward;

rotate each row i of the submesh, i mod w positions right;

sort all columns of the submesh downward;

end;
(a) Balance a submesh

Procedure unblock;
(Operate on entire mesh)

rotate each row i of the mesh (i.N 12) mod N positions right;

sort all columns of the block downwards;

end;
(b) Unblock

Procedure shear,
(Operate on entire mesh)

sort all even numbered rows to the right and all odd numbered rows to the left;

sort all columns downward;

end;
(c) Shear


Figure 2.12 Procedures from [MARB88]

This simplifies the algorithm when it is implemented on a mesh. However, it does

not reduce the number of bus broadcasts needed on reconfigurable meshes with buses.






-42-


Step 1: balance each vertical slice;

Step 2: unblock the mesh;

Step 3: balance each horizontal slice as if it were an N x N12 mesh lying on its side;

Step 4: unblock the mesh;

Step 5: shear three times;

Step 6: sort rows to the right;


Figure 2.13 Rotate Sort [MARB88]

As a result we do not consider this variant of rotate sort.

Rotate Sort On An RMESH

In this section, we adapt the rotate sort algorithm of Figure 2.13 to sort n numbers

on an n x n RMESH. Note that the algorithm of Figure 2.13 sorts MN numbers on an

M x N mesh. For the adaptation, we use M = N Hence, n = N2 = 24t. The n = N2

numbers to be sorted are available, initially, in the Z variable of the row 1 PEs of the

n x n RMESH. We assume a row major mapping from the N x N mesh to row 1 of the

RMESH. Figure 2.14 gives the steps involved in sorting the columns of the n x n mesh

downward. Note that this is the first step of the balance operation of step 1 of rotate sort.

The basic strategy employed in Figure 2.14 is to use each n x N subRMESH of the

n x n RMESH to sort one column of the N x N mesh.






-43-


{sort columns of N x N mesh)
Step 1: Use column buses to broadcast Z[1,j] to all rows in column j, 1 i i n. Now,
Z[i,j] = Z[l,j],1 i: n, 1 5j < n.

Step 2: Use row buses to broadcast Z[i,i] to the R variable of all PEs on row i of the
RMESH. Now, R[i,j] = Z[i,i] = Z[1,i], 1 is n, 1
Step 3: In the q'th n x N subRMESH all, PEs[i,j] such that i mod N = q mod N and
(j- 1) mod N + 1 = [ ilN] broadcast their R values along the column buses, 1
5 q < N. Note that each such PE[i,j] contains the ilN] 'th element of column
q of the N x N mesh. This value is broadcast to the U variables of the PEs in
the column. Now, each row of the q'th n x N subRMESH contains in its U
variables column q of the n x n mesh.

Step 4: Now assume that the n x n RMESH is tiled by N2 N x N subRMESHs. In the
[a,b]'th such subRMESH, the PEs on column a of the subRMESH broadcast
their U value using row buses local to the subRMESH. This is stored in the V
variable of each PE.

Step 5: PE [i,j] of the [a,b]'th subRMESH sets its S value to 0 if (U < V) or
(U = V and i < a).

Step 6: The sum of the S's in any one row of each of the N x N subRMESH's is
computed. The result for the [a,b]'th subRMESH is stored in the T variable of
PE [b,1].

Step 7: Using the row buses that span the n x n RMESH the PE in position [b,1] of
each N x N subRMESH [a,b] sends its V value to the Z variable of the PE in
column (b 1)N + T + 1.

Step 8: The received Z values are broadcast along column buses.


Figure 2.14 Sorting the columns of an N x N mesh

For this, we need to first extract the columns from row 1 of the RMESH. This is

done in steps 1-3 of Figure 2.14. Following this, each row of the q'th n x N subRMESH






-44-


contains the q'th column of the N x N mesh. Steps 4-6 implement the count phase of a

count sort. This implementation is equivalent to that used in [JENQ91b] to sort m ele-

ments on an m x m x m RMESH. Steps 7 and 8 route the data back to row 1 of the

RMESH so that the Z values in row 1 (and actually in all rows) correspond to the row

major order of the n x n mesh following a sort of its columns. The total number of

broadcasts used is 12 ( note that step 6 uses 6 broadcasts).

The row rotation required by procedure balance can be obtained at no additional

cost by changing the destination column computed in step 7 of Figure 2.14 so as to

account for the rotation. The second sort of the columns performed in procedure balance

can be done with 9 additional broadcasts. For this, during the first column sort of pro-

cedure balance, the step 7 broadcast of Figure 2.14 takes into account both the row rota-

tion and the row major to column major transformation to be done in steps 1-3 of Figure

2.14 for the second column sort of procedure balance. So, step 1 of rotate sort (i.e.,

balancing the vertical slices) can be done using a total of 21 broadcasts.

To unblock the data, we need to rotate each row and then sort the columns down-

ward. Once again, the rotation can be accomplished at no additional cost by modifying

the destination column function used in step 7 of the second column sort performed dur-

ing the vertical slice balancing of step 1. The column sort needs 11 broadcasts.

The horizontal slices can be balanced using 18 broadcasts as the Z data are already

distributed over the columns. The unblock of step 4 takes as many broadcasts as the

unblock of step 2 (i.e. 9).






-45-


The shear operation requires a row sort followed by a column sort. Row sorts are

performed using the same strategy as used for a column sort. The fact that all elements of

a row are in adjacent columns of the RMESH permits us to eliminate steps 1-3 of Figure

2.14. So, a row sort takes only 9 additional broadcasts. The following column sort uses 9

broadcasts. So, each application of shear takes 18 broadcasts. Since we need to shear

three times, step 5 of rotate sort uses 54 broadcasts. Step 6 of rotate sort is a row sort.

This takes 9 broadcasts. The total number of broadcasts is 120. This is 19 fewer than the

number of broadcasts used by our RMESH implementation of column sort.


Rotate Sort On A PARBUS

Our implementation of rotate sort on a PARBUS is the same as that on an RMESH.

Note, however, that on a PARBUS ranking (step 6 of Figure 2.14) takes only 3 broad-

casts. Since this is done once for each/row column sort and since a total of 13 such sorts

is done, 39 fewer broadcasts are needed on a PARBUS. Hence our PARBUS implemen-

tation of rotate sort takes 81 broadcasts. Recall that Leighton's column sort could be

implemented on a PARBUS using only 59 broadcasts.






-46-


A Combined Sort

We may combine the first three steps of the iterative shear sort algorithm of Scher-

son et al. [SCHER89] with the last four steps of rotate sort to obtain combined sort of

Figure 2.15. This is stated for an N x N mesh using nearest neighbor connections. The

number of elements to sorted is N .



Step 1: Sort each N3/4 x N3/4 block;

Step 2: Shift the i'th row by (i*N3/4) mod N to the right, 1 < i < N;

Step 3: Sort the columns downward;

step 4: balance each horizontal slice as if it were an N x N112 mesh lying on its side;

Step 5: unblock the mesh;

Step 6: shear two times;

Step 7: sort rows to the right;


Figure 2.15 Combined Sort

Notice that step 4-7 of Figure 2.15 differ from steps 3-6 of Figure 2.13 only in that

in step 6 of Figure 2.15 the shear sort is done two times. The correctness of Figure 23

may be established using the results of [SCHER89] and [MARB88].

To implement the combined sort on an n x n RMESH or n x n PARBUS ( n = N2

elements to be sorted), we note that the column sort of step 3 can be done in the same






-47-


manner as the column sorts of rotate sort are done. The shift of step 2 can be combined

with the sort of step 1. The block sort of step 1 is done using submeshes of size

n x n3/4 =N2 x N3/2 N2 x ( N3/4 x N3/4). On aPARBUS, this is done by ranking

in n1/4 x n3/4 as in [LIN92] while on an RMESH, this is done using the algorithm to

sort a column of Q using an n x r submesh (section 3). We omit the details here. The

number of broadcasts is 77 for PARBUS and 118 for an RMESH.


Comparison With Other 0(1) PARBUS Sorts

As noted earlier, our PARBUS implementation of Leighton's column sort uses only

59 broadcasts whereas our PARBUS implementation of rotate sort uses 81 broadcasts

and our implementation of combined sort uses 77 broadcasts. The 0(1) PARBUS sorting

algorithm of Jang and Prasanna [JANG92a] is also based on column sort. However, it is

far more complex than our adaptation and uses more broadcasts than does the 0(1)

PARBUS algorithm of Lin et al. [LIN92]. So, we compare our algorithm to that of

[LIN92]. This latter algorithm is not based on column sort. Rather, it is based on a multi-

ple selection algorithm that the authors develop. This multiple selection algorithm is

itself a simple modification of a selection algorithm for the PARBUS. This algorithm

selects the k'th largest element of an unordered set S of n elements. The multiple selec-

tion algorithm takes as input an increasing sequence ql < q2 <... < qn13 with 1 qi < n

and reports for each i, the qi'th largest element of S. By selecting qi = i*n23

1 5 i 5 n/3 one is able to determine partitioning elements such that the set of n






-48-


numbers to be sorted can be partitioned into n1/3 buckets each having n2/3 elements.

Each bucket is then sorted using an n x n 23 sub PARBUS. Lin et al. [LIN92] were only

concerned with developing a constant time algorithm to sort n numbers on an n x n

PARBUS. Consequently, they did not attempt to minimize the number of broadcasts

needed for a sort. However, we analyzed versions of their algorithms that were optimized

by us. The optimized selection algorithm requires 84 broadcasts and the optimized sort

algorithm used 103 broadcasts. Thus our PARBUS implementation of Leighton's column

sort uses slightly more than half the number of broadcasts used by an optimized version

of LIN et al.'s algorithm. Furthermore, even if one were interested only in the selection

problem, it would be faster to sort using our PARBUS implementation of Leighton's

column sort algorithm and then select the k'th element than to use the optimized version

of the PARBUS selection algorithm of [LIN92]. Our algorithm is also conceptually

simpler than those of [JANG92a] and [LIN92]. Like the algorithms of [JANG92] and

[LIN91], our PARBUS algorithms may be run directly on an n x n MRN. The number of

broadcasts remains unchanged.


Sorting On Higher Dimension RMB

Rotate sort works by sorting and/or shifting rows and columns of an N x N array.

This algorithm may be implemented on a three dimensional N x N x N reconfigurable

mesh with buses so as to sort N2 elements in 0(1) time. In other words, n = N x N ele-

ments are being sorted in 0(1) time on an n 1/2 x n 12 x n 1/2 RMB. Assume that we






-49-


start with n elements on the base of an N x N x N RMB. To sort row (column) i, we

broadcast the row (column) to the i'th layer of N x N processors and sort it in 0(1) time

in this layer using the algorithms developed in this chapter to sort N elements on an

N x N RMB. The total number of such sorts required by rotate sort is 13 (i.e., 0(1)) and

the required shifts may be combined with the sorts.

We can extend this to obtain an algorithm to sort Nk numbers on a k +1 dimen-

sional RMB with Nk+l processors in 0(1) time. The RMB has an N x N x .** x N

configuration and the Nk numbers are initially in the face with k +1 dimension equal to

zero. In the preceding paragraph we showed how to do this sort when k = 2. Suppose we

have an algorithm to sort N- 1 numbers on an I dimensional N' processor RMB in 0(1)

time. We can use this to sort N' numbers on an NI11 processor RMB in 0(1) time by

regarding the N1 numbers as forming an N- 1' x N array. In the terminology of Marberg

and Gafni [MARB88], we have an M x N array with M = N1-1 > N. To use rotate sort,

we need to be able to sort the columns of this array (which are of size M); sort the rows

which are of size N); and perform shifts/rotations on the rows or subrows.

To do the column sort we use 1 dimensional RMBs. The mth such RMB consists of

all processors with index of the type [i1,2,---..,i-1,m,il+1]. By assumption, this RMB

can sort its NI- numbers in 0(1) time. To sort the rows, we can use two dimensional

RMBs. Each such RMB consists of processors that differ only in their last two dimen-

sions (i.e., [a,b,c,...,itil+1]). This sort is done using the 0(1) sorting algorithm for two

dimensional RMBs. The row shifts and rotates are easily done in 0(1) time using the two






-50-


dimensional RMBs just described (actually most of these can be combined with one of

the required sorts).

Conclusions

We have developed relatively simple algorithms to sort n numbers on

reconfigurable n x n meshes with buses. For the case of the RMESH, our algorithms are

the first to sort in 0(1) time. For the PARBUS, our algorithms are simpler than those of

[JANG92a] and [LIN92]. Our PARBUS column sort algorithm is the fastest of our algo-

rithms for the PARBUS. It uses fewer broadcasts than does the optimized versions of the

selection algorithm of [LIN92]. Our PARBUS algorithms can be run on an MRN with

no modifications. Since n x n reconfigurable meshes require n2 area for their layout. Our

algorithms (as well as those of [JANG92a] and [LIN92]) have an area time square pro-

duct AT2 of n2 which is the best one can hope for in view of the lower bound result AT2

> n2 for the VLSI word model [LEIG85].

Using two dimensional meshes with buses, we are able to sort n elements in 0(1)

time using n2 processors. Using higher dimensional RMB, one can sort n numbers in

0(1) time using fewer processors. In general, n = Nk numbers can be sorted in 0(1)

time using Nk+1 = n1+1/ k processors in a k+l dimensional configuration. While the

same result has been shown for a PARBUS [JANG92b], our algorithm applies to an

RMESH also.














CHAPTER 3

SORTING n2 NUMBERS ON n x n MESHES

Introduction

In this chapter, we are concerned with sorting n2 data elements using an nxn mesh

connected parallel computer. The initial and final configurations have one data element in

each of the nxn processors (say in the A variable of each processor). In the final

configuration the data elements are sorted in snake-like row-major order. This problem

has been extensively studied for mesh architectures (see for e.g. [THOMK77],

[NASS79], [KUMA83], [LEIG85], [SCHN86], [SCHER89], [MARB88]). While all of

these studies consider SIMD meshes, they differ in the permissible communication pat-

terns. Thompson and Kung [THOMK77] consider a strict unidirectional model in which

all processors that simultaneously transfer data to a neighbor processor do so to the same

neighbor. That is, all active processors transfer data to their north neighbor, or all to their

south neighbor, etc. Using this model, Thompson and kung [THOMK77] have developed

a sorting algorithm with complexity 6ntr + ntc + low order terms, where tr is the time

needed to transfer one data element from a processor to its neighbor and tc is the time

needed to compare two data elements that are in the same processor. In the sequel, we

omit the low order terms.






-52-


In the bidirectional model, we assume there are two links between every pair (P,Q)

of neighbors. As a result, P can send a data element to Q at the same time that Q sends

one to P. Using this model, Schnorr and Shamir [SCHN86] have developed a sorting

algorithm with complexity 3ntr + 2ntc. In the same paper, Schnorr and Shamir "prove"

that 3n routing steps are needed by every sorting algorithm for an even stronger mesh

model, each processor can read the entire contents of the memories of its (up to) four

neighbors in time t, and the internal processor computations are free. For this stronger

mesh model, they show that every sorting algorithm must take 3ntr time by first showing

that input data changes in the lower right part of the mesh cannot affect the values in the

top left corer processor until time equal to the distance, d, between the top left corner

processor and the nearest of these lower right processors. Next, they argue that by chang-

ing the lower right input data, they can change the final position of the data in the top left

corner processor by distance n-1. As a result the sort must take at least (d+n-2)tr

time.

The fallacy is that in the first d 1 steps, we may have made several copies of the

input data and it may no longer be necessary to route the data from the top left corner

processor to the final destination processor. In fact, we show that one can sort in 2ntr

time using the stronger mesh model.

Park and Balasubramanium [PARK87,90] have considered a related sorting prob-

lem for the strict unidirectional model. In this the n2 elements to be sorted are initially

stored two to a processor on an n x n/2 mesh. The final sorted configuration also has two






-53-


elements per processor. The time complexity of their algorithm is 4ntr + ntc. This

represents an improvement of 2ntr over Thompson and Kung's algorithm provided the

two element per processor input/output configuration is what we desire. If we desire the

one element per processor configuration (for example, the data to be sorted is the result

of a computation that produces this configuration and the sorted data is to be used for

further computation), then it is necessary to first fold the data to get the two element per

processor configuration, then sort using the algorithm of [PARK87, 90], and finally

unfold to get the desired one element per processor final configuration. The folding can

be done in n/2 tr time as below (see also Figure 3.1(a)).

Fl: The left n/4 columns shift their data n/4 columns to right.

F2: The right n/4 columns shift their data n/4 columns to the left.

The unfolding can also be done in n/2 tr time using the two steps:

Ul: Unfold the n/4 columns labeled A in Figure 3.1(b)

U2: Unfold the n/4 columns labeled B in Figure 3.1(b)

To unfold A, we use the pipelined process described by the example of Figure 3.2.

B is unfolded in a similar way. The total time for the sort is therefore 5ntr + ntc which is

nt, less than that of Thompson et al [THOMK77]. The improvement is slightly more if

we consider the (nonstrict) unidirectional model in which there is a single link between

each pair of neighbor processors and data can be transferred, in parallel, along all links (

however, if P and Q are neighbors, when P sends data to Q, Q cannot send to P). In this






-54-


n/4 n/2 n/4

I I uz
I I- I -
A B



n/4 n/4 n/4 n/4

(a) Folding (b) Unfolding


Figure 3.1 Folding and Unfolding

model,


steps F1 and F2 of folding can be done in parallel. Similarly, we can do U1 and U2 in

parallel. The total sort time now becomes 4.5ntr + ntc.

In section 2, we show that the Schnorr/Shamir algorithm of [SCHN86] can be

modified to sort on unidirectional meshes using the same number of routes as above. The

number of comparison steps is, however, larger. The modified Schnorr/Shamir algorithm

of section 2 runs in 2.5ntr + 3ntc time on an n x n bidirectional mesh. The number of

routes is therefore less than the 3n lower bound established in [SCHN86]. The algorithm

of section 2 folds data onto an n x n/2 submesh. By folding onto smaller submeshes,

i.e., onto n x n/k submeshes for k > 2, the number of routes can be reduced further. In

fact for bidirectional and strict unidirectional meshes we can come very close to the dis-

tance lower bound of 2n -2 and 4n -2, respectively. This is done in section 3. In section

4, we discuss the practical implications of the folding-unfolding approach to sorting. In






-55-


section 5, we show how the sort algorithms of sections 2 and 3 may be simulated on

n x n reconfigurable bus architectures.



Step Po Pi P2 P3 P4 P5 P6 P7
oT_3- 1 FI71 "


2 0 1 2 3 45 6 7

3 0 1 2 34 5 6 7

4 0 1 2 3 4 5 6 7

Figure 3.2 Unfolding one row of A

Modified Schnorr&Shamir Algorithm

The sorting algorithm of Schnorr and Shamir [SCHN86] is given in Figure 3.3.

This algorithm uses the following terms and assumes that n = 24q for some integer q.


1. Block... an n3/4 x n3/4 submesh formed by a natural tiling of an n x n mesh with

n3/4 x n3/4 tiles (Figure 3.4(a)).


2. Vertical Slice... an n x n3/4 submesh formed by a natural tiling of an n x n mesh with

n x n3/4 tiles (Figure 3.4(b)).


3. Snake... the 1 x n2 vector along the snake-like order in the n x n mesh (Figure

3.4(c)).


4. Even-odd transposition sort [KNUT73]...n elements are sorted in n steps. In the odd






-56-


steps each element in an odd position is compared with the element in the next even posi-

tion and exchanged if larger. In the even steps, each element in an even position is com-

pared with the element in the next odd position and exchanged if greater.


5. n /4-way unshuffe ... Let k = 1/4 log2n = q/4. Data in column j of the mesh is

moved to column j'. Let jq-I... jo be the binary representation of j. Then jk-1...Jo

jq-1...Jk is the binary representation of j'. The unshuffle distributes the n3/4 columns of

each vertical slice to the n 1/4 vertical slices in a round robin manner.



Step 1: Sort all the blocks into snake-like row-major order.

Step 2: Perform a n 1/4-way unshuffle along all the rows of the array.

Step 3: Sort all the blocks into snake-like row-major order.

Step 4: Sort all the columns of the array downwards.

Step 5: Sort all the vertical slices into snake-like row-major order.

Step 6: Sort all the rows of the array into alternating left-to-right and right-to-left order.

Step 7: Perform 2n3/4 steps of even-odd transposition sort along the snake.


Figure 3.3 Sorting algorithm of Schnorr and Shamir

The correctness of the algorithm is established in [SCHN86]. As pointed out in

[SCHN86], steps 1,3,5,7 take O(n3/4) time and are dominated by steps 2,4,6. Steps 4 and

6 are done using n steps of even-odd transposition sort each while step 2 is done by






-57-


N3/4 N 314
N3/4
N





(a) Blocks (b) Vertical Slices
N N

N N3/4
------ -----------4-



(c) Snake-like Order (d) Horizontal Slices


Figure 3.4 Definitions of Blocks and Slices

simulating even-odd transposition sort on the fixed input permutation n-1 where x is the

desired unshuffle permutation. Steps 4 and 6 take ntr + ntc time each on a bidirectional

mesh and 2ntr + ntc on a unidirectional (whether strict or not) one. Step 2 takes nt, time

on a bidirectional mesh and 2nt, on a unidirectional one. The total time for the algorithm

of Schnorr and Shamir is therefore 3nt, + 2ntc on a bidirectional mesh and

6nt, + 2ntc on a unidirectional (strict or nonstrict) mesh. On the stronger mesh model

the time is 3nt, as tc is zero.


We may modify the algorithm of Schnorr and Shamir so that it works on an

n x n/2 mesh with two elements per processor. In this modification, we essentially simu-

late the algorithm of Figure 3.3 using half as many processors. The run time for steps






-58-


1,3,5 and 7 increases but is still 0(n3/4). Step 6 takes n/2 tr + ntc on a bidirectional

mesh as data routing is needed for only half of the steps of even-odd transposition sort

and ntr + ntc on a unidirectional mesh. Step 4 takes 2ntr + 2ntc on a bidirectional

mesh and 4ntr + 2ntc on a unidirectional mesh as in each of the n steps of even-odd

transposition sort, both the elements in a processor need to be routed to the neighbor pro-

cessor. This can be reduced to ntr + 2ntc and 2ntr + 2ntc, respectively, by regarding

the 2n elements in a processor column as a single column and sorting this into row major

order using 2n steps of even-odd transposition sort (Figure 3.5). Now, a single element

from each processor is to be routed on each of n steps and no data routing is needed on

the remaining n steps.



8 2 3 1 1 2
6 5 6 2 3 4
3 4 7 4 5 6
7 1 8 5 7 8




(a) Processor column (b) Sort each column (c) Sort as one column
two elements per downward
processor


Figure 3.5 Modified Step 4

To establish the correctness of the sorting algorithm with step 6 changed to sort

pairs of columns as though they were a single column, we use the zero-one principle






-59-


[KNUT73]. This was also used by Schnorr and Shamir to establish the correctness of

their unmodified algorithm. Here, we assume that the input data consists solely of zeroes

and ones. Since we have not changed steps 1-3 of their algorithm, the proofs of parts 1,2,

and 3 of their Theorem 3 are still valid. We shall show that part 4 remains true following

the execution of the modified step 4. Since the proof of parts 5,6, and 7 rely only part 4 of

the theorem, these are valid for the modified algorithm. Hence, the modified algorithms

is correct.

Define a horizontal slice to be an n3/4 x n submesh obtained by a natural tiling of

an n x n mesh using tiles of size n3/4 x n (Figure 3.4(d)). Following step 3 of the sort-

ing algorithm, the maximum difference d3 between the number of zeroes in any two

columns of a horizontal slice is at most 2 [SCHN86]. We need to show (i) that following

the modified step 4, the maximum difference d4' between the number of zeroes in any

two columns (a column is an n x 1 vector with n elements) of the mesh is at most 2n1/4,

and (ii) the maximum difference between the number of zeroes in any two vertical slices

is almost n 1/2. The proof of (ii) is same as that in [SCHN86] as our modification of step

4 does not move data between vertical slices. For (i), consider any two columns of n ele-

ments each. If these two columns reside in the same column of processors, then the max-

imum difference between the number of zeroes between two columns is 1 as the two

columns have been sorted into row major order regarding them as a single 2n element

column. Suppose the two element columns reside in two different processor columns.

These two processor columns contain 4 element columns AB,C,D. Let a,b,c,d





-60-


respectively be the number of zeroes in these four columns prior to the execution of the

modified step 4. From the proof of Theorem 3, part 4 [SCHN86], we know that

Sx -y I < 2n 14 where x,y a,b,cd).

Let a',b',c',d' be the number of zeroes in column A,B,C,D following the modified

step 4. It is easy to see that a' = [(a+b)/|, b' = [(a+b)/t2, c' = r(c+d)/2, d' = L(c+d)/2J.

We need to show that I x -y | < 2 n1/4 where x,y e (a',b',c',d'}.

Without loss of generality (wlog), we may assume that I b'- c'l = max ({ x -y I I

x,y e (a',b',c',d')) (note that when x,y e a', b') orxy {c',d', I x-y I 1 ) and that

b' > c'. b'-c' [r(a+b)/21 [(c+d)1/2. Again, wlog we may assume that c 5 d.

So, b'-c' < (a+b)12] c .Since, lb-cl 2n1/4 and la-cl < 2n1'4

a + b < 2c + 4n1/4. So, b'-c' c + 2n1/4 c = 2n1/4.

Our modified Schnorr/Shamir algorithm for n x n meshes is stated in Figure 3.6. It

combines the folding and unfolding steps discussed in the introduction and the

Schnorr/Shamir modified algorithm for n x n/2 meshes described above.

Theorem 1: The sorting algorithm of Figure 3.6 is correct.

Proof. Follows directly from our earlier discussion that established the correctness of

steps 1 through 7 as a sorting algorithm for an n x n/2 mesh with two elements per pro-

cessor. 0

For the complexity analysis, we note that steps 1,3,5, and 7 take O(n3/4) time and

are dominated by the remaining steps which take O(n) time each. Since we are ignoring






-61-


Step 0: Fold the leftmost and rightmost n/4 columns so that the n2 elements to be sorted
are in the middle n x n/2 submesh with each processor in this submesh having
two elements.

Step 1: Sort all n3/4 x n3/4 blocks of data into snake-like row-major order. Note that
each block of data is in an n3/4 x (1/2 n 3/4) submesh.

Step 2: Perform an n 14-way unshuffle along each row of n elements. Note that each
row of n elements is in a row of the middle n x (1/2 n) submesh.

Step 3: Repeat step 1.

Step 4: Sort the 2n elements in each column of the middle n x n/2 submesh into row-
major order.

Step 5: Sort the elements in each n x (1/2 n3/4) submesh into snake-like row-major
order.

Step 6: Sort all rows of the middle n x n/2 submesh into alternating left-to-right and
right-to-left order.

Step 7: Perform 2n3/4 steps of even-odd transposition sort along the snake of the middle
n x n/2 submesh.

Step 8: Unfold the middle n x n/2 submesh.


Figure 3.6 Sorting algorithm for n x n mesh.

low order terms, we need be concerned only with steps 0, 2, 4, 6, and 8. As noted earlier,

steps 2, 4 and 6, respectively, take n/2 tr, ntr + 2ntc, and n/2 tr + ntc on a bidirec-

tional mesh. Steps 0 and 8 each take n/4 tr on a bidirectional mesh. So, the complexity

of the sort algorithm on a bidirectional mesh is 2.5n tr + 3n tc. Since tc is zero on the

stronger mesh model, the sort time for this model is 2.5n t, (note that the algorithm of






-62-


[SCHN86] has a run time of 3n tr).

On a (nonstrict) unidirectional mesh, the times for steps 0,2,4,6, and 8 are, respec-

tively, n/4 tr, n tr, 2n tr + 2n tc, n tr + n tc, n/4 tr. The total time for this model is

therefore 4.5n tr + 3n tc. On a strict unidirectional mesh, the times for steps 0,2,4,6,

and 8 are respectively, n/2 tr, n tr, 2n tr + 2n tc, n tr + n tc, n/2 tr and the total

time is 5n tr + 3n tc.

Further Enhancements

In the stronger mesh model of Schnorr and Shamir a processor can read the entire

memory of all its neighbors in unit time. This implies that the routing time is independent

of message length. Let Tr denote the time needed to send a message of arbitrary length to

a neighbor processor. We may generalize the sorting algorithm of Figure 3.6 to the case

when each processor has k elements. Now, to sort a row or column of data, we use neigh-

borhood sort [BAUD78] which is a generalization of even-odd transposition sort. Sup-

pose that m processors have k elements each. The mk elements are sorted in m steps. In

the even steps, the elements in each even processor are merged with those in the next odd

processor. The even processor gets the smaller k and the odd one the larger k. In odd

steps, the k elements in each odd processor are merged with the k elements in the next

even processor. The smaller k elements remain with the odd processor and the larger k

with the even one. Note that when k = 1 neighborhood sort is identical to even-odd tran-

sposition sort.






-63-


Our generalization of Figure 3.6 is given in Figure 3.7. We require that k be a power

of 2. As in the case of the Schnorr-Shamir algorithm, we assume n = 24q


Step 0: Fold the leftmost and the rightmost n(k-1)/(2k) columns into middle n/k
columns so that the n2 elements to be sorted are in the middle n x n/k submesh
with each processor in this submesh having k elements. Sort the k elements in
each processor of the middle n x n/k submesh.

Step 1: Sort all n3/4 x n3/4 blocks of data into snake-like row-major order. Note that
each block of data is in an n3/4 x n3/4 /k submesh.

Step 2: Perform an nl/4-way unshuffle along each row of n elements. Note that each
row of n elements is in a row of the middle n x n/k submesh.

Step 3: Repeat step 1.

Step 4: Sort the kn elements in each column of the middle n x n/k submesh into row-
major order. Use neighborhood sort.

Step 5: Sort the elements in each n x (n3/4 /k) submesh into snake-like row-major
order.

Step 6: Sort all rows of the middle n x n/k submesh into alternating left-to-right and
right-to-left order.

Step 7: Perform 2n3/4 steps of even-odd transposition sort along the snake of the middle
n x n/k submesh.

Step 8: Unfold the middle n x n/k submesh.


Figure 3.7 Generalized sorting algorithm

Theorem 2: The generalized sorting algorithm is correct.


Proof. Similar to that of Theorem 1. O






-64-


For the complexity analysis, we may again ignore the odd steps as these run in

O(n3/4) time for any fixed k. The folding and and unfolding of steps 0 and 8 each take

n(k 1)(2k) Tr on bidirectional and nonstrict unidirectional meshes and n(k 1)/k Tr

on strict unidirectional meshes. The sorting of step 0 takes O(klogk) time. Step 2 takes

n/k Tr on bidirectional meshes and 2n/k Tr on unidirectional ones. Step 4 requires n

steps of neighborhood sort. Each merge of two sets of k elements takes almost 2k tc

time (actually atmost 2k-1 comparisons are needed). So, the step 4 time is

n T, + 2kn tc for bidirectional meshes and 2n Tr + 2kn tc for unidirectional meshes.

The time for step 6 is n/k Tr + 2n tc for bidirectional meshes and 2n/k T, + 2n tc

for unidirectional ones.

The total sorting time (ignoring the time to sort k elements in step 0) for a bidirec-

tional mesh is therefore (2n + n/k) Tr + 2n(k + 1) tc. For the model of Schnorr and

Shamir [SCHN87], tc = 0 and the time becomes (2n + n/k) Tr. For large values of k,

this approximates to 2n T,. Since (2n -2) Tr is the distance lower bound for sorting on

Schnorr and Shamir's model, the generalized sorting algorithm of Figure 3.7 is near

optimal for large k. The sorting times on non-strict and strict unidirectional meshes are

(3n + 3n/k) Tr + 2n(k+l) tc and(4n + 2n/k) Tr + 2n(k+l) tc. Since (4n-2)

is the distance lower bound for the strict unidirectional model, our algorithm is near

optimal for large k for this model too.

The s2-way merge sorting algorithm of Thompson and Kung [THOMK77] may be

similarly generalized to sort n2 elements stored k to a PE in an n x (n/k) mesh






-65-


configuration. The resulting sort has a complexity that is almost identical to that of Fig-

ure 3.7. However, Figure 3.7 is conceptually much simpler.

Practical Implications

Suppose we are to sort n2 elements, that are, initially distributed one to a PE on an

n x n bidirectional mesh. The final configuration is also to have one element per PE. On

realistic computers, the time to transfer small packets of data between adjacent proces-

sors is dominated by the setup time. For instance the time to transfer N bytes of data

between adjacent processors of an NCubel hypercube is (446.7 + 2.4 N)ms. Therefore,

it is reasonable to assume that the data transfer time is independent of packet size for

small k and small element size. Furthermore, tR >> tc on most commercial computers.

For instance, tR = 40 tc on the NCubel. Table 1 gives the value of our complexity func-

tions for the different bidirectional sort algorithms and different k. When tR = 40 tc we

can expect to get best performance using the algorithm of section 3 with k = 4. For

tR = 10 tc, the algorithm of section 2 is the best of the three. The original

Schnorr&Shamir algorithm will perform best only when tR < 2 tc.

Simulation on RMBs

The sorting algorithms described here may be simulated by parallel computers in

the reconfigurable mesh with buses (RMB) family. We consider only two members of the

RMB family: RMESH and PARBUS.


In an RMESH [MILL88abc], we have a bus grid with an n x n arrangement of pro-






-66-


Schnorr&Shamlr Section 2 Section 3
3n t, + 2n t, 2.5n t, + 3n t, (2n + -) tn + 2n(k+l) t,


k 1 2 3 4 5 6
tR = 40 t, 122n t, 103n t, 101.33n t, 100n t, 100n t, 100.67n t,

t = 10 t, 32n t, 28n t, 31.3n t, 32.5n t, 34n t, 35.67n t,

tR = 5 t, 17n t, 15.5n t, 19.66n t, 21.25n t, 23n t1 24.83n t,

tR = 2 t, 8n t, 8n tr 12.67n t, 14.5n t, 16.4n t, 18.33n t,


Table 1 Comparison of bidirectional sort algorithms

cessors at the grid points (see Figure 3.8 for a 4x4 RMESH ). Each grid segment has a

switch on it which enables one to break the bus, if desired, at that point. When all

switches are closed, all n2 processors are connected by the grid bus. The switches around

a processor can be set by using local information. If all processors disconnect the switch

on their north, then we obtain row buses (Figure 3.9). Column buses are obtained by

having each processor disconnect the switch on its east (Figure 3.10). In the exclusive

write model two processors that are on the same bus cannot simultaneously write to that

bus. In the concurrent write model several processors may simultaneously write to the

same bus. Rules are provided to determine which of the several writers actually succeeds

(e.g., arbitrary, maximum, exclusive or, etc.). Notice that in the RMESH model it is not

possible to simultaneously have n disjoint row buses and n disjoint column buses that,

respectively, span the width and height of the RMESH. It is assumed that processors on

the same bus can communicate in 0(1) time. RMESH algorithms for fundamental data






-67-


movement operations and image processing problems can be found in [MILL88abc,

MILL91ab, JENQ91abc].

An n x n PARBUS (Figure 3.11) [WANG90] is an n x n mesh in which the inter-

processor links are bus segments and each processor has the ability to connect together

arbitrary subsets of the four bus segments that connect to it. Bus segments that get so

connected behave like a single bus. The bus segment interconnections at a processor are

done by an internal four port switch. If the upto four bus segments at a processor are

labeled N (North), E (East), W (West), and S (South), then this switch is able to realize

any set, A = (A 1, A 2), of connections where Ai c (N,E,W,S), 1 5 iS 2 and the A i's are

disjoint. For example A = ((N,S), {E,W) results in connecting the North and South

segments together and the East and West segments together. If this is done in each pro-

cessor, then we get, simultaneously, disjoint row and column buses (Figure 3.12 and 13).

If A = (N,S,E,W),<), then all four bus segments are connected. PARBUS algorithms

for a variety of applications can be found in [MILL91a, WANG90ab, LIN92, JANG92].

Observe that in an RMESH the realizable connections are of the form A = {A 1), A 1

(N,E,W,S).

The RMESH requires two steps to simulate a unit route along a row or column of a

unidirectional mesh. The PARBUS can simulate such a route in one step using the bus

configurations of Figures 12 and 13. Additionally, the folding and unfolding into/from

nlk columns can be done in (k 1) tr time by having each group of k adjacent columns

fold into the leftmost column in the group. Now, the n/k columns that contain the data





-68-


D- : Processor
O : Switch
- : Link


Figure 3.8 4x4 RMESH


O O

0 0
-e-E-e---e-
O O
o -- -


0

0

0

E


Figure 3.9 Row buses


0

0

0




-69-


O

0

0

O0


0

0

0

0


0

0

o0
o


Figure 3.10 Column buses


Figure 3.11 4 x 4 PARBUS
are not adjacent. However, this does not result in any inefficiency when simulating steps
1-7 as row buses to connect the columns are easily established. With this modification to
the algorithms of Figures 6 and 7, the sort time for an n x n RMESH becomes


f






-70-


Figure 3.12 Row buses in a PARBUS
























Figure 3.13 Column buses in a PARBUS






-71-


(8n + 2) tr + 3n tc, (4n + 8n/k + 2k 2) T, + 2n(k+1) tc and for an nx n

PARBUS becomes (4n + 2) t, + 3n tc, (2n + 4n/k + 2k 2) T, + 2n(k+l) tc.

Conclusion

We have shown that the lower bound for the number of routes needed to sort on an

n x n bidirectional mesh that was established in [SCHN86] is incorrect. Furthermore, we

have provided algorithms that sort using fewer routes than the lower bound of

[SCHN86]. In fact, the algorithm of section 3 is able to sort using a number of routes that

is very close to the distance lower bound for bidirectional as well as strict unidirectional

meshes.

The fold/sort/unfold algorithms developed here have practical implications to sort-

ing on a mesh. The advantages of this technique when applied to computers with

tR >> 2 tc were pointed out in section 4. We also showed how to simulate the mesh

algorithms on reconfigurable meshes with buses.














CHAPTER 4

COMPUTATIONAL GEOMETRY ON N x N RECONFIGURABLE MESHES
WITH BUSES

Introduction

In this chapter, we consider computational geometry problems on N x N

reconfigurable meshes with buses and give 0(1) time algorithms for determining convex

hull of a set of N planar points, the smallest enclosing rectangle of N planar points, ECDF

search problem for N planar points, and triangulating N planar points.

Computational geometry problems have been widely studied on various parallel

computer architectures such as Meshes, PRAM, Hypercubes, Cube-Connected Cycles,

and Pyramid Computers [AGGA88], [CHOW80], [JEONL90, JEONC92],

[MILLS84ab,88e,89], [LEE84], [MACK90ab], [STOJ86], [LUV85, LU86], [SHAM78],

[WANGT87]. Most computational geometry problems have a lower bound equal to the

diameter of the underlying interconnection network. RMB algorithms for computational

geometry problems has been explored in [MILLP87b, JANG92e, REIS92]. In

[MILLP87b] RMESH algorithms for several geometric problems on digitized images

were developed. These include closest figure, extreme points of every figure, diameter,

smallest enclosing box, and smallest enclosing circle.






-73-


An 0(1) time convex hull algorithm for a set of N planar points is given in

[REIS92]. The algorithm uses an N x N configuration and is based on the grid convex

hull algorithm of [MILL88e]. Unfortunately, the algorithm of [REIS92] is flawed. In sec-

tion 2, we give a corrected constant time RMESH convex hull algorithm. This may also

be run on an N x N PARBUS and MRN. Jang and Prasanna [JANG92e] develop con-

stant time PARBUS algorithms for the all pairs nearest neighbor problem on a set of N

planar points, 2-set dominance counting for N planar points, and the 3-dimensional max-

ima problem. All these algorithms are for N x N PARBUS configuration. Their algo-

rithm for the nearest neighbor problem is easily run on an N x N RMESH and MRN in

constant time. Jang and Prasanna [JANG92e] state that their 2-set dominance counting

algorithm can be simulated by an RMESH using the switch simulation of [JANG92d].

However the simulation of [JANG92e] requires 16 RMESH processors for each

PARBUS processor simulated. Hence a 4N x 4N RMESH is needed for the simulation.

In section 4, we consider the ECDF search problem which is closely related to the 2-set

dominance counting problem and develop a constant time algorithm that requires only an

N x NRMESH.

The third geometry problem considered in [JANG92e] is the 3-dimensional maxima

problem. In this, we are given a set S of three tuples (points in three dimensional space)

and are required to find all points p e S such that there is no q e S with the property

that each coordinate of q is greater than the corresponding coordinate ofp (i.e., no q in S

dominates p). The algorithm suggested in [JANG92e] is a modification of their 2-set






-74-


dominance counting algorithm and is unnecessarily complicated. We observe that the

problem is trivially solved in constant time using the algorithm of Figure 4.1. The input

N points are in row 1 of the N x N PARBUS/RMESH/MRN. The algorithm of Figure

4.1 can be used to solve, in constant time, the k-dimensional maxima problem for any k.


Step 1: Broadcast the N points on row 1 to all rows using column buses.

Step 2: (Use the i'th row to determine if the i'th point of S is a non dominated point)
Processor [i,i] broadcasts its point to all processors in its row using a row bus,
15i N
Processor [ij], 1 < i,j S N compares the coordinates of the point it received
in step 1 to those of the points it received in step 2. If each coordinate of the
step 1 point is larger than the corresponding coordinate of its step 2 point, then
PE[ij] sets its T variable to 0. Otherwise, T is set to 1.
Using row bus splitting, PE[i,l] determines if there is a 0 in row i, 1 5 i < N.
If so, it sets its U variable to 0 (the i'th point is dominated). Otherwise, U is set
to 1 (the i'th point is not dominated).

Step 3: (Route the results back to row 1)
Using row buses PE[i,1] sends its U value to PE[i,i]. Using column buses,
PE[i,i] sends the U value just received to PE[1,i], 1 : i

Figure 4.1 Simple Constant time algorithm for 3-dimensional maxima

The remaining computational geometry problems we consider in this chapter are the

smallest enclosing rectangle problem (section 3) and the problem of triangulating a set of

N planar points (section 5). While these problems have not been considered before for

the RMB architecture, parallel algorithms for other architectures have been developed.

For example, Miller and Stout [MILL89] show how to find the smallest enclosing rectan-

gle of a set of N planar points in O(N1/2) on an N x N mesh and Wang and Tsin






-75-


[WANGT87] develop an O(logN) time CREW PRAM algorithm to triangulate a set of N

planar points. Miller and Stout [MILL89] also develop optimal mesh algorithms for the

convex hull of N planar points and for several other computational geometry problems.

Mesh algorithms for some other computational geometry problem are considered in

[JEON90, LU85,86].


Convex Hull

In this section, we consider the problem of determining the Convex Hull of a set of

N planar points on N x N Reconfigurable Meshes with Buses (RMBs). While the

RMESH algorithms we develop can be run without modification on PARBUSs and

MRNs, improved performance results from the use of algorithms for subtasks such as

ranking and sorting that are tailored to the PARBUS/MRN architecture.

Let S be a set of N distinct points in a plane. Let E(S) be the set of extreme points in

S and let CH(S) denote an ordering of E such that CH(S) defines a convex polygon. This

convex polygon is the convex hull of S. The extreme points problem is that of finding

E(S) while in the convex hull problem, we are to find the ordered set CH(S). Our algo-

rithms to find E(S) and CH(S) make use of the following known results:

Lemma 1: [Theorem 3.8, [PREP85], p104] The line segment I defined by two points

is an edge of the convex hull iff all other points lie on I or to one side of it. 0

Lemma 2: [[PREP85], p105] Let p and P2, respectively, be the lexicographically

lowest and the highest points of S. Let CH(S) be (p1, 11, 12, ..., lkP2 1, ..., Uj). The






-76-


ordering is a counter clockwise ordering of the extreme points and is such that the inte-

rior of the defined convex polygon is on the left as one traverses the extreme points in

this order. The points 11, ..., k, P2 have the property that the polar angle made by each of

these points, the positive x-axis, and the preceding point of CH(S) is least. The points

u1, u2, ..., uj have the property that the polar angle made by each point, the negative x-

axis, and the preceding point of CH(S) is least (see Figure 4.2). 1













P\

Figure 4.2 Convex hull of a set of points

N3 Processor Algorithm

Before developing the 0(1) N x N processor algorithm, we develop a simple 0(1)

N2 x N processor algorithm. An 0(1) N2 x N processor algorithm for E(S) or CH(S) is

easily obtained by using either Lemma 1 or 2. We elaborate on the algorithm based on

Lemma 1. The strategy employed by this algorithm is given in Figure 4.3. It assumes that

the N points of S are initially in the top row of the N2 x N RMESH/PARBUS/MRN.

The i'th row, 1 < i < N2 of the RMESH/PARBUS/MRN is used to determine if points





77-


and k where i = (j-1)N + k of S form an edge of the convex hull. In case they do,

then j and k are in E(S) and j immediately precedes or immediately follows k in CH(S).

The correctness of Figure 2 is a direct consequence of Lemma 1 and the fact

[PREP85, pl00] that an ordering of the points of E by polar angle about an internal point

yields the convex hull. The complexity of the algorithm is readily seen to be 0(1). Note

that if only E(S) is to be computed then steps 9-11 may be omitted.


Step 1: Use column buses to broadcast the points of S from row 1 to all other rows.

Step 2: In row i = (j-1)N + k, 1 < i 5 N2, points j and k are broadcast to all
processors using a row bus.

Step 3: Using the points j and k received in step 2, each processor computes a, b, and c
such that ax + by + c = 0 defines the straight line through points and k.

Step 4: Let (u, v) be the coordinates of the point of S assigned to processor [e,f] in step
1. Processor [e, fJ sets its flag to 1 if au + bv + c > 0, -1 if
au + by + c < 0, 0 if au + by + c = 0 and (u,v) is on the line segment
between points j and k of step 3 (this includes the cases when (u,v) is point j or
k), 2 otherwise.

Step 5: Using row buses and row bus splitting, PE[i,l], i = (j- 1)N + k,
1 < i < N2, determines if there is a flag = 2 on row i. If so, (j, k) is not an
edge of the convex hull. If not, it determines if there is a 1 and later if there is a
-1. If both a 1 and -1 are present, (j,k) is not an edge of the convex hull.
Otherwise it is.


Figure 4.3 0(1) N2 x N processor algorithm for E(S) and CH(S)
(continued on next page)






-78-


Step 6: IfPE[i,1], i = (j-1)N + k, i < < N2,j > k detects that (,k) is an edge of
the convex hull, then using a row bus it broadcasts a 1 to PEs [ij] and [i,k].

Step 7: PEs that receive a 1 in step 6 broadcast this to the PEs in row 1 of the same
column (note 0 or 4 PEs in each column receive a 1 in step 6; using column bus
splitting, all but one of these may be eliminated to avoid concurrent writes to a
column bus).

Step 8: Row 1 PEs now mark the points of S they contain as being in or out of E(S)
(note: a point is in E(S) if and only if it receives a 1 value in step 7).

Step 9: Using row bus splitting in row 1, any three extreme points are accumulated in
PE[1,1]. In case I E(S) I = 2, the remainder of this step is omitted. The
centroid of these three points is computed. Since no three points of E can be
collinear, the centroid is an interior point of the convex hull.

Step 10: The centroid computed in step 9 is broadcast to all points of E(S) using a row
bus. Each of these points computes the polar angle made* by the point using
the centroid as the origin.

Step 11: The points of E(S) are sorted by polar angle using the 0(1)
RMESH/PARBUS/MRN sort algorithm of [NIGA92].


Figure 4.3 0(1) N2 x N processor algorithm for E(S) and CH(S)
(continued from previous page)

N2 Processor Algorithm

The convex hull of S can be computed in 0(1) time on an N x N RMESH using the

algorithm of Figure 4.4. The algorithm for an N x N PARBUS/MRN is similar. Step RO

can be done using the O(1) sorting algorithm for N data items on an N x N RMESH

[NIGA92]. Step R1 is done using the 0(1) N3 processor algorithm of the preceding

An alternative to using the polar angle is discussed on pl00 of [PREP85].






-79-


section on each N x N"12 sub RMESH. For R2, we use the i'th N x N"2 sub RMESH

to determine which points of Ei are also in E. This is done using the algorithm of Figure

4.5.



RO: Sort S by x-coordinate and within x by y-coordinate.

Ri: Partition S into N12 subsets, Si, 1 < i N1/2 using the ordering just obtained.
Each subset is of size N1'2. Determine the convex hull CHi and extreme points
Ei of Si using an N x N/2 sub RMESH for each i,1 < i 5 N2.

R2: Combine the extreme points Ei, 1 i < N112, to obtain the extreme points E
of S.

R3: Obtain the convex hull of S from E.


Figure 4.4 N x N RMESH algorithm for convex hull

Theorem 1: The extreme points algorithm of Figure 4.5 is correct.

Proof: To establish the correctness of Figure 4.5, we need to show that exactly those
N12
points of j Ek, that are not in E(S) are eliminated in steps T5 and T6. It is easy to see
k= 1

that no point eliminated in T5 or T6 can be in E as in T5 the eliminated points are not in

the convex hull of El u Ej and in T6 they are not in the convex hull of u Ek. To see that

the points not eliminated are all in E(S), suppose that there is a non eliminated point

pe Ei that is not in E(S). Let Q be the set of non eliminated points. Clearly, E(S) =

extreme points of Q and p is in the interior of the convex hull of Q (see Figure 4.8),

CH(Q). Without loss of generality, assume that I E(S) I > 2. Since p is in the interior of










T1: Let Rji denote the (j,i)'th N112 x N112 sub RMESH of the entire RMESH.
Move Ei and Ej into Rji such that each row of Rji has Ei (one point to a
processor) and each column has Ej (one point to a processor). This can be
accomplished by broadcasting row 1 along column buses to all rows and then
broadcasting the diagonal (of the N x N RMESH) processor data along row
buses to all columns.

T2: Each PE in the k'th column of Rji computes the slope of the line that joins the
k'th point of Ei and the point of Ej it contains. These slopes form a tritonic
sequence (they may decrease, increase, decrease or may increase, decrease,
increase) as the points of Ej are in convex hull order.

T3: The minimum and maximum of the slopes in each column of Rji are found
using bus splitting. These are stored in row 1 of Rji.

T4: The maximum of the at most N12 minimums found in step T3 is determined
by forming the cross product of the minimums in the N processors of Rji. The
minimum of the maximums of step T3 is similarly found.


Figure 4.5 Substeps for step R2 (continued on next page)


T5: The minimum and maximum of step T4 define the two tangents of Ei and Ej.
These are used to eliminate points of Ei that are now known to not be in E (see
Figure 4.6).

T6: The tangents of T5 define up to 4 tangent points. Over all Rji, 1 < i : N'2,
there are 3 N" 2 2 non eliminated tangent points plus non eliminated points
of S i. The extreme points of this set of points is computed. Tangent points of
Ei that are not in the set of extreme points just computed are eliminated (see
Figure 4.7).

T7: The points of u Ei not eliminated in steps T5 and T6 define E.


Figure 4.5 Substeps for step R2 (continued from previous page)






-81-


Tangents = Broken lines
= tangent points of Ei and Ej
X = points of Ei eliminated


Figure 4.6 Examples of extreme points eliminated in step T5



*




El E2 E3







i= 2
= tangent points
X = extreme points of E2 eliminated in step T5
+ = extreme points of E2 eliminated in step T6


Figure 4.7 Example of extreme point elimination in step T6

CH(Q), there must be a q e Q such that q 4 Si and the line from p to q does not go

through the interior of CH(Si) and does not touch the boundary of CH(Si) except at p.

Without loss of generality, assume that q e Sj and q is to the right of p or vertically






-82-


above p. Ifp lies between the two tangents from q to Si, then p lies between the tangents

of Si and Sj and so is eliminated in step T5 (see Figure 4.8). Since p e Q, p is not elim-

inated in T5 and hence must be on a tangent from q to Si (Figure 4.10). Furthermore, if p

is not a tangent point with respect to Si and Sj, then p lies between the two tangents of Si

and Sj and is eliminated in T5. So, we may assume p is one of the 3N12 2 points

considered in step T6.



CH(Q)



q







Figure 4.8 Example for correctness proof

First suppose that p is a point of the leftmost partition S 1. Since p is in the interior

of CH(Q), part of the boundary of CH(Q) passes above p and part below p. Let a be the

leftmost point in E b the first point in CH(Q) that the lower boundary reaches outside

of S1, and c the first point in CH(Q) that the upper boundary reaches outside of S (see

Figure 4.11). If b = c, then p is not on the tangent from c to CH(SI) and so must have

been eliminated in step T5. So, b c. Let d and e respectively, be the points in S, to

which c and b are connected in CH(S 1). Note that it is possible that d = a or e = a (or






-83-


....... tangents from q to S
--- tangents between Si and S;


Figure 4.9 Tangents


Figure 4.10 p is on tangent from q

both). If c e Sj, then d and c must be tangent points with respect to S and Sj (as other-

wise, some points of S1 and/or Sj lie above the upper boundary of CH(S). Similarly, e

and b are tangent points with respect to S1 and Sk where b e Sk. It is clear that p is not

an extreme point of [a, b, c, d, e, p). Since (a, b, c, d, e, p) is a subset of the set used in






-84-


step T6, p must have been eliminated in this step.


~-.--SI


Figure 4.11 p S

Next, suppose that p is a point of SN1'2. The proof for this case is similar to that for

p e S1. Finally, suppose p e Si, i t {1, N112). Let a, b, c, d be the first points of the

portions of the boundary of CH(Q) that are above and below p that are not in Si (Figure

4.12). Note that it is possible that a = b and c = d. Also, note that a, b, c, and d must

exist asISj I = N1'2 for each (i.e., as no Sj is empty).

Suppose that a e Sj. There must be a tangent point with respect to Sj and Si that is

in the triangle defined by a, a and a" (see Figure 4.12). This is so asp is a tangent point






-85-


with respect to Sj and the other end of the tangent must lie in the said triangle. Let this

point be A. Similarly, there are tangent points B, C, D in the corresponding triangles

associated with b, c, and d. From the triangles in which A, B, C, D are located, we see

that p is contained in the convex hull of {A, B, C, D,p) and so must be eliminated in step

T6. Note that when a = b and c = d, there must be a tangent point, e, of Ei that is on

the upper and lower part of the boundary of CH(S). Now, p is not an extreme point of {A,

B, C, D, e,p) (we need point e as it is possible that A = B, and C = D). 0




, ... 'd

d' d

i-- -;*------ ----
:a--------





Sj Si Sk



Figure 4.12p e Si, i {1,N12)

One may verify that each of the steps Tl through T7 can be done in 0(1) time. For

step T6, a modified version of the N3 processor algorithm of the previous section is run

in each N x N1/2 sub RMESH. Since the number of edges to consider is less than

(3N1/2)2 = 9N, the algorithm is run at most nine times. In each run of the algorithm an

edge needs to compare with 3N1/2 2 points using N1/2 processors. This is done in






-86-


three passes.

For step R3, the points of E may be sorted by polar angle about an interior point as

in steps 10 and 11 of Figure 2.


Smallest Enclosing Rectangle

It is well known [FREE75] that the smallest enclosing rectangle of a set of N planar

points has at least one side that is an extension of an edge of the convex hull. Hence, the

smallest enclosing rectangle may be found by first computing the convex hull; then deter-

mining for each convex hull edge, the smallest enclosing rectangle that has one side

which is an extension of this edge; and finally determining the smallest of these rectan-

gles. The algorithm is given in Figure 4.13. Its correctness is immediate and one readily

sees that its complexity is 0(1) on an N x N RMESH, PARBUS and MRN.


ECDF Search On An N x N RMESH

Let S be the set of N distinct planar points. Point p e S dominates point q e S if and

only if each coordinate of p is greater than the corresponding coordinate of q. In the

ECDF search problem [STOJ86], we are to compute for each pe S, the number, D(p,S),

of points of S that are dominated by p. Stojmenovic [STOJ86] provides an N (N = S )

processor, O(log2N) time hypercube algorithm for this problem.

On an N x N RMB, D(p,S) for each point p E S, may be computed using a stra-

tegy similar to that used in [JANG92e] to solve the 2-set dominance counting problem on






-87-


Step 1: Find the convex hull of the N points.

Step 2: [Construct convex hull edges]
(a) The convex hull points are broadcast to all rows using column buses. These
are saved in variable R of each processor.

(b)PE[i,i] broadcasts its R value using a row bus to the S variable of all
processors on row i, 1 < i 5 p, p = number of convex hull points.

(c) PE[i,i +1] broadcasts its R value using a row bus to the T variable of all
processors on row i, 1 < i < p 1. For i = p, PE[p,l] does this. (Note: Now
each PE in row i contains the same edge of the convex hull in its S and T
variables)

Step 3: [Determine area of minimum rectangle for each edge]
(a) Using its R, S, and T values, each PE computes the perpendicular distance D
between point R and the straight line defined by the points S and T. Since, the
R's are in convex hull order, the D values in a row form a tritonic sequence
whose minimum, h, can be found in 0(1) time using row bus splitting. h is the
minimum height of the rectangle for the row edge.


Figure 4.13 Minimum enclosing rectangle (continued on next page)

a PARBUS in 0(1) time. A high level description of the strategy is given in Figure 4.14.

For simplicity in presentation, we assume that no two points of S have the same x- or the

same y-coordinates.

Step 1 can be done by sorting the N points in 0(1) time by x-coordinate [NIGA92].

j-1
The point in PE[1, j] belongs to partition Xu where u = L J + 1. Step 2 is simi-
N 1/2

larly accomplished by another sort (this time by y-coordinate). Following this each point

knows which X and which Y partition it belongs to. The computation of DX(p) is done

using an N x N1'2 submesh for the points in each X partition. The i'th such submesh is






-88-


Step 3: (b) Let P be the perpendicular through the middle of the edge defined by points
S and T. Each processor computes the perpendicular distance of its point R
from the infinite line P (use negative distances for points on one side of P and
positive distances for points on the other side). These distances form a
quadratonic sequence and its maximum, dma, and minimum, din, can be
found in 0(1) time using row bus splitting.

(c) The minimum area, A, of the rectangle for row i is h*(dmax dmin). Let
this be stored in the A value of PE[i,l].

Step 4 [Determine overall minimum rectangle]
Compute the minimum of the A's of PEs [i,1], 1 < i < p. This is done by
forming the cross product of the A's in a p x p sub array and comparing the
two A's in each PE.


Figure 4.13 Minimum enclosing rectangle (continued from previous page)


Step 1: Partition S into N1/2 sets Xi, 1 5 i < N1/2 such that IXi = N12,
1 < i < N12, and no point in Xi has a larger x-coordinate than any of the
points in Xi+1, 1 5 i < N"2.

Step 2: Partition S into N12 sets Yj, 1 < j N112 such that I Y |= N12,
1 5 j 5 N112, and no point in Yj has a larger y-coordinate than any of the
points in Yj+l, 1 < j < N12

Step 3: Let Si = Xi (1 Yj (see Figure 4.15). For each p e Si, it is the case that
D(p,S) = # of points dominated by p in (Yj S i) + # of points dominated
by p in Xi + S,, I= DY(p) + DX(p) + I I IS, I.
u Compute D(p,S) using this equation.


Figure 4.14 Strategy to compute D(p,S)

used for Xi. For this, each p e Xi uses an N1/2 x N112 partition of the N x N"/2 sub-

mesh. In processor k of row 1 of this N1/2 x N112 partition, a variable T is set to 1 iffp






-89-


Yi Si


Y2
Y l
X1 X2 Xi XN1,


Figure 4.15 Partitioning of the points of S

dominates the k'th point of Xi. T is set to zero otherwise. The T's in each N12 x N112

partition may be summed in 0(1) time using the RMESH ranking algorithm of

[JENQ91a]. The result is DX(p). The computation of DX(p) is best done after Step 1 of

Figure 4.14 as at this time the points of S are in the needed order. DY(p) may be com-

puted in a similar way following Step 2. However when setting the value of T, we note

that T is to be set to 1 iffp dominates both the k'th point of Yj and the k'th point is not in

the same X partition as p.

Let Mi = F Y I S,, I. To compute Mij, we first sort the points of S by their Y
u
partition index and within each Y partition, they are sorted by the X partition index. Fol-

lowing this, each processor in row 1 determines if the X partition number of the point in

the processor to its left is less than that of its own point. If so, then using its own column

index and the Y partition number of its point, it can compute R i = Y S j, where i and
u
j are, respectively, the X and Y partition numbers of its point (Rij = q (j- 1)N112






-90-


where q is the processor's column index). Finally, My = Riv. Before describing
v
how to do this final summation, we provide, in Figure 4.16, the steps in the overall ECDF

algorithm.



Step 1: Each point determines its X partition. This is done by sorting the points by x-
coordinate.

Step 2: Compute DX(p) for each point. This is done using an N1x2 x N1/2 block for
each e S.

Step 3: Each point determines its Y partition. For this, sort all points by y-coordinate.

Step 4: Compute DY(p) as in step 2.

Step 5: Sort S by Y partition and within Y partition by X partition. Each p e S
determines its Rij value where i andj are such that e Xi andp e Yj.

Step 6: Each point determines Myi = Riv.
v
Step7: Each point computes D(p,S) = DY(p) + DX(p) + Mij(p).


Figure 4.16 RMESH algorithm to compute D(p,S)

Now, let us consider the computation of Mi = Y Riv. For each i, the N112 My
v
values (1 5 j N112) are computed using the i'th N x N112 submesh. Initially, we have

Riv stored in the R variable of the PEs in the v'th N112 x N112 submesh of the i'th

N x N112 submesh (Figure 4.17(a)). The steps required to compute Mij are given in Fig-

ure 4.18. Its correctness is easily verified and its complexity is 0(1).












N112







Ri2

Ril


(a) i'th N x N"12 submesh


(b) N3/4 X N112 submesh


Figure 4.17



Step 1: [Decompose R's using the radix N 14]
PE[1, 1] of each of the N1/2 x N112 submeshes computes a and b such that
R = a*N1/4 + b and 0 < b < N'4. Since Y, = N12, each Riv < N2.
Hence, 0 5 a < N14.


Step 2: [Prefix sum the a's]
(a) The i'th N x N112 submesh is partitioned into N114 N3/4 x N1/2
submeshes. In each of these, the N114 a values contained are routed to the row
1 processors such that the k'th group of N114 such processors contain the k'th a
value (Figure 4.7(b)).

(b) In each N3/4 x N1/2 submesh, the unary representation of the a's is
computed. For this, the r'th processor in an N1/4 sets its D value to 1 if r is less
than or equal to its a value. Otherwise, it sets D to zero.


Figure 4.18 Computation of Mij (continued on next page)


-91-


N1'2


N1/2


N1'4 N1/4


al..al a2...a2 aN'"...


N3/4


.. N14






92-


Step 2: (c) The one's in row 1 of the N3/4 x N112 submesh are prefix summed using
the RMESH ranking algorithm of [JENQ91a]. Let the overall sum for each
group of N114 a's be stored in variable A of the [1,1] PE of the N3/4 x N114
submesh.

(d) [Prefix sum the A's]
We now have N114 A values to be prefix summed. Each is in the range [0,
N1/2]. The A's are decomposed using radix N1/4 such that A = c*N14 + d
where 0 5 d < N14. The N1/4 c's may be prefix summed using an
N1/2 x N112 submesh by routing them to row 1 of any N1/2 x N1/2 submesh,
computing their unary representation (as in (b) above), and ranking (as in (c)
above). The d's are prefix summed similarly. By adding together corresponding
pairs of prefix sums of c's and d's, the prefix sums of the A's are obtained.

(e) [Obtain prefix sum of a's] The prefix sum of the a's is obtained by adding
together the prefix sum of the a's in each N114 group (as computed in (c)) and
appropriate prefix sum of A (as computed in (d)).

Step 3: [Prefix sum the b's] This is similar to Step 2.

Step 4: [Obtain Mij] Mi is the sum of the a and b prefix sums computed in step 3 and
step 4 for the j'th (from the bottom) N112 x N12 submesh of the i'th
N x N12 submesh.


Figure 4.18 Computation of My (continued from previous page)






-93-


Triangulation

In this section, we develop an 0(1) time algorithm to triangulate a set S of N planar

points. The algorithm is for an N x N RMESH, an N x N PARBUS and an N x N

MRN. For simplicity, we assume that the N points have distinct x-coordinates and that no

four are cocircular. Our algorithm is easily modified to work when these assumption do

not hold. The overall strategy is given in Figure 4.19. We begin, in step 1, by partitioning

the N points into N213 sets, Si, each of size N113. For this, we assume N is a power of 3.

The partitioning is done by x-coordinate. Each Si contains points that are to the left of

those in Si+1, 1 < i < N2/3. This partitioning is possible because of our assumption

that the x-coordinates are distinct. To accomplish the partitioning, the N points are sorted

by x-coordinate using the 0(1) sorting algorithm of [NIGA92].


Step 1: Partition S into N213 sets, Si, each containing N113 points. The partitioning is
done such that all points in Si are to the left of those in Si+1, 1 < i < N23.

Step 2: Using an N x N113 submesh for each Si, compute the Voronoi diagram for Si,
1 < i N2/3

Step 3: Compute the straight-line dual of each of the N213 Voronoi diagrams of step 2.

Step 4: Partition the region not in the union of the convex hulls of the Si's into basic
polygons and triangulate these.


Figure 4.19 Triangulating N planar points

In step 2, the Voronoi diagram of each Si is computed. For this, each point p e Si

computes its voronoi polygon which defines all points in the plane that are closer to p