Citation
Algorithms for the Otis optoelectronic computer

Material Information

Title:
Algorithms for the Otis optoelectronic computer
Creator:
Wang, Chih-Fang, 1967-
Publication Date:
Language:
English
Physical Description:
x, 131 leaves : ill. ; 29 cm.

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Broadcasting industry ( jstor )
Electronics ( jstor )
Histograms ( jstor )
Image processing ( jstor )
Mathematical vectors ( jstor )
Matrices ( jstor )
Permutations ( jstor )
Pixels ( jstor )
Simulations ( jstor )
Computer and Information Science and Engineering thesis, Ph.D ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1998.
Bibliography:
Includes bibliographical references (leaves 126-130).
Additional Physical Form:
Also available online.
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Chih-Fang Wang.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. §107) for non-profit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
Resource Identifier:
029543206 ( ALEPH )
40476977 ( OCLC )

Downloads

This item has the following downloads:


Full Text








ALGORITHMS FOR THE OTIS OPTOELECTRONIC COMPUTER


By


CHIH-FANG WANG













A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY


UNIVERSITY OF FLORIDA


1998












ACKNOWLEDGMENTS


I wish to express my whole-hearted appreciation and gratitude to my advisor Professor Sartaj Sahni for giving countless hours of guidance in my research work. Without his support and patience, this research would not have been done.

I would also like to thank other members in my supervisory committee, Dr. Sanguthevar Rajasekaran, Dr. Tim Davis, Dr. Yann-Hang Lee, and Dr. Haniph Latchman, for their interest and comments.

Many thanks to my friends here in the department. Without your constant companion and encouragement, I would not have made it this far.

Last, but not least, my greatest appreciation goes to my parents and my brother Max. Although I am half a globe away, I am always surrounded by their love and tender care. To them I dedicate this work.

This research was supported, in part, by the Army Research Office under grant DAA H04-95-1-0111.


ii













TABLE OF CONTENTS


ACKNOWLEDGMENTS .............................

LIST OF FIGURES .......................................

LIST OF TABLES ........................................

ABSTRACT ....................................

1 INTRODUCTION ...............................

1.1 Optical Transpose Interconnection System (OTIS) ..........
1.2 OTIS Parallel Computers ........................
1.3 Permutation Routing on OTIS Computers ...............
1.4 This Dissertation ...................................

2 PROPERTIES OF OTIS-MESH .............................

2.1 Diameter of the OTIS-Mesh ...........................
2.2 Simulation of a 4D Mesh .........................
2.3 Simulation of a 2D Mesh .........................

3 DATA REARRANGEMENT ON AN OTIS-MESH .............


3.1 Transpose ......
3.2 Perfect Shuffle ... 3.3 Unshuffle ......
3.4 Bit Reversal ....
3.5 Vector Reversal
3.6 Bit Shuffle . . ...
3.6.1 GP, Swap 3.6.2 Bit Shuffle . 3.7 Shuffled Row-Major 3.8 BPC Permutations
3.8.1 Definition 3.8.2 Algorithm
3.9 Comparison ....


4 BASIC OPERATIONS ON AN OTIS-MESH .......


4.1 4.2 4.3 4.4 4.5 4.6


Data Broadcast . . Window Broadcast Prefix Sum .... Data Sum ......
Rank .........
Shift . .. ... ..


iii


ii vi viii ix

1

2
5
6 10

11 11 13
14


18


19 19
21 22 22 23 23 25 25 25 26 28 30


. . . . 32


32 33
34 36 37 37


. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .











4.7 Data Accumulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8 Consecutive Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.9 Adjacent Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.10 Concentrate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.11 D istribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.12 Generalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.13 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.14 Random Access Read (RAR) . . . . . . . . . . . . . . . . . . . . . . 55
4.15 Random Access Write (RAW) . . . . . . . . . . . . . . . . . . . . . . .56
4.16 Sum mary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 MATRIX MULTIPLICATIONS ON AN OTIS-MESH . . . . . . . . . . . . 60

5.1 Mapping Matrices Onto An OTIS-Mesh . . . . . . . . . . . . . . . . 60
5.2 Multiplication Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.1 Column Vector x Row Vector . . . . . . . . . . . . . . . . . . 61
5.2.2 Row Vector x Column Vector . . . . . . . . . . . . . . . . . . 65
5.2.3 Row Vector x Matrix . . . . . . . . . . . . . . . . . . . . . . 69
5.2.4 Matrix x Column Vector . . . . . . . . . . . . . . . . . . . . . 72
5.2.5 Matrix x Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6 IMAGE PROCESSING ON AN OTIS-MESH ................ 83

6.1 Histogramming ....... .............................. 83
6.1.1 Background ............................ 83
6.1.2 Algorithm for 0 < B < v .. . ................... 84
6.1.3 Algorithms for v.< B N ...................... 86
6.1.4 Algorithm for B > N ...................... . 89
6.2 Histogram Modification ................................ 89
6.3 Shrinking and Expanding ........................ 91
6.3.1 Background ............................ 91
6.3.2 GRM Mapping .......................... 92
6.3.3 GSM Mapping .......................... 96
6.4 Hough Tansform ............................ . 100
6.4.1 Background ............................ 100
6.4.2 An Improved Algorithm For N x N Meshes ............. 101
6.4.3 GRM Mapping ...... .......................... 106
6.4.4 GSM Mapping ...... .......................... 107
6.5 Summary ................................ . 107

7 OTIS-HYPERCUBE .............................. 109

7.1 OTIS-Hypercube Diameter ........................ 109
7.2 Simulation of an N2 hypercube ......................... 111
7.3 Common Data Rearrangements .................. . 112
7.3.1 Transpose [p/2 - 1, ... ,0, - I,... p/2 ............ 113
7.3.2 Perfect Shuffle [,p-1,p- 2,...,l ................... 113
7.3.3 Unshuffle [p - 2,p - 3,...,O,p - 1] .................. 114


iv











7.3.4 Bit Reversal [0,1,. ..,p - 1] . . . . . . . . . . . . . . . . . . . 114
7.3.5 Vector Reversal [-(p - 1), -(p - 2),..., -0 . . . . . . . . . . 116
7.3.6 Bit Shuffle [p - 1, p - 3,...,1, p - 2, p - 4,..., 01 . . . . . . . . 117
7.3.7 Shuffled Row-major [p - 1, p/2 - 1, p - 2, p/2 - 2, ..., p/2, 0] . 118
7.4 BPC Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8.1 Outline of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.2 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131


v













LIST OF FIGURES


1.1 2-dimensional arrangement of L = 64 inputs when M = 4 and N = 16:
(a) vAI x VrM = 2 x 2 grouping of inputs; (b) The (i, *) group, 0< i< M = 4 ...............................

1.2 SideviewoftheOTISwith M=4andN=16 ... ............

1.3 Example of OTIS connections with 16 processors ...........

1.4 Multistage interconnection network (MIN) defined by OTIS ...... 2.1 16 Processor OTIS-Mesh .........................

2.2 Mapping a 4 x 4 mesh onto a 16 processor OTIS-Mesh: (a) GRM; (b)
G SM .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


4.1 Data Configuration: (a) Initial; (b) Concentrated . . . . 4.2 Row-Column Transformation of Leighton's Column Sort


4.3 5.1 5.2 5.3

5.4 5.5 5.6 5.7 5.8 5.9 5.10


Example of Leighton's Column Sort . . . . . . . . . . . . . . . . . . .

GRM Column x Row Multiplication . . . . . . . . . . . . . . . . . .

GSM Column x Row multiply algorithm . . . . . . . . . . . . . . . .

GRM Row x Column Multiply . . . . . . . . . . . . . . . . . . . . .

GSM Row x Column Multiply . . . . . . . . . . . . . . . . . . . . . .

Data Paths Used in Step 1 of Figure 5.4 . . . . . . . . . . . . . . . .

GSM Row x Column Multiply for k > 1 . . . . . . . . . . . . . . . .

GRM Row Vector x Matrix Multiply . . . . . . . . . . . . . . . . . .

GSM Row Vector x Matrix Multiply . . . . . . . . . . . . . . . . . .

GRM Matrix x Column Vector Multiply . . . . . . . . . . . . . . . .

GSM Matrix x Column Vector Multiply ... ................


5.11 O(N) Memory GRM Matrix x Matrix Multiply ..... 5.12 O(N) Memory GSM Matrix x Matrix Multiply ......


2

4

6

7

12 15


45 50 51 62

64 66 66 67 68 69 71 72

74 75 76


vi


. . . . . . .

. . . . . . .


. . . . . .

. . . . . .











5.13 Cannon's Matrix Multiplication Algorithm . . . . . . . . . . . . . . . 78

5.14 Moving A Values as per Step 1 of Cannon's Algorithm . . . . . . . . 78

5.15 GSM Matrix x Matrix Multiply . . . . . . . . . . . . . . . . . . . . . 81

6.1 Data required in GRM for end processor: (a) qg even; (b) qf odd; (c)
qf even; (d) qf odd . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.2 Data required in GRM mapping for middle processor: (a) qf even; (b)
qf odd; (c) qf even; (d) qf odd . . . . . . . . . . . . . . . . . . . . . . 94

6.3 Data required in GSM mapping . . . . . . . . . . . . . . . . . . . . . 96

6.4 Data required in GSM mapping when qf = 0 and q. 0 . . . . . . . 100

6.5 Coordinate system used in Hough Transform . . . . . . . . . . . . . . 102

7.1 16 processor OTIS-Hypercube . . . . . . . . . . . . . . . . . . . . . . 110


vii













LIST OF TABLES


3.1 Optimal moves for 4D mesh and respective OTIS-Mesh simulations . 19

3.2 Source and destination of the BPC permutation [-0, 1, 2, -3] in a
16 processor OTIS-Mesh . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Permutations and their permutation vectors . . . . . . . . . . . . . . 27

3.4 Complexity Comparison of Common Data Rearrangement . . . . . . 31

4.1 Processors with data to concentrate . . . . . . . . . . . . . . . . . . . 44

4.2 Net change in G., G,, P,, and P. . . . . . . . . . . . . . . . . . . . . 46

4.3 Comparison of complexities on SIMD model . . . . . . . . . . . . . . 58

4.4 Comparison of complexities on MIMD model . . . . . . . . . . . . . . 59

5.1 Comparison between GRM and GSM schemes . . . . . . . . . . . . . 82

7.1 Optimal moves for N2 = 22d processor hypercube and respective OTISHypercube simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.2 Illustration of the perfect shuffle algorithm on a 16 processor OTISHypercube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.3 Complexity Comparison of Common Data Rearrangement . . . . . . 121


viii













Abstract of Dissertation
Presented to the Graduate School of the University of Florida
in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy


ALGORITHMS FOR THE OTIS OPTOELECTRONIC COMPUTER By

Chih-fang Wang

August 1998



Chairman: Dr. Sartaj Sahni
Major Department: Computer and Information Science and Engineering


It is well known that optical interconnects are more effective (i.e., provide more bandwidth, speed, and less power consumption) than electronic interconnects when the interconnection distance becomes larger than a few millimeters. The OTIS optoelectronic computer provides the best of both worlds by using free-space optical interconnects to connect distant processors and electronic interconnect for processors that are close. Optical transpose interconnection system (OTIS) provides a fixed and easy to realize optical topology; the topology of the electronic interconnect is flexible. By using different electronic topologies, we arrive at different classes of OTIS computers. For example, OTIS-Mesh is a class of OTIS computer in which the electronic interconnect follows the mesh paradigm, and OTIS-Hypercube is another class of OTIS computer such that the hypercube topology is used to realize the electronic interconnect.


ix










In this dissertation we will describe the OTIS architecture as well as some of its properties. Algorithms for some frequently used permutations, BPC permutations, fundamental operations, and some applications will be presented for the OTIS-Mesh computer. Properties of OTIS-Hypercube will also be discussed, along with algorithms for commonly used data rearrangements and BPC permutations.


x












CHAPTER 1
INTRODUCTION

It is well known that when communication distances exceed a few millimeters, optical interconnects provide speed (bandwidth) and power advantages over electronic interconnects [7, 23]. Therefore, in the construction of very large multiprocessor computers it is prudent to interconnect physically close processors using electronic interconnects and to use optical interconnects for pairs of processors that are distant. We shall assume that physically close processors are in the same physical package (chip, wafer, board) and processors that are not physically close are in different packages. As a result, electronic interconnects are used for intrapackage communications while optical interconnect is used for interpackage communication.

Various combinations of interconnection networks for intrapackage (i.e., electronic) communications and interpackage (i.e., optical communications) have been proposed. In OTIS computers [12, 33, 58], optical interconnects are realized via a free space optical interconnect system known as the optical transpose interconnection system (OTIS).

In this chapter, we begin by describing the OTIS. Next, we describe the OTISMesh and OTIS-Hypercube parallel computers that result, respectively, when the OTIS optical interconnect system is used for interpackage communication and a mesh or hypercube is used for intrapackage communication. Following that, we show that the OTIS computer can be used as a multistage interconnection network (MIN). Finally, we provide a brief description of the remaining chapters.


1






2


Figure 1.1. 2-dimensional arrangement of L = 64 inputs when M = 4 and N = 16:
(a) vfAI x vf'i = 2 x 2 grouping of inputs; (b) The (i, *) group, 0 < i < M = 4

1.1 Optical Transpose Interconnection System (OTIS)

The optical transpose interconnection system (OTIS) was proposed by Marsden et al. [33J. The OTIS connects L = MN inputs to L outputs using free space optics and two arrays of lenslets. The first lenslet array is a vrM? x ViA? array and the second one is of dimension v/N x vr. Thus, a total of M + N lenslets are used. The L inputs and outputs are arranged to form a vf x VT array. The L inputs are arranged into vAI x v/AI groups with each group containing N inputs arranged into a v/K x V/rN array. Figure 1.1 shows the arrangement of the L = 64 inputs when M = 4 and N = 16. The M x N inputs are indexed (ij) with 0 < i < M and 0 < j < N. Inputs with the same i value are in the same V/N x V75 block. The notation (1, *), for example, refers to all inputs of the form (1, j).
In addition to using the two-dimensional notation (i, j) to refer to an input, we also use a four-dimensional notation (i,, ic, j,, j) where (i,, ic) gives the coordinates (row,column) of the v'N x vlN block that contains the input (see Figure 1.1(a)) and (j,,j) gives coordinates of the element within a block (see Figure 1.1(b)). So all


(0,*) (1,*)





(2,*) (3,*)


(1, 0) (1, 1) (U,2) (1, 3) (1, 4) (i,5) (1,6) (1,7) (1, 8) (i, 9) (1, 10)(1, 11) (1,12)(1, 13)(1, 14)(1, 15)






3


elements (i, *) with i = 0 have (i,, i) = (0, 0); those with i = 1 have (i, i) = (0, 1); those with i = 2 have (4, ) = (1, 0); and those with i = 3 have (i, ic) = (1, 1). Similarly, all inputs with j = 3 have (j,, je) = (0, 3), and those with j = 12 have (j,, j) = (3, 0).
The L outputs are also arranged into a vT x v'L. array. This time, however, the Vf x VT array is composed of VN x v blocks with each block containing M outputs that are arranged as a VAM x VM7 array. The L = MN outputs are indexed (i, j) with 0 < i < N, 0 < j < M. All outputs of the form (i,*) are in the same block, block i. Block i is in position (i, ic) with i = x + ie of the W x
block arrangement. Outputs of the form (*,j) are in position (j,,j) of their block, j = jVM M+ je.
In the physical realization of OTIS, the v'i; x 11 output arrangement is rotated 1800. We have 4 two-dimensional planes; the first is the VT x VT input plane; the second is a v x VMX lenslet plane, the third is a V x Vf lenslet plane, and the fourth is the VT x VT plane of outputs rotated 180*. When the OTIS is viewed from the side, only the first column of each of these planes is visible. Such a side view for the case L = M x N = 4 x 16 is shown in Figure 1.2. Notice that the first column of the input plane consists of the inputs (0,0), (0,4), (0,8), (0,12), (2,0), (2,4), (2,8), (2,12) which in 4D notation are (0,0,0,0), (0,0,1,0), (0,0,2,0), (0,0,3,0), (1,0,0,0), (1,0,1,0), (1,0,2,0), (1,0,3,0). The inputs in the same row as (0,0,0,0) are (0,*,0,*), those in the same row as (i, i, j, j) are (i,*, j,,*). The (i, j,) values top to bottom are (0,0), (0,1), (0,2), (0,3), (1,0), (1,1), (1,2), (1,3). The first column in the output plane (after the 180* rotation) has the outputs (15,3), (15,1), (11,3), (11,1), (7,3), (7,1), (3,3), (3,1) which in 4D notation are (3,3,1,1), (3,3,0,1), (2,3,1,1), (2,3,0,1), (1,3,1,1), (1,3,0,1), (0,3,1,1), (0,3,0,1). The outputs in the same row as








4


(0 ,*,0,* ) Q (0,**)
(0, *,1, *) L(0, *)0 (0,*,O,*)




(1,',1,)
(0,*,3, ) 0







(1,,2,*) ,


(1,',3,e)


Figure 1.2. Side view of the OTIS with M = 4 and N = 16


(3, ,1,) (3, 0,*) (2,*,1,) (2,,0,*) (1,*,1,*) (1,*,O,*) (O,*,1,*) (O,,O,*)


(3,3,1,1) are (3, *, 1, *); those in the same row as (i,, ic, j,, je) are (i,, *, j, *). The (i,, j,) values top to bottom are (3,1), (3,0), (2,1), (2,0), (1,1), (1,0), (0,1), (0,0).

Each lens of Figure 1.2 denotes a row of lenslets and each C) a row of inputs or outputs. The interconnection pattern defined by the given arrangement of inputs, outputs, and lenslets connects input (i, j) = (irvi, j,, je) to output (j,i) = (j,, je, i,, ic). The connection is established via an optical ray that originates at input position (i,, ic, j,, jc), goes through lenslet (i,, ic) of the first lenslet array, then through lenslet (j,, je) of the second lenslet array, and finally arrives at output position (J. i,., ir ic).

The basic connectivity provided by the OTIS is an optical connection between input (i, j) and output (j, i), 0 < i < M, 0 < j < N.







5


1.2 OTIS Parallel Computers

Marsden et al. [33] have proposed several parallel computer architectures in which OTIS is used to connect processors in different groups (packages) and an electronic interconnection network is used to connect processors in the same group. Since Krishnamoorthy et al. [23] have shown that bandwidth is maximized and power consumption minimized when an L = N2 processor OTIS computer is partitioned into N groups of N processors each, Zane et al. [58] limit the study of OTIS parallel computers so that each processor group (package) has N processors and the parallel computer has a total of N groups (packages). Let (i, j) denote processor j of package i, 0 < i < N, 0 < j < N. Processor (i, j), i 3 j, is connected to processor (j, i) using free space optics (i.e., OTIS). The only other connections available in an OTIS computer are the electronic intragroup connections.

A generic 16 processor OTIS computer is shown in Figure 1.3. The solid boxes denote processors. Each processor is labeled (g, p) where g is the group index and p is the processor index. OTIS connections are shown by arrows. Intra group connections are not shown.

In an OTIS-Mesh, processors in the same group are connected as a 2-D mesh [33, 58, 461 (Chapter 2); and in an OTIS-Hypercube (Chapter 7), processors in the same group are connected using the hypercube topology [33, 58, 48]. OTIS-Mesh of trees [33], OTIS-Perfect shuffle, OTIS-Cube connected cycles, etc. may be defined in an analogous manner.

When analyzing algorithms for OTIS architectures, we count data moves along electronic interconnects (i.e., electronic moves) and those along optical interconnects (i.e., OTIS moves) separately. This allows us to later account for any differences in the speed and bandwidth of these two types of interconnect.







6


group 0 group 1



(0,0) (0,1) (1,0) (1,1)
(0,2) (0,3) (1,2) (1,3)








(2,0) (2,1) (3,0) (3,1)
(2, 2) (2, 3) (3, 2) (3, 3)



group 2 group 3

Figure 1.3. Example of OTIS connections with 16 processors

1.3 Permutation Routing on OTIS Computers

Suppose we wish to rearrange the data in an N2 processor OTIS computer according to the permutation U = I[O] .. --[N2 - 1]. That is, data from processor i = gN + p is to be sent to processor fl[i], 0 < i < N2. We assume that the interconnection network in each group is able to sort the data in its N processors (equivalently, it is able to perform any permutation of the data in its N processors). This assumption is certainly valid for the mesh, hypercube, perfect shuffle, cubeconnected cycles, and mesh of trees interconnections mentioned earlier.


heorem 1.3.1 Every OTIS computer in which each group can sort can perform any permutation UH wing at most 2 OTIS moves.







7


rtaTS trsasper etad 1 trapove atea b

a-2 srg






N-22 2-2-2




Pro Whe 2 TSmvsaepritdIh aamvmn a emdldb










represents a processor group which is capable of performing any N input to N output permutation. The OTIS moves are represented by the connections from one stage to the next.

The OTIS interstage connections are equivalent to the interstage connections in a standard MIN that uses N x N switches. From MIN theory [22], we know that when k x k switches are used, 2 logk N - I stages of switches are sufficient to make an N2 input N2 output network that can realize every input to output permutation. In our case (Figure 1.4), k = N. Therefore, 2logN N2 - 1 = 3 stages are sufficient. Hence 2 OTIS moves suffice to realize any permutation.

An alternative proof comes from an equivalence with the preemptive open shop scheduling problem (POSP) [9]. In the POSP we are given n jobs that are to be






8


scheduled on m machines. Each job i has m tasks. The task length of the jth task of job i is the integer t > 0. In a preemptive schedule of length T, the time interval from 0 to T is divided into slices of length 1 unit each. A time slice is divided into m slots with each slot representing a unit time interval on one machine. Time slots on each machine are labeled with a job index. The labeling is done in such a way that

(a) each job (index) i is assigned to exactly ti, slots on machine j, 0 < j < m, and

(b) no job is assigned to two or more machines in any time slice. T is the schedule length. The objective is to find the smallest T for which a schedule exists. Gonzalez and Sahni [9] have shown that the length Tm, of an optimal schedule is Tin = max{J.., MM.g},

where J,,. = max{E" itq} (i.e., J,.. is the maximum job length) and M,. max1{D"2~oi t,} (i.e., M= is the maximum processing to be done by any machine).

We can transform the OTIS computer permutation routing problem into a POSP. First, note that to realize a permutation H with 2 OTIS moves, we must be able to write 11 as a sequence of permutations H10T11T2 where fl, is the permutation realized by the switches (i.e., processor groups) in stage i and T denotes the OTIS (transpose) interstage permutation. Let (g,, p,) denote processor p, of group g, where q E {io, ooil, oi, i2,o2} (io = input of stage 0, ao = output of stage 0, etc.). Then, the data path is (, iN) (g,,,p.,) 4 (pgg,,) = (gi,,,p,) -4 (g,,p,,) 34 (p1,g01) = (g12, Pi) -%4 ( P,.)We observe that to realize the permutation H, the following must hold:

(i) Switch i of stage 1 should receive exactly one data item from each switch of stage 0, 0 < i < N.







9


(ii) Switch i of stage 1 should receive exactly one data item destined for each switch

of stage 2, 0 < i < N.

Once we know which data items will get to switch i, 0 < i < N, we can easily compute 11o, 11i, and 112. Therefore, it is sufficient to demonstrate the existence of an assignment of the N2 stage 0 inputs to the switches in stage 1 satisfying conditions

(i) and (ii). For this, we construct an N job N machine POSP instance. Job i represents switch i of stage 0 and machine j represents switch j of stage 2. The task time tg, equals the number of inputs to switch i of stage 0 that are destined for switch j of stage 2 (i.e., t4j is the number of group i data that are destined for group j). Since [1 is a permutation, it follows that EN-1 t', = total number of inputs to switch i of stage 0 = N and E tj = total number of inputs destined for switch j of stage 2 = N. Therefore, J, = M,, = N and the optimal schedule length is N. Since E r-;' t = N2 and the optimal schedule length is N, every slot of every machine is assigned a task in an optimal schedule. From the property of a schedule, it follows that in each time slice all N job labels occur exactly once. The N labels in slice i of the schedule define the inputs that are to be assigned to switch i of stage 1, 0 < i < N. From properties (a) and (b) of a schedule, it follows that this assignment satisfies the requirements (i) and (ii) for an assignment to the stage 1 switches. 0

Even though every permutation 11 can be realized with just 2 OTIS moves, it takes many more OTIS moves to compute the decomposition TI = HfoT~lIT12. Therefore, simulating the 3 stage MIN of Figure 1.4 does not result in an efficient algorithm to perform permutation routing. Consequently, we have developed customized algorithms for specific as well as generalized BPC permutations (Chapters 3 and 7). General permutations may be realized using the sorting algorithm (Section 4.13), which uses o(vr) OTIS moves.







10


1.4 This Dissertation

In this dissertation, properties of OTIS-Mesh and OTIS-Hypercube are studied and obtained. The dissertation is organized as follows:


" Chapter 2 deals with some fundamental properties such as diameter and embedding schemes of OTIS-Mesh.

" OTIS-Mesh algorithms for frequently used permutations are presented in Chapter 3, along with the algorithm for general BPC permutations.

" Algorithms for basic operations-broadcast, prefix sum, rank, sort, and so onare developed in Chapter 4.

" Chapter 5 demonstrates how matrix multiplications are performed.

" Chapter 6 presents algorithms for some well known image processing applications.

" Properties of the OTIS-hypercube are studied in Chapter 7, as well as algorithms for commonly used permutations and general BPC permutations.

" And finally, Chapter 8 summarizes the whole dissertation, and gives directions

for research.












CHAPTER 2
PROPERTIES OF OTIS-MESH

In an N2 processor OTIS-Mesh, each group is a V x VN_ mesh and there are a total of N groups. Figure 2.1 shows a 16 processor OTIS-Mesh. The processors of groups 0 and 2 are labeled using two dimensional local mesh coordinates while the processors in groups 1 and 3 are labeled in row-major fashion. We use the notation (g, p) to refer to processor p of group g.

In this chapter, we first show that the diameter of the OTIS-Mesh is 4vfN-3. Then, we demonstrate how OTIS-Mesh can simulate a 4D-mesh, as well as a 2D-mesh.
2.1 Diameter of the OTIS-Mesh

Let (gi,p1) and (g2,p2) be two OTIS-Mesh processors. The shortest path between these two processors is of one of the form:

(a) The path involves only electronic moves. This is possible only when g = g2.

(b) The path involves an even number of optical moves. In this case the path is of the form (gi, pi) -. (91, pi) (el, 1) -7 (pi, g) 24 (g1, pi)

(g', p1) -*-+ ( (p(,pg7) - -_a+ -!4 (92,P2).

Here E* denotes a sequence (possibly empty) of electronic moves and 0 denotes a single OTIS move. If the number of OTIS moves is more than two, we may compress paths of this form into the shorter path (gi, pi) 24 (gi, p2) --+ (p2, gi) -!+ (p2, 92) -2+ (g2, p2). So we may assume that the path is of the
above form with exactly two OTIS moves.

(c) The path involves an odd number of OTIS moves. In this case, it must involve

exactly one OTIS move (as otherwise it may be compressed into a shorter path 11







12


(0,0) (0,1)
group S group




1.0 1.1 2 2










group 2 group 3
(1,0) (111)

Figure 2.1. 16 Processor OTIS-Mesh with just one OTIS move as in (b)) and may be assumed to be of the form

(91, PI) -'' (91, 92) --+4 (92, 91) -'* (92, P2).Let d(i, j) be the shortest distance between processors i and j of a group using a path comprised solely of electronic moves. So, d(i, j) is the Manhattan distance between the two processors of the local mesh group. Shortest paths of type (a) have length d(p1, p2) while those of types (b) and (c) have length d(pi, p2) + d(g1, 92) + 2 and d(pi, g2) + d(p2, g1) + 1, respectively.

From the preceding discussion we have the following theorems:

Theorem 2.1.1 The length of the shortest path between processors (gi, pi) and (g2, p2) is d(pi, p2) when g1 = g2 and min{d(p1, p2) + d(gi, 92) + 2, d(pi, 92) + d(p2, g1) + 1} when g1 4 g2.






13


Proof When g1 = 92, there are three possibilities for the shortest path. It may be of types (a), (b), or (c). If it is of type (a), its length is d(p,p2). If it is of type (b), its length is d(pi, p2) + d(91,92) + 2 = d(pi, p2) + 2. If it is of type (c), its length is d(pi, 92)+d(p2, gi)+1 = d(pi, g1)+d(p2, g1)+I = d(pi, g1)+d(g1, p2)+1 > d(pi, p2)+1. So, the shortest path has length d(pi,p2). When gi 4 92, the shortest path is either of type (b) or (c). From our earlier development it follows that its length is min{d(Pi, P2) + d(G1, G2) + 2, d(P, G2) + d(P2, G1) + 1}. 3 Theorm 2.1.2 The diameter of the OTIS-Mesh is 4vfN - 3. Proof Since each group is a V# x vW mesh, d(pi,p2), d(p2,gi), d(p1,92), and d(g1, g2) are all less than or equal to 2(v/W - 1). From Theorem 2.1.1, it follows that no two processors are more than 4(vWN - 1) + 1 = 4,N - 3 apart. Hence, the diameter is < 4V7 - 3. Now consider the processors (91,pl), (92,p2) such that pi is in position (0, 0) of its group and p2 is in position (vW/#- 1, VN - 1) (i.e., pi the top left processor and p2 the bottom right one of its group). Let 9 be 0 and g2 be N-1. So, d(p1, p2) = d(g1,92) = d(pi,g2) = d(p2,91) = v/-1. Hence, the distance between (g1, pi) and (g2, p2) is 4VN-- 3. As a result, the diameter of the OTIS-Mesh is exactly 4VW - 3. 0
2.2 Simulation of a 4D Mesh

Zane et al. [58] have shown that the OTIS-Mesh can simulate each move of a V- x # x x v' four-dimensional mesh by using either a single electronic
move local to a group or using one local electronic move and two intergroup OTIS moves. For the simulation, we must first embed the 4D mesh into the OTIS-Mesh. The embedding is rather straightforward with processor (i, j, k, 1) of the 4D mesh being identified with processor (g,p) of the OTIS-Mesh. Here, g = iv'N+ j and p= kVNW+L.







14


The mesh moves (i, j, k 1, 1) and (i, j, k, I 1) can be performed with one electronic move of the OTIS-Mesh while the moves (i,j 1,k,l) and (i 1, j, k,1) require one electronic and two optical moves. For example, the move (i, j+1, k, 1) may be done by the sequence (i, j, k, l) --4 (k, l, i, j) !4 (k, 1, i, j + 1) 4 (i, j + 1, k, l).

The above efficient embedding of a 4D mesh implies that 4D mesh algorithms can be run on the OTIS-Mesh with a constant factor (at most 3) slowdown [58]. Unfortunately, the body of known 4D mesh algorithms is very small compared to that of 2D mesh algorithms. So, it is desirable to consider a 2D mesh embedding. Such an embedding will enable one to run 2D mesh algorithms on the OTIS-Mesh. Naturally, one would do this only for problems for which no 4D algorithm is known or for which the known 4D mesh algorithms are not faster than the 2D algorithms.
2.3 Simulation of a 2D Mesh

There are at least two intuitively appealing ways to embed an N x N mesh into the OTIS-Mesh. One is the group row mapping (GRM) in which each group of the OTIS-Mesh represents a row of the 2D mesh. The mapping of the mesh row onto a group of OTIS processors is done in a snake-like fashion as in Figure 2.2(a). The pair of numbers in each processor of Figure 2.2(a) gives the (row,column) index of the mapped 2D mesh processor. The thick edges show the electronic connections used to obtain the 2D mesh row. Notice that the assignment of rows to groups is also done in a snake-like manner. Let (i, j) denote a processor of a 2D mesh. The move to (i, i+ 1) (or (i, j - 1)) can be done with one electronic move as (i, j) and (i, j +1) are neighbors in a processor group. If all elements of row i are to be moved over one column, then the OTIS-Mesh would need one electronic move in case of a MIMD mesh and 3 in case of a SIMD mesh as the row move would involve a shift by one left, right, and down within a group. A column shift can be done with 2











group 0 ,3 0,2








3,3 3,3 group 3


Figure 2.2. Mapping a 4 x 4 mesh onto GSM


a 16 processor OTIS-Mesh: (a) GRM; (b)


additional OTIS moves as in the case of a 4D mesh embedding. GRM is particularly nice for the matrix transpose operation. Data from processor (i, j) can be moved to

processor (j, i) with one OTIS and zero electronic moves.

The second way to embed an N x N mesh is to use the group submesh mapping (GSM). In this, the N x N mesh is partitioned into N /Wx vW submeshes.

Each of these is mapped in the natural way onto a group of OTIS-Mesh processors.

Figure 2.2(b) shows GSM of a 4 x 4 mesh. Moving all elements of row or column i over by one is now considerably more expensive. For example, a row shift by +1 would be accomplished by the following data movements (a boundary processor is

one on the right boundary of a group):


Step 1: Shift data in non-boundary processors right by one using an electronic move. Step 2: Perform an OTIS move on boundary processor data. So, data from (g, p) move

to (p, g).


15


group 1 1,3 1,2 2,3 ,2

group 2


group 0














3,0 31

group 2


group 1















2,2 2,3 3,2 3,3 group 3


(a)


(b)







16


Step 3: Shift the data moved in Step 2 right by one using an electronic move. Now, the

data from (g, p) are in (p, g + 1).

Step 4: Perform an OTIS move on these data. Now data originally in (g,p) are in

(g + 1,p).

Step 5: Shift the data left by v1N- 1 using vN- 1 electronic moves. Now, the boundary

data originally in (g, p) are in the processor to its right but in the next group.

The above five step process takes vWN electronic and two OTIS moves. Note, however, that if each group is a wraparound mesh in which the last processor of each row connects to the first and the bottom processor of each column connects to the top one, then row and column shift operations become much simpler as Step 1 may be eliminated and Step 5 replaced by a right wraparound shift of 1. The complexity is now two electronic and two OTIS moves.

GSM is also inferior on the transpose operation which now requires 8(vN- 1) electronic and 2 OTIS moves.


Teorem 2.3.1 [461 The transpose operation of an N x N mesh requires 8(VN' - 1) electronic and 2 OTIS moves vhen the GSM is used.

Proof Let g,,g and ppy denote processor (g,p) of the OTIS-Mesh. This processor is in position (p.,p,) of group (g., g,) and corresponds to processor (g9p,,gp,) of the N x N embedded mesh. To accomplish the transpose, data are to be moved from the N x N mesh processor (g9p , gpv) (i.e., the OTIS-Mesh processor (g,p) = (grgy, ppy)) to the mesh processor (g,py, gp.) (i.e., the OTIS-Mesh processor (9,9zP" p)). The following movements do this: (g9p., gypy) -!4 (g.py, gp,) 4 (p,9, p,9,g) -!4 (yy, px9z) 4 (gypy, g.p.). Once again E* denotes a sequence of electronic moves







17


local to a group and 0 denotes a single OTIS move. The E moves in this case perform a transpose in a vW x vW mesh. Each of these transposes can be done in 4(vW - 1) moves [34]. So, the above transpose method uses 8(VK - 1) electronic and 2 OTIS moves.

To see that this is optimal, first note that every transpose algorithm requires at least 2 OTIS moves. For this, pick a group g9g, such that 9, 5 g. Data from all N processors in this group are to move to the processors in group gg,,. This requires at least one OTIS move. However, if only one OTIS move is performed, data from gg, is scattered to the N groups. So, at least two OTIS moves are needed if the data ends up in the same group.

Next, we shall show that independent of the OTIS moves, at least 8(VN - 1) electronic moves must be performed. The electronic moves cumulatively perform one of the following two transforms (depending on whether the number of OTIS moves is even or odd, see previous section about the diameter):


(a) local moves from (p, p,) to (p,,p); local moves from (., g,) to (g., g,);

(b) local moves from (pz, p,) to (g,, g,); local moves from (g, g.) to (Py, p").


For (p.,p,) = (gz,g,) = (0, /- 1), (a) and (b) require 2('/N - 1) left and 2(vW - 1) down moves. For (P,, p,) = (g, g,) = (VN - 1, 0), (a) and (b) require 2(VN - 1) right and 2(VN - 1) up moves. The total number of moves is thus 8(VN - 1). So, 8(VN - 1) is a lower bound on the number of electronic moves needed. 0












CHAPTER 3
DATA REARRANGEMENT ON AN OTIS-MESH

From Section 1.3, we know that an N2 processor OTIS-Mesh can realize any permutation of N2 data (one to each processor) using at most two OTIS moves. However, additional OTIS moves are needed to determine the local group data rearrangements that must be made.

In this chapter, we first develop algorithms to realize permutations such as transpose, shuffle, unshuffle, and vector reversal which arise frequently in applications. Nassimi and Sahni [34] have developed optimal 4D mesh algorithms for several frequently arising permutations. These may be simulated using the method of Zane et al. [58] to obtain algorithms for the OTIS-Mesh. Table 3.1 gives the number of 4D mesh moves used by the optimal 4D mesh algorithms, a breakdown of the number of moves in the first two and last two dimensions, and the number of electronic and OTIS moves required by the simulation.

In the following sections we shall obtain OTIS-Mesh algorithms for the permutations of Table 3.1, that require far fewer moves than the simulations of the optimal 4D mesh algorithms.

Assume that the N2 OTIS-Mesh processors are numbered/indexed 0 through N2 - 1 such that in the binary representation of a processor index the left half bits give the group number and the right half give the processor number local to a group. So, a processor index I is of the form I = GP where I, G and P are represented in binary and G and P have the same number of bits. G and P may be decomposed into halves to get G = G.G, and P = PP1, such that G. and G, give the group


18







19


Table 3.1. Optimal moves for 4D mesh and respective OTIS-Mesh simulations 4D mesh OTIS-Mesh Simulation
Permutation total dim. 1 +2 dim. 3+4 OTIS electronic
Transpose 8(fN - 1) 4(VN - 1) 4(vN-1) 8(VN-1) 8(vN-1)
Perfect Shuffle 4 R 2N 2,N 4vN 4-N
Unsbuffle 4VN 2N 2VN 4fN I
Bit Reversal 8(VN - 1) 4(IN - 1) 4(VN - 1) 8(,N - 1) 8(7N-- 1)
Vector Reversal 8(vWN - 1) 4(VN - 1) 4(VN - 1) 8(v - 1) 8(VN- 1)
BifShufe - 4 -1 VN - 2 4v(N - 2 - 4
Bit Shuffle 2I~T 4N2 fN-4 7JT
Shuffled Row-major - -N - 4 ;N - 2 4/N - 2 T T - 4
GP, Swap 4((v/N - 1) 2(V - 1)- 2N - - 14( -) 4(N - 1)


location by row and column in an array layout of groups (as in Figure 2.1) and P and P, locate processor P of a group by its row and column coordinates.

The permutations of Table 3.1 are members of the BPC (bit permute complement) class of permutations defined in Nassimi and Sahni [34]. The definition of the BPC permutation and its relations with those permutations in Table 3.1 will be presented in the last section, along with the development of the algorithm.
3.1 Transp=s

The transpose operation may be accomplished via a single OTIS move and zero electronic moves. The simulation of the optimal 4D mesh algorithm, however, takes 8(v/- - 1) OTIS and 8(VN# - 1) electronic moves.
3.2 Perfect Shuffle

Let G represent the first half of the bits in the processor index and P the second half. Let bG(,) and bp(,), respectively, denote the bits in position G(i) and P(i) of G and P. So bG(,/2-1) and bp(,/2-1) are the most significant bits of G and P while bG(o) and bp(o) are the least. Let G = bc(,/2-j)G' and P = bp(,/2-)P'. A perfect shuffle may be performed as below:







20


Step 1: Perform a local perfect shuffle in each group. This moves data from every processor GP to the corresponding processor GP'bP(p/2-1).

Step 2: This step involves processors in groups G such that bG(p/2-1) = 0 only. In
these groups, odd processors exchange data with corresponding even processors (note that the processors exchanging data differ only in bit zero). To see the new data arrangement, it is convenient to separate out four cases depending on

the values of bG(p/2-1) and bP(p/2-1). Steps 1 and 2 accomplish the following: OG'OP' OG'P'0 0G'P'1
oG'1P' 0G'P'1 OG'P'O
1G'OP !!+' 1G'P'0 !!!q 1G'P'O 1G'Jp 1G'P'1 1G'P'1

Step 3: Perform an OTIS move on all processors.

Step 4: Perform a local shuffle in each group. The transformations so far are given
below:

OG'OP' 0G'P'O 0G'P'1 P'10G' P'1G'O
OG'lP' 0G'P'1 OG'P'O POG' POG'O
1G0P '!!+4 1G'PO !!V 1G'P'O s-!V P'01G'!n; P'0G'1
IG'lP' 1G'P'1 IG'P'l P'11G' P'1G'l

Step 5: This step involves only processors in even groups. In these groups, odd

processors exchange their data with the corresponding even processors.

0G'0P' OG'P'O 0G'P'1 P'10G' P'1G'0
0G'1P' OG'P'l OG'P'0 P'0OG' P'OG'O
G'0P' !!!+q 1G'P'0 t-m? 1G'P'O s-tq P01G' !!!E P'0G'11 !!!q
i'P' 1G'P'1 1G'P'i P'1lG' P'lG'l
P'1G'O P'OG'l P'OG'O P'ic'







21


Step 6. Perform an OTIS move on all processors. Step 7: Same as Step 5. The seven step process is shown below: OG'OP' OG'P'O 0G'P'1 P'O0G' P'IG'O
oG'1P' 0G'P'1 OG'P'O P'00G' P'OG'O
IG'OP s-!!- 1G'P'O s-M 1G'P'0 s-!! P'01G C !!! P'OG'1 sl
1G'1P' 1G'P1 IG'P' P'1G' p'iG' p
P'1G'O G'OP'1 G'OPIO
P'OG'1 G'1P'O G'1P'O

P1G'1 G'1P1 G'1P'1

The correctness of the seven step algorithm above is readily seen. From the diagram of the data movement operations, we see that data originally in OG'OP' end up in G'OP'O; those in OG'1P' end up in G'1P0; those in 1G'P' end up in OG'1P'; and those in 1G'1P end up in G'1P'1. In other words, data are moved from GP to G'bp~/2-1)P'bG(,/2-1), which is precisely what is to be done in a perfect shuffle.
Steps 1 and 4 perform perfect shuffles in v/W x v/N meshes. Each of these can be done optimally in 2V/N electronic moves using the algorithm of Nassimi and Sahni [34J. Steps 2, 5, and 7 requires exchanging data between mesh neighbors. Each exchange moves data in opposite directions on the same link and takes two electronic moves. Steps 3 and 6 take one OTIS move each. So, the total number of moves is 4v'W + 6 electronic and 2 OTIS only. In contrast, the simulation of the optimal 4D mesh perfect shuffle algorithm takes 4VN electronic and 4VWS OTIS moves.
3.3 Unshuffe

This is the inverse of a perfect shuffle and may be done by running the seven step shuffle algorithm backward (i.e., beginning with Step 7) and replacing the local shuffles of Steps 1 and 4 by local unshuffles. The data movement is shown below (G = G"b(o), P = Pbp(o))-






22


G"OP"O G"oP"1 PM1G" P"1G"O P"10G"
G"OP"1 G"OP"O P"OG"O P"OG"1 P"01G"
G"1P"O tiq G"1P"O '-!p P"0G'1 5-!! P"OG"O s-!! P"OOG" tm;
G"1P"1 G"1P"1 P"1G"1 P"'1G"1 P"11G"

OG"P"1 OG"P"O OG"OP"
1G"P"o IG"MP" 1G"OP"
OG"P"O Stop? OG"P"1 G"1P"
1G"P"1 1G"P"1 1G"1P"

The number of data moves is the same as for a perfect shuffle.
3.4 Bit Reversal

A bit reversal can be done using one OTIS and 8(vN- 1) electronic moves as below. Note that when the bit reversal is done by simulating the optimal 4D mesh algorithm, 8(VN - 1) electronic and 8(9W - 1) OTIS moves are made. Step 1: Do a local bit reversal in each group. Step 2: Perform an OTIS move of all data. Step 9: Do a local bit reversal in each group.

Steps 1 and 3 are done optimally in 4(vW~ - 1) electronic moves each using the optimal 2D mesh bit reversal algorithm of Nassimi and Sahni [34].
3.5 Vector Reversal

A vector reversal can be done using 8(V{-- 1) electronic and two OTIS moves. The steps are as follows:

Step 1: Perform a local vector reversal in each group. Step 2: Do an OTIS move of all data. Step 3: Perform a local vector reversal in each group.






23


Step 4: Do an OTIS move of all data.

Note that Step 1 moves data from GP to GP (where F is the complement of P). Step 2 moves this data from GP to PG. Next, Step 3 sends that data to PG and finally Step 4 sends it to GP completing the vector reversal. The number of data moves is easily obtained by noting that the optimal way to perform the local vector reversals takes 4(,/W - 1) electronic moves [341.
3.6 Bit Shuffle

Our algorithm to perform this permutation employs a GP, Swap permutation in which data from processor GZGPP is routed to processor GZP.GP. So, let us first see how to perform this permutation.
3.6.1 G.,P. Swap

We present two algorithms for this. The first uses 2(V'# - 1) electronic and log2 N OTIS moves. The second uses 6(VW- 1) electronic and 2 OTIS moves. While the second algorithm uses a larger number of moves, it is to be preferred when the cost of an OTIS move is considerably larger than that of an electronic move.

The first algorithm performs a series of bit exchange permutations of the form B(i) = [B,, ..., Bo, 0 < i < p/4, where J p/2+i, j=p/4+i
B,= p/4+i, j=p/2+i
j otherwise
The permutation B(i) may be realized as below: Step 1: Processors GP with bG(,) $ bP(p14+,) route their data to corresponding processors that differ only in bit bp(p14+i). This requires moving data left and right

on rows of vW x 9(# meshes by 2' positions (in each direction).


Step 2: Perform an OTIS move.







24


Step 3: The data moved in Step 1 is routed from their current processors to corresponding processors that differ only in bit i. This requires data moves left and
right along rows of vW x VN- meshes. The distance is 2' in each direction. Step 4: Perform an OTIS move.

The total number of moves is 21+2 electronic and two OTIS.
To perform a GP, Swap permutation, we simply perform B(i) permutations for 0 < i < p/4. This takes p/2 = log2 N OTIS moves and - 2i+2 = 4(2P/4 -1) = 4(vfN - 1) electronic moves.
The second algorithm uses the following six steps:

Step 1: Shift data in group GG, up circularly by G, rows. This moves the datum
from processor G. G,P.P, to processor G yG,((P, - G,) mod Vi )P,.

Step 2: Perform an OTIS move. The datum from GZG,,P, is now in ((P-G,) mod

VN)P G.G,.

Step 3: In each group, shift the data right circularly along the rows by an amount

given by the left half of the group bits. The Datum originally in GZGVPZPI is
now in ((P. - G,) mod VN-)P,GP,.

Step 4: Perform an OTIS move. The datum is now in GP,((P, - G.) mod V)P,. Step 5: Move data up circularly along columns by an amount given by the Right half
of the group bits. The datum is now in GP(-G, mod VW)P,.

Step 6: Reverse the order of data in each column of each group. The datum is now
in G.P GP,.







25


While a column or row circular shift takes v/'N moves in each group, the number of moves in each direction varies from group to group. Assuming that up moves in one group may not be overlapped with down moves in another, Steps 1, 3, and 5 take 2(ViN - 1) electronic moves each. Step 6 may be combined with Step 5 at no extra cost. So, a total of 6(vW - 1) electronic and two OTIS moves are used.
3.6.2 Bit Shuffle

A bit shuffle may be performed following these steps: Step 1: Perform a GP, swap.

Step 2: Do a local bit shuffle in each group. Step 3: Do an OTIS move.

Step 4: Do a local bit shuffle in each group. Step 5: Do an OTIS move.

Using the 4(vr - 1) electronic and log2 N OTIS move algorithm for the GP. Swap and the optimal mesh bit shuffle algorithm of Nassimi and Sahni [34], the number of moves becomes (approximately) -' rN - 4 electronic and log2 N + 2 OTIS.
3.7 Shuffled Row-Major

This is the inverse of a bit shuffle and may be done in the same number of moves by running the bit shuffle algorithm backwards. Of course, Steps 2 and 4 are to be changed to shuffled row-major operations.
3.8 BPC Permutations

We mentioned in the beginning that the permutations of the previous sections are members of the BPC permutation class. In this section we present the definition







26


of the BPC permutation, its relation to those permutations, and the algorithm to realize it.
3.8.1 Definition

In a BPC permutation, the destination processor of each data is given by a rearrangement of the bits in the source processor index. For the case of our N2 processor OTIS-Mesh we assume that N is a power of two and so the number of bits needed to represent a processor index is p = log2 N2 = 2 log N. A BPC permutation of Nassimi and Sahni [34] is specified by a vector A = (A,1, Ap-2, .. , AoJ where

(a) A, E { O, (p - 1)}, 0 < i < p and

(b) [Ap-il, |A,-2|, .., IAoj] is a permutation of [0, 1,...,p - 1].

The destination for the data in any processor may be computed in the following manner. Let M,_lI,-2... m be the binary representation of the processor's index. Let dp-ldp-2... do be that of the destination processor's index. Then, dm if Ai > 0,
I if Ai < 0.

In this definition, -0 is to be regarded as < 0, while +0 is > 0.

In a 16 processor OTIS-Mesh, the processor indices have four bits with the first two giving the group number and the second two the local processor index. The BPC permutation [-0,1,2, -3] requires data from each processor m3m2m1m0 be routed to processor (1 - mo)mjm2(1 - M3). Table 3.2 lists the source and destination processors of the permutation.

The permutation vector A for each of the permutations of Table 3.1 is given in Table 3.3.







27


Table 3.2. Source and destination of the BPC permutation [-0, 1, 2, -3] in a 16 processor OTIS-Mesh


Table 3.3. Permutations and their permutation vectors

Permutation Permutation Vector
Transpose [p/2 - 1,. . ., O,p - 1,..., p/2]
Perfect Shuffle [0, p - 1, p - 2,..., 1]
Unshuffle (p- 2,p- 3,...,p- 1J
Bit Reversal [0,1, ..., p - 1]
Vector Reversal (P - 1), -(p - 2), ..., -01
Bit Shuffle [p - 1,p - 3,.. ., ,p- 2,p- 4,..., 0]
Shuffled Row-major [p - 1, p/2 - 1, p - 2, p/2 - 2,., p/2, 01


Source Destination
Processor (G, P) Binary Binary (G, P) Processor
0 (0,0) 0000 1001 (2,1) 9
1 (0,1) 0001 0001 (0,1) 1
2 (0,2) 0010 1101 (3,1) 13
3 (0,3) 0011 0101 (1,1) 5
4 (1,0) 0100 1011 (2,3) 11
5 (1,1) 0101 0011 (0,3) 3
6 (1,2) 0110 1111 (3,3) 15
7 (1,3) 0111 0111 (1,3) 7
8 (2,0) 1000 1000 (2,0) 8
9 (2,1) 1001 0000 (0,0) 0
10 (2,2) 1010 1100 (3,0) 12
11 (2,3) 1011 0100 (1,0) 4
12 (3,0) 1100 1010 (2,2) 10
13 (3,1) 1101 0010 (0,2) 2
14 (3,2) 1110 1110 (3,2) 14
15 (3,3) 1111 0110 (1,2) 6






28


3.8.2 Algorithm

Every BPC permutation, A, may be realized by a sequence of bit exchange permutations of the form B(i, j) = [B,1, . .,Bo), p/2 < i < p, 0 < j < p/2, and j, q=i
B,= i, q=j
q, otherwise,
and a BPC permutation C = [C,-..., Co = HGlP where IC,1 < p/2, 0 < q < p/2, LTG and flp involve p/2 bits each. Let l'G be the permutation obtained from HG by subtracting p/2 from each entry whose absolute value exceeds p/2 - 1. For example, if HG = [-3,5,4], then p = 6 and Tl'G = [-0, 2, 11.
The transpose permutation may be realized by the sequence B(p/2 +j, j), 0 < j < p/2; bit reversal is equivalent to the sequence B(p - 1 - j, j), 0 < j < p/2; vector reversal can be realized by performing no bit exchanges and using C = [-(p-1), -(p2), ..., -0] (HG = [-(p-1), -(p -2),. . ., -p/21, Hp = [-(p/2-1),-.., -0]) ; perfect

shuffle may be decomposed into B(p/2, 0) and C = [p - 2, p - 3, ..., p/2, p - 1, p/2 2, ...,1,0,p/2-1] (HG = [p-2,p-3,... .,p/2,p-11, Hp = [p/2-2,.. .,1,0,p/2-1).

A bit exchange permutation B(i, j) may be performed in 2' + 2" electronic, where

i - p/2, i < 3p/4
i - 3p/4, i > 3p/4;
._ j, j < p/4
j - p/4, j _> p/4
and 2 OTIS moves following a process similar to that used for B(i) in Section 3.6.1.
Our algorithm for general BPC permutations is:

Step 1: Decompose the BPC permutation A into the bit exchange permutations
Bi(ij,jj), B2(i2,j2),..., Bk(ik,jk) and the BPC permutation C = HGHp as

above. Do this such that il > i2 > ... > ik, and j, > j2 > -.. > jk-







29


Step 2: If k = 0, do the following: Step 2.1: Do the BPC permutation flp in each group using the optimal algorithm of Nassimi and Sahni [34].

Step 2.2: Do an OTIS move.

Step 2.3: Do the BPC permutation 11' in each group using the algorithm of

Nassimi and Sahni [34].

Step 2.4: Do an OTIS move. Step 3: If k = p/2, do the following:

Step 3.1: Do the BPC permutation 11 in each group.

Step 3.2: Do an OTIS move.

Step 3.3: Do the BPC permutation lp in each group. Step 4: If k < p/4, do the following:

Step 4.1: Perform the bit exchange permutation B1,. . . , Bk.

Step 4.2: Do Steps 2.1 through 2.4. Step 5: If k > p/4, do the followingStep 5.1: Perform a sequence of p/2 - k bit exchanges involving bits other
than those in B1,..., Bk in the same orderly fashion described in Step 1.

Recompute JIG and 11p. Swap lIG and 17p.

Step 5.2: Do Steps 3.1 through 3.3.







30


Consider the permutation A = [6,11,3,8,10,7,0,4,13,14,2,9,1,15,5,12] in a 216 processor OTIS-Mesh (we have omitted complements for simplicity; bit complements can be taken care of when the local BPC permutations HG and Hp are performed). For this, the decomposition of Step 1 yields B1 = B(15,7), B2 = B(13,6), B3 = B(10,4), B4 = B(9,2), and B5 = B(8,0), HG = [13,11,14,8,10,9,15,12], HP [6,3,2,7,1,0,5, 4]. Since k = 5 > p/4 = 4, we go to Step 5. First we

perform a sequence of bit exchanges on bits not in B1 through B5; i.e., B(14, 5), B(12,3), and B(11,1). Recomputing r1G and Hp, we get HG = [6,2,3,1,5,7,0,4] and Hn = [13,14,11,9,8,15,10,12]. Next, Steps 3.1 through 3.3 are done. The sequence of data moves is shown below:

(hbl bbb obbahbsbb4bakbo) D24) (bi5bubl Bbio~b~bbybeb4b4b)b2b B4
(b5bsb13bbobgb7b6 b4b4b12 b2b b0)
(b.5b b3b1ob.b8hb14b4bblbIbo) -+ (bs5b13 bbbio5 b b2b6bob1bb4 b12) --* (b2 b6yb b4b1 b4b2bjb5bibabi bi obbs)
(bbh bubu4blbbiob5blbsbisbsb3bg)
It can be verified that the resulting position is exactly the destination that the original BPC permutation A dictates.

The local BPC permutations determined by H0 and Hp take at most 4( 1) electronic moves each [34]; the bit exchanges cumulatively take at most 4(VN'7 1) electronic and log2N OTIS moves. So, the total number of moves is at most 12(1K - 1) electronic and log2 N + 2 OTIS.
3.9 Comparison

Table 3.4 lists the complexities of the algorithms for the commonly used permutations developed in this chapter along with the complexities of algorithms that







31


Table 3.4. Complexity Comparison of Common Data Rearrangement

Simulation Ours
Permutation electronic OTIS electronic OTIS
Transpose 8(vN - 1) 8(vN - 1) 0 1
Perfect Shuffle 4N 4v/N 4v/N+6 2
Unshuffle 4N 4N 4v/N +6 2
Bit Reversal 8(VN - 1) 8(7N - 1) 8(IN - 1) 1
Vector Reversal 8(vN - 1) 8(VN - 1) 8(IN - 1) 2
Bit Shuffle R N-4 16 N-4 N-4 log2N+2
Shuffled Row-major 2 TN - 4 v/N -4 - 4 log2 N +2


use the simulation method of Zane et al. [581. It is clear that each of our algorithms

outperforms the simulation by a good margin.












CHAPTER 4
BASIC OPERATIONS ON AN OTIS-MESH

In this chapter, we develop deterministic OTIS-Mesh algorithms for the basic data operations for parallel computation that are studied in Ranka and Sahni [42), such as broadcast, window broadcast, prefix sum, rank, shift, sort, random access read and write. As shown in [42], algorithms for these operations can be used to arrive at efficient parallel algorithms for numerous applications, from image processing, computational geometry, matrix algebra, graph theory, and so forth.

We consider both the synchronous SIMD and synchronous MIMD models. In both, all processors operate in lock-step fashion. In the SIMD model, all active processors perform the same operation in any step and all active processors move data along the same dimension or along OTIS connections. In the MIMD model, processors can perform different operations in the same step and can move data along different dimensions.
4.1 Data Broadcast

Data broadcast is, perhaps, the most fundamental operation for a parallel computer. In this operation, data that is initially in a single processor (G, P) is to be broadcast or transmitted to all N2 processors of the OTIS-Mesh. Data broadcast can be accomplished using the following three step algorithm:


Step 1: Processor (G, P) broadcasts its data to all other processors in group G. Step 2: Perform an OTIS move.

Step 3: Processor G of each group broadcasts the data within its group.


32







33


Following Step 2, one processor of each group has a copy of the data, and following Step 3 each processor of the OTIS-Mesh has a copy. In the SIMD model, Steps 1 and 3 take 2(v/W - 1) electronic moves each, and Step 2 takes one OTIS move. The SIMD complexity is 4(v/1 - 1) electronic moves and 1 OTIS move, or a total of 4v/W - 3 moves. Note that our algorithm is optimal because the diameter of the OTIS-Mesh is 4V'W - 3 (Section 2.1). For example, if the data to be broadcast is initially in processor (0,0), the data needs to reach processor (N - 1, N - 1), which is at a distance of 4,V - 3. In the MIMD model, the complexity of Steps 1 and 3 depends on the value of P = (P, P,) and ranges from a low of approximately v'7.N- 1 to a high of 2(v/N- 1). The overall complexity is at most 4(VW- 1) electronic moves and one OTIS move. By contrast, simulating the 4D-mesh broadcast algorithm using the simulation method of [58] takes 4(v7N- 1) electronic moves and 4(v(.-- 1) OTIS moves in the SIMD model and up to this many moves in the MIMD model.
4.2 Window Broadcast

In a window broadcast, we start with data in the top left w x w submesh of a single group G. Here w divides v'7-. Following the window broadcast operation, the initial w x w window tiles all groups; that is, the window is broadcast both within and across groups. Our algorithm for window broadcast is: Step 1: Do a window broadcast within group G. Step 2: Perform an OTIS move.

Step 3: Do an intragroup data broadcast from processor G of each group. Step 4: Perform an OTIS move.

Following Step 1 the initial window properly tiles group G and we are left with the task of broadcasting from group G to all other groups. In Step 2, data d(G, P)







34


from (G, P) is moved to (P, G) for 0 < P < N. In Step 3, d(G, P) is broadcast to all processors (P, i), 0 < P, i < N, and in Step 4 d(G, P) is moved to (i, P),
0 < i,P < N.
Step 1 of our window broadcast algorithm takes 2(VW- - w) electronic moves in both the SIMD and MIMD models, and Step 3 takes 2(VK-- 1) electronic moves in the SIMD model and up to 2(vW - 1) electronic moves in the MIMD model. The total cost is 4Vy -2w -2 electronic and 2 OTIS moves in the SIMD model and up to this many moves in the MIMD model. A simulation of the 4D mesh window broadcast algorithm takes the same number of electronic moves, but also takes 4(VN- 1) OTIS moves.
4.3 Prefix Sum

The index (G, P) of a processor may be transformed into a scalar I = GN + P with 0 < I < N2. Let D(I) be the data in processor I, 0 < I < N2. In a prefix sum, each processor I computes S(I) = L' D(i), 0 < I < N2. A simple prefix sum algorithm results from the following observation:

S(I) = SD(I) + LP(I)

where SD(I) is the sum of D(i) over all processors i that are in a group smaller than the group of I and LP(I) is the local prefix sum within the group of I. The simple prefix sum algorithm is as follows:

Step 1: Perform a local prefix sum in each group. Step 2: Perform an OTIS move of the prefix sums computed in Step 1 for all processors (G, N - 1).







35


Step 3: Group N - I computes a modified prefix sum of the values, A, received in Step 2. In this modification, processor P computes EP-1 A(i) rather than

EX A(i).

Step 4: Perform an OTIS move of the modified prefix sums computed in Step 3. Step 5: Each group does a local broadcast of the modified prefix sum received by its N - 1 processor.

Step 6: Each processor adds the local prefix sum computed in Step 1 and the modified prefix sum it received in Step 5.

The local prefix sums of Steps 1 and 3 take 3(VNT- - 1) electronic moves in both the SIMD and MIMD models, and the local data broadcast of Step 5 takes 2(VN - 1) electronic moves. The overall complexity is 8(VN - 1) electronic moves and 2 OTIS moves. This can be reduced to 7(VN - 1) electronic moves and 2 OTIS moves by deferring some of the Step 1 moves to Step 5 as below. Step 1: In each group, compute the row prefix sums R. Step 2: Column 'NY - 1 of each group computes the modified prefix sums of its R
values.

Step 3: Perform an OTIS move on the prefix sums computed in Step 2 for all processors (G, N - 1).

Step 4: Group N - 1 computes a modified prefix sum of the values, A, received in
Step 3.

Step 5: Perform an OTIS move of the modified prefix sums computed in Step 4.







36


Step 6: Each group broadcasts the modified prefix sum received in Step 5 along column vW - 1 of its mesh.

Step 7 The column ,/fV - 1 processors add the modified prefix sum received in Step 6 and the prefix sum of R values computed in Step 2 minus its own R value

computed in Step 1.

Step 8: The result computed by column VW - 1 processors in Step 7 is broadcast along mesh rows.

Step 9: Each processor adds its R value and the value it received in Step 8.

If we simulate the best 4D mesh prefix sum algorithm, the resulting OTIS mesh algorithm takes 7(vf/R - 1) electronic and 6(VWi - 1) OTIS moves.
4.4 Datam

In this operation, each processor is to compute the sum of the D values of all processors. An optimal SIMD data sum algorithm is as follows: Step 1: Each group performs the data sum. Step 2: Perform an OTIS move.

Step 3: Each group performs the data sum.

In the SIMD model Steps 1 and 3 take 4(vfN-1) electronic moves, and step 2 takes 1 OTIS move. The total cost is 8(,v/- 1) electronic and 1 OTIS moves. Note that since the distance between processors (0, 0) and (N - 1, N - 1) is 4(,/7N - 1) electronic and 1 OTIS moves and since each needs to get information from the other, at least 8(v'7 - 1) electronic and 1 OTIS moves are needed (the moves needed to send information from (0,0) to (N -1, N -1) and those from (N - 1, N -1) to (0, 0)







37


cannot be overlapped in the SIMD model). Also, note that a simulation of the 4D mesh data sum algorithm takes 8(vl' - 1) electronic and 8(v'W - 1) OTIS moves.
The MIMD complexity can be reduced by computing the group sums in the middle processor of each group rather than in the bottom right processor. The complexity now becomes 4(VV - 1) electronic and 1 OTIS moves when V is odd and 4VW# electronic and 1 OTIS moves when v/W is even. The simulation of the 4D mesh, however, takes 4(v/T- 1) electronic and 4(VW- 1) OTIS moves. Notice that the MIMD algorithm is near optimal as the diameter of the OTIS-Mesh is 4vWN - 3 (Section 2.1).
4.5 Rank

In the rank operation, each processor I has a flag S(I) E {0, 1}, 0 < I < N2 We are to compute the prefix sums of the processors with S(I) = 1. This operation can be performed in 7(vNW - 1) electronic and 2 OTIS moves using the prefix sum algorithm of Section 4.3.
4.6 Shift

Although there are many variations of the shift operation, the ones we believe are most useful in application development are as follows:

(a) mesh row shift with zero fill-in this we shift data from processor (G,, G,, P, P)
to processor (G., Gy, P., P, + s), -vfN < s < VN#. The shift is done with zero fill and end discard (i.e., if P, + s > vN# or P, +,s < 0, the data from P, is
discarded).

() mesh column shift with zero fill-similar to (a), but along mesh column P.

(c) circular shift on a mesh row--in this we shift data from processor (Ge, G,, P , P,)
to processor (G., G,, P., (P, + s) mod v'N-).







38


(d) circular shift on a mesh column-similar to (c), but instead P is used.

(e) group row shift with zero fill---similar to (a), except that G, is used in place of PY.

(I) group column shift with zero fill-similar to (e), but along group column G2.

(g) circular shift on a group row-similar to (c), but with G. rather than P,.

(h) circular shift on a group column-similar to (g), with G, in place of G,.

Shifts of types (a) through (d) are done using the best mesh algorithms while those of types (e) through (h) are done as below: Step 1: Perform an OTIS move.

Step 2: Do the shift as a P (if originally a G, shift) or a P, (if originally a G. shift)
shift.

Step 3: Perform an OTIS move.

Shifts of types (a) and (b) take s electronic moves on the SIMD and MIMD models; (c) and (d) take v/N electronic moves on the SIMD model and max{ sI, vRIs|} electronic moves on the MIMD model; (e) and (f) take s electronic and 2 OTIS moves on both SIMD and MIMD models; and (g) and (h) take v/Nh electronic and 2 OTIS moves on the SIMD model and max{isI, v'N - Isj} electronic and 2 OTIS moves on the MIMD model.
If we simulate the corresponding 4D mesh algorithms, we obtain the same complexity for (a)-(d), but (e) and (f) take an additional 2s - 2 OTIS moves, and
(g) and (h) take an additional 2 x max{IsI, -v - IsI} - 2 OTIS moves.







39


4.7 Data Accumulation

Each processor is to accumulate M, 0 < M < V/N, values from its neighboring processors along one of the four dimensions G,, Gv, P,, P. Let D(G., Gv, P., P,) be the data in processor (G., G,, P., P,). In a data accumulation along the G, dimension (for example), each processor (G., G,, P, ,) accumulates in an array A the data values from ((G, + i) mod vfN-, G,, P., P,), 0 - i < M. Specifically, we have

A[i] = D((G, + i) mod VN, G, P, P,) Accumulation in other dimensions is similar.

The accumulation operation can be done using a circular shift of -M in the appropriate dimension. The complexity is readily obtained from that for the circular shift operation (see Section 4.6).
4.8 Consecutive Sum

The N2 processor OTIS-Mesh is tiled with one-dimensional blocks of size M. These blocks may align with any of the four dimensions G,, G,, P, and P,. Each processor has M values XUjI, 0 5 j < M. The ith processor in a block is to compute the sum of the X[iJ's in that block. Specifically, processor i of a block computes M-1
S(i) = E X~i](j), 0 < i < M
j=O
where i and j are indices relative to a block.
When the one-dimensional blocks of size M align with the P, or P, dimensions, a consecutive sum can be performed by using M tokens in each block to accumulate the M sums S(i), 0 < i < M. Assume the blocks align along P. Let po,pi,... pM_1 be the M processors, left-to-right, in a block. The consecutive sum algorithm works







40


in two phases. In the first phase, processor M - 1 initiates tokens to, t1,..., t-2 one by one. These tokens move leftwards. When a processor receive token t,, it adds its X[iJ value to it and transmits the token to the processor on its left. The first phase operates for M - I moves and at the end of this phase, pi has token = -, X[i](j). The second phase is similar to the first. This time, po initiates
the tokens M- 1, t'M-2, ... , e and the tokens move rightwards. Following M-1 moves, token t' is in processor pi and t' = E'' X[i](j). Following phase 2, pi computes the desired result 4. + t + X[iJ(i). The total number of moves is 2(M - 1).

In the MIMD model, the left and right moves can be done simultaneously, and only M - 1 electronic moves are needed.

When the one-dimensional size M blocks align with G, or G., we first do an OTIS move; then run either a P, or P. consecutive sum algorithm; and then do an OTIS move. The number of electronic moves is the same as for P, or P, alignment. However, two additional OTIS moves are needed.

Simulation of the corresponding 4D mesh algorithm takes an additional 4M-6 OTIS moves for the case of G. or G, alignment in the SIMD model and an additional 2M - 4 OTIS moves in the MIMD model.
4.9 Adjacent Sum

This operation is similar to the data accumulation operation of Section 4.7 except that the M accumulated values are to be summed. The operation can be done with the same complexity as data accumulation using a similar algorithm.
4.10 Concentrate

A subset of the processors contain data. These processors have been ranked as in Section 4.5. So the data is really a pair (D, r); D is the data in the processor and r is its rank. Each pair (D, r) is to be moved to processor r, 0 < r < b, where







41


b is the number of processors with data. Using the (G, P) format for a processor index, we see that (D, r) is to be routed from its originating processor to processor ([r/NJ, r mod N). We accomplish this using the steps: Step 1: Each pair (D, r) is routed to processor r mod N within its current group. Step 2: Perform an OTIS move.

Step 3. Each pair (D, r) is routed to processor [r/NJ within its current group. Step 4: Perform an OTIS move.

Teomrrn 41.j The four step algorithm given above correctly routes every pair (D, r) to processor ([r/NJ,r mod N).

Proof Step 1 does the routing on the second coordinate. This step does not route two pairs to the same processor provided no group has two pairs (Di, ri), (D2, r2) with r, mod N = r2 mod N. Since each group has at most N pairs and the ranks of these pairs are contiguous integers, no group can have two pairs with r, mod N = r2 mod N. So following Step 1 each processor has at most one pair and each pair is in the correct processor of the group, though possibly in the wrong group.

To get the pairs to their correct groups without changing the within group index, Step 2 performs an OTIS move, which moves data from processor (G, P) to processor (P, G). Now all pairs in a group have the same r mod N value and different [r/NJ values. The routing on the [r/NJ values, as in Step 3, routes at most one pair to each processor. The OTIS move of Step 4, therefore, gets every pair to its correct destination processor. 0

In group 0, Step 1 is a concentrate localized to the group, and in the remaining groups, Step 1 is a generalized concentrate in which the ranks have been increased







42


by the same amount. In all groups we may use the mesh concentrate algorithm of Nassimi and Sahni [35] to accomplish the routing in 4(vW - 1) electronic moves. Step 3 is also a concentrate as the [r/NJ values of the pairs are in ascending order from 0, 1, 2, - - -. So Steps 1 and 3 take 4(v/7# - 1) electronic moves each in the SIMD model and 2(,V'W - 1) in the MIMD model [35]. Therefore, the overall complexity of concentrate is 8(VN - 1) electronic and 2 OTIS moves in the SIMD model and 4(VN- - 1) electronic and 2 OTIS moves in the MIMD model.
We can improve the SIMI) time to 7(/W - 1) electronic and 2 OTIS moves by using a better mesh concentrate algorithm than the one in Nassimi and Sahni [35]. The new and simpler algorithm is given below for the case of a generalized concentration on a VN- x VN mesh.

Step 1: Move data that are to be in a column right of the current one rightwards to the proper processor in the same row.

Step 2: Move data that are to be in a column left of the current one leftwards to the

proper processor in the same row.

Step 3: Move data that are to be in a smaller row upwards to the proper processor

in the same column.

Step 4: Move data that are to be in a bigger row downwards to the proper processor

in the same column.

In a concentrate operation on a square mesh the data that begin in two processors of the same row ends up in different columns as the ranks of these two data differ by at most V- 1. So Steps 1 and 2 do not leave two or more data in the same processor. Steps 3 and 4 get data to the proper row and hence to the proper processor. Note that it is possible to have up to two data items in a processor following







43


Step 1 and Step 3. The complexity of the above concentrate algorithm is 4(vfNi - 1) on a SIMD mesh and 2(v'N - 1) on an MIMD mesh (we can overlap Steps 1 and 2 as well as Steps 3 and 4 on an MIMD mesh).

For an ordinary concentrate in which the ranks begin at 1, Step 4 can be omitted as no data moves down a column to a row with bigger index. So an ordinary concentrate takes only 3(VW - 1) moves. This improves the SIMD concentration algorithm of Nassimi and Sahni [351, which takes 4(VN- 1) moves to do an ordinary concentrate.
Actually, we can show that the four step concentration algorithm just stated is optimal for the SIMD model. Consider the ordinary concentrate instance in which the selected elements are in processors (0, vr - 1), (1, rN - 2), - - -, (N- - 1, 0). The ranks are 0, 1, - - -, VN- 1. So the data in processor (0, v'K- 1) is to be moved to processor (0,0). This requires moves that yield a net of VNW - 1 left moves. Also, the data in processor (vW - 1, 0) is to be moved to processor (0, rN - 1). This requires a net of N - 1 upward moves and VW - 1 rightward moves. None of these moves can be overlapped in the SIMD model. So every SIMD concentrate algorithm must take at least vrN - 1 moves in each of the directions left, right, and up; a total of at least 3(VW - 1) moves.
For the generalized concentrate algorithm, the ranks need not start at zero. Suppose we have two elements to concentrate. One is at processor (0,0) and has rank N - i, and the other is at processor (V'7-- 1, vN -1) and has rank N. The data in (0,0) is to be moved to (VN7- - 1, vWR - 1) at a cost of vW - 1 net right and down moves. The data in (vW - 1, vW - 1) is to be moved to (0,0) at a cost of V'?- 1 net left and up moves. So at least 4(vW - 1) moves are needed.







44


Table 4.1. Processors with data to concentrate

GG, P. P,
0,0 0 < P. < VN - 1, 0 < P, < vN
0,1 P, = 0, 0 < P, <
G =1,OG, V/ - I 1 0 0 < P, < N, G = 0
VN_ - 1, 1 0 < P. < -%/SN - 1, 0 < P, vN -1, -VN - 1 (N - 1, VN_ - 1


Theorem. .2 The OTIS-Mesh data concentration algorithm described above is optimal for both the SIMD and MIMD models; that is, (a) every SIMD concentration algorithm must make 7(VNW - 1) electronic and 2 OTIS moves in the worst case, and (b) every MIMD concentration algorithm must make 4(VN - 1) electronic and

2 OTIS moves.

Proof (a) Suppose that the data to be concentrated are in the processors shown in Table 4.1. Let a denote processor (VW_ - 1, VN# - 1, vNW- 1, VN - 1), let b denote processor (vIrN- 1, 0, vfW - 1, 0), and let c denote processor (0,1,0,0). The ranks of a, b, and care N3V, N31-N+vrN-1, and N-v/N respectively. Therefore, following the concentration the data D(a), D(b), and D(c) initially in processors a, b, and c will be in processors (0,1,0,0), (0, v'N - 1,0, V - 1), and (0,0, v - 1,0) respectively. Figure 4.1 shows the initial and concentrated data layout for the case when N = 16. The change in G,, G,, P, and P values between the final and initial locations of D(a), D(b), and D(c) is shown in Table 4.2.
The maximum net negative change in each of G,, G,, P., and P. is -(v/W-1). Since a net negative change in G. can only be overlapped with a net negative change in P, and since D(b) needs -(vr - 1) negative change in both G, and P, we must make at least 2(VW- 1) electronic moves that decrease the row index within a mesh.






45


01DO E]E]00 0000 0000 0000 0000 0000 00000000 0000 0000 0000 0000 0000 0000
000Z xxxx 0000 000J 0000 0000 0000

0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
00 0000 0000 0000
moo 0m oooo 0000 0OO 0000 0000 0001
(a)




MOM 0000 0000 0000 10000 0000 0000 0000
[D000 0000 000 00 0000~ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
00000 000 000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000010000
000 OO G 0000


0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
(b)


Figure 4.1. Data Configuration: (a) Initial; (b) Concentrated







46


Table 4.2. Net change in G., Gi, P,, and P.

data G,G_ P, PG
D(a) -(VN - 1) +1 -(v/N- ) (N -1) 7-(v/N -1) D(b) -(VN - 1) +(vN - VN- - 1) +(VN- 1)
D(c) 0 0 +(VN - 1) 0 I


Similarly, because of D(a)'s requirements, at least 2(vNW - 1) electronic moves that increase the column index within a vWN x /N- mesh must be made. Turning our attention to net positive changes, we see that because of D(b)'s requirements there must be at least 2(vW - 1) electronic moves that increase the column index. D(c) requires vW - 1 electronic moves that increase the row index. Since positive net moves cannot be overlapped with negative net moves, and since net moves along G, and P cannot be overlapped with net moves along G, and P,, the concentration of the configuration of Table 4.1 must take at least 7(v'# - 1) electronic moves.

In addition to 7(v/N - 1) electronic moves, we need at least 2 OTIS moves to concentrate the data of Table 4.1. To see this consider the data initially in group (0,1). These data are in group (0,0) following the concentration. At least one OTIS move is needed to move the data out of group (0,1). A nontrivial OTIS-Mesh has > 2 processors on a row of a 4Y x v'N submesh. For such an OTIS-Mesh, at least two pieces of data must move from group (0,1) to group (0,0). A single OTIS move scatters data from group (0,1) to different groups with each datum going to a different group. At least one additional OTIS move must be made to get the data back into the same group. Therefore the concentration of the configuration of Table 4.1 cannot be done with fewer than 2 OTIS moves.

(b) Consider the initial configuration of Table 4.1. Since the shortest path between processor b and its destination processor is 4(vNW - 1) electronic and one







47


OTIS move, at least that many electronic moves are made, in the worst case, by every concentration algorithm. The reason that at least 2 OTIS moves are needed to complete the concentration is the same as for (a). 0
4.11 Distribute

This is the inverse of the concentrate operation of Section 4.10. We start with pairs (Do, do),..., (D,, d), do < di < - - - < d,, in the first q+l processors 0, 1,..., q and are to route pair (Di, di) to processor dt, 0 < i < q. The algorithm of Section 4.10 tells us how to start with pairs (Di, i) in processor d,, 0 5 i < q and move them so that Di is in i. By running this backwards, we can start with Di in i and route it to d,. The complexity of the distribute operation is the same as that of the concentrate operation. We have shown that the concentrate algorithm of Section 4.10 is optimal; it follows that the distribute algorithm is also optimal.
4.12 Generalize

We start with the same initial configuration as for the distribute operation. The objective is to have D, in all processors j such that d, 5 j < d,+, (set di to N2 - 1). If we simulate the 4D mesh algorithm for generalize using the simulation strategy of Zane et al. [58], it takes 8(V/N- 1) electronic and 8(v(- 1) OTIS moves to perform the generalize operation on an SIMD OTIS-Mesh. We can improve this to 8(VNW - 1) electronic and 2 OTIS moves if we run the generalize algorithm of Nassimi and Sahni [35] adapted to use OTIS moves as necessary. The outer loop of the algorithm of Nassimi and Sahni [35] examines processor index bits from 2p - 1 to 0 where p = log2 N. So in the first p iterations we are moving along bits of the G index and in the last p iterations along bits of the P index. On an OTIS-Mesh we would break this into two parts as below:


Step 1: Perform an OTIS move.







48


Step 2: Run the GENERALIZE procedure of Nassimi and Sahni 135] from bit p - I to 0, while maintaining the original index. Step 3: Perform an OTIS move.

Step 4: Run the GENERALIZE algorithm of Nassimi and Sahni [35 from bit p - 1 to 0.

On an MIMD OTIS-Mesh the above algorithm takes 4(VW - 1) electronic and 2 OTIS moves.
We can reduce the SIMI) complexity to 7(/N - 1) electronic and 2 OTIS moves by using a better algorithm to do the generalize operation on a 2D SIMD mesh. This algorithm uses the same observation as used by us in Section 4.10 to speed the 2D SIMD mesh concentrate algorithm; that is, of the four possible move directions, only three are possible. When doing a generalize on a 2D VW x VN mesh the possible move directions for data are to increasing row indexes and to decreasing and increasing column indexes. With this observation, the algorithm to generalize on a 2D mesh becomes:

Step 1: Move data along columns to increasing row indexes if the data is needed in

a row with higher index.

Step 2: Move data along rows to increasing column indexes if the data is needed in

a processor in that row with higher column index.

Step 3: Move data along rows to decreasing column indexes if the data is needed in

a processor in that row with smaller column index.

The correctness of the preceding generalize algorithm can be established using the argument of Theorem 4.10.1, and its optimality follows from Theorem 4.10.2







49


and the fact that the distribute operation, which is the inverse of the concentrate operation, is a special case of the generalize operation.

The new and more efficient generalize algorithm may be used in Step 2 of the OTIS-Mesh generalize algorithm. It cannot be used in Step 4 because the generalize of this step requires the full capability of the code of Nassimi and Sahni [35] which permits data movement in all four directions of a mesh.

When we use the new generalize algorithm for Step 2 of the OTIS-Mesh generalize algorithm, we can perform a generalize on a SIMD OTIS-Mesh using 7(1K- 1) electronic and 2 OTIS moves. The new algorithm is optimal for both SIMD and MIMD models. This follows from the lower bound on a concentrate operation established in Theorem 4.10.2 and the observation made above that the distribute operation, which is a special case of the generalize operation, is the inverse of the concentrate operation and so has the same lower bound.
4.13 Sorting

As was the case for the operations considered so far, an O(v/N) time algorithm to sort can be obtained by simulating a similar complexity 4D mesh algorithm. For sorting a 4D Mesh, the algorithm of Kunde [25] is the fastest. Its simulation will sort into snake-like row-major order using 14vK+ o(vr) electronic and 12VN+ o(vrNd) OTIS moves on the SIMD model and 7VN + o(VN) electronic and 6VN + o(VN) OTIS moves on the MIMD model. To sort into row-major order, additional moves to reverse alternate dimensions are needed. This means that an OTIS-Mesh simulation of Kunde's 4D mesh algorithm to sort into row-major order will take 18vrN+ o(V'K) electronic and 16-V + o(VN) OTIS moves on the SIMD model. We show that Leighton's column sort [26J can be implemented on an OTIS-Mesh to sort into rowmajor order using 22vN + o(V'K) electronic and O(N/8) OTIS moves on the SIMD







50


1 7 13 1 2 3
2 8 14 % 4 5 6 3 9 15 7 8 9
4 10 16 10 11 12
5 11 17 S+"'-4 13 14 15 6 12 18 16 17 18


Figure 4.2. Row-Column Transformation of Leighton's Column Sort

model and 11vV#+o(VW) electronic and O(N/8) OTIS moves on the MIMD model. Please note that the algorithm discussed here is deterministic. The randomized algorithms for sorting can be found in Rajasekaran and Sahni [41].

Our OTIS-Mesh sorting algorithm is based on Leighton's column sort [26]. This sorting algorithm sorts an r x s array, with r > 2(s - 1)2, into column-major order using the following seven steps:


Step 1: Sort each column. Step 2: Perform a row-column transformation. Step 3: Sort each column. Step 4: Perform the inverse transformation of Step 2. Step 5: Sort each column in alternating order. Step 6: Apply two steps of comparison-exchange to adjacent rows. Step 7: Sort each column.


Figure 4.2 shows an example of the transformation of Step 2, and its inverse. Figure 4.3 shows a step by step example of Leighton's column sort.







51


7 9 12 2 3 1 2 4 7 1 4 7
4 16 1 4 5 6 10 15 18 2 5 8
18 5 14 7 9 8 3 5 9 3 6 9
2 17 8 " 10 11 12 !1, 11 16 17 !tq 10 13 14 '1!'
15 11 6 15 16 13 1 6 8 11 15 17
10 3 13 18 17 14 12 13 14 12 16 18

1 3 11 1 14 11 1 11 14 1 7 13
4 6 15 2 13 12 2 12 13 2 8 14
7 9 17 4 10 15 4 10 15 3 9 15
2 10 12 ' 5 9 16 !!M 5 9 16 "'! 4 10 16
5 13 16 7 6 17 6 7 17 5 11 17
8 14 18 8 3 18 3 8 18 6 12 18


Figure 4.3. Example of Leighton's Column Sort

Although Leighton's column sort is explicitly stated for r x s arrays with r > 2(s - 1)2, it can be used to sort arrays with s > 2(r - 1)2 into row-major order by interchanging the roles of rows and columns. We shall do this and use Leighton's method to sort an N'/2 x N'12 array. We interpret our N2 OTIS-Mesh as an N1/2 x N3/2 array with G, giving the row index and GP.P, giving the column index of an element processor. We shall further subdivide G, (GV, P,, P,) into equal parts G,, G,, G,,, and G,4 from left to right. We use G.,-4, for example, to represent G,,GGZ. Since p = log2 N, G. has p/2 bits and G, has p/8 bits. These notations are helpful in describing the transformations in Steps 2 and 4 of the column sort, as we use the BPC permutations of Nassimi and Sahni [34] to realize these transformations. The definition of a BPC permutation can be found in Nassimi and Sahni [34] and in Section 3.8.1.

In describing our sorting algorithm, we shall, at times, use a 4D array interpretation of an OTIS-Mesh. In this interpretation, processor (G., G1, P., P,) of the







52


OTIS-Mesh corresponds to processor (G., G1, P., P.) of the 4D mesh. We use 9 to denote the bit positions of G,, that is the leftmost p/2 bits in a processor index, g,, to represent the leftmost p/8 bit positions, p, to represent the rightmost p/2 bit positions, pV3-4 to represent the rightmost p/4 bit positions, and so on. Our strategy for the sorting steps 1, 3, 5, and 7 of Leighton's method is to collect each row (recall that since we are sorting an N112 x N3/ array, the column-sort steps of Leighton's method become row-sort steps) of our NI12 x N3/ array into an N31 x N3/8 x N3/8 x N3/1 4D submesh of the OTIS-Mesh, and then sort this row by simulating the 4D mesh sort algorithm of Kunde [25]. This strategy translates into the following sorting algorithm:


Step 1: [Move rows of the NI12 x N3/2 array into N3/8 x N3/8 x N3/8 x N3/8 4D
submeshes]

Perform the BPC permutation P. = [9-11,,pgPl929V2-493Pz2-494P-41Step 2: [Sort each row of the NI2 x N3/2 array]

Sort each 4D submesh of size N3/' x N3/' x N3/' x N3/8.

Step 3: [Do the inverse of Step 1, perform a column-row transformation, and move
rows into N3/8 x N31 x N31 x N31 submeshes]

Perform the BPC permutation P, = [92,_-,gzgy2psa-,gYSPY2--49VyPZIPVIlStep 4: [Sort each row of the Ni12 x N3/2 array]
Sort each 4D submesh of size N3/' x N3/' x N3/' x N3/'.

Step 5: [Do the inverse of Step 1, perform a row-column transformation, and move
rows into N3/8 x N31 x N3/ x N3/8 submeshes]

Perform the BPC permutation P' = [g:4gz..py2y3PxIpylPy9lY2-4Py4PZ2-a4-







53


Step 6: [Sort each row in alternating order]
Sort each 4D submesh of size N3/8 x N3/8 x N3/1 x N3/8. Step 7: [Move rows back from 4D submeshes]
Perform the BPC permutation P, = (9g PZIP(Ig2gy2-4 9Z3PZ-4gZ4PY2- Perform the BPC permutation P, = [gzgYPZlPYlgz2gwy-4gz3PX2-494PY2-41Step 10: [Sort each row of the N112 x N3/2 array]

Sort each 4D submesh of size N3/' x N/8 x N3/8 x N3/8. Step 11: [Move rows back from 4D submeshes]
Perform the BPC permutation P' = [gllgrPXIPn 92-49.PX2-4gZ4Py2-41Notice that the row to 4D submesh transform is accomplished by the BPC permutation P. = [Z19y11PZPgZ2gY2-4gZ3PZ2-4Z4Py2-4}. Elements in the same row of our N'12 x N3/2 array interpretation have the same G. value; but in our 4D mesh interpretation, elements in the same N/8 x N3/8 x N3/8 x N3/8 submesh have the same GsG,,P,,P, value. P. results in this property. To go from Step 2 to Step 3 of Leighton's method, we need to first restore the N1/2 x N3/2 array interpretation using the inverse permutation of P., that is, perform the BPC permutation P. = [grpgy P-,Py19g2y-4913Pz2-4gZ4P2-4]; then perform a column-row transform using BPC permutation pb = [gypspygz]; and finally map the rows of our N1/2 x N3/2 array into 4D submeshes of size N3/8 x N3/ x N3/1 x N3/1 using the BPC permutation P,. The three BPC permutation sequence P'PbP. is equivalent to the single BPC permutation P. = [W2_4911 gY,2,4 gaPy,-4 949VIPSIPnI-







54


The preceding OTIS-Mesh implementation of column sort performs 6 BPC permutations, 4 4D mesh sorts, and two steps of comparison-exchange on adjacent rows. Since the sorting steps take O(N3/') time each (use Kunde's 4D mesh sort (25] followed by a transform from snake-like row-major to row-major), and since the remaining steps take O(N1/2) time, we shall ignore the complexity of the sort steps.
We can reduce the number of BPC permutations from 6 to 3 as follows. First note that the P. of Step 1 just moves elements from rows of the N1/2 x N3/2 array into N3/8 x N3/8 x N3/1 x N3/' 4D submeshes. For the sort of Step 2, it doesn't really matter which N3/2 elements go to each 4D submesh as the initial configuration is an arbitrary unsorted configuration. So we may eliminate Step 1 altogether. Next note that the BPC permutations of Steps 7 and 9 cancel each other and we can perform the comparison-exchange of Step 8 by moving data from one N3/1 x N31/ x N3/8 x N3/ 4D submesh to an adjacent one and back in O(N3/') time.

With these observations, the algorithm to sort on an OTIS-Mesh becomes:

Step 1: Sort in each subarray of size N3/8 x N318 x N 3/ x N31/ Step 2: Perform the BPC permutation P. Step 3: Sort in each subarray.

Step 4: Perform the BPC permutation P1. Step 5: Sort in each subarray.

Step 6: Apply two steps of comparison-exchange to adjacent subarrays. Step 7: Sort in each subarray.


Step 8: Perform the BPC permutation P'.







55


Using the BPC routing algorithm of Section 3.8.2, the three BPC permutations can be done using 36v'W electronic and 310g2 N + 6 OTIS moves on the SIMD model and 18v'W electronic and 310g2 N + 6 OTIS moves on the MIMD model. A more careful analysis based on the development in Nassimi and Sahni [34] and Section 3.8.2 reveals that the permutations P', PC, and P' can be done with 28vr electronic and log2 N +6 OTIS moves on the SIMD model and 14vW electronic and 310g2 N + 6 OTIS moves on the MIMD model. By using p' = [g9YgPZPY1gz2gVPz2PY2gZ39YSPzapsngz4gYPZ4PY4]I, Pc = [9x2-49Z192-4 9YIPZ2-4PZiP2-4PY,1 and p' = [gZgzi-,gy4g y-.P-Pgi .PIVPvi-.], the permutation cost becomes 22v/V electronic and log2 N + 5 OTIS moves on the SIMD model and I1 I/ electronic and log2 N + 5 OTIS moves on the MIMD model. The total number of moves is thus 22v/ + O(N3/8) electronic and O(N3/8) OTIS moves on the SIMD model and 1v/N + O(N3/8) electronic and O(N3/3) OTIS moves on the MIMD model. This is superior to the cost of the sorting algorithm that results from simulating the 4D row-major mesh sort of Kunde [25].
4.14 Random Access Read (RAR)

In a random access read (RAR) [42] processor I wishes to read data variable D of processor dl, 0 < I < N2. The steps suggested in Ranka and Sahni [42] for this operation are as follows:

Step 0: Processor I creates a triple (I, D, di) where D is initially empty. Step 1: Sort the triples by dr.

Step 2: Processor I checks processor I + 1 and deactivates if both have triples with
the same third coordinate.


Step 3: Rank the remaining processors.







56


Step 4: Concentrate the triples using the ranks of Step 3. Step 5: Distribute the triples according to their third coordinates. Step 6: Load each triple with the D value of the processor it is in. Step 7: Concentrate the triples using the ranks in Step 3. Step 8. Generalize the triples to get the configuration we had following Step 1. Step 9: Sort the triples by their first coordinates.

Using the SIMD model the RAR algorithm of Ranka and Sahni [42] takes 79(v/N' - 1) electronic moves and O(N/8) OTIS moves. On the MIMD model, it takes 45(vIN - 1) electronic O(N3/8) OTIS moves.
4.15 Random Access Write (RAW) Now processor I wants to write its D data to processor d, 0 < I < N. The steps in the RAW algorithm of Ranka and Sahni [42] are as follows: Step 0. Processor I creates the tuple (D(I), d1), 0 < I < N2. Step 1: Sort the tuples by their second coordinates. Step 2: Processor I deactivates if the second coordinate of its tuple is the same as

the second coordinate of the tuple in I + 1, 0 < I < N2 - 1. Step 3: Rank the remaining processors. Step 4: Concentrate the tuples using the ranks of Step 3. Step 5: Distribute the tuples according to their second coordinates.







57


Step 2 implements the arbitrary write method for a concurrent write. In this, any one of the processors wishing to write to the same location is permitted to succeed. The priority model may be implemented by sorting in Step 1 by d, and within d, by priority. The common and combined models can also be implemented, but with increased complexity.

On the SIMD model, an RAW takes 43(v7N- 1) electronic and O(N3/1) OTIS moves while on the MIMD model, it takes 26(V7 - 1) electronic and O(N3/') OTIS moves.
4.I6 Summar

Our algorithms run faster than the simulation of the fastest algorithms known for 4D meshes. Tables 4.3 and 4.4 summarizes the complexities of our algorithms and those of the corresponding ones obtained by simulating the best 4D-mesh algorithms on SIMD and MIMD models respectively. Note that the worst case complexities are listed for the broadcast and window broadcast operation, and that of the case when vW is even is presented for the data sum operation on the MIMD model. Also, the complexities listed for circular shift, data accumulation, and adjacent sum assume that the shift distance is < vK/2 on the MIMD model. Both tables give only the dominating v/W terms for sorting. Our algorithms for data broadcast, data sum, concentrate, distribute, and generalize are optimal.






58


Table 4.3. Comparison of complexities on SIMD model


Simulation Ours
Operation Electronic OTIS Electronic OTIS
Broadcast 4(vN - 1) 4(/N-- 1) 4(/N- I 1
WindowBroadcast 4 N- 2w-2 4(v/N-i1) 4/N - 2w - 2 2
Prefix Sum 7(vN - 1) 6(VN - I) 7(V/N - 1) 2
Data Sum 8(/N - 1) 4(/N - 1) 8(VN - 1) 1
Rank 7(VN - 1) 6(v/N - 1) 7(v/N - 1) 2
Regular Shift s 2s s 2
Circular Shift /N7 2/N -/N 2
Data Accumulation vrN 2-/N /N 2
Consecutive Sum 2(M - 1) 4(M - 1) 2(M - 1) 2
Adjacent Sum V/N 2v/N * /N 2
Concentrate 8(v/N - IL 8(v/N - 1) 7(/N - 1) 2
Distribute 8(v/N - 1) 8( /N - 1) 7(VN - 1) 2
Generalize 8(VN - 1) 8(VN - 1) 7(VN - 1) 2
Sorting 14v/N 12VN 22/Nq O(N3/8






59


Table 4.4. Comparison of complexities on MIMD model


Simulation Ours
Operation Electronic OTIS Electronic OTIS
Broadcast 4(VN - 1) 4(v/N - 1) 4(VN - 1) 1
Window Broadcast 4VN - 2w - 2 4(v/N - 1) 4VN - 2w - 2 2
Prefix Sum 7(VN- 1) 6(v/N - 1) 7(VWN - 1) 2
Data Sum 4 v/N 4VN 4VN 1
Rank 7(v'N - 1) 6(7/N - 1) 7(VN - 1) 2
Regular Shift a 2s s 2
Circular Shift 8 2s s 2
Data Accumulation M M M 2
Consecutive Sum M - 1 2(M - 1) M - 1 2
Adjacent Sum M 2M M 2
Concentrate 4(VN - 1) 4(VN - 1) 4(/N - 1) 2
Distribute 4(VWN - 1) 4(V/N - 1) 4(VN - 1) 2
Generalize 4(VN-) 4(/ -) 4(VN - 1) 2
Sorting 7/N 6 VN 11-IN O(N3/8)












CHAPTER 5
MATRIX MULTIPLICATIONS ON AN OTIS-MESH

In this chapter, we develop algorithms to multiply vectors of size kN and matrices of size kN x kN on an N2 processor OTIS-Mesh. These algorithms are developed for both of the matrix to OTIS-Mesh mapping schemes considered in Section 2.3--group row-major mapping (GRM) and group submatrix mapping (GSM). We begin, in Section 5.1, by describing the GRM and GSM schemes and making observations about the complexity of performing the matrix add and transpose operations. In Section 5.2, we develop the algorithms for various versions of vector and matrix multiplication.

For purposes of this chapter the essential differences between electronic and optical links are (a) optical links have much larger bandwidth than do electronic links; and (b) transfer times including latency are different on optical and electronic links. In our analysis, we count communication along electronic and optical interconnects separately. However, we use the simplifying assumption that any constant amount of data can be communicated over an optical link during an optical communication step while only a unit amount of data can be communicated over an electronic link during an electronic communication step. In this chapter, we assume that the processor mesh that represents any group of processors is a SIMD mesh. Therefore, in any given time step, data can be moved in only one of the four mesh dimensions: up, down, left, or right. Extensions to MIMD meshes are straightforward and thus omitted.
5.1 Mapping Matrices Onto An OTIS-Mesh

In Section 2.3 we described the GRM and GSM mapping of a matrix. For the GSM mapping, we introduce the following notation. Matrix element (i, j) is mapped


60







61


to processor (if, jf, i,, jm) where if = [i/vWrJ, i. = i mod V, jf = [j/v/NJ, and jm = j mod v .
The GRM and GSM mappings of a row or column vector are obtained from the corresponding mapping of an N x N matrix by extracting the sub-mapping corresponding to row zero or column zero of the matrix. The GRM and GSM mappings of a kN x kN matrix are obtained by partitioning the kN x kN matrix into N x N blocks of size k x k each. The N x N block matrix is then mapped onto the N2 processor OTIS-Mesh, one block per processor, using the standard GRM and GSM schemes described above. 1 x kN and kN x 1 vectors are mapped by using the sub-mapping corresponding to row zero or column zero of the kN x kN matrix mapping.

It is easy to see that regardless of which mapping is used, matrix as well as vector addition and subtraction requires no interprocessor communication. Two kN x kN matrices can be added or subtracted in 0(k2) time and two vectors of size kN can be added or subtracted in 0(k) time.

Algorithms for the matrix transpose operation were developed in Section 3.1. A kN x kN matrix can be transposed using a single OTIS move and no electronic moves when the GRM mapping is used. When the GSM mapping is used, the transpose requires 8k2(rN - 1) electronic and 2 OTIS moves. In either case, 0(k2) intraprocessor moves are needed to transpose the k x k block stored in a processor.
5.2 Multiplication Algorithm
5.2.1 Column Vector x Row Vector
GRM

First consider the GRM mapping. When an N x 1 column vector A and a 1 x N row vector B are multiplied, the result is an N x N matrix. This is to be stored in the OTIS-Mesh using the GRM mapping.







62


Step 1: Perform an OTIS move on B.

Step 2: Broadcast the A and B data in each group to all processors of the group. Step 3: Perform an OTIS move on B.

Step 4: Each processor multiplies its A and B data.


Figure 5.1. GRM Column x Row Multiplication

Initially, the element Ai in row i of A is in the 0th processor of group i (i.e., processor (i,0)) and the jth element Bj of B is in processor (0,j). Following the multiplication, the (ij) element A;Bj of the product matrix is to be in processor (i, j). The four step algorithm given in Figure 5.1 performs the multiplication.
Following Step 1, B, is in processor (j,0), 0 < j < N; and following the broadcast of Step 2, processor (i, *) has A, (* denotes all permissible indexes; in this case indexes are in the range 10, N)) and processor (j, *) has B,. After the OTIS move of Step 3, processor (ij) has A. and B,. Consequently, following the multiplication of Step 4, processor (i, j) has the (i, j)th entry of the result matrix. Therefore, the algorithm correctly multiplies the vectors A and B.

For the complexity analysis, we see that 2 OTIS moves are made in Steps 1 and 3 together. Step 2 can be done using 2VN electronic moves by first sending A and B data initially in processor 0 of a group down column zero of that group. Since only one piece of data can be moved at a time along an electronic link, this column broadcast of the A and B data can be done in VN moves if we pipeline the data movement down column zero (i.e., the B data trail the A data by one column processor). Next the A and B data in each processor of column 0 are broadcast along rows using a similar pipelining. This requires another vrN electronic moves.







63


The complexity of our GRM column x row algorithm is therefore 2vfN electronic and 2 OTIS moves.


Theorem 5.2.1 Our column x row algorithm is an optimal algorithm.

Proof To see this, first note that all the B values are initially in group 0, and all need to get to group 1 (say) either in the form of A1B, or simply B,. The only way data can move from one group to another is via an OTIS move, and a single OTIS move can only move a constant number of the B,'s accumulated into a single processor of group 0 into a single processor of group 1. Therefore, at least 2 OTIS moves are needed.

Also, 2VN-' electronic moves are necessary. To see this, observe that Bo is initially in processor (0,0) and its influence must be seen at all processors (*,0) because the (*,0) element of the result is A.BD, which is to be left in processor (*,0). OTIS moves can only transpose group and local processor indexes. To affect a change from (0,0) to (*, 0), 2vY - 2 electronic moves (vfN - 1 rightward row moves and v/N - 1 downward column moves) are essential. Further, A0 is initially in (0,0) and all values in (0, *) depend on A0. Therefore at least vf- 1 rightward row moves and v/N- 1 downward column moves are needed to communicate the AG value directly or indirectly to (0, *). Since only unit data can flow along an electronic link in a single move, we cannot overlap all of the rightward row moves needed for A0 and B0. Therefore, at least VN/ rightward moves must be made. Similarly at least V7N downward moves must be made. 0
The algorithm of Figure 5.1 also can be used when A is a kN x 1 vector and B a 1 x kN vector. Now, in Steps 1 and 3, blocks of k values are moved from a single processor via an OTIS move. In Step 2, blocks of size k are to be broadcast. The







64


Step 1: Processors that have a B value broadcast this B value to all processors in the same column of the group.

Step 2: Processors that have an A value broadcast this A value to all processors in the same row of the group.

Step 3: Perform an OTIS move on the A and B values in a processor. Step 4: Same as Step 1.

Step 5: Same as Step 2.

Step 6: Same as Step 3.

Step 7: Each processor multiplies its A and B values to produce an element of the product matrix.


Figure 5.2. GSM Column x Row multiply algorithm

strategy is the same as for the case k = 1; however, now we must pipeline the 2k A and B values in a processor for the column and row broadcast steps. This pipelining takes 2V +4k - 4 electronic moves. Steps 1 and 3 still take 2 OTIS moves as we can move k element blocks using a single OTIS move. In Step 4, each processor performs k2 multiplications to generate a k x k block of the product matrix. GSM

For A an N x 1 vector and B a 1 x N vector, we start with A, in processor (if,0, i,,0) and B, in (0, jf,0, jm) and are to leave the product term C,, = A1B, in (if, Jf, im, jm). The algorithm of Figure 5.2 does this.

Step I moves B, from (0, j, 0, jm) to (0, if, *, jm) and Step 2 moves A. from (i, 0,im,0) to (if,0, im,*). Following Step 3, Bi is in (*,jn,O,jf) and Ai is in (im, *,if,O). Step 4 now moves B, from (*,j.,O,jf) to (*,jm, *,jf) and Step 5 moves A. from (4, *, if, 0) to (im, *, if, *). Following Step 6, processor (if, jf, i., jm) has A; and B. Therefore, Step 7 correctly computes the product element C1, = Agb,.







65


The number of data moves is 4(VW - 1) electronic and 2 OTIS moves. The algorithm of Figure 5.2 is optimal because the A0 initially in (0,0,0,0) affects the final value in (fN - 1,0, v' - 1,0). This requires 2(V/ - 1) electronic column moves. Further, the B0 initially in (0,0,0,0) affects the final value in (0, vW - 1, 0, .1 - 1). This requires 2(vW - 1) electronic row moves. Additionally, VNY A values initially in group (0,0) affect the final values in group (0,1). This requires at least 2 OTIS moves (assume that VW > 2).
The algorithm of Figure 5.2 can also be used when k > 1. Now, the broadcasts and each OTIS move involves 2 blocks of k elements each. The broadcasts are done by pipelining the transfer of the k elements in a block and each OTIS move simply does a block transfer of the k elements. The total number of data move steps becomes 4vW + 4k - 8 electronic and 2 OTIS. Step 7 produces a k x k block of the result matrix using k2 multiplication steps.
5.2.2 Row Vector x Column Vector
GRM

For a 1 x N row vector A and an N x I column vector B, we begin with A, in (0, i) and Bi in (i, 0). The result E- AjB is to be left in (0,0). In the algorithm of Figure 5.3, Bi is moved from (i,0) to (0,i) in Step 1. The sum =' AiBi is computed in Step 3 by first moving the products of Step 2 upward to row 0 and adding terms in the row zero processors. Then the partial sums are moved leftward along row zero and the result computed in (0,0). The algorithm requires 1 OTIS and 2(VN- - 1) electronic moves. It is obvious that the algorithm is optimal.
When the vectors are of size 1 x kN and kN x 1, respectively, Step 2 multiplies a 1 x k block of A with a k x 1 block of B. This takes 0(k) time. We assume that the cost of 0(k) arithmetics is considerably less than the cost of an electronic move







66


Step 1: Perform an OTIS move on B values. Step 2: Each processor of group 0 multiplies its A and B values. Step 3: Sum the products of Step 2 by columns and finally along row zero, leaving the result in (0,0).


Figure 5.3. GRM Row x Column Multiply Step 1: Groups with A values move their A values from row 0 to column 0 using the data paths of Figure 5.5.

Step 2: Perform an OTIS move on all data. Step 3: Shift the A values leftward along row 0 of a group and the B values upward along column 0 and compute the sum of products in the (0,0) processor of each
group that has A and B values.

Step 4: Perform an OTIS move on the product sums computed in Step 3. Step 5: Shift the product sums upward along column 0 of group 0, summing these
sums in processor (0,0).


Figure 5.4. GSM Row x Column Multiply and, therefore, make no attempt to utilize processors from other groups to reduce the time spent on arithmetic operations. The data moves required by the algorithm of Figure 5.3 still are 2(VY - 1) electronic and 1 OTIS.


When multiplying a 1 x N vector and an N x 1 vector using the GSM mapping, the algorithm of Figure 5.4 can be used.

The algorithm of Figure 5.4 begins with A, in (0, if, 0, im) and Bi in (if, 0, im, 0). In Step 1, A. is moved to (0, if, i, 0) by performing ON - 1 downward moves and N - 1 leftward moves as in Figure 5.5. Following Step 2, A, is in (im, 0, 0, if) and







67


I I s
I I

I I





Figure 5.5. Data Paths Used in Step 1 of Figure 5.4

B is in (im, 0, if, 0). In Step 3, (im, 0,0,0) sums up VN of the terms that contribute to the result. In Step 4, these sums are moved to (0, 0, i, 0) and are added together in Step 5. Steps I and 3 take 2(vW - 1) electronic moves each and Step 5 takes V - 1 electronic moves. The total number of data moves is therefore 5(v'N - 1) electronic and 2 OTIS moves.

A straightforward generalization of the algorithm of Figure 5.4 to the case when we care multiplying a 1 x kN row with a kN x 1 column results in excessive complexity when k > 1. This is so because the pipelining of Step 3 takes 2k(VW- 1) electronic moves. When k > 1, the number of data moves is reduced by using the algorithm of Figure 5.6.

The algorithm of Figure 5.6 begins with the ith block of A in (0, if, 0, i.) and the ith block of B in (if, 0, im, 0). In Step 1, the ith block of A is moved to (0, if, i4,, 0). And following Step 2, the ith block of A is in (i, 0,0, if) while the ith block of B is in (im,0, if,0). Step 3 moves the ith block of A from (im,0,0, if) to (im,0, if,0). Now (i, 0, if, 0) contains block i of A and B. These blocks are multiplied in Step 4 to produce a single number in (i., 0, if, 0). In Step 5, the numbers computed in







68


Step 1: Groups with A values move their A value blocks from row 0 to column 0
using pipelining and the data paths of Figure 5.5.

Step 2: Perform an OTIS move on all data blocks.

Step 3: Same as Step 1.

Step 4: Processors with an A and a B block multiply their blocks (these are the
column 0 processors of each column 0 group).

Step 5: The column 0 processors, in each column 0 group, shift their Step 4 results
upward along column 0. The results are added together by the (0,0) processor
in each group.

Step 6: Perform an OTIS move on the sums computed by the (0,0) processors in Step
5.

Step 7: In group (0,0), the column 0 processors shift the values received in Step 6
upward to the (0,0) processor of the group. The (0,0) processor adds these
values together.


Figure 5.6. GSM Row x Column Multiply for k > 1







69


Step 1: Perform an OTIS move on A values. Step 2: Processor 0 of each group broadcasts its A value to the remaining processors in its group.

Step 3: All processors multiply their A and B values. Step 4: Perform an OTIS move on the products computed in Step 3. Step 5: Processor 0 of each group sums the products from all processors in the same group.

Step 6: Perform an OTIS move on the sums computed in Step 5.


Figure 5.7. GRM Row Vector x Matrix Multiply

group (im, 0) are summed in processor (im, 0,0,0). The OTIS move of Step 6 moves the resultant sums to (0,0, im, 0). These resultant sums are added together in Step

7.

The number of data moves performed by the algorithm of Figure 5.6 is 6VW-+ 4k - 10 electronic and 2 OTIS.
5.2.3 Row Vector x Matrix
RM

We are to multiply a 1 x N row vector A and an N x N matrix B. The result is a 1 x N vector C such that Cj = Ev1 AjB. Initially, A. is in (0, i) and Bij is in (i, j) and the result is to be left so that Ci is in (0, i). The multiplication algorithm is given in Figure 5.7.

In Step 1, A, is moved from (0, i) to (i, 0). Following Step 2, processor (j, i) has A, and B1. Processor (j,i) computes A3B,, in Step 3 and in Step 4, AB,, is moved to processor (i, j). Processor (i,0) computes Cj = EN-1 AB in Step 5. In







70


Step 6, C, is moved from (i,0) to (0, i). The complexity of the algorithm is 4(Vf-1) electronic and 3 OTIS moves.
When A is a 1 x kN vector and B a kN x kN matrix, a block of k A values are moved in Step 1 of Figure 5.7; the broadcast of the A block in Step 2 is done in 2(VW + k - 2) electronic moves by pipelining the broadcast of the k values; the multiplication of Step 3 is between a 1 x k vector and a k x k matrix; and the OTIS move of Step 4 moves 1 x k blocks. To do the sum of Step 5, we first sum along rows. This is done in vW + k - 2 electronic moves by pipelining the k sums to be computed. Next the partial sums in column 0 are summed; again using pipelining. Step 5 takes 2(VN + k - 2) steps. Adding in the OTIS move of Step 6, the total number of moves becomes 4VN+ 4k - 8 electronic and 3 OTIS. GSM

Our GSM algorithm to multiply a 1 x N row vector A and an N x N matrix B is given in Figure 5.8. Note that the algorithm begins with A, in (0, jf, 0, jm) and Bjj in (jf , if , j,,, i,,).
Step 1 moves A, from (0, jf,0, jm) to (0, jf, jm,0) and following Step 2, A, is in (0, jf, jm, *). Following Step 3, Aj is in (j., *, 0, jf) and B,, is in (jm, im, jf, if). After Step 5, A, is in (jm,*,jf,*). Therefore, in Step 6, processor (jm,im,jf,if) computes AB,,. In Step 7, processor (jm, im, 0, if) computes Eq m r-. AB1, which is then sent to (0, if, jm, i..) in Step 8. Finally in Step 9 (0, if, 0, im) computes C, = E _- A B,,.
Steps 1 and 4 take 2(vW- 1) electronic moves each; Steps 2, 5, 7, and 9 take
- 1 electronic moves each; and Steps 3 and 8 take 1 OTIS move each. The total number of moves is 8(v'W - 1) electronic and 2 OTIS.







71


Step 1: In each group move the A values from row 0 to column 0 using the data
paths of Figure 5.5.

Step 2: The column 0 processors broadcast their A values to all processors in the
same group and on the same row.

Step 3: Perform an OTIS move on all A and B values. Step 4: Same as Step 1.
Step 5: Same as Step 2.

Step 6: All processors multiply their A and B values. Step 7: The processor in row 0 of each group sum the products of Step 6 that are in
the same column.

Step 8: Perform an OTIS move on the sums of Step 7. Step 9: The processors in row 0 of group (0, *) sum the values received in Step 8 that
are in the same column.


Figure 5.8. GSM Row Vector x Matrix Multiply







72


Step 1: Processor 0 of each group broadcasts its B value to all processors in its group. Step 2: Perform an OTIS move on B values. Step 3: All processors multiply their A and B values. Step 4: Processor 0 of each group sums the products computed in Step 3 by all processors in its group.


Figure 5.9. GRM Matrix x Column Vector Multiply

When A is a 1 x kN vector and B a kN x kN matrix, Steps 1 and 4 can be done with 2(vrN- + k - 2) electronic moves each by transmitting the k values in each processor in a pipelined fashion; Steps 2 and 5 take V + k - 2 (again using pipelining) electronic moves; Steps 7 and 9 can be done in vW+ k - 2 moves each using the pipelined summing scheme used in Step 5 of Figure 5.7. The total number of moves is 8(VNW + k - 2) electronic and 2 OTIS.
5.2.4 Matrix x Column Vector
GRM

We start with an N x N matrix A and an N x 1 column vector B and compute the N x 1 column vector C such that C = EN-. A,,B,. Initially, A,, is in (ij) and B, is in (j, 0). On termination, C, is to be in (i, 0). Our algorithm to perform the multiplication is given in Figure 5.9.

Following Step 1, B, is in (j, *) and following Step 2 it is in (*,j). In Step 3, (i, j) computes A,,B, and in Step 4, (i, 0) computes Ci = >2,J A;,B,. The number of data moves is 4(v'N-1) electronic (Steps 1 and 4 each require 2(v'-1) electronic moves) and 1 OTIS.

Theorem a.2.2 The GRM matrix x column vector multiplication algorithm of Figure 5.9 is optimal.







73


Proof Since the value of Co depends on all Ao values, information about all these A values must get to (0,0) either directly or indirectly. For this to happen, at least

- 1 leftward row moves and vW - 1 upward column moves must be made. Let the snake-like row-major index of the bottom right processor of a group be q. Since C, = E AV B,, information originally in (0,0) (i.e., BO) must get to (q, 0) directly or indirectly. This requires a minimum of VTN - 1 rightward row moves and vF - 1 downward column moves plus one OTIS move. The row and column moves required for the computation of Co and C, are in opposite directions and cannot be overlapped in the SIMD model. Therefore, at least 4(VY - 1) electronic and 1 OTIS moves are needed. 0

When A is a kN x kN matrix and B a kN x 1 vector, we use the algorithm of Figure 5.9 and pipelining as used for the case when A is a 1 x kN vector and B a kN x kN matrix. The number of moves is 4(-vfR+ k - 2) electronic and 1 OTIS. GSM

The GSM matrix x vector multiplication algorithm is very similar to the GSM vector x matrix algorithm of Figure 5.8. The steps are given in Figure 5.10. Note that we start with A, in (if, if, iM, jM) and Bi in (if,0, im,0).

In Step 1, B, is moved from (jO,jm,0) to (jf,0,0,jm). Following Step 2, B, is in (jf,0,*, jm). The OTIS move of Step 3 moves B, to (*, jm, jj, 0) and A, to (i., jm, if, jf). Steps 4 and 5 first move B, to (*, jm,0, jj) and then to (*, jm,*, jf). Following Step 5, (im, j., if, jf) has A, and B. In Step 6, (i,, jm, if, jf) computes AgB,. In Step 7, processor (i.,,jm, if,0) computes E -=j,. Aq3B,, which
is then sent to (if,0,i,j.m) in Step 8. Finally, in Step 9 (if,0,im,0) computes EyN A3B,. The total number of data moves is 8(VN?- - 1) electronic and 2 OTIS.







74


Step 1: In each group move the B values from column 0 to row 0 using the data
paths of Figure 5.5 in the reverse direction.

Step 2: The row 0 processors of each group broadcast their B values to all processors
in the same group and on the same column.

Step 3: Perform an OTIS moves on all A and B values. Step 4: Same as Step 1.

Step 5: Same as Step 2.

Step 6: All processors multiply their A and B values. Step 7: The processor in column 0 of each group sum the products of Step 6 that are
in the same row.

Step 8: Perform an OTIS move on the sums of Step 7. Step 9: The processors in column 0 of group (*, 0) sum the values received in Step 8
that are in the same row.


Figure 5.10. GSM Matrix x Column Vector Multiply







75


Step 1: Perform an OTIS move on B.

Step 2: Each processor of each group accumulates all A and B values in its group. Step 3: Move the accumulated B values along the OTIS connection. Step 4: Each processor computes the inner product of the A and B values it has.


Figure 5.11. O(N) Memory GRM Matrix x Matrix Multiply

In the case when A is a kN x kN matrix and B a kN x 1 vector, we use the algorithm of Figure 5.10 and pipelining as used for the case when A is a 1 x kN vector and B a kN x kN matrix. The number of moves is 8(x/i-+ k - 2) electronic and 2 OTIS.
5.2.5 Matrix x Matrix
0(N) Memory/Processor Algorithms

When each processor has O(N) memory, it is possible to accumulate an entire column (or row) into each processor. This leads to simplified algorithms. Consider the case when we are to multiply two N x N matrices A and B.


G.RM The GRM algorithm is given in Figure 5.11.
We begin with As3 and Bj in (i, j). Following Step 1, (i, j) has Aj and Bi. After Step 2, (i, j) has row i of A and column i of B. Following Step 3, (i, j) has row i of A and column j of B. In Step 4, (i, j) computes Ca, = EN-1 AjA, Bkj.

Step 2 can be done in two stages. In the first stage, the B values are accumulated; and in the second stage the A values are accumulated. To accumulate the B values, each processor first accumulates all values from its row. This takes V/W- 1 rightward and V~- 1 leftward moves. Next, the accumulated blocks of vr values are accumulated along columns by making v4N_- 1 upward and VW- 1 downward moves of







76


Step 1: Perform an OTIS move on A and B values. Step 2: Each processor accumulates all VfK A values in the same row and group as well as all vfN- B values in the same column and group.

Step 3: Move the accumulated A and B values along the OTIS connection. Step 4: Each processor accumulates all N A values in the same row and group as well as all N B values in the same row and group.

Step 5: Each processor computes the inner product of the A and B values it has.


Figure 5.12. O(N) Memory GSM Matrix x Matrix Multiply

blocks of size VN. The total stage 1 moves are 2V/~(V'2-1)+2(vW-1) = 2(N-1). Stage 2 is done similarly. Step 3 takes N/K OTIS moves where K is the maximum number of B values that can be moved in unit time over an optical link. The total number of moves needed by the algorithm of Figure 5.11 is 4(N - 1) electronic and N/K +1 OTIS. Each processor needs memory for N A values and N B values. The memory requirements can be reduced to N + vW by delaying stage 2 of Step 2 to after Step 3 and coupling Step 4 with the columnwise movement of the VI size packets of A during stage 2.

The algorithm of Figure 5.11 is easily generalized to the case when A and B are kN x kN matrices. Operations previously performed on matrix elements are now performed on k x k blocks of elements. The data movement counts are 1 OTIS in Step 1, 4k2(N - 1) electronic in Step 2, and k2N/K OTIS in Step 3. The total is 4k2(N - 1) electronic and k2N/K + 1 OTIS.


GSM Our GSM algorithm to multiply two N x N matrices is given in Figure 5.12.







77


Following Step 1, A,1 and B1, are in (im, jm, if,jf). Following Step 2, (imjm, if, *) has the VN# A values A, such that q. = jn and (im, jm, *, jf) has the VW B values Bj such that rm = im. These 1r blocks of A and B values are then moved to (if,*, im, j.) and (*, jj, jm, ,m), respectively. Following Step 4, (if, *, im, *) has row i of A and (*, j, *, jm) has column j of B. Therefore, (if, jf, im, jm) has row i of A and column j of B. The inner product computation of Step 5 leaves Cij = EN AkB k, in (if, jf, im, jn).

Step 1 takes 1 OTIS move. Step 2 takes 2(V"K - 1) electronic row moves to accumulate the A values and 2(VN - 1) electronic column moves to accumulate the B values. Step 3 takes 21K/K OTIS moves. In Step 4, each electronic move moves V data. Since a total of 4(1K-1) moves are made, the total cost is 41K(1K-1) unit electronic moves. Hence the total number of moves is 4(N - 1) electronic and 21/K +1 OTIS.

The algorithm is easily extended to the case when A and B are kN x kN matrices. The number of moves is 4k2(N - 1) electronic and 2k2 /7'/K + 1 OTIS. 0(1) Memory/Processor Algorithms

Our 0(1) memory algorithm is based on Cannon's algorithm [2] to multiply two N x N matrices on an N x N mesh connected computer. Cannon's algorithm was also used by Dekel, Nassimi, and Sahni [6] in their development of hypercube algorithms for the matrix multiplication. Cannon's algorithm is given in Figure 5.13.


RM We simulate Cannon's algorithm on the OTIS-Mesh. While obtaining the alignment of Step 1, we also obtain the reverse of each aligned row of A and each aligned column of B. The process for B is similar to that used for A except that we must precede and follow the algorithm for A by an OTIS move (the preceding OTIS







78


Step 1: [Align Matrix Elements] Move Ai,(j+i) mod N and B(j+i) awd Nj to mesh processor (i, j).

Step 2: [Initialize C,,] Processor (i, j) initializes its C value to the product of its A and B values.

Step 3: [Compute and Add Remaining Terms] Repeat N - 1 times:
{ Shift A values left circularly by 1;
Shift B values up circularly by 1;
C=C+A*B; }



Figure 5.13. Cannon's Matrix Multiplication Algorithm

Step 1: In each group, move A values upward along columns. As data moves through a processor, the processor saves a copy in case it is needed in the row the
processor is in.

Step 2: Same as Step 1 except that A values are moved downward. Step 3. In each row of each group, form the forward ordering. Step 4: In each row of each group, form the reverse ordering.


Figure 5.14. Moving A Values as per Step 1 of Cannon's Algorithm

move gets all B elements in the same column into the same group and the following OTIS move gets the columns to the proper processors). We describe the alignment of rows of A only (Figure 5.14).

Steps 1 and 2 each take VNY - 1 electronic moves. Following Step 2, a processor can have up to four A values-2 belonging to the aligned ordering of Step 1 of Cannon's algorithm, and 2 belonging to the reverse of this ordering. Each row contains a total of 2v'N values with each processor in the row having 0, 1, 2, 3, or







79


4 values. These values can be moved to the proper processors on the same row by making O(1K) leftward and rightward data moves. To align B takes 0(1K) electronic and 2 OTIS moves. Therefore, making 0(1f) electronic and 2 OTIS moves, we can obtain the Step 1 alignment of Cannon's algorithm as well as the reverse of this alignment.

The circular shift of A in Step 3 of Cannon's algorithm can be implemented as a forward shift along the snake of the reverse alignment and a backward shift along the snake of the aligned data. So each circular shift takes 4 electronic moves.

To do the circular shift on B, we retain a copy of the aligned and reversed B in each group prior to the second OTIS move done in Step 1. For each circular shift, we make 4 electronic moves in each group on the copy of B and then do an OTIS move to get the shifted B values to the desired processors. The total moves required by Step 3 is (N - 1) x (8 electronic and 1 OTIS ) = 8(N - 1) electronic and N -1 OTIS. Therefore, the GRM simulation of Cannon's algorithm can be done using 8N + O(1K) electronic and N + 1 OTIS moves.

The simulation just described works even when A and B are kNx kN matrices. Now, each element that is moved is a k x k block. Therefore an electronic block move takes k2 electronic move steps. The number of moves becomes 8k2N + 0(k'Ng) electronic and N + 1 OTIS.


GQSM To multiply two N x N matrices we use a two level simulation of Cannon's algorithm. At the top level, we view each N x N matrix as a K/ x matrix in which each element is a 1K x K submatrix. Let A and B be the N x N matrices to be multiplied and let BA and BB be the corresponding 1K x 1K matrices in which each element is a v/K x VW block or submatrix of A and B,







80


respectively. Initially, BA, and BBj are in group (i, j) of the OTIS-Mesh. Let C = A x B and let BC be the corresponding VW x v/W matrix of blocks of size rv x vN each. Since BC,, = Ev BAk x BB1, we can use Cannon's algorithm to compute BC. The products of Steps 2 and 3 now are products of submatrices or blocks of size IN x v/N, each block is in an OTIS group which is a v/fN x v/TN mesh. These submatrix products can in turn be done using Cannon's algorithm (this is the second level application of Cannon's algorithm).

To implement the two level scheme, we use the algorithm of Figure 5.15.

Steps 1 and 2 do the data alignment necessary to perform Steps 2 and 3 of Cannon's algorithm to multiply two blocks/submatrices of size V/N x 'N_ each. The forward and backward ordering of the A values can be obtained by making v/N - 1 leftward and 'N_-1 rightward moves of A values. Similarly the forward and backward ordering of B values can be done using 2(VN - 1) column moves. Following Step 2, each processor has 2 A values (one from the forward ordering and the other from the backward ordering) and 2 B values.

In Steps 3 and 4 the *N x vW blocks of submatrices of A and B are aligned. For this, an OTIS move is made on the A and B values, followed by 2('FTN - 1) electronic row moves and 2(VN - 1) electronic column moves, and finally an OTIS move. For the final OTIS move, we leave a copy of the As and Bs in the originating processors also. Now each processor has 8 A and 8 B values.

Step 5 is done using Steps 2 and 3 of Cannon's algorithm at a cost of 4(x/W-1) electronic moves. In Step 6, the A and B blocks are shifted by using the copies saved during the second OTIS moves of Steps 3 and 4 followed by an OTIS move. This shifting of A and B blocks takes 2 row electronic moves (both forward and backward A blocks are to be shifted in the opposite direction) plus 2 column electronic moves







81


Step 1: [Align A data within each group/block] Reorder A values in each row of
each group so that the A value originally in (*, *, i, (i + j) mod VN) is now in (*, *, i, j). Call this the forward A ordering. Also create the reverse of this row
ordering in each group. Call this the backward ordering.

Step 2: [Align B data within each group/block] Reorder B values in each column of
each group so that the B value originally in (*, *, (i + j) mod VN, j) is now in
(*,*, i, j). Also create the backward column ordering for the Bs.
Step 3: [Align the A blocks] Rearrange the blocks of A values obtained in Step 1 so
that the block originally in group (i, (i+j) mod vIN-) (i.e., in processors (i, (i+ j) mod VM*,*))is now in the group of (i,j). Also create the corresponding
backward row ordering for the A blocks.

Step 4: [Align the B blocks] Rearrange the blocks of B values obtained in Step 2 so
that the block originally in group ((i + j) mod vrN, j) is now in group (i, j).
Also create the corresponding backward column ordering for the B blocks.

Step 5: [Initialize block BCI,] BC,, = BA, x BBi.

Step 6: [Compute and add remaining terms]

Repeat vfN- - 1 times:
{ Shift A blocks left circularly by 1 group;
Shift B blocks up circularly by 1 group;
BCj = BCq + BAj x BB,, }


Figure 5.15. GSM Matrix x Matrix Multiply







82


Table 5.1. Comparison between GRM and GSM schemes

Embedding Scheme GRM GSM
Operation Electronic OTIS Electronic OTIS
Column x Row 2,1N 2 4( N - 1) 2
Row x Column 2(JN - 1) 1 5(VN - 1) 2
Row x Matrix 4(vN - 1) 3 8(N - 1) 2
Matrix x Column 4(VN - 1) 1 8(*N - 1) 2
Matrix x Matrix O(N) 4(N - 1) N/K + 1 4(N - 1) 2/N/K + 1
Matrix x Matrix 0(1) 8N + 0(vN) N +1 4N + O(VN) N


for B blocks and 1 OTIS move. The block matrix multiply is done using steps 2 and 3 of Cannon's algorithm at a cost of 4(vW - 1) electronic and VN- 1) OTIS moves. The total number of moves made by the GSM algorithm is 4N + O(VR) electronic and vN OTIS.

The algorithm of Figure 5.15 is easily extended to the case when A and B are kN x kN matrices. The essential difference is that each element of a V# x v'N block is now itself a k x k block. So, each electronic data move becomes k2 unit moves. The number of data moves is therefore 4k2N + O(k2vrN) electronic and vW OTIS.
5.3 Summary

We have developed OTIS-Mesh algorithms for several variants of the matrix multiplication problem. For each variant, we have considered both the group row mapping and the group submatrix mapping. Our results are summarized in Table 5.1. As can be seen, the GSM mapping is superior for the case of matrix x matrix multiplication. However, for all other variants the GRM is superior. As noted in Section 2.3, GRM is also superior for the matrix transpose operation.












CHAPTER 6
IMAGE PROCESSING ON AN OTIS-MESH In this chapter, we focus on four problems from the image processing area. These problems are histogramming, histogram modification, Hough transform, and image shrinking expanding. As noted in Section 2.3, there are two plausible ways to map an N x N image onto an N2 processor OTIS-Mesh-group row mapping (GRM), and group submesh mapping (GSM).

Our histogramming and histogram modification algorithms are insensitive to how the image is mapped onto the OTIS-Mesh. Therefore, these algorithms are developed without regard to the mapping used. The algorithms for Hough transform and image shrinking and expanding depend on the particular mapping used. In Sections 6.3 and 6.4, we develop algorithms for both the GRM and GSM mappings.
6.1 Histogramming
6.1.1 Background

The input to the histogramming problem is an N x N digitized image I with I(i, j) being the gray level of pixel (i, j), 0 < i, j < N. The gray levels are integers in the range [0, B); that is, 0 < I(i, j) < B, 0 < i, j < N. The histogram of the image is a vector H such that H[b] is the number of pixels with gray value b, 0 < b < B.

Parallel algorithms to compute the histogram of an image have been developed for many parallel architectures. For example, Siegel et al.[49] have developed a histogramming algorithm for a p processor PASM multicomputer, p < N2, and Yasrebi et al.[57] have done this for the TRAC multicomputer. Grinberg et al.[10 have developed an algorithm to compute the histogram on an N2 processor cellular machine called the 3-D machine; Tanimoto [51] has developed an O(B + log N)


83







84


algorithm for a pyramid computer with an N x N base; Bestul and Davis [1] have developed an 0(viB + log(N/B)) algorithm for an N2 processor SIMD hypercube; and Jenq and Sahni [181 and Jang et al.[151 have developed algorithms for various reconfigurable mesh models.

The histogramming algorithms of Jang et al.[15] and Jenq and Sahni [18] partition B into the ranges 0 < B < vrN, vrN < B < N and B > N and solve for each range separately. Further, where appropriate, they consider the cases 0(1) memory per processor and O(B) memory per processor. We shall follow this strategy here also.
6.1.2 Algorithm for 0 < B
In this case, the histogram is left in row 0 of the group (0,0) mesh. Using our four-dimensional indexing scheme, processor (0, 0, 0, i) will have H[iJ, 0 < i < B, following the histogram computation. Our strategy is (a) compute the histogram for each row of each IN x VN mesh, (b) use the row histograms to obtain the histogram for each group, and (c) combine the group histograms into a single histogram. More formally, our algorithm for the case 0 < B < v'W is:

Step 1: Processor (gz,g,pz,p,) determines the number of pixels on its row of the

group (g., g,) mesh, 0 p, < B.

Step 2: Processor (g,9,,0, py) adds up the values, along its column, that were computed in Step 1, 0 < p, < B.

Step 3: Perform an OTIS move on the values computed in Step 2. Step 4: Processor (0, gy, 0,0) sums up all the values received in Step 3 by processors

in group (0, gy), 0 < g, < B.







85


Step 5: Perform an OTIS move on the results computed in Step 4.

Step I is accomplished by shifting the image values first leftward and then rightward within rows of the meshes. When an image value passes through processor (gz,g,p.,hpy), this processor increments its counter if the image value equals p,. v/W - 1 leftward and B - 1 rightward shifts, for a total of vW + B - 2 electronic moves are needed. Step 2 is done by shifting the counts upwards along columns to row 0 of the group. This step takes vW - 1 electronic moves. In Step 4 we add all values in a vrN x V/IN mesh. This takes 2(VN - 1) electronic moves. Therefore, histogramming can be done with 4(v/W - 1) + B - 1 electronic and 2 OTIS moves.


Theorem 6. 1. Our histogramming algorithm for 0 < B < vW is optimal

Proof We need to show that every OTIS-Mesh histogramming algorithm must make 4(vf - 1) + B - 1 electronic and 2 OTIS moves when 0 < B < vN. To see this, consider an image in which 1(0,0) = B - 1 and I(N - 1, N - 1) = 0. Since 1(0,0) is mapped to processor (0,0,0,0) of the OTIS-Mesh and since H[B - 1] is left in processor (0, 0, 0, B - 1), it is necessary for the histogramming algorithm to move in formation from (0,0,0,0) to (0, 0, 0, B - 1), requiring at least B - 1 electronic moves that increase the row index (note that OTIS moves can only transpose indices, not change their value). Further, since I(N - 1, N - 1) is mapped to processor (vrN - 1, vR# - 1, VNW - 1, VNW - 1) and H[01 is left in (0,0,0,0), it is necessary to make at least v/ -1 electronic moves to decrease each of the four indices 9, 9,, Pz, and p1, a total of 4(vWS - 1) electronic moves. Since moves that increase an index cannot be overlapped with those that decrease an index in the SIMD model, at least 4(vrN - 1) + B - 1 electronic moves are necessary.










For the 2 OTIS moves, we see that information from all processors in group (VK - 1, ,r-1) (say) must get to group 0. Assume that group (VN- 1, v - 1)
has at least 2 gray values. To get information out of a group, an OTIS move must be made. A single OTIS move, however, moves data from different processors of a group into processors of different groups. Therefore, at least 2 OTIS moves are necessary to move different data from 2 or more different processors into a single other group.


6.1.3 Algorithms for V ! < B < N

We first present an algorithm that uses 0(1) memory per processor. Next, we present an optimal algorithm that uses 0(vf) memory per processor. Our 0(1) memory per processor algorithm leaves the histogram in the processors of group 0, one histogram value per processor. More specifically, processor (0,0, p,, p,) contains H[pNV/N +pj. The algorithm is given below:

Step 1: Processor (p.,p,) of each group computes H[p.v'7+p,] for the subimage in

its group. This is done by sorting the gray values in a group using the integer sort algorithm of Krizanc [241. During the sort, equal gray values are combined

into a single gray value.

Step 2: Perform an OTIS move on the H values computed in Step 1. Step 3: Processor (0,0) of each group sums the H values in its group that were

received in Step 2. That is, processor (g., gs, 0,0) computes H[gV, + g,] for
the entire image.

Step 4: Processor (0,0) of each group performs an OTIS move on the sum computed

in Step 3. Following this move, processor (0,0,p,p,) has H[pv-+ p,].







87


Step 1 takes 41N+o(VN-) electronic moves; Steps 2 and 4 take I OTIS move each; and Step 3 takes 2(VN - 1) electronic moves. The total number of moves is 6VW + o(vW) electronic moves and 2 OTIS moves.

eorem6.1.2 Every histogramming algorithm for the case vr < B < N must make at least 5(VN7- 1) + [(B - 1)/vN'j - 1 electronic and 2 OTIS moves to compute the histogram configuration obtained by the 0(1) memory algorithm. When the output configuration has the histogram in a vB x v' submesh of group (0,0), at least 4VN/+ 2y'- -6 electronic and 2 OTIS moves are needed. Proof Consider the image in which the gray value of the pixel in processor ( 1, V/' - 1, -VN- 1, I - 1) is 0 and the pixels in group (,IN- 1, /N - 1) have at least 2 different gray values. Using the reasoning in Theorem 6.1.1, we see that at least 4(v,/ - 1) electronic and 2 OTIS moves are needed to get the histogram information from the group (v/N - 1, VN/# - 1) processors to the target processors in group (0,0).
Next, suppose that the pixel in processor (0, 0, 0, 0) has gray value v such that v mod vNY = VN-I and [v/VWJ = [(B-1)/VWJ -1. It takes [(B-1)/VWJ -1+ v/N - 1 electronic moves to get information from (0,0,0,0) to (0, 0, [(B - 1)/VN-j 1, vN- 1) and these electronic moves cannot be overlapped with those used to move information from (IN-1, vrN -1, v/N -1, vfN-1) to (0,0,0,0). Therefore, at least 5(4 - 1) + [(B - 1)/v/NJ - 1 electronic and 2 OTIS moves are needed to obtain the output configuration obtained by the 0(1) memory algorithm.
For the vrB x V5 submesh output configuration, suppose that the gray value in processor (0,0,0,0) is B - 1. Therefore, information needs to flow from (0,0,0,0) to (0,0, vf/B - 1, vIB - 1), requiring 2(v'/ - 1) electronic moves that cannot be







88


overlapped with the electronic moves made when moving information from (VN 1, V/K - 1, v/K - 1, V/N - 1) to (0,0,0,0). Thus a total of 4v/W + 2vrB- - 6 electronic and 2 OTIS moves are necessary. 0

An optimal histogramming algorithm for the V/B x VE submesh output configuration is possible when O(v/l) memory per processor is available. The algorithm given below adapts the method used in Jenq and Sahni [181, and assumes that B is a perfect square and that v/E divides v/K. Step 1: Tile each v/K x v/K mesh by v/7_ x V/7B tiles. Step 2: Processor i on each row of each vrB-x v/r tile computes an array A[0 : V- 1] of values such that A[j] equals the number of pixels in that row of the tile whose
gray value is jV/ + i, 0 < i < vrB.

Step 3: Processors in the same column of each tile perform a consecutive sum operation; processor i of a tile column sums the A[i] values of the processors on its column. Following this step, processor (ij) of a tile has the number of pixels
in its tile whose gray value is iv/B + j.

Step 4: Perform a window sum operation on the results of Step 3 using a window
size VB_ x v/B. This operation does not span group boundaries. The result of the window sum operation is in the top left v/Tg x VB window/tile of each group. Following this operation, processor (g,,, g, i, j) has the number of pixels
in group (g, g.) whose gray value is iv + j.

Step 5: Do an OTIS move on the values computed in Step 4. Step 6: Processor (ga, g,, 0,0) sums all the values received by its group. Step 7: Do an OTIS move on the values computed by (g,,g,0,0) in Step 6.







89


For the time complexity, we see that Steps 2 and 3 take 2(v/T - 1) electronic moves each; Step 4 takes 2(vNW- - -vf/) electronic moves; Steps 5 and 7 take 1 OTIS move each; and Step 6 takes 2(v#- 1) electronic moves. The total number of moves is 4V'i- + 2vB - 6 electronic and 2 OTIS moves.
6.1.4 Algorithm for B > N

This case can be done with 22vW + O(N3/8) electronic and O(N3/8) OTIS moves by modifying the sort algorithm in Section 4.13 so that during the sort, pixels with the same gray value are combined into a single pixel.
6.2 Histogram Modification

Histogram modification is the process of changing the gray values of an image based on a mapping function f; f(i) gives the new gray value for pixels whose original gray value is i, 0 < i < B. In histogram flattening or equalization [40), the function f is computed by first computing the prefix sums S[i]= Xj=, H[j], 0 < i < B of the histogram. Next, f(i) is obtained using one of the following equations f(i) = [S[i]/BJ, 0 < i < B,

or

1(i) = [sil+s-11J, 0 < i < B

where S[-1] = 0.

In the OTIS implementation of histogram flattening, we explicitly consider only the case B = N. Other values of B may be handled similarly. The prefix sums may be computed from the histogram (which is in group 0) using 3(v"N - 1) electronic moves (Section 4.3). To compute f using the first equation, no additional moves are needed. When the second definition is used, additional electronic moves are needed to shift the prefix sums by 1 processor.







90


Following the computation of f, the gray values of all pixels must be updated according to f. When we are limited to 0(1) memory per processor, this updating of pixel values may be done by first performing a window broadcast of the f values to all groups. This broadcast can then be followed by a random access read (RAR) in which each processor obtains the needed f value from within its group. The window broadcast takes 2(V'~ - 1) electronic moves and the RAR takes 23VW + o(VfK) electronic moves [54]. Thus, the updating phase takes 25V7+o(VN) electronic and

2 OTIS moves.
When 0(v/N) (= 0(v )) memory per processor is available, the updating of group values may be done by first doing a window broadcast of the f values as was done in the 0(1) memory case. Next, each processor accumulates the f values in the ,N processors that are in its column. This accumulation is done in an array C. For a processor in column j of its group, C[i] = f(iVW + j), 0 < i, j < 'N. This accumulation step takes 2(v"N- 1) electronic moves. Following the construction of the C arrays, each processor sends a token to the processor on its row that has the f value it needs. When the token reaches the target processor, the f value is written into the token, and the token returned to the originating processor. This token send/receive step can be broken into two phases-one in which tokens are sent to and received from processors to the left of the source processors and another in which the target processors are to the right. Each of these phases takes 2(vN - 1) electronic moves. Thus, the 0(vr) memory algorithm takes 8(VN - 1) electronic and 2 OTIS moves to update the gray values following the computation of f.
The complexity of the O(VW) memory updating algorithm can be reduced to 6(vW - 1) electronic and 2 OTIS moves if the histogram computation phase saves, in processors 0 and -I - 1 of each row of a group, the gray values of all VN pixels




Full Text

PAGE 1

ALGORITHMS FOR THE OTIS OPTOELECTRONIC COMPUTER By CHIH-FANG WANG A DISSERTATION PRESENTED TO THE GRADV^E ! SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHmOSOPHY UNF/ERSITY OF FLORIDA 1998

PAGE 2

ACKNOWLEDGMENTS I wish to express my whole-hearted appreciation and gratitude to my advisor Professor Sartaj Sahni for giving countless hours of guidance in my research work. Without his support and patience, this research would not have been done. I would also like to thank other members in my supervisory committee, Dr. Sanguthevar Rajasekaran, Dr. Tim Davis, Dr. Yann-Hang Lee, and Dr. Haniph Latchman, for their interest and comments. Many thanks to my friends here in the department. Without your constant companion and encouragement, I would not have made it this far. Last, but not least, my greatest appreciation goes to my parents and my brother Max. Although I am half a globe away, I am always surrounded by their love and tender care. To them I dedicate this work. This research was supported, in part, by the Army Research Office under grant DAA H04-95-1-0111. ii

PAGE 3

TABLE OF CONTENTS ACKNOWLEDGMENTS 11 LIST OF FIGURES VI LIST OF TABLES vm ABSTRACT ,x 1 INTRODUCTION 1 1.1 Optical Transpose Interconnection System (OTIS) 2 1.2 OTIS Parallel Computers £ 1.3 Permutation Routing on OTIS Computers j> 1.4 This Dissertation 2 PROPERTIES OF OTIS-MESH 11 2.1 Diameter of the OTIS-Mesh 2.2 Simulation of a 4D Mesh }j 2.3 Simulation of a 2D Mesh 3 DATA REARRANGEMENT ON AN OTIS-MESH 18 19 3.1 Transpose 3.2 Perfect Shuffle 3.3 Unshuffle 3.4 Bit Reversal g 3.5 Vector Reversal ~ 3.6 Bit Shuffle £J 3.6.1 G y P, Swap g 3.6.2 Bit Shuffle g 3.7 Shuffled Row-Major g 3.8 BPC Permutations g 3.8.1 Definition g 3.8.2 Algorithm f° 3.9 Comparison 4 BASIC OPERATIONS ON AN OTIS-MESH 32 4.1 Data Broadcast 32 4.2 Window Broadcast jj 4.3 Prefix Sum f* 4.4 Data Sum * 4.5 Rank % 4.6 Shift J7 iii

PAGE 4

4.7 Data Accumulation 4.8 Consecutive Sum *» 4.9 Adjacent Sum *n 4.10 Concentrate TX 4.11 Distribute J' 4.12 Generalize V. 4.13 Sorting 4.14 Random Access Read (RAR) jg 4.15 Random Access Write (RAW) g 4.16 Summary 57 MATRDC MULTIPLICATIONS ON AN OTIS-MESH 60 5.1 Mapping Matrices Onto An OTIS-Mesh 60 5.2 Multiplication Algorithm 61 5.2.1 Column Vector x Row Vector j» 5.2.2 Row Vector x Column Vector 65 5.2.3 Row Vector x Matrix 69 5.2.4 Matrix x Column Vector 72 5.2.5 Matrix x Matrix 75 5.3 Summary 82 IMAGE PROCESSING ON AN OTIS-MESH 83 6.1 Histogramming g 6.1.1 Background M 6.1.2 Algorithm for 0 < B < VN 84 6.1.3 Algorithms for yfN < B < N 86 6.1.4 Algorithm for B > N 89 6.2 Histogram Modification 89 6.3 Shrinking and Expanding 91 6.3.1 Background j»J 6.3.2 GRM Mapping 92 6.3.3 GSM Mapping 96 6.4 Hough Transform JJJJ 6.4.1 Background 6.4.2 An Improved Algorithm For N x N Meshes 101 6.4.3 GRM Mapping }J6 6.4.4 GSM Mapping J07 6.5 Summary 107 OTIS-HYPERCUBE 109 7.1 OTIS-Hypercube Diameter 109 7.2 Simulation of an N* hypercube 1H 7.3 Common Data Rearrangements 112 7.3.1 Transpose b/2 1, . . . ,0,p l,...,p/2] }J3 7.3.2 Perfect Shuffle [0,p-l,p2,..., 1] 113 7.3.3 Unshuffle [p 2,p 3, . . . ,0,p 1] 114 iv

PAGE 5

7.3.4 Bit Reversal [0,1,..., p-1] • } 1fi 7.3.5 Vector Reversal [-(p-l),-(p2),..-, -0J . }{» 7.3.6 Bit Shuffle [p-l,p3, ,lp-2,p-4, 01. . .. . ^ 7.3.7 Shuffled Row-major [p-l,p/2-l,p2, p/22,..., p/2,0j . 118 7.4 BPC Permutations 12 o 7.5 Comparison 122 8 CONCLUSION 122 8.1 Outline of the Results 124 8.2 Open Problems REFERENCES BIOGRAPHICAL SKETCH 131 V

PAGE 6

LIST OF FIGURES 1.1 2-dimensional arrangement of L = 64 inputs when M = 4 and N = 16: (a) y/M x y/M = 2 x 2 grouping of inputs; (b) The (t, *) group, 0 < i < A/ = 4 1.2 Side view of the OTIS with M = 4 and AT = 16 4 1.3 Example of OTIS connections with 16 processors 6 1.4 Multistage interconnection network (MIN) defined by OTIS 7 2.1 16 Processor OTIS-Mesh 12 2 2 Mapping a 4 x 4 mesh onto a 16 processor OTIS-Mesh: (a) GRM; (b) GSM lt) 4.1 Data Configuration: (a) Initial; (b) Concentrated 45 4.2 Row-Column Transformation of Leighton's Column Sort 50 4.3 Example of Leighton's Column Sort 51 5.1 GRM Column x Row Multiplication 62 5.2 GSM Column x Row multiply algorithm 64 5.3 GRM Row x Column Multiply 66 5.4 GSM Row x Column Multiply 66 5.5 Data Paths Used in Step 1 of Figure 5.4 67 5.6 GSM Row x Column Multiply for k > 1 68 5.7 GRM Row Vector x Matrix Multiply 69 5.8 GSM Row Vector x Matrix Multiply 71 5.9 GRM Matrix x Column Vector Multiply 72 5.10 GSM Matrix x Column Vector Multiply 74 5.11 0(N) Memory GRM Matrix x Matrix Multiply 7 5 5.12 O(N) Memory GSM Matrix x Matrix Multiply 7 6 vi

PAGE 7

5.13 Cannon's Matrix Multiplication Algorithm 5.14 Moving A Values as per Step 1 of Cannon's Algorithm 5.15 GSM Matrix x Matrix Multiply 6.1 Data required in GRM for end processor: (a) q } even; (b) q f odd; (c) qj even; (d) qj odd 6.2 Data required in GRM mapping for middle processor: (a) q } even; (b) q f odd; (c) q f even; (d) q f odd 6.3 Data required in GSM mapping 6.4 Data required in GSM mapping when q f = 0 and q m / 0 6.5 Coordinate system used in Hough Transform 7.1 16 processor OTIS-Hypercube vii

PAGE 8

LIST OF TABLES 3.1 Optimal moves for 4D mesh and respective OTIS-Mesh simulations . 19 3.2 Source and destination of the BPC permutation [-0, 1,2, -3] in a 16 processor OTIS-Mesh 27 3.3 Permutations and their permutation vectors 27 3.4 Complexity Comparison of Common Data Rearrangement 31 4.1 Processors with data to concentrate 44 4.2 Net change in G x , G y , P x , and P y 46 4.3 Comparison of complexities on SIMD model 58 4.4 Comparison of complexities on MIMD model 59 5.1 Comparison between GRM and GSM schemes 82 7. 1 Optimal moves for N 2 — 2 M processor hypercube and respective OTISHypercube simulations I* 2 7.2 Illustration of the perfect shuffle algorithm on a 16 processor OTISHypercube ^5 7.3 Complexity Comparison of Common Data Rearrangement 121 viii

PAGE 9

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ALGORITHMS FOR THE OTIS OPTOELECTRONIC COMPUTER By Chih-fang Wang August 1998 Chairman: Dr. Sartaj Sahni Major Department: Computer and Information Science and Engineering It is well known that optical interconnects are more effective (i.e., provide more bandwidth, speed, and less power consumption) than electronic interconnects when the interconnection distance becomes larger than a few millimeters. The OTIS optoelectronic computer provides the best of both worlds by using free-space optical interconnects to connect distant processors and electronic interconnect for processors that are close. Optical transpose interconnection system (OTIS) provides a fixed and easy to realize optical topology; the topology of the electronic interconnect is flexible. By using different electronic topologies, we arrive at different classes of OTIS computers. For example, OTIS-Mesh is a class of OTIS computer in which the electronic interconnect follows the mesh paradigm, and OTIS-Hypercube is another class of OTIS computer such that the hypercube topology is used to realize the electronic interconnect. ix

PAGE 10

In this dissertation we will describe the OTIS architecture as well as some of its properties. Algorithms for some frequently used permutations, BPC permutations, fundamental operations, and some applications will be presented for the OTIS-Mesh computer. Properties of OTIS-Hypercube will also be discussed, along with algorithms for commonly used data rearrangements and BPC permutations. x

PAGE 11

CHAPTER 1 INTRODUCTION It is well known that when communication distances exceed a few millimeters, optical interconnects provide speed (bandwidth) and power advantages over electronic interconnects [7, 23]. Therefore, in the construction of very large multiprocessor computers it is prudent to interconnect physically close processors using electronic interconnects and to use optical interconnects for pairs of processors that are distant. We shall assume that physically close processors are in the same physical package (chip, wafer, board) and processors that are not physically close are in different packages. As a result, electronic interconnects are used for intrapackage communications while optical interconnect is used for interpackage communication. Various combinations of interconnection networks for intrapackage (i.e., electronic) communications and interpackage (i.e., optical communications) have been proposed. In OTIS computers [12, 33, 58], optical interconnects are realized via a free space optical interconnect system known as the optical transpose interconnection system (OTIS). In this chapter, we begin by describing the OTIS. Next, we describe the OTISMesh and OTIS-Hypercube parallel computers that result, respectively, when the OTIS optical interconnect system is used for interpackage communication and a mesh or hypercube is used for intrapackage communication. Following that, we show that the OTIS computer can be used as a multistage interconnection network (MIN). Finally, we provide a brief description of the remaining chapters.

PAGE 12

2 (0,*) (1.*) (2,*) (3,*) (i,0) (1,1) (i,2) (1,3) (i,4) (1,5) (1,6) (1,7) (i.8) (1,9) (1,10X1,11) (i,12)(i,13)(i,14)(i,15) Figure 1.1. 2-dimensional arrangement of L = 64 inputs when M = 4 and N 16: (a)v^xv^ = 2x2 grouping of inputs; (b) The (t, *) group, 0 < t < M = 4 yi Optical Transpose Interconnection System (OTIS) The optical transpose interconnection system (OTIS) was proposed by Marsden et aL [33]. The OTIS connects L = MN inputs to L outputs using free space optics and two arrays of lenslets. The first lenslet array is a y/Kf x y/M array and the second one is of dimension y/N x y/N. Thus, a total of M + N lenslets are used. The L inputs and outputs are arranged to form ny/Zxy/Z array. The L inputs are arranged into y/M x \fM groups with each group containing N inputs arranged into a y/N x y/N array. Figure 1.1 shows the arrangement of the L = 64 inputs when M = 4 and N = 16. The M x N inputs are indexed (ij) with 0 < i < M and 0 < j < N. Inputs with the same t value are in the same y/N x y/N block. The notation (1,*), for example, refers to all inputs of the form In addition to using the two-dimensional notation (ij) to refer to an input, we also use a four-dimensional notation (ir, tc,>> jc) where (ir,i c ) gives the coordinates (row.column) of the y/N x y/N block that contains the input (see Figure 1.1(a)) and Or, Jc) gives coordinates of the element within a block (see Figure 1.1(b)). So all

PAGE 13

3 elements (t» with t = 0 have (Wc) = (0,0); those with t = 1 have (t„i«) = (0,1); those with i = 2 have (t r ,t c ) = (1,0); and those with t = 3 have (t r ,i c ) = (1,1). Similarly, all inputs with j = 3 have (j r Jc) = (0,3), and those with j = 12 have 0r,Jc) = (3,0). The L outputs are also arranged into a VI x v/I array. This time, however, the VZ x VI array is composed of VN x VN blocks with each block containing M outputs that are arranged as a y/M x VM array. The L = MN outputs are indexed (i,j) with 0 < t < N, 0 < J < M. All outputs of the form (i» are in the same block, block t. Block i is in position (t^O with i = i T \fN + i c of the x VN block arrangement. Outputs of the form (*,j) are in position (j r Jc) of their block, j = j rS /M + j c . In the physical realization of OTIS, the VI x y/I output arrangement is rotated 180°. We have 4 two-dimensional planes; the first is the \TL x VL input plane; the second is a VM x VM lenslet plane, the third is a x VN lenslet plane, and the fourth is the \TL x VI plane of outputs rotated 180°. When the OTIS is viewed from the side, only the first column of each of these planes is visible. Such a side view for the case L = M x N = 4 x 16 is shown in Figure 1.2. Notice that the first column of the input plane consists of the inputs (0,0), (0,4), (0,8), (0,12), (2,0), (2,4), (2,8), (2,12) which in 4D notation are (0,0,0,0), (0,0,1,0), (0,0,2,0), (0,0,3,0), (1,0,0,0), (1,0,1,0), (1,0,2,0), (1,0,3,0). The inputs in the same row as (0,0,0,0) are (0,*,0,*), those in the same row as (tr, icj r , jc) are (t,,*, >,*). The (t r ,>) values top to bottom are (0,0), (0,1), (0,2), (0,3), (1,0), (1,1), (1,2), (1,3). The first column in the output plane (after the 180° rotation) has the outputs (15,3), (15,1), (11,3), (11,1), (7,3), (7,1), (3,3), (3,1) which in 4D notation are (3,3,1,1), (3,3,0,1), (2,3,1,1), (2,3,0,1), (1,3,1,1), (1,3,0,1), (0,3,1,1), (0,3,0,1). The outputs in the same row as

PAGE 14

Figure 1.2. Side view of the OTIS with M = 4 and N = 16 (3,3,1,1) are (3,*,1,*); those in the same row as (t r ,t c ,ir.ic) are (t r , *J r ,*)The (V, j r ) values top to bottom are (3,1), (3,0), (2,1), (2,0), (1,1), (1,0), (0,1), (0,0). Each lens of Figure 1.2 denotes a row of lenslets and each O * row of inputs or outputs. The interconnection pattern defined by the given arrangement of inputs, outputs, and lenslets connects input (i, j) = (t r ,t e ,; r ,ic) to output 0,0 = 0V, jc,»r,«c)The connection is established via an optical ray that originates at input position (v.tcir.ie), goes through lenslet (ir,i c ) of the first lenslet array, then through lenslet f>, j c ) of the second lenslet array, and finally arrives at output position 0r,;e.*r»«c)The basic connectivity provided by the OTIS is an optical connection between input and output (j,0» 0 < t < A/, 0 < i < AT.

PAGE 15

5 1 7 OTIS Par cel Computers Marsden et d. [33] have proposed several parallel computer architectures in which OTIS is used to connect processors in different groups (packages) and an electronic interconnection network is used to connect processors in the same group. Since Krishnamoorthy et aL [23] have shown that bandwidth is maximized and power consumption minimized when an L = AT 2 processor OTIS computer is partitioned into N groups of N processors each, Zane et al. [58] limit the study of OTIS parallel computers so that each processor group (package) has N processors and the parallel computer has a total of N groups (packages). Let denote processor j of package i, 0 < i < N, 0 < < N. Processor i ± h connected to processor (j,i) using free space optics (i.e., OTIS). The only other connections available in an OTIS computer are the electronic intragroup connections. A generic 16 processor OTIS computer is shown in Figure 1.3. The solid boxes denote processors. Each processor is labeled (g,p) where g is the group index and p is the processor index. OTIS connections are shown by arrows. Intra group connections are not shown. In an OTIS-Mesh, processors in the same group are connected as a 2-D mesh [33, 58, 46] (Chapter 2); and in an OTIS-Hypercube (Chapter 7), processors in the same group are connected using the hypercube topology [33, 58, 48]. OTIS-Mesh of trees [33], OTIS-Perfect shuffle, OTIS-Cube connected cycles, etc. may be defined in an analogous manner. When analyzing algorithms for OTIS architectures, we count data moves along electronic interconnects (i.e., electronic moves) and those along optical interconnects (te., OTIS moves) separately. This allows us to later account for any differences in the speed and bandwidth of these two types of interconnect.

PAGE 16

6 group 0 group 1 group 2 group 3 Figure 1.3. Example of OTIS connections with 16 processors 1 , 3 Permutation ftnnting o n OPS Computers Suppose we wish to rearrange the data in an N* processor OTIS computer according to the permutation II = n[0] • • • U[N* 1]. That is, data from processor i = gN + p is to be sent to processor n[t], 0 < t < N*. We assume that the interconnection network in each group is able to sort the data in its N processors (equivalent^, it is able to perform any permutation of the data in its TV processors). This assumption is certainly valid for the mesh, hypercube, perfect shuffle, cubeconnected cycles, and mesh of trees interconnections mentioned earlier. Theorem 1.3.1 Every OTIS computer in which each group can sort can perform any permutation U using at most 2 OTIS moves.

PAGE 17

7 Figure 1.4. Multistage interconnection network (MIN) defined by OTIS Proof When 2 OTIS moves are permitted, the data movement can be modeled by a 3 stage MIN (multistage interconnection network) as in Figure 1.4. Each switch represents a processor group which is capable of performing any N input to N output permutation. The OTIS moves are represented by the connections from one stage to the next. The OTIS interstage connections are equivalent to the interstage connections in a standard MIN that uses N x N switches. From MIN theory [22], we know that when k x it switches are used, 2 log* N* 1 stages of switches are sufficient to make an N 7 input tf 2 output network that can realize every input to output permutation. In our case (Figure 1.4), k = N. Therefore, 2 log* 1 = 3 stages are sufficient. Hence 2 OTIS moves suffice to realize any permutation. An alternative proof comes from an equivalence with the preemptive open shop scheduling problem (POSP) [9]. In the POSP we are given n jobs that are to be

PAGE 18

8 scheduled on m machines. Each job i has m tasks. The task length of the jth task of job t is the integer Uj > 0. In a preemptive schedule of length T, the time interval from 0 to T is divided into slices of length 1 unit each. A time slice is divided into m slots with each slot representing a unit time interval on one machine. Time slots on each machine are labeled with a job index. The labeling is done in such a way that (a) each job (index) i is assigned to exactly Uj slots on machine j, 0 < j < m, and (b) no job is assigned to two or more machines in any time slice. T is the schedule length. The objective is to find the smallest T for which a schedule exists. Gonzalez and Sahni [9] have shown that the length T**, of an optimal schedule is where = max^E]^ 1 Uj} («•«-, Jm» is the maximum job length) and = max^E^o Uj} (»•«•> Mmms the maximum processing to be done by any machine). We can transform the OTIS computer permutation routing problem into a POSP. First, note that to realize a permutation II with 2 OTIS moves, we must be able to write II as a sequence of permutations noTTTiTITa where IT, is the permutation realized by the switches (i.e., processor groups) in stage t and T denotes the OTIS (transpose) interstage permutation. Let (g v p 9 ) denote processor p, of group where q € {to,oo,»i,0i,»2,02} (*o = in P ut o f sta € e °> °» = out P ut of 8ta G e °> etc ')* Then ' the data path is (fto.p*,) (5oo>PP»j) (&>i>P<>i) > We observe that to realize the permutation IT, the following must hold: (i) Switch i of stage 1 should receive exactly one data item from each switch of stage 0, 0 < i < N.

PAGE 19

9 (ii) Switch i of stage 1 should receive exactly one data item destined for each switch of stage 2, 0 < i < N. Once we know which data items will get to switch i, 0 < « < N t we can easily compute n 0 , U u and II 2 . Therefore, it is sufficient to demonstrate the existence of an assignment of the N* stage 0 inputs to the switches in stage 1 satisfying conditions (i) and (ii). For this, we construct an N job N machine POSP instance. Job t represents switch i of stage 0 and machine j represents switch j of stage 2. The task time Uj equals the number of inputs to switch t of stage 0 that are destined for switch j of stage 2 (i.e., t 0 is the number of group i data that are destined for group j). Since n is a permutation, it follows that E£*
PAGE 20

10 ] 4 This Dissertation In this dissertation, properties of OTIS-Mesh and OTIS-Hypercube are studied and obtained. The dissertation is organized as follows: • Chapter 2 deals with some fundamental properties such as diameter and embedding schemes of OTIS-Mesh. • OTIS-Mesh algorithms for frequently used permutations are presented in Chapter 3, along with the algorithm for general BPC permutations. • Algorithms for basic operations-broadcast, prefix sum, rank, sort, and so onare developed in Chapter 4. • Chapter 5 demonstrates how matrix multiplications are performed. • Chapter 6 presents algorithms for some well known image processing applications. • Properties of the OTIS-hypercube are studied in Chapter 7, as well as algorithms for commonly used permutations and general BPC permutations. • And finally, Chapter 8 summarizes the whole dissertation, and gives directions for research.

PAGE 21

CHAPTER 2 PROPERTIES OF OTIS-MESH In an N 7 processor OTIS-Mesh, each group is a y/N x y/N mesh and there are a total of N groups. Figure 2.1 shows a 16 processor OTIS-Mesh. The processors of groups 0 and 2 are labeled using two dimensional local mesh coordinates while the processors in groups 1 and 3 are labeled in row-major fashion. We use the notation (g,p) to refer to processor p of group g. In this chapter, we first show that the diameter of the OTIS-Mesh is Ay/N-3. Then, we demonstrate how OTIS-Mesh can simulate a 4D-mesh, as well as a 2D-mesh. ? 1 Diameter of the OTIS-Mesh Let foi, pi) and pj) be two OTIS-Mesh processors. The shortest path between these two processors is of one of the form: (a) The path involves only electronic moves. This is possible only when gi = g 2 (b) The path involves an even number of optical moves. In this case the path is of the form (g u pi) (ft,rf) iA>9i) ^ (pWi) W\M — > Here E* denotes a sequence (possibly empty) of electronic moves and O denotes a single OTIS move. If the number of OTIS moves is more than two, we may compress paths of this form into the shorter path (g u Pi) (tfi>P2) (p^) (pa,^) {92,92)So we may assume that the path is of the above form with exactly two OTIS moves. (c) The path involves an odd number of OTIS moves. In this case, it must involve exactly one OTIS move (as otherwise it may be compressed into a shorter path 11

PAGE 22

12 (0,0) group 0 (0,1), group l Figure 2.1. 16 Processor OTIS-Mesh with just one OTIS move as in (b)) and may be assumed to be of the form Let d(t, j) be the shortest distance between processors i and of a group using a path comprised solely of electronic moves. So, d(ij) is the Manhattan distance between the two processors of the local mesh group. Shortest paths of type (a) have length dOh.pa) while those of types (b) and (c) have length d(pi,pi) + d{g u gi) + 2 and d(pi,«fc) + dfa, 9i) + 1, respectively. From the preceding discussion we have the following theorems: Theorem 2.1.1 The length of the shortest path between processors (ffi,Pi) and (2) when ft = g? and min{d(jh,P2) + d{g x ,gi) + 2,d(pi,^) + d(P2>0i) + 1} when gi ±

PAGE 23

13 Proof When gi = 2)+l > d(Pi,Pa)+lSo, the shortest path has length d(pi,pa). When gi ? fr, the shortest path is either of type (b) or (c). From our earlier development it follows that its length is mn{d(P u Pi) + d{G u G 2 ) + M(A,G 2 ) + d{P 2> G x ) + 1}. Thp.nrr.rn 2.1.2 The diameter of the OTISMesh is 4y/N 3. Proof Since each group is a y/N x y/N mesh, d(pi,pa), d(P2,0i), d(pi,g?) y and d(gu&) are all less than or equal to 2(y/N 1). From Theorem 2.1.1, it follows that no two processors are more than 4(y/N 1) + 1 = 4v^ 3 apart. Hence, the diameter is < 4^ 3. Now consider the processors (51, Pi), (i the top left processor and pa the bottom right one of its group). Let gi be 0 and pa be N-l. So, dOh.pa) = d(g 1 ,g 2 ) =
PAGE 24

14 The mesh moves (t,i,* ± 1,1) and (i,j,M± 1) can be performed with one electronic move of the OTIS-Mesh while the moves (t, j ± and (i ± require one electronic and two optical moves. For example, the move (i, j+l, k, J) may be done by the sequence (i,j,k,l) (*»MJ + 1) + LMThe above efficient embedding of a 4D mesh implies that 4D mesh algorithms can be run on the OTIS-Mesh with a constant factor (at most 3) slowdown [58]. Unfortunately, the body of known 4D mesh algorithms is very small compared to that of 2D mesh algorithms. So, it is desirable to consider a 2D mesh embedding. Such an embedding will enable one to run 2D mesh algorithms on the OTIS-Mesh. Naturally, one would do this only for problems for which no 4D algorithm is known or for which the known 4D mesh algorithms are not faster than the 2D algorithms. 2.3 Simula *,'™ 1 ™* * 2D Mesh There are at least two intuitively appealing ways to embed an N x N mesh into the OTIS-Mesh. One is the group row mapping (GRM) in which each group of the OTIS-Mesh represents a row of the 2D mesh. The mapping of the mesh row onto a group of OTIS processors is done in a snake-like fashion as in Figure 2.2(a). The pair of numbers in each processor of Figure 2.2(a) gives the (row.column) index of the mapped 2D mesh processor. The thick edges show the electronic connections used to obtain the 2D mesh row. Notice that the assignment of rows to groups is also done in a snake-like manner. Let (»,;) denote a processor of a 2D mesh. The move to (t, j + 1) (or (», j 1)) can be done with one electronic move as (ij) and (t, j + 1) are neighbors in a processor group. If all elements of row i are to be moved over one column, then the OTIS-Mesh would need one electronic move in case of a MIMD mesh and 3 in case of a SIMD mesh as the row move would involve a shift by one left, right, and down within a group. A column shift can be done with 2

PAGE 25

15 group 0 group 1 group 0 group 1 group 3 group 2 group 2 group 3 (»> (b) Figure 2.2. Mapping a 4 x 4 mesh onto a 16 processor OTIS-Mesh: (a) GRM; (b) GSM additional OTIS moves as in the case of a 4D mesh embedding. GRM is particularly nice for the matrix transpose operation. Data from processor (i, j) can be moved to processor fj,i) with one OTIS and zero electronic moves. The second way to embed an N x N mesh is to use the group submesh mapping (GSM). In this, the N x N mesh is partitioned into N y/Nxy/N submeshes. Each of these is mapped in the natural way onto a group of OTIS-Mesh processors. Figure 2.2(b) shows GSM of a 4 x 4 mesh. Moving all elements of row or column i over by one is now considerably more expensive. For example, a row shift by +1 would be accomplished by the following data movements (a boundary processor is one on the right boundary of a group): Step 1: Shift data in non-boundary processors right by one using an electronic move. Step 2: Perform an OTIS move on boundary processor data. So, data from (g,p) move to (p,g).

PAGE 26

16 Step 3: Shift the data moved in Step 2 right by one using an electronic move. Now, the data from (g,p) are in (p,g + 1). Step 4: Perform an OTIS move on these data. Now data originally in (g,p) are in Step 5: Shift the data left by VN-1 using y/N-l electronic moves. Now, the boundary data originally in (y,p) are in the processor to its right but in the next group. The above five step process takes electronic and two OTIS moves. Note, however, that if each group is a wraparound mesh in which the last processor of each row connects to the first and the bottom processor of each column connects to the top one, then row and column shift operations become much simpler as Step 1 may be eliminated and Step 5 replaced by a right wraparound shift of 1. The complexity is now two electronic and two OTIS moves. GSM is also inferior on the transpose operation which now requires S(y/N1) electronic and 2 OTIS moves. Theorem 2.3.1 [46] The transpose operation of an N x N mesh requires 8{y/N 1) electronic and 2 OTIS moves when the GSM is used. Proof Let g x Oy and p x p y denote processor (g,p) of the OTIS-Mesh. This processor is in position (p x ,p,) of group (g x ,g y ) and corresponds to processor (g x p x ,g y P y ) of the N x N embedded mesh. To accomplish the transpose, data are to be moved from theNxN mesh processor {g x p x ,g y p y ) (ie., the OTIS-Mesh processor {g,p) = (g x g v ,p x p y )) to the mesh processor (g y p yi g x p x ) (i.e., the OTIS-Mesh processor (^^.PyPi))The following movements do this: {g x p x ,g v Py) (9xPyi9yPx) iPy9x>Px9 v ) (p y 9y,Px9z) (9 v Py,9xPx)Once again E* denotes a sequence of electronic moves

PAGE 27

17 local to a group and O denotes a single OTIS move. The E moves in this case perform a transpose in a y/N x VN mesh. Each of these transposes can be done in A{yfN 1) moves [34]. So, the above transpose method uses &(VN 1) electronic and 2 OTIS moves. To see that this is optimal, first note that every transpose algorithm requires at least 2 OTIS moves. For this, pick a group g t g y such that g x ^ g y . Data from all N processors in this group are to move to the processors in group g v g s . This requires at least one OTIS move. However, if only one OTIS move is performed, data from g s9y is scattered to the N groups. So, at least two OTIS moves are needed if the data ends up in the same group. Next, we shall show that independent of the OTIS moves, at least 8(^-1) electronic moves must be performed. The electronic moves cumulatively perform one of the following two transforms (depending on whether the number of OTIS moves is even or odd, see previous section about the diameter): (a) local moves from (p«,p,) to (p,,p»); local moves from (g x ,g y ) to (g y ,g x ); (b) local moves from (p x ,p y ) to {g y ,g x )\ local moves from (g x ,g y ) to (p,,,Px). For (p.,*) = « (O t VN1), (a) and (b) require 2(^-1) left and 2(y/N 1) down moves. For (p„pj = (<7x,5») = ~ M)i (a) and (b) require 2{-jN 1) right and 2{VN 1) up moves. The total number of moves is thus 8(v / ^ 1). So, S(y/N 1) is a lower bound on the number of electronic moves needed.

PAGE 28

CHAPTER 3 DATA REARRANGEMENT ON AN OTIS-MESH From Section 1.3, we know that an N 7 processor OTIS-Mesh can realize any permutation of N 7 data (one to each processor) using at most two OTIS moves. However, additional OTIS moves are needed to determine the local group data rearrangements that must be made. In this chapter, we first develop algorithms to realize permutations such as transpose, shuffle, unshuffle, and vector reversal which arise frequently in applications. Nassimi and Sahni [34] have developed optimal 4D mesh algorithms for several frequently arising permutations. These may be simulated using the method of Zane et d. [58] to obtain algorithms for the OTIS-Mesh. Table 3.1 gives the number of 4D mesh moves used by the optimal 4D mesh algorithms, a breakdown of the number of moves in the first two and last two dimensions, and the number of electronic and OTIS moves required by the simulation. In the following sections we shall obtain OTIS-Mesh algorithms for the permutations of Table 3.1, that require far fewer moves than the simulations of the optimal 4D mesh algorithms. Assume that the N 2 OTIS-Mesh processors are numbered/indexed 0 through N* 1 such that in the binary representation of a processor index the left half bits give the group number and the right half give the processor number local to a group. So, a processor index / is of the form I = GP where /, G and P are represented in binary and G and P have the same number of bits. G and P may be decomposed into halves to get G = G x G y and P = P S P V such that G x and G v give the group 18

PAGE 29

19 Table 3.1. Optimal moves for 4D mesh and respective OTIS-Mesh simulations Permutation 4D mesh OTIS-Mesh Simulation total dim. 1 + 2 dim. 3 + 4 OTIS electronic Transpose S(VN-l) 4(VN 1) 4(VN 1) S(VN-l) S(VN 1) Perfect Shuffle 4\fN 2y/N 2VN AyfN 4sTN Unshuffle WN 2VN 2s/N Wn 4VN Bit Reversal StfN 1) A(sfN 1) 4(vW-l) HVn-1) HVN 1) Vector Reversal 8(VW-1) 4(V^~ 1) 4(vW-l) S(sfN-l) HVN-i) Bit Shuffle ®y/N-4 4^-2 Vy/N-4 &VN-4 Shuffled Row-major #v^-4 WN-2 4>/N-2 WN-4 #v/JV-4 G V P X Swap 4(V3v 1) 2{VN-1) KVN-l) 4[VN-1) 4{>/N-l) location by row and column in an array layout of groups (as in Figure 2.1) and P t and P t locate processor P of a group by its row and column coordinates. The permutations of Table 3.1 are members of the BPC (bit permute complement) class of permutations denned in Nassimi and Sahni [34]. The definition of the BPC permutation and its relations with those permutations in Table 3.1 will be presented in the last section, along with the development of the algorithm. 3.1 Transpose The transpose operation may be accomplished via a single OTIS move and zero electronic moves. The simulation of the optimal 4D mesh algorithm, however, takes 8(y/N 1) OTIS and S(y/N 1) electronic moves. 3.2 Perfect Shuffle Let G represent the first half of the bits in the processor index and P the second half. Let b G{i) and 6/> {l) , respectively, denote the bits in position G{i) and P(i) of G and P. So 6c(p/2-i) and &/>0>/2-i) are the most significant bits of G and P while 6 G (o) and b P ( 0 ) are the least. Let G = 6 G (j»/2-i)G' and P bp^/i-vP 1 • A perfect shuffle may be performed as below:

PAGE 30

20 Step 1: Perform a local perfect shuffle in each group. This moves data from every processor GP to the corresponding processor GP'bp^-i)Step 2: This step involves processors in groups G such that 6c(p/2-i) = 0 onlv In these groups, odd processors exchange data with corresponding even processors (note that the processors exchanging data differ only in bit zero). To see the new data arrangement, it is convenient to separate out four cases depending on the values of b G{j> /2-i) and b P{p/2 -i). Steps 1 and 2 accomplish the following: OG'OP' OG'P'O OG'P'l OG'IP* OG'P'l OG'P'O ICOP* ^ IG'P'O ^ IG'P'O IG'lP* IGPI IG'P'l Step 3: Perform an OTIS move on all processors. Step 4: Perform a local shuffle in each group. The transformations so far are given below: OG'OP' 001*0 00*1 P'lOG' P'IG'O OG'l/* OG'P'l OG'P'O P'OOG P'OG'O lG >QP> IG'P'O *2f IG'P'O ^ p>0W' P'OG'l IG'IP IG'P'l IG'P'l P'HC P'IG'l Step 5: This step involves only processors in even groups. In these groups, odd processors exchange their data with the corresponding even processors. OG'OP' OG'P'O OG'P'l P'lQG' P'IG'O OG'lP' OG'P'l OG'P'O P'OOG 1 P'OG'O WOP* ^ IG'P'O IG'P'O sj^* p'QG'l laip 1 IG'P'l IG'P'l FIIG' P'IG'l P'IG'O P'OG'l P'OG'O P'IG'l

PAGE 31

21 Step 6: Perform an OTIS move on all processors. Step 7: Same as Step 5. The seven step process is shown below: OG'OF OG'FO OG'P'l P'lOG' FlG'O OG'IP 1 OG'PTJ P'OOG' FOG'O WOP 1 St»p l iQ'p'Q Sup * IG'P'O p'oiC IG'lP 1 IG'Fl IG'Fl P'UG' FlG'l FIGO G'OP'l G'OP'O P'OCl G'IP'O ClP'O FQG'Q st»p « G'OP'O St * p > 7 G'OP'l P'lCl ClP 1 ! Step 5 The correctness of the seven step algorithm above is readily seen. From the diagram of the data movement operations, we see that data originally in OG'OP 1 end up in G'OP'O; those in OG'IP' end up in G'IP'O; those in WOP 1 end up in OG'IP'; and those in IG'lP 1 end up in ClPl. In other words, data are moved from GP to G"6 p0 , / 2-i)P / 6g( J( /2-i), which is precisely what is to be done in a perfect shuffle. Steps 1 and 4 perform perfect shuffles in VN x y/N meshes. Each of these can be done optimally in 2y/N electronic moves using the algorithm of Nassimi and Sahni [34]. Steps 2, 5, and 7 requires exchanging data between mesh neighbors. Each exchange moves data in opposite directions on the same link and takes two electronic moves. Steps 3 and 6 take one OTIS move each. So, the total number of moves is Ay/N + 6 electronic and 2 OTIS only. In contrast, the simulation of the optimal 4D mesh perfect shuffle algorithm takes iVN electronic and 4\/N OTIS moves. 3.3 Unshuffle This is the inverse of a perfect shuffle and may be done by running the seven step shuffle algorithm backward (i.e., beginning with Step 7) and replacing the local shuffles of Steps 1 and 4 by local unshuffles. The data movement is shown below {G = Crb amt P = P*b m ).

PAGE 32

22 G"OP"0 G"0P"1 P*1G"0 P"1G"0 P"10G" G"0P"1 G"OP"0 P"OG"0 P"OG"l P"01G" G n \P"0 st«p « p"QG n \ s " p ' P"OG"0 s -^ 4 P"0O/N 1) electronic and &(VN 1) OTIS moves are made. Step 1: Do a local bit reversal in each group. Step 2: Perform an OTIS move of all data. Step 3: Do a local bit reversal in each group. Steps 1 and 3 are done optimally in 4(y/N 1) electronic moves each using the optimal 2D mesh bit reversal algorithm of Nassimi and Sahni [34]. 3 5 Vector Reversal A vector reversal can be done using S(VN 1) electronic and two OTIS moves. The steps are as follows: Step 1: Perform a local vector reversal in each group. Step 2: Do an OTIS move of all data. Step 3: Perform a local vector reversal in each group.

PAGE 33

23 Step 4: Do an OTIS move of all data. Note that Step 1 moves data from GP to GP (where P is the complement of P). Step 2 moves this data from GP to PG. Next, Step 3 sends that data to PG and finally Step 4 sends it to GP completing the vector reversal. The number of data moves is easily obtained by noting that the optimal way to perform the local vector reversals takes 4(v / W 1) electronic moves [34]. a.fi Bit Shuffle Our algorithm to perform this permutation employs a G y P t Swap permutation in which data from processor G x G y P x P y is routed to processor G x P x G y P y . So, let us first see how to perform this permutation. 3fi.1 (7 r P. Swap We present two algorithms for this. The first uses 2{y/N 1) electronic and log 2 N OTIS moves. The second uses 6(VN-1) electronic and 2 OTIS moves. While the second algorithm uses a larger number of moves, it is to be preferred when the cost of an OTIS move is considerably larger than that of an electronic move. The first algorithm performs a series of bit exchange permutations of the form B[i) = [fl^i, • • • , Bo], 0 < i < p/4, where ' p/2 + t, i=p/4 + i p/4 + t, ;=p/2 + » j otherwise The permutation B(i) may be realized as below: Step 1: Processors GP with 6c W / b P{p/i+i) route their data to corresponding processors that differ only in bit b P{j , /4 +i). This requires moving data left and right on rows of x VN meshes by T positions (in each direction). Step 2: Perform an OTIS move.

PAGE 34

24 Step 3: The data moved in Step 1 is routed from their current processors to corresponding processors that differ only in bit i. This requires data moves left and right along rows of yfN x v^V meshes. The distance is 2« in each direction. Step 4: Perform an OTIS move. The total number of moves is T +2 electronic and two OTIS. To perform a G y P x Swap permutation, we simply perform B{i) permutations for 0 < i < p/4. This takes p/2 = log 2 N OTIS moves and EH* 2 ,+2 = 4(2^-1) = A(y/~N 1) electronic moves. The second algorithm uses the following six steps: Step 1: Shift data in group G s G y up circularly by G y rows. This moves the datum from processor G x G y P x P y to processor G x G y ((P s G v ) mod VN)P y . Step 2: Perform an OTIS move. The datum from G s G,P x P y is now in ((P x -G y ) mod VN)P y G x G y . Step 3: In each group, shift the data right circularly along the rows by an amount given by the left half of the group bits. The Datum originally in G x G y P x P y is now in ((P x G y ) mod y/N)P y G x P x . Step 4: Perform an OTIS move. The datum is now in G X P X ((P X G y ) mod \ffi)P y . Step 5: Move data up circularly along columns by an amount given by the Right half of the group bits. The datum is now in G x P x {-G y mod y/N)P y . Step 6: Reverse the order of data in each column of each group. The datum is now in G z P x G y P y .

PAGE 35

25 While a column or row circular shift takes VN moves in each group, the number of moves in each direction varies from group to group. Assuming that up moves in one group may not be overlapped with down moves in another, Steps 1, 3, and 5 take 2(VN 1) electronic moves each. Step 6 may be combined with Step 5 at no extra cost. So, a total of 6(VN 1) electronic and two OTIS moves are used. 3fi.2 Bit Shuffle A bit shuffle may be performed following these steps: Step 1: Perform a G y P x swap. Step 2: Do a local bit shuffle in each group. Step 3: Do an OTIS move. Step 4: Do a local bit shuffle in each group. Step 5: Do an OTIS move. Using the 4(y/N 1) electronic and log 2 N OTIS move algorithm for the G y P x Swap and the optimal mesh bit shuffle algorithm of Nassimi and Sahni [34], the number of moves becomes (approximately) f y/N 4 electronic and log 2 N + 2 OTIS. 3.7 Shuffled Row-Maior This is the inverse of a bit shuffle and may be done in the same number of moves by running the bit shuffle algorithm backwards. Of course, Steps 2 and 4 are to be changed to shuffled row-major operations. 3 ft RPD Permutations We mentioned in the beginning that the permutations of the previous sections are members of the BPC permutation class. In this section we present the definition

PAGE 36

26 of the BPC permutation, its relation to those permutations, and the algorithm to realize it. 3 ft l Pefinition In a BPC permutation, the destination processor of each data is given by a rearrangement of the bits in the source processor index. For the case of our N* processor OTIS-Mesh we assume that N is a power of two and so the number of bits needed to represent a processor index is p = log, N* = 2logN. A BPC permutation of Nassimi and Sahni [34] is specified by a vector A = [A p U A v . 2 , . . . , A>] where (a) 4€{±0,±l,...,±(p-l)}>0 0, rf W " \ 1 rm if Ai<0. In this definition, -0 is to be regarded as < 0, while +0 is > 0. In a 16 processor OTIS-Mesh, the processor indices have four bits with the first two giving the group number and the second two the local processor index. The BPC permutation [-0,1,2,-3] requires data from each processor m 3 m 2 m x mQ be routed to processor (1 rn^m^il m 3 ). Table 3.2 lists the source and destination processors of the permutation. The permutation vector A for each of the permutations of Table 3.1 is given in Table 3.3.

PAGE 37

27 Table 3.2. Source and destination of the BPC permutation [-0, 1,2, -3] i processor OTIS-Mesh Source Destination r rocessor Binary Binary (G,P) Processor n u 0000 1001 (2,1) 9 i i 0001 0001 (0,1) 1 o t 0010 \J\J X \J 1101 (3,1) 13 3 (0,3) 0011 0101 (1,1) 5 4 (1,0) 0100 1011 (2,3) 11 5 (1,1) 0101 0011 (0,3) 3 6 (1,2) 0110 1111 (3,3) 15 7 (1,3) 0111 0111 (1,3) 7 8 (2,0) 1000 1000 (2,0) 8 9 (2,1) 1001 0000 (0,0) 0 10 (2,2) 1010 1100 (3,0) 12 11 (2,3) 1011 0100 (1,0) 4 12 (3,0) 1100 1010 (2,2) 10 13 (3,1) 1101 0010 (0,2) 2 14 (3,2) 1110 1110 (3,2) 14 15 (3,3) 1111 0110 (1,2) 6 Table 3.3. Permutations and their permutation vectors Permutation Permutation Vector Transpose Perfect Shuffle Unshuffle Bit Reversal Vector Reversal Bit Shuffle Shuffled Row-major [p/2-l,...,0,p-l,...,p/2j [0,p-l,p-2,...,l] [p-2,p-3,...,0,p-l] [0,1,. ...p-i] Hp-i),-(p-2),...,-o] [p-l,p-3,...,l,p-2,p-4,...,0] [p l,p/2 l,p 2,p/2 2, . . . ,p/2,0]

PAGE 38

28 38 2 Algorithm Every BPC permutation, A, may be realized by a sequence of bit exchange permutations of the form B(i,j) = [B^u . . .,flb], p/2 < • < p, 0 < j < p/2, and f h ? = * B q = < », 9 = i I otherwise, and a BPC permutation (7 = [C,-!, . . , Co] = n G U P where |C,| < p/2, 0 < q < p/2, n G and n P involve p/2 bits each. Let U' G be the permutation obtained from II G by subtracting p/2 from each entry whose absolute value exceeds p/2 1. For example, if n G = [-3, 5, 4], then p = 6 and n' G = [-0, 2, 1]. The transpose permutation may be realized by the sequence B(p/2 +j,j), 0 < j < p/2; bit reversal is equivalent to the sequence B(p1 J, j), 0 < j < p/2; vector reversal can be realized by performing no bit exchanges and using C = [-(p-1), ~(p2), . . . , -0] (Tla = [-(p-1), -(P-2), .... -p/2], U P m [-(p/2-1), . . . , -0]) ; perfect shuffle may be decomposed into £(p/2, 0) and C = [p 2,p 3, . . . ,p/2,p l,p/2 2,...,l,0,p/2-l] (nc = b-2,p-3,...,p/2,p-l], n P = [p/2-2,...,l,0,p/2-l]). A bit exchange permutation B(iJ) may be performed in 2* + 2 1 ' electronic, where .,_/i-p/2, i<3p/4 1 3p/4, i > 3p/4; j, j < P/ 4 J-p/4, j>p/4 and 2 OTIS moves following a process similar to that used for B(i) in Section 3.6.1. Our algorithm for general BPC permutations is: Step 1: Decompose the BPC permutation A into the bit exchange permutations 5 2 (t 2 ,j 2 ),..., B k {i k ,j k ) and the BPC permutation C = n G U P as above. Do this such that ij > it > • • • > t't, and j x > fa > • • • > jt-

PAGE 39

29 Step 2: If k = 0, do the following: Step 2.1: Do the BPC permutation U P in each group using the optimal algorithm of Nassimi and Sahni [34]. Step 2.2: Do an OTIS move. Step 2.3: Do the BPC permutation U' G in each group using the algorithm of Nassimi and Sahni [34]. Step 2.4: Do an OTIS move. Step 3: If fc = p/2, do the following: Step 3.1: Do the BPC permutation II^ in each group. Step 3.2: Do an OTIS move. Step 3.3: Do the BPC permutation U P in each group. Step 4: Uk< p/4, do the following: Step 4.1: Perform the bit exchange permutation B u . . . , B k . Step 4.2: Do Steps 2.1 through 2.4. Step 5: \ik> do the following: Step 5.1: Perform a sequence of p/2 k bit exchanges involving bits other than those in B u ...,B k in the same orderly fashion described in Step 1. Recompute YIq and lip. Swap lie and lip. Step 5.2: Do Steps 3.1 through 3.3.

PAGE 40

30 Consider the permutation A = [6,11,3,8,10,7,0,4,13,14,2,9,1,15,5,12] in a 2 16 processor OTIS-Mesh (we have omitted complements for simplicity; bit complements can be taken care of when the local BPC permutations U G and U P are performed). For this, the decomposition of Step 1 yields B x = 5(15,7), 5 2 = 5(13,6), 5 3 = £(10,4), B< = 5(9,2), and 5 5 = 5(8,0), n G = [13,11,14,8,10,9,15,12], Up = [6,3,2,7,1,0,5,4]. Since k = 5 > p/4 = 4, we go to Step 5. First we perform a sequence of bit exchanges on bits not in B x through 5 5 ; i.e., 5(14,5), 5(12,3), and 5(11,1). Recomputing n G and U P , we get n c = [6,2,3,1,5,7,0,4] and U P = [13,14,11,9,8,15,10,12]. Next, Steps 3.1 through 3.3 are done. The sequence of data moves is shown below: (&15&14&13&12&11&10&9&8&7&6M4&3&2&1M £(13 3) (bibhbi3 b i2biibiob 9 b i b 7 b 6 bi A b 4 b 3 b 2 b 1 b 0 ) — (^5Ml3&3^1&10&9&8&7&6&14Ml2&2&lM (bisb 5 bi 3 b 3 bibiob 9 b i b 7 b 6 bub A bi2b7b\il>o) (bishbubzbibiobvbshbebjbobiibnbAbii) (btbsbybobububibubisbsbiibibibiobtbi) (hbsbtbQbubiibtbubiQbisbibsbisbsbzbg) It can be verified that the resulting position is exactly the destination that the original BPC permutation A dictates. The local BPC permutations determined by Tic and U P take at most 4(y/N1) electronic moves each [34]; the bit exchanges cumulatively take at most 4(y/N 1) electronic and log 2 N OTIS moves. So, the total number of moves is at most 12(VN 1) electronic and log 2 N + 2 OTIS. 3 Q Comparison Table 3.4 lists the complexities of the algorithms for the commonly used permutations developed in this chapter along with the complexities of algorithms that o

PAGE 41

31 Table 3.4. Complexity Comparison of Common Data Rearrangement Permutation Simulation Ours electronic OTIS electronic oris Transpose S(VN-l) B(VN 1) 0 1 Perfect Shuffle Wn Wn 4^ + 6 2 Unshuffle Wn Ay/N 4>/N + 6 2 Bit Reversal &(VN 1) 8(VN 1) &(VN-1) 1 Vector Reversal 8(vW 1) S(y/N 1) S(VN 1) 2 Bit Shuffle fv/N-4 log 2 JV + 2 Shuffled Row-major fv/^-4 ^v^V-4 log 2 N + 2 use the simulation method of Zane et oL [58]. It is clear that each of our algorithm outperforms the simulation by a good margin.

PAGE 42

CHAPTER 4 BASIC OPERATIONS ON AN OTIS-MESH In this chapter, we develop deterministic OTIS-Mesh algorithms for the basic data operations for parallel computation that are studied in Ranka and Sahni [42], such as broadcast, window broadcast, prefix sum, rank, shift, sort, random access read and write. As shown in [42], algorithms for these operations can be used to arrive at efficient parallel algorithms for numerous applications, from image processing, computational geometry, matrix algebra, graph theory, and so forth. We consider both the synchronous SIMD and synchronous MIMD models. In both, all processors operate in lock-step fashion. In the SIMD model, all active processors perform the same operation in any step and all active processors move data along the same dimension or along OTIS connections. In the MIMD model, processors can perform different operations in the same step and can move data along different dimensions. 4,1 Pft f * Broadcast Data broadcast is, perhaps, the most fundamental operation for a parallel computer. In this operation, data that is initially in a single processor (G,P) is to be broadcast or transmitted to all N* processors of the OTIS-Mesh. Data broadcast can be accomplished using the following three step algorithm: Step 1: Processor (G,P) broadcasts its data to all other processors in group G. Step 2: Perform an OTIS move. Step 3: Processor G of each group broadcasts the data within its group. 32

PAGE 43

33 Following Step 2, one processor of each group has a copy of the data, and following Step 3 each processor of the OTIS-Mesh has a copy. In the SIMD model, Steps 1 and 3 take 2(y/N 1) electronic moves each, and Step 2 takes one OTIS move. The SIMD complexity is 4(s/N 1) electronic moves and 1 OTIS move, or a total of 4^-3 moves. Note that our algorithm is optimal because the diameter of the OTIS-Mesh is 4y/N 3 (Section 2.1). For example, if the data to be broadcast is initially in processor (0,0), the data needs to reach processor (AT1, AT1), which is at a distance of 4y/N 3. In the MIMD model, the complexity of Steps 1 and 3 depends on the value of P = (P„ P,) and ranges from a low of approximately y/N1 to a high of 2(VN1). The overall complexity is at most 4{y/N-l) electronic moves and one OTIS move. By contrast, simulating the 4D-mesh broadcast algorithm using the simulation method of [58] takes 4(^-1) electronic moves and 4(>/A 7 1) OTIS moves in the SIMD model and up to this many moves in the MIMD model. 4fl Window Broadcast In a window broadcast, we start with data in the top left w x w submesh of a single group G. Here w divides Following the window broadcast operation, the initial id x 10 window tiles all groups; that is, the window is broadcast both within and across groups. Our algorithm for window broadcast is: Step i: Do a window broadcast within group G. Step 2: Perform an OTIS move. Step S: Do an intragroup data broadcast from processor G of each group. Step 4: Perform an OTIS move. Following Step 1 the initial window properly tiles group G and we are left with the task of broadcasting from group G to all other groups. In Step 2, data d(G, P)

PAGE 44

34 from (G,P) is moved to (P,G) for 0 < P < N. In Step 3, d(G,P) is broadcast to all processors (P,i), 0 < P,i < N, and in Step 4 d(G,P) is moved to (i,P), 0 < »', P < AT. Step 1 of our window broadcast algorithm takes 2(y/N w) electronic moves in both the SIMD and MIMD models, and Step 3 takes 2(s/N 1) electronic moves in the SIMD model and up to 2(y/N-l) electronic moves in the MIMD model. The total cost is 4VN-2w-2 electronic and 2 OTIS moves in the SIMD model and up to this many moves in the MIMD model. A simulation of the 4D mesh window broadcast algorithm takes the same number of electronic moves, but also takes 4(VN1) OTIS moves. 4 3 Prefix Sum The index (G, P) of a processor may be transformed into a scalar I = GN+P with 0 < / < N*. Let D(I) be the data in processor 1,0 < I < N 2 . In a prefix sum, each processor / computes 5(7) = ELo *>(»), 0 < / < N 2 . A simple prefix sum algorithm results from the following observation: S{I) = SD(I) + LP{I) where SD{I) is the sum of D{i) over all processors t that are in a group smaller than the group of / and LP(I) is the local prefix sum within the group of /. The simple prefix sum algorithm is as follows: Step 1: Perform a local prefix sum in each group. Step 2: Perform an OTIS move of the prefix sums computed in Step 1 for all processors (G, N 1).

PAGE 45

35 Step 3: Group N 1 computes a modified prefix sum of the values, A, received Step 2. In this modification, processor P computes EiJo ^(0 rather than in Step 4: Perform an OTIS move of the modified prefix sums computed in Step 3. Step 5: Each group does a local broadcast of the modified prefix sum received by its N — 1 processor. Step 6: Each processor adds the local prefix sum computed in Step 1 and the modified prefix sum it received in Step 5. The local prefix sums of Steps 1 and 3 take 3(y/N 1) electronic moves in both the SIMD and MIMD models, and the local data broadcast of Step 5 takes 2(jN-\) electronic moves. The overall complexity is 8(v^ 1) electronic moves and 2 OTIS moves. This can be reduced to 7(VN 1) electronic moves and 2 OTIS moves by deferring some of the Step 1 moves to Step 5 as below. Step 1: In each group, compute the row prefix sums R. Step 2: Column y/N 1 of each group computes the modified prefix sums of its R values. Step S: Perform an OTIS move on the prefix sums computed in Step 2 for all processors (G,N 1). Step 4: Group N 1 computes a modified prefix sum of the values, A, received in Step 3. Step 5: Perform an OTIS move of the modified prefix sums computed in Step 4.

PAGE 46

36 Step 6: Each group broadcasts the modified prefix sum received in Step 5 along column >/N 1 of its mesh. Step 7: The column y/N 1 processors add the modified prefix sum received in Step 6 and the prefix sum of R values computed in Step 2 minus its own R value computed in Step 1. Step 8: The result computed by column VN 1 processors in Step 7 is broadcast along mesh rows. Step 9: Each processor adds its R value and the value it received in Step 8. If we simulate the best 4D mesh prefix sum algorithm, the resulting OTIS mesh algorithm takes 7(VN 1) electronic and 6(v / N 1) OTIS moves. 4.4 Data Sum In this operation, each processor is to compute the sum of the D values of all processors. An optimal SIMD data sum algorithm is as follows: Step 1: Each group performs the data sum. Step 2: Perform an OTIS move. Step 3: Each group performs the data sum. In the SIMD model Steps 1 and 3 take 4(VN1) electronic moves, and step 2 takes 1 OTIS move. The total cost is S(y/N 1) electronic and 1 OTIS moves. Note that since the distance between processors (0,0) and (N 1,N 1) is 4(y/N 1) electronic and 1 OTIS moves and since each needs to get information from the other, at least 8{VN 1) electronic and 1 OTIS moves are needed (the moves needed to send information from (0, 0) to (N 1, N 1) and those from (N 1, N 1) to (0, 0)

PAGE 47

37 cannot be overlapped in the SIMD model). Also, note that a simulation of the 4D mesh data sum algorithm takes &(\/N 1) electronic and 8(\/N 1) OTIS moves. The MIMD complexity can be reduced by computing the group sums in the middle processor of each group rather than in the bottom right processor. The complexity now becomes 4(\/N 1) electronic and 1 OTIS moves when is odd and 4\/N electronic and 1 OTIS moves when y/N is even. The simulation of the 4D mesh, however, takes 4(y/N-l) electronic and 4(y/N 1) OTIS moves. Notice that the MIMD algorithm is near optimal as the diameter of the OTIS-Mesh is 4V^ 3 (Section 2.1). 4.5 Rank In the rank operation, each processor / has a flag 5(7) € {0, 1}, 0 < I < N 7 . We are to compute the prefix sums of the processors with S(I) = 1. This operation can be performed in 1(\/N 1) electronic and 2 OTIS moves using the prefix sum algorithm of Section 4.3. 4.6 Shift Although there are many variations of the shift operation, the ones we believe are most useful in application development are as follows: (a ) mesh row shift with zero fill — in this we shift data from processor (G s , G y ,P x ,P f ) to processor (G X1 G y , P„ P, + «), -y/N y/N or P y + s < 0, the data from P, is discarded). (b) mesh column shift with zero fill — similar to (a), but along mesh column P x . (c) circular shift on a mesh row — in this we shift data from processor (G„ G y} P x , P t ) to processor (G x , G y ,P x , (P, + s) mod \/N).

PAGE 48

38 (d) circular shift on a mesh column— similar to (c), but instead P x is used. (e) group row shift with zero fill— similar to (a), except that G v is used in place of (f) group column shift with zero fill— similar to (e), but along group column G x . (g) circular shift on a group row— similar to (c), but with G v rather than P v . (h) circular shift on a group column— similar to (g), with G x in place of G y . Shifts of types (a) through (d) are done using the best mesh algorithms while those of types (e) through (h) are done as below: Step 1: Perform an OTIS move. Step 2: Do the shift as a P x (if originally a G x shift) or a P, (if originally a G y shift) shift. Step S: Perform an OTIS move. Shifts of types (a) and (b) take s electronic moves on the SIMD and MIMD models; (c) and (d) take y/N electronic moves on the SIMD model and max{|s|, y/N\s\) electronic moves on the MIMD model; (e) and (f) take * electronic and 2 OTIS moves on both SIMD and MIMD models; and (g) and (h) take y/N electronic and 2 OTIS moves on the SIMD model and max{|s|, y/N \s\) electronic and 2 OTIS moves on the MIMD model. If we simulate the corresponding 4D mesh algorithms, we obtain the same complexity for (a)— (d), but (e) and (f) take an additional 2s 2 OTIS moves, and (g) and (h) take an additional 2 x max{|s|, y/N \s\} 2 OTIS moves.

PAGE 49

39 4,7 Data Ar r " m "l at ' on Each processor is to accumulate M, 0 < M < v/N, values from its neighboring processors along one of the four dimensions G x , G y , P„ P y Let D(G x ,G y ,P x ,P y ) be the data in processor (G x ,G yi P x ,P y ). In a data accumulation along the G x dimension (for example), each processor (G x ,G y ,P x ,P y ) accumulates in an array A the data values from ((G x + i) mod ^,G y ,P X} P y ), 0 < i < M. Specifically, we have A\i] = D{(G S + i) mod y/N, G y , P x , P y ) Accumulation in other dimensions is similar. The accumulation operation can be done using a circular shift of -M in the appropriate dimension. The complexity is readily obtained from that for the circular shift operation (see Section 4.6). 4.ft Cauaeaitia Sum The JV 2 processor OTIS-Mesh is tiled with one-dimensional blocks of size M. These blocks may align with any of the four dimensions G s , G y , P X1 and P y . Each processor has M values X\j], 0
PAGE 50

40 in two phases. In the first phase, processor M 1 initiates tokens lo»*i» • • » * adds its X[i] value to it and transmits the token to the processor on its left. The first phase operates for M 1 moves and at the end of this phase, Pi has token ti = EjlT+i^WO)The second phase is similar to the first. This time, po initiates the tokens tf u _ v f M _ 2t . . . , t x and the tokens move rightwards. Following A/-1 moves, token fi is in processor pi and = Z$*JT[t|(j). Following phase 2, p { computes the desired result U + <£ + X[i](i). The total number of moves is 2(M 1). In the MIMD model, the left and right moves can be done simultaneously, and only AT 1 electronic moves are needed. When the one-dimensional size M blocks align with G x or G y , we first do an OTIS move; then run either a P, or P, consecutive sum algorithm; and then do an OTIS move. The number of electronic moves is the same as for P x or P v alignment. However, two additional OTIS moves are needed. Simulation of the corresponding 4D mesh algorithm takes an additional AM -6 OTIS moves for the case of G x or G, alignment in the SIMD model and an additional 2M 4 OTIS moves in the MIMD model. 4 9 Adjacent Sum This operation is similar to the data accumulation operation of Section 4.7 except that the M accumulated values are to be summed. The operation can be done with the same complexity as data accumulation using a similar algorithm. 410 Concentrate A subset of the processors contain data. These processors have been ranked as in Section 4.5. So the data is really a pair (D,r); D is the data in the processor and r is its rank. Each pair (D, r) is to be moved to processor r, 0 < r < 6, where

PAGE 51

41 b is the number of processors with data. Using the (G, P) format for a processor index, we see that (D,r) is to be routed from its originating processor to processor ([r/JVj.r mod N). We accomplish this using the steps: Step 1: Each pair (D,r) is routed to processor r mod N within its current group. Step 2: Perform an OTIS move. Step 3: Each pair (D,r) is routed to processor [r/N\ within its current group. Step 4: Perform an OTIS move. Thpnrrm 1.10.1 The four step algorithm given above correctly routes every pair (D, r) to processor ([r/N\,r mod N). Proof Step 1 does the routing on the second coordinate. This step does not route two pairs to the same processor provided no group has two pairs (Z>i,ri), (£j,r 2 ) with mod N = r 2 mod N. Since each group has at most N pairs and the ranks of these pairs are contiguous integers, no group can have two pairs with ri mod N = r 2 mod N. So following Step 1 each processor has at most one pair and each pair is in the correct processor of the group, though possibly in the wrong group. To get the pairs to their correct groups without changing the within group index, Step 2 performs an OTIS move, which moves data from processor (G,P) to processor (P, G). Now all pairs in a group have the same r mod N value and different [r/N\ values. The routing on the [r/N\ values, as in Step 3, routes at most one pair to each processor. The OTIS move of Step 4, therefore, gets every pair to its correct destination processor. In group 0, Step 1 is a concentrate localized to the group, and in the remaining groups, Step 1 is a generalized concentrate in which the ranks have been increased

PAGE 52

42 by the same amount. In all groups we may use the mesh concentrate algorithm of Nassimi and Sahni [35] to accomplish the routing in 4(VN 1) electronic moves. Step 3 is also a concentrate as the [r/N\ values of the pairs are in ascending order from 0, 1, 2, • • •. So Steps 1 and 3 take i(y/N 1) electronic moves each in the SIMD model and 2{s/N 1) in the MIMD model [35]. Therefore, the overall complexity of concentrate is S{VN 1) electronic and 2 OTIS moves in the SIMD model and 4(v y N 1) electronic and 2 OTIS moves in the MIMD model. We can improve the SIMD time to 7(y/N 1) electronic and 2 OTIS moves by using a better mesh concentrate algorithm than the one in Nassimi and Sahni [35]. The new and simpler algorithm is given below for the case of a generalized concentration on a y/N x y/N mesh. Step 1: Move data that are to be in a column right of the current one rightwards to the proper processor in the same row. Step 2: Move data that are to be in a column left of the current one leftwards to the proper processor in the same row. Step 3: Move data that are to be in a smaller row upwards to the proper processor in the same column. Step 4: Move data that are to be in a bigger row downwards to the proper processor in the same column. In a concentrate operation on a square mesh the data that begin in two processors of the same row ends up in different columns as the ranks of these two data differ by at most y/N1So Steps 1 and 2 do not leave two or more data in the same processor. Steps 3 and 4 get data to the proper row and hence to the proper processor. Note that it is possible to have up to two data items in a processor following

PAGE 53

43 Step 1 and Step 3. The complexity of the above concentrate algorithm is 4(y/N 1) on a SIMD mesh and 2(y/N 1) on an MIMD mesh (we can overlap Steps 1 and 2 as well as Steps 3 and 4 on an MIMD mesh). For an ordinary concentrate in which the ranks begin at 1, Step 4 can be omitted as no data moves down a column to a row with bigger index. So an ordinary concentrate takes only Z(y/N 1) moves. This improves the SIMD concentration algorithm of Nassimi and Sahni [35], which takes 4(y/N1) moves to do an ordinary concentrate. Actually, we can show that the four step concentration algorithm just stated is optimal for the SIMD model. Consider the ordinary concentrate instance in which the selected elements are in processors (0, y/N 1), (1, y/N 2), • • •, (y/N 1,0). The ranks are 0, 1, • • •, V^1. So the data in processor (0, y/N1) is to be moved to processor (0,0). This requires moves that yield a net of y/N 1 left moves. Also, the data in processor (y/N 1,0) is to be moved to processor (0, y/N 1). This requires a net of y/N1 upward moves and y/N -I rightward moves. None of these moves can be overlapped in the SIMD model. So every SIMD concentrate algorithm must take at least y/N 1 moves in each of the directions left, right, and up; a total of at least 3(y/N 1) moves. For the generalized concentrate algorithm, the ranks need not start at zero. Suppose we have two elements to concentrate. One is at processor (0,0) and has rank N 1, and the other is at processor (y/N 1, y/N 1) and has rank N. The data in (0,0) is to be moved to (y/N 1, y/N 1) at a cost of y/N 1 net right and down moves. The data in (y/N 1, y/N 1) is to be moved to (0,0) at a cost of y/N 1 net left and up moves. So at least 4(VN 1) moves are needed.

PAGE 54

44 Table 4.1. Processors with data to concentrate 0,0 0,1 G X = 1,0
PAGE 55

ESSE BBSS BBBB BBBB 1 — II — II — II — 1 UUUU BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB f — If — If — 1 f — 1 0000 BBBB BBBB BBBB B (a) BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB BBBB (b) Figure 4.1. Data Configuration: (a) Initial; (b) Concentrated

PAGE 56

46 Table 4.2. Net change in G x , G y , P„ and P„ data G x ft D(a) ~(VN-l) + l -UN -I) -(VN-1) -UN -I) D{b) -UN -I) +UN-1) -UN -I) +UN-1) D(c) 0 0 HVN-l) 0 Similarly, because of £>(a)'s requirements, at least 2(y/N 1) electronic moves that increase the column index within a x s/N mesh must be made. Turning our attention to net positive changes, we see that because of D(6)'s requirements there must be at least 2{ 2 processors on a row of a
PAGE 57

47 OTIS move, at least that many electronic moves are made, in the worst case, by every concentration algorithm. The reason that at least 2 OTIS mora are needed to complete the concentration is the same as for (a). 4 11 Distribute This is the inverse of the concentrate operation of Section 4.10. We start with pairs (D 0 ,d 0 ),...,(D„d q ),do
PAGE 58

48 Step 2: Run the GENERALIZE procedure of Nassimi and Sahni [35] from bit p 1 to 0, while maintaining the original index. Step 3: Perform an OTIS move. Step 4: Run the GENERALIZE algorithm of Nassimi and Sahni [35] from bit p 1 to 0. On an MIMD OTIS-Mesh the above algorithm takes 4(>/3V 1) electronic and 2 OTIS moves. We can reduce the SIMD complexity to 7(VN 1) electronic and 2 OTIS moves by using a better algorithm to do the generalize operation on a 2D SIMD mesh. This algorithm uses the same observation as used by us in Section 4.10 to speed the 2D SIMD mesh concentrate algorithm; that is, of the four possible move directions, only three are possible. When doing a generalize on a 2D y/N x \/N mesh the possible move directions for data are to increasing row indexes and to decreasing and increasing column indexes. With this observation, the algorithm to generalize on a 2D mesh becomes: Step 1: Move data along columns to increasing row indexes if the data is needed in a row with higher index. Step 2: Move data along rows to increasing column indexes if the data is needed in a processor in that row with higher column index. Step S: Move data along rows to decreasing column indexes if the data is needed in a processor in that row with smaller column index. The correctness of the preceding generalize algorithm can be established using the argument of Theorem 4.10.1, and its optimality follows from Theorem 4.10.2

PAGE 59

49 and the fact that the distribute operation, which is the inverse of the concentrate operation, is a special case of the generalize operation. The new and more efficient generalize algorithm may be used in Step 2 of the OTIS-Mesh generalize algorithm. It cannot be used in Step 4 because the generalize of this step requires the full capability of the code of Nassimi and Sahni [35] which permits data movement in all four directions of a mesh. When we use the new generalize algorithm for Step 2 of the OTIS-Mesh generalize algorithm, we can perform a generalize on a SIMD OTIS-Mesh using 7(VN-1) electronic and 2 OTIS moves. The new algorithm is optimal for both SIMD and MIMD models. This follows from the lower bound on a concentrate operation established in Theorem 4.10.2 and the observation made above that the distribute operation, which is a special case of the generalize operation, is the inverse of the concentrate operation and so has the same lower bound. 4.13 Sorting As was the case for the operations considered so far, an 0(y/N) time algorithm to sort can be obtained by simulating a similar complexity 4D mesh algorithm. For sorting a 4D Mesh, the algorithm of Kunde [25] is the fastest. Its simulation will sort into snake-like row-major order using Uy/N+o{y/N) electronic and 12v^ +o{y/N) OTIS moves on the SIMD model and 7^ + o(y/N) electronic and 6VN + o{y/N) OTIS moves on the MIMD model. To sort into row-major order, additional moves to reverse alternate dimensions are needed. This means that an OTIS-Mesh simulation of Kunde's 4D mesh algorithm to sort into row-major order will take I8y/N + o(y/N) electronic and \6y/N + o(>/N) OTIS moves on the SIMD model. We show that Leighton's column sort [26] can be implemented on an OTIS-Mesh to sort into rowmajor order using 22y/N+o(y/N) electronic and 0(7V 3/8 ) OTIS moves on the SIMD

PAGE 60

50 1 7 13 1 2 3 2 8 14 St*p» 4 5 6 3 9 15 7 8 9 4 10 16 10 11 12 5 11 17 St«p < 13 14 15 6 12 18 16 17 18 Figure 4.2. Row-Column Transformation of Leighton's Column Sort model and l\VN+o(y/N) electronic and 0{N 3 '*) OTIS moves on the MIMD model. Please note that the algorithm discussed here is deterministic. The randomized algorithms for sorting can be found in Rajasekaran and Sahni [41]. Our OTIS-Mesh sorting algorithm is based on Leighton's column sort [26]. This sorting algorithm sorts an r x s array, with r > 2(s l) 2 , into column-major order using the following seven steps: Step 1: Sort each column. Step 2: Perform a row-column transformation. Step 3: Sort each column. Step 4: Perform the inverse transformation of Step 2. Step 5: Sort each column in alternating order. Step 6: Apply two steps of comparison-exchange to adjacent rows. Step 7: Sort each column. Figure 4.2 shows an example of the transformation of Step 2, and its inverse. Figure 4.3 shows a step by step example of Leighton's column sort.

PAGE 61

51 Step 1 2 4 7 10 15 18 3 5 9 11 16 17 1 6 8 12 13 14 SUp J 7 9 12 4 16 1 18 5 14 2 17 8 15 11 6 10 3 13 1 3 11 4 6 15 7 9 17 2 10 12 5 13 16 8 14 18 Sup 1 2 3 1 4 5 6 7 9 8 10 11 12 15 16 13 18 17 14 1 11 14 2 12 13 4 10 15 5 9 16 6 7 17 3 8 18 SUp J 1 1 4 7 o £ fi 0 3 6 9 10 13 X \J 14 11 X X 15 17 12 16 18 7 13 8 14 9 15 10 16 11 17 12 18 SUp 4 SUp » 1 2 4 5 7 8 14 11 13 12 10 15 9 16 6 17 3 18 Sup 6 Figure 4.3. Example of Leighton's Column Sort Although Leighton's column sort is explicitly stated for r x s arrays with r > 2(s l) 2 , it can be used to sort arrays with s > 2(r l) 2 into row-major order by interchanging the roles of rows and columns. We shall do this and use Leighton's method to sort an N 1 * 7 x N 3 ' 2 array. We interpret our N 2 OTIS-Mesh as an AT 1 ' 2 x TV 3 / 2 array with G x giving the row index and G y P x P y giving the column index of an element processor. We shall further subdivide G x (G„ P x , P y ) into equal parts G,,, G Sl , G x „ and G 9A from left to right. We use G„_ 4 , for example, to represent G Xl G x ,G^. Since p = log 2 AT, G x has p/2 bits and G^ has p/8 bits. These notations are helpful in describing the transformations in Steps 2 and 4 of the column sort, as we use the BPC permutations of Nassimi and Sahni [34] to realize these transformations. The definition of a BPC permutation can be found in Nassimi and Sahni [34] and in Section 3.8.1. In describing our sorting algorithm, we shall, at times, use a 4D array interpretation of an OTIS-Mesh. In this interpretation, processor (G„ G y , P x , P y ) of the

PAGE 62

52 OTIS-Mesh corresponds to processor (G„ G„ P., J»,) of the 4D mesh. We use g x to denote the bit positions of G x , that is the leftmost p/2 bits in a processor index, g Xl to represent the leftmost p/8 bit positions, p, to represent the rightmost p/2 bit positions, p w _„ to represent the rightmost p/4 bit positions, and so on. Our strategy for the sorting steps 1, 3, 5, and 7 of Leighton's method is to collect each row (recall that since we are sorting an N 1 * 2 x N" 7 array, the column-sort steps of Leighton's method become row-sort steps) of our N*<* x array into an N"> x JV 3/8 x N>» x 4D submesh of the OTIS-Mesh, and then sort this row by simulating the 4D mesh sort algorithm of Kunde [25]. This strategy translates into the following sorting algorithm: Step 1: [Move rows of the x array into N** X N>» x N>» x N** 4D submeshes] Perform the BPC permutation P. = [y*,$,nP*,Py,0x a 0 w _40*»P* 2 -40*4Pi*-<]Step 2: [Sort each row of the N l/7 x N** array] Sort each 4D submesh of size TV 3 ' 8 x N*" x N*'* x JV 3/8 . Step S: [Do the inverse of Step 1, perform a column-row transformation, and move rows into N 3 '* x N*'* x N*** x N*'* submeshes] Perform the BPC permutation P c = [0x,-«0x 1 0ittP* l -4WW40V40yiP*iPvi]Step 4: [Sort each row of the N 1 * 2 x N*P array] Sort each 4D submesh of size JV 3 ' 8 x AT 3 / 8 x N*/* x TV 3 / 8 . Step 5: [Do the inverse of Step 1, perform a row-column transformation, and move rows into W 3 / 8 x 7V 3/8 x N 3 '* x N*'* submeshes] Perform the BPC permutation P^ = tex^xj.jPy^PxiPy^ira^vi^PiMPxj^]-

PAGE 63

53 Step 6: [Sort each row in alternating order] Sort each 4D submesh of size iW» x N 3 * x N 3 '* x AT 3 / 8 . Step 7: [Move rows back from 4D submeshes] Perform the BPC permutation = [g Xl M*iPin9x,9vi-49x s Pxi-A9x
PAGE 64

54 The preceding OTIS-Mesh implementation of column sort performs 6 BPC permutations, 4 4D mesh sorts, and two steps of comparison-exchange on adjacent rows. Since the sorting steps take 0(N>«) time each (use KunoVs 4D mesh sort [25] followed by a transform from snake-like row-major to row-major), and since the remaining steps take 0(N^) time, we shall ignore the complexity of the sort steps. We can reduce the number of BPC permutations from 6 to 3 as follows. First note that the P a of Step 1 just moves elements from rows of the N 1 ^ 2 x N*' 7 array into N 3 / 6 x AT 3 / 8 x AT 3 / 8 x TV 3 / 8 4D submeshes. For the sort of Step 2, it doesn't really matter which TV 3 / 2 elements go to each 4D submesh as the initial configuration is an arbitrary unsorted configuration. So we may eliminate Step 1 altogether. Next note that the BPC permutations of Steps 7 and 9 cancel each other and we can perform the comparison-exchange of Step 8 by moving data from one AT 3 / 8 x N** x AT 3 / 8 x N*» 4D submesh to an adjacent one and back in 0(JV 3/8 ) time. With these observations, the algorithm to sort on an OTIS-Mesh becomes: Step 1: Sort in each subarray of size A*/ 8 x AT 3 / 8 x N 3 ^ x A*/ 8 Step 2: Perform the BPC permutation P c . Step 8: Sort in each subarray. Step 4: Perform the BPC permutation /*. Step 5: Sort in each subarray. Step 6: Apply two steps of comparison-exchange to adjacent subarrays. Step 7: Sort in each subarray. Step 8: Perform the BPC permutation P^.

PAGE 65

55 Using the BPC routing algorithm of Section 3.8.2, the three BPC permutations can be done using 36v^ electronic and 31og 2 N + 6 OTIS moves on the SIMD model and 18y/N electronic and 31og 2 7V* + 6 OTIS moves on the MIMD model. A more careful analysis based on the development in Nassimi and Sahni [34] and Section 3.8.2 reveals that the permutations P^, P c , and can be done with 28v^V electronic and log 2 AT + 6 OTIS moves on the SIMD model and Uy/N electronic and 31og 2 tf + 6 OTIS moves on the MIMD model. By using p*. = [9 Xi 9v 1 Px 1 Py i 9 X 29v2Px2Pv79x s 9yzPz i Py s 9x i 9vJ>^Py t ]> Pc = [gx 7 -.9x i 9 V3 i 9y,Px^Px i P 1 n-*Py,] and j/ e = \<},<9x 1 -,9y<9y i -JxJ>x>-,Py
PAGE 66

56 Step 4: Concentrate the triples using the ranks of Step 3. Step 5: Distribute the triples according to their third coordinates. Step 6: Load each triple with the D value of the processor it is in. Step 7: Concentrate the triples using the ranks in Step 3. Step 8: Generalize the triples to get the configuration we had following Step 1. Step 9: Sort the triples by their first coordinates. Using the SIMD model the RAR algorithm of Ranka and Sahni [42] takes 79(v/tf 1) electronic moves and 0{N*/ S ) OTIS moves. On the MIMD model, it takes 45(v / ^~ 1) electronic 0(N 3 '*) OTIS moves. 4 1 5 Random Access Write (RAW) Now processor / wants to write its D data to processor dj, 0 < J < N*. The steps in the RAW algorithm of Ranka and Sahni [42] are as follows: Step 0: Processor / creates the tuple (£>(/),<*/), 0 < / < N 7 . Step 1: Sort the tuples by their second coordinates. Step 2: Processor / deactivates if the second coordinate of its tuple is the same as the second coordinate of the tuple in/ + l, Q
PAGE 67

57 Step 2 implements the arbitrary write method for a concurrent write. In this, any one of the processors wishing to write to the same location is permitted to succeed. The priority model may be implemented by sorting in Step 1 by d, and within dj by priority. The common and combined models can also be implemented, but with increased complexity. On the SIMD model, an RAW takes 43(v^1) electronic and 0(N 3 ^) OTIS moves while on the MIMD model, it takes 26(^-1) electronic and 0(N 3 '*) OTIS moves. 41fi Summary Our algorithms run faster than the simulation of the fastest algorithms known for 4D meshes. Tables 4.3 and 4.4 summarizes the complexities of our algorithms and those of the corresponding ones obtained by simulating the best 4D-mesh algorithms on SIMD and MIMD models respectively. Note that the worst case complexities are listed for the broadcast and window broadcast operation, and that of the case when is even is presented for the data sum operation on the MIMD model. Also, the complexities listed for circular shift, data accumulation, and adjacent sum assume that the shift distance is < VN/2 on the MIMD model. Both tables give only the dominating y/N terms for sorting. Our algorithms for data broadcast, data sum, concentrate, distribute, and generalize are optimal.

PAGE 68

58 Table 4.3. Comparison of complexities on SIMD model Operation Simulation Ours Electronic OTIS Electronic OTIS Broadcast 4(VN 1) 4(VN 1) 4(\A/V 1) 1 Window Broadcast 4\fN -2w-2 4(y/N 1) AyfN -2w-2 2 Prefix Sum 7(VN 1) 6(SN 1) 7(VN 1) 2 Data Sum 8(^-1) A(y/N-l) B(VN-l) 1 Rank 7(VN-l) 6(y/N-l) nvN-i) 2 Regular Shift 8 2s s 2 Circular Shift 2VN y/N 2 Data Accumulation Vn 2y/N y/N 2 Consecutive Sum 2(M 1) 4(M 1) 2(M-1) 2 Adjacent Sum y/N 2y/N y/N 2 Concentrate 8(VN 1) &(VN 1) 7(y/N-\) 2 Distribute B(VN-\) B(VN-l) 7(VN 1) 2 Generalize S(VN 1) KVN-l) 7(VN 1) 2 Sorting Uy/N \2sfN 22y/N

PAGE 69

Table 4.4. Comparison of complexities on MIMD model Operation Simulation Ours Electronic OTIS Electronic oris Broadcast A{y/N 1) 4(^-1) 4(V^V1) 1 Window Broadcast Ay/N -2w-2 4(^-1) 4yfN -2w-2 2 Prefix Sum 7(VN-l) 6(VN-1) 7(VN-1) 2 Data Sum WN WN WN 1 Flank 7(VN-1) 6(VN-l) 7(VN-1) 2 Regular Shift 8 2s s 2 Circular Shift 8 2s s 2 Data Accumulation M M M 2 Consecutive Sum M 1 2{M-\) M 1 2 Adjacent Sum M 2M M 2 Concentrate 4(VN 1) \WN-\) 4(VN 1) 2 Distribute 4(^-1) A(y/N 1) A(y/N-1) 2 Generalize A(y/N 1) 4(vW-l) 4(VN-1) 2 Sorting 7VN 6v^ lWN 0(AT 3 / 8 )

PAGE 70

CHAPTER 5 MATRIX MULTIPLICATIONS ON AN OTIS-MESH In this chapter, we develop algorithms to multiply vectors of size kN and matrices of siie kN x kN on an TV 2 processor OTIS-Mesh. These algorithms are developed for both of the matrix to OTIS-Mesh mapping schemes considered in Section 2.3— group row-major mapping (GRM) and group submatrix mapping (GSM). We begin, in Section 5.1, by describing the GRM and GSM schemes and making observations about the complexity of performing the matrix add and transpose operations. In Section 5.2, we develop the algorithms for various versions of vector and matrix multiplication. For purposes of this chapter the essential differences between electronic and optical links are (ty optical links have much larger bandwidth than do electronic links; and (b) transfer times including latency are different on optical and electronic links. In our analysis, we count communication along electronic and optical interconnects separately. However, we use the simplifying assumption that any constant amount of data can be communicated over an optical link during an optical communication step while only a unit amount of data can be communicated over an electronic link during an electronic communication step. In this chapter, we assume that the processor mesh that represents any group of processors is a SIMD mesh. Therefore, in any given time step, data can be moved in only one of the four mesh dimensions: up, down, left, or right. Extensions to MIMD meshes are straightforward and thus omitted. §J Mapping Matrices O nto An OTIS-Mesh In Section 2.3 we described the GRM and GSM mapping of a matrix. For the GSM mapping, we introduce the following notation. Matrix element (», j) is mapped 60

PAGE 71

61 to processor (i/.i/.Wm) where i f = [i/v^J, i* = • mod V^, jf = U/v^J, and Jm = j mod v^ATThe GRM and GSM mappings of a row or column vector are obtained from the corresponding mapping of an N x N matrix by extracting the sub-mapping corresponding to row zero or column zero of the matrix. The GRM and GSM mappings of akNxkN matrix are obtained by partitioning the kN x kN matrix into NxN blocks of size k x Jfc each. The N x N block matrix is then mapped onto the N 9 processor OTIS-Mesh, one block per processor, using the standard GRM and GSM schemes described above. 1 x kN and kN x 1 vectors are mapped by using the sub-mapping corresponding to row zero or column zero of the kN x kN matrix mapping. It is easy to see that regardless of which mapping is used, matrix as well as vector addition and subtraction requires no interprocessor communication. Two kN x kN matrices can be added or subtracted in 0(k 7 ) time and two vectors of size kN can be added or subtracted in 0(k) time. Algorithms for the matrix transpose operation were developed in Section 3.1. A kN x kN matrix can be transposed using a single OTIS move and no electronic moves when the GRM mapping is used. When the GSM mapping is used, the transpose requires &k 2 (y/N 1) electronic and 2 OTIS moves. In either case, 0(k 2 ) intraprocessor moves are needed to transpose the * x k block stored in a processor. S.2 Mult iplication Algorithm S 2.1 Colum n Vector x Row Vector GEM First consider the GRM mapping. When an N x 1 column vector A and a \ x N row vector B are multiplied, the result is an N x N matrix. This is to be stored in the OTIS-Mesh using the GRM mapping.

PAGE 72

62 Step 1: Perform an OTIS move on B. Step 2: Broadcast the A and B data in each group to all processors of the group. Step 3: Perform an OTIS move on B. Step 4: Each processor multiplies its A and B data. Figure 5.1. GRM Column x Row Multiplication Initially, the element A in row i of A is in the Oth processor of group i (i.e., processor (i,0)) and the jth element Bj of B is in processor (0,;). Following the multiplication, the (», j) element A+Bj of the product matrix is to be in processor The four step algorithm given in Figure 5.1 performs the multiplication. Following Step 1, Bj is in processor (j',0), 0 < j < N; and following the broadcast of Step 2, processor (i, *) has A i (* denotes all permissible indexes; in this case indexes are in the range [0, N)) and processor (|, *) has Bj. After the OTIS move of Step 3, processor (i, j) has A and Bj. Consequently, following the multiplication of Step 4, processor (i, j) has the (i,j)th entry of the result matrix. Therefore, the algorithm correctly multiplies the vectors A and B. For the complexity analysis, we see that 2 OTIS moves are made in Steps 1 and 3 together. Step 2 can be done using 2y/N electronic moves by first sending A and B data initially in processor 0 of a group down column zero of that group. Since only one piece of data can be moved at a time along an electronic link, this column broadcast of the A and B data can be done in y/N moves if we pipeline the data movement down column zero (i.e., the B data trail the A data by one column processor). Next the A and B data in each processor of column 0 are broadcast along rows using a similar pipelining. This requires another y/N electronic moves.

PAGE 73

63 The complexity of our GRM column x row algorithm is therefore 2^ electronic and 2 OTIS moves. Thp.nrp.rn 5.2.1 Our column x row algorithm is an optimal algorithm. Proof To see this, first note that all the B values are initially in group 0, and all need to get to group 1 (say) either in the form of A x Bj or simply By The only way data can move from one group to another is via an OTIS move, and a single OTIS move can only move a constant number of the Bfs accumulated into a single processor of group 0 into a single processor of group 1. Therefore, at least 2 OTIS moves are needed. Also, 2^ electronic moves are necessary. To see this, observe that B 0 is initially in processor (0,0) and its influence must be seen at all processors (*,0) because the (*,0) element of the result is A.A,, which is to be left in processor (*,0). OTIS moves can only transpose group and local processor indexes. To affect a change from (0,0) to (*,0), 2y/N 2 electronic moves {y/N 1 rightward row moves and y/N-1 downward column moves) are essential. Further, Aq is initially in (0,0) and all values in (0, *) depend on Ac. Therefore at least y/N1 rightward row moves and y/N 1 downward column moves are needed to communicate the Aq value directly or indirectly to (0,*). Since only unit data can flow along an electronic link in a single move, we cannot overlap all of the rightward row moves needed for Aq and B Q . Therefore, at least y/N rightward moves must be made. Similarly at least y/N downward moves must be made. The algorithm of Figure 5.1 also can be used when A is a kN x 1 vector and B a 1 x kN vector. Now, in Steps 1 and 3, blocks of k values are moved from a single processor via an OTIS move. In Step 2, blocks of size k are to be broadcast. The

PAGE 74

64 Step 1: Processors that have a B value broadcast this B value to all processors in the same column of the group. Step 2: Processors that have an A value broadcast this A value to all processors in the same row of the group. Step 3: Perform an OTIS move on the A and B values in a processor. Step 4: Same as Step 1. Step 5: Same as Step 2. Step 6: Same as Step 3. Step 7: Each processor multiplies its A and B values to produce an element of the product matrix. Figure 5.2. GSM Column x Row multiply algorithm strategy is the same as for the case Jk = 1; however, now we must pipeline the 2k A and B values in a processor for the column and row broadcast steps. This pipelining takes 4 electronic moves. Steps 1 and 3 still take 2 OTIS moves as we can move Jfc element blocks using a single OTIS move. In Step 4, each processor performs k 2 multiplications to generate & k x k block of the product matrix. GSM For A an N x 1 vector and B a 1 x N vector, we start with A, in processor (»'/, 0,^,0) and Bj in (0,i/,0,; m ) and are to leave the product term Cy = ABj in (*/,;"/, Wm)The algorithm of Figure 5.2 does this. Step 1 moves Bj from (0,j/,0,i m ) to (0,j/, *,j m ) and Step 2 moves A from (i f , 0,^,0) to (i/,0,i m ,*). Following Step 3, Bj is in (*,j m Ajf) and A is in (» m ,*,*/,0). Step 4 now moves Bj from (*,j m ,0,j f ) to (*,j m , and Step 5 moves Ai from (im, *, i f , 0) to (tm, *, »'/, *). Following Step 6, processor U^Jm) has Ai and Bj. Therefore, Step 7 correctly computes the product element dj = Aibj.

PAGE 75

65 The number of data moves is A{VN 1) electronic and 2 OTIS moves. The algorithm of Figure 5.2 is optimal because the Aq initially in (0,0,0,0) affects the final value in (y/N 1,0, y/N 1,0). This requires 2(y/N 1) electronic column moves. Further, the B 0 initially in (0,0,0,0) affects the final value in (0, y/N 1, 0, y/N 1). This requires 2(y/N 1) electronic row moves. Additionally, y/N A values initially in group (0,0) affect the final values in group (0,1). This requires at least 2 OTIS moves (assume that y/N > 2). The algorithm of Figure 5.2 can also be used when Jfc > 1. Now, the broadcasts and each OTIS move involves 2 blocks of Jfc elements each. The broadcasts are done by pipelining the transfer of the Jfc elements in a block and each OTIS move simply does a block transfer of the Jfc elements. The total number of data move steps becomes Ay/N + 4* 8 electronic and 2 OTIS. Step 7 produces a k x k block of the result matrix using k 7 multiplication steps. 5.2.2 Row Vp <*nr x Column Vector GRM For a 1 x TV row vector A and an N x 1 column vector B, we begin with A, in (0, i) and in (t, 0). The result E^ 1 4$ is to be left in (0,0). In the algorithm of Figure 5.3, B< is moved from (t',0) to (0,i) in Step 1. The sum E^ 1 A^ is computed in Step 3 by first moving the products of Step 2 upward to row 0 and adding terms in the row zero processors. Then the partial sums are moved leftward along row zero and the result computed in (0,0). The algorithm requires 1 OTIS and 2(\/N 1) electronic moves. It is obvious that the algorithm is optimal. When the vectors are of size 1 xkN and kNx 1, respectively, Step 2 multiplies a 1 x Jfc block of A with a Jfc x 1 block of B. This takes 0(k) time. We assume that the cost of O(Jfc) arithmetics is considerably less than the cost of an electronic move

PAGE 76

66 Step i: Perform an OTIS move on B values. Step 2: Each processor of group 0 multiplies its A and B values. Step 3: Sum the products of Step 2 by columns and finally along row zero, leaving the result in (0,0). Figure 5.3. GRM Row x Column Multiply Step 1: Groups with A values move their A values from row 0 to column 0 using the data paths of Figure 5.5. Step 2: Perform an OTIS move on all data. Step SShift the A values leftward along row 0 of a group and the B values upward along column 0 and compute the sum of products in the (0,0) processor of each group that has A and B values. Step 4: Perform an OTIS move on the product sums computed in Step 3. Step 5: Shift the product sums upward along column 0 of group 0, summing these sums in processor (0,0). Figure 5.4. GSM Row x Column Multiply and, therefore, make no attempt to utilize processors from other groups to reduce the time spent on arithmetic operations. The data moves required by the algorithm of Figure 5.3 still are 2(VN 1) electronic and 1 OTIS. GSM When multiplying alxN vector and an N x 1 vector using the GSM mapping, the algorithm of Figure 5.4 can be used. The algorithm of Figure 5.4 begins with Ai in (0, t'/, 0, t m ) and Bi in (i/, 0, im, 0). In Step 1, Ai is moved to (0,i/,» m ,0) by performing sfN 1 downward moves and y/N-l leftward moves as in Figure 5.5. Following Step 2, Ai is in (t m , 0,0,1/) and

PAGE 77

67 Figure 5.5. Data Paths Used in Step 1 of Figure 5.4 Bi is in (t™, O.t/,0). In Step 3, (i m , 0,0,0) sums up of the terms that contribute to the result. In Step 4, these sums are moved to (0, 0, i m , 0) and are added together in Step 5. Steps 1 and 3 take 2(VN 1) electronic moves each and Step 5 takes y/N 1 electronic moves. The total number of data moves is therefore 5(\/]V 1) electronic and 2 OTIS moves. A straightforward generalization of the algorithm of Figure 5.4 to the case when we care multiplying a 1 x kN row with a kN x 1 column results in excessive complexity when k > 1. This is so because the pipelining of Step 3 takes 2k(y/N1) electronic moves. When k > 1, the number of data moves is reduced by using the algorithm of Figure 5.6. The algorithm of Figure 5.6 begins with the ith block of A in (0,t'/,0,i m ) and the ith block of B in (i/, 0, i ro , 0). In Step 1, the ith block of A is moved to (0, i/, i m , 0). And following Step 2, the ith block of A is in (i^.O.O.i/) while the ith block of B is in (im,0,i/,0). Step 3 moves the ith block of A from (i m ,0,0,i/) to (im,0,i/,0). Now (i m ,0,i/,0) contains block i of A and B. These blocks are multiplied in Step 4 to produce a single number in (t m ,0,i/,0). In Step 5, the numbers computed in

PAGE 78

68 Step 1: Groups with A values move their A value blocks from row 0 to column 0 using pipelining and the data paths of Figure 5.5. Step 2: Perform an OTIS move on all data blocks. Step 3: Same as Step 1. Step 4: Processors with an A and a B block multiply their blocks (these are the column 0 processors of each column 0 group). Step 5The column 0 processors, in each column 0 group, shift their Step 4 results upward along column 0. The results are added together by the (0,0) processor in each group. Step 6: Perform an OTIS move on the sums computed by the (0,0) processors in Step 5. Step 7: In group (0,0), the column 0 processors shift the values received in Step 6 upward to the (0,0) processor of the group. The (0,0) processor adds these values together. Figure 5.6. GSM Row x Column Multiply for k > 1

PAGE 79

69 Step 1: Perform an OTIS move on A values. Step 2: Processor 0 of each group broadcasts its A value to the remaining processors in its group. Step 3: All processors multiply their A and B values. Step 4: Perform an OTIS move on the products computed in Step 3. Step 5: Processor 0 of each group sums the products from all processors in the same group. Step 6: Perform an OTIS move on the sums computed in Step 5. Figure 5.7. GRM Row Vector x Matrix Multiply group (u.O) are summed in processor (i™ 0,0,0). The OTIS move of Step 6 moves the resultant sums to (0,0,^,0). These resultant sums are added together in Step 7. The number of data moves performed by the algorithm of Figure 5.6 is 6VN+ 4k 10 electronic and 2 OTIS. fi Row V ector x Matrix GEM We are to multiply a 1 x N row vector A and an N x N matrix B. The result is a 1 x N vector C such that d = E& 1 AjB*. Initially, A, is in (0, i) and B« is in (i, j) and the result is to be left so that Q is in (0,i). The multiplication algorithm is given in Figure 5.7. In Step 1, M is moved from (0,i) to (i,0). Following Step 2, processor fj,») has Aj and Processor fj,«) computes AjBfi in Step 3 and in Step 4, AjBji is moved to processor (t,j). Processor (i,0) computes d = EjLY' 4 ^ in Ste P 5 In

PAGE 80

70 Step 6, d is moved from (t, 0) to (0, i). The complexity of the algorithm is A{yfN-\) electronic and 3 OTIS moves. When A is a 1 x kN vector and B a kN x kN matrix, a block of k A values are moved in Step 1 of Figure 5.7; the broadcast of the A block in Step 2 is done in 2{y/N + k 2) electronic moves by pipelining the broadcast of the k values; the multiplication of Step 3 is between a 1 x Jb vector and & kxk matrix; and the OTIS move of Step 4 moves lxi blocks. To do the sum of Step 5, we first sum along rows. This is done in y/N + k 2 electronic moves by pipelining the k sums to be computed. Next the partial sums in column 0 are summed; again using pipelining. Step 5 takes 2(y/N + k 2) steps. Adding in the OTIS move of Step 6, the total number of moves becomes Ay/N + 4* 8 electronic and 3 OTIS. GSM Our GSM algorithm to multiply a 1 x N row vector A and an N x N matrix B is given in Figure 5.8. Note that the algorithm begins with Aj in (0,;/,0,;' m ) and Bji in (j/, *>.?/»*)• Therefore, in Step 6, processor (j m , i«, ;'/,«/) computes AjBji. In Step 7, processor (j m ,i m ,0,if) computes E j9j mod Vs=j m A i B ^ which is then sent to (0 t if,j m ,hn) in Step 8. Finally in Step 9 (0,i/,0,i m ) computes C< = ^2jLo AjBji. Steps 1 and 4 take 2{y/N 1) electronic moves each; Steps 2, 5, 7, and 9 take y/N-l electronic moves each; and Steps 3 and 8 take 1 OTIS move each. The total number of moves is S(y/N1) electronic and 2 OTIS.

PAGE 81

71 Step 1: In each group move the A values from row 0 to column 0 using the data paths of Figure 5.5. Step 2: The column 0 processors broadcast their A values to all processors in the same group and on the same row. Step S: Perform an OTIS move on all A and B values. Step 4-' Same as Step 1. Step 5: Same as Step 2. Step 6: All processors multiply their A and B values. Step 7: The processor in row 0 of each group sum the products of Step 6 that are in the same column. Step 8: Perform an OTIS move on the sums of Step 7. Step 9: The processors in row 0 of group (0, *) sum the values received in Step 8 that are in the same column. Figure 5.8. GSM Row Vector x Matrix Multiply

PAGE 82

72 Step 1: Processor 0 of each group broadcasts its B value to all processors in its group. Step 2: Perform an OTIS move on B values. Step 3: All processors multiply their A and B values. Step 4: Processor 0 of each group sums the products computed in Step 3 by all processors in its group. Figure 5.9. GRM Matrix x Column Vector Multiply When A is a 1 x kN vector and B a kN x kN matrix, Steps 1 and 4 can be done with 2(\/N + k 2) electronic moves each by transmitting the k values in each processor in a pipelined fashion; Steps 2 and 5 take y/N + k 2 (again using pipelining) electronic moves; Steps 7 and 9 can be done in VN + k 2 moves each using the pipelined summing scheme used in Step 5 of Figure 5.7. The total number of moves is 8(y/N + k 2) electronic and 2 OTIS. 5.2.4 Matrix x Column Vector GEM We start with anNxN matrix A and an N x 1 column vector B and compute the TV x 1 column vector C such that d = EjLV AijBjInitially, A^ is in (ij) and Bj is in 0,0)On termination, d is to be in (t',0). Our algorithm to perform the multiplication is given in Figure 5.9. Following Step 1, Bj f is in (j, *) and following Step 2 it is in (*, j). In Step 3, (»',;') computes AijBj and in Step 4, (t,0) computes C> = E^j 1 A^Bj. The number of data moves is 4(>/N-l) electronic (Steps 1 and 4 each require 2(^-1) electronic moves) and 1 OTIS. Theorem 5.8.2 The GRM matrix x column vector multiplication algorithm of Figure 5.9 is optimal.

PAGE 83

Proof Since the value of C 0 depends on all Mj values, information about all these A values must get to (0,0) either directly or indirectly. For this to happen, at least y/N 1 leftward row moves and y/N 1 upward column moves must be made. Let the snake-like row-major index of the bottom right processor of a group be q. Since C q = E^jBj, information originally in (0,0) (i.e., B 0 ) must get to (q,0) directly or indirectly. This requires a minimum of y/N 1 rightward row moves and y/N 1 downward column moves plus one OTIS move. The row and column moves required for the computation of C 0 and C % are in opposite directions and cannot be overlapped in the SIMD model. Therefore, at least 4(^-1) electronic and 1 OTIS moves are needed. When A is a kN x kN matrix and B a kN x 1 vector, we use the algorithm of Figure 5.9 and pipelining as used for the case when A is a 1 x kN vector and B a kN x kN matrix. The number of moves is A(y/N + k 2) electronic and 1 OTIS. GSM The GSM matrix x vector multiplication algorithm is very similar to the GSM vector x matrix algorithm of Figure 5.8. The steps are given in Figure 5.10. Note that we start with Aij in (if t jf,im,jm) and B* in (*/»0»*m,0). In Step 1, Bj is moved from 0/.°»im,0) to (j/,0,0,im)Following Step 2, Bj is in 0'/»°»*.Jm). The OTIS move of Step 3 moves Bj to (*J m J/,0) and Ay to (im,jm,ifjf)Steps 4 and 5 first move Bj to (*,jm,0,j/) and then to (*,j m ,*,jf)Following Step 5, (i m J m ,if,j/) has 4; and By In Step 6, computes AijBj. In Step 7, processor (im,im,«'/,0) computes mod y/N=j m A v B j> which is then sent to (s/,0,«i»» 3m) »n Step 8. Finally, in Step 9 (i/,0,i m ,0) computes E^o 1 A*j B JThe total number °f data raoves 45 8 (v^ 1) electronic and 2 OTIS.

PAGE 84

74 Step 1: In each group move the B values from column 0 to row 0 using the data paths of Figure 5.5 in the reverse direction. Step 2: The row 0 processors of each group broadcast their B values to all processors in the same group and on the same column. Step S: Perform an OTIS moves on all A and B values. Step 4' Same as Step 1. Step 5: Same as Step 2. Step 6: All processors multiply their A and B values. Step 7: The processor in column 0 of each group sum the products of Step 6 that are in the same row. Step 8: Perform an OTIS move on the sums of Step 7. Step 9: The processors in column 0 of group (*,0) sum the values received in Step 8 that are in the same row. Figure 5.10. GSM Matrix x Column Vector Multiply

PAGE 85

75 Step 1: Perform an OTIS move on B. Step 2: Each processor of each group accumulates all A and B values in its group. Step 3: Move the accumulated B values along the OTIS connection. Step 4: Each processor computes the inner product of the A and B values it has. Figure 5.11. O(N) Memory GRM Matrix x Matrix Multiply In the case when A is a kN x kN matrix and B a kN x 1 vector, we use the algorithm of Figure 5.10 and pipelining as used for the case when A is a 1 x kN vector and B&kNxkN matrix. The number of moves is &{y/N +k-2) electronic and 2 OTIS. 5.2.B Matrix x Matrix O(N) Memo ry /Processor Algorithms When each processor has O(N) memory, it is possible to accumulate an entire column (or row) into each processor. This leads to simplified algorithms. Consider the case when we are to multiply two N x N matrices A and B. GRM The GRM algorithm is given in Figure 5.11. We begin with Ay and B„ in Following Step 1, (ij) has Ay and B fi . After Step 2, («, j) has row t of A and column i of B. Following Step 3, (i,;*) has row i of A and column j of B. In Step 4, (»,;') computes dj = Em Step 2 can be done in two stages. In the first stage, the B values are accumulated; and in the second stage the A values are accumulated. To accumulate the B values, each processor first accumulates all values from its row. This takes y/N1 rightward and y/N -I leftward moves. Next, the accumulated blocks of VN values are accumulated along columns by making y/N 1 upward and y/N1 downward moves of

PAGE 86

76 Step 1: Perform an OTIS move on A and B values. Step 2: Each processor accumulates all y/N A values in the same row and group as well as all y/N B values in the same column and group. Step 8: Move the accumulated A and B values along the OTIS connection. Step 4: Each processor accumulates all N A values in the same row and group as well as all N B values in the same row and group. Step 5: Each processor computes the inner product of the A and B values it has. Figure 5.12. O(N) Memory GSM Matrix x Matrix Multiply blocks of size y/N. The total stage 1 moves are 2y/N{y/N-\)+2(y/N-l) = 2(N-l). Stage 2 is done similarly. Step 3 takes N/K OTIS moves where K is the maximum number of B values that can be moved in unit time over an optical link. The total number of moves needed by the algorithm of Figure 5.11 is 4(N 1) electronic and N/K + 1 OTIS. Each processor needs memory for N A values and N B values. The memory requirements can be reduced to N + y/N by delaying stage 2 of Step 2 to after Step 3 and coupling Step 4 with the columnwise movement of the y/N size packets of A during stage 2. The algorithm of Figure 5.11 is easily generalized to the case when A and B are kN x kN matrices. Operations previously performed on matrix elements are now performed on k x A: blocks of elements. The data movement counts are 1 OTIS in Step 1, 4k 2 (N 1) electronic in Step 2, and k 2 N/K OTIS in Step 3. The total is 4k 2 (N 1) electronic and k 2 N/K + 1 OTIS. GSM Our GSM algorithm to multiply two N x N matrices is given in Figure 5.12.

PAGE 87

77 Following Step 1, A {j and B {j are in (Wm, i/,j f ). Following Step 2, (i m ,j m , if, * has the v/N A values A*, such that g m = j m and (im, j m , *,;'/) has the y/N B values B rj such that r m = t m . These \/N blocks of A and £ values are then moved to (t/, *, t^, j m ) and (*, j h t m , j m ), respectively. Following Step 4, (t,, *, t m , *) has row i oiA and (*,;'/, *,jm) has column j of B. Therefore, (i/J f ,im,jm) has row t of A and column of B. The inner product computation of Step 5 leaves Cy = E^=o X A^B^ in (t/,i/,t m , j m ). Step 1 takes 1 OTIS move. Step 2 takes 2(v / 5V 1) electronic row moves to accumulate the A values and 2(y/N 1) electronic column moves to accumulate the B values. Step 3 takes 2y/N/K OTIS moves. In Step 4, each electronic move moves y/N data. Since a total of 4(v^V 1) moves are made, the total cost is 4\/N(y/N1) unit electronic moves. Hence the total number of moves is 4(N 1) electronic and 2-/N/K + 1 OTIS. The algorithm is easily extended to the case when A and B are kN x kN matrices. The number of moves is 4k 2 (N 1) electronic and 2k 7 y/N/K + 1 OTIS. Q(\) Memor y /Processor Algorithms Our 0(1) memory algorithm is based on Cannon's algorithm [2] to multiply two N x N matrices on an N x N mesh connected computer. Cannon's algorithm was also used by Dekel, Nassimi, and Sahni [6] in their development of hypercube algorithms for the matrix multiplication. Cannon's algorithm is given in Figure 5.13. OEM We simulate Cannon's algorithm on the OTIS-Mesh. While obtaining the alignment of Step 1, we also obtain the reverse of each aligned row of A and each aligned column of B. The process for B is similar to that used for A except that we must precede and follow the algorithm for A by an OTIS move (the preceding OTIS

PAGE 88

78 Step 1: [Align Matrix Elements] Move A itU+ i )mo ds and B 0+t)mO d^j to mesh processor (i, j). Step 2: [Initialize Cy] Processor (i, j) initializes its C value to the product of its A and £ values. Step 3: [Compute and Add Remaining Terms] Repeat N 1 times: { Shift A values left circularly by 1; Shift B values up circularly by 1; C = C + A * B; } Figure 5.13. Cannon's Matrix Multiplication Algorithm Step 1: In each group, move A values upward along columns. As data moves through a processor, the processor saves a copy in case it is needed in the row the processor is in. Step 2: Same as Step 1 except that A values are moved downward. Step 3: In each row of each group, form the forward ordering. Step 4: In each row of each group, form the reverse ordering. Figure 5.14. Moving A Values as per Step 1 of Cannon's Algorithm move gets all B elements in the same column into the same group and the following OTIS move gets the columns to the proper processors). We describe the alignment of rows of A only (Figure 5.14). Steps 1 and 2 each take y/N 1 electronic moves. Following Step 2, a processor can have up to four A values— 2 belonging to the aligned ordering of Step 1 of Cannon's algorithm, and 2 belonging to the reverse of this ordering. Each row contains a total of 2y/N values with each processor in the row having 0, 1, 2, 3, or

PAGE 89

79 4 values. These values can be moved to the proper processors on the same row by making 0(y/N) leftward and rightward data moves. To align B takes 0{y/N) electronic and 2 OTIS moves. Therefore, making 0{y/N) electronic and 2 OTIS moves, we can obtain the Step 1 alignment of Cannon's algorithm as well as the reverse of this alignment. The circular shift of A in Step 3 of Cannon's algorithm can be implemented as a forward shift along the snake of the reverse alignment and a backward shift along the snake of the aligned data. So each circular shift takes 4 electronic moves. To do the circular shift on B, we retain a copy of the aligned and reversed B in each group prior to the second OTIS move done in Step 1. For each circular shift, we make 4 electronic moves in each group on the copy of B and then do an OTIS move to get the shifted B values to the desired processors. The total moves required by Step 3 b (N 1) x (8 electronic and 1 OTIS ) = 8(N 1) electronic and N 1 OTIS. Therefore, the GRM simulation of Cannon's algorithm can be done using 8N + 0(y/N) electronic and N + 1 OTIS moves. The simulation just described works even when A and B are kNxkN matrices. Now, each element that is moved is a Jfc x k block. Therefore an electronic block move takes k 2 electronic move steps. The number of moves becomes Bk 2 N + 0(k 2 y/N) electronic and N + 1 OTIS. £SM To multiply two N x N matrices we use a two level simulation of Cannon's algorithm. At the top level, we view each N x N matrix as a x y/N matrix in which each element xs&y/Nxy/N submatrix. Let A and B be the N x N matrices to be multiplied and let BA and BB be the corresponding y/N x y/N matrices in which each element is a y/N x y/N block or submatrix of A and B,

PAGE 90

80 respectively. Initially, BMj and J5B 0 are in group (», j) of the OTIS-Mesh. Let C = A x B and let BC be the corresponding y/N x y/N matrix of blocks of size y/N x VN each. Since £C 0 = E^S" 1 BA* x BB t)) we can use Cannon's algorithm to compute BC. The products of Steps 2 and 3 now are products of submatrices or blocks of size y/N x y/N, each block is in an OTIS group which is a y/N x mesh. These submatrix products can in turn be done using Cannon's algorithm (this is the second level application of Cannon's algorithm). To implement the two level scheme, we use the algorithm of Figure 5.15. Steps 1 and 2 do the data alignment necessary to perform Steps 2 and 3 of Cannon's algorithm to multiply two blocks/submatrices of size y/N x y/N each. The forward and backward ordering of the A values can be obtained by making y/N 1 leftward and y/N-l rightward moves of A values. Similarly the forward and backward ordering of B values can be done using 2(y/N 1) column moves. Following Step 2, each processor has 2 A values (one from the forward ordering and the other from the backward ordering) and 2 B values. In Steps 3 and 4 the y/N x y/N blocks of submatrices of A and B are aligned. For this, an OTIS move is made on the A and B values, followed by 2(y/N 1) electronic row moves and 2{y/N 1) electronic column moves, and 6nally an OTIS move. For the final OTIS move, we leave a copy of the As and Bs in the originating processors also. Now each processor has 8 A and 8 B values. Step 5 is done using Steps 2 and 3 of Cannon's algorithm at a cost of A(y/N-l) electronic moves. In Step 6, the A and B blocks are shifted by using the copies saved during the second OTIS moves of Steps 3 and 4 followed by an OTIS move. This shifting of A and B blocks takes 2 row electronic moves (both forward and backward A blocks are to be shifted in the opposite direction) plus 2 column electronic moves

PAGE 91

81 Step 1: [Align A data within each group/block] Reorder A values in each row of each group so that the A value originally in (*, *, i, (i + j) mod y/N) is now in (*, *, Call this the forward A ordering. Also create the reverse of this row ordering in each group. Call this the backward ordering. Step 2: [Align B data within each group/block] Reorder B values in each column of each group so that the B value originally in (*, *, (t + j) mod y/NJ) is now in (*,*,*»;)• Also create the backward column ordering for the Bs. Step 3: [Align the A blocks] Rearrange the blocks of A values obtained in Step 1 so that the block originally in group (t, (i+j) mod y/N) (i.e., in processors (i, (t + j) m od V / N,*,*))is now in the group of (»,». Also create the corresponding backward row ordering for the A blocks. Step 4: [Align the B blocks] Rearrange the blocks of B values obtained in Step 2 so that the block originally in group ((t +;') mod \/N,j) is now in group Also create the corresponding backward column ordering for the B blocks. Step 5: [Initialize block BCy] BC 0 = BMj x BBy. Step 6: [Compute and add remaining terms] Repeat y/N 1 times: { Shift A blocks left circularly by 1 group; Shift B blocks up circularly by 1 group; BCij = Bdj + BAij x BB i} ,; } Figure 5.15. GSM Matrix x Matrix Multiply

PAGE 92

82 Table 5.1. Comparison between GRM and GSM schemes Embedding Scheme GRM GSM Operation Electronic OTIS Electronic OTIS Column x Row ly/N 2 4(VN 1) 2 Row x Column 2(VN 1) 1 5(VN 1) 2 Row x Matrix 4(VN 1) 3 S(VN-l) 2 Matrix x Column 4(VN 1) 1 8(VN-\) 2 Matrix x Matrix O(N) 4{N 1) N/K + l 4(N-1) 2yfN/K + 1 Matrix x Matrix 0(1) SN + 0(VN) N + \ 4N + O(VN) for B blocks and 1 OTIS move. The block matrix multiply is done using steps 2 and 3 of Cannon's algorithm at a cost of 4(y/N1) electronic and y/N1) OTIS moves. The total number of moves made by the GSM algorithm is 4N + 0{VN) electronic and VN OTIS. The algorithm of Figure 5.15 is easily extended to the case when A and B are kN x kN matrices. The essential difference is that each element of a y/N x y/N block is now itself a k x k block. So, each electronic data move becomes k 7 unit moves. The number of data moves is therefore 4k 7 N + 0(k 2 y/N) electronic and VN OTIS. 53 Summary We have developed OTIS-Mesh algorithms for several variants of the matrix multiplication problem. For each variant, we have considered both the group row mapping and the group submatrix mapping. Our results are summarized in Table 5.1. As can be seen, the GSM mapping is superior for the case of matrix x matrix multiplication. However, for all other variants the GRM is superior. As noted in Section 2.3, GRM is also superior for the matrix transpose operation.

PAGE 93

CHAPTER 6 IMAGE PROCESSING ON AN OTIS-MESH In this chapter, we focus on four problems from the image processing area. These problems are histogramming, histogram modification, Hough transform, and image shrinking expanding. As noted in Section 2.3, there are two plausible ways to map an AT x AT image onto an N 2 processor OTIS-Mesh-group row mapping (GRM), and group submesh mapping (GSM). Our histogramming and histogram modification algorithms are insensitive to how the image is mapped onto the OTIS-Mesh. Therefore, these algorithms are developed without regard to the mapping used. The algorithms for Hough transform and image shrinking and expanding depend on the particular mapping used. In Sections 6.3 and 6.4, we develop algorithms for both the GRM and GSM mappings. 6,1 Histogramming 6J 1 Badtflrround The input to the histogramming problem is an N x N digitized image / with /(»,;) being the gray level of pixel (t,j), 0 < ij < N. The gray levels are integers in the range (0,B); that is, 0 < /(«, j) < B, 0 < ij < N. The histogram of the image is a vector H such that H[b] is the number of pixels with gray value b,0
PAGE 94

84 algorithm for a pyramid computer with an N x N base; Bestul and Davis [1] have developed an 0(VB + log(AtyB)) algorithm for an N 7 processor SIMD hypercube; and Jenq and Sahni [18] and Jang et c/.[15] have developed algorithms for various reconfigurable mesh models. The histogramming algorithms of Jang et al.[lb] and Jenq and Sahni [18] partition B into the ranges 0 < B < VN, VNNand solve for each range separately. Further, where appropriate, they consider the cases 0(1) memory per processor and 0(B) memory per processor. We shall follow this strategy here also. fi 1 9 Algorit hm for (1 < B < JN In this case, the histogram is left in row 0 of the group (0,0) mesh. Using our four-dimensional indexing scheme, processor (0,0,0, i) will have H[t\, 0 < i< B, following the histogram computation. Our strategy is (a) compute the histogram for each row of each y/Nx^N mesh, (b) use the row histograms to obtain the histogram for each group, and (c) combine the group histograms into a single histogram. More formally, our algorithm for the case 0 < B < y/N is: Step 1: Processor (g x ,g y ,Px,Py) determines the number of pixels on its row of the group {g x ,g,) mesh, 0 < p t < B. Step 2: Processor (g x ,g v AP v ) *d
PAGE 95

85 Step 5: Perform an OTIS move on the results computed in Step 4. Step 1 is accomplished by shifting the image values first leftward and then rightward within rows of the meshes. When an image value passes through processor (<7*,<7v>P*>Pv)> this Processor increments its counter if the image value equals p v . y/fi 1 leftward and B 1 rightward shifts, for a total of y/N + B 2 electronic moves are needed. Step 2 is done by shifting the counts upwards along columns to row 0 of the group. This step takes y/N 1 electronic moves. In Step 4 we add all values inav^xV^ mesh. This takes 2{VN 1) electronic moves. Therefore, histogramming can be done with 4(y/N 1) + B 1 electronic and 2 OTIS moves. Thmrrm 6.1.1 Our histogramming algorithm forO
PAGE 96

86 For the 2 OTIS moves, we see that information from all processors in group (\/N -\,\/N 1) (say) must get to group 0. Assume that group (y/N 1, y/N 1) has at least 2 gray values. To get information out of a group, an OTIS move must be made. A single OTIS move, however, moves data from different processors of a group into processors of different groups. Therefore, at least 2 OTIS moves are necessary to move different data from 2 or more different processors into a single other group. fi 1 3 Algorit hm? fnrJN
PAGE 97

87 Step 1 takes 4\/N + o(y/N) electronic moves; Steps 2 and 4 take 1 OTIS move each; and Step 3 takes 2(VN 1) electronic moves. The total number of moves is 6y/N + o(VN) electronic moves and 2 OTIS moves. Theorem 6.1.2 Every histogramming algorithm for the case VN < B < N must make at least 5{y/N 1) + [(B l)/VN\ 1 electronic and 2 OTIS moves to compute the histogram configuration obtained by the 0(1) memory algorithm. When the output configuration has the histogram in a s/B x y[B submesh of group (0,0), at least 4\/W + 2\/B 6 electronic and 2 OTIS moves are needed. Proof Consider the image in which the gray value of the pixel in processor (VN l y 1, VN 1, VN 1) is 0 and the pixels in group (y/N \,VN-1) have at least 2 different gray values. Using the reasoning in Theorem 6.1.1, we see that at least 4(y/N 1) electronic and 2 OTIS moves are needed to get the histogram information from the group (v^-l.V^-l) processors to the target processors in group (0,0). Next, suppose that the pixel in processor (0, 0, 0, 0) has gray value t; such that u mod v/tf = V^land [v/VN\ = [(B-\)/^N\-l. It takes [(B-\)/^/N\-l + y/N-l electronic moves to get information from (0,0,0,0) to (0,0, [(B l)/y/N\ 1, y/N1) and these electronic moves cannot be overlapped with those used to move information from {s/N 1, y/N-l t VN1, ^N1) to (0,0,0,0). Therefore, at least 5(y/N 1) + [(B 1)/VN\ 1 electronic and 2 OTIS moves are needed to obtain the output configuration obtained by the 0(1) memory algorithm. For the VB x y/B submesh output configuration, suppose that the gray value in processor (0,0,0,0) is B 1. Therefore, information needs to flow from (0,0,0,0) to (0,0, \/B l,y/B 1), requiring 2(VB 1) electronic moves that cannot be

PAGE 98

88 overlapped with the electronic moves made when moving information from (y/N lf Jfi _ i f JJj _ i, vffi1) to (0,0,0,0). Thus a total of 4y/N + 1y/B 6 electronic and 2 OTIS moves are necessary. An optimal histogramming algorithm for the y/B x y/B submesh output configuration is possible when 0(y/B) memory per processor is available. The algorithm given below adapts the method used in Jenq and Sahni [18], and assumes that B is a perfect square and that y/B divides y/N. Step 1: Tile each y/N x y/N mesh by y/B x y/B tiles. Step 2: Processor t on each row of each y/B x y/B tile computes an array A[0 : y/B -I] of values such that A[j] equals the number of pixels in that row of the tile whose gray value is j y/B + », 0 < t < y/B. Step S: Processors in the same column of each tile perform a consecutive sum operation; processor i of a tile column sums the A[i] values of the processors on its column. Following this step, processor (i, j) of a tile has the number of pixels in its tile whose gray value is iy/B + j. Step 4: Perform a window sum operation on the results of Step 3 using a window size y/B x y/B. This operation does not span group boundaries. The result of the window sum operation is in the top left y/B x y/B window/tile of each group. Following this operation, processor (g x ,g,,i,j) has the number of pixels in group (g x ,9y) whose gray value is iy/B + j. Step 5: Do an OTIS move on the values computed in Step 4. Step 6: Processor (g x ,9 yi 0>0) sums all the values received by its group. Step 7: Do an OTIS move on the values computed by (p t ,<7„,0,0) in Step 6.

PAGE 99

89 For the time complexity, we see that Steps 2 and 3 take 2{s[B 1) electronic moves each; Step 4 takes 2(y/N %fB) electronic moves; Steps 5 and 7 take 1 OTIS move each; and Step 6 takes 2(y/N1) electronic moves. The total number of moves is 4y/N + 2y/B 6 electronic and 2 OTIS moves. fi 1 4 Algori thm for B > N This case can be done with 22VN + 0{N 3 '*) electronic and 0(N 3 /*) OTIS moves by modifying the sort algorithm in Section 4.13 so that during the sort, pixels with the same gray value are combined into a single pixel. B ?, HifftngrftTP Modification Histogram modification is the process of changing the gray values of an image based on a mapping function /; f(t) gives the new gray value for pixels whose original gray value is i, 0 < t < B. In histogram flattening or equalization [40], the function / is computed by first computing the prefix sums S\i] = Ej =0 H[j],0
PAGE 100

90 Following the computation of /, the gray values of all pixels must be updated according to /. When we are limited to 0(1) memory per processor, this updating of pixel values may be done by first performing a window broadcast of the / values to all groups. This broadcast can then be followed by a random access read (RAR) in which each processor obtains the needed / value from within its group. The window broadcast takes 2(y/N 1) electronic moves and the RAR takes 23\/N + o(y/N) electronic moves [54]. Thus, the updating phase takes 25\/N + o(VN) electronic and 2 OTIS moves. When 0(y/N) (= 0(y/B)) memory per processor is available, the updating of group values may be done by first doing a window broadcast of the / values as was done in the 0(1) memory case. Next, each processor accumulates the / values in the VN processors that are in its column. This accumulation is done in an array C. For a processor in column j of its group, C[i] = f(iy/N + j), 0 < i,j < yffi. This accumulation step takes 2(y/N 1) electronic moves. Following the construction of the C arrays, each processor sends a token to the processor on its row that has the / value it needs. When the token reaches the target processor, the / value is written into the token, and the token returned to the originating processor. This token send/receive step can be broken into two phases — one in which tokens are sent to and received from processors to the left of the source processors and another in which the target processors are to the right. Each of these phases takes 2{VN 1) electronic moves. Thus, the 0{y/N) memory algorithm takes &{y/N 1) electronic and 2 OTIS moves to update the gray values following the computation of /. The complexity of the 0(VN) memory updating algorithm can be reduced to 6(y/N 1) electronic and 2 OTIS moves if the histogram computation phase saves, in processors 0 and y/N 1 of each row of a group, the gray values of all y/N pixels

PAGE 101

91 in that row. This can be done without increasing the number of moves taken by the histogramming algorithm. When processors 0 and v^V 1 of each row know the gray values of all pixels in their row, the pixel values can be updated using the C arrays in 2{VF-1) electronic moves rather than in A(y/N-l) electronic moves. In the new method, processor 0 of each row initiates token t^jf_ 2 , . . ., h in that order. Token U contains the gray value of the pixel in processor i of the row. The tokens move rightward, one processor at a time. When a processor receives a token, it checks the token's gray value and appends the updated value for that gray value in case this updated value is stored in the processor's C array. The rightward token moves are made for exactly 1 steps. Following this, processor VFf 1 of each row initiates tokens to, t u . . t^_ 2 that move leftward for exactly y/N 1 steps. At the end of these moves, each processor should have received a pair (t, /(f)) where t is the original gray value in the processor. fi.3 Shrinkj pp *nd Expanding fi.3.1 Background Let / be an N x N image and let B^i, j] represent the pixel block: {[«, v]|0 < u, v < N, maxflu i|, \v j\} < q}. The ?-step expansion, E*, and the g-step shrinking, 5*, of / are given by the equations [43]: = max{/[u,v]|[u,t>] € B^i, j]}, 0
PAGE 102

92 an N x N base pyramid. Ranka and Sahni [42] have developed an O(logTV) time algorithm to compute E* and S" exactly using an processor hypercube computer. Jenq and Sahni have developed an 0(1) time RMESH algorithm to compute E* and 5* for binary images [16]. Since image expansion and shrinking are computationally equivalent, we discuss only image expansion explicitly. Following Ranka and Sahni [42], we compute E* using the decomposition given below. E%j\ = max{top%j],bottam'[ij]}, where top'[iJ] = max{R'[uJ]\0]|0 < j v < q}, and righfi[i,j\ = max{/[»',t;]|0 < v -j < q}\ The algorithm to use to compute right! depends on whether the image is mapped onto the OTIS computer using the GRM or the GSM mapping. fi.T2 CRM Manning In the GRM mapping, each row of the image is mapped onto an OTIS group in snake-like order. A simple way to compute right* is to shift the gray values leftward by q units along the snake. A left shift by 1 takes 1 electronic left move, 1 electronic right move, and 1 electronic up move. Therefore, right* can be computed using a total of electronic moves. We can overlap the left and right electronic moves required to compute left 9 and right* so that the total number of moves in each of

PAGE 103

93 the four mesh directions is q. That is, we can compute left* and right" using 4q electronic moves. To compute top* and bottom 9 , we first do an OTIS move so that each column of R* is in a single group in snake-like order; then we run the left* and right* algorithm; and finally do another OTIS move to get the E* values back in the proper locations. Thus E* can be computed using 8? electronic and 2 OTIS moves. When q > 1.25>/N, this simple strategy can be improved upon as below. When q > 1.25\/N t right* of a column 0 processor on an even indexed row (within a group) depends on the y/N 1 values to its right and on the same row, all values on the following q f =[(q-VN + 1)/VN\ = [(q + l)/v^J " 1 rows, and q m _ q _ y/N + 1 q f y/N m (q + 1) mod y/N values in the row qj + 1 rows away (Figure 6.1(a) and (b)). Likewise, right* of a column y/N 1 processor on an odd indexed row depends on the y/N1 values to its left and on the same row, all values on the following q s rows, and qm values in the row q f + 1 rows away (Figure 6.1(c) and (d)). Also, right* of a column t processor on an even indexed row, i / 0, depends on the y/N-i 1 values to its right and on the same row, all values in the following q f rows, and an additional q m + t values (Figure 6.2(a) and (b)). Similarly, for a column t processor on an odd indexed row, » / \/N 1, right* depends on the t values to its left and on the same row, all values in the following q f rows, and an additional q^ + i values (Figure 6.2(c) and (d)). When q m = 0 or 1, the additional q m + i values lie on a single row for all i € [0, y/N 1]. When q m = 0 or 1, we use the following steps to compute right*. Step 1: Column 0 processors compute the maximum gray value in their row.

PAGE 104

94 • o o — o o o 1 (•) #00 ~ 000 1 Of J (b) o J o o o o • I Figure 6.1. Data required in GUM for end processor: (a) q f even; (b) q f odd; (c) q f even; (d) q/ odd 1 3 1 0 0 0 — • • 0 0 • 1 J o-~. — o — o o (e> 1 'J o 1 (d) Figure 6.2. Data required in GRM mapping for middle processor: (a) q f even; (b) q s odd; (c) qj even; (d) g/ odd

PAGE 105

95 Step 2: Column 0 processors shift the maximum computed in Step 1 up column 0 for q f steps. Each column 0 processor computes the max of the q f values it receives. Step S: Each processor shifts its image value up by q f + 1 rows. Following this step, each row has the additional sfN values it needs from a row q } + 1 rows away. Step 4: right* may now be computed by circulating the original image values in a row, the additional values from a row q f + 1 away, and the max value of the intermediate q/ rows. Step 1 takes y/N-l leftward moves, Step 2 takes q f upward moves, and Step 3 takes q f + 1 upward moves. For Step 4, we note that in some rows, the original row values are to be shifted left while in others, these are to be shifted right. Rows which require a left shift of same row values, require a right shift of the row q f + 1 away values, and rows that require a right shift of same row values, require a left shift of row q f + 1 away values. So, the same row and row q f +1 away values can be circulated through the processors that need these values using 2(VN1) electronic moves. The rightward circulation of the max value of the intermediate rows can be done with one additional move by pipelining it with the rightward moves for that row. The total number of moves for the computation oiright* is 3{VN-\)+2q f +2 = 3\/tf+2ty-l electronic and 0 OTIS. The computation of left* requires only 2(y/N -l) + 2q f + 2 electronic moves because the row max values need not be recomputed. Thus the left* and right* values may be computed using a total of 5(y/N 1) + 4q/ + 4 electronic moves, top* and bottom* can be computed similarly using 5(VN 1) + % + 4 electronic and 2 OTIS moves. Thus the total number of moves to compute E* is 10{>/N 1) + 8?/ + 8 electronic and 2 OTIS moves.

PAGE 106

96 i[i,*] Figure 6.3. Data required in GSM mapping When q m > 1, the additional values needed to compute right? lie on two rows— one is q } + 1 away and the other is qj + 2 away. The number of gray values needed from a row q s + 2 away is q m 1. These 9m 1 values can reach the row that needs them if Step 3 of the q m < 1 algorithm shifts upwards for q f + 2 steps, rather than q f + 1. So, only one additional upward move is needed. The row q f + 2 values can be circulated in Step 4 by pipelining their movement along with that of the row qj + 1 values. This increases the leftward and rightward moves by q m 1 each. The total move count for right? increases by 2q m 1 over that for the case qm < 1. The new count for E< becomes 10(v^ 1) + 8«/ + 8f m + 4 electronic and 2 OTIS moves. GSM Mapping The strategy to compute E< when a GSM mapping is used is similar to that used for a GRM mapping. Notice that in a GSM mapping, a row of the image is distributed over y/N groups with y/N row elements per group (Figure 6.3). We consider three cases: (a) q < VN, 0>) q f = [(q VN + l)/v^J = [(q + l)/y/N\ 1 / 0 and q m = (q + 1) mod = 0, and (c) q f > 0 and q m ^ 0. The case q f = 0 and q m < 1 is included in (a). When q < y/N, the gray values needed to compute right 9 for processor {9z,9v,P*,Py) are in {g^9y,Ps,*) and ($ x ,* + l,p*,*). The Steps to compute right 9 are: Step 1: Perform the following sequence of moves on the gray values:

PAGE 107

97 (&,ft + l,Px,Pi f ) iPx,Py,9x,9y + ^) (Pz,P V ,9x,9y) (9x,9y,Px,Py) This moves gray values from (&,ff lf + l,P«,P,) to (g x , g y ,p x ,P y )Now. each row in a group has all the gray values needed to compute right* for all processors in the group row. Step 2: Shift original gray values leftwards within a group row by q units. Step S: Shift gray values received in Step 1 rightwards by \/N 1 units. Steps 2 and 3 cause all data needed by a processor to go through the processor, enabling the processor to compute its right" value. The total number of moves needed for the case q < VN is y/N + q electronic and 2 OTIS moves. The moves needed to compute E* become 4-/N + 4q. When q f ^ 0 and q m = 0, the right? value of {g x ,gy,Px,Py) is the max of I(g X} g y ,p x ,p y ), the value to the right of (g x ,g yt Px,Py) and in the same row and group, the max of the values in + i,p„ *), 1 < * < ?/, and some of the values (9x,9v + 91 + 1»P*» *) ( but not i3x,g v + 9/ + hP*, 1)). The steps to use are: Step 1: Processor (g x ,g y ,p x ,0) determines the max of the gray values in (g x ,g y ,p x , *)• Step 2: Processor {g x ,g y ,p x ,Pg) sends its gray value to (g x ,g y ,p x ,p y + 1), 0 < p y < y/N-1.

PAGE 108

98 Step S: Perform the following move sequence on the max values computed in Step 1: (Pz,0,g s ,g, 1) (p*,O,0„s, -2) (Px,0,9x,9v ~Qf) (9*1 9,~ During the electronic moves, each processor computes the max of the q s max values that pass through it. Step 4: Perform the following move sequence on the shifted values of Step 2. Note that p, / 0. (9z,9v,Px,Py) (Px,Py,9x,9 v ) (Px,Py,9x,9v ~ !) (P* > Pjr > 9y-Qfl ) (9x,9y ~ QfiPxiPy) This moves the additional values that processors in group need from group (<7x,5» + ?/ + l)(P*>°) B

PAGE 109

99 Step 5: Shift the max of the max' in (g x ,g v1 Pz,0) (received in Step 3) and the additional values received in Step 4 by (g z , <7 v ,Px,Py), Py ^ 0. rightwards y/N 1 times within a group row. Step 6: Shift original values leftward y/N 1 times within a group row. Steps 5 and 6 cause all data needed by a processor to go through it and enable the processor to compute right*. Step 1 requires y/N 1 electronic moves, Step 3 requires 2 OTIS and q s electronic moves, Step 4 requires 2 OTIS and q f + 1 electronic moves, and Steps 5 and 6 each require y/N 1 electronic moves. Note, however, that all moves of Step 3 can be overlapped with moves of Step 4. Moreover, Step 6 can be combined with Step 1. Therefore right* can be computed using 2y/N + q f electronic and 2 OTIS moves. When computing left?, Step 1 can be omitted as the max values have already been computed and Step 2 is not required either. However, Step 6 contributes y/N-1 electronic moves now. So left* can be computed with an additional 2y/N + ?/ 1 electronic and 2 OTIS moves, top* and bottom* can be computed similarly. The total number of moves needed to compute E* is Sy/N + Aqj 2 electronic and 8 OTIS moves. The final case to consider is when qj > 0 and q m / 0. If qj = 0, we may assume q m > 1 because the case q/ — 0 and qj < 1 is covered by the case q < y/~N . Now we need to shift values from two adjacent groups (Figure 6.4); y/N from the group on the right and q m — 1 from the group two to the right. This shifting can be done using the move sequence (gx,g v ,Px,Py) 2 + (pz,p v ,gz,gy)

PAGE 110

100 Figure 6.4. Data required in GSM mapping when q f = 0 and q m ± 0 (j>x,Py,9x,9y ~ 1) (Px,Py,9x,9y 2) (yx,[^-l,^-2],Px,P») In the last OTIS move, two values from each processor are moved. These are the two values that pass through the processor during the two electronic moves stated above. To compute right', we must now shift the original gray values leftward within each group \/N 1 times, shift the values from the right adjacent group leftward q m 1 times, and shift the values from the group two away rightward yffi -\ times. The computation of right* takes 2 OTIS and 2^ + q m 1 electronic moves, fc/f, top 9 , and bottom 1 may be similarly computed. The total number of moves to compute E" is therefore &VN + 4g m 4 electronic and 8 OTIS. If q f > 0, we must also compute the max of the values in the intermediate q f groups as was done for case (b). This adds 4q f electronic and 0 OTIS move to the computation of E*. fij Hmifrh Transform fi.4.1 Background The Hough transform is used to detect straight lines or edges in an image. The p angle Hough transform [42] of an N x N image / is a two-dimensional array // such that

PAGE 111

101 H[rJ] = |{(x, y)\r = Lxcos*, + ysin^J, 9j = JO + 1) ™* *{*>v\ = Here j has the values 0, 1, . . ., p 1. These values of J correspond to the p angles 0, = JO + Since °i te in the ran ^ M and since 0 < x, y < AT, i is in the range [-V2N, V2N]. Parallel algorithms to compute the Hough transform have been developed for several architectures. Chaung and Li [3] and Li et al.[29] do this for systolic arrays; Rosenfeld et oil [44], Kannan and Chaung [20], Cypher et o/.[5], Guerra and Hambrusch [11], and Silberberg [50] consider mesh computers; Fisher and Highnam [8] use a scan line array; Ibrahim et a/.[13] uses a SIMD tree; Li et ot(27, 28], Maresca et a/.[32], use a polymorphic torus; Ranka and Sahni [42] use hypercube computers; Choudhary and Ponnusamy [4] and Thazhuthaveetil and Shah [52] use shared-memory multiprocessors; Jenq and Sahni [17] use reconfigurable meshes; and Pavel and Akl [39] use optical arrays. Ulingworth and Kittler [14] provide a survey of work related to the Hough transform. 4 9 An Improved Algorit hm Ffl r ^ * & Meshes Our development here is based on the work of Jenq and Sahni [17], which itself is closely related to the work reported in several of the other references cited above. The computation of H is generally broken into four phases; each phase computes H for a certain range of 0, . The four ranges are (a) 0<9j< 0>) *H < < *l% (c) x/2 < 6j < 3*/4, and (d) 3tt/4 < $j < r. Since the computation of H in each of these 9j ranges is similar, the computation is explicitly described for only one of the four ranges. As in Jenq and Sahni [17], we explicitly consider only the case */2 < 6j < 3ir/4. Jenq and Sahni [17] have shown how to compute H[rJ] for r > 0 and j such that */2 < $j < 3ir/4 on a two-dimensional mesh computer by starting tokens on two

PAGE 112

102 A 0,0 Figure 6.5. Coordinate system used in Hough Transform boundaries of the mesh and successfully moving these tokens towards the remaining two boundaries. These tokens accumulate the H[rJ] values. When we use the normal convention of locating the origin of the coordinate system at the bottom-left corner of the image and so at the bottom left corner of the N x N mesh (see Figure 6.5), tokens originate at the left and bottom boundaries of the image/mesh and move up and right till they reach either the top or right boundary. For x/2 < 9 j ^ 3 */ 4 > the rules governing token movement are derived from the following facts. Here 6 is the angle and r(x, y) = |x cos 9 + y sin 0J . (a) If r(x, y) = r(x, y + k) for some k > 0, then k = 1. (b) If r(x, y) = r(x + l,c), then c = y or c = y + l. (c) If r(x, y) = r(x, y + 1) = r(x + 1 , c), then c = y + 1. (d) If r(x, y) ^ r(z, y + 1) for z > x, then r(x, y) # r(z, w) for u> > y.

PAGE 113

103 Using these facts, we arrive at the following algorithm to compute H[rJ] for a single angle 9 = 9j. Step 1: [ Create Tokens on Left and Bottom Boundaries ] Processor (0, y) creates the token (sinv, cosv, r, n) = (sin 0, cos 9, r(0, y), 7(0, y]) provided r (0, y) ? r(0, y 1). Processor (x, 0) creates the token (sin 9, cos 9, r(x, 0), I[x, 0]) provided r(x, 0) # r(x-l,0). Step 2: [ Move Tokens Up ] Let {sinv,cosv,r,n) be the token (if any) in processor (x,y). If r = r(x,y + 1) or r / r(x + l,y), move the token to (x,y + 1). If (x,y) receives a token (sinv, cosv, r 1 , n') in this step, it increments n' by 7[x,y] provided r' = r(x,y). Step 3: [ Move Tokens Right ] All tokens move right from (x,y) to (x + l,y). If (x,y) receives a token {sinv,cosv,r',n') in this step, it increments n' by 7[x,y] provided r* = r(x,y). 5fep 4: Repeat Steps 2 and 3 until all tokens reach the top or right boundary. Theorem 6.1.1 The Jour step procedure given above is correct Proof To establish the correctness of this procedure, we must show that every token (sinv,cosv,r t n) visits all processors (x,y) for which r(x,y) = r. Note that when tc/2<9< 3x/4, -l/\/2 < cos0 < 0, l/v/2 < sin* < 1, and sin0 + cos0 > 0. Consider the configuration following Step 1. If token (sinv,cosv,r,n) is in (0,y) then all (x,z) for which r(x,z) = r satisfy x > 0 and z > y. Further, if the token is in (x,0), then all (z,y) for which r(z,y) = r satisfy z > x and y > 0. Therefore, all unreached processors with the same r value can be reached by making upward and rightward moves alone. For any token (sinvycosv,^^ in processor (x,y), let

PAGE 114

104 property P be: all unreached processors (u,v) with r(u,v) = r have u > x and t> > V («.«•, these processors can be reached by making upward and rightward moves alone). We have already shown that P holds following Step 1. We shall show that this property holds following each execution of Steps 2 and 3. Therefore the algorithm is correct. To establish the result, we will also need to show that at the start of Step 2 r(x, y) = r (so, we need not check r 1 = r(x + y) in Step 3). The first time Step 2 is initiated, r(x, y) = r and so this condition holds. Call this condition Q. Assume, for the induction hypothesis, that P and Q hold at the start of each execution of Step 2. Consider a token that moves up in Step 2. Suppose that r = r(x,y). If r = r {x,y + 1) also, then r = r(x,y) = r(x,y + 1). From fact (c) it follows that r ^ r(x + 1, y). From this and cos 0 < 0, it follows that r > r(x + 1, y) > r{x + y), j > 2. Therefore following the upward move of the token, property P still holds for the token. From fact (a) % r = r(x,y) = r(x,y + 1), and sin* > 0, it follows that r < r(x, y + j), j > 2. Therefore, following the rightward move of this token in Step 3, property P again holds. Following the Step 3 move of the token, the token is in processor (x + l,y + l) and r(x + l,y + l) = [(x + l)cos0 + (y + l)sin0J = [xcos0 + ysin0 + cos0 + sin0j > [x cos 0 + y sin 6\ =r (because cos 9 + sin B > 0). This, together with the knowledge that r = r(x, y + 1) > r(x + l,f +1), implies that r(x + 1, y + 1) = r. So condition Q holds after Step 3. Next, suppose that r = r(x,y), r / r(x,y + 1), and r ^ r(x + l,y). Again, the token moves to (x,y + 1) in Step 2. Since -l/y/2 < cos0 < 0, r(x + l,y) = r 1 > r(x + ;',y), j > 2. Therefore moving the token up to (x,y + 1) preserves property P. P is also preserved following the rightward move made in Step 3 because r(x,y+j) > r(x,y + l) = r + lj > 2. Further, since l/y/2 < sin0 < 1, r(x,y+l) = r + 1. Since r(x + l,y + l) G {r(x,y + l),r(x,y+ 1) 1} = {r + l,r} as well as in

PAGE 115

105 {r(x + 1, y), r{x + 1, y) + 1} = {r 1, r}, r(x + 1, y + 1) = r. Therefore Q also holds following Step 3. If a token does not move up in Step 2, then r = r(x, y) = r(x+l, y) at the start and end of Step 2. Following Step 3, the token has moved to (x+1, y) and so condition Q holds. Also since r ^ r(x,y + 1) and sin0 > l/\/2, r < r(x,y + 1) < r(x,y + j), j > 2. Therefore the right move of Step 3 preserves property P also. Theorem 6.4.1 establishes the correctness of our N x N mesh single angle Hough transform algorithm by showing that each token goes through every processor with the same r value as that of the token. For the complexity analysis, we first observe that there can never be more than one token in any processor. To see this, observe that when the algorithm completes in Step 1, no processor has more than 1 token. If two tokens t\ and tj end up in the same processor following Step 2, then t\ must be in (x, y) and * 2 in (x, y + 1) prior to the upward move of Step 2. Further, in Step 2, ti must move to (x, y + 1) and <2 must remain in (x, y + 1). From condition Q of Theorem 6.4.1, we know that the r value r x of token t x must be r(x,y) and r 2 = r(x,y + 1). Since all tokens in the same column must originate in the same column (because tokens move right at the same rate), r x ^ r 2 (alternatively, since all tokens have different r values, ri ^ r 2 ). Therefore, r 2 = r(x,y + l) = r t + l. For ti to move up and ^ to remain in (x, y + 1), we must have r(x + 1, y) = r x 1 (this causes ti to move up), r(x + l,y + 1) = r x + 1 (this causes ^ to not move up). However, r(x + 1, y) and r(x + 1, y + 1) can differ by at most 1. Therefore, this condition is not possible and so two tokens cannot be in the same processor following Step 2. Since all tokens move right in Step 3, two tokens cannot be in the same processor after Step 3 unless they are in the same processor before Step 3. Therefore, it is not possible for two tokens to ever be in the same processor.

PAGE 116

106 Since tokens are always in different processors, the upward move of Step 2 can be done in one time unit and the rightward move of Step 3 can be done in another time unit. Since tokens can make at most N-\ right moves before reaching the right boundary, at most N-l iterations of Steps 2 and 3 are needed. Therefore, the single angle Hough transform can be computed with 2(W 1) moves. This represents an improvement over the algorithm of Jenq and Sahni [17] which takes 3(N 1) moves to compute the single angle Hough transform. To compute the p angle transform, we modify the basic one angle algorithm so that the token originating in (x,0) is created only when the tokens that originated in (0, y) reach column x; that is, the (x, 0) token is created after x rightward moves have been made in Step 3. With this change, all tokens are always in the same column and correctness is not affected. Further, when tokens reach the top or right boundary, they start moving down from the top boundary or left from the right boundary. This avoids an accumulation of multiple tokens in the same boundary processor. Since the modified algorithm uses only one column of the N x N mesh at any time (excluding the backward movement of tokens from the top and right boundaries), we can pipeline the computation for all p/4 angles in the range x/2<0< 3tt/4. The total computation takes 2(N 1) + 2(p/4 1) moves. The number of moves needed for all p angles is S{y/N 1) + 2p 8. A final sort step is needed to create the array H in the desired format. This takes an additional (4 + e)N moves [38]. fi4.3 ARM Manning We can simulate the single angle Hough transform algorithm of Section 6.4.2 as well as the modified version of this algorithm on an N 7 processor OTIS-Mesh in which the image has been mapped using the GRM mapping. Each execution of Step 2 can be done with 2 OTIS moves and 3 electronic moves, and each execution of Step

PAGE 117

107 3 takes 3 electronic moves. The total number of moves for the single angle transform becomes 6(N 1) electronic and 2(N 1) OTIS moves. To compute the transform for all p/4 angles in the range tt/2 < 0 < 3tt/4 takes 6(N 1) + 6(p/4 1) electronic and 2(N 1) + 2(p/4 1) OTIS moves, and to compute the transform for all p angles takes 24{N 1) + 24(p/4 1) electronic and 8(N 1) + 8(p/4 1) OTIS moves, exclusive of the final sort step. fi4d OSM Manning When the GSM mapping is used we first compute the Hough transform for each y/Nxy/N mesh of the OTIS-Mesh (i.e., for each group of the OTISMesh) and then combine the results from all N groups. The number of moves, exclusive of the combining step, is 8(v / 77 1) + 8(p/4 1). fi.S Summary We have improved upon the N x N mesh Hough transform algorithms of [17] and [5]. We are able to compute the p angle Hough transform using B(N1) +8(p/41) moves whereas the algorithm of [17] takes 12(7V 1) + 8(p/4 1) moves and that of [5] takes 487V + 20p+4 moves (exclusive of the final sort step). The p angle Hough transform algorithms takes 24(7V-l)+24(p/4-l) electronic and 8(AT-l)+8(p/4-l) OTIS moves when the GRM mapping is used, and 8(y/N1) + 8(p/4 1) electronic moves when GSM mapping is used ( both exclusive of the sort/combine step ). The histogramming algorithm we developed takes 4(VN-l)+B-l electronic and 2 OTIS moves when 0 < B < y/N, 22y/N + 0(N 3 ^) electronic and 0{N 3 '*) OTIS moves when B > N, and when \/N < B < N, 6y/N + o(y/N) electronic moves and 2 OTIS moves for 0(1) memory per processor, and 4\/N + 2\TB 6 electronic and 2 OTIS moves for 0(y/B) memory per processor.

PAGE 118

108 For histogram modification, our algorithm for 0(1) memory per processor case takes 2SVN+o(VN) electronic and 2 OTIS moves, and that for the 0(VN) memory per processor case takes 9(\/N 1) electronic and 2 OTIS moves. Our algorithm for image shrinking and expanding takes 1Q(\/N -l) + Sq f + 8q m + i electronic and 2 OTIS moves when the GRM mapping is used and 8y/N + 4q f + 4q m -4 electronic and 8 OTIS moves when the GSM mapping is used.

PAGE 119

CHAPTER 7 OTIS-HYPERCUBE The OTIS-Hypercube is another class of the OTIS computer in which the electronic interconnect follows the hypercube paradigm. In an N 7 processor OTISHypercube, each group is a hypercube of dimension \og 2 N. Figure 7.1 shows a 16 processor OTIS-Hypercube. The number inside a processor is the processor index within its group. In this chapter, we explore the properties of the OTIS-Hypercube. We also develop algorithms for the frequently used permutations and BPC permutations listed in Chapter 3. 71 OTIS-F YP p m ihp diameter Let N = 2 d and let D(i,j) be the length of the shortest path from processor i to processor ; in a hypercube. Let (ft.pi) and (<*,;*) be two OTIS-Hypercube processors. Similar to the discussion of the diameter of OTIS-Mesh in the previous section, The shortest path between these two processors fits into one of the following categories: (a) The path employs electronic moves only. This is possible only when g x = g*. (b) The path employs an even number of OTIS moves. If the number of OTIS moves is more than two, we may compress the path into a shorter path that uses 2 OTIS moves only: (g x ,pi) (a,0i) (P2,
PAGE 120

110 (0,0) group 0 (0,1), group 1 E* . Figure 7.1. 16 processor OTIS-Hypercube employs exactly one OTIS move. The compressed path looks like: {g u pi) (91,92) (92,9l) (<72,P2)Shortest paths of type (a) have length exactly LKpi.pa) (which equals the number of ones in the binary representation of pi Spa). Paths of type (b) and type (c) have length Dfo.p,) + Digufr) + 2 and Dfo.ft) + DfagJ + h respectively. The following theorem follows from the preceding discussion: Thenre.m 7.1.1 The length of the shortest path between processors (g u pi) and (ft.pa) isd(pi,p2) when g x = g* andm\n{D{p l} p2) + D(gug2) + 2,D(p 1 ,fr) + D(p 2 ,gi) + l} when gi ^ ft. Theorem 7.1.2 The diameter of the OTIS-Hypercvbe is 2d+ 1. Proof Since each group is a d-dimensional hypercube, D(pi,P2), D(gi,g 2 ), ^(pi.ft), and D(pa,<7i) are all less than or equal to d. From Theorem 7.1.1, we conclude that

PAGE 121

Ill no two processors are more than 2d + 1 apart. Now consider the processors (51, Pi), (ff2,P2) such that pi = 0 and pa = N 1. Let # = 0 and pa = JV 1. So D(pi,Pa) = D(g u g 2 ) = £>(pi,ft) = D{P*G X ) = d. Hence, the distance between (g uPl ) and Gfc.pa) is 2d + 1. As a result, the diameter of the OTIS-Mesh is exactly 2d + l. 1.2 Simulation nf an N 7 hvnercube Zane et al [58] have shown that each move of an TV 2 processor hypercube can be simulated by either a single electronic move or by one electronic and two OTIS moves in an processor OTIS-Hypercube. For the simulation, processor q of the hypercube is mapped to processor (g,p) of the OTIS-Hypercube. Here gp = q (i.e., g is obtained from the most significant log 2 JV* bits of q and p comes from the least significant log 2 N bits). Let g d -\ • • -go and p d X • • po, d = log 2 N, be the binary representations of g and p respectively. The binary representation of q is qu-i '•$> = gd-i • • • goPd-i • • -PoA hypercube move moves data from processor q to processor ql k) where is obtained from q by complementing bit k in the binary representation of q. When Jb is in the range [0, d), the move is done in the OTISHypercube by a local intragroup hypercube move. When k > d, the move is done using the steps o K , O (gd-1 • • • 0j f • • 9oPd-l • • Po) (Pd-i-Po9d-i--9l'-9o) (gd-\--9j---9oPd-i--Po) where j = k — d.

PAGE 122

112 Table 7.1. Optimal moves for AT 2 = 2 M processor hypercube and respective OTISHypercube simulations Permutation Optimal Hypercube Moves Simulation total group dim. local dim. OTIS electronic Transpose 2d d & 2d 2d Perfect Shuffle 2d d d 2d 2d Unshuffle 2d d d 2d 2d Bit Reversal 2d d d 2d 2d Vector Reversal 2d d d 2d 2d Bit Shuffle 2d -2 d1 d-\ 2d-2 2d2 Shuffled Row-major 2d -2 d-1 d-1 2d-2 2d 2 G V P S Swap d d/2 d/2 d d 7.3 Common Data Rearrangements In this section, we concentrate on the realization of permutations such as transpose, perfect shuffle, unshuffle, vector reversal which are frequently used in applications. Nassimi and Sahni [36] have developed optimal hypercube algorithms for these frequently used permutations. These algorithms may be simulated by an OTISHypercube using the method of Zane et aL [58] to obtain algorithms to realize these data rearrangement patterns on an OTIS-Hypercube. Table 7.1 gives the number of moves used by the optimal hypercube algorithms; a break down of the number of moves in the group and local dimensions; and the number of electronic and OTIS moves required by the simulation. We shall obtain OTIS-Hypercube algorithms, for the permutations of Table 7.1, that require far fewer moves than the simulations of the optimal hypercube algorithms.

PAGE 123

113 As mentioned before, each processor is indexed as (G,P) where G is the group index and P the local index. An index pair (G, P) may be transformed into a singleton index / = GP by concatenating the binary representations of G and P. The permutations of Table 7.1 are members of the BPC (bit-permute-complement) class of permutations defined in Nassimi and Sahni [36]. The definition of the BPC permutation can be located in Section 3.8.2. 7.3.1 Transpose [p/2 1 Q. P 1 p/2l The transpose operation may be accomplished via a single OTIS move and no electronic moves. The simulation of the optimal hypercube algorithm, however, takes 2d OTIS and 2d electronic moves. 7 3 2 Perfect Shuffle [0. P 1 . P 2 ll We can adapt the strategy of Nassimi and Sahni [36] to an OTIS-Hypercube. Each processor uses two variables A and B. Initially, all data are in the A variables and the B variables have no data. The algorithm for perfect shuffle is given below: Step 1: Swap A and B in processors with last two bits equal to 01 or 10. Step* for (t = l;t
PAGE 124

114 Step 5: Perform an OTIS move on the A and B variables. Step 6: Swap the B variables of processors that differ on bit 0 only. Step 7: Swap the A and B variables of processors with last two bits equal to 01 or 10. Actually, in Step 1 it is sufficient to copy from A to B, and in Step 7 to copy from B to A. Table 7.2 shows the working of this algorithm on a 16 processor OTIS-Hypercube. The correctness of the algorithm is easily established, and we see that the number of data move step is 2d + 2 (2d electronic moves and 2 OTIS moves; each OTIS move moves two pieces of data from one processor to another, each electronic swap moves a single data between two processors). The communication complexity of 2d+2 is very close to optimal. For example, data from the processor with index I = 0101 . . . 0101 is to move to the processor with index V = 1010 ... 1010 and the distance between these two processors is 2d + 1. Notice that the simulation of the optimal hypercube algorithm for perfect shuffle takes 4d moves. 7.3.3 Unshuffl e [p-2.p-3 O.P-ll This is the inverse of a perfect shuffle and may be performed by running the perfect shuffle algorithm mentioned above backwards (i.e., beginning with Step 7); the for loops of Steps 2 and 4 are also run backwards. Thus the number of moves is the same as for a perfect shuffle. 7.3.4 Bit Reversal [0. 1 p 1) When simulating the optimal hypercube algorithm, the task requires 2d electronic moves and 2d OTIS moves. But with the following algorithm:

PAGE 125

115 Table 7.2. Illustration of the perfect shuffle algorithm on a 16 processor OT1SHypercube Step 1 2 OTIS 4 OTIS 6 7 index initial t = 1 i = 0 s = 1 var. ( a ) ( b ) ( a ) ( b ) ( a ) ( b ) 0000 0 0 0 0 0 0 0 0 0 0 0 0 A 2 2 2 4 4 8 8 8 B 0001 1 _ 6 6 2 2 2 8 A 1 4 2 6 10 10 8 B 0010 2 _ 8 8 12 12 4 1 A 2 _ m 10 12 8 4 12 1 B 0011 3 3 3 1 14 14 14 14 6 9 9 9 A 1 3 12 10 10 6 14 1 B 0100 4 4 4 6 _ _ 2 2 2 A 6 4 _ 10 B 0101 5 10 A 5 _ _ _ 10 B 0110 6 3 A 6 m _ 3 B 0111 7 7 7 7 11 11 11 A 5 5 3 B 1000 8 8 8 8 4 4 4 A 10 10 _ _ 12 B 1001 9 12 A 9 _ m _ 12 B 1010 10 5 A 10 5 B 1011 11 11 11 9 13 13 13 A 9 11 5 B 1100 12 12 12 14 1 1 1 1 9 6 6 6 A 14 12 3 5 5 9 1 14 B 1101 13 7 7 3 3 11 14 A 13 5 3 7 11 3 14 B 1110 14 9 9 13 13 13 7 A 14 11 13 9 5 5 7 B 1111 15 15 15 15 15 15 15 15 15 15 15 15 A 13 13 13 11 11 7 7 7 B

PAGE 126

116 Step 1: Do a local bit reversal in each group. Step 2: Perform an OTIS move of all data. Step 8: Do a local bit reversal in each group. we can actually achieve the rearrangement in 2d electronic moves and 1 OTIS move, since Steps 1 and 3 can be performed optimally in d electronic moves each [36]. The number of moves is optimal since the data from processor 0101 . . .0101 is to move to processor 1010 . . . 1010, and the distance between these two processors is 2d+l (Theorem 7.1.2). 7.3.5 Vector Reversal [-(v -\).-(v-2) -Ql A vector reversal can be done using 2d electronic and 2 OTIS moves. The steps are as follows: Step 1: Perform a local vector reversal in each group. Step 2: Do an OTIS move of all data. Step 3: Perform a local vector reversal in each group. Step 4: Do an OTIS move of all data. The correctness of the algorithm is obvious. The number of moves is computed using the fact that Steps 1 and 3 can be done in d electronic moves each [36]. Since a vector reversal requires us to move data from processor 00 ... 00 to processor 11... 11, and since the distance between these two processors is 2d + 1 (Theorem 7.1.2), our vector reversal algorithm can be improved by at most one move.

PAGE 127

117 7.3.6 Bit Shuffle [ p 1 . D 3 1. V 2, P 4, , , , Ql Let G = G u Gi where G« and G t partition G in half. Same for P = P„P,. Our algorithm employs a G/P* Swap permutation in which data from processor G % G t P % Pi is routed to processor G H P^G t Pi. So we need to first look at how this permutation is performed. a,P.. Swan [i, i , W4 . p / 2 1 p/4, 3ff /4 1 p/2, p/4 1 Ql The swap is performed by a series of bit exchanges of the form B(i) = [B p U . . . , 5 0 ], 0 < » < p/4, where f p/2 + i, j = p/4 + i Bj = \ p/4 + t, j=p/2 + i [ j otherwise Let and P(t) denote the ith bit of G and P respectively. So G{0) is the least significant bit in G, and P(d) is the most significant bit in P. The bit exchange B(i) may be accomplished as below: Step 1: Every processor (G, P) with G(t) ^ P(d/2+i) moves its data to the processor (G t P 1 ) where P* differs from P only in bit d/2 + »'. Step 2: Perform an OTIS move on the data moved in Step 1. Step 8: Processors (G, P) that receive data in Step 2 move the received data to (G, P 1 ), where P 1 differs from P only in bit i. Step 4: Perform an OTIS move on the data moved in Step 3. The cost is 2 electronic moves and 2 OTIS moves. To perform a G/P u Swap permutation, we simply do B(i) for 0 < i < d/2. This takes d electronic moves and d OTIS moves. By doing pairs of bit exchanges (P(0),B(1)), (5(2), J3(3)), etc. together, we can reduce the number of OTIS moves to d/2.

PAGE 128

118 Pit Shuffle A bit shuffle, now, can be performed following these steps: Step 1: Perform a GiP u swap. Step 2: Do a local bit shuffle in each group. Step 3: Do an OTIS move. Step 4: Do a local bit shuffle in each group. Step 5: Do an OTIS move. Steps 2 and 4 are done using the optimal d move hypercube bit shuffle algorithm of Nassimi and Sahni [36]. The total number of data moves is 3d electronic moves and d/2 + 2 OTIS moves. 7.3.7 Shuffled Row-maior [p 1. p/2 1. V 2, v/2 2 p/ 2, Ql This is the inverse of a bit shuffle and may be done in the same number of moves by running the bit shuffle algorithm backwards. Of course, Steps 2 and 4 are to be changed to shuffled row-major operations. Every BPC permutation A can be realized by a sequence of bit exchange permutations of the form B(i,j) = [B^-u • • • . fl»Ji d < » < 2d, 0 < < d, and and a BPC permutation C = [C M _i, . . . , C 0 ] = Ug^p where \C q \
PAGE 129

119 0 < j < d; vector reversal can be realized by performing no bit exchanges and using C = [-(2d-l),-(2d-2),...,-0] (Tic = [-(2d 1), -(2d 2), . . . , -d], Tip = [-(d 1), ... , -0]); and perfect shuffle may be decomposed into B(d,0) and C = [2d-2, 2d-3, . . . , d, 2d1, d-2, ...,1,0, d-1] (Il c = [2d-2, 2d-3, . . . , d, 2d1], n,» = [d-2,...,l,0,d-l]). A bit exchange permutation B(i,j) can be performed in 2 electronic moves and 2 OTIS moves using a process similar to that used for the bit exchange permutation B{i). Notice that B(i) = B(i,i). Our algorithm for general BPC permutations is: Step 1: Decompose the BPC permutation A into the pair cycle moves £i(ti,ji), 5 2 (t2,j 2 ),..., B k (i k ,j k ) and the BPC permutation C = UcUp as above. Do this such that ii > it > • •• > U, «nd j\ > j* > •• • > jkStep 2: If Jt = 0, do the following: Step 2.1: Do the BPC permutation U P in each group using the optimal algorithm of Nassimi and Sahni [36]. Step 2.2: Do an OTIS move. Step 2.3: Do the BPC permutation Yi' G in each group using the algorithm of Nassimi and Sahni [36]. Step 2.4: Do an OTIS move. Step 8: If Jk = d, do the following: Step 8.1: Do the BPC permutation II^ in each group. Step S.2: Do an OTIS move.

PAGE 130

120 Step 3.3: Do the BPC permutation U P in each group. Step 4: Uk< d/2, do the following: Step 4.1: Perform the bit exchange permutation B\ t . . . , B k . Step 4.2: Do Steps 2.1 through 2.4. Step 5: If Jfc > d/2, do the following: Step 5.1: Perform a sequence of d k bit exchanges involving bits other than those in B\,...,B k in the same orderly fashion described in Step 1. Recompute Uc and Up. Swap Uc and Up. Step 5.2: Do Steps 3.1 through 3.3. The local BPC permutations determined by n G and U P take at most d electronic moves each [36]; and the bit exchange permutations take at most d electronic moves and d/2 OTIS moves. So the total number of moves is at most 3d electronic moves and d/2 + 2 OTIS moves. 7.5 Comparison In this chapter we have shown that the diameter of the OTIS-Hypercube is 2d+l, which is very close to that of an TV 2 processor hypercube. However, each OTISHypercube processor is connected to at most d + 1 other processors; while in an TV 2 processor hypercube, a processor is connected to up to 2d other processors. We have also developed algorithms for frequently used data permutations. Table 7.3 lists the performance comparisons between our algorithms and those obtained by simulating the optimal hypercube algorithms using the simulation technique of Zane et al. [58]. For most of the permutations considered, our algorithms are either optimal or within one move of being optimal.

PAGE 131

Table 7.3. Complexity Comparison of Common Data Rearrangement Permutation Simulation Ours electronic OTIS electronic OTIS Transpose 2d 2d 0 1 Perfect Shuffle 2d 2d 2d 2 Unshuffle 2d 2d 2d 2 Bit Reversal 2d 2d 2d 1 Vector Reversal 2d 2d 2d 1 Bit Shuffle 2d2 2d -2 3d d/2 + 2 Shuffled Row-major 2d -2 2d -2 3d d/2 + 2

PAGE 132

CHAPTER 8 CONCLUSION The OTIS computer is a relatively new architecture that contains both optical and electronic connections. Optical interconnect is used to connect pairs of processors that are on different groups. Electronic interconnect is used inside each group and is flexible. The reason for the combination is that optical interconnect is superior to electronic interconnect for long interconnects, but not for short (a few millimeters) ones. Different classes of OTIS computers are obtained by using different topologies to realize the electronic interconnect. This hybrid computer combines the best of both worlds, can be easily realized, and has tremendous computing power. In the following sections, we briefly summarize the results we have obtained in this dissertation, and discuss some of the problems that remain to be explored. 81 Outline of the Results We have reviewed the definition of the optical transpose interconnection system (OTIS) and its construction via free-space optics (a pair of arrays of lenslets). Different classes of OTIS computers can be obtained by using different electronic interconnect topologies. Moreover, the OTIS computer can be utilized as multistage interconnection network (MIN) using only 2 OTIS connections. The diameter of an N 7 processor OTIS-Mesh computer, in which the mesh topology is used for the electronic interconnect, is 4VN—3. It is shown that the OTISMesh can be used to simulate a 2D-mesh using either the group row mapping (GRM) or the group submesh mapping (GSM). OTIS-Mesh algorithms for the commonly used permutations — transpose, perfect shuffle, unshuffle, bit reversal, vector reversal, bit 122

PAGE 133

123 shuffle, and shuffled row-majored— are developed. Among them transpose and bit reversal are shown to be optimal. All of them are better than the 4D-mesh simulation. The BPC permutation, of which all those commonly used permutations are members, can realize a large variety of permutations. The OTIS-Mesh algorithm for the general BPC permutation is thus developed. A complete set of basic data manipulation operations— i.e., broadcast, window broadcast, prefix sum, data sum, rank, shift, data accumulation, consecutive sum, adjacent sum, concentrate, distribute, generalize, sorting, random access read (RAR), and random access write (RAW)— are shown with their corresponding OTISMesh algorithms. Among them we show that the algorithms for broadcast, data sum, concentrate, distribute, and generalize are optimal. The rest of them, besides sorting, RAR, and RAW, are close to optimal. These algorithms can be used for a great number of applications, like image processing, computational geometry, matrix algebra, graph theory, and so forth. We demonstrate how the various matrix multiplication operations— vector x vector, vector x matrix, matrix x vector, matrix x matrix— are accomplished. Since the mapping of the matrix onto the OTIS-Mesh has a profound consequence on the outcome, both GRM and GSM mapping schemes are included. We also explore some problems related to image processing— histogramming and histogram modification, Hough transform, and image shrinking and expanding. Another class of OTIS computer, OTIS-Hypercube, is discussed, in which the electronic interconnect follows the hypercube paradigm. The diameter of a N* processor OTIS-Hypercube is 2d + 1, where d = log 2 N. OTIS-Hypercube algorithms for commonly used permutations are developed, and they are either optimal or close

PAGE 134

124 to optimal. Also, the OTIS-Hypercube algorithm for general BPC permutations is presented. The properties of OTIS computers are obtained mainly due to the combination of the characteristics of both optical and electronic interconnects. Although the technology is still evolving, the results obtained so far are promising. Further investigations are encouraged as many open problems remain to be solved and many questions to be answered. Numerous parts of the dissertation are published and can be found in [46, 47, 55, 54, 53, 56, 48, 45]. 8 2 Open Problems Several questions are considered to be related to this type of architecture, ranging from the basics of technologies to algorithms: • So far we separate the marks for electronic and OTIS moves, since at this time it is difficult to see how the technology could progress. At one hand the more transistors that can be put on to a chip, the better the electronic interconnect is. On the other hand the optical technology is still evolving, which could further increase the number of advantages optical interconnect has over electronic interconnect. The combination of this could change the partition of the group, and give preference to one over the other. • There are quite a few operations for which sub-optimal algorithms are obtained. The difference between those with optimal ones are usually only a constant amount of steps away. There might be some way to make them optimal. • The deterministic sorting we presented does not have a great improvement over the simulated version. There could be a better way to further increase the

PAGE 135

125 difference between them. Also, we only concern ourselves with general sorting. Maybe there is a simpler and/or faster algorithm for integer sorting. • We know that arbitrary permutations can be realized through sorting. But the OTIS computer is shown to be capable of performing this task as a MIN. If the whole permutation vector is known before hand by each group, there could be a way to perform the permutation faster than sorting. Further, if each processor only knows its destination, there could be a better algorithm to accomplish the task rather than use sorting. • There might be other 2D-mesh mapping such that it would reduce the slowdown factor with either GRM or GSM. Moreover, this possible mapping can further simplify and speed up the matrix multiplication operations. • There are still other interesting applications that are worth exploring, like convex hull, FFT, etc.

PAGE 136

REFERENCES [1] T. Bestul and L. S. Davis. On computing complete histograms of images in log(n) steps using hypercubes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(2):212-213, 1989. [2] L. E. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969. [3] H. Y. H. Chaung and C. C. Li. A systolic processor for straight line detection by modified Hough transform. In IEEE Workshop on Computer Architecture, Pattern Analysis and Data Base Management, pages 300-303. IEEE Computer Society Press, Las Alamitos, California, 1985. [4] A. N. Choudhary and R Ponnusamy. Implementation and evaluation of Hough transform algorithms on a shared-memory multiprocessor. Journal of Parallel and Distributed Computing, 12(2): 178-188, 1991. [5] R E. Cypher, J. L. C. Sanz, and L. Snyder. The Hough transform has o(n) complexity on SIMD n x n mesh array architecture. In IEEE 1987 Workshop on Computer Architecture for Pattern Analysis and Machine Intelligence, pages 115-121, Seattle, Washington, 1987. [6] E. Dekel, D. Nassimi, and S. Sahni. Parallel matrix and graph algorithms. SIAM Journal on Computing, 10(4):657-675, Nov. 1981. [7] M. Feldman, S. Esener, C. Guest, and S. Lee. Comparison between electrical and free-space optical interconnects based on power and speed considerations. Applied Optics, 27(9): 1742-1 751, May 1988. [8] A. Fisher and P. Highnam. Computing the Hough transform on a scan line array processor. In IEEE 1987 Workshop on Computer Architecture for Pattern Analysis and Machine Intelligence, pages 83-87, Seattle, Washington, 1987. [9] T. Gonzalez and S. Sahni. Open shop scheduling to minimize finish time. Journal of the Association for Computing Machinery, 23(4):665-679, Oct. 1976. [10] J. Grinberg, G. R Nudd, and R. D. Etchells. A cellular VLSI architecture. IEEE Computer, 17(1):69-81, Jan. 1984. [11] C. Guerra and S. Hambrusch. Parallel algorithms for line detection on a mesh. In IEEE 1987 Workshop on Computer Architecture for Pattern Analysis and Machine Intelligence, pages 99-106, Seattle, Washington, 1987. [12] W. Hendrick, O. Kibar, P. Marchand, C. Fan, D. V. Blerkom, F. McCormick, I. Cokgor, M. Hansen, and S. Esener. Modeling and optimization of the optical transpose interconnection system. In Optoelectronic Technology Center, Program Review, Cornell University, Ithaca, New York, Sept. 1995. 126

PAGE 137

[13] H. A. Ibrahim, J. B. Kender, and D. E. Shaw. On the application of massively parallel SIMD tree machine to certain intermediate-level vision tasks. Computer Vision, Graphics, and Image Processing, 36:53-75, 1986. [14] J. Illingworth and J. Kittler. A survey of Hough transform. Computer Vision, Graphics, and Image Processing, 44, 1988. [15] J.-W. Jang, H. Park, and V. K. Prasanna. A fast algorithm for computing histogram on reconfigurable mesh. In Proceedings of the Symposium on the Frontiers of Massively Parallel Computation, McLean, Virginia, 1992. [16] J.-F. Jenq and S. Sahni. Reconfigurable mesh algorithms for image shrinking, expanding, clustering, and template matching. In Proceedings of the 5th International Parallel Processing Symposium, pages 208-215. IEEE Computer Society Press, Alamitos, California, 1991. [17] J.-F. Jenq and S. Sahni. Reconfigurable mesh algorithms for the Hough transform. In Proceedings of 1991 International Conference on Parallel Processing, pages 34-41. Academic Press, New York, 1991. [18] J.-F. Jenq and S. Sahni. Histogramming on a reconfigurable mesh computer. In Proceedings of the 6th International Parallel Processing Symposium, pages 425-432, Beverly Hills, California, 1992. [19] J.-F. Jenq and S. Sahni. Image shrinking and expanding on a pyramid. IEEE Transactions on Parallel and Distributed Systems, 4(11):1291-1296, Nov. 1993. [20] C. S. Kannan and H. Y. H. Chuang. Fast Hough transform on a mesh connected processor array. Information Processing Letters, 33:243-248, Jan. 1990. [21] F. Kiamilev, P. Marchand, A. Krishnamoorthy, S. Esener, and S. Lee. Performance comparison between optoelectronic and VLSI multistage interconnection networks. Journal of Lightwave Technology, 9(12):1674-1692, Dec. 1991. [22] D. M. Koppelman and A. Y. Oruc.. The complexity of routing in Clos permutation networks. IEEE Transactions on Information Theory, 40(l):278-284, Jan. 1994. [23] A. Krishnamoorthy, P. Marchand, F. Kiamilev, and S. Esener. Grain-size considerations for optoelectronic multistage interconnection networks. Applied Optics, 31(26):5480-5507, Sept. 1992. [24] D. Krizanc. Integer sorting on a mesh-connected array of processors. Manuscript, 1989. [25] M. Kunde. Routing and sorting on mesh-connected arrays. In Proceedings of the SrdAgean Workshop on Computing: VLSI Algorithms and Architectures, Lecture Notes on Computer Science, volume 319, pages 423-433. Springer Verlag, New York, 1988. [26] T. Leighton. Tight bounds on the complexity of parallel sorting. IEEE Transactions on Computers, C-34(4):344-354, Apr. 1985.

PAGE 138

128 [27] H. Li, M. A. Lavin, and L. R Le Master. Fast Hough transform: A hierarchical approach. Computer Vision, Graphics, and Image Processing, 36:139-161, Dec. 1986. [28] H. Li and Maresca. Polymorphic-torus architecture for computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(3), Mar. 1989. [29] H. F. Li, D. Pao, and R. Jayakumar. Improvements and systolic implementation of the Hough transform for straight line detection. Pattern Recognition, 22(6):697-706, 1989. [30] J. S. Lim. Image enhancement. In M. P. Ekstrom, editor, Digital Image Processing Technique, pages 1-51. Academic Press, San Diego, California, 1984. [31] Maresca, M. A. Lavin, and H. Li. Parallel Hough transform algorithms on polymorphic torus. In Levialdi, editor, High Level Vision in Multicomputer s. Academic Press, New York, 1988. [32] Maresca, H. Li, and Sheng. Parallel computer vision on polymorphic torus architecture. International Journal of Computer Vision and Applications, 2(4), 1989. [33] G. C. Marsden, P. J. Marchand, P. Harvey, and S. C. Esener. Optical transpose interconnection system architectures. Optics Letters, 18(1 3): 1083-1 085, July 1 1993. [34] D. Nassimi and S. Sahni. An optimal routing algorithm for mesh-connected parallel computers. Journal of the Association for Computing Machinery, 27(1). 629, Jan. 1980. [35] D. Nassimi and S. Sahni. Data broadcasting in SIMD computers. IEEE Transactions on Computers, C-30(2): 101-107, Feb. 1981. [36] D. Nassimi and S. Sahni. Optimal BPC permutations on a cube connected computer. IEEE Transactions on Computers, C-31(4):338-341, Apr. 1982. [37] D. Nassimi and S. Sahni. Parallel algorithms to set up the Benes permutation network. IEEE Transactions on Computers, C-31(2):148-154, Feb. 1982. [38] M. Nigam and S. Sahni. Sorting n 2 numbers onnxn meshes. In Proceedings of the seventh International Parallel Processing Symposium (IPPS'93), pages 73-78, Newport Beach, California, 1993. [39] S. Pavel and S. G. Akl. Efficient algorithms for the Hough transform on arrays with reconfigurable optical buses. In Proceedings of the 10th International parallel Processing Symposium (IPPS'96), pages 697-701, Honolulu, Hawaii, 1996. [40] T. Pavlidis. Algorithms for Graphics and Image Processing. Computer Science Press, Rockvill, Maryland, 1982. [41] S. Rajasekaran and S. Sahni. Randomized routing, selection, and sorting on the OTIS-Mesh optoelectronic computer. IEEE Transactions on Parallel and Distributed Systems, 1998. To appear.

PAGE 139

129 [42] S. Ranka and S. Sahni. Hypereube Algorithms with Applications to Image Processing and Pattern Recognition. SpringerVerlag, New York, 1990. [43] A. Rosenfeld. A note on shrinking and expanding operations in pyramids. Patter Recognition Utters, 6(4):241-244, 1987. [44] A. Rosenfeld, J. Ornelas, Jr., and Y. Hung. Hough transform algorithms for mesh-connected SIMD parallel processors. Computer Vision, Graphics, and Image Processing, 41:293-305, 1988. [45] S. Sahni and C.-F. Wang. BPC permutations on the OTIS-Hypercube optoelectronic computer. Technical Report 97-028, CISE Department, University of Florida, Gainesville, Florida, 1997. Available by anonymous ftp login from ftp.cise.ufl.edu under directory tech-report/tr97/tr97-028.ps.gz. [46] S. Sahni and C.-F. Wang. BPC permutations on the OTIS-Mesh optoelectronic computer. In Proceedings of the Fourth International Conference on Massively Parallel Processing Using Optical Interconnections (MPPOI'97), pages 130-135, Montreal, Canada, 1997. [47] S Sahni and C.-F. Wang. BPC permutations on the OTIS-Mesh optoelectronic computer. Technical Report 97-008, CISE department, University of Florida, Gainesville, Florida, 1997. Available by anonymous ftp login from ftp.cise.ufl.edu under directory tech-report/tr97/tr97-008.ps.gz. [48] S. Sahni and C.-F. Wang. BPC permutations on the OTIS-Hypercube optoelectronic computer. Informatica, 1998. To appear. [49] H. J. Siege!, J. Siegel, F. C. Kemmerer, P. T. Muller, H. E. Smalley, and D. D Smith. PASM: A partitionable SIMD/MIMD system for image processing and pattern recognition. IEEE Transactions on Computers, C-30(12):934-947, Dec. 1981. [50] T. M. Silberberg. The Hough transform in the geometric arithmetic parallel processor. In IEEE Workshop on Computer Architecture and Image Database Management, pages 387-391. IEEE Computer Society Press, Las Alamitos, California, 1985. [51] S. L. Tanimoto. Sorting, histogramming and other statistical operations on a pyramid machine. In A. Rosenfeld, editor, Multiresolution Image Processing and Analysis, pages 136-145. SpringerVerlag, New York, 1984. [52] M. J. Thazhuthaveetil and A. V. Shah. Parallel Hough transform algorithm performance. Image and Vision Computing, 9(2):88-92, 1991. [53] C.-F. Wang and S. Sahni. Basic operations on the OTIS-Mesh optoelectronic computer. Technical Report 97-029, CISE Department, University of Florida, Gainesville, Florida, 1997. Available by anonymous ftp login from ftp.cise.ufl.edu under directory tech-report/tr97/tr97-029.ps.gz. [54] C.-F. Wang and S. Sahni. Basic operations on the OTIS-Mesh optoelectronic computer. In Proceedings of the Fifth International Conference on Massively

PAGE 140

130 Parallel Processing Using Optical Interconnections (MPPOI'98), pages 150-157, Las Vegas, Nevada, 1998. [55] C.-F. Wang and S. Sahni. Basic operations on the OTIS-Mesh optoelectronic computer. IEEE Transactions on Parallel and Distributed Systems, 1998. To appear. [56] C.-F. Wang and S. Sahni. Matrix multiplication on the OTIS-Mesh optoelectronic computer. Technical report, CISE Department, University of Florida, Gainesville, Florida, 1998. [57] M. Yasrebi, S. Deshpande, and J. C. Browne. A comparison of circuit switching and packet switching data transfer using two simple image processing algorithms. In Proceedings of 1983 International Conference on Parallel Processing, pages 25-28. IEEE Computer Society Press, Alamitos, California, 1983. [58] F. Zane, P. Marchand, R. Paturi, and S. Esener. Scalable network architectures using the optical transpose interconnection system (OTIS). In Proceedings of the Second International Conference on Massively Parallel Processing Using Optical Interconnections (MPPOI'96), pages 114-121, San Antonio, Texas, 1996.

PAGE 141

BIOGRAPHICAL SKETCH Chih-Fang Wang was born on January 2, 1967, in Taipei, Taiwan, Republic of China. He received his Bachelor of Architecture degree from Tunghai University, Taichung, Taiwan, Republic of China, in 1989. He received his Master of Science degree from the University of Miami, Coral Gables, Florida, in 1993. He will receive his Doctor of Philosophy degree from Computer and Information Science and Engineering at the University of Florida, Gainesville, Florida, in August 1998. His research interests include sequential, distributed, and parallel algorithms, reconfigurable networks, optical interconnection computers, and all-optical networks. 131

PAGE 142

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Sartaj Kumar Sahni, Chairman Professor of Computer and Information Science and Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Sanguthevar Rajasekaran Associate Professor of Computer and Information Science and Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Timothy
PAGE 143

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Haniph A. Latch man Associate Professor of Electrical and Computer Engineering This dissertation was submitted to the Graduate Faculty of the College of Engineering and to the Graduate School and was accepted^ as partial fulfillment of the requirements for the degree of Doctor of Philosoj " August, 1998 finfred M. Phillips Dean, College of Engineering Karen A. Holbrook Dean, Graduate School