UFDC Home  myUFDC Home  Help 
Title Page  
Acknowledgement  
Table of Contents  
Abstract  
1. Parallel mesh connected computer...  
2. Sorting n numbers on n x n reconfigurable...  
3. Sorting n2 numbers on n x n...  
4. Computational geometry on n...  
5. Conclusions  
References  
Biographical sketch 
Material Information
Subjects
Notes
Record Information

Material Information
Subjects
Notes
Record Information

Table of Contents  
Title Page
Page i Page ii Acknowledgement Page iii Table of Contents Page iv Page v Abstract Page vi Page vii 1. Parallel mesh connected computer architectures Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15 Page 16 Page 17 Page 18 Page 19 Page 20 Page 21 Page 22 Page 23 2. Sorting n numbers on n x n reconfigurable meshes with buses Page 24 Page 25 Page 26 Page 27 Page 28 Page 29 Page 30 Page 31 Page 32 Page 33 Page 34 Page 35 Page 36 Page 37 Page 38 Page 39 Page 40 Page 41 Page 42 Page 43 Page 44 Page 45 Page 46 Page 47 Page 48 Page 49 Page 50 3. Sorting n2 numbers on n x n meshes Page 51 Page 52 Page 53 Page 54 Page 55 Page 56 Page 57 Page 58 Page 59 Page 60 Page 61 Page 62 Page 63 Page 64 Page 65 Page 66 Page 67 Page 68 Page 69 Page 70 Page 71 4. Computational geometry on n x n reconfigurable meshes with buses Page 72 Page 73 Page 74 Page 75 Page 76 Page 77 Page 78 Page 79 Page 80 Page 81 Page 82 Page 83 Page 84 Page 85 Page 86 Page 87 Page 88 Page 89 Page 90 Page 91 Page 92 Page 93 Page 94 Page 95 Page 96 Page 97 Page 98 Page 99 Page 100 Page 101 5. Conclusions Page 102 Page 103 Page 104 Page 105 References Page 106 Page 107 Page 108 Page 109 Page 110 Page 111 Page 112 Page 113 Page 114 Page 115 Page 116 Page 117 Page 118 Page 119 Page 120 Page 121 Biographical sketch Page 122 Page 123 Page 124 Page 125 

Full Text  
PARALLEL COMPUTATIONS ON RECONFIGURABLE MESHES WITH BUSES BY MADHUSUDAN NIGAM A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1992 Copyright 1992 by Madhusudan Nigam All rights reserved ACKNOWLEDGEMENTS I express my deep gratitude towards my advisor Dr. Sartaj Sahni for his continuous guidance and support during my course of studies. I am fortunate to have him as my mentor. It has been a previlege and pleasure to work and to do research with him. I also thank my dissertation committee members, Dr. Gerhard. X. Ritter, Dr. Nabil N. Kamel, Dr. Ravi Varadarajan and Dr. Haniph A. Latchman. My work at University of Florida was partially supported by Computer and Infor mation Sciences Department and National Science Foundation under grants MIP 8617374 and MIP9103379. This support is greatly appreciated. I thank all members of my family for their love, support and patience during my studies. In particular, I thank my mother, Ram Dulari, for her support during the years 199192. I thank my father, Lallan Ram, who inspired me to pursue this goal. I also thank my younger brothers Rakesh, Rajesh, Mukesh and my sisters Meera and Sunita. I would like to thank my associates at University of Florida, Vishal Walia, Venka tramanan Thanvantri, Madhu Acharya and Balaji Venkatraman for their help with the typesetting software during the production of this dissertation. Finally, my deep appreciation to my grandparents, the late Mr. and Mrs. J. P. Nigam, for being the source of strength and inspiration during difficult times. To them I dedicate this dissertation. TABLE OF CONTENTS 1 ACKNOWLEDGEMENTS iii 2 ABSTRACT vi 3 CHAPTERS 1 PARALLEL MESH CONNECTED COMPUTER ARCHITECTURES 1 1.1 Introduction ...................... ........ * 1 1.2 Mesh Connected Computer .......... . ............. 2 1.3 M esh of Trees .... .......... . .................. 5 1.4 Pyramid Computer ... .................... ....... 7 1.5 Hypercube .............. ............. .. ........ 9 1.6 Modified MCC ..... ....... ... .. ... ..... ....... 10 1.7 Reconfigurable Meshes with Buses .............. ...... . 12 1.8 Comparision of RMB with other models ............ . 19 1.9 Dissertation Outline .......... ................ .... 22 2 SORTING n NUMBERS ON n x n RECONFIGURABLE MESHES . W ITH BUSES ................. .............. 24 2.1 Introduction ........................... ......... 24 2.2 Column Sort ....................... .......... 26 2.3 Column Sort on An RMESH ................. ....... 27 2.3.1 Ranking ..... .......... .................... 33 2.3.2 Analysis of RMESH Column Sort .. ............... . 35 2.4 Column Sort On A PARBUS ........................ 36 2.5 Rotate Sort .................................... . 39 2.6 Rotate Sort On A RMESH ....... ................... 42 2.7 Rotate Sort On A PARBUS ......... ............... 45 2.8 A Combined Sort .......................... ... 46 2.8 Comparison with Other 0(1) PARBUS Sorts .............. 47 2.8 Sorting On Higher Dimension RMB ................. .... 48 2.9 Conclusions .... ..................... ....... 50 3 SORTING n2 NUMBERS ON n x n MESHES ............. 3.1 Introduction ................ 3.2 Modified Schnorr and Shamir Algorithm. 3.3 Further Enhancements . . . . . 3.4 Practical Implications . . . . . . 3.5 Simulation On RMBs . . . . . . 3.6 Conclusions . . . . . . . . 4 COMPUTATIONAL GEOMETRY ON N MESHES WITH BUSES ......... 4.1 Introduction ............... 4.2 Convex Hull ............... 4.2.1 N3 Processor Algorithm ......... 4.2.2 N2 Processor Algorithm ......... 4.3 Smallest Enclosing Rectangle ...... 4.4 ECDF Search On An N x N RMESH . 4.5 Triangulation ............... 4.6 Conclusions .............. 5 CONCLUSIONS .............. 5.1 Summary ................. 5.2 Future Work ............... x N RECONFIGURABLE 5 REFERENCES 6 BIOGRAPHICAL SKETCH 106 122 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy PARALLEL COMPUTATIONS ON RECONFIGURABLE MESHES WITH BUSES By Madhusudan Nigam December 1992 Chairman: Dr. Sartaj Sahni Major Department: Computer and Information Sciences We consider different reconfigurable meshes with buses architectures such as PARBUS, RMESH and MRN for parallel computing. We compare reconfigurable meshes with buses with other parallel computing models such as PRAM, fixed topology interconnection networks and meshes with static buses. We consider sorting n numbers on n x n reconfigurable meshes with buses. We develop simple and optimal areaxtime2 complexity algorithms for all reconfigurable meshes with buses architectures (PARBUS/RMESH/MRN). We extend our algorithms to higher dimension reconfigurable meshes with buses. We show that it is faster to sort than to select the kth element using Lin's algorithm on a PARBUS. We show that the lower bound for the number of routes needed to sort n2 numbers on an n x n bidirectional mesh established by Schnorr and Shamir is incorrect. We pro vide algorithms that are able to sort using a number of routes that is very close to the dis tance lower bound for bidirectional as well as strict unidirectional meshes. We point out the practical advantages of our algorithms when applied to mesh connected computers with tR >> 2 tc. We show how to simulate the mesh algorithms on reconfigurable meshes with buses. We develop 0(1) time algorithms to determine the convex hull of N planar points, find the smallest enclosing rectangle of a set of N planar points, triangulate a set of N planar points, and for the ECDF search problem for N planar points on N x N reconfigurable meshes with buses. We also develop an 0(1) time algorithm to determine the voronoi diagram of N planar points on N3 x N reconfigurable meshes with buses. We show that Reisis's algorithm for determining the convex hull of N planar points is incorrect. Our algorithms work on all N x N reconfigurable meshes with buses (PARBUS/RMESH/MRN). CHAPTER 1 MESH CONNECTED PARALLEL COMPUTER ARCHITECTURES Introduction Most controlflow based parallel computational models can be classified into either Shared Memory Model (SMM) or Distributed Memory Model (DMM). In SMM, all pro cessor elements (PEs) share a global memory and communicate with each other via shared memory. In SMM, each PE has its own local control and local memory. Each pro cessor is allowed to access an arbitrary location in shared memory based on allowed read/write modes. In an exclusiveread, exclusivewrite (EREW) model, at most one PE can read from or write to a location in the shared memory in a cycle. In a concurrent read, exclusivewrite (CREW) model, multiple PEs can read from the same memory location, but only one PE can write to a location in the shared memory. In concurrent write (CW), multiple write to a single location in memory is allowed. The value written is resolved based on a write conflict arbitration scheme. PRAM (Parallel Random Access Memory) characterizes the SMM. Karp [KARP90] lists parallel algorithms for many applications on SMM. There is no global memory to share in a DMM and the interprocessor communica tion is done via an interconnection network. In this model each PE has its own local 2 memory and/or blocks of distributed memory modules available to PEs through an align ment network connecting PEs with memory modules. In DMM, there is a delay associ ated with remote memory access through a network. The interprocessor network is usu ally fixed at design phase. All the data routing and communication between PEs is depen dent on the capabilities of the network. The computation time on a DMM is dependent on the diameter of the communication network, i.e., the maximum time to communicate and/or route data between any two PEs. There are various architectural classification schemes for parallel computational models [HWAN84]. Flynn's Classification [FLYN66] is based on multiplicity of instruc tion streams and data streams on a multiprocessor system. Figure 1.1 gives a block diagram for SIMD (Single Instruction Multiple Data) mutiprocessor computer architec ture. While many interconnection networks [HWAN84], [FENG81], [ULLM84] have been proposed, the twodimensional mesh, the hypercube, mesh of trees, and the pyramid have emerged as the dominant ones for many practical applications. Figures 1.2 1.7 give examples of these interconnection networks. Mesh Connected Computer In Mesh Connected Computer (MCC) model, the PEs can be thought of as logically arranged in a kdimensional array A(nk1, nk2,..., no), where ni is the size of the ith dimension and the number of PEs p = nk 1*nk2* * *no. The PE at location A(ikl,...,io)is connected to the PEs at locations A(ikl,...,ij 1,...,i0), 0 5 j < k, pro vided they exist. Each PE is connected to its 2k neighbors. It is assumed that a pE may 3 Figure 1.1 Block diagram of an SIMD Computer transmit data to any of its nearest neighbors in unit time. Data may be transmitted from one PE to another PE via this interconnection pattern. The interconnection scheme for a 16 PE MCC with k = 2 is given in Figure 1.2. Some simple variations to the mesh include endaround connections. Figure 1.3 shows an MCC with wraparound connections, the right end of each row connects to the left end of the row and the bottom of each column connects to its top. Figure 1.4 shows an MCC with toroidal connection. The right end of a row connects to the left end of the next row and the bottom of a column connects to the top of the next column. The ILLIACIV, MPP, and DAP are some examples of mesh connected computers. 4 The regular structure and nearest neighbor (local connectivity) connection makes MCC suitable for VLSI implementation. The fraction of the area occupied by intercon nect wires in a MCC is fixed regardless of the size N. MCC seems to be a natural struc ture for solving many problems in matrix computation and image processing. In parallel and distributed processing, the computations are usually constrained by information flow and data movement operations rather than processing time at PEs. MCC is not suitable for computations like sorting where a single piece of data has to be moved from one corer to the opposite corer. In a N112 x N1/2 2MCC, the worst case data movement may take as much as 2(N1/2) time. This deficiency, the large diameter, is the main shortcoming of MCC's. Most problems involving communications over long distance will have a lower bound on computation time at least equal to the diameter of the mesh. Dekel [DEKE81] has parallel algorithms for matrix and graph problems on MCC and other models. The studies [BAUD78], [KUMA83], [MARB88], [NASS79], [SCHE86], [SCHER89], [THOMK77], [PARK87,90] have sorting algorithms on MCC and [NASS80] has an optimal routing algorithm on MCC. The studies [HWAN84], [MILLS84a,88de,89], [STOU82], and [LEIG92] have listed MCC algorithms for many applications. Figure 1.2 A 4 x 4 Mesh Connected Computer Mesh of Trees Mesh of Trees (MOT) is a hybrid network based on meshes and trees. MOT has a smaller diameter and is faster than a MCC for the same number of nodes. MOT is area efficient in terms of VLSI complexity [THOM79,80], [ULLM84]. MOT is defined in [LEIG81], [LEIG84], [CAPE83], [NATH83]. The N x N mesh of trees is constructed from an N x N grid of processors by adding processors and wires to form a complete binary tree in each row and each column. The leaves of the trees are the N2 grid nodes and added nodes are the internal nodes of the trees. The network has 3N2 2N proces sors. The leaf and the root processors have degree two and all others have degree 3. The N x N MOT has diameter 41og2(N). Figure 1.5 shows a 4 x 4 MOT. 6 Figure 1.3 Mesh with wraparound connections 4 Q Figure 1.4 Mesh with toroidal connections rd^ 11 f^: 1 a E 1 {} {} 3E Figure 1.3 Mesh with wraparound connections _f: 0/ ~71 Q: I : 7 The twodimensional mesh of trees has a very natural and regular structure. There are variations to MOT such as an additional sets of trees along the diagonal of the mesh, connecting original grid nodes with mesh edges. A reduced mesh of trees has 1/3 as many processors as the usual mesh trees. Mesh of Trees can also be defined in higher dimensions. MOT offer dramatic speedup in time over meshes for the same number of processors. Many problems can be solved in polylogarithmic time with polynomial size MOT (Class NC). MOT is area universal, i.e., it can simulate any other network with the same VLSI wire area with only a polylogarithmic factor slow down. Nath et al. [NATH83] have area efficient algorithms for matrix computation, sorting, FFT and graph problems. Other studies [ALNU86], [ALNU87], [ALNU88], [PRAS86] describe mesh of trees algorithms for image processing, graph problems, geometric problems. Leighton [LEIG92] has listed mesh of trees algorithms for many applications. Pyramid Computer The Pyramid Computer (PC) is a combination of tree and mesh structures. A pyramid computer of size N has an N112 x N1/2 mesh connected computer as its base (level 0), and log4(N) levels of mesh connected computers above. A PE at level k is con nected to 4 siblings at level k, 4 children at level k 1 and a parent at level k +1. Figure 1.6 shows a PC with 16 PEs at the base. The apex is the topmost level and has exactly one PE. If some level of PC is a 2k x 2k mesh, then the level above it is a 2k1 x 2k1 mesh and the level below is a 2k+1 x 2k+1 mesh. Let (i,j,k) represent the PE in posi tion (i,j) at level k, then it is connected to PEs (il,j,k), (i,jl,k), O Original mesh nodes [] Column tree nodes Row tree nodes Figure 1.5 A 4 x 4 Mesh of Trees (i div 2,j div 2,k+l), (2i,2j,k1), (2i+l,2j,k1), (2i,2j+1,k1), and (2i + 1,2j+ 1,k 1), if they exist. Cantoni and Levialdi [CANT86] have proposed varia tions of PC for computer vision applications. A PC with a base size N2 has (4N 1) PEs and has a diameter of the O(log2(N)). 3 PC is not as powerful as MOT. There are many problems which admit algorithms in class NC on a N2MOT but require N steps on N x N PC. The studies [TANI82], [TANI84], [MILLS84b,87a], [CHAN88] have algorithms for image processing, data movement and 9 geometric problems on PC. Apex (4,1) (1,4) (4,4) Figure 1.6 Pyramid Computer with base 16 (42) nodes Hypercube A kdimensional hypercube consists of N = 2k nodes. The nodes are labeled 0,1,...,2k1. Two nodes in a hypercube are connected if their labels differ in exactly one bit position in binary representation. The degree of each node is log2(N). The diameter of an N PE hypercube is also log 2(N). Figure 1.8 shows an 8 PE hypercube. The hypercube has become a popular architecture because of its versatility and efficiency. It can be used both for general and special purpose applications. It can efficiently simulate any other network of the same size. It can simulate meshes, binary trees and mesh of trees efficiently. Ncube and iPSC are examples of commercially available hypercubes. 10 Since the degree of a node in a hypercube is log 2(N), the number of connections per PE grows logarithmically with the size N of the network. This can be a problem for large size hypercubes during physical layout for VLSI implementation. There are bounded degree derivatives of hypercube such as butterfly, shuffleexchange graphs, cube connectedcycles, Benes network, de Bruijn graphs [ULLM84]. [RANK91], [LEIG92], [MILLS87c], [STOJ86] have listed hypercube algorithms for many application areas including image processing and computational geometry. 100 110 4 6 000 10 111 001 1 3 011 Figure 1.7 Hypercube with 8 (23) nodes Modified MCC Several modifications to MCC have been proposed to reduce the worst case PEto PE communication time. These include attaching trees (MOT), embedding one or more global meshes within MCC architecture, pyramid computer, attaching buses to proces sors. Carlson [CARL85], [CARL86] explores embedding one or more global meshes to 11 MCC. This modification improves asymptotic time complexity for computations with low to medium interprocessor communications such as tree computations, prefix compu tations, finding connected components of a graph. There is no improvement for computa tions requiring high interprocessor communications such as sorting and computing a Fourier transform. Another modification suggested to MCC is adding one or more broadcast buses. A broadcast bus connects PEs together so that they can transmit data to all other attached PEs in unit time. It is assumed that only one PE can broadcast at any time. Ableson [ABEL80] and Gentleman [GENT78] have analyzed problems where the worst case solution is constrained by data movement. Jordan [JORD78,79], and Gentleman [GENT78] were first to consider supplementing MCC with a global bus. Bhokari [BOKH81,84], Stout [STOU83] consider finding the maximum, semigroup operations and finding median on MCC with a single bus. They report asymptotic improvement in time complexity over MCC. There is no improvement for computations requiring high interprocessor communications such as sorting. Prasanna Kumar [PRAS87ab], [PRAS89], Stout [STOU86], and Chen [CHEN89,90] consider augmenting MCC with a bus for each row and column. Figure 1.8 shows such a MCC with multiple buses. Aggarwal [AGGA86] considers augmenting d dimension MCC with k global buses for finding the maximum and matrix multiplication. The studies [PRAS87b], [PRAS89] and [CHEN89,90] consider semigroup computations, linear algebra, numerical computations, image processing, and geometric problems on 12 MCC with multiple buses. They obtain improvement in asymptotic time complexity over MCC with a single bus. In all these models the bus configurations are static, i.e., it is fixed over computations. Row Buses 9 1 1 Column Buses Figure 1.8 A 4 x 4 MCC with Row and Column Broadcasting Reconfigurable Meshes with Buses Several authors have considered various reconfigurable parallel computational models and architectures. These include the Bus Automata (BA) of Rothstein [ROTH76], the Configurable Highly Parallel Computer (CHiP) of Synder [SYND82], the Reconfigurable Meshes with buses (RMESH) of Miller et al. [MILLP87b,88abc], the Content Addressable Array Processor (CAAP) of Weems et al. [WEEM87,89], the polymorphic torus of Li and Maresca [LI89ab, MARE89], the processor array with a reconfigurable bus system (PARBUS) of Wang and Chen [WANG90ab], and the reconfigurable network (RN) of BenAsher et al. [BEN91]. 13 CHiP [SYND82] consists of homogeneous PEs located at grid points and inter posed with programmable switches. Each PE and switch has local memory. PE store pro gram and data while each switch stores a fixed number of allowed interconnection pat terns. The switches are under a central global control of the program running the machine. Various interconnection between processors are accomplished by using dif ferent stored patterns. CHiP architecture differs from other reconfigurable meshes in two ways. First, the number of connection pattern between PEs is constrained by the limited switch settings stored in the local memory. Second, the switch settings are under a central control rather than under distributed control of PEs. Several issues related to Wafer Scale Integration of CHiP are discussed in [HEDL82]. Gannon [GANN81] discusses solutions to numerical and signal processing problems on CHiP. Bus Automata (BA) [ROTH76] can be considered as a cellular automata (arrays of automata with interconnections between any one automata and a fiexd bounded number of neighbors) with a locally switchable global communication network. Each cell (PE) can connect a subset of its input lines (links) directly to some subset of its output links through internal channels. By alternating links and channels "buses" are set up which can, in principle, establish direct communication from any particular cell to any desired subset of cells. Communication on a bus is assumed to take unit time. The works of Rothstein et. al. [ROTH76,77ab,78,79], [MOSH79], [CHAM87] focus on developing 0(1) algorithms for problems in the area of parsing and formal languages, pattern recog nition, neural modeling, cryptographic methods. 14 Other meshlike architectures with reconfigurable buses are the content addressable array processor (CAAP) of Weems et al. [WEEM87,89], the polymorphic torus of Li and Maresca [LI89, MARE89], the reconfigurable mesh with buses (RMESH) of Miller et al. [MILLP87b, 88abc], the processor array with a reconfigurable bus system (PARBUS) of Wang and Chen [WANG90ab], and the reconfigurable network (RN) of BenAsher et al. [BENA91]. We refer to all these models as Reconfigurable Meshes with Buses (RMBs). The CAAP [WEEM87,89] and RMESH [MILLP87b, MILLP88abc] architectures appear to be quite similar. So, we shall describe the RMESH only. In this, we have a bus grid with an n x n arrangement of processors at the grid points (see Figure 1.9 for a 4x4 RMESH ). Each grid segment has a switch on it which enables one to break the bus, if desired, at that point. When all switches are closed, all n2 processors are connected by the grid bus. The switches around a processor can be set by using local information. If all processors disconnect the switch on their north, then we obtain row buses (Figure 1.10). Column buses are obtained by having each processor disconnect the switch on its east (Figure 1.11). In the exclusive write model two processors that are on the same bus can not simultaneously write to that bus. In the concurrent write model several processors may simultaneously write to the same bus. Rules are provided to determine which of the several writers actually succeeds (e.g., arbitrary, maximum, exclusive or, etc.). Notice that in the RMESH model it is not possible to simultaneously have n disjoint row buses and n disjoint column buses that, respectively, span the width and height of the RMESH. It is assumed that processors on the same bus can communicate in 0(1) time. RMESH 15 algorithms for fundamental data movement operations and image processing problems can be found in [MILLP88abc, MILLP91ab, JENQ91abc]. An n x n PARBUS (Figure 1.12) [WANG90ab] is an n x n mesh in which the interprocessor links are bus segments and each processor has the ability to connect together arbitrary subsets of the four bus segments that connect to it. Bus segments that get so connected behave like a single bus. The bus segment interconnections at a proc cessor are done by an internal fourport switch. If the upto four bus segments at a proces sor are labeled N (North), E (East), W (West), and S (South), then this switch is able to realize any set, A = (A 1, A2), of connections where A i (N,E,W,S), 1 < i< 2 and the Ai's are disjoint. For example A = ( N,S), (E,W) results in connecting the North and South segments together and the East and West segments together. If this is done in each processor, then we get, simultaneously, disjoint row and column buses (Figure 1.13 and 1.14). If A = ((N,SE,W},}, then all four bus segments are connected. PARBUS algo rithms for a variety of applications can be found in [MILLP91ab, WANG90ab, LIN92, JANG92abcde, OLAR92, ELGI91, LI91ab]. Observe that in an RMESH the realizable connections are of the form A = (A 1 ), A 1 {N,E,W,S). The polymorphic torus architecture [LI89ab, MARE89] is identical to the PARBUS except that the rows and columns of the underlying mesh wrap around (Figure 1.15). In a reconfigurable network (RN) [BENA91] no restriction is placed on the bus segments that connect pairs of processors or on the relative placement of the processors, i.e., processors may not lie at grid points and a bus segment may join an arbitrary pair of processors. 16 L ( L [ ( L ( L ] ) ]  L : Processor : Switch  : Link Figure 1.9 4x4 RMESH m rm r 0J 0 0 0 O O 000 _I" 0" D D D Figure 1.10 Row buses O o IJ 0 c E C c E LJ uLU LL L L I L ) () () ) () () ( I ^ ,, ^ , F , F  1 17 0 0 0 0 o" "oFT 0 0 0 0 ) o0 o i 0 0 0 0 "0E Figure 1.11 Column buses Figure 1.12 4 x 4 PARBUS Like the PARBUS and polymorphic torus, each processor has an internal switch that is able to connect together arbitrary subsets of the bus segments that connect to the proces sor. Benasher et al. [BENA91] also define a mesh restriction (MRN) of their 0 El 18 Figure 1.13 Row buses in a PARBUS Figure 1.14 Column buses in a PARBUS 19 reconfigurable network. In this, the processor and bus segment arrangement is exactly as for the PARBUS (Figure 1.12). However, the switches internal to processors are able to obtain only the 10 bus configurations given in Figure 1.16 Thus an MRN is a restricted PARBUS. While we have defined the above reconfigurable bus architectures as square two dimensional meshes, it is easy to see how these may be extended to obtain non square architectures and architectures with more dimensions than two. Figure 1.15 4 x 4 Polymorphic torus Comparison of RMB with other models Reconfigurable Meshes with Buses (RMB) are attractive because they have a diam eter of 1. The time taken to broadcast along a bus is assumed constant 0(1) regardless of 20 N N N N WP E WQ>E E E S S S S N N N N WO E W^>E WO E W E S S S S N N E wQ>E S S Figure 1.16 Local Configurations allowed for a switch in MRN the length of the bus. This assumption may be unrealistic on currently realizable parallel architectures. The conductance and capacity of wires and transistors imply a constant lower bound for the transmission rate of a single switch. Despite this, there already exist experimental RMB on VLSI chip such as YUPPIE (Yorktown UltraParallel Polymorphic Image Engine [LI89ab] and, CAAP [WEEM87,89]. RMB are simple and easy to build with short wires (each edge has 0(1) length) in VLSI. The time to change a state of a switch is orders of magnitude higher than the time for a signal to traverse a dis tance equal to the diameter of a switch [ROTH88]. This motivates nonelectronic or hybrid implementation of RMBs. Schuster [SCHU91] considers optical implementations. The RMB model is similar to an SIMD MCC except for the buses and switches. It has an area O(N) for N112 x N1'2 RMB under the assumption that processors, switches, and single links have constant size. It is assumed that only a single PE can broadcast on a 21 bus at any time, i.e., it functions in CREW mode. The switch is under control of adjacent processors. The switch setting is dynamically controlled (software) by adjacent proces sors as long as there is no conflict in the desired switch settings by the adjacent proces sors. The switch setting allows the broadcast bus to be partitioned into different configurations (subbuses). Each subbus can function as a smaller RMB. RMB can simu late any VLSI organization with equivalent area (those which can be laid out in an N1/2 x N1/2 grid) without loss of time [MILLP87b]. RMB supports several techniques of CRCW PRAM models giving the same or better time performance as a CREW PRAM with N processors. Schuster [SCHU91] provides simulation of PRAM's by RMB's and vice versa. Miller et al. [MILLP87b] give efficient simulation of Mesh of Trees (MOT) on RMB and a simulation of Pyramid Computers (PC) without loss of time. The compu tation for problems requiring high interprocessor communication on fixedtopology net work require at least 0(diameter) time. RMB can solve some of these problems in con stant time in certain cases [SCHU91], [WANG90ab, 91ab]. RMB is more powerful and superior than PRAM (SMM). One dimensional linear array of size N with reconfigurable bus can compute the OR of N Boolean inputs in 0(1) steps, whereas the CREW PRAM with any polynomial number of processors require o(log N) steps [COOK86]. The parity (summation modulo 2) of N inputs is computed in 0(1) time on a 3 x N PARBUS and MRN [LI91a]. It takes Q(log N/loglog N) steps for CRCW PRAM having any polynomial number of processors [BEAM87]. Moreover, par ity of N2 inputs can be computed in 0(1) time on an N x N PARBUS and MRN. 22 RMB is more powerful than fixedtopology networks and meshes with static broad cast buses. Asymptotic improvement in running time of algorithms that solve several problems on RMB [MILLP87b,88abc,91ab], [JENQ91abc], [JANG92abcd], [OLAR92], [REIS89] have been reported compared to efficient algorithms on other fixedtopology networks such as mesh of trees (MOT), pyramid computers (PC), hypercubes, and meshes with static broadcast buses. For example, parity of N inputs on N112 x N112 mesh with multiple buses takes I(N1/6) [PRAS87a] and Q(N1'8) time on any rectangu lar Nprocessor mesh with multiple buses [BARN91], [CHEN89]. This can be solved in 0(1) time on a PARBUS and MRN. Dissertation Outline The prospects of developing 0(1) i.e. constant time algorithms on RMBs motivated this dissertation. Several other dissertations have been motivated by similar considera tions [REIS89], [SCHU91], [WANG91a]. This dissertation covers algorithms for sorting on twodimension and higher dimension RMBs for all configurations in 0(1) time, sort ing N elements on N processor meshes using a number of routing steps approximately equal to the distance lower bound in twodimensions and its simulation on RMBs, con stant time algorithms for determining the convex hull of a set of N planar points, finding the smallest enclosing rectangle of a set of N planar points, triangulating N planar points and the ECDF search problem for N planar points. In chapter 2, we discuss the sorting problem on two and higher dimensions on all RMB configurations. Our algorithms require a fewer number of routes than previuosly 23 published algorithms for twodimensional RMBs. Our algorithms are much simpler and easier to implement. We show that it is faster to sort than to select the kth element using Lin's algorithm [LIN92]. In chapter 3, sorting N elements on N processor meshes and RMBs is discussed. We show that the lower bound for the number of routes needed to sort on an n x n bidirec tional mesh established in [SCHN86] is incorrect We give sorting algorithms on 2 dimensional bidirectional and unidirectional meshes using a number of routing steps that are very close to the distance lower bound. We demonstrate the practical implications of our algorithms for sorting on a mesh. We also show how to simulate the mesh algorithms on reconfigurable meshes with buses. In chapter 4, we derive 0(1) time algorithms for determining the convex hull of N planar points, finding the smallest enclosing rectangle of N planar points, triangulating N planar points and the ECDF search problem for N planar points on N x N reconfigurable meshes with buses. We prove that Reisis's convex hull algorithm [REIS92] is incorrect. Our algorithms work on all RMB architectures (PARBUS/RMESH/MRN). Chapter 5 concludes this work with a critical analysis of the results and a discussion on future research directions. CHAPTER 2 SORTING n NUMBERS ON n x n RECONFIGURABLE MESHES WITH BUSES Introduction In this chapter, we consider the problem of sorting n numbers on an RMESH, PARBUS and MRN. This sorting problem has been previously studied for all three architectures. We can sort n numbers in 0(1) on a three dimensional n x n x n RMESH [JENQ91ab], PARBUS [WANG90], and MRN [BENA91]. All of these algorithms are based on a count sort [HORO90] and are easily modified to run in the same amount of time on a two dimensional n2 x n computer of the same model. Nakano et al. [NAKA90] have shown how to sort n numbers in 0(1) time on an (n log2n x n) PARBUS. Jang and Prasanna [JANG92a] and LIN et al. [LIN92] have reduced the number of processors required by an 0(1) sort further. They both present 0(1) sorting algorithms that work on an n x n PARBUS. Since such a PARBUS can be realized using n2 area, their algorithms achieve the area time squared (AT2) lower bound of K (n 2) for sorting n numbers in the VLSI word model [LEIG85]. The algorithm of Jang and Pra sanna [JANG92a] is based on Leighton's column sort [LEIG85] while that of LIN et al. [LIN92] is based on selection. Neither is directly adaptable to run on an n x n RMESH in 0(1) time as the algorithm of [JANG92a] requires processors be able to connect their 25 bus segments according to A = { (N,S), (E,W)) while the algorithm of [LIN92] requires A = {(N,S), {E,W)) and ((N,W), {S, E)). These are not permissible in an RMESH. Their algorithms are, however, directly usable on an n x n MRN as the bus connections used are permissible connections for an MRN. BenAsher et al. [BENA91] describe an 0(1) algorithm to sort n numbers on an MRN with O(nl +e) processors for any, e e > 0. This algorithm is also based on Leighton's column sort [LEIG85]. In this chapter, we show how Leighton's column sort algorithm [LEIG85] and Mar berg and Gafni's rotate sort algorithm [MARB88] can be implemented on all three reconfigurable mesh with buses (RMB) architectures so as to sort n numbers in 0(1) time on an n x n configuration. The resulting RMB sort algorithms are conceptually simpler than the 0(1) PARBUS sorting algorithms of [JANG92a] and [LIN92]. In addition, our implementations use fewer bus broadcasts than do the algorithms of [JANG92a] and [LIN92]. Since the PARBUS implementations use only bus connections permissible in an MRN, our PARBUS algorithms may be directly used on an MRN. For an RMESH, our implementations are the first RMESH algorithms to sort n numbers in 0(1) time on an n x n configuration. In section 2, we describe Leighton's column sort algorithm. Its implementation on an RMESH is developed in section 3. In section 4, we show how to implement column sort on a PARBUS. Rotate sort is considered in sections 5 through 7. In section 5 we describe Marberg and Gafni's rotate sort [MARB88]. The implementation of rotate sort is obtained in sections 6 and 7 for RMESH and PARBUS architectures, respectively. In 26 section 8, we propose a sorting algorithm that is a combination of rotate sort and Scher son et al.'s [SCHER89] iterative shear sort. In section 9, we provide a comparison of the two PARBUS sorting algorithms developed here and those of Jang and Prasanna [JANG92a] and Lin et al. [LIN92]. For the PARBUS model, Leighton's column sort uses the fewest bus broadcasts. However, for the RMESH model, combined sort uses the fewest bus broadcasts. In section 10, we make the observation that using rotate sort, one can sort Nk elements, in 0(1) time on an N x N x .* x N k + 1 dimensional RMESH and PARBUS. Column Sort Column sort is a generalization of Batcher's oddeven merge [KNUT73] and was proposed by Leighton [LEIG85]. It may be used to sort an r x s matrix Q where r > 2(s1 )2 and r mod s= 0. The number of elements in Q is n = rs and the sorted sequence is stored in column major order (Figure 2.1). Our presentation of column sort follows that of [LEIG85] very closely. There are eight steps to column sort. In the odd steps 1,3,5, and 7, we sort each column of Q top to bottom. In step 2, the elements of Q are picked up in column major order and placed back in row major order (Figure 2.2). This operation is called a tran spose. Step 4 is the reverse of this (i.e., elements of Q are picked up in row major order and put back in column major order) and is called untranspose. Step 6 is a shift by [ J. This increases the number of columns by 1 and is shown in Figure 2.3. Step 8, unshift, is 27 (a) Input Q (b) Output Q Figure 2.1 Sorting a 9x3 matrix Q into column major order the reverse of this. Leighton [LEIG85] has shown that these eight steps are sufficient to sort Q whenever r > 2(s 1 )2 and r mod s = 0. Untranspose Figure 2.2 Transpose (step 2) and Untranspose (step 4) of a 9x3 matrix Column Sort On An RMESH Our adaptation of column sort to an n x n RMESH is similar to the adaptation used by BenAsher et al. to obtain an O(n 17/9) processor reconfigurable network that can sort 28 1 10 19 oo 6 15 24 2 11 20 o 7 16 25 3 12 21 o0 8 17 26 4 13 22 ft> o 9 18 27 5 14 23 < 1 10 19 o 6 15 24 Unshift 2 11 20 o 7 16 25 3 12 21 o 8 17 26 4 13 22 o 9 18 27 5 14 23 o Figure 2.3 Shift (step 6) and Unshift (step 8) in 0(1) time. This reconfigurable network uses approximately 11 layers. The input n numbers to be sorted reside in the Z variables of the row 1 PEs of an n x n RMESH. This is interpreted as representing the column major order of the Q matrix used in a column sort (Figure 2.4). The dimensions of Q are r I x s1 where r = 1/2*n3/4 and sI = 2*n1/4. Clearly, there is an nl such that rl 2 2*(s11)2 for all n > n 1. Hence, column sort will work for n > n For n < n I we can sort in constant time (as n 1 is a constant) using any previously known RMESH sorting algorithm. The steps in the sorting algorithm are given in Figure 2.5. Steps 1, 2, and 3 use the n x r sub RMESH Ai to sort column i of Q. For the sort of step 4, the Q matrix has dimensions r x (s + 1). Columns 1 and s + 1 are already sorted as the oo's and +oo's of Figure 2.3 do not affect the sorted order. Only the inner s I 1 columns need to be sorted. Each of these columns is sorted using an n x r1 sub RMESH, Bi, as shown in Figure 2.6. The sub RMESHs X and Y are idle during this sort. 29 column 1 column 2 column s1 Figure 2.4 Column major interpretation of Q Step 0: [Input] Q is available in column major order, one element per PE, in row 1 of the RMESH. I.e., z[ 1,j] = j'th element of Q in column major order. Step 1: [Sort Transpose] Obtain the Q matrix following step 2 of column sort. This matrix is available in column major order in each row of the RMESH. Step 2: [Sort Untranspose] Obtain the Q matrix following step 4 of column sort. This matrix is available in column major order in each row of the RMESH. Step 3: [Sort Shift] Obtain the Q matrix following step 6 of column sort. This matrix (excluding the o and +o values ) is available in column major order in each row of the RMESH. Step 4: [Sort Unshift] Obtain the sorted result in row 1 of the RMESH. Figure 2.5 Sorting Q on an RMESH Thus the sorts of each of the above four steps involve sorting r numbers using an n x r1 sub RMESH. Each of these is done using column sort with the Q matrix now being an r2 x s2 matrix with r2s2 = r1. We use r2 = n1/2 and s2 = 1/2*n1/4. With AI A2 As, 30 this choice, we have 2(s21)2 < 2s22 = 1/2*n1'2 the r2 x s2 matrix. X B1 B2 ... Bsi Y Lr/2J r r rr [r i/21 Figure 2.6 Sub RMESHs for step 4 of sort Let us examine step 1 of Figure 2.5. To sort a column of Q we initially use only the n x r sub RMESH A that contains this column. This column is actually the column major representation of a matrix Wi. So, sorting the column is equivalent to sorting Wi. To sort Wi, we follow the steps given in Figure 2.5 except that Q is replaced by Wi. The steps of Figure 2.5 are done differently on Wi than on Q. Figure 2.7 gives the necessary steps for each Wi. Note that this figure assumes that the n x n RMESH has been parti tioned into the s A i's and each A i operates independent of each other. To SortTranspose a Wi, we first broadcast Wi which is initially stored in the Z variables of the row 1 PEs of Ai to all rows of Ai (step 1). Following this, the U variable of each PE in each column of 31 Ai is the same. In steps 24, each Ai is partitioned into square sub RMESHs B ijk 1 < j < r2, 1 < k < s2 of size r2 x r2 (Figure 2.8). Note that Bijk contains column k of Wi in the U variables of each of its rows. B ijk will be used to determine the rank of j'th element of the k'th column of Wi. Here, by rank we mean the number of elements in the k'th column of Wi that are either less than the j'th element or are equal to it but not in a column to the right of the j'th column of Bijk. So, this rank gives the position, in column k, of the j'th element following a sort of column k. To determine this rank, in step 2, we broadcast the j'th element of column k of Wi to all processors in B ijt. This is done by broadcasting U values from column j of B ijk using row buses that are local to B ijk* The broadcast value is stored in the variables of the PEs of B ijk. Now, the processor in position (ab) of Bijk has the b'th element of column k of Wi in its U variable and the a'th element of this column in its V variable. Step 3 sets S variables in the processors to 0 or 1 such that the sum of the S's in each row of Bijk gives the rank of the j'th element of column k of Wi. In step 4, the S values in a row are summed to obtain the rank. We shall describe shortly how this is done. The computed rank is stored in variable R of the PE in position [k,l] of Bijk. Note that since k < s2 < r2, [k,1] is actually a position in Bijk. Now if our objective is simply to sort the columns of Wi, we can route the V vari able in PE[k,i] ( this is the j'th element in column k of Wi) to the R'th column using a row bus local to Bijk and then broadcast it along a column bus to all PE's on column R. However, we wish to transpose the resulting Wi which is sorted by columns. For this, we 32 Step 1: Broadcast the row 1 Z values, using column buses, to the U variables of all PEs in the same column of A i. Step 2: The column j PEs in all Bijk's broadcast their U values, using row buses local to each B ijk, to the V variables of all PEs in the same B ijk. Step 3: PE [a,b] of Bijk sets its S value to 1 if (U < V) or (U = V and b < j). Otherwise, it sets its S value to 0. This is done, in parallel, by all PEs in all B ijkS. Step 4: The sum of the S's in any one row of Bijk is computed and stored in the R variable of PE[k,1] of B ijk. This is done. in parallel, for all B ijk's. Step 5: Using row buses, PE[k,l] of each Bijk sends its V value to the PE in column R [(Rl) mod s2]r2 + (kl)r2/s2 + R of A. S2 Step 6: The PEs that received V values in step 5 broadcast this value, using column buses, to the U variables of the PEs in the same column of A i. Figure 2.7 Steps to Sort Transpose Wi into A i see that the j'th element of Wi will be in column k and row R of the sorted Wi. Following R the transpose, this element will be in row (kl)r2/s2 + r _1 and column (Rl) mod s2 + S2 1 of Wi. Hence, the V value in PE[k,1] of B jk is to be in all PEs of column [(Rl) mod R S2]r2 +(kl)r2/s2 + 1 of Ai following the SortTranspose. This is accomplished in s2 steps 5 and 6. Note that there are no bus conflicts in the row bus broadcasts of step 5 as the broadcast for each B ijk uses a different row bus that spans the width of A The total number of broadcasts is 4 plus the number needed to sum the S values in a row of B ijk. 33 We shall shortly see that summing can be done with 6 broadcasts. Hence SortTranspose uses 10 broadcasts. Figure 2.8 Division of A into r2 x r2 sub RMESHs Ranking To sum the S values in a row of an r2 x r2 RMESH, we use a slightly modified version of the ranking algorithm of Jenq and Sahni [JENQ91ab]. This algorithm ranks the row 1 processors of an r2 x r2 RMESH. The ranking is done by using 0/1 valued variables, S, in row 1. The rank of the processor in column i of row 1 is the number of processors in row 1 and columns k, k < i that have S = 1. Hence, the rank of processor Bi,, B __i,1,2 Bis___ Bi,2,1 B i,r2.1 B ir2.s2 34 [1,r2] equals the sum of the S values in row 1 except when S[1,r2] = 1. In this latter case we need to add 1 to the rank to get the sum. The ranking algorithm of [JENQ91ab] has three phases associated with it. Phase 1: Rank the processors in even columns of row 1. This does not account for the 1's in the odd column. Phase 2: Phase 2: Rank the processors in odd columns of row 1. This does not account for 1's in the even columns. Phase 3: Combine odd and even ranks. The procedures for phases 1 and 2 are quite similar. These are easily modified to start with the configuration following step 3 of Figure 2.7 and send the phase 1 and phase 2 results of the rightmost odd and even columns in Bijk to PE[k,1] where the two results are added. To avoid an extra broadcast, the result for the rightmost column (phase 1 if r2 is even and phase 2 if r2 is odd) is incremented by 1 in case S[*,r2] = 1 before the broadcast. While the phase 1 code of [JENQ91ab] uses 4 broadcasts, the first of these can be eliminated as we begin with a configuration in which S[*,b] is already on all rows of column b, 1 < b < r2. So, phases 1 and 2 use 6 broadcasts. Phase 3 of [JENQ91ab] uses two broadcasts. Both of these can be eliminated by having their phase 1 (2) step 10 directly broadcast the rank of the rightmost even (odd) column to PE[k,1] bus using a row bus that spans row k connected to a column bus that spans the rightmost even (odd) column of Bijk. So, the summing operation of step 4 can be done using a total of 6 35 broadcasts. Analysis of RMESH Column Sort First, let us consider sorting a column of Q which is an r x s matrix. This requires us to perform steps 14 of Figure 2.5 on the Wi's. As we have just seen, step 1 uses 10 broadcasts. Step 2 is similar to step 1 except that we begin with the data on all rows of Ai and instead of a transpose, an untranspose is to performed. This means that step 1 of Figure 2.7 can be eliminated and the formula in step 5 is to be changed to correspond to an untranspose. The number of broadcasts for step 2 of Figure 2.5 is there fore 9. Steps 3 and 4 are similar and each uses 9 broadcasts. The total number of broad casts to sort a column of Q is therefore 37. Now, to perform a SortTranspose of the columns of Q we proceed as in a sort of the columns of Q except that the last broadcast of SortUnshift performs the transpose and leaves the transposed matrix in column major order in all rows of the RMESH. This takes 37 broadcasts. The SortUntranspose takes 36 broadcasts as it begins with Q in all rows. Similarly, step 3 and 4 each take 36 broadcasts. The total number of broadcasts is there fore 145. A more careful analysis reveals that the number of broadcasts needed for the Sort Shift and SortUnshift steps can be reduced by one each as the step 5 (Figure 2.7) broad cast can be eliminated. Taking this into account, the number of broadcasts to sort a column of Q becomes 35. However to do a SortTranspose of the columns of Q, we need 36 to do an additional broadcast during the SortUnshift of the Wi's. This brings the number to 36. A SortUntranspose of Q takes 35 broadcasts and the remaining two steps of Figure 2.2 each takes 34 broadcasts. Hence, the total number of broadcasts becomes 139. Column Sort On A PARBUS The RMESH algorithm of Section 3 will work on a PARBUS as all connections used by it are possible in a PARBUS. We can, however, sort using fewer than 139 broad casts on a PARBUS. If we replace the ranking algorithm of [JENQ91ab] that we used in Section 3 by the prefix sum of [LIN92] ( i.e., prefix sum N bits on a (N+1) x N PARBUS) then we can sum the S's in a row using two broadcasts. The algorithm of [LIN92] needs to be modified slightly to allow for the fact that we begin with the S values on all columns and we are summing r2 S values on a r2 x r2 PARBUS rather than an (r2 +1) x r2 PARBUS. These modifications are straightforward. Since the S's can be summed in 2 broadcasts rather than 6, the SortTranspose of Figure 2.5 requires only 6 broadcasts. The SortUntranspose can be done in 5 broadcasts. As indicated in the analysis of Section 3, the SortShift and SortUnshift steps can be done without the broad cast of Step 5 of Figure 2.7. So, each of these require only 4 broadcasts. Thus, to sort a column of Q requires 19 broadcasts. Following the analysis of Section 3, we see that a SortTranspose of Q can be done with 20 broadcasts, a SortUntranspose with 19 broad casts, and a SortShift and SortUnshift with 18 broadcasts each. Thus n numbers can be sorted on an n x n PARBUS using 75 broadcasts. 37 We can actually reduce the number of broadcasts further by beginning with r = n2/3 and st = n1'3. While this does not satisfy the requirement that r > 2(s1)2, Leighton [LEIG85] has shown that column sort works for r and s such that r 2 s(s 1) provided that the Untranspose of step 4 is replaced by an Undiagonalize step (Figure 2.9). The use of the undiagonalizing permutation only requires us to change the formula used in step 5 of Figure 2.7 to a slightly more complex one. This does not change the number of broadcasts in the case of the RMESH and PARBUS algorithms previously dis cussed. However, the ability to use rl = n2/3 and s1 = n13 ( instead of rl = 2 n2/3 and s = 1/2*n 13 which satisfy r > 2(s 1)2 ) significantly reduces the number of broad casts for the PARBUS algorithm we are about to describe. Now, the SortTranspose, SortUntranspose, SortShift and SortUnshift for Q are not done by using another level of column sort. Instead a count sort similar to that of Figure 2.7 is directly applied to the columns of Q. This time, Ai is an n x rl = n x n2'3 sub PARBUS. Let D be thej'th (from the top) n 13 x r sub PARBUS of A (Figure 2.10). The Dy's are used to do the counting previously done by the Bijk's. To count r1 = n2/3 bits using an n1'3 x n 23 sub PARBUS, we use the parallel prefix sum algorithm of [LIN92] which does this in 12 broadcasts when we begin with bits in all rows of Di and take into account we want only the sum and not the prefix sum. Note that the prefix sum algorithm of [LIN92] is an iterative algorithm that uses modulo M arithmetic to sum N bits on an (M + 1) x N PARBUS. For this it uses logN iterations. For the case of sum logM 38 ming N bits, this is easily modified to run on an M x N PARBUS in lo iterations logM with each iteration using 6 broadcasts ( 3 for the odd bits, and 3 for the even bits). With our choice of r and s1, the number of iterations is 2, while with r = 2 n2/3 and s1 = 1/2 n 13, the number of iterations is 3. The two iterations, together, use 12 broadcasts. 1 2 4 1 10 19 3 5 7 2 12 20 6 8 10 3 13 21 9 11 13 Undiagonalize 4 14 22 12 14 16 > 5 15 23 15 17 19 6 16 24 18 20 22 7 17 25 21 23 25 8 18 26 24 26 27 9 19 27 Figure 2.9 Undiagonalize a 9x3 matrix To get the SortTranspose algorithm for the PARBUS, we replace all occurrences of Bijk by Dij and of PE[k,1] by PE[i,1] in Figure 2.7. The formula of step 5 is changed to [(R1) mod s ] r1 + (i1)r l/s + [ R/s 11. The number of broadcasts used by the new SortTranspose algorithm is 16. The remaining three steps of Figure 2.7 are similar. The SortUndiagonalize takes 15 broadcasts as it begins with data in all rows and the step 1 (Figure 2.3) broadcast can be eliminated. The SortShift and SortUnshift each can be done in 14 broadcasts as the step 5 (Figure 2.7) broadcasts are unnecessary. So, the number of broadcasts in the one level PARBUS column sort algorithms is 59. 39 Di2 Dij = s1 x r1 D 2 Ai =n x r1 =n x n23 Figure 2.10 Decomposing Ai into Dij's Rotate Sort Rotate sort was developed by Marberg and Gafni [MARB88] to sort M N numbers on an M x N mesh with the standard four neighbor connections. To state their algorithm we need to restate some of the definitions from [MARB88]. Assume that M = 2' and N = 22t where s : t. An M x N mesh can be tiled in a natural way with tiles of size M x N112. This tiling partitions the mesh into vertical slices (Figure 2.11(a)). Similarly an M x N mesh can be tiled with N112 x N tiles to obtain horizontal slices (Figure 2.11(b)). Tiling by N1/2 x N112 tiles results in a partitioning into blocks (Figure 2.11(c)). Marberg and Gafni define three procedures on which rotate sort is based. These are given in Figure 2.12. Rotate sort is comprised of the six steps given in Figure 2.13. Recall that a vertical 40 (a) Verticle Slices )N (c) Blocks (b) Horizontal Slices Figure 2.11 Definitions of the Slice and block slice is an M x N112 submesh; a horizontal slice is an N x N112 submesh; and a block is aN12 x N"12 submesh. Marberg and Gafni [MARB88] point out that when M = N, step 1 of rotate sort may be replaced by the steps. Step 1' (a) Sort all the columns downward; (b) Sort all the rows to the right; 41 Procedure balance (v,w); (Operate on submeshes of size v x w) sort all columns of the submesh downward; rotate each row i of the submesh, i mod w positions right; sort all columns of the submesh downward; end; (a) Balance a submesh Procedure unblock; (Operate on entire mesh) rotate each row i of the mesh (i.N 12) mod N positions right; sort all columns of the block downwards; end; (b) Unblock Procedure shear, (Operate on entire mesh) sort all even numbered rows to the right and all odd numbered rows to the left; sort all columns downward; end; (c) Shear Figure 2.12 Procedures from [MARB88] This simplifies the algorithm when it is implemented on a mesh. However, it does not reduce the number of bus broadcasts needed on reconfigurable meshes with buses. 42 Step 1: balance each vertical slice; Step 2: unblock the mesh; Step 3: balance each horizontal slice as if it were an N x N12 mesh lying on its side; Step 4: unblock the mesh; Step 5: shear three times; Step 6: sort rows to the right; Figure 2.13 Rotate Sort [MARB88] As a result we do not consider this variant of rotate sort. Rotate Sort On An RMESH In this section, we adapt the rotate sort algorithm of Figure 2.13 to sort n numbers on an n x n RMESH. Note that the algorithm of Figure 2.13 sorts MN numbers on an M x N mesh. For the adaptation, we use M = N Hence, n = N2 = 24t. The n = N2 numbers to be sorted are available, initially, in the Z variable of the row 1 PEs of the n x n RMESH. We assume a row major mapping from the N x N mesh to row 1 of the RMESH. Figure 2.14 gives the steps involved in sorting the columns of the n x n mesh downward. Note that this is the first step of the balance operation of step 1 of rotate sort. The basic strategy employed in Figure 2.14 is to use each n x N subRMESH of the n x n RMESH to sort one column of the N x N mesh. 43 {sort columns of N x N mesh) Step 1: Use column buses to broadcast Z[1,j] to all rows in column j, 1 i i n. Now, Z[i,j] = Z[l,j],1 i: n, 1 5j < n. Step 2: Use row buses to broadcast Z[i,i] to the R variable of all PEs on row i of the RMESH. Now, R[i,j] = Z[i,i] = Z[1,i], 1 is n, 1 Step 3: In the q'th n x N subRMESH all, PEs[i,j] such that i mod N = q mod N and (j 1) mod N + 1 = [ ilN] broadcast their R values along the column buses, 1 5 q < N. Note that each such PE[i,j] contains the ilN] 'th element of column q of the N x N mesh. This value is broadcast to the U variables of the PEs in the column. Now, each row of the q'th n x N subRMESH contains in its U variables column q of the n x n mesh. Step 4: Now assume that the n x n RMESH is tiled by N2 N x N subRMESHs. In the [a,b]'th such subRMESH, the PEs on column a of the subRMESH broadcast their U value using row buses local to the subRMESH. This is stored in the V variable of each PE. Step 5: PE [i,j] of the [a,b]'th subRMESH sets its S value to 0 if (U < V) or (U = V and i < a). Step 6: The sum of the S's in any one row of each of the N x N subRMESH's is computed. The result for the [a,b]'th subRMESH is stored in the T variable of PE [b,1]. Step 7: Using the row buses that span the n x n RMESH the PE in position [b,1] of each N x N subRMESH [a,b] sends its V value to the Z variable of the PE in column (b 1)N + T + 1. Step 8: The received Z values are broadcast along column buses. Figure 2.14 Sorting the columns of an N x N mesh For this, we need to first extract the columns from row 1 of the RMESH. This is done in steps 13 of Figure 2.14. Following this, each row of the q'th n x N subRMESH 44 contains the q'th column of the N x N mesh. Steps 46 implement the count phase of a count sort. This implementation is equivalent to that used in [JENQ91b] to sort m ele ments on an m x m x m RMESH. Steps 7 and 8 route the data back to row 1 of the RMESH so that the Z values in row 1 (and actually in all rows) correspond to the row major order of the n x n mesh following a sort of its columns. The total number of broadcasts used is 12 ( note that step 6 uses 6 broadcasts). The row rotation required by procedure balance can be obtained at no additional cost by changing the destination column computed in step 7 of Figure 2.14 so as to account for the rotation. The second sort of the columns performed in procedure balance can be done with 9 additional broadcasts. For this, during the first column sort of pro cedure balance, the step 7 broadcast of Figure 2.14 takes into account both the row rota tion and the row major to column major transformation to be done in steps 13 of Figure 2.14 for the second column sort of procedure balance. So, step 1 of rotate sort (i.e., balancing the vertical slices) can be done using a total of 21 broadcasts. To unblock the data, we need to rotate each row and then sort the columns down ward. Once again, the rotation can be accomplished at no additional cost by modifying the destination column function used in step 7 of the second column sort performed dur ing the vertical slice balancing of step 1. The column sort needs 11 broadcasts. The horizontal slices can be balanced using 18 broadcasts as the Z data are already distributed over the columns. The unblock of step 4 takes as many broadcasts as the unblock of step 2 (i.e. 9). 45 The shear operation requires a row sort followed by a column sort. Row sorts are performed using the same strategy as used for a column sort. The fact that all elements of a row are in adjacent columns of the RMESH permits us to eliminate steps 13 of Figure 2.14. So, a row sort takes only 9 additional broadcasts. The following column sort uses 9 broadcasts. So, each application of shear takes 18 broadcasts. Since we need to shear three times, step 5 of rotate sort uses 54 broadcasts. Step 6 of rotate sort is a row sort. This takes 9 broadcasts. The total number of broadcasts is 120. This is 19 fewer than the number of broadcasts used by our RMESH implementation of column sort. Rotate Sort On A PARBUS Our implementation of rotate sort on a PARBUS is the same as that on an RMESH. Note, however, that on a PARBUS ranking (step 6 of Figure 2.14) takes only 3 broad casts. Since this is done once for each/row column sort and since a total of 13 such sorts is done, 39 fewer broadcasts are needed on a PARBUS. Hence our PARBUS implemen tation of rotate sort takes 81 broadcasts. Recall that Leighton's column sort could be implemented on a PARBUS using only 59 broadcasts. 46 A Combined Sort We may combine the first three steps of the iterative shear sort algorithm of Scher son et al. [SCHER89] with the last four steps of rotate sort to obtain combined sort of Figure 2.15. This is stated for an N x N mesh using nearest neighbor connections. The number of elements to sorted is N . Step 1: Sort each N3/4 x N3/4 block; Step 2: Shift the i'th row by (i*N3/4) mod N to the right, 1 < i < N; Step 3: Sort the columns downward; step 4: balance each horizontal slice as if it were an N x N112 mesh lying on its side; Step 5: unblock the mesh; Step 6: shear two times; Step 7: sort rows to the right; Figure 2.15 Combined Sort Notice that step 47 of Figure 2.15 differ from steps 36 of Figure 2.13 only in that in step 6 of Figure 2.15 the shear sort is done two times. The correctness of Figure 23 may be established using the results of [SCHER89] and [MARB88]. To implement the combined sort on an n x n RMESH or n x n PARBUS ( n = N2 elements to be sorted), we note that the column sort of step 3 can be done in the same 47 manner as the column sorts of rotate sort are done. The shift of step 2 can be combined with the sort of step 1. The block sort of step 1 is done using submeshes of size n x n3/4 =N2 x N3/2 N2 x ( N3/4 x N3/4). On aPARBUS, this is done by ranking in n1/4 x n3/4 as in [LIN92] while on an RMESH, this is done using the algorithm to sort a column of Q using an n x r submesh (section 3). We omit the details here. The number of broadcasts is 77 for PARBUS and 118 for an RMESH. Comparison With Other 0(1) PARBUS Sorts As noted earlier, our PARBUS implementation of Leighton's column sort uses only 59 broadcasts whereas our PARBUS implementation of rotate sort uses 81 broadcasts and our implementation of combined sort uses 77 broadcasts. The 0(1) PARBUS sorting algorithm of Jang and Prasanna [JANG92a] is also based on column sort. However, it is far more complex than our adaptation and uses more broadcasts than does the 0(1) PARBUS algorithm of Lin et al. [LIN92]. So, we compare our algorithm to that of [LIN92]. This latter algorithm is not based on column sort. Rather, it is based on a multi ple selection algorithm that the authors develop. This multiple selection algorithm is itself a simple modification of a selection algorithm for the PARBUS. This algorithm selects the k'th largest element of an unordered set S of n elements. The multiple selec tion algorithm takes as input an increasing sequence ql < q2 <... < qn13 with 1 qi < n and reports for each i, the qi'th largest element of S. By selecting qi = i*n23 1 5 i 5 n/3 one is able to determine partitioning elements such that the set of n 48 numbers to be sorted can be partitioned into n1/3 buckets each having n2/3 elements. Each bucket is then sorted using an n x n 23 sub PARBUS. Lin et al. [LIN92] were only concerned with developing a constant time algorithm to sort n numbers on an n x n PARBUS. Consequently, they did not attempt to minimize the number of broadcasts needed for a sort. However, we analyzed versions of their algorithms that were optimized by us. The optimized selection algorithm requires 84 broadcasts and the optimized sort algorithm used 103 broadcasts. Thus our PARBUS implementation of Leighton's column sort uses slightly more than half the number of broadcasts used by an optimized version of LIN et al.'s algorithm. Furthermore, even if one were interested only in the selection problem, it would be faster to sort using our PARBUS implementation of Leighton's column sort algorithm and then select the k'th element than to use the optimized version of the PARBUS selection algorithm of [LIN92]. Our algorithm is also conceptually simpler than those of [JANG92a] and [LIN92]. Like the algorithms of [JANG92] and [LIN91], our PARBUS algorithms may be run directly on an n x n MRN. The number of broadcasts remains unchanged. Sorting On Higher Dimension RMB Rotate sort works by sorting and/or shifting rows and columns of an N x N array. This algorithm may be implemented on a three dimensional N x N x N reconfigurable mesh with buses so as to sort N2 elements in 0(1) time. In other words, n = N x N ele ments are being sorted in 0(1) time on an n 1/2 x n 12 x n 1/2 RMB. Assume that we 49 start with n elements on the base of an N x N x N RMB. To sort row (column) i, we broadcast the row (column) to the i'th layer of N x N processors and sort it in 0(1) time in this layer using the algorithms developed in this chapter to sort N elements on an N x N RMB. The total number of such sorts required by rotate sort is 13 (i.e., 0(1)) and the required shifts may be combined with the sorts. We can extend this to obtain an algorithm to sort Nk numbers on a k +1 dimen sional RMB with Nk+l processors in 0(1) time. The RMB has an N x N x .** x N configuration and the Nk numbers are initially in the face with k +1 dimension equal to zero. In the preceding paragraph we showed how to do this sort when k = 2. Suppose we have an algorithm to sort N 1 numbers on an I dimensional N' processor RMB in 0(1) time. We can use this to sort N' numbers on an NI11 processor RMB in 0(1) time by regarding the N1 numbers as forming an N 1' x N array. In the terminology of Marberg and Gafni [MARB88], we have an M x N array with M = N11 > N. To use rotate sort, we need to be able to sort the columns of this array (which are of size M); sort the rows which are of size N); and perform shifts/rotations on the rows or subrows. To do the column sort we use 1 dimensional RMBs. The mth such RMB consists of all processors with index of the type [i1,2,..,i1,m,il+1]. By assumption, this RMB can sort its NI numbers in 0(1) time. To sort the rows, we can use two dimensional RMBs. Each such RMB consists of processors that differ only in their last two dimen sions (i.e., [a,b,c,...,itil+1]). This sort is done using the 0(1) sorting algorithm for two dimensional RMBs. The row shifts and rotates are easily done in 0(1) time using the two 50 dimensional RMBs just described (actually most of these can be combined with one of the required sorts). Conclusions We have developed relatively simple algorithms to sort n numbers on reconfigurable n x n meshes with buses. For the case of the RMESH, our algorithms are the first to sort in 0(1) time. For the PARBUS, our algorithms are simpler than those of [JANG92a] and [LIN92]. Our PARBUS column sort algorithm is the fastest of our algo rithms for the PARBUS. It uses fewer broadcasts than does the optimized versions of the selection algorithm of [LIN92]. Our PARBUS algorithms can be run on an MRN with no modifications. Since n x n reconfigurable meshes require n2 area for their layout. Our algorithms (as well as those of [JANG92a] and [LIN92]) have an area time square pro duct AT2 of n2 which is the best one can hope for in view of the lower bound result AT2 > n2 for the VLSI word model [LEIG85]. Using two dimensional meshes with buses, we are able to sort n elements in 0(1) time using n2 processors. Using higher dimensional RMB, one can sort n numbers in 0(1) time using fewer processors. In general, n = Nk numbers can be sorted in 0(1) time using Nk+1 = n1+1/ k processors in a k+l dimensional configuration. While the same result has been shown for a PARBUS [JANG92b], our algorithm applies to an RMESH also. CHAPTER 3 SORTING n2 NUMBERS ON n x n MESHES Introduction In this chapter, we are concerned with sorting n2 data elements using an nxn mesh connected parallel computer. The initial and final configurations have one data element in each of the nxn processors (say in the A variable of each processor). In the final configuration the data elements are sorted in snakelike rowmajor order. This problem has been extensively studied for mesh architectures (see for e.g. [THOMK77], [NASS79], [KUMA83], [LEIG85], [SCHN86], [SCHER89], [MARB88]). While all of these studies consider SIMD meshes, they differ in the permissible communication pat terns. Thompson and Kung [THOMK77] consider a strict unidirectional model in which all processors that simultaneously transfer data to a neighbor processor do so to the same neighbor. That is, all active processors transfer data to their north neighbor, or all to their south neighbor, etc. Using this model, Thompson and kung [THOMK77] have developed a sorting algorithm with complexity 6ntr + ntc + low order terms, where tr is the time needed to transfer one data element from a processor to its neighbor and tc is the time needed to compare two data elements that are in the same processor. In the sequel, we omit the low order terms. 52 In the bidirectional model, we assume there are two links between every pair (P,Q) of neighbors. As a result, P can send a data element to Q at the same time that Q sends one to P. Using this model, Schnorr and Shamir [SCHN86] have developed a sorting algorithm with complexity 3ntr + 2ntc. In the same paper, Schnorr and Shamir "prove" that 3n routing steps are needed by every sorting algorithm for an even stronger mesh model, each processor can read the entire contents of the memories of its (up to) four neighbors in time t, and the internal processor computations are free. For this stronger mesh model, they show that every sorting algorithm must take 3ntr time by first showing that input data changes in the lower right part of the mesh cannot affect the values in the top left corer processor until time equal to the distance, d, between the top left corner processor and the nearest of these lower right processors. Next, they argue that by chang ing the lower right input data, they can change the final position of the data in the top left corner processor by distance n1. As a result the sort must take at least (d+n2)tr time. The fallacy is that in the first d 1 steps, we may have made several copies of the input data and it may no longer be necessary to route the data from the top left corner processor to the final destination processor. In fact, we show that one can sort in 2ntr time using the stronger mesh model. Park and Balasubramanium [PARK87,90] have considered a related sorting prob lem for the strict unidirectional model. In this the n2 elements to be sorted are initially stored two to a processor on an n x n/2 mesh. The final sorted configuration also has two 53 elements per processor. The time complexity of their algorithm is 4ntr + ntc. This represents an improvement of 2ntr over Thompson and Kung's algorithm provided the two element per processor input/output configuration is what we desire. If we desire the one element per processor configuration (for example, the data to be sorted is the result of a computation that produces this configuration and the sorted data is to be used for further computation), then it is necessary to first fold the data to get the two element per processor configuration, then sort using the algorithm of [PARK87, 90], and finally unfold to get the desired one element per processor final configuration. The folding can be done in n/2 tr time as below (see also Figure 3.1(a)). Fl: The left n/4 columns shift their data n/4 columns to right. F2: The right n/4 columns shift their data n/4 columns to the left. The unfolding can also be done in n/2 tr time using the two steps: Ul: Unfold the n/4 columns labeled A in Figure 3.1(b) U2: Unfold the n/4 columns labeled B in Figure 3.1(b) To unfold A, we use the pipelined process described by the example of Figure 3.2. B is unfolded in a similar way. The total time for the sort is therefore 5ntr + ntc which is nt, less than that of Thompson et al [THOMK77]. The improvement is slightly more if we consider the (nonstrict) unidirectional model in which there is a single link between each pair of neighbor processors and data can be transferred, in parallel, along all links ( however, if P and Q are neighbors, when P sends data to Q, Q cannot send to P). In this 54 n/4 n/2 n/4 I I uz I I I  A B n/4 n/4 n/4 n/4 (a) Folding (b) Unfolding Figure 3.1 Folding and Unfolding model, steps F1 and F2 of folding can be done in parallel. Similarly, we can do U1 and U2 in parallel. The total sort time now becomes 4.5ntr + ntc. In section 2, we show that the Schnorr/Shamir algorithm of [SCHN86] can be modified to sort on unidirectional meshes using the same number of routes as above. The number of comparison steps is, however, larger. The modified Schnorr/Shamir algorithm of section 2 runs in 2.5ntr + 3ntc time on an n x n bidirectional mesh. The number of routes is therefore less than the 3n lower bound established in [SCHN86]. The algorithm of section 2 folds data onto an n x n/2 submesh. By folding onto smaller submeshes, i.e., onto n x n/k submeshes for k > 2, the number of routes can be reduced further. In fact for bidirectional and strict unidirectional meshes we can come very close to the dis tance lower bound of 2n 2 and 4n 2, respectively. This is done in section 3. In section 4, we discuss the practical implications of the foldingunfolding approach to sorting. In 55 section 5, we show how the sort algorithms of sections 2 and 3 may be simulated on n x n reconfigurable bus architectures. Step Po Pi P2 P3 P4 P5 P6 P7 oT_3 1 FI71 " 2 0 1 2 3 45 6 7 3 0 1 2 34 5 6 7 4 0 1 2 3 4 5 6 7 Figure 3.2 Unfolding one row of A Modified Schnorr&Shamir Algorithm The sorting algorithm of Schnorr and Shamir [SCHN86] is given in Figure 3.3. This algorithm uses the following terms and assumes that n = 24q for some integer q. 1. Block... an n3/4 x n3/4 submesh formed by a natural tiling of an n x n mesh with n3/4 x n3/4 tiles (Figure 3.4(a)). 2. Vertical Slice... an n x n3/4 submesh formed by a natural tiling of an n x n mesh with n x n3/4 tiles (Figure 3.4(b)). 3. Snake... the 1 x n2 vector along the snakelike order in the n x n mesh (Figure 3.4(c)). 4. Evenodd transposition sort [KNUT73]...n elements are sorted in n steps. In the odd 56 steps each element in an odd position is compared with the element in the next even posi tion and exchanged if larger. In the even steps, each element in an even position is com pared with the element in the next odd position and exchanged if greater. 5. n /4way unshuffe ... Let k = 1/4 log2n = q/4. Data in column j of the mesh is moved to column j'. Let jqI... jo be the binary representation of j. Then jk1...Jo jq1...Jk is the binary representation of j'. The unshuffle distributes the n3/4 columns of each vertical slice to the n 1/4 vertical slices in a round robin manner. Step 1: Sort all the blocks into snakelike rowmajor order. Step 2: Perform a n 1/4way unshuffle along all the rows of the array. Step 3: Sort all the blocks into snakelike rowmajor order. Step 4: Sort all the columns of the array downwards. Step 5: Sort all the vertical slices into snakelike rowmajor order. Step 6: Sort all the rows of the array into alternating lefttoright and righttoleft order. Step 7: Perform 2n3/4 steps of evenodd transposition sort along the snake. Figure 3.3 Sorting algorithm of Schnorr and Shamir The correctness of the algorithm is established in [SCHN86]. As pointed out in [SCHN86], steps 1,3,5,7 take O(n3/4) time and are dominated by steps 2,4,6. Steps 4 and 6 are done using n steps of evenodd transposition sort each while step 2 is done by 57 N3/4 N 314 N3/4 N (a) Blocks (b) Vertical Slices N N N N3/4  4 (c) Snakelike Order (d) Horizontal Slices Figure 3.4 Definitions of Blocks and Slices simulating evenodd transposition sort on the fixed input permutation n1 where x is the desired unshuffle permutation. Steps 4 and 6 take ntr + ntc time each on a bidirectional mesh and 2ntr + ntc on a unidirectional (whether strict or not) one. Step 2 takes nt, time on a bidirectional mesh and 2nt, on a unidirectional one. The total time for the algorithm of Schnorr and Shamir is therefore 3nt, + 2ntc on a bidirectional mesh and 6nt, + 2ntc on a unidirectional (strict or nonstrict) mesh. On the stronger mesh model the time is 3nt, as tc is zero. We may modify the algorithm of Schnorr and Shamir so that it works on an n x n/2 mesh with two elements per processor. In this modification, we essentially simu late the algorithm of Figure 3.3 using half as many processors. The run time for steps 58 1,3,5 and 7 increases but is still 0(n3/4). Step 6 takes n/2 tr + ntc on a bidirectional mesh as data routing is needed for only half of the steps of evenodd transposition sort and ntr + ntc on a unidirectional mesh. Step 4 takes 2ntr + 2ntc on a bidirectional mesh and 4ntr + 2ntc on a unidirectional mesh as in each of the n steps of evenodd transposition sort, both the elements in a processor need to be routed to the neighbor pro cessor. This can be reduced to ntr + 2ntc and 2ntr + 2ntc, respectively, by regarding the 2n elements in a processor column as a single column and sorting this into row major order using 2n steps of evenodd transposition sort (Figure 3.5). Now, a single element from each processor is to be routed on each of n steps and no data routing is needed on the remaining n steps. 8 2 3 1 1 2 6 5 6 2 3 4 3 4 7 4 5 6 7 1 8 5 7 8 (a) Processor column (b) Sort each column (c) Sort as one column two elements per downward processor Figure 3.5 Modified Step 4 To establish the correctness of the sorting algorithm with step 6 changed to sort pairs of columns as though they were a single column, we use the zeroone principle 59 [KNUT73]. This was also used by Schnorr and Shamir to establish the correctness of their unmodified algorithm. Here, we assume that the input data consists solely of zeroes and ones. Since we have not changed steps 13 of their algorithm, the proofs of parts 1,2, and 3 of their Theorem 3 are still valid. We shall show that part 4 remains true following the execution of the modified step 4. Since the proof of parts 5,6, and 7 rely only part 4 of the theorem, these are valid for the modified algorithm. Hence, the modified algorithms is correct. Define a horizontal slice to be an n3/4 x n submesh obtained by a natural tiling of an n x n mesh using tiles of size n3/4 x n (Figure 3.4(d)). Following step 3 of the sort ing algorithm, the maximum difference d3 between the number of zeroes in any two columns of a horizontal slice is at most 2 [SCHN86]. We need to show (i) that following the modified step 4, the maximum difference d4' between the number of zeroes in any two columns (a column is an n x 1 vector with n elements) of the mesh is at most 2n1/4, and (ii) the maximum difference between the number of zeroes in any two vertical slices is almost n 1/2. The proof of (ii) is same as that in [SCHN86] as our modification of step 4 does not move data between vertical slices. For (i), consider any two columns of n ele ments each. If these two columns reside in the same column of processors, then the max imum difference between the number of zeroes between two columns is 1 as the two columns have been sorted into row major order regarding them as a single 2n element column. Suppose the two element columns reside in two different processor columns. These two processor columns contain 4 element columns AB,C,D. Let a,b,c,d 60 respectively be the number of zeroes in these four columns prior to the execution of the modified step 4. From the proof of Theorem 3, part 4 [SCHN86], we know that Sx y I < 2n 14 where x,y a,b,cd). Let a',b',c',d' be the number of zeroes in column A,B,C,D following the modified step 4. It is easy to see that a' = [(a+b)/, b' = [(a+b)/t2, c' = r(c+d)/2, d' = L(c+d)/2J. We need to show that I x y  < 2 n1/4 where x,y e (a',b',c',d'}. Without loss of generality (wlog), we may assume that I b' c'l = max ({ x y I I x,y e (a',b',c',d')) (note that when x,y e a', b') orxy {c',d', I xy I 1 ) and that b' > c'. b'c' [r(a+b)/21 [(c+d)1/2. Again, wlog we may assume that c 5 d. So, b'c' < (a+b)12] c .Since, lbcl 2n1/4 and lacl < 2n1'4 a + b < 2c + 4n1/4. So, b'c' c + 2n1/4 c = 2n1/4. Our modified Schnorr/Shamir algorithm for n x n meshes is stated in Figure 3.6. It combines the folding and unfolding steps discussed in the introduction and the Schnorr/Shamir modified algorithm for n x n/2 meshes described above. Theorem 1: The sorting algorithm of Figure 3.6 is correct. Proof. Follows directly from our earlier discussion that established the correctness of steps 1 through 7 as a sorting algorithm for an n x n/2 mesh with two elements per pro cessor. 0 For the complexity analysis, we note that steps 1,3,5, and 7 take O(n3/4) time and are dominated by the remaining steps which take O(n) time each. Since we are ignoring 61 Step 0: Fold the leftmost and rightmost n/4 columns so that the n2 elements to be sorted are in the middle n x n/2 submesh with each processor in this submesh having two elements. Step 1: Sort all n3/4 x n3/4 blocks of data into snakelike rowmajor order. Note that each block of data is in an n3/4 x (1/2 n 3/4) submesh. Step 2: Perform an n 14way unshuffle along each row of n elements. Note that each row of n elements is in a row of the middle n x (1/2 n) submesh. Step 3: Repeat step 1. Step 4: Sort the 2n elements in each column of the middle n x n/2 submesh into row major order. Step 5: Sort the elements in each n x (1/2 n3/4) submesh into snakelike rowmajor order. Step 6: Sort all rows of the middle n x n/2 submesh into alternating lefttoright and righttoleft order. Step 7: Perform 2n3/4 steps of evenodd transposition sort along the snake of the middle n x n/2 submesh. Step 8: Unfold the middle n x n/2 submesh. Figure 3.6 Sorting algorithm for n x n mesh. low order terms, we need be concerned only with steps 0, 2, 4, 6, and 8. As noted earlier, steps 2, 4 and 6, respectively, take n/2 tr, ntr + 2ntc, and n/2 tr + ntc on a bidirec tional mesh. Steps 0 and 8 each take n/4 tr on a bidirectional mesh. So, the complexity of the sort algorithm on a bidirectional mesh is 2.5n tr + 3n tc. Since tc is zero on the stronger mesh model, the sort time for this model is 2.5n t, (note that the algorithm of 62 [SCHN86] has a run time of 3n tr). On a (nonstrict) unidirectional mesh, the times for steps 0,2,4,6, and 8 are, respec tively, n/4 tr, n tr, 2n tr + 2n tc, n tr + n tc, n/4 tr. The total time for this model is therefore 4.5n tr + 3n tc. On a strict unidirectional mesh, the times for steps 0,2,4,6, and 8 are respectively, n/2 tr, n tr, 2n tr + 2n tc, n tr + n tc, n/2 tr and the total time is 5n tr + 3n tc. Further Enhancements In the stronger mesh model of Schnorr and Shamir a processor can read the entire memory of all its neighbors in unit time. This implies that the routing time is independent of message length. Let Tr denote the time needed to send a message of arbitrary length to a neighbor processor. We may generalize the sorting algorithm of Figure 3.6 to the case when each processor has k elements. Now, to sort a row or column of data, we use neigh borhood sort [BAUD78] which is a generalization of evenodd transposition sort. Sup pose that m processors have k elements each. The mk elements are sorted in m steps. In the even steps, the elements in each even processor are merged with those in the next odd processor. The even processor gets the smaller k and the odd one the larger k. In odd steps, the k elements in each odd processor are merged with the k elements in the next even processor. The smaller k elements remain with the odd processor and the larger k with the even one. Note that when k = 1 neighborhood sort is identical to evenodd tran sposition sort. 63 Our generalization of Figure 3.6 is given in Figure 3.7. We require that k be a power of 2. As in the case of the SchnorrShamir algorithm, we assume n = 24q Step 0: Fold the leftmost and the rightmost n(k1)/(2k) columns into middle n/k columns so that the n2 elements to be sorted are in the middle n x n/k submesh with each processor in this submesh having k elements. Sort the k elements in each processor of the middle n x n/k submesh. Step 1: Sort all n3/4 x n3/4 blocks of data into snakelike rowmajor order. Note that each block of data is in an n3/4 x n3/4 /k submesh. Step 2: Perform an nl/4way unshuffle along each row of n elements. Note that each row of n elements is in a row of the middle n x n/k submesh. Step 3: Repeat step 1. Step 4: Sort the kn elements in each column of the middle n x n/k submesh into row major order. Use neighborhood sort. Step 5: Sort the elements in each n x (n3/4 /k) submesh into snakelike rowmajor order. Step 6: Sort all rows of the middle n x n/k submesh into alternating lefttoright and righttoleft order. Step 7: Perform 2n3/4 steps of evenodd transposition sort along the snake of the middle n x n/k submesh. Step 8: Unfold the middle n x n/k submesh. Figure 3.7 Generalized sorting algorithm Theorem 2: The generalized sorting algorithm is correct. Proof. Similar to that of Theorem 1. O 64 For the complexity analysis, we may again ignore the odd steps as these run in O(n3/4) time for any fixed k. The folding and and unfolding of steps 0 and 8 each take n(k 1)(2k) Tr on bidirectional and nonstrict unidirectional meshes and n(k 1)/k Tr on strict unidirectional meshes. The sorting of step 0 takes O(klogk) time. Step 2 takes n/k Tr on bidirectional meshes and 2n/k Tr on unidirectional ones. Step 4 requires n steps of neighborhood sort. Each merge of two sets of k elements takes almost 2k tc time (actually atmost 2k1 comparisons are needed). So, the step 4 time is n T, + 2kn tc for bidirectional meshes and 2n Tr + 2kn tc for unidirectional meshes. The time for step 6 is n/k Tr + 2n tc for bidirectional meshes and 2n/k T, + 2n tc for unidirectional ones. The total sorting time (ignoring the time to sort k elements in step 0) for a bidirec tional mesh is therefore (2n + n/k) Tr + 2n(k + 1) tc. For the model of Schnorr and Shamir [SCHN87], tc = 0 and the time becomes (2n + n/k) Tr. For large values of k, this approximates to 2n T,. Since (2n 2) Tr is the distance lower bound for sorting on Schnorr and Shamir's model, the generalized sorting algorithm of Figure 3.7 is near optimal for large k. The sorting times on nonstrict and strict unidirectional meshes are (3n + 3n/k) Tr + 2n(k+l) tc and(4n + 2n/k) Tr + 2n(k+l) tc. Since (4n2) is the distance lower bound for the strict unidirectional model, our algorithm is near optimal for large k for this model too. The s2way merge sorting algorithm of Thompson and Kung [THOMK77] may be similarly generalized to sort n2 elements stored k to a PE in an n x (n/k) mesh 65 configuration. The resulting sort has a complexity that is almost identical to that of Fig ure 3.7. However, Figure 3.7 is conceptually much simpler. Practical Implications Suppose we are to sort n2 elements, that are, initially distributed one to a PE on an n x n bidirectional mesh. The final configuration is also to have one element per PE. On realistic computers, the time to transfer small packets of data between adjacent proces sors is dominated by the setup time. For instance the time to transfer N bytes of data between adjacent processors of an NCubel hypercube is (446.7 + 2.4 N)ms. Therefore, it is reasonable to assume that the data transfer time is independent of packet size for small k and small element size. Furthermore, tR >> tc on most commercial computers. For instance, tR = 40 tc on the NCubel. Table 1 gives the value of our complexity func tions for the different bidirectional sort algorithms and different k. When tR = 40 tc we can expect to get best performance using the algorithm of section 3 with k = 4. For tR = 10 tc, the algorithm of section 2 is the best of the three. The original Schnorr&Shamir algorithm will perform best only when tR < 2 tc. Simulation on RMBs The sorting algorithms described here may be simulated by parallel computers in the reconfigurable mesh with buses (RMB) family. We consider only two members of the RMB family: RMESH and PARBUS. In an RMESH [MILL88abc], we have a bus grid with an n x n arrangement of pro 66 Schnorr&Shamlr Section 2 Section 3 3n t, + 2n t, 2.5n t, + 3n t, (2n + ) tn + 2n(k+l) t, k 1 2 3 4 5 6 tR = 40 t, 122n t, 103n t, 101.33n t, 100n t, 100n t, 100.67n t, t = 10 t, 32n t, 28n t, 31.3n t, 32.5n t, 34n t, 35.67n t, tR = 5 t, 17n t, 15.5n t, 19.66n t, 21.25n t, 23n t1 24.83n t, tR = 2 t, 8n t, 8n tr 12.67n t, 14.5n t, 16.4n t, 18.33n t, Table 1 Comparison of bidirectional sort algorithms cessors at the grid points (see Figure 3.8 for a 4x4 RMESH ). Each grid segment has a switch on it which enables one to break the bus, if desired, at that point. When all switches are closed, all n2 processors are connected by the grid bus. The switches around a processor can be set by using local information. If all processors disconnect the switch on their north, then we obtain row buses (Figure 3.9). Column buses are obtained by having each processor disconnect the switch on its east (Figure 3.10). In the exclusive write model two processors that are on the same bus cannot simultaneously write to that bus. In the concurrent write model several processors may simultaneously write to the same bus. Rules are provided to determine which of the several writers actually succeeds (e.g., arbitrary, maximum, exclusive or, etc.). Notice that in the RMESH model it is not possible to simultaneously have n disjoint row buses and n disjoint column buses that, respectively, span the width and height of the RMESH. It is assumed that processors on the same bus can communicate in 0(1) time. RMESH algorithms for fundamental data 67 movement operations and image processing problems can be found in [MILL88abc, MILL91ab, JENQ91abc]. An n x n PARBUS (Figure 3.11) [WANG90] is an n x n mesh in which the inter processor links are bus segments and each processor has the ability to connect together arbitrary subsets of the four bus segments that connect to it. Bus segments that get so connected behave like a single bus. The bus segment interconnections at a processor are done by an internal four port switch. If the upto four bus segments at a processor are labeled N (North), E (East), W (West), and S (South), then this switch is able to realize any set, A = (A 1, A 2), of connections where Ai c (N,E,W,S), 1 5 iS 2 and the A i's are disjoint. For example A = ((N,S), {E,W) results in connecting the North and South segments together and the East and West segments together. If this is done in each pro cessor, then we get, simultaneously, disjoint row and column buses (Figure 3.12 and 13). If A = (N,S,E,W),<), then all four bus segments are connected. PARBUS algorithms for a variety of applications can be found in [MILL91a, WANG90ab, LIN92, JANG92]. Observe that in an RMESH the realizable connections are of the form A = {A 1), A 1 (N,E,W,S). The RMESH requires two steps to simulate a unit route along a row or column of a unidirectional mesh. The PARBUS can simulate such a route in one step using the bus configurations of Figures 12 and 13. Additionally, the folding and unfolding into/from nlk columns can be done in (k 1) tr time by having each group of k adjacent columns fold into the leftmost column in the group. Now, the n/k columns that contain the data 68 D : Processor O : Switch  : Link Figure 3.8 4x4 RMESH O O 0 0 eEee O O o   0 0 0 E Figure 3.9 Row buses 0 0 0 69 O 0 0 O0 0 0 0 0 0 0 o0 o Figure 3.10 Column buses Figure 3.11 4 x 4 PARBUS are not adjacent. However, this does not result in any inefficiency when simulating steps 17 as row buses to connect the columns are easily established. With this modification to the algorithms of Figures 6 and 7, the sort time for an n x n RMESH becomes f 70 Figure 3.12 Row buses in a PARBUS Figure 3.13 Column buses in a PARBUS 71 (8n + 2) tr + 3n tc, (4n + 8n/k + 2k 2) T, + 2n(k+1) tc and for an nx n PARBUS becomes (4n + 2) t, + 3n tc, (2n + 4n/k + 2k 2) T, + 2n(k+l) tc. Conclusion We have shown that the lower bound for the number of routes needed to sort on an n x n bidirectional mesh that was established in [SCHN86] is incorrect. Furthermore, we have provided algorithms that sort using fewer routes than the lower bound of [SCHN86]. In fact, the algorithm of section 3 is able to sort using a number of routes that is very close to the distance lower bound for bidirectional as well as strict unidirectional meshes. The fold/sort/unfold algorithms developed here have practical implications to sort ing on a mesh. The advantages of this technique when applied to computers with tR >> 2 tc were pointed out in section 4. We also showed how to simulate the mesh algorithms on reconfigurable meshes with buses. CHAPTER 4 COMPUTATIONAL GEOMETRY ON N x N RECONFIGURABLE MESHES WITH BUSES Introduction In this chapter, we consider computational geometry problems on N x N reconfigurable meshes with buses and give 0(1) time algorithms for determining convex hull of a set of N planar points, the smallest enclosing rectangle of N planar points, ECDF search problem for N planar points, and triangulating N planar points. Computational geometry problems have been widely studied on various parallel computer architectures such as Meshes, PRAM, Hypercubes, CubeConnected Cycles, and Pyramid Computers [AGGA88], [CHOW80], [JEONL90, JEONC92], [MILLS84ab,88e,89], [LEE84], [MACK90ab], [STOJ86], [LUV85, LU86], [SHAM78], [WANGT87]. Most computational geometry problems have a lower bound equal to the diameter of the underlying interconnection network. RMB algorithms for computational geometry problems has been explored in [MILLP87b, JANG92e, REIS92]. In [MILLP87b] RMESH algorithms for several geometric problems on digitized images were developed. These include closest figure, extreme points of every figure, diameter, smallest enclosing box, and smallest enclosing circle. 73 An 0(1) time convex hull algorithm for a set of N planar points is given in [REIS92]. The algorithm uses an N x N configuration and is based on the grid convex hull algorithm of [MILL88e]. Unfortunately, the algorithm of [REIS92] is flawed. In sec tion 2, we give a corrected constant time RMESH convex hull algorithm. This may also be run on an N x N PARBUS and MRN. Jang and Prasanna [JANG92e] develop con stant time PARBUS algorithms for the all pairs nearest neighbor problem on a set of N planar points, 2set dominance counting for N planar points, and the 3dimensional max ima problem. All these algorithms are for N x N PARBUS configuration. Their algo rithm for the nearest neighbor problem is easily run on an N x N RMESH and MRN in constant time. Jang and Prasanna [JANG92e] state that their 2set dominance counting algorithm can be simulated by an RMESH using the switch simulation of [JANG92d]. However the simulation of [JANG92e] requires 16 RMESH processors for each PARBUS processor simulated. Hence a 4N x 4N RMESH is needed for the simulation. In section 4, we consider the ECDF search problem which is closely related to the 2set dominance counting problem and develop a constant time algorithm that requires only an N x NRMESH. The third geometry problem considered in [JANG92e] is the 3dimensional maxima problem. In this, we are given a set S of three tuples (points in three dimensional space) and are required to find all points p e S such that there is no q e S with the property that each coordinate of q is greater than the corresponding coordinate ofp (i.e., no q in S dominates p). The algorithm suggested in [JANG92e] is a modification of their 2set 74 dominance counting algorithm and is unnecessarily complicated. We observe that the problem is trivially solved in constant time using the algorithm of Figure 4.1. The input N points are in row 1 of the N x N PARBUS/RMESH/MRN. The algorithm of Figure 4.1 can be used to solve, in constant time, the kdimensional maxima problem for any k. Step 1: Broadcast the N points on row 1 to all rows using column buses. Step 2: (Use the i'th row to determine if the i'th point of S is a non dominated point) Processor [i,i] broadcasts its point to all processors in its row using a row bus, 15i N Processor [ij], 1 < i,j S N compares the coordinates of the point it received in step 1 to those of the points it received in step 2. If each coordinate of the step 1 point is larger than the corresponding coordinate of its step 2 point, then PE[ij] sets its T variable to 0. Otherwise, T is set to 1. Using row bus splitting, PE[i,l] determines if there is a 0 in row i, 1 5 i < N. If so, it sets its U variable to 0 (the i'th point is dominated). Otherwise, U is set to 1 (the i'th point is not dominated). Step 3: (Route the results back to row 1) Using row buses PE[i,1] sends its U value to PE[i,i]. Using column buses, PE[i,i] sends the U value just received to PE[1,i], 1 : i Figure 4.1 Simple Constant time algorithm for 3dimensional maxima The remaining computational geometry problems we consider in this chapter are the smallest enclosing rectangle problem (section 3) and the problem of triangulating a set of N planar points (section 5). While these problems have not been considered before for the RMB architecture, parallel algorithms for other architectures have been developed. For example, Miller and Stout [MILL89] show how to find the smallest enclosing rectan gle of a set of N planar points in O(N1/2) on an N x N mesh and Wang and Tsin 75 [WANGT87] develop an O(logN) time CREW PRAM algorithm to triangulate a set of N planar points. Miller and Stout [MILL89] also develop optimal mesh algorithms for the convex hull of N planar points and for several other computational geometry problems. Mesh algorithms for some other computational geometry problem are considered in [JEON90, LU85,86]. Convex Hull In this section, we consider the problem of determining the Convex Hull of a set of N planar points on N x N Reconfigurable Meshes with Buses (RMBs). While the RMESH algorithms we develop can be run without modification on PARBUSs and MRNs, improved performance results from the use of algorithms for subtasks such as ranking and sorting that are tailored to the PARBUS/MRN architecture. Let S be a set of N distinct points in a plane. Let E(S) be the set of extreme points in S and let CH(S) denote an ordering of E such that CH(S) defines a convex polygon. This convex polygon is the convex hull of S. The extreme points problem is that of finding E(S) while in the convex hull problem, we are to find the ordered set CH(S). Our algo rithms to find E(S) and CH(S) make use of the following known results: Lemma 1: [Theorem 3.8, [PREP85], p104] The line segment I defined by two points is an edge of the convex hull iff all other points lie on I or to one side of it. 0 Lemma 2: [[PREP85], p105] Let p and P2, respectively, be the lexicographically lowest and the highest points of S. Let CH(S) be (p1, 11, 12, ..., lkP2 1, ..., Uj). The 76 ordering is a counter clockwise ordering of the extreme points and is such that the inte rior of the defined convex polygon is on the left as one traverses the extreme points in this order. The points 11, ..., k, P2 have the property that the polar angle made by each of these points, the positive xaxis, and the preceding point of CH(S) is least. The points u1, u2, ..., uj have the property that the polar angle made by each point, the negative x axis, and the preceding point of CH(S) is least (see Figure 4.2). 1 P\ Figure 4.2 Convex hull of a set of points N3 Processor Algorithm Before developing the 0(1) N x N processor algorithm, we develop a simple 0(1) N2 x N processor algorithm. An 0(1) N2 x N processor algorithm for E(S) or CH(S) is easily obtained by using either Lemma 1 or 2. We elaborate on the algorithm based on Lemma 1. The strategy employed by this algorithm is given in Figure 4.3. It assumes that the N points of S are initially in the top row of the N2 x N RMESH/PARBUS/MRN. The i'th row, 1 < i < N2 of the RMESH/PARBUS/MRN is used to determine if points 77 and k where i = (j1)N + k of S form an edge of the convex hull. In case they do, then j and k are in E(S) and j immediately precedes or immediately follows k in CH(S). The correctness of Figure 2 is a direct consequence of Lemma 1 and the fact [PREP85, pl00] that an ordering of the points of E by polar angle about an internal point yields the convex hull. The complexity of the algorithm is readily seen to be 0(1). Note that if only E(S) is to be computed then steps 911 may be omitted. Step 1: Use column buses to broadcast the points of S from row 1 to all other rows. Step 2: In row i = (j1)N + k, 1 < i 5 N2, points j and k are broadcast to all processors using a row bus. Step 3: Using the points j and k received in step 2, each processor computes a, b, and c such that ax + by + c = 0 defines the straight line through points and k. Step 4: Let (u, v) be the coordinates of the point of S assigned to processor [e,f] in step 1. Processor [e, fJ sets its flag to 1 if au + bv + c > 0, 1 if au + by + c < 0, 0 if au + by + c = 0 and (u,v) is on the line segment between points j and k of step 3 (this includes the cases when (u,v) is point j or k), 2 otherwise. Step 5: Using row buses and row bus splitting, PE[i,l], i = (j 1)N + k, 1 < i < N2, determines if there is a flag = 2 on row i. If so, (j, k) is not an edge of the convex hull. If not, it determines if there is a 1 and later if there is a 1. If both a 1 and 1 are present, (j,k) is not an edge of the convex hull. Otherwise it is. Figure 4.3 0(1) N2 x N processor algorithm for E(S) and CH(S) (continued on next page) 78 Step 6: IfPE[i,1], i = (j1)N + k, i < < N2,j > k detects that (,k) is an edge of the convex hull, then using a row bus it broadcasts a 1 to PEs [ij] and [i,k]. Step 7: PEs that receive a 1 in step 6 broadcast this to the PEs in row 1 of the same column (note 0 or 4 PEs in each column receive a 1 in step 6; using column bus splitting, all but one of these may be eliminated to avoid concurrent writes to a column bus). Step 8: Row 1 PEs now mark the points of S they contain as being in or out of E(S) (note: a point is in E(S) if and only if it receives a 1 value in step 7). Step 9: Using row bus splitting in row 1, any three extreme points are accumulated in PE[1,1]. In case I E(S) I = 2, the remainder of this step is omitted. The centroid of these three points is computed. Since no three points of E can be collinear, the centroid is an interior point of the convex hull. Step 10: The centroid computed in step 9 is broadcast to all points of E(S) using a row bus. Each of these points computes the polar angle made* by the point using the centroid as the origin. Step 11: The points of E(S) are sorted by polar angle using the 0(1) RMESH/PARBUS/MRN sort algorithm of [NIGA92]. Figure 4.3 0(1) N2 x N processor algorithm for E(S) and CH(S) (continued from previous page) N2 Processor Algorithm The convex hull of S can be computed in 0(1) time on an N x N RMESH using the algorithm of Figure 4.4. The algorithm for an N x N PARBUS/MRN is similar. Step RO can be done using the O(1) sorting algorithm for N data items on an N x N RMESH [NIGA92]. Step R1 is done using the 0(1) N3 processor algorithm of the preceding An alternative to using the polar angle is discussed on pl00 of [PREP85]. 79 section on each N x N"12 sub RMESH. For R2, we use the i'th N x N"2 sub RMESH to determine which points of Ei are also in E. This is done using the algorithm of Figure 4.5. RO: Sort S by xcoordinate and within x by ycoordinate. Ri: Partition S into N12 subsets, Si, 1 < i N1/2 using the ordering just obtained. Each subset is of size N1'2. Determine the convex hull CHi and extreme points Ei of Si using an N x N/2 sub RMESH for each i,1 < i 5 N2. R2: Combine the extreme points Ei, 1 i < N112, to obtain the extreme points E of S. R3: Obtain the convex hull of S from E. Figure 4.4 N x N RMESH algorithm for convex hull Theorem 1: The extreme points algorithm of Figure 4.5 is correct. Proof: To establish the correctness of Figure 4.5, we need to show that exactly those N12 points of j Ek, that are not in E(S) are eliminated in steps T5 and T6. It is easy to see k= 1 that no point eliminated in T5 or T6 can be in E as in T5 the eliminated points are not in the convex hull of El u Ej and in T6 they are not in the convex hull of u Ek. To see that the points not eliminated are all in E(S), suppose that there is a non eliminated point pe Ei that is not in E(S). Let Q be the set of non eliminated points. Clearly, E(S) = extreme points of Q and p is in the interior of the convex hull of Q (see Figure 4.8), CH(Q). Without loss of generality, assume that I E(S) I > 2. Since p is in the interior of T1: Let Rji denote the (j,i)'th N112 x N112 sub RMESH of the entire RMESH. Move Ei and Ej into Rji such that each row of Rji has Ei (one point to a processor) and each column has Ej (one point to a processor). This can be accomplished by broadcasting row 1 along column buses to all rows and then broadcasting the diagonal (of the N x N RMESH) processor data along row buses to all columns. T2: Each PE in the k'th column of Rji computes the slope of the line that joins the k'th point of Ei and the point of Ej it contains. These slopes form a tritonic sequence (they may decrease, increase, decrease or may increase, decrease, increase) as the points of Ej are in convex hull order. T3: The minimum and maximum of the slopes in each column of Rji are found using bus splitting. These are stored in row 1 of Rji. T4: The maximum of the at most N12 minimums found in step T3 is determined by forming the cross product of the minimums in the N processors of Rji. The minimum of the maximums of step T3 is similarly found. Figure 4.5 Substeps for step R2 (continued on next page) T5: The minimum and maximum of step T4 define the two tangents of Ei and Ej. These are used to eliminate points of Ei that are now known to not be in E (see Figure 4.6). T6: The tangents of T5 define up to 4 tangent points. Over all Rji, 1 < i : N'2, there are 3 N" 2 2 non eliminated tangent points plus non eliminated points of S i. The extreme points of this set of points is computed. Tangent points of Ei that are not in the set of extreme points just computed are eliminated (see Figure 4.7). T7: The points of u Ei not eliminated in steps T5 and T6 define E. Figure 4.5 Substeps for step R2 (continued from previous page) 81 Tangents = Broken lines = tangent points of Ei and Ej X = points of Ei eliminated Figure 4.6 Examples of extreme points eliminated in step T5 * El E2 E3 i= 2 = tangent points X = extreme points of E2 eliminated in step T5 + = extreme points of E2 eliminated in step T6 Figure 4.7 Example of extreme point elimination in step T6 CH(Q), there must be a q e Q such that q 4 Si and the line from p to q does not go through the interior of CH(Si) and does not touch the boundary of CH(Si) except at p. Without loss of generality, assume that q e Sj and q is to the right of p or vertically 82 above p. Ifp lies between the two tangents from q to Si, then p lies between the tangents of Si and Sj and so is eliminated in step T5 (see Figure 4.8). Since p e Q, p is not elim inated in T5 and hence must be on a tangent from q to Si (Figure 4.10). Furthermore, if p is not a tangent point with respect to Si and Sj, then p lies between the two tangents of Si and Sj and is eliminated in T5. So, we may assume p is one of the 3N12 2 points considered in step T6. CH(Q) q Figure 4.8 Example for correctness proof First suppose that p is a point of the leftmost partition S 1. Since p is in the interior of CH(Q), part of the boundary of CH(Q) passes above p and part below p. Let a be the leftmost point in E b the first point in CH(Q) that the lower boundary reaches outside of S1, and c the first point in CH(Q) that the upper boundary reaches outside of S (see Figure 4.11). If b = c, then p is not on the tangent from c to CH(SI) and so must have been eliminated in step T5. So, b c. Let d and e respectively, be the points in S, to which c and b are connected in CH(S 1). Note that it is possible that d = a or e = a (or 83 ....... tangents from q to S  tangents between Si and S; Figure 4.9 Tangents Figure 4.10 p is on tangent from q both). If c e Sj, then d and c must be tangent points with respect to S and Sj (as other wise, some points of S1 and/or Sj lie above the upper boundary of CH(S). Similarly, e and b are tangent points with respect to S1 and Sk where b e Sk. It is clear that p is not an extreme point of [a, b, c, d, e, p). Since (a, b, c, d, e, p) is a subset of the set used in 84 step T6, p must have been eliminated in this step. ~.SI Figure 4.11 p S Next, suppose that p is a point of SN1'2. The proof for this case is similar to that for p e S1. Finally, suppose p e Si, i t {1, N112). Let a, b, c, d be the first points of the portions of the boundary of CH(Q) that are above and below p that are not in Si (Figure 4.12). Note that it is possible that a = b and c = d. Also, note that a, b, c, and d must exist asISj I = N1'2 for each (i.e., as no Sj is empty). Suppose that a e Sj. There must be a tangent point with respect to Sj and Si that is in the triangle defined by a, a and a" (see Figure 4.12). This is so asp is a tangent point 85 with respect to Sj and the other end of the tangent must lie in the said triangle. Let this point be A. Similarly, there are tangent points B, C, D in the corresponding triangles associated with b, c, and d. From the triangles in which A, B, C, D are located, we see that p is contained in the convex hull of {A, B, C, D,p) and so must be eliminated in step T6. Note that when a = b and c = d, there must be a tangent point, e, of Ei that is on the upper and lower part of the boundary of CH(S). Now, p is not an extreme point of {A, B, C, D, e,p) (we need point e as it is possible that A = B, and C = D). 0 , ... 'd d' d i ;*  :a Sj Si Sk Figure 4.12p e Si, i {1,N12) One may verify that each of the steps Tl through T7 can be done in 0(1) time. For step T6, a modified version of the N3 processor algorithm of the previous section is run in each N x N1/2 sub RMESH. Since the number of edges to consider is less than (3N1/2)2 = 9N, the algorithm is run at most nine times. In each run of the algorithm an edge needs to compare with 3N1/2 2 points using N1/2 processors. This is done in 86 three passes. For step R3, the points of E may be sorted by polar angle about an interior point as in steps 10 and 11 of Figure 2. Smallest Enclosing Rectangle It is well known [FREE75] that the smallest enclosing rectangle of a set of N planar points has at least one side that is an extension of an edge of the convex hull. Hence, the smallest enclosing rectangle may be found by first computing the convex hull; then deter mining for each convex hull edge, the smallest enclosing rectangle that has one side which is an extension of this edge; and finally determining the smallest of these rectan gles. The algorithm is given in Figure 4.13. Its correctness is immediate and one readily sees that its complexity is 0(1) on an N x N RMESH, PARBUS and MRN. ECDF Search On An N x N RMESH Let S be the set of N distinct planar points. Point p e S dominates point q e S if and only if each coordinate of p is greater than the corresponding coordinate of q. In the ECDF search problem [STOJ86], we are to compute for each pe S, the number, D(p,S), of points of S that are dominated by p. Stojmenovic [STOJ86] provides an N (N = S ) processor, O(log2N) time hypercube algorithm for this problem. On an N x N RMB, D(p,S) for each point p E S, may be computed using a stra tegy similar to that used in [JANG92e] to solve the 2set dominance counting problem on 87 Step 1: Find the convex hull of the N points. Step 2: [Construct convex hull edges] (a) The convex hull points are broadcast to all rows using column buses. These are saved in variable R of each processor. (b)PE[i,i] broadcasts its R value using a row bus to the S variable of all processors on row i, 1 < i 5 p, p = number of convex hull points. (c) PE[i,i +1] broadcasts its R value using a row bus to the T variable of all processors on row i, 1 < i < p 1. For i = p, PE[p,l] does this. (Note: Now each PE in row i contains the same edge of the convex hull in its S and T variables) Step 3: [Determine area of minimum rectangle for each edge] (a) Using its R, S, and T values, each PE computes the perpendicular distance D between point R and the straight line defined by the points S and T. Since, the R's are in convex hull order, the D values in a row form a tritonic sequence whose minimum, h, can be found in 0(1) time using row bus splitting. h is the minimum height of the rectangle for the row edge. Figure 4.13 Minimum enclosing rectangle (continued on next page) a PARBUS in 0(1) time. A high level description of the strategy is given in Figure 4.14. For simplicity in presentation, we assume that no two points of S have the same x or the same ycoordinates. Step 1 can be done by sorting the N points in 0(1) time by xcoordinate [NIGA92]. j1 The point in PE[1, j] belongs to partition Xu where u = L J + 1. Step 2 is simi N 1/2 larly accomplished by another sort (this time by ycoordinate). Following this each point knows which X and which Y partition it belongs to. The computation of DX(p) is done using an N x N1'2 submesh for the points in each X partition. The i'th such submesh is 88 Step 3: (b) Let P be the perpendicular through the middle of the edge defined by points S and T. Each processor computes the perpendicular distance of its point R from the infinite line P (use negative distances for points on one side of P and positive distances for points on the other side). These distances form a quadratonic sequence and its maximum, dma, and minimum, din, can be found in 0(1) time using row bus splitting. (c) The minimum area, A, of the rectangle for row i is h*(dmax dmin). Let this be stored in the A value of PE[i,l]. Step 4 [Determine overall minimum rectangle] Compute the minimum of the A's of PEs [i,1], 1 < i < p. This is done by forming the cross product of the A's in a p x p sub array and comparing the two A's in each PE. Figure 4.13 Minimum enclosing rectangle (continued from previous page) Step 1: Partition S into N1/2 sets Xi, 1 5 i < N1/2 such that IXi = N12, 1 < i < N12, and no point in Xi has a larger xcoordinate than any of the points in Xi+1, 1 5 i < N"2. Step 2: Partition S into N12 sets Yj, 1 < j N112 such that I Y = N12, 1 5 j 5 N112, and no point in Yj has a larger ycoordinate than any of the points in Yj+l, 1 < j < N12 Step 3: Let Si = Xi (1 Yj (see Figure 4.15). For each p e Si, it is the case that D(p,S) = # of points dominated by p in (Yj S i) + # of points dominated by p in Xi + S,, I= DY(p) + DX(p) + I I IS, I. u Compute D(p,S) using this equation. Figure 4.14 Strategy to compute D(p,S) used for Xi. For this, each p e Xi uses an N1/2 x N112 partition of the N x N"/2 sub mesh. In processor k of row 1 of this N1/2 x N112 partition, a variable T is set to 1 iffp 89 Yi Si Y2 Y l X1 X2 Xi XN1, Figure 4.15 Partitioning of the points of S dominates the k'th point of Xi. T is set to zero otherwise. The T's in each N12 x N112 partition may be summed in 0(1) time using the RMESH ranking algorithm of [JENQ91a]. The result is DX(p). The computation of DX(p) is best done after Step 1 of Figure 4.14 as at this time the points of S are in the needed order. DY(p) may be com puted in a similar way following Step 2. However when setting the value of T, we note that T is to be set to 1 iffp dominates both the k'th point of Yj and the k'th point is not in the same X partition as p. Let Mi = F Y I S,, I. To compute Mij, we first sort the points of S by their Y u partition index and within each Y partition, they are sorted by the X partition index. Fol lowing this, each processor in row 1 determines if the X partition number of the point in the processor to its left is less than that of its own point. If so, then using its own column index and the Y partition number of its point, it can compute R i = Y S j, where i and u j are, respectively, the X and Y partition numbers of its point (Rij = q (j 1)N112 90 where q is the processor's column index). Finally, My = Riv. Before describing v how to do this final summation, we provide, in Figure 4.16, the steps in the overall ECDF algorithm. Step 1: Each point determines its X partition. This is done by sorting the points by x coordinate. Step 2: Compute DX(p) for each point. This is done using an N1x2 x N1/2 block for each e S. Step 3: Each point determines its Y partition. For this, sort all points by ycoordinate. Step 4: Compute DY(p) as in step 2. Step 5: Sort S by Y partition and within Y partition by X partition. Each p e S determines its Rij value where i andj are such that e Xi andp e Yj. Step 6: Each point determines Myi = Riv. v Step7: Each point computes D(p,S) = DY(p) + DX(p) + Mij(p). Figure 4.16 RMESH algorithm to compute D(p,S) Now, let us consider the computation of Mi = Y Riv. For each i, the N112 My v values (1 5 j N112) are computed using the i'th N x N112 submesh. Initially, we have Riv stored in the R variable of the PEs in the v'th N112 x N112 submesh of the i'th N x N112 submesh (Figure 4.17(a)). The steps required to compute Mij are given in Fig ure 4.18. Its correctness is easily verified and its complexity is 0(1). N112 Ri2 Ril (a) i'th N x N"12 submesh (b) N3/4 X N112 submesh Figure 4.17 Step 1: [Decompose R's using the radix N 14] PE[1, 1] of each of the N1/2 x N112 submeshes computes a and b such that R = a*N1/4 + b and 0 < b < N'4. Since Y, = N12, each Riv < N2. Hence, 0 5 a < N14. Step 2: [Prefix sum the a's] (a) The i'th N x N112 submesh is partitioned into N114 N3/4 x N1/2 submeshes. In each of these, the N114 a values contained are routed to the row 1 processors such that the k'th group of N114 such processors contain the k'th a value (Figure 4.7(b)). (b) In each N3/4 x N1/2 submesh, the unary representation of the a's is computed. For this, the r'th processor in an N1/4 sets its D value to 1 if r is less than or equal to its a value. Otherwise, it sets D to zero. Figure 4.18 Computation of Mij (continued on next page) 91 N1'2 N1/2 N1'4 N1/4 al..al a2...a2 aN'"... N3/4 .. N14 92 Step 2: (c) The one's in row 1 of the N3/4 x N112 submesh are prefix summed using the RMESH ranking algorithm of [JENQ91a]. Let the overall sum for each group of N114 a's be stored in variable A of the [1,1] PE of the N3/4 x N114 submesh. (d) [Prefix sum the A's] We now have N114 A values to be prefix summed. Each is in the range [0, N1/2]. The A's are decomposed using radix N1/4 such that A = c*N14 + d where 0 5 d < N14. The N1/4 c's may be prefix summed using an N1/2 x N112 submesh by routing them to row 1 of any N1/2 x N1/2 submesh, computing their unary representation (as in (b) above), and ranking (as in (c) above). The d's are prefix summed similarly. By adding together corresponding pairs of prefix sums of c's and d's, the prefix sums of the A's are obtained. (e) [Obtain prefix sum of a's] The prefix sum of the a's is obtained by adding together the prefix sum of the a's in each N114 group (as computed in (c)) and appropriate prefix sum of A (as computed in (d)). Step 3: [Prefix sum the b's] This is similar to Step 2. Step 4: [Obtain Mij] Mi is the sum of the a and b prefix sums computed in step 3 and step 4 for the j'th (from the bottom) N112 x N12 submesh of the i'th N x N12 submesh. Figure 4.18 Computation of My (continued from previous page) 93 Triangulation In this section, we develop an 0(1) time algorithm to triangulate a set S of N planar points. The algorithm is for an N x N RMESH, an N x N PARBUS and an N x N MRN. For simplicity, we assume that the N points have distinct xcoordinates and that no four are cocircular. Our algorithm is easily modified to work when these assumption do not hold. The overall strategy is given in Figure 4.19. We begin, in step 1, by partitioning the N points into N213 sets, Si, each of size N113. For this, we assume N is a power of 3. The partitioning is done by xcoordinate. Each Si contains points that are to the left of those in Si+1, 1 < i < N2/3. This partitioning is possible because of our assumption that the xcoordinates are distinct. To accomplish the partitioning, the N points are sorted by xcoordinate using the 0(1) sorting algorithm of [NIGA92]. Step 1: Partition S into N213 sets, Si, each containing N113 points. The partitioning is done such that all points in Si are to the left of those in Si+1, 1 < i < N23. Step 2: Using an N x N113 submesh for each Si, compute the Voronoi diagram for Si, 1 < i N2/3 Step 3: Compute the straightline dual of each of the N213 Voronoi diagrams of step 2. Step 4: Partition the region not in the union of the convex hulls of the Si's into basic polygons and triangulate these. Figure 4.19 Triangulating N planar points In step 2, the Voronoi diagram of each Si is computed. For this, each point p e Si computes its voronoi polygon which defines all points in the plane that are closer to p 