Sorting n Numbers On n x n Reconfigurable Meshes With Buses*
Madhusudan Nigam and Sartaj Sahni
University of Florida
Gainesville, FL 32611
Technical Report 925
ABSTRACT
We show how column sort [LEIG85] and rotate sort [MARB88] can be imple
mented on the different reconfigurable mesh with buses (RMB) architectures that have
been proposed in the literature. On all of these proposed RMB architectures, we are able
to sort n numbers on an n x n configuration in 0(1) time.
For the PARBUS RMB architecture [WANG90ab], our column sort and rotate sort
implementations are simpler than the 0(1) sorting algorithms developed in [JANG92]
and [LIN92]. Furthermore, our sorting algorithms use fewer bus broadcasts. For the
RMESH RMB architecture [MILL88abc], our algorithms are the first to sort n numbers
on an n x n configuration in 0(1) time.
We also observe that rotate sort can be implemented on N x N x x N k + 1
dimensional RMB architectures so as to sort Nk elements in 0(1) time.
2 Keywords And Phrases
Sorting, column sort, rotate sort, reconfigurable mesh with buses.
* This research was supported, in part, by the National Science Foundation under grant MIP9103379.
1. Introduction
Several different mesh like architectures with reconfigurable buses have been pro
posed in the literature. These include the content addressable array processor (CAPP) of
Weems et al. [WEEM89], the polymorphic torus of Li and Maresca [LI89, MARE89],
the reconfigurable mesh with buses (RMESH) of Miller et al. [MILL88abc], the proces
sor array with a reconfigurable bus system (PARBUS) of Wang and Chen [WANG90],
and the reconfigurable network (RN) of BenAsher et al. [BENA91].
The CAPP [WEEM89] and RMESH [MILL88abc] architectures appear to be quite
similar. So, we shall describe the RMESH only. In this, we have a bus grid with an n x n
arrangement of processors at the grid points (see Figure 1 for a 4x4 RMESH ). Each grid
segment has a switch on it which enables one to break the bus, if desired, at that point.
When all switches are closed, all n2 processors are connected by the grid bus. The
switches around a processor can be set by using local information. If all processors
disconnect the switch on their north, then we obtain row buses (Figure 2). Column buses
are obtained by having each processor disconnect the switch on its east (Figure 3). In the
exclusive write model two processors that are on the same bus cannot simultaneously
write to that bus. In the concurrent write model several processors may simultaneously
write to the same bus. Rules are provided to determine which of the several writers actu
ally succeeds (e.g., arbitrary, maximum, exclusive or, etc.). Notice that in the RMESH
model it is not possible to simultaneously have n disjoint row buses and n disjoint
column buses that, respectively, span the width and height of the RMESH. It is assumed
that processors on the same bus can communicate in 0(1) time. RMESH algorithms for
fundamental data movement operations and image processing problems can be found in
[MILL88abc, MILL91ab, JENQ91abc].
An n x n PARBUS (Figure 4) [WANG90] is an n x n mesh in which the interpro
cessor links are bus segments and each processor has the ability to connect together arbi
trary subsets of the four bus segments that connect to it. Bus segments that get so con
nected behave like a single bus. The bus segment interconnections at a proccessor are
done by an internal four port switch. If the upto four bus segments at a processor are
labeled N (North), E (East), W (West), and S (South), then this switch is able to realize
any set, A = {A 1, A 2 }, of connections where A, c_ {N,E,W,S}, 1 i< 2 and the A i's are
disjoint. For example A = {{N,S}, {E,W}} results in connecting the North and South
segments together and the East and West segments together. If this is done in each pro
cessor, then we get, simultaneously, disjoint row and column buses (Figure 5 and 6). If A
= {{N,S,E,W},)}, then all four bus segments are connected. PARBUS algorithms for a
variety of applications can be found in [MILL91a, WANG90ab, LIN92, JANG92].
Observe that in an RMESH the realizable connections are of the form A = {A }, A c1
{N,E,W,S}.
K
(
K
(
K
[
1
j
D
]
)
]
" L
:Processor
S: Switch
S Link
Figure 1 4x4 RMESH
The polymorphic torus architecture [LI89ab, MARE89] is identical to the PARBUS
except that the rows and columns of the underlying mesh wrap around (Figure 7). In a
reconfigurable network (RN) [BENA91] no restriction is placed on the bus segments that
connect pairs of processors or on the relative placement of the processors. I.e., processors
may not lie at grid points and a bus segment may join an arbitrary pair of processors.
Like the PARBUS and polymorphic torus, each processor has an internal switch that is
able to connect together arbitrary subsets of the bus segments that connect to the proces
sor. Benasher et al. [BENA91] also define a mesh restriction (MRN) of their
reconfigurable network In this, the processor and bus segment arrangement is exactly as
for the PARBUS (Figure 4). However the switches internal to processors are able to
/J LJ_ I__ " L
:) () () (
31 1o 1 o 6E
:) () () (
31oE1o 1 o 6E
:) () () (
1 ,. I  I  I 
/_r
0
OE
0
O
0
E3
O O
O O
O O
o o 
0
O
0
O
0
OF
Figure 2 Row buses
0 0 0
 0 El 0 El 0 E
Figure 3 Column buses
obtain only the 10 bus configurations given in Figure 8. Thus an MRN is a restricted
PARBUS.
While we have defined the above reconfigurable bus architectures as square two
dimensional meshes, it is easy to see how these may be extended to obtain non square
architectures and architectures with more dimensions than two.
In this paper we consider the problem of sorting n numbers on an RMESH,
PARBUS and MRN. This sorting problem has been previously studied for all three
architectures. n numbers can be sorted in 0(1) on a three dimensional n x n x n
Figure 4 4 x 4 PARBUS
RMESH [JENQ91ab], PARBUS [WANG90], and MRN [BENA91]. All of these algo
rithms are based on a count sort [HORO90] and are easily modified to run in the same
amount of time on a two dimensional n2 x n computer of the same model. Nakano et al.
[NAKA90] have shown how to sort n numbers in 0(1) time on an (n log2n x n)
PARBUS. Jang and Prasanna [JANG92] and LIN et al. [LIN92] have reduced the
number of processors required by an 0(1) sort further. They both present 0(1) sorting
algorithms that work on an n x n PARBUS. Since such a PARBUS can be realized using
n2 area, their algorithms achieve the area time squared (AT2) lower bound of Q (n2) for
sorting n numbers in the VLSI word model [LEIG85]. The algorithm of Jang and Pra
sanna [JANG92] is based on Leighton's column sort [LEIG85] while that of LIN et al.
[LIN92] is based on selection. Neither is directly adaptable to run on an n x n RMESH
in 0(1) time as the algorithm of [JANG92] requires processors be able to connect their
bus segments according to A = { {N,S}, {E,W}} while the algorithm of [LIN92] requires
A = {{N,S}, {E,W}} and {{N,W}, {S, E}} These are not permissible in an RMESH.
Their algorithms are, however, directly usable on an n x n MRN as the bus connections
used are permissible connections for an MRN. BenAsher et al. [BENA91] describe an
0(1) algorithm to sort n numbers on an RN with O(n +e) processors for any, e e > 0.
This algorithm is also based on Leighton's column sort [LEIG85].
Figure 5 Row buses in a PARBUS
In this paper, we show how Leighton's column sort algorithm [LEIG85] and Mar
berg and Gafni's rotate sort algorithm [MARB88] can be implemented on all three
reconfigurble mesh with buses (RMB) architectures so as to sort n numbers in 0(1) time
on an n x n configuration. The resulting RMB sort algorithms are conceptually simpler
than the 0(1) PARBUS sorting algorithms of [JANG92] and [LIN92]. In addition, our
implementations use fewer bus broadcasts than do the algorithms of [JANG92] and
[LIN92]. Since the PARBUS implementations use only bus connections permissible in
an MRN, our PARBUS algorithms may be directly used on an MRN. For an RMESH,
our implementations are the first RMESH algorithms to sort n numbers in 0(1) time on
an n x n configuration.
In section 2, we describe Leighton's column sort algorithm. Its implementation on
an RMESH is developed in section 3. In section 4, we show how to implement column
sort on a PARBUS. Rotate sort is considered in sections 5 through 7. In section 5 we
describe Marberg and Gafni's rotate sort [MARB88]. The implementation of rotate sort
is obtained in sections 6 and 7 for RMESH and PARBUS architectures, respectively. In
Figure 6 Column buses in a PARBUS
Figure 7 4 x 4 Polymorphic torus
i
r(5
<5
<5"?
5
N
W E
S
N
Wo E
S
N
W S E
S
N
W E
S
N
W E
S
N
W E
S
N
W E
S
N
W ) E
S
N
W E
S
N
W E
S
Figure 8 Local Configurations allowed for a switch in MRN
section 8, we propose a sorting algorithm that is a combination of rotate sort and Scher
son et al.'s [SCHER89] iterative shear sort. In section 9, we provide a comparison of the
two PARBUS sorting algorithms developed here and those of Jang and Prasanna
[JANG92] and Lin et al. [LIN92]. For the PARBUS model, Leighton's column sort uses
the fewest bus broadcasts. However, for the RMESH model, combined sort uses the
fewest bus broadcasts. In section 10, we make the observation that using rotate sort, one
can sort Nk elements, in 0(1) time on an N x N x x N k + 1 dimensional RMESH
and PARBUS.
2. Column Sort
Column sort is a generalization of Batcher's oddeven merge [KNUT73] and was
proposed by Leighton [LEIG85]. It may be used to sort an r x s matrix Q where
r 2 2(s 1)2 and r mod s = 0. The number of elements in Q is n = rs and the sorted
sequence is stored in column major order (Figure 9). Our presentation of column sort fol
lows that of [LEIG85] very closely.
21 1 16 1 10 19
2 27 25 2 11 20
15 22 3 3 12 21
4 14 20 4 13 22
24 13 17 5 14 23
19 6 5 6 15 24
26 23 11 7 16 25
8 7 18 8 17 26
10 12 9 9 18 27
(a) Input Q (b) Output Q
Figure 9 Sorting a 9x3 matrix Q into column major order
There are eight steps to column sort. In the odd steps 1,3,5, and 7, we sort each
column of Q top to bottom. In step 2, the elements of Q are picked up in column major
order and placed back in row major order (Figure 10). This operation is called a tran
spose. Step 4 is the reverse of this (i.e., elements of Q are picked up in row major order
and put back in column major order) and is called untranspose. Step 6 is a shift by L r.
2
This increases the number of columns by 1 and is shown in Figure 11. Step 8, unshift, is
the reverse of this. Leighton [LEIG85] has shown that these eight steps are sufficient to
sort Q whenever r 2 2(s 1)2 and r mod s = 0.
3. Column Sort On An RMESH
Our adaptation of column sort to an n x n RMESH is similar to the adaptation used
by BenAsher et al. to obtain an O(n 17/9) processor reconfigurable network that can sort
in 0(1) time. This reconfigurable network uses approximately 11 layers.
The input n numbers to be sorted reside in the Z variables of the row 1 PEs of an
n x n RMESH. This is interpreted as representing the column major order of the Q
matrix used in a column sort (Figure 12). The dimensions of Q are rl x si where rl =
1/2*n3/4 and si = 2*n1/4. Clearly, there is an nl such that rl I 2*(si 1)2 for all
n 2 n Hence, column sort will work for n 2 n For n < n we can sort in constant
time (as n is a constant) using any previously known RMVIESH sorting algorithm.
10
Transpose
Untranspose
Figure 10 Transpose (step 2) and Untrnnspose (step 4) of a 9x3 matrix
Shift
Unshift
Figure 11 Shift (step 6) and Unshift (step 8)
The steps in the sorting alogrithm are given in Figure 13. Steps 1, 2, and 3 use the
n x r sub RMESH Ai to sort column i of Q. For the sort of step 4, the Q matrix has
dimensions r1 x (s1 + 1). Columns 1 and sl+ 1 are already sorted as the oo's and +00's
of Figure 11 do not affect the sorted order. Only the inner s 1 columns need to be
sorted. Each of these columns is sorted using an n x rl sub RMESH, Bi, as shown in
Figure 14. The sub RMESHs X and Y are idle during this sort.
11
column 1 column 2
column s1
Figure 12 Column major interpretation of Q
Step 0: [Input] Q is available in column major order, one element per PE, in row 1 of
the RMESH. I.e., z[ 1,j] = j'th element of Q in column major order.
Step 1: [Sort Transpose] Obtain the Q matrix following step 2 of column sort. This
matrix is available in column major order in each row of the RMESH.
Step 2: [Sort Untranspose] Obtain the Q matrix following step 4 of column sort. This
matrix is available in column major order in each row of the RMESH.
Step 3: [Sort Shift] Obtain the Q matrix following step 6 of column sort. This matrix
(excluding the o and +o values ) is available in column major order in each
row of the RMESH.
Step 4: [Sort Unshift] Obtain the sorted result in row 1 of the RMESH.
Figure 13 Sorting Q on an RMESH
A1 A2 ... As,
III
 12
Thus the sorts of each of the above four steps involve sorting r 1 numbers using an
n x rl sub RMESH. Each of these is done using column sort with the Q matrix now
being an r2 x s2 matrix with r2S2 = r1. We use r2 = n1/2 and s2 = 1/2*n1/4. With
this choice, we have 2(s2 1)2 < 2s22 = 1/2*n1/2 < nl/2, n 1. We use W to denote
the r2 X S2 matrix.
X B1 B2 ... B1 1 Y
Lri/21 rr ri r1 r 1/2]
Figure 14 Sub RMESHs for step 4 sort
Let us examine step 1 of Figure 13. To sort a column of Q we initially use only the
n x r1 sub RMESH Ai that contains this column. This column is actually the column
major representation of a matrix Wi. So, sorting the column is equivalent to sorting Wi.
To sort Wi, we follow the steps given in Figure 13 except that Q is replaced by Wi. The
steps of Figure 13 are done differently on Wi than on Q. Figure 15 gives the necessary
steps for each Wi. Note that this figure assumes that the n x n RMESH has been parti
tioned into the s A i's and each A operates independent of each other. To SortTranspose
a Wi, we first broadcast Wi which is initially stored in the Z variables of the row 1 PEs of
Ai to all rows of A, (step 1). Following this, the Uvariable of each PE in each column of
Ai is the same. In steps 24, each Ai is partitioned into square sub RMESHs Bijk,
1 j < r2, 1 < k s2 of size r2 x r2 (Figure 16). Note that Bijk contains column k of
Wi in the U variables of each of its rows. Bijk will be used to determine the rank ofj'th
 13 
element of the k'th column of Wi. Here, by rank we mean the number of elements in the
k'th column of Wi that are either less than thej'th element or are equal to it but not in a
column to the right of thej'th column of Bijk. So, this rank gives the position, in column
k, of thej'th element following a sort of column k. To determine this rank, in step 2, we
broadcast thej'th element of column k of Wi to all processors in Bijk. This is done by
broadcasting Uvalues from column j of Bijk using row buses that are local to Bijk. The
broadcast value is stored in the variables of the PEs of Bijk. Now, the processor in posi
tion (a,b) of Bijk has the b'th element of column k of Wi in its U variable and the a'th
element of this column in its V variable. Step 3 sets S variables in the processors to 0 or
1 such that the sum of the S's in each row of Bijk gives the rank of thej'th element of
column k of Wi. In step 4, the S values in a row are summed to obtain the rank. We shall
describe shortly how this is done. The computed rank is stored in variable R of the PE in
position [k,1] of Bijk. Note that since k < s2 < r2, [k,1] is actually a position in Bijk.
Now if our objective is simply to sort the columns of Wi, we can route the Vvari
able in PE[k,i] ( this is the j'th element in column k of Wi) to the R'th column using a
row bus local to Bijk and then broadcast it along a column bus to all PE's on column R.
However, we wish to transpose the resulting Wi which is sorted by columns. For this, we
see that thej'th element of Wi will be in column k and row R of the sorted Wi. Following
the transpose, this element will be in row (k1)r2/s2 + r R and column (R1) mod s2 +
S2
1 of Wi. Hence, the Vvalue in PE[k,1] of Bijk is to be in all PEs of column [(R1) mod
s2]r2 +(k1)r2/s2 + R of Ai following the SortTranspose. This is accomplished in
S2
steps 5 and 6. Note that there are no bus conflicts in the row bus broadcasts of step 5 as
the broadcast for each Bijk uses a different row bus that spans the width of A,. The total
number of broadcasts is 4 plus the number needed to sum the S values in a row of Bijk.
We shall shortly see that summing can be done with 6 broadcasts. Hence SortTranspose
uses 10 broadcasts.
Step 1: Broadcast the row 1 Z values, using column buses, to the Uvariables of all PEs
in the same column ofA i.
Step 2: The column PEs in all Bijk's broadcast their Uvalues, using row buses local
to each Bijk, to the Vvariables of all PEs in the same Bijk.
Step 3: PE [a,b] of Bijk sets its S value to 1 if (U < V) or (U = V and b < j).
Otherwise, it sets its S value to 0. This is done, in parallel, by all PEs in all
Bijk'S.
Step 4: The sum of the S's in any one row of Bijk is computed and stored in the R
variable of PE[k,1] of Bijk. This is done. in parallel, for all Bijk's.
Step 5: Using row buses, PE[k,1] of each Bijk sends its V value to the PE in column
[(Rl) mod s2]r2 + (k1)r2/s2 + r R of Ai
S2
Step 6: The PEs that received V values in step 5 broadcast this value, using column
buses, to the Uvariables of the PEs in the same column of A .
Figure 15 Steps to Sort Transpose Wi into A
3.1. Ranking
To sum the S values in a row of an r2 x r2 RMESH, we use a slightly modified
version of the ranking alogrithm of Jenq and Sahni [JENQ91ab]. This algorithm ranks
the row 1 processors of an r2 x r2 RMESH. The ranking is done by using 0/1 valued
variables, S, in row 1. The rank of the processor in column i of row 1 is the number of
processors in row 1 and columns k, k < i that have S = 1. Hence, the rank of processor
[1,r2] equals the sum of the S values in row 1 except when S[1,r2] = 1. In this latter
case we need to add 1 to the rank to get the sum. The ranking algorithm of [JENQ91ab]
has three phases associated with it.
15 
Figure 16 Division of A into r2 x r2 sub RMESHs
Phase 1: Rank the processors in even columns
l's in the odd column.
of row 1. This does not account for the
Phase 2: Phase 2: Rank the processors in odd columns of row 1. This does not account
for l's in the even columns.
Phase 3: Combine odd and even ranks.
The procedures for phases 1 and 2 are quite similar. These are easily modified to
start with the configuration following step 3 of Figure 15 and send the phase 1 and phase
2 results of the rightmost odd and even columns in Bijk to PE[k, 1] where the two results
Bi,1,1 Bi,1,2 Bi,,2
Bi,2,1
Bir2,i B1 ,r2 ~ 2
Bi,T2,1
Bi,r,,s,
16
are added. To avoid an extra broadcast, the result for the rightmost column (phase 1 if r2
is even and phase 2 if r2 is odd) is incremented by 1 in case S[*,r2] = 1 before the
braodcast. While the phase 1 code of [JENQ91ab] uses 4 broadcasts, the first of these can
be eliminated as we begin with a configuration in which S[ *,b] is already on all rows of
column b, 1 < b < r2. So, phases 1 and 2 use 6 broadcasts. Phase 3 of [JENQ91ab] uses
two broadcasts. Both of these can be eliminated by having their phase 1 (2) step 10
directly broadcast the rank of the rightmost even (odd) column to PE[k, 1] bus using a row
bus that spans row k connected to a column bus that spans the rightmost even (odd)
column of Bijk. So, the summing operation of step 4 can be done using a total of 6 broad
casts.
3.2. Analysis of RMESH Column Sort
First, let us consider sorting a column of Q which is an r lx sl matrix. This
requires us to perform steps 14 of Figure 13 on the Wi's. As we have just seen, step 1
uses 10 broadcasts. Step 2 is similar to step 1 except that we begin with the data on all
rows of Ai and instead of a transpose, an untranspose is to performed. This means that
step 1 of Figure 15 can be eliminated and the formula in step 5 is to be changed to
correspond to an untranspose. The number of broadcasts for step 2 of Figure 13 is there
fore 9. Steps 3 and 4 are similar and each uses 9 broadcasts. The total number of broad
casts to sort a column of Q is therefore 37.
Now, to perform a SortTranspose of the columns of Q we proceed as in a sort of the
columns of Q except that the last broadcast of SortUnshift performs the transpose and
leaves the transposed matrix in column major order in all rows of the RMESH. This takes
37 broadcasts. The SortUntranspose takes 36 broadcasts as it begins with Q in all rows.
Similarly, step 3 and 4 each take 36 broadcasts. The total number of broadcasts is there
fore 145.
A more careful analysis reveals that the number of broadcasts needed for the Sort
Shift and SortUnshift steps can be reduced by one each as the step 5 (Figure 15) broad
cast can be eliminated. Taking this into account, the number of broadcasts to sort a
column of Q becomes 35. However to do a SortTranspose of the columns of Q, we need
to do an additional broadcast during the SortUnshift of the Wi's. This brings the number
to 36. A SortUntranspose of Q takes 35 broadcasts and the remaining two steps of Figure
10 each takes 34 broadcasts. Hence, the total number of broadcasts becomes 139.
4. Column Sort On A PARBUS
The RMESH algorithm of Section 3 will work on a PARBUS as all connections
used by it are possible in a PARBUS. We can, however, sort using fewer than 139 broad
casts on a PARBUS. If we replace the ranking algorithm of [JENQ91ab] that we used in
Section 3 by the prefix sum of [LIN92] ( i.e., prefix sum N bits on a (N+ 1) x N
PARBUS) then we can sum the S's in a row using two broadcasts. The algorithm of
[LIN92] needs to be modified slightly to allow for the fact that we begin with the S
values on all columns and we are summing r2 S values on a r2 x r2 PARBUS rather
than an (r2 +1) x r2 PARBUS. These modifications are straightforward. Since the S's
can be summed in 2 broadcasts rather than 6, the SortTranspose of Figure 13 requires
only 6 broadcasts. The SortUntranspose canbe done in 5 broadcasts. As indicated in the
analysis of Section 3, the SortShift and SortUnshift steps can be done without the broad
cast of Step 5 of Figure 15. So, each of these require only 4 broadcasts. Thus, to sort a
column of Q requires 19 broadcasts. Following the analysis of Section 3, we see that a
SortTranspose of Q can be done with 20 broadcasts, a SortUntranspose with 19 broad
casts, and a SortShift and SortUnshift with 18 broadcasts each. Thus n numbers can be
sorted on an n x n PARBUS using 75 broadcasts.
We can actually reduce the number of broadcasts further by beginning with r I = n2/3
and s = n 1/3. While this does not satisfy the requirement that r 2 2(s 1)2, Leighton
[LEIG85] has shown that column sort works for r and s such that r 2 s(s 1) provided
that the Untranspose of step 4 is replaced by an Undiagonalize step (Figure 17). The use
of the undiagonalizing permutation only requires us to change the formula used in step 5
of Figure 15 to a slightly more complex one. This does not change the number of broad
casts in the case of the RMESH and PARBUS algorithms previously discussed. How
ever, the ability to use rI = n23 and s1 = 1/3 (instead ofrl = 2 n2/3 and s = 1/2*n1/3
which satisfy r 2 2(s 1)2 ) significantly reduces the number of broadcasts for the
PARBUS algorithm we are about to describe.
Now, the SortTranspose, SortUntranspose, SortShift and SortUnshift for Q are not
done by using another level of column sort. Instead a count sort similar to that of Figure
15 is directly applied to the columns of Q. This time, Ai is an n x rl = n x n2/3 sub
PARBUS. Let Dyi be thej'th (from the top) n1/3 x rl sub PARBUS of Ai (Figure 18).
The Dyi's are used to do the counting previously done by the Bijk's. To count rl = n2/3
bits using an n1/3 x 2/3 sub PARBUS, we use the parallel prefix sum algorithm of
[LIN92] which does this in 12 broadcasts when we begin with bits in all rows of Di and
18
take into account we want only the sum and not the prefix sum. Note that the prefix sum
algorithm of [LIN92] is an iterative algorithm that uses modulo M arithmetic to sum N
logN
bits on an (M+ 1) x NPARBUS. For this it uses lN iterations. For the case of sum
logM
ming N bits, this is easily modified to run on an M x N PARBUS in logN iterations
logM
with each iteration using 6 broadcasts ( 3 for the odd bits, and 3 for the even bits). With
our choice of r and s1, the number of iterations is 2, while with rl = 2 n2/3 and s1 =
1/2 n /3, the number of iterations is 3. The two iterations, together, use 12 broadcasts.
1 2 4 1 10 19
3 5 7 2 12 20
6 8 10 3 13 21
9 11 13 Undiagonalize 4 14 22
12 14 16 5 15 23
15 17 19 6 16 24
18 20 22 7 17 25
21 23 25 8 18 26
24 26 27 9 19 27
Figure 17 Undiagonalize a 9x3 matrix
To get the SortTranspose algorithm for the PARBUS, we replace all occurrences of
Bijk by Dij and of PE[k,1] by PE[i,1] in Figure 15. The formula of step 5 is changed to
[(R1) mod s1] rl + (i1)rl/sl + r R The number of broadcasts used by the new
Sl
SortTranspose algorithm is 16. The remaining three steps of Figure 15 are similar. The
SortUndiagonalize takes 15 broadcasts as it begins with data in all rows and the step 1
(Figure 11) broadcast can be eliminated. The SortShift and SortUnshift each can be done
in 14 broadcasts as the step 5 (Figure 15) broadcasts are unnecessary. So, the number of
broadcasts in the one level PARBUS column sort algorithms is 59.
19
Dil
Di2 Di = s x r1
D ir
Ai =n x r1 =n x 2n2/3
Figure 18 Decomposing A into Di's
5. Rotate Sort
Rotate sort was developed by Marberg and Gafni [MARB88] to sort M N numbers
on an M x N mesh with the standard four neighbor connections. To state their algorithm
we need to restate some of the definitions from [MARB88]. Assume thatM= 2s and N=
22t where s 2 t. An M x N mesh can be tiled in a natural way with tiles of size
M x N1 2. This tiling partitions the mesh into vertical slices (Figure 19(a)). Similarly an
M x N mesh can be tiled with N1/2 x N tiles to obtain horizontal slices (Figure 19(b)).
Tiling by N1 2 x N1/2 tiles results in a partitioning into blocks (Figure 19(c)). Marberg
and Gafni define three procedures on which rotate sort is based. These are given in Fig
ure 20.
Rotate sort is comprised of the six steps given in Figure 21. Recall that a vertical
slice is an M x N1/2 submesh; a horizontal slice is an N x N1/2 submesh; and a block is
a N1/2 x N1 2 submesh.
Marberg and Gafni [MARB88] point out that when M = N, step 1 of rotate sort
may be replaced by the steps.
20
Verticle Slices
N_____
(b) Horizontal Slices
(c) Blocks
Figure 19 Definitions of the Slice and block
Step 1' (a) Sort all the columns downward;
(b) Sort all the rows to the right;
This simplifies the algorithm when it is implemented on a mesh. However, it does
not reduce the number of bus broadcasts needed on reconfigurable meshes with buses.
As a result we do not consider this variant of rotate sort.
6. Rotate Sort On An RMESH
In this section, we adapt the rotate sort algorithm of Figure 21 to sort n numbers on
an n x n RMESH. Note that the algorithm of Figure 21 sorts MN numbers on an M x N
mesh. For the adaptation, we use M = N. Hence, n = N2 = 24t. The n = N2 numbers to
__ F + +
21 
Procedure balance (v,w);
{Operate on submeshes of size v x w}
sort all columns of the submesh downward;
rotate each row i of the submesh, i mod w positions right;
sort all columns of the submesh downward;
end;
(a) Balance a submesh
Procedure unblock;
{Operate on entire mesh}
rotate each row i of the mesh (i.N1/2) mod Npositions right;
sort all columns of the block downwards;
end;
(b) Unblock
Procedure shear;
{Operate on entire mesh}
sort all even numbered rows to the right and all odd numbered rows to the left;
sort all columns downward;
end;
(c) Shear
Figure 20 Procedures from [MARB88]
be sorted are available, initially, in the Z variable of the row 1 PEs of the n x n RMESH.
We assume a row major mapping from the N x N mesh to row 1 of the RMESH. Figure
22 gives the steps involved in sorting the columns of the n x n mesh downward. Note
that this is the first step of the balance operation of step 1 of rotate sort. The basic stra
tegy employed in Figure 22 is to use each n x N subRMESH of the n x n RMESH to
sort one column of the N x N mesh.
For this, we need to first extract the columns from row 1 of the RMESH. This is
done in steps 13 of Figure 22. Following this, each row of the q'th n x N subRMESH
22
Step 1: balance each vertical slice;
Step 2: unblock the mesh;
Step 3: balance each horizontal slice as if it were an N x N1/2 mesh lying on its side;
Step 4: unblock the mesh;
Step 5: shear three times;
Step 6: sort rows to the right;
Figure 21 Rotate Sort [MARB88]
contains the q'th column of the N x N mesh. Steps 46 implement the count phase of a
count sort. This implementation is equivalent to that used in [JENQ91b] to sort m ele
ments on an m x m x m RMESH. Steps 7 and 8 route the data back to row 1 of the
RMESH so that the Z values in row 1 (and actually in all rows) correspond to the row
major order of the n x n mesh following a sort of its columns. The total number of
broadcasts used is 12 (note that step 6 uses 6 broadcasts).
The row rotation required by procedure balance can be obtained at no additional cost
by changing the destination column computed in step 7 of Figure 22 so as to account for
the rotation. The second sort of the columns performed in procedure balance can be done
with 9 additional broadcasts. For this, during the first column sort of procedure balance,
the step 7 broadcast of Figure 22 takes into account both the row rotation and the row
major to column major transformation to be done in steps 13 of Figure 22 for the second
column sort of procedure balance. So, step 1 of rotate sort (i.e., balancing the vertcial
slices) can be done using a total of 21 broadcasts.
To unblock the data, we need to rotate each row and then sort the columns down
ward. Once again, the rotation can be accomplished at no additional cost by modifying
the destination column function used in step 7 of the second column sort performed dur
ing the vertical slice balancing of step 1. The column sort needs 11 broadcasts.
The horizontal slices can be balanced using 18 broadcasts as the Z data is already
distributed over the columns. The unblock of step 4 takes as many broadcasts as the
unblock of step 2 (i.e. 9).
The shear opertaion requires a row sort followed by a column sort. Row sorts are
performed using the same strategy as used for a column sort. The fact that all elements of
a row are in adjacent columns of the RMESH permits us to eliminate steps 13 of Figure
23 
{sort columns ofN x Nmesh}
Step 1: Use column buses to broadcast Z[ 1,j] to all rows in column j, 1 i n. Now,
Z[i,j] = Z[1,j], I nn, 1 jn.
Step 2: Use row buses to broadcast Z[i,i] to the R variable of all PEs on row i of the
RMESH. Now, R[i,j] = Z[i,i] = Z[1,i], 1 < i< n, 1
Step 3: In the q'th n x N subRMESH all, PEs[i,j] such that i mod N = q mod N and
(j 1) mod N + 1 = r i/N] broadcast their R values along the column buses, 1
< q < N. Note that each such PE[i,j] contains the r i/N7 'th element of column
q of the N x N mesh. This value is broadcast to the U variables of the PEs in
the column. Now, each row of the q'th n x N subRMESH contains in its U
variables column q of the n x n mesh.
Step 4: Now assume that the n x n RMESH is tiled by N2 N x N subRMESHs. In the
[a,b]'th such subRMESH, the PEs on column a of the subRMESH broadcast
their U value using row buses local to the subRMESH. This is stored in the V
variable of each PE.
Step 5: PE [i,j] of the [a,b]'th subRMESH sets its S value to 0 if (U < V) or
(U = Vand i < a).
Step 6: The sum of the S's in any one row of each of the N x N subRMESH's is
computed. The result for the [a,b]'th subRMESH is stored in the Tvariable of
PE [b,1].
Step 7: Using the row buses that span the n x n RMESH the PE in position [b,1] of
each N x N subRMESH [a,b] sends its Vvalue to the Z variable of the PE in
column (b 1)N + T + 1.
Step 8: The received Zvalues are broadcast along column buses.
Figure 22 Sorting the columns of an N x N mesh
22. So, a row sort takes only 9 additional broadcasts. The following column sort uses 9
broadcasts. So, each application of shear takes 18 broadcasts. Since we need to shear
three times, step 5 of rotate sort uses 54 broadcasts. Step 6 of rotate sort is a row sort.
This takes 9 broadcasts. The total number of broadcasts is
21 + 9 + 18 + 9 + 54 + 9 = 120. This is 19 fewer than the number of broadcasts
used by our RMESH implementation of column sort.
24
7. Rotate Sort On A PARBUS
Our implementation of rotate sort on a PARBUS is the same as that on an RMESH.
Note, however, that on a PARBUS ranking (step 6 of Figure 22) takes only 3 broadcasts.
Since this is done once for each/row column sort and since a total of 13 such sorts is
done, 39 fewer broadcasts are needed on a PARBUS. Hence our PARBUS implementa
tion of rotate sort takes 81 broadcasts. Recall that Leighton's column sort could be imple
mented on a PARBUS using only 59 broadcasts.
8. A Combined Sort
We may combine the first three steps of the iterative shear sort algorithm of Scher
son et al. [SCHER89] with the last four steps of rotate sort to obtain combined sort of
Figure 23. This is stated for an N x N mesh using nearest neighbor connections. The
number of elements to sorted is N2
Step 1: Sort each N3/4 x N34 block;
Step 2: Shift the i'th row by (i*N3/4) mod Nto the right, 1 I i < N;
Step 3: Sort the columns downward;
step 4: balance each horizontal slice as if it were an N x N1/2 mesh lying on its side;
Step 5: unblock the mesh;
Step 6: shear two times;
Step 7: sort rows to the right;
Figure 23 Combined Sort
Notice that step 47 of Figure 23 differ from steps 36 of Figure 21 only in that in
step 6 of Figure 23 the shear sort is done two times. The correctness of Figure 23 may be
established using the results of [SCHER89] and [MARB88].
To implement the combined sort on an n x n RMESH or n x n PARBUS (n = N2
elements to be sorted), we note that the column sort of step 3 can be done in the same
manner as the column sorts of rotate sort are done. The shift of step 2 can be combined
with the sort of step 1. The block sort of step 1 is done using submeshes of size
n x n3/4 = N2 x N3/2 = N2 x ( N34 N3/4). On a PARBUS, this is done by ranking
25
in n1/4 x 3/4 as in [LIN92] while on an RMESH, this is done using the algorithm to
sort a column of Q using an n x r submesh (section 3). We omit the details here. The
number of broadcasts is 77 for PARBUS and 118 for an RMESH.
9. Comparison With Other 0(1) PARBUS Sorts
As noted earlier, our PARBUS implementation of Leighton's column sort uses only
59 broadcasts whereas our PARBUS implementation of rotate sort uses 81 broadcasts
and our implementation of combined sort uses 77 broadcasts. The 0(1) PARBUS sorting
algorithm of Jang and Prasanna [JANG92] is also based on column sort. However, it is
far more complex than our adaptation and uses more broadcasts than does the 0(1)
PARBUS algorithm of Lin et al. [LIN92]. So, we compare our algorithm to that of
[LIN92]. This latter algorithm is not based on column sort. Rather, it is based on a multi
ple selection algorithm that the authors develop. This multiple selection algorithm is
itself a simple modification of a selection algorithm for the PARBUS. This algorithm
selects the k'th largest element of an unordered set S of n elements. The multiple selec
tion algorithm takes as input an increasing sequence qi < q2 < ... < q/3 with 1 < qi n
and reports for each i, the qi'th largest element of S. By selecting qi = i*n2/3
1 < i < n1/3 one is able to determine partitioning elements such that the set of n
numbers to be sorted can be partitioned into n1/3 buckets each having n2/3 elements.
Each bucket is then sorted using an n x n2/3 sub PARBUS. Lin et al. [LIN92] were only
concerned with developing a constant time algorithm to sort n numbers on an n x n
PARBUS. Consequently, they did not attempt to minimize the number of broadcasts
needed for a sort. However, we analyzed versions of their algorithms that were optimized
by us. The optimized selection algorithm requires 84 broadcasts and the optimized sort
algorithm used 103 broadcasts. Thus our PARBUS implementation of Leighton's column
sort uses slightly more than half the number of broadcasts used by an optimized version
of LIN et al.'s algorithm. Furthermore, even if one were interested only in the selection
problem, it would be faster to sort using our PARBUS implementation of Leighton's
column sort algorithm and then select the k'th element than to use the optimized version
of the PARBUS selection algorithm of [LIN92]. Our algorithm is also conceptually
simpler than those of [JANG92] and [LIN92]. Like the algorithms of [JANG92] and
[LIN91], our PARBUS algorithms may be run directly on an n x n MRN. The number of
broadcasts remains unchanged.
26
10. Sorting On An n 2 x n/2 x n /2 Reconfigurable Mesh with Buses
Rotate sort works by sorting and/or shifting rows and columns of an N x N array.
This algorithm may be implemented on a three dimensional; N x N x N reconfigurable
mesh with buses so as to sort N2 elements in 0(1) time. In other words, n = N x N ele
ments are being sorted in 0(1) time on an n 2 x n 12 n 1/2 RMB. Assume that we
start with n elements on the base of an N x N x N RMB. To sort row (column) i, we
broadcast the row (column) to the i'th layer of N x N processors and sort it in 0(1) time
in this layer using the algorithms developed in this paper to sort N elements on an N x N
RMB. The total number of such sorts required by rotate sort is 13 (i.e., 0(1)) and the
required shifts may be combined with the sorts.
We can extend this to obtain an algorithm to sort Nk numbers on a k+ 1 dimen
sional RMB with Nk+1 processors in 0(1) time. The RMB has an N x N x x N
configuration and the Nk numbers are initially in the face with k + 1 dimension equal to
zero. In the preceding paragraph we showed how to do this sort when k = 2. Suppose we
have an algorithm to sort N 1 numbers on an I dimensional N1 processor RMB in 0(1)
time. We can use this to sort NI numbers on an NI+1 processor RMB in 0(1) time by
regarding the N1 numbers as forming an Nl 1 x N array. In the terminology of Marberg
and Gafni [MARB88], we have an M x N array with M = N 1 > N. To use rotate sort,
we need to be able to sort the columns of this array (which are of size M); sort the rows
which are of size N); and perform shifts/rotations on the rows or subrows.
To do the column sort we use I dimensional RMBs. The mth such RMB consists of
all processors with index of the type [il,i2,...,ll1,m,i 1+]. By assumption, this RMB
can sort its N 1 numbers in 0(1) time. To sort the rows, we can use two dimensional
RMBs. Each such RMB consists of processors that differ only in their last two dimen
sions (i.e., [a,b,c,...,il,il+1]). This sort is done using the 0(1) sorting algorithm for two
dimensional RMBs. The row shifts and rotates are easily done in 0(1) time using the two
dimensional RMBs just described (actually most of these can be combined with one of
the required sorts).
11. Conclusions
We have developed relatively simple algorithms to sort n numbers on
reconfigurable n x n meshes with buses. For the case of the RMESH, our algorithms are
the first to sort in 0(1) time. For the PARBUS, our algorithms are simpler than those of
[JANG92] and [LIN92]. Our PARBUS column sort algorithm is the fastest of our
27
algorithms for the PARBUS. It uses fewer broadcasts than does the optimized versions
of the selection algorithm of [LIN92]. Our PARBUS algorithms can be run on an MRN
with no modifications. Since n x n reconfigurable meshes require n2 area for their lay
out. Our algorithms (as well as those of [JANG92] and [LIN92]) have an area time
square product AT2 ofn2 which is the best one can hope for in view of the lower bound
result AT2 > n2 for the VLSI word model [LEIG85].
Using two dimensional meshes with buses, we are able to sort n elements in 0(1)
time using n2 processors. Using higher dimensional RMB, one can sort n numbers in
0(1) time using fewer processors. In general, n = Nk numbers can be sorted in 0(1)
time using Nk+1 = l+1/k processors in a k+1 dimensional configuration. While the
same result has been shown for a PARBUS [JANG92b], our algorithm applies to an
RMESH also.
12. References
[BENA91] Y. BenAsher, D. Peleg, R. Ramaswami, and A. Schuster, "The power of
reconfiguration," Journal of Parallel and Distributed Computing, 13,
139153, 1991.
[HORO90] E. Horowitz and S. Sahni, Fundamentals of data structures in Pascal,
Third Edition, Computer Science Press, Inc., New York, 1990.
[JANG92] J. Jang and V. Prasanna, "An optimal sorting algorithm on reconfigurable
meshes", International Parallel Processing Symposium, 1992.
[JENQ91a] J. Jenq and S. Sahni, "Reconfigurable mesh algorithms for image shrink
ing, expanding, clustering, and template matching," Proceedings 5th
International Parallel Processing Symposium, IEEE Computer Society
Press, 208215, 1991.
[JENQ91b] J. Jenq and S. Sahni, "Reconfigurable mesh algorithms for the Hough
transform," Proc. 1991 International Conference on Parallel Processing,
The Pennsylvania State University Press, 3441, 1991.
[JENQ91c] J. Jenq and S. Sahni, "Reconfigurable mesh algorithms for the area and
perimeter of image components," Proc. 1991 International Conference on
28
Parallel Processing, The Pennsylvania State University Press, 280281,
1991.
[KNUT73] D. E. Knuth, The Art of Computer Programming, Vol 3, AddisonWesley,
NewYork, 1973.
[LEIG85] T. Leighton, "Tight bounds on the complexity of parallel sorting", IEEE
Trans. on Computers, C34, 4, April 1985, 344354.
[LI89a] H. Li and M. Maresca, "Polymorphictorus architecture for computer
vision," IEEE Trans. on Pattern & Machine Intelligence, 11, 3, 133143,
1989.
[LI89b] H. Li and M. Maresca, "Polymorphictorus network", IEEE Trans. on
Computers, C38, 9, 13451351, 1989.
[LIN92]
[MARB88]
[MARE89]
[MILL88a]
[MILL88b]
[MILL88c]
[MILL91a]
R. Lin, S. Olariu, J. Schwing, and J. Zhang, "A VLSIoptimal constant
time Sorting on reconfigurable mesh", Proceedings of Ninth European
Workshop on Parallel Computing, Madrid, Spain, 1992.
John M. Marberg, and Eli Gafni, "Sorting in Constant Number of Row
and Column Phases on a Mesh", Algl, itinuika, 3, 1988, 561572.
M. Maresca, and H. Li, "Connection autonomy in SIMD computers: A
VLSI implementation", Journal of Parallel and Distributed Computing, 7,
April 1989, 302320.
R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Data movement
operations and applications on reconfigurable VLSI arrays", Proceedings
of the 1988 International Conference on Parallel Processing, The
Pennsylvania State University Press, pp 205208.
R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Meshes with
reconfigurable buses", Proceedings 5th MIT Conference On Advanced
Research In VLSI, 1988, pp 163178.
R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Image compu
tations on reconfigurable VLSI arrays", Proceedings IEEE Conference On
Computer Vision And Pattern Recognition, 1988, pp 925930.
R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Efficient paral
lel algorithms for intermediate level vision analysis on the reconfigurable
mesh", Parallel Architectures and Algorithms for Image Understanding,
29
Viktor k. Prasanna ed., 185207, Academic Press, 1991
[MILL91b]
[NAKA90]
[SCHER89]
[WANG90a]
[WANG90b]
[WEEM89]
R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Image process
ing on reconfigurable meshes", From Pixels to Features II, H. Burkhardt
ed., Elsevier Science Publishing, 1991.
Koji Nakano, Thoshimits Msuzawa, Nobuki Tokura, "A fast sorting algo
rithm on a reconfigurable array", Technical Report, COMP 9069, 1990.
Issac D. Scherson, Sandeep Sen, and Yiming MA, "Two nearly Optimal
Sorting algorithms for MeshConnected processor arrays using shear sort,"
Journal of Parallel and Distributed Computing, 6, 151165, 1989.
B. Wang and G. Chen, "Constant time algorithms for the transitive closure
and some related graph problems on processor arrays with reconfigurable
bus systems," IEEE Trans. on Parallel and Distributed Systems, 1, 4,
500507, 1990.
B. Wang, G. Chen, and F. Lin, "Constant time sorting on a processor array
with a reconfigurable bus system," Info. Proc. Letrs., 34, 4, 187190,
1990.
C. C. Weems, S. P. Levitan, A. R. Hanson, E. M. Riseman, J. G. Nash,
and D. B. Shu, "The image understanding architecture, International
Journal of Computer Vision, 2, 251282, 1989.
