Sorting n2 Numbers On n x n Meshes*
Madhusudan Nigam and Sartaj Sahni
University of Florida
Gainesville, FL 32611
Technical Report 9234
ABSTRACT
We show that by folding data from an n x n mesh onto an n x (n/k) submesh, sort
ing on the submesh, and finally unfolding back onto the entire n x n mesh it is possible
to sort on bidirectional and strict unidirectional meshes using a number of routing steps
that is very close to the distance lower bound for these architectures. The technique may
also be applied to reconfigurable bus architectures to obtain faster sorting algorithms.
Keyword and Phrases
Sorting, mesh architectures, distance lower bound
* This research was supported, in part, by the National Science Foundation under grant MIP9103379.
1. Introduction
In this paper, we are concerned with sorting n2 data elements using an nxn mesh
connected parallel computer. The initial and final configurations have one data element in
each of the nxn processors (say in the A variable of each processor). In the final
configuration the data elements are sorted in snakelike rowmajor order. This problem
has been extensively studied for mesh architectures (see for e.g. [THOM77], [NASS79],
[KUMA83], [LEIG85], [SCHN86], [SCHER89], [MARB88]). While all of these studies
consider SIMD meshes, they differ in the permissible communication patterns. Thomp
son and Kung [THOM77] consider a strict unidirectional model in which all processors
that simultaneously transfer data to a neighbor processor do so to the same neighbor.
That is, all active processors transfer data to their north neighbor, or all to their south
neighbor, etc. Using this model, Thompson and kung [THOM77] have developed a sort
ing algorithm with comlexity 6ntr + ntc + low order terms, where tr is the time needed
to transfer one data element from a processor to its neighbor and tc is the time needed to
compare two data elements that are in the same processor. In the sequel, we omit the low
order terms.
In the bidirectional model, we assume there are two links between every pair (P,Q)
of neighbors. As a result, P can send a data element to Q at the same time that Q sends
one to P. Using this model, Schnorr and Shamir [SCHN86] have developed a sorting
algorithm with complexity 3ntr + 2ntc. In ths same paper, Schnorr and Shamir "prove"
that 3n routing steps are needed by every sorting algorithm for an even stronger mesh
model, each processor can read the entire contents of the memories of its (up to) four
neighbors in time tr and the internal processor computations are free. For this stronger
mesh model, they show that every sorting algorithm must take 3ntr time by first showing
that input data changes in the lower right part of the mesh cannot affect the values in the
top left corner processor until time equal to the distance, d, between the top left corner
processor and the nearest of these lower right processors. Next, they argue that by chang
ing the lower right input data, they can change the final position of the data in the top left
corner processor by distance n 1. As a result the sort must take at least (d+n2)tr
time.
The fallacy is that in the first d 1 steps, we may have made several copies of the
input data and it may no longer be necessary to route the data from the top left corner
processor to the final destination processor. In fact, we show that one can sort in 2ntr
time using the stronger mesh model.
Park and Balasubramanium [PARK87,90] have considered a related sorting prob
lem for the strict unidirectional model. In this the n2 elements to be sorted are initially
stored two to a processor on an n x n/2 mesh. The final sorted configuration also has two
elements per processor. The time complexity of their algorithm is 4nt, + ntc. This
represents an improvement of 2nt, over Thompson and Kung's algorithm provided the
two element per processor inpu/output configuration is what we desire. If we desire the
one element per processor configuration (for example, the data to be sorted is the result
of a computation that produces this configuration and the sorted data is to be used for
further computation), then it is necessary to first fold the data to get the two element per
processor configuration, then sort using the algorithm of [PARK87, 90], and finally
unfold to get the desired one element per processor final configuration. The folding can
be done in n/2 t, time as below (see also Figure l(a)).
F 1: The left n/4 columns shift their data n/4 columns to right.
F2: The right n/4 columns shift their data n/4 columns to the left.
The unfolding can also be done in n/2 t, time using the two steps:
Ul: Unfold the n/4 columns labeled A in Figure 1(b)
U2: Unfold the n/4 columns labeled B in Figure l(b)
n/4 n/2 n/4
A B
FI 2 Ul I U2
> I  < 1 >
n/4 n/4 n/4 n/4
(a) Folding (b) Unfolding
Figure 1 Folding and Unfolding
To unfold A, we use the pipelined process described by the example of Figure 2. B
is unfolded in a similar way. The total time for the sort is therefore 5nt, + ntc which is
ntr less than that of [THOM77]. The improvement is slightly more if we consider the
(nonstrict) unidirectional model in which there is a single link between each pair of
neighbor processors and data can be transferred, in parallel, along all links ( however, if
P and Q are neighbors, when P sends data to Q, Q cannot send to P). In this model, steps
Fl and F2 of folding can be done in parallel. Similarly, we can do U1 and U2 in paral
lel. The total sort time now becomes 4. 5ntr + ntc.
In section 2, we show that the Schnorr/Shamir algorithm of [SCHN86] can be
modified to sort on unidirectional meshes using the same number of routes as above. The
number of comparison steps is, however, larger. The modified Schnorr/Shamir algorithm
of section 2 runs in 2.5ntr + 3ntc time on an n x n bidirectional mesh. The number of
routes is therefore less than the 3n lower bound established in [SCHN86]. The algorithm
of section 2 folds data onto an n x n/2 submesh. By folding onto smaller submeshes,
i.e., onto n x n/k submeshes for k > 2, the number of routes can be reduced further. In
fact for bidirectional and strict unidirectional meshes we can come very close to the dis
tance lower bound of 2n 2 and 4n 2, respectively. This is done in section 3. In section
4, we discuss the practical implications of the foldingunfolding approach to sorting. In
section 5, we show how the sort algorithms of sections 2 and 3 may be simulated on
n x n reconfigurable bus architectures.
Step Po P1 P2 P3 P4 P5 P6 P7
1 F1F F]~ F1 F3 1 1
2 0 1 23 45 6 7
3 0 1 2 3 4 5 6 7
4 0 1 2 3 4 5 6 7
Figure 2 Unfolding one row of A
2. Modified Schnorr/Shamir Algorithm
The sorting algorithm of Schnorr and Shamir [SCHN86] is given in Figure 3. This
algorithm uses the following terms and assumes that n = 24q for some integer q.
1. Block... an n 34 x n 3/4 submesh formed by a natural tiling of an n x n mesh with
S3/4 x n3/4 tiles (Figure 4(a)).
2. Vertical Slice... an n x n 3/4 submesh formed by a natural tiling of an n x n mesh with
n x n 3/4 tiles (Figure 4(b)).
3. Snake... the 1 x n2 vector along the snakelike order in the n x n mesh (Figure 4(c)).
4. Evenodd transposition sort [KNUT73]...n elements are sorted in n steps. In the odd
steps each element in an odd position is compared with the element in the next even posi
tion and exchanged if larger. In the even steps, each element in an even position is com
pared with the element in the next odd position and exchanged if greater.
5. n /4way unshuffle ... Let k = 1/4 log2n = q/4. Data in column j of the mesh is
moved to column j'. Let jq1 ... jo be the binary representation ofj. Then jk_1...j
jq1..Jk is the binary representation ofj'. The unshuffle distributes the n3 4 columns of
each vertical slice to the n1/4 vertical slices in a round robin manner.
Step 1: Sort all the blocks into snakelike rowmajor order.
Step 2: Perform a n 1/4way unshuffle along all the rows of the array.
Step 3: Sort all the blocks into snakelike rowmajor order.
Step 4: Sort all the columns of the array downwards.
Step 5: Sort all the vertical slices into snakelike rowmaj or order.
Step 6: Sort all the rows of the array into alternating lefttoright and righttoleft order.
Step 7: Perform 2n 3/4 steps of evenodd transposition sort along the snake.
Figure 3 Sorting algorithm of Schnorr and Shamir
The correctness of the algorithm is established in [SCHN86]. As pointed out in
[SCHN86], steps 1,3,5,7 take O(n3/4) time and are dominated by steps 2,4,6. Steps 4 and
6 are done using n steps of evenodd transposition sort each while step 2 is done by simu
lating evenodd transposition sort on the fixed input permutation n1 where n is the
desired unshuffle permutation. Steps 4 and 6 take nt, + ntc time each on a bidirectional
mesh and 2ntr + ntc on a unidirectional (whether strict or not) one. Step 2 takes ntr time
on a bidirectional mesh and 2ntr on a unidirectional one. The total time for the algorithm
of Schnorr and Shamir is therefore 3ntr + 2ntc on a bidirectional mesh and
6ntr + 2ntc on a unidirectional (strict or nonstrict) mesh. On the stronger mesh model
the time is 3ntr as tc is zero.
We may modify the algorithm of Schnorr and Shamir so that it works on an
n x n/2 mesh with two elements per processor. In this modification, we essentially simu
late the algorithm of Figure 3 using half as many processors. The run time for steps 1,3,5
and 7 increases but is still O(n3/4). Step 6 takes n/2 tr + ntc on a bidirectional mesh as
N
(a) Blocks (b) Vertical Slices
N N
N N3/4
(c) Snakelike Order (d) Horizontal Slices
Figure 4 Definitions of Blocks and Slices
data routing is needed for only half of the steps of evenodd transposition sort and
ntr + ntc on a unidirectional mesh. Step 4 takes 2ntr + 2ntc on a bidirectional mesh
and 4ntr + 2ntc on a unidirectional mesh as in each of the n steps of evenodd transpo
sition sort, both the elements in a processor need to be routed to the neighbor processor.
This can be reduced to ntr + 2ntc and 2ntr + 2ntc, respectively, by regarding the 2n
elements in a processor column as a single column and sorting this into row major order
using 2n steps of evenodd transposition sort (Figure 5). Now, a single element from
each processor is to be routed on each of n steps and no data routing is needed on the
remaining n steps.
To establish the correctness of the sorting algorithm with step 6 changed to sort
pairs of columns as though they were a single column, we use the zeroone principle
[KNUT73]. This was also used by Schnorr and Shamir to establish the correctness of
their unmodified algorithm. Here, we assume that the input data consists solely of zeroes
and ones. Since we have not changed steps 13 of their algorithm, the proofs of parts 1,2,
and 3 of their Theorem 3 are still valid. We shall show that part 4 remains true following
the execution of the modified step 4. Since the proof of parts 5,6, and 7 rely only part 4 of
the theorem, these are valid for the modified algorithm. Hence, the modified algorithms
is correct.
8 2 3 1 1 2
6 5 6 2 3 4
3 4 7 4 5 6
7 1 8 5 7 8
(a) Processor column (b) Sort each column (c) Sort as one column
two elements per downward
processor
Figure 5 Modified Step 4
Define a horizontal slice to be an n 34 x n submesh obtained by a natural tiling of
an n x n mesh using tiles of size n3/4 x n (Figure 4(d)). Following step 3 of the sorting
algorithm, the maximum difference d3 between the number of zeroes in any two
columns of a horizontal slice is at most 2 [SCHN86]. We need to show (i) that following
the modified step 4, the maximum difference d4' between the number of zeroes in any
two columns (a column is an n x 1 vector with n elements) of the mesh is at most 2n 14
and (ii) the maximum difference between the number of zeroes in any two vertical slices
is atmost n 2. The proof of (ii) is same as that in [SCHN86] as our modification of step
4 does not move data between vertical slices. For (i), consider any two columns of n ele
ments each. If these two columns reside in the same column of processors, then the max
imum difference between the number of zeroes between two columns is 1 as the two
columns have been sorted into row major order regarding them as a single 2n element
column. Suppose the two element columns reside in two different processor columns.
These two processor columns contain 4 element columns A,B,C,D. Let a,b,c,d respec
tively be the number of zeroes in these four columns prior to the execution of the
modified step 4. From the proof of Theorem 3, part 4 [SCHN86], we know that
 x y I I 2n14 where xy e {a,b,c,d}.
Let a',b',c',d' be the number of zeroes in column A,B,C,D following the modified
step 4. It is easy to see that a' = r(a+b)/2], b' = L(a+b)/2J, c' = [(c+d)/2], d' = L(c+d)/2J.
We need to show that x y I < 2 n1/4 where x,y e {a',b',c',d'}.
Without loss of generality (wlog), we may assume thatI b' c' = max { x y
x,y e {a',b',c',d'}} (note that when x,y e {a', b'} or xy e {c',d'}, I x y I 1 ) and that
b' > c'. b'c' < r(a+b)/2] L(c +d)/2J. Again, wlog we may assume that c < d.
So, b'c' r[(a+b)1/2 c .Since, bc I 2n14 and ac l 2nl/
a + b c 2c + 4n1/4. So, b'c' c + 2n1/4 c = 2n/4
Our modified Schnorr/Shamir algorithm for n x n meshes is stated in Figure 6. It
combines the folding and unfolding steps discussed in the introduction and the
Schnorr/Shamir modified algorithm for n x n/2 meshes described above.
Step 0: Fold the leftmost and rightmost n/4 columns so that the n2 elements to be sorted
are in the middle n x n/2 submesh with each processor in this submesh having
two elements.
Step 1: Sort all n3/4 x n3/4 blocks of data into snakelike rowmajor order. Note that
each block of data is in an n3/4 x (1/2 n3/4) submesh.
Step 2: Perform an nl/4way unshuffle along each row of n elements. Note that each
row of n elements is in a row of the middle n x (1/2 n) submesh.
Step 3: Repeat step 1.
Step 4: Sort the 2n elements in each column of the middle n x n/2 submesh into row
major order.
Step 5: Sort the elements in each n x (1/2 n3/4) submesh into snakelike rowmajor
order.
Step 6: Sort all rows of the middle n x n/2 submesh into alternating lefttoright and
righttoleft order.
Step 7: Perform 2n 3/4 steps of evenodd transposition sort along the snake of the middle
n x n/2 submesh.
Step 8: Unfold the middle n x n/2 submesh.
Figure 6 Sorting algorithm for n x n mesh.
Theorem 1: The sorting algorithm of Figure 6 is correct.
Proof. Follows directly from our earlier discussion that established the correctness of
steps 1 through 7 as a sorting algorithm for an n x n/2 mesh with two elements per pro
cessor. D
For the complexity analysis, we note that steps 1,3,5, and 7 take O(n3/4) time and
are dominated by the remaining steps which take O(n) time each. Since we are ignoring
low order terms, we need be concerned only with steps 0, 2, 4, 6, and 8. As noted earlier,
steps 2, 4 and 6, respectively, take n/2 tr, ntr + 2ntc, and n/2 tr + ntc on a bidirec
tional mesh. Steps 0 and 8 each take n/4 tr on a bidirectional mesh. So, the complexity
of the sort algorithm on a bidirectional mesh is 2.5n tr + 3n t,. Since tc is zero on the
stronger mesh model, the sort time for this model is 2.5n tr (note that the algorithm of
[SCHN86] has a run time of 3n t).
On a (nonstrict) unidirectional mesh, the times for steps 0,2,4,6, and 8 are, respec
tively, n/4 tr, n tr, 2n tr + 2n t,, n tr + n t,, n/4 tr. The total time for this model is
therefore 4.5n tr + 3n tc. On a strict unidirectional mesh, the times for steps 0,2,4,6,
and 8 are respectively, n/2 tr, n tr, 2n tr + 2n tc, n tr + n tc, n/2 tr and the total
time is 5n tr + 3n tc.
3. Further Enhancements
In the stronger mesh model of Schnorr and Shamir a processor can read the entire
memory of all its neighbors in unit time. This implies that the routing time is independent
of message length. Let Tr denote the time needed to send a message of arbitrary length to
a neighbor processor. We may generalize the sorting algorithm of Figure 6 to the case
when each processor has k elements. Now, to sort a row or column of data, we use neigh
borhood sort [BAUD78] which is a generalization of evenodd transposition sort. Sup
pose that m processors have k elements each. The mk elements are sorted in m steps. In
the even steps, the elements in each even processor are merged with those in the next odd
processor. The even processor gets the smaller k and the odd one the larger k. In odd
steps, the k elements in each odd processor are merged with the k elements in the next
even processor. The smaller k elements remain with the odd processor and the larger k
with the even one. Note that when k = 1 neighborhood sort is identical to evenodd tran
sposition sort.
Our generalization of Figure 6 is given in Figure 7. We require that k be a power of
2. As in the case of the SchnorrShamir algorithm, we assume n = 24q
Theorem 2: The generalized sorting algorithm is correct.
Proof. Similar to that of Theorem 1. D
For the complexity analysis, we may again ignore the odd steps as these run in
O(n3/4) time for any fixed k. The folding and and unfolding of steps 0 and 8 each take
n(k 1)/(2k) Tr on bidirectional and nonstrict unidirectional meshes and n(k 1)/k Tr
10
Step 0: Fold the leftmost and the rightmost n(k 1)/(2k) columns into middle n/k
columns so that the n2 elements to be sorted are in the middle n x n/k submesh
with each processor in this submesh having k elements. Sort the k elements in
each processor of the middle n x n/k submesh.
Step 1: Sort all n3/4 x n3/4 blocks of data into snakelike rowmajor order. Note that
each block of data is in an n3/4 x n3/4 /k submesh.
Step 2: Perform an n /4way unshuffle along each row of n elements. Note that each
row of n elements is in a row of the middle n x n/k submesh.
Step 3: Repeat step 1.
Step 4: Sort the kn elements in each column of the middle n x n/k submesh into row
major order. Use neighborhood sort.
Step 5: Sort the elements in each n x (n3/4 /k) submesh into snakelike rowmajor
order.
Step 6: Sort all rows of the middle n x n/k submesh into alternating lefttoright and
righttoleft order.
Step 7: Perform 2n 3/4 steps of evenodd transposition sort along the snake of the middle
n x n/k submesh.
Step 8: Unfold the middle n x n/k submesh.
Figure 7 Generalized sorting algorithm
on strict unidirectional meshes. The sorting of step 0 takes O(klogk) time. Step 2 takes
n/k Tr on bidirectional meshes and 2n/k Tr on unidirectional ones. Step 4 requires n
steps of neighborhood sort. Each merge of two sets of k elements takes atmost 2k tc
time (actually atmost 2k1 comparisons are needed). So, the step 4 time is
n Tr + 2kn tc for bidirectional meshes and 2n Tr + 2kn tc for unidirectional meshes.
The time for step 6 is n/k Tr + 2n tc for bidirectional meshes and 2n/k Tr + 2n tc
for unidirectional ones.
The total sorting time (ignoring the time to sort k elements in step 0) for a bidirec
tional mesh is therefore (2n + n/k) Tr + 2n(k+ 1) tc. For the model of Schnorr and
Shamir [SCHN87], tc = 0 and the time becomes (2n + n/k) Tr. For large values of k,
this approximates to 2n Tr. Since (2n 2) Tr is the distance lower bound for sorting on
11
Schnorr and Shamir's model, the generalized sorting algorithm of Figure 7 is near
optimal for large k. The sorting times on nonstrict and strict unidirectional meshes are
(3n + 3n/k) T, + 2n(k+l) tc and (4n + 2n/k) Tr + 2n(k+l) tc. Since (4n2)
is the distance lower bound for the strict unidirectional model, our algorithm is near
optimal for large k for this model too.
The s2way merge sorting algorithm of Thompson and Kung [THOM77] may be
similarly generalized to sort n2 elements stored k to a PE in an n x (n/k) mesh
configuration. The resulting sort has a complexity that is almost identical to that of Fig
ure 7. However, Figure 7 is conceptually much simpler.
4. Practical Implications
Suppose we are to sort n2 elements, that are, initially distributed one to a PE on an
n x n bidirectional mesh. The final configuration is also to have one element per PE. On
realistic computers, the time to transfer small packets of data between adjacent proces
sors is dominated by the setup time. For instance the time to transfer N bytes of data
between adjacent processors of an NCubel hypercube is (446.7 + 2.4 N)ms. Therefore,
it is reasonable to assume that the data transfer time is independent of packet size for
small k and small element size. Furthermore, tR > tc on most commercial computers.
For instance, tR = 40 tc on the NCubel. Table 1 gives the value of our complexity func
tions for the different bidirectional sort algorithms and different k. When tR = 40 tc we
can expect to get best performance using the algorithm of section 3 with k = 4. For
tR = 10 tc, the algorithm of section 2 is the best of the three. The original
Schnorr/Shamir algorithm will perform best only when tR < 2 tc.
5. Reconfigurable Meshes with Buses
The sorting algorithms described here may be simulated by parallel computers in
the reconfigurable mesh with buses (RMB) family. We consider only two members of the
RMB family: RMESH and PARBUS.
In an RMESH [MILL88abc], we have a bus grid with an n x n arrangement of pro
cessors at the grid points (see Figure 8 for a 4x4 RMESH ). Each grid segment has a
switch on it which enables one to break the bus, if desired, at that point. When all
switches are closed, all n2 processors are connected by the grid bus. The switches around
a processor can be set by using local information. If all processors disconnect the switch
on their north, then we obtain row buses (Figure 9). Column buses are obtained by hav
ing each processor disconnect the switch on its east (Figure 10). In the exclusive write
Schnorr & Shamir Section 2 Section 3
n
3n tR + 2n t, 2.5n tR + 3n t, (2n + n) tR + 2n(k+1) t,
k
k 1 2 3 4 5 6
tR = 40 t, 122n t, 103n t, 101.33n t, 100n t, 100n t, 100.67n t,
tR = 10 tG 32n t, 28n t, 31.3n t, 32.5n t, 34n t, 35.67n t,
tR = 5 t, 17n t, 15.5n t, 19.66n t, 21.25n t, 23n t, 24.83n t,
tR = 2 t, 8n t, 8n t, 12.67n t, 14.5n t, 16.4n t, 18.33n t,
Table 1 Comparison of bidirectional sort algorithms
model two processors that are on the same bus cannot simultaneously write to that bus. In
the concurrent write model several processors may simultaneously write to the same bus.
Rules are provided to determine which of the several writers actually succeeds (e.g., arbi
trary, maximum, exclusive or, etc.). Notice that in the RMESH model it is not possible to
simultaneously have n disjoint row buses and n disjoint column buses that, respectively,
span the width and height of the RMESH. It is assumed that processors on the same bus
can communicate in 0(1) time. RMESH algorithms for fundamental data movement
operations and image processing problems can be found in [MILL88abc, MILL91ab,
JENQ91abc].
An n x n PARBUS (Figure 11) [WANG90] is an n x n mesh in which the interpro
cessor links are bus segments and each processor has the ability to connect together arbi
trary subsets of the four bus segments that connect to it. Bus segments that get so con
nected behave like a single bus. The bus segment interconnections at a proccessor are
done by an internal four port switch. If the upto four bus segments at a processor are
labeled N (North), E (East), W (West), and S (South), then this switch is able to realize
any set, A = {A 1, A 2 }, of connections where A c {N,E,W,S}, 1 < i< 2 and the A i's are
disjoint. For example A = {{N,S}, {E,W}} results in connecting the North and South
segments together and the East and West segments together. If this is done in each pro
cessor, then we get, simultaneously, disjoint row and column buses (Figure 12 and 13). If
A = { {N,S,E,W},}, then all four bus segments are connected. PARBUS algorithms for a
 13 
variety of applications can be found in
Observe that in an RMESH the realizable
{N,E,W,S}.
E
E
[
(
7
,I I
j
D
[MILL91a, WANG90ab, LIN92, JANG92].
connections are of the form A = {A 1 }, A c1
I I
]
)
]
)
 L
:Processor
S: Switch
S Link
Figure 8 4x4 RMESH
The RMESH requires two steps to simulate a unit route along a row or column of a
unidirectional mesh. The PARBUS can simulate such a route in one step using the bus
configurations of Figures 12 and 13. Additionally, the folding and unfolding into/from
n/k columns can be done in (k 1)tr time by having each group of k adjacent columns
fold into the leftmost column in the group. Now, the n/k columns that contain the data
are not adjacent. However, this does not result in any inefficiency when simulating steps
17 as row buses to connect the columns are easily established. With this modification to
the algorithms of Figures 6 and 7, the sort time for an n x n RMESH becomes
(8n + 2) tr + 3n tc, (4n + 8n/k + 2k 2) Tr + 2n(k+1) tc and for an n x n
PARBUS becomes (4n + 2) tr + 3n tc, (2n + 4n/k + 2k 2) Tr + 2n(k+1) tc
1 /%  tl
^ r
 14
O o
O O
O O
o o o
Figure 9 Row buses
0 0 0
 0 El10 El10 El
0 0 0
Figure 10 Column buses
6. Conclusion
We have shown that the lower bound for the number of routes needed to sort on an
n x n bidirectional mesh that was established in [SCHN86] is incorrect. Furthermore, we
have provided algorithms that sort using fewer routes than the lower bound of
[SCHN86]. In fact, the algorithm of section 3 is able to sort using a number of routes that
is very close to the distance lower bound for bidirectional as well as strict unidirectional
meshes.
1:3
0
1:3
0
1:3
0
1:3
E]
0
E]
0
E]
0
E]
 15 
Figure 11 4 x 4 PARBUS
The fold/sort/unfold algorithms developed here have practical implications to sort
ing on a mesh. The advantages of this technique when applied to computers with
tR >> 2 tc were pointed out in section 4. We also showed how to simulate the mesh
algorithms on reconfigurable meshes with buses.
7. References
[BAUD78] G. Baudet, and D. Stevenson, "Optimal Sorting Algorithms for Parallel
Computers," IEEE Transactions on Computers, C27, 1, Jan 1978, 8487.
[JANG92] J. Jang and V. Prasanna, "An optimal sorting algorithm on reconfigurable
meshes", International Parallel Processing Symposium, 1992.
[KUMA83] M. Kumar and D.S. Hirschberg, "An efficient implementation of Batcher's
oddeven merge algorithm and its application in parallel sorting schemes,"
IEEE Transactions on Computing, C32, 3, March 1983, 254264.
[KNUT73] D. E. Knuth, The Art of Computer Programming, Vol 3, Sorting and
Searching, AddisonWesley, Reading, MA, 1973.
[LEIG85] T. Leighton, "Tight bounds on the complexity of parallel sorting", IEEE
Trans. on Computers, C34, 4, April 1985, 344354.
16
Figure 12 Row buses in a PARBUS
R. Lin, S. Olariu, J. Schwing, and J. Zhang, "A VLSIoptimal constant
time Sorting on reconfigurable mesh", Proceedings of Ninth European
Workshop on Parallel Computing, Madrid, Spain, 1992.
John M. Marberg, and Eli Gafni, "Sorting in Constant Number of Row
and Column Phases on a Mesh", Algt ihliika, 3, 1988, 561572.
R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Data movement
operations and applications on reconfigurable VLS I arrays", Proceedings
of the 1988 International Conference on Parallel Processing, The
Pennsylvania State University Press, 1988, 205208.
R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Meshes with
reconfigurable buses", Proceedings 5th MIT Conference On Advanced
Research In VLSI, 1988, 163178.
R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Image compu
tations on reconfigurable VLSI arrays", Proceedings IEEE Conference On
Computer Vision And Pattern Recognition, 1988, 925930.
[LIN92]
[MARB88]
[MILL88a]
[MILL88b]
[MILL88c]
17
[MILL91a]
[NASS79]
[PARK87]
[PARK90]
[SCHER89]
[SCHN86]
Figure 13 Column buses in a PARBUS
R. Miller, V. K. Prasanna Kumar, D. Reisis and Q. Stout, "Efficient paral
lel algorithms for intermediate level vision analysis on the reconfigurable
mesh", Parallel Architectures and Algorithms for Image Understanding,
Viktor K. Prasanna ed., 185207, Academic Press, 1991
D. Nassimi and S. Sahni, "Bitonic sort on a meshconnected parallel com
puter," IEEE Transactions on Computers, C27, 1, Jan 1979, 27.
A. Park and K. Balasubramanian, "Improved Sorting Algorithms for
Parallel Computers," Proceedings of 1987 ACM Computer Science
Conference, Feb 1987, 239244.
A. Park and K. Balasubramanian, "Reducing Communication Costs for
Sorting on MeshConnected and Linerly Connected Parallel Computers,"
Journal of Parallel and Distributed Computing, 9, 318322, 1990.
Issac D. Scherson, Sandeep Sen, and Yiming MA, "Two nearly Optimal
Sorting algorithms for MeshConnected processor arrays using shear sort,"
Journal of Parallel and Distributed Computing, 6, 151165, 1989.
C.P. Schnorr, and A. Shamir, "An Optimal Sorting Algorithm for Mesh
Connected Computers," Proceeding of the 18th ACM Symposium on
Theory of Computing, May 1985, Berkely, CA, 255263.
g
18
[THOM77]
[WANG90a]
C.D. Thompson, H.T. Kung, "Sorting on a Mesh Connected Parallel Com
puter," Communications ofACM, 20, 4, April 1977, 263271.
B. Wang and G. Chen, "Constant time algorithms for the transitive closure
and some related graph problems on processor arrays with reconfigurable
bus systems," IEEE Trans. on Parallel and Distributed Systems, 1, 4,
500507, 1990.
[WANG90b] B. Wang, G. Chen, and F. Lin, "Constant time sorting on a processor array
with a reconfigurable bus system," Info. Proc. Letrs., 34, 4, 187190,
1990.
