Reconfigurable Mesh Algorithms For Fundamental
Data Manipulation Operations
JingFu Jenql and Sartaj Sahni2
Department of Soil Science, University of Minnesota, Minneapolis, MN 55455, USA
Computer and Information Sciences Department, CSE 301, University of Florida, Gainesville, FL 32611, USA
University of Florida Technical Report 91031 Abstract: Reconfigurable mesh (RMESH)
algorithms for several fundamental operations are developed. These operations include data
broadcast, prefix sum, data sum, ranking, shift, data accumulation, consecutive sum, adjacent
sum, sorting, random access read, and random access write.
Keywords:
reconfigurable mesh computer, parallel algorithms, data manipulation.
1 Introduction
Recently, several similar reconfigurable mesh (RMESH) architectures have been proposed
[MILL88abc, LI89ab, BEN90]. It has been demonstrated that these architectures are often very easy
to program and that in many cases it is possible to obtain constant time algorithms that use a polyno
mial number of processors for problems that are not so solvable using the PRAM model [BEN90,
MILL88a, JENQ91b, WANG90ab]. For instance, the parity of n bits can be found in 0(1) time on a
reconfigurable mesh with n2 processors while it takes Q(lognAoglogn) time to do this on every
CRCW PRAM with a polynomial number of processors [BEAM87]. Furthermore, the 0(1) time
RMESH algorithm is fairly simple.
Because of the power and ease of programming of this model, it is interesting to explore the
potential application of this model to various application areas. Some initial work in this regard has
already been done [LI89a, MILL88c, MILL91ab, JENQ91abc, WANG90ab].
In this paper, we consider most of the fundamental parallel processing data manipulation operations
identified in [RANK90] and develop efficient RMESH algorithms for these. This should simplify the
* This research was supported in part by the National Science Foundation under grants DCR8420935 and MIP
8617374
2
task of developing application programs for the RMESH. We begin, in Section 2, by describing the
RMESH model that we use.
2 RMESH Model
The particular reconfigurable mesh architecture that we use in this paper is due to Miller, Prasanna
Kumar, Resis and Stout [MILL88abc]. This variant employs a reconfigurable bus to connect together
all processors. Figure 1 shows a 4x4 RMESH. By opening some of the switches, the bus may be
reconfigured into smaller buses that connect only a subset of the processors.
0
(3,
]
)
]
)
: Processor
O : Switch
:Link
Figure 1 4x4 RMESH
The important features of an RMESH are [MILL88abc]:
1 An NxM RMESH is a 2dimensional mesh connected array of processing elements (PEs). Each
PE in the RMESH is connected to a broadcast bus which is itself constructed as an NxM grid.
The PEs are connected to the bus at the intersections of the grid. Each processor has up to four
bus switches (Figure 1) that are software controlled and that can be used to reconfigure the bus
r\
7
0
E
E
r\ r
3
into subbuses. The ID of each PE is a pair (i,j) where i is the row index and j is the column
index. The ID of the upper left comer PE is (0,0) and that of the lower right one is (N1,M1).
2 The up to four switches associated with a PE are labeled E (east), W (west), S (south) and N
(north). Notice that the east (west, north, south) switch of a PE is also the west (east, south,
north) switch of the PE (if any) on its right (left, top, bottom). Two PEs can simultaneously set
(connect, close) or unset (disconnect, open) a particular switch as long as the settings do not
conflict. The broadcast bus can be subdivided into subbuses by opening (disconnecting) some
of the switches.
3 Only one processor can put data onto a given sub bus at any time
4 In unit time, data put on a subbus can be read by every PE connected to it. If a PE is to broad
cast a value in register I to all of the PEs on its subbus, then it uses the command broadcast(I).
5 To read the content of the broadcast bus into a register R the statement R := content(bus) is
used.
6 Row buses are formed if each processor disconnects (opens) its S switch and connects (closes)
its E switch. Column buses are formed by disconnecting the E switches and connecting the S
switches.
7 Diagonalize a row (column) of elements is a command to move the specific row (column) ele
ments to the diagonal position of a specified window which contains that row (column). This is
illustrated in Figure 2.
1 1
3 3
5 5
1354 2 4 4
2 2
(a) 4th row (b) 1st column (c) diagonalize
Figure 2 Diagonalize 4th row or 1st column elements of a 5x5 window
The model described above differs from those of [LI89ab and BEN90] in that The polymorphic
torus of [LI89ab] differs from the RMESH model just described in the following respects:
(1) It has row and column wraparound connections.
(2) The switches are not placed on bus segments as in Figure 1. Rather, each processor has four bus
segments entering. These segments are connected to a four input switch local to the processor.
This switch is able to connect together arbitrary subsets of the input bus segments.
The reconfigurable networks of [BEN90] do not have the wraparound connections of the
polymorphic torus. However, their switching mechanism is similar. Only a subset of the possible
connections obtainable from the polymorhic torus switch are permitted.
3 Fundamental Data Manipulation Operations
3.1 Window Broadcast
The data to be broadcast is initially in the A variable of the PEs in the top left wxw submesh. These
PEs have ID (0,0) .. (wl,w1). The data is to tile the whole mesh in such a way that
A (i,j) A (i mod w,j mod w) (A (i,j) denotes register A of the PE with ID (i,j)). The algorithm for
this is given in Figure 3. Its complexity is O (w) and is independent of the size of the RMESH.
3.2 Prefix Sum
Assume that N2 values A 0,A 1,....,AN21 are initially distributed in the A variables of an NxN RMESH
such that A (i,j) = AN+ O ij, < N. PE (i,j) is to compute a value Sum (i,j) such that
iN+j
Sum(i,j)= Ak, 0 i, < N
k=0
An O (logN) algorithm for this is given in [MILL88a]. First consider the case of obtaining the prefix
sum of the N elements on any row of the RMESH. This can be done in O(logN) time by using buses
that are confined to the rows of the RMESH. We start with row buses of size one and then double the
size of each bus by a factor of two on each iteration (Figure 4). This is done by combining two adja
cent row buses on each iteration. Each processor Z has a value P which is the sum of the A values of
all processors that are currently on the same bus and are not to the right of Z. Initially, P = A for every
processor. When two adjacent buses are combined, the rightmost processor of the left bus broadcasts
its value (which is the sum of the values of all elements on the left bus) to the right bus. The
5
procedure WindowBroadcast(A,w);
{ broadcast the A values in the upper left wxw submesh }
begin
for := 0 to w 1 do { broadcast column j of the submesh }
begin
diagonalize the A variables in column j of the wxw submesh so that
B (i,i) A(i,j), O i < w;
set switches to form column buses;
PE(i,i) broadcasts its B value on column bus i, 0 < i < w;
B (k,k mod w) : content(bus), O k < N;
set switches to form row buses;
PE (k,k mod w) broadcasts its B value on its row bus, 0 s k < N;
A (k,i) := content(bus) for i mod w = j, and 0 < k < N;
end;
end;
Figure 3 Window broadcast
processors on this bus add the read value to obtain their new P values. After [log2N] iterations, the P
value in a processor is the sum of the A values in processors in the same row but not to its right.
Hence, we have computed the row prefix sums.
The prefix sum can now be obtained by performing the following three steps:
Step 1: Perform a row prefix sum on all rows of the RMESH. Let the resulting prefix values be
stored in the P variables of the PEs.
Step 2: Perform a column prefix sum on the P values in the rightmost column of the RMESH. Let
these be stored in the Q variables of these PEs. Let R (i,N1) : Q (i,N1)P (i,N1).
procedure RowPrefixSums(A,P,N);
{Compute the prefix sums for each row of the NxN RMESH}
begin
open all switches;
a : 1; b : 2;
P (i,j) : A (i,j), O i,j < N;
for k : 1 to log2N] do
begin
All PEs (i,j) with j mod b = a connect their west switch;
All PEs (i,j) with j +1 mod b = a broadcast their P value;
All PEs (i,j) with mod b 2 a read their bus and
increase their P value by the value read from the bus;
a :=b;b := 2*b;
end;
end;
Figure 4 Row prefix sum
Step 3: Broadcast the R (i,N1) values on row buses. Each PE adds the value read from its bus to
its current P value. The new P value is the desired prefix sum for this PE.
Step 2 is very similar to a row prefix sum and also takes O (logN) time. Step 3 takes 0(1) time.
3.3 Data Sum
Initially, each PE of the NxN RMESH has an A value. Each PE is to sum up the A values of all the N2
PEs and put the result in its B variable. I.e., following the data sum operation we have :
N1N1
B(i,j)= I A(k,l), 0i,j
k=01=0
This can be done in O(logN) time by first performing a prefix sum [MILL88a] and then having PE
(N1,N1) broadcast Sum (N1,N1) to the remaining PEs in the RMESH. For this, all switches can
be closed.
3.4 Ranking
Consider the linear ordering of the N2 PEs defined by row major order. I.e., PE [i,j] is in posi
tion iN+j of this ordering. Assume that each PE has a Boolean variable selected. If selected (i,j) is
true then rank (i,j) is the number of PEs with selected (i,j) true that precede it in the defined linear
ordering. If selected (i,j) is false, then rank (i,j) is undefined. We consider two cases for this opera
tion.
3.4.1 N2 Processors and N2 elements
In this case, ranking can be done in O (logN) time using the prefix sum algorithm. PE [i,j] sets
A (i,j) to 0 if selected (i,j) is false and to 1 otherwise. The prefix sum of A's is computed. If
selected (i,j) is true then rank (i,j) is the prefix sum in PE [i,j] less 1.
3.4.2 N2 processors and N elements
Suppose that all the PEs with selected (i,j) = true are on row 0 (i.e. selected (i,j) = false, i > 0).
In this case rank(0,j), 0
Figure 5. An example is given in Figure 6.
The algorithms for Steps 1 and 2 are similar. So we describe the algorithm only for Step 1. To
compute r (0,j), for even j, we set the bus switches as in Figure 7 (a) in case selected (O,j) is true and
as in Figure 7 (b) in case it is not. The switch settings are similar to those used to compute the
exclusive or of l's in [MILL88]. In this figure e denotes an even index and o an odd index. So, (e,j)
denotes all PEs [i,j] with even i. Note that since j is even (e,j) is equivalent to (e,e) and (e,j +1) to
(e,o). Solid lines indicate connected (closed) switches; blanks indicate disconnected (open) switches.
Figure 8 shows an example switch setting for the case N = 8. As can be seen, the switch setting
scheme of Figure 7 results in several disjoint buses. The bus of interest is the one that includes
PE [0,0]. This bus includes a PE from each column j such that j is even and selected (O,j) is true.
Furthermore, this bus moves down by one row at each such and by another row at j+l. The bus does
not move down at any other column. Hence if the bus with PE [0,0] contains PE [i,j], i and j are
even, and selected (O,j) is true, then r (0,j) = i/2+1. The algorithm to implement this strategy is given
in Figure 9. Its complexity is readily seen to be 0(1). As mentioned earlier, the algorithm for Step 2
Step 1 [ rank even columns ]
Compute r (0,j) forj even where r is defined as
r (O,j) = I{q q is even and selected (0,q) and q
Step 2 [ rank odd columns ]
Compute r (0,j) forj odd where r is defined as
r (O,j) = I{q q is odd and selected (0,q) and q j}l
Step 3 [ combine ]
rank (0,j)]= r() 0,j1) 1
I r(~i)r(Oill
Figure 5 N2 processor algorithm for ranking
Figure 6 N2 processor ranking example
is similar. Step 3 simply requires a rightward shift of 1 which can be easily done in O (1) time. Hence
the entire ranking can be done in (1) time.
j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
selected F T T F F T T T F T T T T F T F
r (0,j) even j 0 1 1 2 2 3 4 5
r (O,j) odd j 1 1 2 3 4 5 5 5
rank 1 0 1 1 1 2 3 4 4 5 6 7 8 8 9 9
(e,e)
(oj+1)
(o,e)
(ej+1)
(a) Settings for selected (0,j) = true
(ej) (ej+1)
(e,e) (e,o)
(oj) (oj+1)
0 0
(o,e) (o,o)
0 0
(b) Setting for selected (0,j) = false
Figure 7 Switch settings to compute r (0,j) forj even
0 1 1 0 0
1 1 0
Figure 8 Example switch setting
(oj)
(o,o)
(e,o)
.......... ..1.
____ 1 I ___
I I ___
........... ................................. ..........
............. .......
..........................
........
....... ..........
{ Compute r (0,j) forj even }
Step 1 t (0,j) := selected (0,j), 0 < j < N
Step 2 set up column buses
Step 3 broadcast t (0,j) on column bus j, 0 < j < N
Step 4 t(i,j) := content(bus); 0 < i,j < N
Step 5 {send t(i,j) forj even to t(i,j) forj odd }
all PEs [i,j] with j even disconnect their N, S, W switches and connect their E
switch.
all PEs [i,j ] with j even broadcast t (i,j)
all PEs [i, ] with j odd set t (i,j) to their bus content
Step 6 { set switches as in Figure 2.6 }
if t(i,j) then case (i,j) of
(odd,odd),(even,even): PE [i,j] disconnects its E switch and connects its S switch
else PE [i,j] connects its E switch and disconnects its S switch
endcase
else case i of
odd : PE [i,j] disconnects its E and S switches
else : PE [i,j] connects its E switch and disconnects its S switch
endcase
Step 7 PE [0,0] broadcasts a special value on its bus
Step 8 All PEs [ij] with i and j even read their bus.If the special value is read,
then they set their S value to true and r value to i/2 + 1.
Step 9 Set up column buses
Step 10 PE [i,j] puts its r value on its bus ifS(i,j) is true
Step 11 r (0,j) = content(bus), j even
Figure 9 RMESH algorithm to compute r (0,j) forj even
3.5 Shift
Each PE has data in its A variable that is to be shifted to the B variable of a processor that is s, s > 0,
units to the right but on the same row. Following the shift, we have
inull < s
B (i,j)= n
B JA (i,js), j2s
A circular shift variant of the above shift requires
B (i,j) = A (i, (s) mod N)
Let us examine the first variant first. This can be done in O(s) time by dividing the send and
receive processor pairs ((i, js), (i,j)) into s+1 equivalence classes as below:
class k = {((i,js), (i,j)) I ( s) mod (s + 1) k}
The send and receive pairs in each class can be connected by disjoint buses and so we can accomplish
the shift of the data in the send processors of each class in O(1) time. In O(s) time all the classes can
be handled. The algorithm is given in Figure 10. The number of broadcasts is s +1. The procedure is
easily extended to handle the case of left shifts. Assume that s < 0 denotes a left shift by s units on
the same row. This can also be done with s +1 broadcasts.
A circular shift of s can be done in O (s) time by first performing an ordinary shift of s and then
shifting A (i,Ns),...,A (i,N1) left by Ns. The latter shift can be done by first shifting A (i,Ns),
then A (i,Ns +1),..., and finally A (i,N1). The exact number of broadcasts is 2s +1.
Circular shifts of s, s > N/2 can be accomplished more efficiently by performing a shift of
(Ns) instead. For s < N/2, we observe that data from PEs (i, 0), (i, 1),  (i,s 1) need to be sent
to PEs (i,s), (i,s +1),  ., (i,2s1), respectively. So, by limiting the data movement to within rows,
s pieces of data need to use the bus segment between PE (i,s1) and (i,s). This takes O(s) time. If
only the data on one row of the NxNVRMESH is to be shifted, the shifting can be done in O (1) time by
using each row to shift one of the elements. The circular shift operation can be extended to shift in
IxWrow windows or Wxl column windows. Let RowCircularShift (A,s, W) and ColumnCircularShift
(A,s, W), respectively, be procedures that shift the A values by s units in windows of size IxW and
Wxl. Let A'n and Af, respectively, denote the initial and final values of A. Then, for ColumnCircu
larShift we have
procedure Shift (s,A,B)
{ Shift from A (i,j) to B (i,j +s), s > 0 }
begin
All PEs disconnect their N and S switches;
for k : 0 to s do { shift class k}
begin
PE (i,j) disconnects its E switch if (js) mod (s +1) = k;
PE (i,j) disconnects its W switch and broadcasts
A (i,j) ifj mod (s +1) = k;
B (i,j) := content(bus) for every PE (i,j) with (js) mod (s +1) = k;
end;
end;
Figure 10 Shifting by s,s > 0
A n(i,j) Af(q,j)
where PEs (i,j) and (q,j) are, respectively, the a = i mod W'th and b = q mod W'th PEs in the same
Wxl column window and b = (as) mod W. The strategy of Figure 10 is easily extended so that
RowCircularShift and ColumnCircularShift are done using 2s + 1 broadcasts.
3.6 Data Accumulation
In this operation PE (i,j) initially has a value I(i,j), 0 i,j < N. Each PE is required to accumulate M
Values in its array A as specified below:
A [q](i,j)= I(i, + q) mod N)
This can be done using 2M 1 broadcasts. The algorithm is given in Figure 11.
procedure Accumulate (A,I,A)
{ each PE accumulates in A, the next MI values }
PE (i,j) disconnects its S switch and connects its W switch, 0 < i,j < N;
begin
{accumulate from the right}
for k : 0OtoM1 do
begin
{PEs (i,j) with j mod M = k broadcast to PEs
on their left that need their I value}
PE (i,j) disconnects its E switch ifj mod M k
and then broadcasts I(i,j);
A [(k +M( modM)) mod M](i,j) : content(bus);
end;
{accumulate from the left}
Each PE (i,j) disconnects its S switch and connects its W switch, 0 < i,j < N;
fork : 0OtoM2do
begin
PE (i,k) broadcasts I(i,k), 0 < i < N;
A [q +k](i,Nq) : content(bus), 1 < q < Mk;
end;
end;
Figure 11 Data accumulation
3.7 Consecutive Sum
Assume that an NxN RMESH is tiled by lxM blocks (M divides N) in a natural manner with no
blocks overlapping. So, processor (i,j) is the j mod M'th processor in its block. Each processor (i,j)
of the RMESH has an array X[O.M1](i,j) of values. If j modM = q, then PE (i,j) is to compute
S(i,j) such that
Ml
S(i,j) = X[q](i, ( div M) M + r)
r=O
14
That is, the q'th processor in each block sums the q'th Xvalue of the processors in its block. The con
secutive sum operation is performed by having each PE in a IxM block initiate a token that will accu
mulate the desired sum for the processor to its right and in its block. More specifically, the token gen
erated by the q'th PE in a block will compute the sum for the (q+1) modM'th PE in the block,
0 < q < M. The tokens are shifted left circularly within their IxM block until each token has visited
each PE in its block and arrived at its destination PE. The algorithm is given in Figure 12. The
number of broadcasts is 3M3 as each row circular shift of1 takes 3 broadcasts.
procedure ConsecutiveSum (X,S,M);
{ Consecutive Sum of X in IxM blocks }
begin
S(i,j) : X[((j modM)+ 1) modM](i,j), 0 < i,j < N;
fork := 2toMdo
begin
{circularly shift S in IxAMblocks and add terms }
Row CircularShift (S, M,1)
S(i,j) : S(i,j) + X[((j modM)+k) modM](i,j),O < i,j < N;
end;
end;
Figure 12 Consecutive sums in IxAblocks
3.8 Adjacent Sum
We consider two forms of this operation: row adjacent sum and column adjacent sum. In each, PE
(i,j) begins with an array X[O.M1](i,j) of values. In a row adjacent sum, PE (i,j) is to compute
M1
S(i,j) X[q](i,(i +q)modN), O0i,j
q0
While in a column adjacent sum it is to compute
M1
S(i,j)= X[q]((i+q)modN,j), 0i,j
q0
Since the algorithms for both are similar, we discuss only the one for row adjacent sum. The strategy
is similar to that for consecutive sum. Each processor initiates a token that will accumulate the
desired sum for the processor that is M1 units to its left. That is PE (i,j) initiates the token that will
eventually have the desired value of S(i, (N+jM+1) mod N), 0 i,j < N. The tokens are shifted
left circulary 1 processor at a time until they reach their destination PE. The details of the algorithm
are given in Figure 13. As each circular shift by 1 requires 3 broadcasts, the algorithm of Figure 13
requires 3(M1) broadcasts.
procedure RowAdjacentSum (S,X,M);
begin
S(i,j):= X(i,j)[M1];
for k : M2 down to 0 do
begin
RowCircularShift (S,N, 1);
S(i,j): S(i,j) + X[k](i,j);
end;
end;
Figure 13 Row adjacent Sum
3.9 Sorting
N2 elements, one per processor, can be sorted in O (N) time on an NxN RMESH by simulating the
O(N) sorting algorithm for ordinary mesh computers [NASS79]. That O(N) is optimal for an
RMESH can be seen by considering the amount of data that might need to cross the boundary
between the left N/2 columns and the right N/2 columns. This is N2/2 in the worst case. The
bandwidth of this boundary is N. Hence O (N) time is needed to accomplish this data transfer. Miller
et al. [MILL88a] present an O (logN) sorting algorithm for the case when N elements are to be sorted
on an RMESH with N2 processors. The initial and final configuration has the data in row 0 of the
NxN RMESH. N elements can be sorted on an NxNxN RMESH in 0(1) time using our 0(1) ranking
algorithm and count sort [HORO90]. The algorithm is given in Figure 14.
procedure Sort(A)
{ Sort A (,j), 0 j < Non an N2xNRMESH }
begin
Set up column buses;
PE [0,j] broadcasts A (0,j) on its bus; 0
A(i,j) : content (bus), O i < N2, 0
set up row buses;
PE [kN,k] broadcasts A (kN,k), 0 < k < N;
B (i,j) : = content (bus), O < i < N2, 0
{ now A (kN,j) = A (O,j) and B (kN,j) = A (0,k) }
if (A (kN,j) < B (kN,j)) or (A (kN,j)= B (kN,j) and j k)
then selected (kN,j) : true
else selected (kN,j) := false;
rank the processors in row kNusing the NxN block of PEs beginning at row kN, 0 < k < N
select the rank of the rightmost processor in row kN with selected (kN,j) = true; this is
broadcast to variable r of PE [kN,N1];
PE [kN,N1] sends r(kN,N1) to PE [kN,r (kN,N1)] along a row bus,
this PE now broadcasts the B value to PE [0,r(kN,N1)] along a column bus to complete the
sort;
end;
Figure 14 Sorting with N3 PEs
Each NxN block of PEs is used to obtain the count value for one of the elements to be sorted.
The k'th block from the top obtains the count for A (0,k). This is done by first using the NPEs in row
zero of this block (i.e., the PEs [kN, j],O j < N) to compare A (0,j) and A (0,k). If A (0,j) is to come
before A (0,k) after the sort, selected (kN,j) is set to true, otherwise it is set to false. The number of
elements that come to the left of A (0,k) in the sorted order is now the rank of the rightmost processor
in row kNwith selected value true. First, the processors in row kNare ranked using the N2 processors
in rows kNthrough kN+N1, 0 < k < N. To determine the rank of the rightmost selected processor in
row kN, each processor in this row with selected value true disconnects its W, N, and S switches.
Then all processors with selected value true broadcast their rank. PEs [kN,N1], 0 < k < N read their
bus and the value read is the rank of the rightmost selected processor in row kN. Let this be
r(kN,N1). Now, it is known that A (0,k) is to be the r(kN,N1)+ 'th element in sorted order. This
rearrangement is accomplished by first having PE [kN,N1] route its r value to PE [kN,r(kN,N1)]
and then have this PE route the B value along a column bus to PE [O,r(kN,N1)].
The sorting algorithm just described can be coupled with the column sort of [LEIG85] to obtain
an O (1) time algorithm to sort N elements on an NxN RMESH [LIN91, JANG91, NIGA91 ].
3.10 RAR And RAW
The random access read (RAR) and random access write (RAW) operations are defined in [NASS81].
In a RAR each PE has a read address associated with it. This is the address of the PE whose A vari
able it wishes to read. In a RAW each PE has a write address which is the address of the PE to which
it wishes to send the value of its A variable. Conflicts may be resolved arbitrarily. Miller et al.
[MILL88a] have developed RMESH algorithms for RARs and RAWs. When k data items are to be
moved in the RAR or RAW, their algorithm takes O (k + logN) time, k N2. If the number of
source and destination processors in each kxk block of PEs is O (k), 1 < k < N then their algorithm
takes O (logN) time.
When the source and destination processors are all on a single row of an NxN RMESH, then
RARs and RAWs can be done in O(1) time. The RAR and RAW algorithms are given in Figures 15
and 16, respectively. These assume that the source and destination processors are all on row 0 or the
RMESH.
Step 1: Diagonalize the row 0 data.
Step 2: The diagonal PEs broadcast their data to all processors on their row using row buses.
Step 3: The row 0 PEs that need to read data broadcast the source PE column index to all processors
on their column using column buses.
Step 4: The processors that have a match between the source indexes just received and their row
index broadcast their data to the row 0 PE on their column using column buses.
Figure 15 O(1) time RAR
Step 1: All row 0 PEs that wish to write, use column buses to broadcast the data and column index
of the destination PE. This information is saved only by PEs whse row index is the same as
the destination column index.
Step 2: To resolve write conflicts, row buses are established. Each PE that has saved data from the
last step disconnects its row bus at its right. Then all processors with saved data from step 1
broadcast this data on their row bus. The column 0 processors read this data.
Step 3: Diagonalize the column 0 data.
Step 4: The diagonal PEs use column buses to broadcast their data to the row 0 PE on their column.
Figure 16 O(1) time RAW
4 Conclusions
We have considered many of the fundamental data manipulation operations identified in [RANK90]
and shown how these can be performed efficiently on a reconfigurable mesh parallel computer. These
operations are useful in the development of efficient parallel algorithms. In [JENQ91abc], we have
used these operations to arrive at efficient reconfigurable mesh algorithms for several problems that
arise in the image processing area.
5 References
[BEAM87] P. Beame and J. Hastad, "Optimal bounds for decision problems on the CRCW
PRAM", Proc. 19th ACM Symp. on Theo. of Computing, 8393, 1987.
[BEN90] Y. BenAsher, D. Peleg, R. Ramaswami, and A. Schuster, "The power of
reconfiguration," Research Report, The Hebrew University, Israel, 1990.
[HORO90] E. Horowitz and S. Sahni, Fundamentals of data structures in Pascal, Third Edition,
Computer Science Press, Inc., New York, 1990.
[JANG91] J. Jang and V. Prasanna, "An optimal sorting algorithm on reconfigurable meshes",
University of Southern California, Technical Report IRIS 277, 1991.
19
[JENQ91a] J. Jenq and S. Sahni, "Reconfigurable mesh algorithms for image shrinking, expanding,
clustering, and template matching," Proceedings 5th International Parallel Processing
Symposium, IEEE Computer Society Press, 208215, 1991.
[JENQ91b] J. Jenq and S. Sahni, "Reconfigurable mesh algorithms for the Hough transform," Proc.
1991 International Conference on Parallel Processing, The Pennsylvania State
University Press, 3441, 1991.
[JENQ91c] J. Jenq and S. Sahni, "Reconfigurable mesh algorithms for the area and perimeter of
image components," Proc. 1991 International Conference on Parallel Processing, The
Pennsylvania State University Press, 280281, 1991.
[LEIG85] T. Leighton, "Tight bounds on the complexity of parallel sorting", IEEE Trans. on Com
puters, C34, 4, April 1985, pp 344354.
[LI89a] H. Li and M. Maresca, "Polymorphictorus architecture for computer vision," IEEE
Trans. on Pattern & Machine Intelligence, 11, 3, 133143, 1989.
[LI89b] H. Li and M. Maresca, "Polymorphictorus network", IEEE Trans. on Computers, C38,
9, 13451351, 1989.
[LIN91] R. Lin, S. Olariu, J. Schwing, and J. Zhang, "Sorting in 0(1) time on an nxn
reconfigurable mesh", Technical Report, Old Dominion University, Virginia.
[MILL88a] R. Miller, V. K. Prasanna Kumar, D. Resis and Q. Stout, "Data movement operations
and applications on reconfigurable VLSI arrays", Proceedings of the 1988 International
Conference on Parallel Processing, The Pennsylvania State University Press, pp 205
208.
[MILL88b] R. Miller, V. K. Prasanna Kumar, D. Resis and Q. Stout, "Meshes with reconfigurable
buses", Proceedings 5th MIT Conference On Advanced Research IN VLSI, 1988, pp
163178.
[MILL88c] R. Miller, V. K. Prasanna Kumar, D. Resis and Q. Stout, "Image computations on
reconfigurable VLSI arrays", Proceedings IEEE Conference On Computer Vision And
Pattern Recognition, 1988, pp 925930.
[MILL91a] R. Miller, V. K. Prasanna Kumar, D. Resis and Q. Stout, "Image processing on
reconfigurable meshes," in From Pixels to Features II, Elsevier Science, H. Burkhardt,
ed., 1991.
20
[MILL91b] R. Miller, V. K. Prasanna Kumar, D. Resis and Q. Stout, "Efficient parallel algorithms
for intermediatelevel vision analysis on the reconfigurable mesh," in Parallel Algo
rithms and Architectures for Image Understanding, ed. V. Prasanna Kumar, Academic
Press, 1991.
[NASS79] D. Nassimi and S. Sahni, "Bitonic sort on a mesh connected parallel computer", IEEE
Transactions on Computers, vol C27, no. 1, Jan. 1979, pp 27.
[NASS81] D. Nassimi and S. Sahni, "Data broadcasting in SIMD computers", IEEE Transactions
on Computers, vol C30, no. 2, Feb. 1981, pp 101107.
[NIGA91] M. Nigam and S. Sahni, "On the equivalence of certain reconfigurable mesh models",
Technical Report, University of Florida, 1991.
[RANK90] S. Ranka and S. Sahni, Hypercube ulmg, ,ilii, with Applications to Image Processing
and Pattern Recognition, Springer Verlag, 1990.
[SIEG81] H. J. Siegel, L. Siegel, F. C. Kemmerer, P. T. Muller, H. E. Smalley, and S. D. Smith
"PASM: A partitionable SIMD/MIMD system for image processing and pattern recog
nition", IEEE Transactions on computers, vol. C30, no. 12, Dec. 1981, pp 934947.
[WANG90a] B. Wang and G. Chen, "Constant time algorithms for the transitive closure and some
related graph problems on processor arrays with reconfigurable bus systems," IEEE
Trans. on Parallel and Distributed Systems, 1, 4, 500507, 1990.
[WANG90b] B. Wang, G. Chen, and and F. Lin, "Constant time sorting on a processor array with a
reconfigurable bus system," Info. Proc. Letrs., 34, 4, 187190, 1990.
