
Full Citation 
Material Information 

Title: 
Basic operations on the OTISMesh optoelectronic computer 

Series Title: 
Department of Computer and Information Science and Engineering Technical Reports 

Physical Description: 
Book 

Language: 
English 

Creator: 
Wang, Chihfang Sahni, Sartaj 

Affiliation: 
University of Florida University of Florida 

Publisher: 
Department of Computer and Information Science and Engineering, University of Florida 

Place of Publication: 
Gainesville, Fla. 

Copyright Date: 
1997 
Record Information 

Bibliographic ID: 
UF00095413 

Volume ID: 
VID00001 

Source Institution: 
University of Florida 

Holding Location: 
University of Florida 

Rights Management: 
All rights reserved by the source institution and holding location. 

Downloads 

Full Text 
Basic Operations On The OTISMesh Optoelectronic Computer*
Chilifri ,g Wang and Sartaj Sahni
Department of Computer and Information Science and Engineering
University of Florida
Gainesville, FL 32611
{wang,sahni} 1 i, ifl lii
Abstract
In this paper we develop algorithms for some basic operations broadcast, window broad
cast, prefix sum, data sum, rank, shift, data accumulation, consecutive sum, .i11 ii sum,
concentrate, distribute, generalize, sorting, random access read and write on the OTISMesh
[10] model. These operations are useful in the development of t.in !ii algorithms for numerous
applications [8].
1 Introduction
I Ir Optical Transpose Interconnection System ( OTIS ), proposed by : 1..i i et al. [4], is a hybrid
optical and electronic interconnection system for large parallel computers. I Ir OTIS architecture
employs free space optics to connect distant processors and electronic interconnect to connect
nearby processors. Specifically, to maximize bandwidth, power efficiency, and to minimize system
area and volume [1], the processors of an N2 processor OTIS computer are partitioned into N
groups of N processors each. Each processor is indexed by a tuple (G, P), 0 < G, P < N, where
G is the group index ( i.e., the group the processor is in ), and P the processor index within a
group. I1 Ir inter group interconnects are optical while the intra group interconnects are electronic.
I Ir optical or OTIS interconnects connect pairs of processors of the form [(G, P), (P, G)]; that is,
the group and processor indices are transposed by an optical interconnect. I Ir electrical or intra
group interconnections are according to any of the well studied electronic interconnection networks
 mesh, hypercube, mesh of trees, and so forth. I Ir choice if the electronic interconnection network
defines a subfamily of OTIS computers OTIS: 1i i., OTISHypercube, and so forth. Fin, 1
shows a 16 processor OTIS: 1i i Each small square represents a processor. I Ii number inside a
processor square is the processor index P. Some processor squares have a pair (Pr, Py) inside them.
I Iir pair gives the row and column index of the processor P within its vN x vN mesh. Each large
*This work was supported, in part, by the Army Research ( it .., under grant DAA H049510111.
(0,0) (0,1)
group 0 group 1
L0,0 6^ ,1  2 0e 1
S1,0 H ,1 L2 3
group 2 group 3
(1,0) (1,1)
Filii 1: 16 processor OTIS: 1. i]
square encloses a group of processors. A group index G may also be given as a pair (G,, Gy) where
G, and Gy are the row and column indices of the group assuming a N x vN layout of groups.
Zane et al. [11] have shown that an N2 processor OTIS: 1i i, can simulate each move of a
vN x vN x vN x N fourdimensional ( 4D ) mesh computer using either one electronic move or
one electronic and two OTIS moves ( depending on which dimension of the 4D mesh we are to move
along ). I I.y have also shown that an N2 processor OTISHypercube can simulate each move of
an N2 processor hypercube using either one electronic move or one electronic and two OTIS moves.
Sahni and Wang [10, 9] have developed efficient algorithms to rearrange data according to bit
permutecomplement ( BPC ) [5] permutations on OTIS: 1i i, and OTISHypercube computers,
respectively. P., ij.i 1..!i. i and Sahni [7] have developed efficient randomized algorithms for routing,
selection, and sorting on an OTIS: 1i i,
In this paper, we develop deterministic OTIS: 1i i, algorithms for the basic data operations for
parallel computation that are studied in [8]. As shown in [8], algorithms for these operations can
be used to arrive at efficient parallel algorithms for numerous applications, from image processing,
computational geometry, matrix algebra, graph theory, and so forth.
We consider both the synchronous SIMD and synchronous MIMD models. In both, all pro
cessors operate in lockstep fashion. In the SIMD model, all active processors perform the same
operation in any step and all active processors move data along the same dimension or along OTIS
connections. In the MIMD model, processors can perform different operations in the same step
and can move data along different dimensions.
2 Basic Operations
2.1 Data Broadcast
Data broadcast is, perhaps, the most fundamental operation for a parallel computer. In this
operation, data that is initially in a single processor (G, P) is to be broadcast or transmitted to all
N2 processors of the OTIS: 1i ij Data broadcast can be accomplished using the following three
step algorithm:
Step 1: Processor (G, P) broadcasts its data to all other processors in group G.
Step 2: Perform an OTIS move.
Step 3: Processor G of each group broadcasts the data within its group.
Following Step 2, one processor of each group has a copy of the data, and following Step 3
each processor of the OTIS: 1. ij has a copy. In the SIMD model, Steps 1 and 3 take 2(VN 1)
electronic moves each, and Step 2 takes one OTIS move. I Ir SIMD complexity is 4(vN 1)
electronic moves and 1 OTIS move, or a total of 4/N 3 moves. N.\.t that our algorithm is
optimal because the diameter of the OTIS: 1i ij is 4vN 3 [10]. For example, if the data to be
broadcast is initially in processor (0,0), the data needs to reach processor (N 1, N 1), which is
at a distance of 4/N 3. In the MIMD model, the complexity of Steps 1 and 3 depends on the
value of P = (Pr, Py) and ranges from a low of approximately vN 1 to a high of 2(vN 1).
'I l overall complexity is at most 4(VN 1) electronic moves and one OTIS move. By contrast,
simulating the 4Dmesh broadcast algorithm using the simulation method of [11] takes 4(VN 1)
electronic moves and 4(vN 1) OTIS moves in the SIMD model and up to this many moves in
the MIMD model.
2.2 Window Broadcast
In a window broadcast, we start with data in the top left w x w submesh of a single group G.
Here w divides vN. Following the window broadcast operation, the initial w x w window tiles all
groups; that is, the window is broadcast both within and across groups. Our algorithm for window
broadcast is:
Step 1: Do a window broadcast within group G.
Step 2: Perform an OTIS move.
Step 3: Do an intra group data broadcast from processor G of each group.
Step 4: Perform an OTIS move.
Following Step 1 the initial window properly tiles group G and we are left with the task of
broadcasting from group G to all other groups. In Step 2, data d(G, P) from (G, P) is moved to
(P, G) for 0 < P < N. In Step 3, d(G, P) is broadcast to all processors (P i),0 < P, i < N, and in
Step 4 d(G, P) is moved to (i, P), 0 < i, P < N.
Step 1 of our window broadcast algorithm takes 2(/N w) electronic moves in both the SIMD
and MIMD models, and Step 3 takes 2(VN 1) electronic moves in the SIMD model and up to
2(VN 1) electronic moves in the MIMD model. I Ir total cost is 4VN 2w 2 electronic and
2 OTIS moves in the SIMD model and up to this many moves in the MIMD model. A simulation
of the 4D mesh window broadcast algorithm takes the same number of electronic moves, but also
takes 4(N 1) OTIS moves.
2.3 Prefix Sum
I Ir index (G, P) of a processor may be transformed into a scalar I = GN+P with 0 < I < N2. Let
D(I) be the data in processor I, 0 < I < N2. In a prefix sum, each processor I computes S(I) =
Z io D(i), 0 < I < N2. A simple prefix sum algorithm results from the following observation:
S(I) = SD(I) + LP(I)
where SD(I) is the sum of D(i) over all processors i that are in a group smaller than the group of
I and LP(I) is the local prefix sum within the group of I. I Ir simple prefix sum algorithm is:
Step 1: Perform a local prefix sum in each group.
Step 2: Perform an OTIS move of the prefix sums computed in Step 1 for all processors (G, N 1).
Step 3: Group N 1 computes a modified prefix sum of the values, A, received in Step 2. In this
modification, processor P computes Z$=0 A(i) rather than 0 A(i).
Step 4: Perform an OTIS move of the modified prefix sums computed in Step 3.
Step 5: Each group does a local broadcast of the modified prefix sum received by its N 1
processor.
Step 6: Each processor adds the local prefix sum computed in Step 1 and the modified prefix sum
it received in Step 5.
Il local prefix sums of Steps 1 and 3 take 3(VN 1) electronic moves in both the SIMD
and MIMD models, and the local data broadcast of Step 5 takes 2(/N 1) electronic moves.
'l l overall complexity is 8(VN 1) electronic moves and 2 OTIS moves. I I!i jcan be reduced to
7(VN 1) electronic moves and 2 OTIS moves by deferring some of the Step 1 moves to Step 5 as
below.
Step 1: In each group, compute the row prefix sums R.
Step 2: Column VN 1 of each group computes the modified prefix sums of its R values.
Step 3: Perform an OTIS move on the prefix sums computed in Step 2 for all processors (G, N 1).
Step 4: Group N 1 computes a modified prefix sum of the values, A, received in Step 3.
Step 5: Perform an OTIS move of the modified prefix sums computed in Step 4.
Step 6: Each group broadcasts the modified prefix sum received in Step 5 along column VN 1
of its mesh.
Step 7: 1 Ii column N 1 processors add the modified prefix sum received in Step 6 and the
prefix sum of R values computed in Step 2 minus its own R value computed in Step 1.
Step 8: I I, result computed by column N 1 processors in Step 7 is broadcast along mesh
rows.
Step 9: Each processor adds its R value and the value it received in Step 8.
If we simulate the best 4D mesh prefix sum algorithm, the resulting OTIS mesh algorithm takes
7(VN 1) electronic and 6(vN 1) OTIS moves.
2.4 Data Sum
In this operation, each processor is to compute the sum of the D values of all processors. An
optimal SIMD data sum algorithm is:
Step 1: Each group performs the data sum.
Step 2: Perform an OTIS move.
Step 3: Each group performs the data sum.
In the SIMD model Steps 1 and 3 take 4(/N 1) electronic moves, and step 2 takes 1 OTIS
move. I Ir, total cost is 8(vN 1) electronic and 1 OTIS moves. N\i.t that since the distance
between processors (0, 0) and (N 1, N 1) is 4(vN 1) electronic and 1 OTIS moves and since
each needs to get information from the other, at least 8(vN 1) electronic and 1 OTIS moves
are needed ( the moves needed to send information from (0, 0) to (N 1, N 1) and those from
(N 1, N 1) to (0, 0) cannot be overlapped in the SIMD model ). Also, note that a simulation
of the 4D mesh data sum algorithm takes 8(VN 1) electronic and 8(VN 1) OTIS moves.
1I MIMD complexity can be reduced by computing the group sums in the middle processor
of each group rather than in the bottom right processor. 1 Ir complexity now becomes 4(vN 1)
electronic and 1 OTIS moves when N is odd and 4/N electronic and 1 OTIS moves when vN
is even. I Ir simulation of the 4D mesh, however, takes 4(vN 1) electronic and 4(vN 1) OTIS
moves. N\it ., that the MIMD algorithm is near optimal as the diameter of the OTIS: i 1i is
4v 3 [10].
2.5 Rank
In the rank operation, each processor I has a flag S(I) E {0, 1}, 0 < I < N2. We are to compute
the prefix sums of the processors with S(I) = 1. I I!, operation can be performed in 7(VN 1)
electronic and 2 OTIS moves using the prefix sum algorithm of Section 2.3.
2.6 Shift
Although there are many variations of the shift operation, the ones we believe are most useful in
application development are:
(a) mesh row shift with zero fill in this we shift data from processor (G,, Gy, Py, Py) to processor
(Gx, Gy, Pj, Py + s), vN < s < vN. I I. shift is done with zero fill and end discard ( i.e.,
if Py + s > vN or Py + s < 0, the data from Py is discarded ).
(b) mesh column shift with zero fill similar to (a), but along mesh column P,.
(c) circular shift on a mesh row in this we shift data from processor (Gx, Gy, Ps, Py) to processor
(Gx, G,, P, (Py + s) mod N).
(d) circular shift on a mesh column similar to (c), but instead P, is used.
(e) group row shift with zero fill similar to (a), except that Gy is used in place of Py.
(f) group column shift with zero fill similar to (e), but along group column G,.
(g) circular shift on a group row similar to (c), but with Gy rather than Py.
(h) circular shift on a group column similar to (g), with G, in place of Gy.
ilIt of types (a) through (d) are done using the best mesh algorithms while those of types (e)
through (h) are done as below:
Step 1: Perform an OTIS move.
Step 2: Do the shift as a P, ( if originally a G, shift ) or a Py ( if originally a Gy shift ) shift.
Step 3: Perform an OTIS move.
"Iilt of types (a) and (b) take s electronic moves on the SIMD and MIMD models; (c) and
(d) take vN electronic moves on the SIMD model and max{s vN sl} electronic moves on the
MIMD model; (e) and (f) take s electronic and 2 OTIS moves on both SIMD and MIMD models;
and (g) and (h) take VN electronic and 2 OTIS moves on the SIMD model and max{ s, VN Isl}
electronic and 2 OTIS moves on the MIMD model.
If we simulate the corresponding 4D mesh algorithms, we obtain the same complexity for (a)
 (d), but (e) and (f) take an additional 2s 2 OTIS moves, and (g) and (h) take an additional
2 x max{s VN Is } 2 OTIS moves.
2.7 Data Accumulation
Each processor is to accumulate M, 0 < M < vN, values from its neighboring processors along
one of the four dimensions G,, Gy, Ps, Py. Let D(GX, Gy, Ps, Py) be the data in processor
(Gx, Gy, P,, Py). In a data accumulation along the G, dimension ( for example ), each proces
sor (Gx, Gy, Pj, Py) accumulates in an array A the data values from ((G, +i) mod vN, Gy, Px, Py),
0 < i < M. Specifically, we have
A[i] = D((Gx + i) mod /N, Gy, PX, Py)
Accumulation in other dimensions is similar.
'I Ih accumulation operation can be done using a circular shift of M in the appropriate dimen
sion. I Ih complexity is readily obtained from that for the circular shift operation ( see Section 2.6
2.8 Consecutive Sum
SIr N2 processor OTIS: 1i ij is tiled with onedimensional blocks of size M. I I. i blocks may
align with any of the four dimensions Gx, Gy, Py, and Py. Each processor has M values X[j],
0 < j < M. I hI ith processor in a block is to compute the sum of the X[i]s in that block.
Specifically, processor i of a block computes
A 1
S(i)= Xi(j), < i
j=0
where i and j are indices relative to a block.
When the onedimensional blocks of size M align with the P, or Py dimensions, a consecutive
sum can be performed by using M tokens in each block to accumulate the M sums S(i), 0 < i < M.
Assume the blocks align along P,. Each processor in a block initiates a token labeled with the
processor's intra block index. I hl tokens from processors 0 through M 2 are right bound and
that from M 1 is left bound. In odd time steps, right bound tokens move one processor right
along the block, and in even time steps left bound tokens move one processor left along the block.
When a token reaches the rightmost or leftmost processor in the block, it reverses direction. Each
token visits each processor in its block twice once while moving left and once while moving right.
During the rightward visits it adds in the appropriate X value from the processor. After 4(M 1)
time steps ( and hence 4(M 1) electronic moves ), all tokens return to their originating processors,
and we are done.
In the MIMD model, the left and right moves can be done simultaneously, and only 2(M 1)
electronic moves are needed.
When the onedimensional size M blocks align with G, or Gy, we first do an OTIS move; then
run either a P, or Py consecutive sum algorithm; and then do an OTIS move. I Ii number of
electronic moves is the same as for P,: or Py alignment. However, two additional OTIS moves are
needed.
Simulation of the corresponding 4D mesh algorithm takes an additional 8M 10 OTIS moves
for the case of G, or Gy alignment in the SIMD model and an additional 4M 6 OTIS moves in
the MIMD model.
2.9 Adjacent Sum
I I!,' operation is similar to the data accumulation operation of Section 2.7 except that the M
accumulated values are to be summed. I Ir operation can be done with the same complexity as
data accumulation using a similar algorithm.
2.10 Concentrate
A subset of the processors contain data. I Ir I processors have been ranked as in Section 2.5. So
the data is really a pair (D, r); D is the data in the processor and r is its rank. Each pair (D, r) is
to be moved to processor r, 0 < r < b, where b is the number of processors with data. Using the
(G, P) format for a processor index, we see that (D, r) is to be routed from its originating processor
to processor ([r/iN, r mod N). We accomplish this using the steps:
Step 1: Each pair (D, r) is routed to processor r mod N within its current group.
Step 2: Perform an OTIS move.
Step 3: Each pair (D, r) is routed to processor [r/NJ within its current group.
Step 4: Perform an OTIS move.
Theorem 1 I I. four step algorithm given above i.,..,tl routes every pair (D,r) to processor
([r/N, r mod N).
Proof Step 1 does the routing on the second coordinate. I I, I step does not route two pairs to
the same processor provided no group has two pairs (D1, rl), (D2, r2) with rl mod N = r2 mod N.
Since each group has at most N pairs and the ranks of these pairs are contiguous integers, no group
can have two pairs with rl mod N = r2 mod N. So following Step 1 each processor has at most
one pair and each pair is in the correct processor of the group, though possibly in the wrong group.
To get the pairs to their correct groups without changing the within group index, Step 2 performs
an OTIS move, which moves data from processor (G, P) to processor (P, G). Now all pairs in a
group have the same r mod N value and different [r/Nj values. I Ir, routing on the [r/Nj values,
as in Step 3, routes at most one pair to each processor. I I. OTIS move of Step 4, therefore, gets
every pair to its correct destination processor. D
In group 0, Step 1 is a concentrate localized to the group, and in the remaining groups, Step
1 is a generalized concentrate in which the ranks have been increased by the same amount. In all
groups we may use the mesh concentrate algorithm of ,i to accomplish the routing in 4(vN 1)
electronic moves. Step 3 is also a concentrate as the [r/Nj values of the pairs are in ascending order
from 0, 1, 2,  .. So Steps 1 and 3 take 4(vN 1) electronic moves each in the SIMD model and
2(/N 1) in the MIMD model i I I 1 I. the overall complexity of concentrate is 8(VN 1)
electronic and 2 OTIS moves in the SIMD model and 4(VN 1) electronic and 2 OTIS moves in
the MIMD model.
We can improve the SIMD time to 7(vN 1) electronic and 2 OTIS moves by using a better
mesh concentrate algorithm than the one in 1 I Irj new and simpler algorithm is given below for
the case of a generalized concentration on a N x N mesh.
Step 1: : 1.i., data that is to be in a column right of the current one rightwards to the proper
processor in the same row.
Step 2: : 1. . data that is to be in a column left of the current one leftwards to the proper processor
in the same row.
Step 3: : 1.i.. data that is to be in a smaller row upwards to the proper processor in the same
column.
Step 4: : i,1. data that is to be in a bigger row downwards to the proper processor in the same
column.
In a concentrate operation on a square mesh data that begins in two processors of the same row
ends up in different columns as the rank of these two data differs by at most vN 1. So Steps 1
and 2 do not leave two or more data in the same processor. Steps 3 and 4 get data to the proper
row and hence to the proper processor. N,\t that it is possible to have up to two data items in
a processor following Step 1 and Step 3. 1 I. complexity of the above concentrate algorithm is
4(VN 1) on a SIMD mesh and 2(VN 1) on an MIMD mesh ( we can overlap Steps 1 and 2 as
well as Steps 3 and 4 on an MIMD mesh ).
For an ordinary concentrate in which the ranks begin at 1, Step 4 can be omitted as no data
moves down a column to a row with bigger index. So an ordinary concentrate takes only 3(vN 1)
moves. I In, improves the SIMD concentration algorithm of i,, which takes 4(vN 1) moves to
do an ordinary concentrate.
Actually, we can show that the four step concentration algorithm just stated is optimal for the
SIMD model. Consider the ordinary concentrate instance in which the selected elements are in
processors (0, N 1), (1, N 2), , (N 1, 0). I I, ranks are 0, 1,  vN 1. So the data
in processor (0, N 1) is to be moved to processor (0,0). I I! requires moves that yield a net of
/N 1 left moves. Also, the data in processor ( N 1, 0) is to be moved to processor (0, vN 1).
I I, i requires a net of vN 1 upward moves and /N 1 rightward moves. N i ir of these moves
can be overlapped in the SIMD model. So every SIMD concentrate algorithm must take at least
/N 1 moves in each of the directions left, right, and up; a total of at least 3(vN 1) moves.
For the generalized concentrate algorithm, the ranks need not start at zero. Suppose we have
two elements to concentrate. One is at processor (0,0) and has rank N 1, and the other is at
processor (vN 1, N 1) and has rank N. 1I I, data in (0,0) is to be moved to (vN 1, VN 1)
at a cost of VN 1 net right and down moves. I Ir, data in (VN 1, VN 1) is to be moved to
(0,0) at a cost of /N 1 net left and up moves. So at least 4(vN 1) moves are needed.
Theorem 2 I,. OTIS.l1. 4. data concentration algorithm described above is optimal for both the
SIMD and MIMD models; that is, (a) every SIMD concentration algorithm must make 7(/N 1)
electronic and 2 OTIS moves in the worst case, and (b) every MIMD concentration algorithm must
make 4(vN 1) electronic and 2 OTIS moves.
Proof (a) Suppose that the data to be concentrated are in the processors shown in Table 1. Let
a denote processor (N 1, N 1, N 1, N 1), let b denote processor (N 1, 0, N 1, 0),
and let c denote processor (0,1,0,0). I Ir ranks of a, b, and c are N3/2, N3/2 N + /N 1, and
NN respectively. I Ir, i1 Io !, following the concentration the data D(a), D(b), and D(c) initially
in processors a, b, and c will be in processors (0,1,0,0), (0, N 1, 0, N 1), and (0, 0, N 1, 0)
respectively. Fi ni 2 shows the initial and concentrated data layout for the case when N = 16.
I I, change in Gy, Gy, Py, and Py values between the final and initial locations of D(a), D(b), and
D(c) is shown in Table 2.
WBBBWW D0000E D0000
BB BBBBBB0000 0000
WW DEED DEED DODD
WWW DOD DEED DODD
0000 0000 0000 0000
ZME ZM 8 0000 0000
DOD DODO DODO DOD
DODO DODO DEED DODD
0000 0000 0000D 0000
0000DD D0000D D0000D 0000DD
DD0000 0000 DD0000 DD
000 BBB B0000 0oo
ZOOO 0000 0000 OOOD
DEED EDEE DODE DED
LEDDE DED DED MDW
(a)
wwww wwow ooww woow
DBBDBB DBB DBBB
WWBBBB WWWW WWWW WB
WWBBBB WWWW WWWW WB
WDE DED DEED EEDD
BDDBBEE DED DED EED
0000 0000 0000 0oo0
0000 D0000 DD 00000000
DD0000 0000 DD0000 E0000
DD0000 DD0000 DD0000
DD0000 D0000D D0000D DD0000
DD0000 0000DD D0000D E0000E
DD0000 0000 DD0000 E0000
0000 0000 0000 0oo0
DDEE DEED DEED EEDD
DD DEED DEED EEDD
DD0000 D0000D D0000D DD0000
(b)
Fis;oi 2: Data Configuration: (a) Initial; (b) Concentrated
Table 1: Processors with data to concentrate
data G, Gy Px PY
D(a) (N 1)+1 (vN1) (vNI) (vNi)
D(b) (N 1) +(vN 1) (N 1) +(N 1)
D(c) 0 0 +(N 1) 0
Table 2: \N t change in G,, Gy, P,, and Py
1 Ir maximum net negative change in each of Gx, Gy, Pj, and Py is (vN 1). Since a net
negative change in G, can only be overlapped with a net negative change in P, and since D(b)
needs (N 1) negative change in both G, and Pj, we must make at least 2(N 1) electronic
moves that decrease the row index within a mesh. Similarly, because of D(a)'s requirements, at
least 2(VN 1) electronic moves that increase the column index within a VN x VN mesh must be
made. Turning our attention to net positive changes, we see that because of D(b)'s requirements
there must be at least 2(/N 1) electronic moves that increase the column index. D(c) requires
vN 1 electronic moves that increase the row index. Since positive net moves cannot be overlapped
with negative net moves, and since net moves along G, and P, cannot be overlapped with net moves
along Gy and Py, the concentration of the configuration of Table 1 must take at least 7(vN 1)
electronic moves.
In addition to 7(vN 1) electronic moves, we need at least 2 OTIS moves to concentrate the
data of Table 1. To see this consider the data initially in group (0,1). 1 I!i data is in group (0,0)
following the concentration. At least one OTIS move is needed to move the data out of group
(0,1). A nontrivial OTIS: 1i i, has > 2 processors on a row of a vN x vN submesh. For such
an OTIS: 1i i;, at least two pieces of data must move from group (0,1) to group (0,0). A single
OTIS move scatters data from group (0,1) to different groups with each data going to a different
group. At least one additional OTIS move must be made to get the data back into the same group.
G Gy Px Py
0,0 o< p< N i, o< < PO
0,1 P = 0, O < Py < ON
Gx= 1, 0 < Gy < N 2 0
N 1,0 0
vN 1,1 0
vN 1, vN 1 vN 1, vN 1
I Ii I I: the concentration of the configuration of Table 1 cannot be done with fewer than 2 OTIS
moves.
(b) Consider the initial configuration of Table 1. Since the shortest path between processor
b and its destination processor is 4(VN 1) electronic and one OTIS move, at least that many
electronic moves are made, in the worst case, by every concentration algorithm. I I. reason that
at least 2 OTIS moves are needed to complete the concentration is the same as for (a). O
2.11 Distribute
I Ii, is the inverse of the concentrate operation of Section 2.10. We start with pairs (Do, 1,,), ..., (Dq, dq),
ii,, < di < . < dq, in the first q+ 1 processors 0, 1,..., q and are to route pair (Di, di) to processor
di, 0 < i < q. I lI algorithm of Section 2.10 tells us how to start with pairs (Di, i) in processor
di, 0 < i < q and move them so that Di is in i. By running this backwards, we can start with
Di in i and route it to di. I Ir complexity of the distribute operation is the same as that of the
concentrate operation. We have shown that the concentrate algorithm of Section 2.10 is optimal;
it follows that the distribute algorithm is also optimal.
2.12 Generalize
We start with the same initial configuration as for the distribute operation. I I. objective is to
have Di in all processors j such that di < j < di+i ( set d+il to N2 1 ). If we simulate the 4D
mesh algorithm for generalize using the simulation strategy of [11], it takes 8(vN 1) electronic
and 8(VN 1) OTIS moves to perform the generalize operation on an SIMD OTIS: 1i i, We can
improve this to 8(vN 1) electronic and 2 OTIS moves if we run the generalize algorithm of i,
adapted to use OTIS moves as necessary. I lip outer loop of the algorithm of ii examines processor
index bits from 2p 1 to 0 where p = log2 N. So in the first p iterations we are moving along bits
of the G index and in the last p iterations along bits of the P index. On an OTIS: 1i i we would
break this into two parts as below:
Step 1: Perform an OTIS move.
Step 2: Run the GENERALIZE procedure of .i from bit p 1 to 0, while maintaining the original
index.
Step 3: Perform an OTIS move.
Step 4: Run the GENERALIZE algorithm of ii from bit p 1 to 0.
On an MIMD OTIS: 1. ij the above algorithm takes 4(VN 1) electronic and 2 OTIS moves.
We can reduce the SIMD complexity to 7(vN 1) electronic and 2 OTIS moves by using a
better algorithm to do the generalize operation on a 2D SIMD mesh. I I!, algorithm uses the
same observation as used by us in Section 2.10 to speed the 2D SIMD mesh concentrate algorithm;
that is, of the four possible move directions, only three are possible. When doing a generalize on
a 2D vN x N mesh the possible move directions for data are to increasing row indexes and to
decreasing and increasing column indexes. With this observation, the algorithm to generalize on a
2D mesh becomes:
Step 1: : 1.. data along columns to increasing row indexes if the data is needed in a row with
higher index.
Step 2: : 1.. data along rows to increasing column indexes if the data is needed in a processor in
that row with higher column index.
Step 3: : 1.. data along rows to decreasing column indexes if the data is needed in a processor
in that row with smaller column index.
I I. correctness of the preceding generalize algorithm can be established using the argument of
I r i.I 1 1, and its optimality follows from I Ir j,i i 2 and the fact that the distribute operation,
which is the inverse of the concentrate operation, is a special case of the generalize operation.
I lI. new and more efficient generalize algorithm may be used in Step 2 of the OTIS: 1i i1
generalize algorithm. It cannot be used in Step 4 because the generalize of this step requires the
full capability of the code of which permits data movement in all four directions of a mesh.
When we use the new generalize algorithm for Step 2 of the OTIS: 1i i1 generalize algorithm,
we can perform a generalize on a SIMD OTIS: 1i ij using 7(vN 1) electronic and 2 OTIS moves.
I l. new algorithm is optimal for both SIMD and MIMD models. I h, follows from the lower
bound on a concentrate operation established in I Ir .,i i 2 and the observation made above that
the distribute operation, which is a special case of the generalize operation, is the inverse of the
concentrate operation and so has the same lower bound.
2.13 Sorting
As was the case for the operations considered so far, an O(VN) time algorithm to sort can be
obtained by simulating a similar complexity 4D mesh algorithm. For sorting a 4D : i i the
1 7 13 1 2 3
2 8 14 s_ 4 5 6
3 9 15 7 8 9
4 10 16 10 11 12
5 11 17 4 13 14 15
6 12 18 16 17 18
Finli: 3: RowColumn Transformation of Leighton's Column Sort
algorithm of Kunde [2] is the fastest. Its simulation will sort into snakelike rowmajor order using
14vN+o(VN) electronic and 12v/N+o(VN) OTIS moves on the SIMD model and 7vN+o(vN)
electronic and 6vN + o(vN) OTIS moves on the MIMD model. To sort into rowmajor order,
additional moves to reverse alternate dimensions are needed. I h! means that an OTIS i. ii
simulation of Kunde's 4D mesh algorithm to sort into rowmajor order will take 18vN + o(VN)
electronic and 16vN + o(vN) OTIS moves on the SIMD model. We show that Leighton's column
sort [3] can be implemented on an OTIS: 1i i to sort into rowmajor order using 22vN + o(vN)
electronic and O(N3/8) OTIS moves on the SIMD model and 11 vN+o(vN) electronic and O(N3/8)
OTIS moves on the MIMD model.
Our OTIS' 1i ij sorting algorithm is based on Leighton's column sort [3]. I Ii sorting algorithm
sorts an r x s array, with r > 2(s 1)2, into columnmajor order using the following seven steps:
Step 1: Sort each column.
Step 2: Perform a rowcolumn transformation.
Step 3: Sort each column.
Step 4: Perform the inverse transformation of Step 2.
Step 5: Sort each column in alternating order.
Step 6: Apply two steps of comparisonexchange to ;Il...i. lit rows.
Step 7: Sort each column.
Fi :i, 3 shows an example of the transformation of Step 2, and its inverse. Fi ,i. 4 shows a
step by step example of Leighton's column sort.
7 9 12 2 3 1 2 4 7 1 4 7
4 16 1 4 5 6 10 15 18 2 5 8
18 5 14 7 9 8 3 5 9 3 6 9
2 17 8 10 11 12 11 16 17 10 13 14
15 11 6 15 16 13 1 6 8 11 15 17
10 3 13 18 17 14 12 13 14 12 16 18
1 3 11 1 14 11 1 11 14 1 7 13
4 6 15 2 13 12 2 12 13 2 8 14
7 9 17 4 10 15 4 10 15 3 9 15
2 10 12 s 5 9 16 5 9 16 7 4 10 16
5 13 16 7 6 17 6 7 17 5 11 17
8 14 18 8 3 18 3 8 18 6 12 18
Fi ii 4: Example of Leighton's Column Sort
Although Leighton's column sort is explicitly stated for r x s arrays with r > 2(s 1)2, it
can be used to sort arrays with s > 2(r 1)2 into rowmajor order by interchanging the roles of
rows and columns. We shall do this and use Leighton's method to sort an N1/2 x N3/2 array. We
interpret our N2 OTIS: 1i ij as an N1/2 X N3/2 array with G, giving the row index and GyPXjPy
giving the column index of an element processor. We shall further subdivide G, ( Gy, P Py
) into equal parts G ,,, G2, Gx3, and G,, from left to right. We use Gx24, for example, to
represent Gx2GaGx4. Since p = log2 N, G, has p/2 bits and Gx, has p/8 bits. I Ir , notations
are helpful in describing the transformations in Steps 2 and 4 of the column sort, as we use the
BPC permutations of [5] to realize these transformations. A BPC permutation [5] is specified by a
vector A = [Api, Ap2,..., 1,, where
(a) Ai E {0, l,..., (p 1)}, 0 < i
(b) [IApi, Ap21 ...,' 1,,~ is a permutation of [0, 1,..., p 1].
1 Ir destination for the data in any processor may be computed in the following manner. Let
mplmp_2 ...' ,,, be the binary representation of the processor's index. Let dp_dp2 ... I,, be that
of the destination processor's index. I Ii n,
,, if Ai > 0,
dIAl 1 if A <0.
In this definition, 0 is to be regarded as < 0, while +0 is > 0. Table 3 shows an example
of the BPC permutation defined by the permutation vector A = [0, 1, 2, 3] on a 16 processor
OTISI: i,
Table 3: Source and destination
:i i
of the BPC permutation [0, 1, 2, 3] in a 16 processor OTIS.
In describing our sorting algorithm, we shall, at times, use a 4D array interpretation of an
OTIS: 1f i In this interpretation, processor (G,, G,, P,, Py) of the OTIS: i, i corresponds to
processor (G,, G,, P,, Py) of the 4D mesh. We use g, to denote the bit positions of G, that is
the leftmost p/2 bits in a processor index, g,, to represent the leftmost p/8 bit positions, ,'. to
represent the rightmost p/2 bit positions, /'. to represent the rightmost p/4 bit positions, and
so on. Our strategy for the sorting steps 1, 3, 5, and 7 of Leighton's method is to collect each row
( recall that since we are sorting an N1/2 X N3/2 array, the columnsort steps of Leighton's method
become rowsort steps ) of our N1/2 X N3/2 array into an N3/8 x N3/8 x N3/8 x N3/8 4D submesh
of the OTIS: 1i i and then sort this row by simulating the 4D mesh sort algorithm of [2]. I I!
strategy translates into the following sorting algorithm:
Step 1: [: 1. .. rows of the N1/2 x N3/2 array into N3/8 x N3/8 x N3/8 x N3/8 4D submeshes
Perform the BPC permutation Pa = [gxgyi', I/'. 9x2g9. ,', I/' ,' ,/'.i ,.
Step 2: [ Sort each row of the N1/2 x N3/2 array ]
Sort each 4D submesh of size N3/8 3/8 x N3/8 x N3/8
Step 3: [ Do the inverse of Step 1, perform a columnrow transformation, and move rows into
Source Destination
Processor (G, P) Binary Binary (G, P) Processor
0 (0,0) 0000 1001 (2,1) 9
1 (0,1) 0001 0001 (0,1) 1
2 (0,2) 0010 1101 (3,1) 13
3 (0,3) 0011 0101 (1,1) 5
4 (1,0) 0100 1011 (2,3) 11
5 (1,1) 0101 0011 (0,3) 3
6 (1,2) 0110 1111 (3,3) 15
7 (1,3) 0111 0111 (1,3) 7
8 (2,0) 1000 1000 (2,0) 8
9 (2,1) 1001 0000 (0,0) 0
10 (2,2) 1010 1100 (3,0) 12
11 (2,3) 1011 0100 (1,0) 4
12 (3,0) 1100 1010 (2,2) 10
13 (3,1) 1101 0010 (0,2) 2
14 (3,2) 1110 1110 (3,2) 14
15 (3,3) 1111 0110 (1,2) 6
N3/8 x N3/8 x N3/8 x N3/8 submeshes ]
Perform the BPC permutation Pc = [gI24x19.g l ', /I. /'._ 'll,9yil' I.
Step 4: [ Sort each row of the N1/2 x N3/2 array ]
Sort each 4D submesh of size N3/ x N3/ x N3/ x N3/8
Step 5: [ Do the inverse of Step 1, perform a rowcolumn transformation, and move rows into
N3/8 x N3/8 x N3/8 x N3/8 submeshes
Perform the BPC permutation P' = gxgx_ /'. 'i. /'. /'. 1' ,/. ,/',
Step 6: [ Sort each row in alternating order ]
Sort each 4D submesh of size N3/8 x N3/8 x N3/8 x N3/8
Step 7: [ r i,1., rows back from 4D submeshes ]
Perform the BPC permutation P. r = [gxggyi'' ,'. g9x2g. I, /' ,,'. ,/'._ i]
Step 8: Apply two steps of comparisonexchange to ,i j.... i lit rows.
Step 9: [ 1.. .. rows into submeshes of size N3/8 x N3/8 x N3/8 x N3/ ]
Perform the BPC permutation Pa = [gx9gyj.rl /'. 9x'29. ,'I, /', ,~I '.i_ ,.
Step 10: [ Sort each row of the N1/2 x N3/2 array ]
Sort each 4D submesh of size N3/8 x N3/8 x N3/8 x N3/8
Step 11: f[ 1i,.r rows back from 4D submeshes
Perform the BPC permutation P. = [gx9gyJ'I' '. 9x2g9. I', /'I,. ,/'. I ]
N. t,. r that the row to 4D submesh transform is accomplished by the BPC permutation Pa =
[g9X9yi',l '. I,:," I1, /'1 ,,i ,'. ,J. El Ti i rij in the same row of our N1/2 x N3/2 array in
terpretation have the same Gx value; but in our 4D mesh interpretation, elements in the same
N3/8 x N3/8 x N3/8 x N3/8 submesh have the same GxGyPxPy, value. Pa results in this prop
erty. To go from Step 2 to Step 3 of Leighton's method, we need to first restore the N1/2 x N3/2
array interpretation using the inverse permutation of Pa, that is, perform the BPC permutation
Pa = [g919yl' ,'. 9x'29 ,, ,, ,/',. i ]; then perform a columnrow transform using BPC per
mutation pb = ri/. ',/' i, ; and finally map the rows of our N1/2 x N3/2 array into 4D submeshes of
size N3/ x N3/ x N3/ x N3/8 using the BPC permutation Pa. I Il three BPC permutation sequence
PPbPa is equivalent to the single BPC permutation Pc = [ i9x2g I 1. /' _. /'. '. Iy9Yi I9g /'.
I Ii preceding OTIS: 1i ij implementation of column sort performs 6 BPC permutations, 4
4D mesh sorts, and two steps of comparisonexchange on ;i1i.... i t rows. Since the sorting steps
take O(N3/8) time each ( use Kunde's 4D mesh sort [2] followed by a transform from snakelike
rowmajor to rowmajor ), and since the remaining steps take O(N1/2) time, we shall ignore the
complexity of the sort steps.
We can reduce the number of BPC permutations from 6 to 3 as follows. Fi I note that the Pa
of Step 1 just moves elements from rows of the N1/2 x N3/2 array into N3/8 x N3/8 x N3/8 x N3/8
4D submeshes. For the sort of Step 2, it doesn't really matter which N3/2 elements go to each 4D
submesh as the initial configuration is an arbitrary unsorted configuration. So we may eliminate
Step 1 altogether. \N :.. note that the BPC permutations of Steps 7 and 9 cancel each other and we
can perform the comparisonexchange of Step 8 by moving data from one N3/8 x N3/8 x N3/8 x N3/8
4D submesh to an ;,i .... i it one and back in O(N3/8) time.
With these observations, the algorithm to sort on an OTIS 1i, i becomes:
Step 1: Sort in each subarray of size N3/8 x N3/8 x N3/8 x N3/8
Step 2: Perform the BPC permutation P,.
Step 3: Sort in each subarray.
Step 4: Perform the BPC permutation P,.
Step 5: Sort in each subarray.
Step 6: Apply two steps of comparisonexchange to ;:i1.... lit subarrays.
Step 7: Sort in each subarray.
Step 8: Perform the BPC permutation P1.
Using the BPC routing algorithm of [10], the three BPC permutations can be done using
:;11/N electronic and 3 log2 N + 6 OTIS moves on the SIMD model and 18VN electronic and
3 log2 N + 6 OTIS moves on the MIMD model. A more careful analysis based on the development
in [5] and [10] reveals that the permutations P', Pc, and P' can be done with 28vN electronic and
log2 N + 6 OTIS moves on the SIMD model and 14vN electronic and 3 log2 N + 6 OTIS moves
on the MIMD model. By using p' = [gx9 yl' J'. ', '. _', '. :, *1 i, i'. =' ,'. ,i, ,r/'. Pc =
[gx24919fx y249Y./'._ 1/, /' ], and p' = [gx34gx_3gy4gy_ 7,, ,,,_ 7,. 3,. 3], the permutation
cost becomes 22 N electronic and log2 N+5 OTIS moves on the SIMD model and 11N electronic
and log2 N+5 OTIS moves on the MIMD model. I Ih total number of moves is thus 22vN +O(N3/8)
electronic and O(N3/8) OTIS moves on the SIMD model and 11 /N + O(N3/8) electronic and
O(N3/8) OTIS moves on the MIMD model. I I!i is superior to the cost of the sorting algorithm
that results from simulating the 4D rowmajor mesh sort of Kunde [2].
2.14 Random Access Read ( RAR )
In a random access read (RAR) [8] processor I wishes to read data variable D of processor dr,
0 < I < N2. I I1 steps suggested in [8] for this operation are:
Step 0: Processor I creates a triple (I, D, dr) where D is initially empty.
Step 1: Sort the triples by dr.
Step 2: Processor I checks processor I+ 1 and deactivates if both have triples with the same third
coordinate.
Step 3: Rank the remaining processors.
Step 4: Concentrate the triples using the ranks of Step 3.
Step 5: Distribute the triples according to their third coordinates.
Step 6: Load each triple with the D value of the processor it is in.
Step 7: Concentrate the triples using the ranks in Step 3.
Step 8: Generalize the triples to get the configuration we had following Step 1.
Step 9: Sort the triples by their first coordinates.
Using the SIMD model the RAR algorithm of [8] take 79(N 1) electronic moves and O(N3/8)
OTIS moves. On the MIMD model, it takes 45(/N 1) electronic O(N3/8) OTIS moves.
2.15 Random Access Write ( RAW )
Now processor I wants to write its D data to processor dl, 0 < I < N2. I Ii steps in the RAW
algorithm of [8] are:
Step 0: Processor I creates the tuple (D(I), dl), 0 < I < N2
21
Step 1: Sort the tuples by their second coordinates.
Step 2: Processor I deactivates if the second coordinate of its tuple is the same as the second
coordinate of the tuple in I + 1, 0 < I < N2 1.
Step 3: Rank the remaining processors.
Step 4: Concentrate the tuples using the ranks of Step 3.
Step 5: Distribute the tuples according to their second coordinates.
Step 2 implements the arbitrary write method for a concurrent write. In this, any one of the
processors wishing to write to the same location is permitted to succeed. 1 Ir priority model may
be implemented by sorting in Step 1 by d1 and within d1 by priority. Ii common and combined
models can also be implemented, but with increased complexity.
On the SIMD model, an RAW takes 43(VN 1) electronic and O(N3/8) OTIS moves while on
the MIMD model, it takes 2i(vN 1) electronic and O(N3/8) OTIS moves.
3 Conclusion
We have developed OTIS: 1i i algorithms for the basic parallel computing algorithms of [ Our
algorithms run faster than the simulation of the fastest algorithms known for 4D meshes. Table 4
summarizes the complexities of our algorithms and those of the corresponding ones obtained by
simulating the best 4Dmesh algorithms. N\i t that the worst case complexities are listed for the
broadcast and window broadcast operation, and that of the case when vN is even is presented for
the data sum operation on the MIMD model. Also, the complexities listed for circular shift, data
accumulation, and; li.... I lt sum assume that the shift distance is < VN/2 on the MIMD model.
Table 4 gives only the dominating vN terms for sorting. Our algorithms for data broadcast, data
sum, concentrate, distribute, and generalize are optimal.
References
[1] A. Krishnamon th, P. Marchand, F. Kiamilev, and S. Esener. Grainsize considerations for optoelec
tronic multistage interconnection networks. Applied Optics, 31 'ti') Sept. 1992.
[2] M. Kunde. Routing and sorting on meshconnected ,ii In Proceedings of the 3rd Agean Workshop
on Computing: VLSI Algorithms and Architectures, Lecture Notes on Computer Science, volume 319,
pages 423433. Springer Verlag, 1988.
[3] T. Leighton. Tight bounds on the .,!ii!l. '. of parallel sorting. IEEE I ........ ... on Computers,
C34(4):344 354, Apr. 1985.
[4] G. C. Marsden, P. J. Marchand, P. Harvey, and S. C. Esener. Optical transpose interconnection  ,. i!
architectures. Optics Letters, 18(13):10831085, July 1 1993.
[5] D. Nassimi and S. Sahni. An optimal routing algorithm for meshconnected parallel computers. Journal
of the Association for Computing Machinery, 27(1):629, Jan. 1980.
[6] D. Nassimi and S. Sahni. Data broadcasting in i. 11) computers. IEEE I,....... ..... on Computers,
C30(2):101107, Feb. 1981.
[7] S. R' 1 i 1: i. ii and S. Sahni. Randomized routing, selection, and sorting algorithms on the OTISMesh
optoelectronic computer. manuscript, 1997.
[8] S. Ranka and S. Sahni. IIq., .'1 Algorithms with Applications to Image Processing and Pattern
Recognition. SpringerVerlag, 1990.
[9] S. Sahni and C.F. Wang. BPC permutations on the OTISHypercube optoelectronic computer. Tech
nical report, ('iS. Department, University of Florida, 1997.
[10] S. Sahni and C.F. Wang. BPC permutations on the OTISMesh optoelectronic computer. In Proceed
ings of the fourth International C,.f' ..... on W, ...'.. I., Parallel Processing Using Optical Intercon
nections (MPPOI'97), pages 130135, 1997.
[11] F. Zane, P. Marchand, R. Paturi, and S. Esener. Scalable network architectures using the optical
transpose interconnection  . ii (OTIS). In Proceedings of the second International C,,f f,..... on
Sl' ....'. 1, Parallel Processing Using Optical Interconnections (MPPOI'96), pages 114121, 1996.

