BPC Permutations On The OTIS-Hypercube Optoelectronic

Computer*

Sartaj Sahni and C1liili- ,iiig Wang

Department of Computer and Information Science and Engineering

University of Florida

Gainesville, FL 32611

{sahni,wang} -i i l I lii

Abstract

We show that the diameter of an N2 processor OTIS-Hypercube computer ( N = 2d ) is

2d + 1. OTIS-Hypercube algorithms for some commonly performed permutations transpose,

bit reversal, vector reversal, perfect -* ItI1. t! i-1 i!1.It -1!t!!. ,1 ro- -!i i.i and bit Iit1l.-I are

developed. We also propose an algorithm for general BPC permutations.

1 Introduction

El i liiji: interconnects are superior to optical interconnects when the interconnect distance is

up to a few millimeters [1, 3]. However, for longer interconnects, optics ( and in particular, free

space optics ) provides power, speed, and bandwidth advantages over electronics. With this in mind,

: i.! -,I i, et al. [5, Hendrick et al.[2], and Zane et al.[9] have proposed a hybrid computer architecture

in which the processors are divided into groups; intra-group connects are electronic, and inter-group

interconnects are optical. Krishnamoorthy et al.[4] have demonstrated that bandwidth and power

consumption are minimized when the number of processors in a group equals the number of groups.

Si.l.i-i i, et al.[5] propose a family of optoelectronic architectures in which the number of groups

equals the number of processors per group. In this family the optical transpose interconnection

system ( OTIS ) the inter-group connects ( or optical interconnect ) connect processor p of group

g to processor g of group p. I Iri intra-group interconnect ( or electronic interconnect ) can be

any of the standard previously studied connection schemes for electronic computers. 'I h!- strategy

gives rise to the OTIS-: 1. -i, OTIS-Hypercube, OTIS-Perfect shuffle, OTIS-: 1. -Ii of trees, and so

forth computers.

Fi ii 1 shows a generic 16 processor OTIS computer; only the optical connections are shown.

I Ir solid squares indicate individual processors, and a processor index is given by the pair (G, P)

*This work was supported, in part, by the Army Research ( ith. under grant DAA H04-95-1-0111.

group 0 group 1

(0,0) (0,1) (1,0) (1,1)

(0,2) (0,3) (1,2) (1,3)

(2,0) (2,1) (3,0) (3,1)

(2,2) (2,3) (3,2) (3,3)

group 2 group 3

Fiii 1: Example of OTIS connections with 16 processors

where G is its group index and P the processor or local index. Finii 2 shows a 16 processor

OTIS-Hypercube. I Ir number inside a processor is the processor index within its group.

Hendrick et al. [2] have computed the performance characteristics ( power, throughput, volume,

etc. ) of the OTIS-Hypercube architecture. Zane et al. *'i have shown that each move of an N2

processor hypercube can be simulated by an N2 processor OTIS-Hypercube using either one local

electronic move, or one local electronic move and two optical inter group move using the OTIS

interconnection. We shall refer the latter as OTIS moves. Sahni and Wang [7] and Wang and

Sahni [8] have evaluated thoroughly the characteristics of the OTIS-: 1i -i, architecture, developing

algorithms for basic data rearrangements..

In this paper, we study the OTIS-Hypercube architecture and obtain basic properties and basic

permutation routing algorithms for this architecture. I Ir -, algorithms can be used to develop

efficient application programs.

In the following, when we describe a path through an OTIS-Hypercube, we use the term elec-

tronic move to refer to a move along an electronic interconnect ( so it is an intra-group move )

and the OTIS move to refer to a move along an optical interconnect.

(0,0) (0,1)

group 0 group 1

0 ^ 1 0R 1

2 -- 3 2 3

0 1 0 1

group 2 group 3

(1,0) (1,1)

Finii, 2: 16 processor OTIS-Hypercube

2 OTIS-Hypercube Diameter

Let N = 2d and let d(i, j) be the length of the shortest path from processor i to processor j in

a hypercube. Let (G1, P1) and (G2, P2) be two OTIS-Hypercube processors. I Ir. shortest path

between these two processors fits into one of the following categories:

(a) I Ii path employs electronic moves only. I I,- is possible only when G1 = G2.

(b) I Ir path employs an even number of OTIS moves. Paths of this type look like (G1, P) -

(Gi,P') 4 (P,Gi) G (P', G') (G',PB) 4 (G',P P') 4 (P"',G') 4 (PG',G') -4

E (Gz2, P2)

Here E denotes a sequence ( possibly empty ) of electronic moves and O denotes a single

OTIS move. If the number of OTIS moves is more than two, we may compress the path

into a shorter path that uses 2 OTIS moves only: (G1, P)) (G1, P2) (P2, G1) -

(P2, G2) -2 (G2, P2).

(c) I Il- path employs an odd number of OTIS moves. Again, if the number of moves is more than

one, we can compress the path into a shorter one that employs exactly one OTIS move as in

(b). I l., shorter path looks like: (G1, Pi) (G1, G2) (G2, G1) -- (G2, P2)

Shortest paths of type (a) have length exactly d(Pi, P2) ( which equals the number of ones in the

binary representation of Pi1 P2 ). Paths of type (b) and type (c) have length d(Pi, P2)+d(Gi, G2)+2

and d(Pi, G2) + d(P2, G1) + 1, respectively.

As a result, we obtain the following theorem:

Theorem 1 I I. length of the shortest path between processors (G1, P1) and (G2, P2) is d(Pi, P2)

when GI = G2 and min{d(Pi, P2) + d(Gi, G2) + 2, d(PI, G2) + d(P2, G1) + 1} when G1 C G2.

Theorem 2 1 Ib. diameter of the OTIS-Hypercube is 2d + 1.

Proof Since each group is a d-dimensional hypercube, d(P1, P2), d(Gi, G2), d(Pi, G2), and d(P2, G1)

are all less than or equal to d. From theorem 1, we conclude that no two processors are more than

2d + 1 apart. Now consider the processors (G1, Pi), (G2, P2) such that Pi = 0 and P2 = N 1.

Let G1 = 0 and G2 = N 1. So d(P1, P2) = d(G, G2) = d(P1, G2) = d(P2, G1) = d. Hence, the

distance between (G1, P1) and (G2, P2) is 2d + 1. As a result, the diameter of the OTIS-: 1. -i1 is

exactly 2d + 1. D

3 Common Data Rearrangements

In this section, we concentrate on the realization of permutations such as transpose, perfect shuffle,

unshuffle, vector reversal which are frequently used in applications. \..--.i and Sahni -ii have

developed optimal hypercube algorithms for these frequently used permutations. I Ir -, algorithms

may be simulated by an OTIS-Hypercube using the method of ['i- to obtain algorithms to realize

these data rearrangement patterns on an OTIS-Hypercube. Table 1 gives the number of moves

used by the optimal hypercube algorithms; a break down of the number of moves in the group and

local dimensions; and the number of electronic and OTIS moves required by the simulation.

We shall obtain OTIS-Hypercube algorithms, for the permutations of Table 1, that require far

fewer moves than the simulations of the optimal hypercube algorithms.

As mentioned before, each processor is indexed as (G, P) where G is the group index and P

the local index. An index pair (G, P) may be transformed into a singleton index I = GP by

concatenating the binary representations of G and P.

I I, permutations of Table 1 are members of the BPC ( bit-permute-complement ) class of

permutations defined in I i, In a BPC permutation, the destination processor of each data is given

by a rearrangement of the bits in the source processor index. For the case of our N2 processor

Optimal Hypercube i,,.. OTIS-Hypercube Simulation

Permutation total group dimension local dimension OTIS electronic

Transpose 2d d d 2d 2d

Perfect !il!!t. 2d d d 2d 2d

Unshuffle 2d d d 2d 2d

Bit Reversal 2d d d 2d 2d

Vector Reversal 2d d d 2d 2d

Bit Shuffle 2d 2 d 1 d 1 2d 2 2d 2

Shuffled Row-major 2d 2 d 1 d 1 2d 2 2d 2

GyP Swap d d/2 d/2 d d

Table 1: Optimal moves for N2 =

lations

OTIS-Hypercube we know that N

a processor index is p = log2 N2

A = [Ap_1, Ap-2, -,, where

22d processor hypercube and respective OTIS-Hypercube simu-

is a power of two and so the number of bits needed to represent

2 log N = 2d. A BPC permutation ,i i is specified by a vector

(a) Ai E {0, l,..., (p 1)}, 0 < i

(b) [IApi-, |Ap-21,..., I' is a permutation of [0, 1,... p 1].

'I destination for the data in any processor may be computed in the following manner. Let

mp-lmrp_2 ... ',,, be the binary representation of the processor's index. Let dp_-dp-2 ... I,, be that

of the destination processor's index. I 1j n-

,, if Ai > 0,

dji l 1- if Ai <0.

In this definition, -0 is to be regarded as < 0, while +0 is > 0.

In a 16-processor OTIS-Hypercube, the processor indices have four bits with the first two giving

the group number and the second two the local processor index. I I. BPC permutation [-0, 1, 2, -3]

requires data from each processor f, 'i',,, to be routed to processor (1 ',,,)mIrl2(1 "' ).

Table 2 lists the source and destination processors of the permutation.

'I Ih permutation vector A for each of the permutations of Table 1 is given in Table 3.

3.1 Transpose [p/2 1,..., 0,p 1,... ,p/2]

I I. transpose operation may be accomplished via a single OTIS move and no electronic moves.

I I. simulation of the optimal hypercube algorithm, however, takes 2d OTIS and 2d electronic

moves.

Table 2: Source and destination

Hypercube

of the BPC permutation [-0, 1, 2, -3] in a 16 processor OTIS-

Table 3: Permutations and their permutation vectors

Source Destination

Processor (G, P) Binary Binary (G, P) Processor

0 (0,0) 0000 1001 (2,1) 9

1 (0,1) 0001 0001 (0,1) 1

2 (0,2) 0010 1101 (3,1) 13

3 (0,3) 0011 0101 (1,1) 5

4 (1,0) 0100 1011 (2,3) 11

5 (1,1) 0101 0011 (0,3) 3

6 (1,2) 0110 1111 (3,3) 15

7 (1,3) 0111 0111 (1,3) 7

8 (2,0) 1000 1000 (2,0) 8

9 (2,1) 1001 0000 (0,0) 0

10 (2,2) 1010 1100 (3,0) 12

11 (2,3) 1011 0100 (1,0) 4

12 (3,0) 1100 1010 (2,2) 10

13 (3,1) 1101 0010 (0,2) 2

14 (3,2) 1110 1110 (3,2) 14

15 (3,3) 1111 0110 (1,2) 6

Permutation Permutation Vector

Transpose [p/2- 1,...,0,p- 1,... ,p/2

Perfect Shuffle [0,p 1,p 2,..., 1

Unshuffle p 2,p 3,..., 0,p 1

Bit Reversal [0,1,...,p 1

Vector Reversal [-(p- 1), -(p- 2),.., -0

Bit -'ni. [p- ,p 3,...,l,p-2,p- 4,...,0

-Illl..1 IRow-major [p l,p/2 1,p 2,p/2 2,... ,p/2,0]

GyP, Swap [p- ,...,3p/4,p/2 1,...,p/4,3p/4 1,...,p/2,p/4 1,...,0

3.2 Perfect Shuffle [0, p 1,p 2,..., 1]

We can adapt the strategy of i to an OTIS-Hypercube. Each processor uses two variables A and

B. Initially, all data are in the A variables and the B variables have no data. 1 Ir algorithm for

perfect shuffle is given below:

Step 1: Swap A and B in processors with last two bits equal to 01 or 10.

Step 2: for (i=1;i

(a) Swap the B variables of processors that differ on bit i only;

(b) Swap the A and B variables of processors with bit i of their

index I not equal to bit i + 1 of their index; }

Step 3: Perform an OTIS move on the A and B variables.

Step 4: for (i=0;i

(a) Swap the B variables of processors that differ on bit i only;

(b) Swap the A and B variables of processors with bit i of their

index I not equal to bit i + 1 of their index; }

Step 5: Perform an OTIS move on the A and B variables.

Step 6: Swap the B variables of processors that differ on bit 0 only.

Step 7: Swap the A and B variables of processors with last two bits equal to 01 or 10.

Actually, in Step 1 it is sufficient to copy from A to B, and in Step 7 to copy from B to A.

Table 4 shows the working of this algorithm on a 16 processor OTIS-Hypercube. I Ir, correctness

of the algorithm is easily established, and we see that the number of data move step is 2d + 2 ( 2d

electronic moves and 2 OTIS moves; each OTIS move moves two pieces of data from one processor

to another, each electronic swap moves a single data between two processors ).

1 Ir communication complexity of 2d + 2 is very close to optimal. For example, data from the

processor with index I = 0101... 0101 is to move to the processor with index I' = 1010... 1010

and the distance between these two processors is 2d + 1.

N.it - that the simulation of the optimal hypercube algorithm for perfect shuffle takes 4d moves.

Step 1 Step 2 OTIS Step 4 OTIS Step 6 Step 7

index initial i=1 i=0 1 variable

(a) (b) (a) (b) (a) (b)

0000 0 0 0 0 0 0 0 0 0 0 0 0 A

2 2 2 4 4 8 8 8 B

0001 1 6 6 2 2 2 8 A

1 4 2 6 10 10 8 B

0010 2 8 8 12 12 4 1 A

S 2 10 12 8 4 12 1 B

0011 3 3 3 1 14 14 14 14 6 9 9 9 A

1 3 12 10 10 6 14 1 B

0100 4 4 4 6 2 2 2 A

-6 4 -- 10 B

0101 5 -- 10 A

S 5 10 B

0110 6 3 A

6 3 B

0111 7 7 7 7 11 11 11 A

5 5 3 B

1000 8 8 8 8 4 4 4 A

-10 10 12 B

1001 9 12 A

S 9 12 B

1010 10 -5 A

S 10 5 B

1011 11 11 11 9 13 13 13 A

9 11 5 B

1100 12 12 12 14 1 1 1 1 9 6 6 6 A

14 12 3 5 5 9 1 14 B

1101 13 7 7 3 3 11 14 A

S 13 5 3 7 11 3 14 B

1110 14 9 9 13 13 13 7 A

S 14 11 13 9 5 5 7 B

1111 15 15 15 15 15 15 15 15 15 15 15 15 A

13 13 13 11 11 7 7 7 B

Table 4: Illustration of the perfect shuffle algorithm on a 16 processor OTIS-Hypercube

3.3 Unshuffle [p 2,p 3,..., 0,p 1]

I 11J- is the inverse of a perfect shuffle and may be performed by running the perfect shuffle algorithm

mentioned above backwards ( i.e., beginning with Step 7 ); the for loops of Steps 2 and 4 are also

run backwards. I Iji-i the number of moves is the same as for a perfect shuffle.

3.4 Bit Reversal [0, 1,..., p 1]

When simulating the optimal hypercube algorithm, the task requires 2d electronic moves and 2d

OTIS moves. But with the following algorithm:

Step 1: Do a local bit reversal in each group.

Step 2: Perform an OTIS move of all data.

Step 3: Do a local bit reversal in each group.

we can actually achieve the rearrangement in 2d electronic moves and 1 OTIS move, since Steps 1

and 3 can be performed optimally in d electronic moves each i

Ill number of moves is optimal since the data from processor 0101...0101 is to move to

processor 1010... 1010, and the distance between these two processors is 2d + 1 ( 1 r1 ,r i, 1 ).

3.5 Vector Reversal [-(p 1), -(p 2),..., -0]

A vector reversal can be done using 2d electronic and 2 OTIS moves. I steps are:

Step 1: Perform a local vector reversal in each group.

Step 2: Do an OTIS move of all data.

Step 3: Perform a local vector reversal in each group.

Step 4: Do an OTIS move of all data.

'I correctness of the algorithm is obvious. 'I number of moves is computed using the fact

that Steps 1 and 3 can be done in d electronic moves each i

Since a vector reversal requires us to move data from processor 00... 00 to processor 11... 11,

and since the distance between these two processors is 2d + 1 ( I Ir. i, in 1 ), our vector reversal

algorithm can be improved by at most one move.

3.6 Bit Shuffle [p- ,p 3,...,pp 2,p 4,...,0]

Let G = GGI where G, and GI partition G in half. Same for P = PIPI. Our algorithm employs a

GIP, Swap permutation in which data from processor GGIPuPI is routed to processor GuPGIPI.

So we need to first look at how this permutation is performed.

3.6.1 GIP, Swap [p 1,...,3p/4,p/2 1,...,p/4,3p/4 1,...,p/2,p/4 1,...,0

I II- swap is performed by a series of bit exchanges of the form B(i) = [Bp1,..., Bo0, 0 < i < p/4,

where

p/2 + i, j = p/4 + i

By p/4 + i, j p/2 + i

j otherwise

Let G(i) and P(i) denote the ith bit of G and P respectively. So G(0) is the least significant

bit in G, and P(d) is the most significant bit in P. I 11 bit exchange B(i) may be accomplished as

below:

Step 1: Every processor (G, P) with G(i) # P(d/2 + i) moves its data to the processor (G, P')

where P' differs from P only in bit d/2 + i.

Step 2: Perform an OTIS move on the data moved in Step 1.

Step 3: Processors (G, P) that receive data in Step 2 move the received data to (G, P'), where P'

differs from P only in bit i.

Step 4: Perform an OTIS move on the data moved in Step 3.

I l, cost is 2 electronic moves and 2 OTIS moves.

To perform a GIPE Swap permutation, we simply do B(i) for 0 < i < d/2. I I!-, takes d

electronic moves and d OTIS moves. By doing pairs of bit exchanges (B(0), B(1)), (B(2), B(3)),

etc. together, we can reduce the number of OTIS moves to d/2.

3.6.2 Bit Shuffle

A bit shuffle, now, can be performed following these steps:

Step 1: Perform a GtP, swap.

Step 2: Do a local bit shuffle in each group.

Step 3: Do an OTIS move.

Step 4: Do a local bit shuffle in each group.

Step 5: Do an OTIS move.

Steps 2 and 4 are done using the optimal d move hypercube bit shuffle algorithm of ih I I.

total number of data moves is 3d electronic moves and d/2 + 2 OTIS moves.

3.7 Shuffled Row-major [p l,p/2- 1, p 2,p/2 2,...,p/2, 0]

I I!J- is the inverse of a bit shuffle and may be done in the same number of moves by running the

bit shuffle algorithm backwards. Of course, Steps 2 and 4 are to be changed to shuffled row-major

operations.

4 BPC Permutations

Every BPC permutation A can be realized by a sequence of bit exchange permutations of the form

B(i,j) = B2d-,..., B], d < i < 2d, 0 < j < d, and

j, q=i

Bq= i, q j

q, otherwise,

and a BPC permutation C = [C2d-1, ,Co = IIGIIp where |Cq < d, 0 < q < d, IG and lip

involve d bits each.

For example, the transpose permutation may be realized by the sequence B(d+ j, j), 0 < j < d;

bit reversal is equivalent to the sequence B(2d 1 j,j), 0 < j < d; vector reversal can be

realized by performing no bit exchanges and using C = [-(2d 1), -(2d 2),..., -0 ( IIG

[-(2d-1), -(2d-2),..., -d], ip = [-(d-1),..., -0 ); and perfect shuffle may be decomposed into

B(d,0) and C = [2d-2,2d-3,..., d, 2d- 1,d-2,...,1,0, d- 1] ( IG = [2d-2,2d-3,..., d,2d- 1],

IIp = d 2,...,1,0,d 1] ).

A bit exchange permutation B(i, j) can be performed in 2 electronic moves and 2 OTIS moves

using a process similar to that used for the bit exchange permutation B(i). N\t.. that B(i) =

B(i, i).

Our algorithm for general BPC permutations is:

Step 1: Decompose the BPC permutation A into the pair cycle moves Bl(ii, ji), B2(i2,j2),. ,

Bk(ik,jk) and the BPC permutation C = IIGHp as above. Do this such that il > i2 > >

ik, and jl > j2 > > jk-

Step 2: If k = 0, do the following:

Step 2.1: Do the BPC permutation IIp in each group using the optimal algorithm of [ii

Step 2.2: Do an OTIS move.

Step 2.3: Do the BPC permutation HI' in each group using the algorithm of ii

Step 2.4: Do an OTIS move.

Step 3: If k = d, do the following:

Step 3.1: Do the BPC permutation HI' in each group.

Step 3.2: Do an OTIS move.

Step 3.3: Do the BPC permutation IIp in each group.

Step 4: If k < d/2, do the following:

Step 4.1: Perform the bit exchange permutation B1,..., Bk.

Step 4.2: Do Steps 2.1 through 2.4.

Step 5: If k > d/2, do the following:

Step 5.1: Perform a sequence of d k bit exchanges involving bits other than those in

BI,..., Bk in the same orderly fashion described in Step 1. Recompute IIG and IIp.

Swap IIG and IIp.

Step 5.2: Do Steps 3.1 through 3.3.

I l local BPC permutations determined by HG and IIp take at most d electronic moves each

i ; and the bit exchange permutations take at most d electronic moves and d/2 OTIS moves. So

the total number of moves is at most 3d electronic moves and d/2 + 2 OTIS moves.

5 Conclusion

In this paper we have shown that the diameter of the OTIS-Hypercube is 2d+1, which is very close

to that of an N2 processor hypercube. However, each OTIS-Hypercube processor is connected to

at most d + 1 other processors; while in an N2 processor hypercube, a processor is connected to up

to 2d other processors. We have also developed algorithms for frequently used data permutations,

and made performance comparisons between our algorithms and those obtained by simulating the

optimal hypercube algorithms using the simulation technique of I] For most of the permutations

considered, our algorithms are either optimal or within one move of being optimal. An algorithm

for general BPC permutations has also been proposed.

References

[1] M. Feldman, S. Esener, C. Guest, and S. Lee. Comparison between electrical and free-space optical

interconnects based on power and speed considerations. Applied Optics, 27(9), May 1988.

[2] W. Hendrick, O. Kibar, P. Marchand, C. Fan, D. V. Blerkom, F. McCormick, I. Cokgor, M. Hansen, and

S. Esener. Modeling and optimization of the optical transpose interconnection -- -1. i in Optoelectronic

T .~1 i,..1..- Center, Program Review, Cornell Ci -. i -i; Sept. 1995.

[3] F. Kiamilev, P. Marchand, A. Krishnamo 11-, S. Esener, and S. Lee. Performance comparison between

optoelectronic and vlsi multistage interconnection networks. Journal of Lightwave T, .".....l,,,1 9(12),

Dec. 1991.

[4] A. Krishnamon ~l-, P. Marchand, F. Kiamilev, and S. Esener. Grain-size considerations for optoelec-

tronic multistage interconnection networks. Applied Optics, 31;'t,), Sept. 1992.

[5] G. C. Marsden, P. J. Marchand, P. Harvey, and S. C. Esener. Optical transpose interconnection -- -. i!

architectures. Optics Letters, 18(13):1083-1085, July 1 1993.

[6] D. Nassimi and S. Sahni. Optimal BPC permutations on a cube connected computer. IEEE i.... ..*.

on Computers, C-31(4):338-341, Apr. 1982.

[7] S. Sahni and C.-F. Wang. BPC permutations on the OTIS-Mesh optoelectronic computer. In Proceedings

of the fourth International C, -.-f ,. .... on M..... ., Parallel Processing Using Optical Interconnections

(MPPOFI97), pages 130-135, 1997.

[8] C.-F. Wang and S. Sahni. Basic operations on the OTIS-Mesh optoelectronic computer. Technical report,

('iS. Department, Uii i -i- of Florida, 1997.

[9] F. Zane, P. Marchand, R. Paturi, and S. Esener. Scalable network architectures using the optical

transpose interconnection -- -. ii, (OTIS). In Proceedings of the second International C-4f., .....- on

Sl' .... l., Parallel Processing Using Optical Interconnections (MPPOF96), pages 114-121, 1996.