Title Page
 Table of Contents
 List of Figures
 Program and algorithm models
 Arrays for uniform recurrent...
 Mapping to fixed architectures
 Numerical linear algebraic...
 Bus connected architectures --...
 FRAP assembly language definition...
 FRAP assembly language 3X3 matrix-matrix...
 FRAP c-language pin level...
 Biographical sketch

Title: Systolic and bus connected arrays -- architectures, algorithms and algorithm transformations
Full Citation
Permanent Link: http://ufdc.ufl.edu/UF00082281/00001
 Material Information
Title: Systolic and bus connected arrays -- architectures, algorithms and algorithm transformations
Physical Description: viii, 173 leaves : ill. ; 28 cm.
Language: English
Creator: Dowling, Eric M., 1962-
Publication Date: 1989
Subject: Systolic array circuits   ( lcsh )
Array processors   ( lcsh )
Algorithms   ( lcsh )
Electrical Engineering thesis Ph. D
Dissertations. Academic -- Electrical Engineering -- UF
Genre: bibliography   ( marcgt )
non-fiction   ( marcgt )
Thesis: Thesis (Ph. D.)--University of Florida, 1989.
Bibliography: Includes bibliographical references (leaves 142-145).
Statement of Responsibility: by Eric M. Dowling.
General Note: Typescript.
General Note: Vita.
 Record Information
Bibliographic ID: UF00082281
Volume ID: VID00001
Source Institution: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: aleph - 001480206
oclc - 21049903
notis - AGZ2167

Table of Contents
    Title Page
        Page i
        Page ii
    Table of Contents
        Page iii
        Page iv
    List of Figures
        Page v
        Page vi
        Page vii
        Page viii
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
    Program and algorithm models
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
    Arrays for uniform recurrent algorithms
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
    Mapping to fixed architectures
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
    Numerical linear algebraic arrays
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
    Bus connected architectures -- theory & practice
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
        Page 130
        Page 131
        Page 132
        Page 133
        Page 134
        Page 135
        Page 136
        Page 137
        Page 138
        Page 139
        Page 140
        Page 141
        Page 142
        Page 143
        Page 144
        Page 145
    FRAP assembly language definition file
        Page 146
        Page 147
        Page 148
        Page 149
    FRAP assembly language 3X3 matrix-matrix multiply program
        Page 150
        Page 151
    FRAP c-language pin level simulator
        Page 152
        Page 153
        Page 154
        Page 155
        Page 156
        Page 157
        Page 158
        Page 159
        Page 160
        Page 161
        Page 162
        Page 163
        Page 164
        Page 165
        Page 166
        Page 167
        Page 168
        Page 169
        Page 170
        Page 171
        Page 172
    Biographical sketch
        Page 173
Full Text








First and foremost I would like to thank Dr. Fred J. Taylor. He was able to

guide me technically and show me the way in the academic world. He was an

inspiration to me and remains a role model to follow in the years to come. He

provided me with the support and resources I needed to get to where I am today.

I also thank my committee members, Dr. L.W. Couch, Dr. J.C. Principe, Dr. H.

Lam, and Dr. K. Sigmon. Each one of these professors provided me with guidance,

advice and high quality instruction in the class room.

I would like to express my respect and thanks to the members of the HSDAL.

They all were great to work with and full of energy and good ideas. Special thanks

goes to Dr. Mike Sousa for the work we did together in the systolic array algorithm

mapping area and the leadership he provided in the area of mathematics.

During the entire time that I worked on my Ph.D. I was funded and supported

by the Army Research Office under an ARO Fellowship.They provided me with the

necessities I needed to stick it out for over three years in pursuit of this degree. In

addition, they provided me with the means to get to conferences so that I could

meet my colleagues nationwide. For this I am deepfully grateful.

I would also like to take this opportunity to thank my parents for their support

and for providing my with strong role models. It was natural for me to seek gradu-

ate level education since my mother has a Ph.D. and my father has a law degree.

Finally I would like to thank my lovely girlfriend, Joan. She made my Ph.D.

years both exciting and enjoyable.


ACKNOWLEDGEMENTS ..................................ii

LIST OF FIGURES .................................... v

ABSTRACT ......................................... vii


ONE INTRODUCTION..................................
1.1 Array Processor Fundamentals .... ................ 1
1.2 Scope of Dissertation................. ................ 3
1.3 Literature Review .................... ........... 6
1.4 Outline and Summary. .............. ............. .12
2.1 Programs and Algorithms ....... .....................16
2.2 Program and Algorithm Dependence Structures. ..............22
2.3 Program and Algorithm Equivalences ................. ..29
3.1 Uniform Recurrence Programs and Algorithms. ..............34
3.2 Matrix Based URP and URA Representation and Analysis ...... .38
3.3 Transform Domain Analysis ....................... ...... .44
3.4 Algorithmically Specified Array Synthesis ................. .53

4.1 Partitioning Fundamentals .................. ....... ..63
4.2 Integer Based Index Set Pretransformations ................ .70
4.3 Processor Domain Post-transformations .................. .75
4.4 Array Synthesis Using Index Transformations. ...............83
5.1 Multiple Expression Nested Loop Programs (MENLPs) ........ .88
5.2 NLA Mapping Methodology ..........................90
5.3 Design of Cholesky Decomposition Arrays ................ .94

5.4 Design of QR-Decomposition Arrays. .................. 103
THEORY & PRACTICE............................ 112
6.1 The Bus Connected Architecture. ..................... 113
6.2 Design of URA Implementations for BCA Arrays .... ...... 118
6.3 The HSDAL/Texas Instruments FPAP Architecture .......... 125
6.4 The HSDAL/Texas Instruments FPAP Software ........... .131
SEVEN CONCLUSION ................................. 136
7.1 Summary ................. .....................136
7.2 Conclusions ................................... 138
7.3 Future Research Potentials. ....................... ..140

BIBLIOGRAPHY .......................................... 142




BIOGRAPHICAL SKETCH ............................... 173


1.1 Systolic Interconnection Structure and Embedded Topologies .. 4
1.2 The Bus Connected Architecture ..................... 5
2.1 Graphical View of Simple Algorithm with a 1-D Index Set..... 18
2.2 The Lexicographal Execution Ordering of Program P3 ........ 21
2.3 Simple Two Statement Algorithm Graph With Dependence
and Variable Usage Information Displayed ............... 25
2.4 Nonlocalized and Localized Algorithm Dependence Graphs..... 27
2.5 Localized Matrix Multiply Dependence Structure ........... 29
3.1 Uniform and Nonuniform Dependence Graphs ............ 33
3.2 Upper Triangular Matrix Multiply Dependence Graph ........ 42
3.3 Data Distributions for Algorithm With 6y=I2 ............ 49
3.4 Position Trace of Yoo Through The Processor Array ......... .50
3.5 The Systolic Algorithm of Example 3.1 with Velocity and
Distribution Vectors Displayed ......... ............ 58
3.6 2-D Convolution Algorithm Designed in Example 3.2 ........ .61
4.1 In-Place Partitioning Method . . . . . . . . . . .. 66
4.2 In Place Algorithm With Partitioning Lines . . . . . ... 68
4.3 The Partitioned Algorithm As Four Sub-Arrays ............ 69
4.4 1-D Systolic Convolution Array with Xo Position Trace Shown. .. 78
4.5 1-D Convolution on a 2-D Array with Xo Position Trace Shown
Using Division Algorithm Mapping .................... 79
4.6 1-D Convolution on a 2-D Array with xo Position Trace Shown
Using CRT Based Mapping............... ........... 79
4.7 Dimensions of Transformed Virtual Array and Non-zero
Data Velocities of Example 4.4 ..................... 80
4.8 Unpartitioned Array's Initial Oth Row Data Alignment....... 81

4.9 Physical Array with y-Memory and Initial y- and z- Memory
Pointer Values Displayed .......................... 82
5.1 Orderings for Computing the 4x4 Cholesky Decomposition ..... 96
5.2 The Cholesky Decomposition Physical Array Processor. ....... 100
5.3 The 4x4 Cholesky Decomposition Array Algorithm .......... 101
5.4 Triangularization of a Matrix Using Givens Rotations ......... 105
5.5 The QR-Decomposition Physical Array Processor. .......... 109
5.6 The 4x3 QR-Decomposition Array Algorithm. .............. 111
6.1 The General Bus Connected Architecture (BCA) ........... 114
6.2 A Simple PE Architecture Ideal For Use in a BCA Array ....... 115
6.3 4x4 Matrix Multiply Algorithm for a BCA............... 118
6.4 A Single Output Determines the Input Sequences Of Two
Busses. ........... ...... ........... ........ 121
6.5 A 9x1 BCA for Concurrently Computing Nine FIR Outputs ..... 124
6.6 A 3x3 BCA for Concurrently Computing Nine FIR Outputs ..... 125
6.7 Basic SN74ACT8847 PE Architecture. .................. 127
6.8 FPAP Architecture with Common Bus and IBM PC-AT Interface. .129
6.9 Microinstruction Format Used to Control FPAP Architecture. .. 130
6.10 FPAP Control Structure and Host Interface. .............. 131
6.11 Programming Models for Data Routing Commands. ......... 133

Abstract of Dissertation Presented to the Graduate School of the University of
Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of




May 1989

Chairman : Dr. Fred J. Taylor
Major Department : Electrical Engineering

Systolic and bus connected arrays have recently been shown to offer a signifi-

cant throughput advantage over serial architectures for many digital signal, image

and matrix processing algorithms. While array structures are well suited for the

parallel execution of computationally intensive algorithms that operate on large

data sets, they are difficult to conceptualize, design and program. This dissertation

focuses on the problem of automatically generating an algorithmically specified

array design from a serial program description. Program loop structures are cast

into index set based program and algorithm models. The models are subsequently

transformed to array structures. Matrix based analysis tools are derived to ma-

nipulate and transform the models. Integer transformations are introduced in order

to pretransform or post-transform the algorithm so that it can be mapped into

fixed sized or fixed dimensioned arrays. A set of design equations is derived that

may be used to solve for an algorithm transformation that will generate an algo-

rithmically specified array with desired data flow characteristics. The method is

showcased by examples where several arrays are designed. Arrays are designed for

Cholesky decompositions, QR-decompositions, matrix multiplies, convolutions and

more. Also, a structured method is presented to generate bus connected array

implementations of algorithms based on the dependence graph of the serial algo-

rithm. Finally, the design and implementation of an experimental programmable

bus connected array are presented.


1.1 Array Processor Fundamentals

An array processor consists of a regular lattice of interconnected processor

elements and attendent memory. Control over the processors in the array may

come from a central controller (Single Instruction Multiple Data-SIMD) or each
processor may execute its own instruction stream (Multiple Instruction Multiple

Data-MIMD). The complexity of the processor elements (PEs) may vary from

simple single bit ALUs to complex instruction set processors with word parallel

fixed or floating point arithmetic.

Array processors present an alternative computing structure for the high speed

execution of certain types of algorithms. They have been applied to image process-

ing, mathematical modeling of physical processes, digital signal processing (DSP),

and numerical linear algebra (NLA). All of these applications have several com-

mon attributes. First of all they are characterized by regularly structured algo-

rithms. Secondly, they all operate on n-dimensional data sets where n is at least

one. Thirdly, they often run in real time. The high complexity and real time

requirements provide the motivation for seeking faster computing structures for

executing these algorithms. The regularly structured algorithms acting on n-di-
mensional data sets give rise to the regular n-dimensional lattice structures of the

arrays. Some arrays are constructed specifically for one algorithm (algorithmically



specified arrays). The second type of array is a programmable array. Programma-

ble arrays are capable of implementing a wide class of algorithms.

One encounters many problems when implementing an algorithm on an array

processor in hardware. First of all, most problems require a large number of proc-

essors to execute a given algorithm on a given data set without partitioning. An-

other problem is the hardware complexity of an interconnection topology. If word

parallel data paths are used, the number of connections may grow too fast for

practical implementations, and as a result many practical designs are bit serial

[Fou88]. Array control poses another physical problem. The simplest control struc-

ture to implement is SIMD. Another problem concerns array level I/O bandwidth.

If data arrive from a single source and are passed to the array boundary of n PEs,

and word parallel transactions are occurring, then n words must be read from

memory in order to service a single cycle of the array. This means that the memory

structure must operate n times faster than the PEs. Again this problem caused

many designs to be bit serial so that a single m-bit memory read could service m

boundary processors in a single cycle.

VLSI technology has had and is having an impact on array processor designs.

Now multiple PEs may be packaged on a single chip, making these systems more

practical to build. More complex PEs that required whole boards to implement

may now be implemented on a single chip. Still, VLSI technology imposes con-

straints on designs. The cost of communications in VLSI is very high. Communica-

tion in VLSI circuits takes up approximately 90% of the total real estate of the chip

[Sei84]. The I/O pads and drivers for inter-chip communication may take 50%

[Sei84] of the power. Also, if many chips are to be connected together, the inter-

processor communication cost at the board level will be exceedingly high, espe-

cially if word wide paths are used.


Beside the impact of VLSI on array processor technology, advances have also

been made on array processor algorithms. Mathematicians have rewritten many

algorithms for parallel implementations, going back to Givens rotations and Jacobi

iterations over preferred methods (Householder transformations) [Gol83] on serial

machines. Researchers in the fields image and signal processing have developed

array algorithms for many of their applications. Also, a new science is emerging

that considers the problem of geometrically or mathematically representing an al-

gorithm and then transforming it to an equivalent array processor algorithm using

linear and/or affine transformations. These aspects of advancement will be cov-

ered in detail in the literature review in section 1.3.

1.2 Scope of Dissertation

This dissertation research focuses on systolic and bus connected array proces-

sors and algorithms. More specifically, it studies systematic mathematical ap-

proaches to design algorithms for these architectures. In considering these architec-

tures several implicit assumptions are made. The systolic architecture is considered

to have nearest neighbor connections as depicted in figure la for the two dimen-

sional case. Figures lb,lc, and Id show how linear, mesh and hexogonal inter-

connection topologies may be implemented with the connections of Figure la.

Also, each processor may operate autonomously in a word parallel mode and be

able to compute whatever operations (sqrt,*,+,-,/,...) are required. In some cases

the PEs may be required to make a decision. The Bus Connected Architecture

(BCA) is assumed to be configured as depicted in figure 2 where all paths are also

assumed to be word parallel. In the BCA a SIMD control strategy is assumed with

the addition of a halt line that allows given PEs to ignore instructions and an output

enable that allows selected PEs to write to busses individually.





Figure 1.1 : Systolic Interconnection Structure and Embedded Topologies
la: The Interconnection Network; Ib: Linear Topology ; Ic: Mesh Topology ;
Id: Hexagonal Topology

The central focus of the dissertation is to develop methods for the design of

systolic and BCA algorithms. To this end, known serial algorithms are cast into a

mathematical framework that allows them to be manipulated and transformed into

equivalent parallel algorithms. The mathematical model is developed for a sub-

class of algorithms that consist of a single statement embedded in an n-level

nested loop. The statement operates on m dimensional data sets where indexing

into the data is done via a linear combination of the loop indices. The method is

developed for the systolic array case and is later used to help design BCA algo-

rithms. Next the model is extended to include more diversified algorithms such as

those that arise in the field of image processing and numerical linear algebra. The


central objective is to provide a methodology which one can apply to design array

algorithms, or that one could program into a CAD tool to convert known serial

algorithms to parallel algorithms automatically.

Figure 1.2 : The Bus Connected Architecture (BCA)

Another focus of the dissertation is the design, implementation and program-

ming of an experimental attached array processor system. The system, called the

Floating Point Array Processor (FPAP), is under development as an experimental


machine for image and signal processing as well as matrix computations. The

machine has a modified bus connected architecture (MBCA), SIMD control, and

32-bit data paths. This machine will be examined as an experimental endeavor

that will keep a focus on the practical and technological issues that are involved in

implementing an actual system.

1.3 Literature Review

As mentioned in section 1.1, the research trends in array processor design and

analysis has taken several directions. Some of these directions are the building

models for parallel computation, the rewriting of algorithms so that they may be

more efficiently executed on parallel machines, the algebraic and/or geometric

representation of algorithms, and the transformation of these representations to

equivalent parallel implementations. Another research trend has been to study the

characteristics an architecture must possess in order to have an efficient VLSI

implementation. Hence the study of VLSI architectures and VLSI algorithms was

introduced. Researchers studied algorithms and how to alter them to run efficiently

on VLSI array processors. Another area is the design and analysis of experimental

machines and user programming environments and the like. This literature review

will present the highlights of these research trends along with related reference


Early work in the area of algorithm mapping was performed as early as 1967

by Karp, Miller and Winograd [Kar67] at the IBM Watson research center. They

considered systems of uniform recurrence equations and their implementation on

theoretical parallel computers. This work introduced the concept of computations

occurring at points in an integer space. The uniform recurrence equations were

defined in such a way that they were guaranteed to have uniform dependencies

throughout the index set. They also viewed the algorithm in terms of a dependence


graph and looked at algebraic conversions of the graph to perform parallel sched-

uling of computations.

In 1969, Karp and Miller produced a formal mathematical model for parallel

computation [Kar69]. An important concept introduced in this paper was that the

sequencing of computations was considered separately from the actual computa-

tions themselves. They also used the concept of algorithm equivalence. Although

did not present a method to map from one schemataa" to an equivalent maximally

parallel form, they did propose that future work be done to this end.

Around the same time an aggressive research effort was proceeding at the

University of Illinois. The ILLIAC III array processor project spanned the 1960s,

and the ILLIAC IV was under development in 1967 and eventually built [Fou88].

Research concentrated both on hardware and software aspects. Kuck was a key

figure who worked on transforming programs into forms that could be executed

more efficiently in parallel[Kuc72],[Ban79]. The research focused on ways to re-

write statements and loops into equivalent forms that were better suited for paral-

lel execution. Several rules were introduced to transform programs to equivalent

enhanced forms. In [Kuc72] the concept of dependence as an arc form the output

of one computation to the input of another was stated. The central theme involved

the reordering of computations in advantageous ways that preserved the depend-

ence structure of the original program.

Lamport [Lam74] was designing an optimizing compiler for the ILLIAC IV and

looked at methods to parallelize FORTRAN DO loops. He derived two methods for

converting the loops to explicitly parallel forms. The first was called the hyper-

plane method and was applicable to SIMD and MIMD machines. The second was

called the coordinate method and was applicable only to SIMD structures. This

work would influence others in the development of methods to map nested loops to

VLSI arrays in years to come.


Another researcher at the University of Illinois was Khun [Kuh80] who ex-

tended some of Kuck's ideas of program transformations [For84]. He developed

methods to transform the index set of an algorithm into an alternate index set that

produced an equivalent algorithm that preserved the dependencies of the original

algorithm. Using his method he was able to map several algorithms to SIMD struc-

tures, and he showed how the index set could be transformed to directly yield VLSI

array algorithms. His method involved interpreting the transformed index set as

time and processor dimensions.

Moldavan [Mol82],[Mol83] extended and formalized the work of Khun. He

looked at the problem of algorithm transformation as reindexing the index set. He

provided necessary and sufficient conditions for a transformation to exist which

was later found to be flawed [Sou87] and corrected. He cast the problem in matrix

form and defined a matrix whose column vectors were the algorithm's dependence

vectors. He used a linear transformation to transform the algorithm's dependencies

to a new coordinate system. The new coordinates represented time and processors.

A valid transformation was an integer bijection matrix that forced the dependence

vectors to be projected to positive time coordinates. The transform was partitioned

into a time and a space transform, the details of which will be discussed more

thoroughly in chapter two.

A main difference between Moldavan's and previous work was the type of par-

allel architecture being considered. Kuck and the others were considering general

type parallel machines that were assumed to act as a set of uniprocessors that

could interact with memory without conflict. Moldavan was considering systolic

architectures which are practical for VLSI implementation. Moldavan brought the

theoretical work in parallelism extraction and algorithm transformations together

with the newly introduced systolic architecture.


In 1978 Kung and Leiserson introduced the concept of systolic arrays [Kun78],

[Mea80], [Ku82] and developed some of the first systolic algorithms. They also

developed graph based methods to assist in the design of systolic algorithms. In

[Lei81] the systolic conversion lemma is presented along with related corollaries.

The net result was a method to convert an algorithm flow graph into an equivalent

systolic graph by adding delay elements in a specified way. Using this technique

broadcasts are eliminated. In [Kun84] the cut theorem was introduced. This pro-

vided a method to further manipulate dependence graphs by splitting them up into

subgraphs and inserting delays between them. Again, the key idea was to provide a

tool to manipulate the graphs into equivalent systolic forms.

Kung at Carnige-Mellon University, stressed a number of practical hardware

issues concerning systolic arrays, and Leiserson [Lei81] studied VLSI lay out prob-

lems. Kung et al. designed a VLSI chip called the Programmable Systolic Chip,

(PSC) [Kun85b] that would act as a systolic processing node. The chip was de-

signed to act as a simple programmable processing node that could be put into

board level systolic arrays. Their experience with the project was unfavorable and

they modified their concept of a systolic array from very fine grain to more coarse

grain. Using the modified concept, they produced a mini-super computer called

the WARP [Ann87] machine under a DARPA contract with General Electric as

their industrial partner. In this design, a single systolic node occupied a one square

foot board. The machine was configured as a linear array of ten cells. The project

was successful and the machine was primarily applied to computer vision. A strong

effort went into the design and implementation of W2, a parallel language for the

WARP [Gro87]. System Software was also written for a SUN workstation based

WARP network. A command shell was implemented in LISP [Bru87] to form an

elegant user environment. In 1990 the iwarp, a single chip VLSI version of the

systolic node found in the warp will become available from Intel Corp The

- 10 -

introduction the iwarp, should influence many systolic machines to be designed in
the future.

Another key reseacher in the field of VLSI array processors is S.Y. Kung. He

has recently published a text book [Ku88] where several aspects of array processor

technology are described to include algorithm mapping techniques, systolic and

wavefront architectures and hardware and software issues. He devised a variation

of the systolic array called a wavefront array [Ku85a]. A wavefront array is essen-

tially a systolic array with a data driven asynchronous control strategy. Some algo-

rithms are better suited to this type of structure, and some problems such as clock

skewing can be alleviated. According to Kung [Kun88] he also developed array

processor hardware and software. A machine called the WAP (Wavefront Array

Processor) was originally constructed as an 8x8 system [Kun82b]. He developed a

language called MFDL (Matrix Data Flow Language) and a simulator to execute

the language in order to assist in the hardware design.

With the introduction of the systolic and wavefront architectures, algorithm

mapping research tended to concentrate heavily on these architectures, especially

the systolic array. Since the wavefront array is just a variation of a systolic array,

the techniques of mapping algorithms to systolic arrays apply to wavefront arrays

as well. A review of these methods was compiled by Fortes and Wah [For85]. Most

of these methods were either graph based [Kun88], geometically based [Cap83] or

algebraic [Mol82],[For84]. One method is based on a set of equations derived in

"the theorem of systolic processing" [Li85]. These equations serve as constraining

equations for an optimization procedure that can find a systolic array that opti-

mizes a time-processor cost function. The method was interesting and several pa-

rameters of systolic array algorithms such as data velocities and distributions were

formally defined. The problem with the method was that the theorem of systolic

processing did not govern all algorithms but just a subclass of uniform recurrences

- 11 -

in two dimensions. Therefore the method can not be extended to algorithms that

deviate from the scope of the theorem, which is quite limited. The method can be

extended if new "theorems of systolic processing" are derived for algorithm

classes of interest.

Fortes has made significant contributions to the algorithm transformation field.

He studied algorithm mapping theory under Moldavan and extended and further

developed his method [For84]. He developed methods to optimize the time trans-

form, and methods to select a space transform. With Moldavan, he developed a

partitioning technique based on the transformation method [Mol86]. He also stud-

ied fault tolerance and alternate transformation methods that preserved various

types of algorithm equivalence.

Due to the research in systolic architectures and algorithms, several systolic

and wavefront machines have been constructed such as the SLAPP [Dra87] (Sys-

tolic Linear Algebra Parallel Processor) which was built at the Naval Ocean Sys-

tems Center for numerical linear algebraic algorithms. A systolic array, called the

Matrix-1, was built by Saxpy [Fou87] and performs nearly a gigaFLOP for some

DSP and NLA algorithms. Several other systems were reported by Fortes and Wah

[For87] that include systolic arrays for beamforming and general DSP and NLA


In 1983 a systolic algorithm was introduced to solve linear and least squares

systems via Givens rotations on a systolic array [Hel83]. Others have developed

different algorithms that compute the QR decomposition via similar algorithms

[Boj84], [Luk86]. Stewart derived an algorithm based on Jacobi iterations to com-

pute a Schur decomposition [Ste85]. Jacobi iterations were also used by Brent and

Luk [Bre85] to solve eigenvalue and singular value problems on systolic architec-

tures. Nash and Hensen [Nas88] developed a modified Faddeva algorithm to solve

a wide class of NLA algorithms on a systolic SIMD array. These researchers are

- 12 -

looking at the algorithms themselves and finding stable alternatives that can be

efficiently implemented in parallel.

The research in systolic arrays can be separated into algorithm modeling and

transformation techniques, hardware/architecture, and pure algorithms research.

All three of these areas are related and intertwined. Array processor designers,

programmers, and algorithms researchers should be aware of all areas and use

them to their advantage. Current and future research must extend the knowledge

base of good parallizable algorithms, broaden algorithm mapping techniques, and

explore architectural and practical hardware issues.

2.4 Outline and Summary

This dissertation focuses on the problem of mapping algorithms to systolic and

bus connected processor arrays. It presents matrix-based techniques to compute

the parallel algorithm's data flow parameters and to solve for an algorithm trans-

formation matrix. Many of the ideas in this work stem from the works of Kuck

[Kuc79], Moldavan [Mol82],[Mol83], and Fortes [For84]. This dissertation will

present a systematic methodology to design parallel algorithms from serial algo-

rithm descriptions. Also, a 3x3 experimental array processor will be developed and

explored. An algorithm design methodology will be presented to help design algo-

rithms for the experimental machine.

In chapter two, a mathematical framework for the description and transforma-

tion of algorithms and programs is established. Both programs and algorithms are

defined over an index set of integer vectors that will allow the programs and algo-

rithms to be manipulated mathematically. The important concepts of program and

algorithm equivalence are also defined.

In chapter three, a class of algorithms known as uniform recurrent algorithms

is studied along with the programs that describe (generate) them. The algorithms

- 13 -

are studied both in their original form and in their systolic forms. The original

sequential algorithms are mapped to systolic algorithms through a linear transfor-

mation that was first studied by Moldavan and Fortes. Several functions are de-

rived that reveal information concerning data flow and collisions in the systolic

algorithm. The general notion of time is expanded from a purely scalar entity to a

temporal vector. The meaning of vector time is explained and the temporal vector

is used to map the the two dimensional convolution algorithm to a two dimensional

systolic array. Several examples are used to illustrate how the theory is to be ap-

plied to actual problems. A set of design equations called the systolic array synthe-

sis equations is derived and applied to design systolic array algorithms with

prespecified data flow constraints.

In chapter four, the problem of mapping algorithms to fixed architectures is

addressed. Methods are presented to map algorithms to fixed sized and/or fixed

dimensioned arrays. A set of nonlinear integer transformations is used in conjunc-

tion with the linear transformation to expand the degrees of freedom available to

meet the physical architectural constraints.

In chapter five, the framework developed in chapter three is extended to

accommodate a wider class of algorithms. Program transformations are used to

put more complex algorithms in a form that may subsequently be mapped to sys-

tolic MIMD architectures. The method will be illustrated by mapping several algo-

rithms such as the LU, Cholesky, and QR decompositions into array structures.

In chapter six, the general Bus Connected Architecture (BCA) and the design

of an attached modified bus connected array processor is described. The bus con-

nected architecture will be studied and a systematic design method will be pre-

sented to design uniform recurrent algorithms for BCA arrays. By designing an

actual system, the practical difficulties and hardware limitations are discovered.

Various architectural options are considered that effect the hardware complexity

14 -

and efficiency of the overall system. Tradeoffs that must be made when construct-

ing a system are described along with the way the target algorithms effect the

design of the architecture.

Chapter seven offers conclusions of the research and outlines where future

work would be useful. A bibliography is provided for reference and appendices

are included that contain relevant software listings.


An algorithm is normally thought of as a method for solving a problem, usually

on a computer. The algorithm is coded via a programming language so that the

computer may interpret and execute it. When the algorithm executes, it accepts an

input data set and generates an output data set. The algorithm may be viewed as

an ordered set of operations that are applied to the input data set to produce the

output data set. The program is the ordered set of instructions that are used to

code the algorithm. Normally the program is considered to be an implementation

of an algorithm. The algorithm is the general recurrence relation and the program

is a particular realization of the algorithm. In the systolic array mapping research

area the program is considered to be a computer language description of an algo-

rithm [For84]. In an array compiler or CAD tool, the underlying recurrence struc-

ture of the algorithm will be derived from the particular program that is presented

as input. Once the general algorithm structure is determined, a parallel realization

will be produced.

In this chapter programs and algorithms will be viewed in mathematical terms.

Definitions will first be stated for serial algorithms and programs, then more gen-

eral definitions that incorporate the concepts of the index set, dependence struc-

ture, and the sets of computations and expressions are introduced. The algorithm

model presented in this chapter will follow the works of Karp, Khun, Moldavan

and Fortes to a large degree. In Karp and Miller [Kar67], the concept of the index

- 15 -

- 16 -

set and the dependence structure defined over the index set was introduced. Khun

used the index set to map algorithms to parallel processing structures. The de-

pendence matrix and the concept of transforming the dependence matrix to an

altered dependence matrix of the array algorithm was introduced by Moldavan

[Mol82]. Fortes [For84] developed a detailed mathematical description of an algo-


2.1 Programs and Algorithms

It is important to draw a distinction between a program and an algorithm. Let
X be the input data set, Y be the output data set and W be a transformation

W:X->Y. The algorithm may be thought of as a sequence of finitary operations

that are applied to X to obtain a set Y. If Q is a sequence of operations that are

applied to X to obtain Y, and Z is the set of intermediate values generated as the

algorithm executes, then the serial algorithm, Aser, is a tuple Aser = (X,Y,Z,Q). If

F is a list of instructions that cause a given computer to execute Q, and ?p is the

set of program variables that are referenced in the list of instructions then the

serial program, Pser, is a tuple Pser = (p ,r). The algorithm consists of the data

values and the physical action involved in a processing task, and a program con-

sists of the data structures and the list of instructions that cause the computer to

execute an algorithm. Program variables are data objects that may take on many

values, while algorithm variables are specific instances or values of program vari-

ables. If a program Pser causes the computer to execute the algorithm, Aser, then

Pser is said to generate Aser and we write Pser => Aser.
A data transformation executed on a computer may be viewed from mathemati-

cal, algorithm, or program levels. The mathematical level concerns the transfor-

mation itself. For example, one may wish to compute the Discrete Fourier Trans-

form (DFT). It is well known that there are many ways to compute this transfor-

- 17 -

mation on a computer. That is, several algorithms exist to compute the DFT. The

algorithm level concerns the sequence of computations used to perform the data

transformation. The program level concerns the instructions that are used to code

the algorithm. So an infinite set of programs exists that will generate a given algo-

rithm. A program transformation or translation is used to move between programs

in this set.

The above view gives rise to the notions of algorithm and program equivalence.

Two algorithms AI=(X1,Y1,Z1,QI) and A2 = (X2,Y2,Z2,Q2) are Input Output

Equivalent (IOE) if XI=X2 implies that Y1=Y2. This type of equivalence does not

take numerical accuracy and stability into account. Another tighter form of algo-

rithm equivalence is ri-equivalence [For84] (formally defined in section 2.3). In

this type of equivalence the algorithms must first be IOE, but also both algorithms

must perform the exact set of computations but possibly in a different order. This

type of equivalence may also be called computational equivalence because the

exact same set of computations is performed. Consequently Xi=X2 implies Y1=Y2

and that Q2 is a permutation of Qi1. Computationally equivalent algorithms will

have the same numerical roundoff and stability properties. If programs Pi and P2

both generate the same algorithm A, then Pi and P2 are said to be operationally

equivalent or Q-equivalent. Q-equivalent programs cause the computer to exe-

cute the exact same ordered set of computations, Q. The notion of Q-equivalence

between programs will allow a given algorithm to be advantageously expressed by

alternate programs.

More general program and algorithm definitions rely on the concept of the

index set. The index set, denoted In, is a set of vectors in Z" where one computa-

tion or expression is associated with each vector. In an n-level nested loop, the

index vector is a vector where each element is one of the loop indices. In these

- 18 -

cases the index set will be the set of vectors containing all combinations of indices

generated by the loop. For example, in the program below,

for (i=0:9)
for 0=0:9)
for (k=0:9)
y[i,j] = x[j,i] + 5
The index vector is


I = [lI

and the cardinality of the index set is 1000 since 1000 combinations of indices will

be generated as the loop executes.

Given a serial program Pl=>Aser, an Q-equivalent serial program P2 may be
written that explicitly assigns each computation to a point in the index set. This is

done by rewriting the arbitrary program as a nested loop. For example,

Pi: x=x-1
y= x +y2
can be written in terms of a one dimensional index set as

P2 : for i=0:2
if (i=0) x = x 1
if (i=l) y = x 3
if (i=2) y = x + y2
which may be viewed graphically as shown in figure 2.1.

xx x y
x=x-l 1 y= x-3 y=x+y

i=0 i=1 i=2
Figure 2.1: Graphical View of Simple Algorithm with a 1-D Index Set

Here the index vector is I e Z', and I = i. The index set is In = {0,1,2}. If the

compiler is assumed to make the data independent decisions in program P2 at

- 19 -

compile time and the loop is expanded into inline code, both programs will gener-

ate identical object codes and be Q-equivalent. It is easy to see that any finite

serial program may be written in a form that expressly assigns each expression to

a point in the index set. The original program is converted to another form where

each expression has an explicit index vector. This alternate form may be used to

generate an algorithm where each computation has an explicit index vector.

Since any program may be written as an Q -equivalent program that explicitly

assigns an index vector to each expression, we shall only consider nested loop

programs where the index assignments are explicit. The algorithms generated by

these programs will then also have an explicit index set. Formal index set based

definitions for programs and algorithms are presented below.

Definition 2.1: A program is a four tuple

P=(i ,In,E,DP)

ip is the set of all program variables in the program

P is the index set where In C Z"

E is a set of expressions, i.e. a mapping of In onto E

DP is the set of dependencies in the program

Definition 2.2: An algorithm is a six tuple

A=(X,Y,Z,In,C,DA) where

X is the input data set

Y is the output data set

Z is the intermediate data set

In is the index set where I C Zn

C is the set of computations, i.e. a mapping of In onto C

DA is the set of dependencies in the algorithm.

- 20 -

The concept of dependencies in programs and algorithms will be developed further

in section 2.2.

The first distinction to make in the above definitions concerns the way variables

are defined. In a program, ?i denotes the set of all the program variables. That is,

the elements of v/ are data structures. If the program variable Y appeared in an

expression of a program P, then ye 7p The variable y may be used in many

different expressions and in many computations in the resulting algorithm. In an

algorithm on the other hand, each variable has a unique value. If a computation

computes a result, a new variable is generated in the algorithm. The algorithm

definition is concerned with each separate value, not with the variable names. This

distinction between program variables and algorithm variables was the main rea-

son for presenting the program definition.

Two important mappings, E and C, appear in the above definitions. In a pro-

gram P=>A, E(I) is the expression in P that causes the computation C(I) to occur in

A at point Ie P. The computation function may also be viewed as a function of

both the index vector and the expression that causes C(I) to be executed. In this

context, C is written as a function of two variables, C(I,E(I)). The forms C(I) and

C(I,E(I)) may be used interchangeably depending on context.

A serial nested loop program is a convenient way to represent a program de-

fined over an index set. If this is the case, the index set of the general model may

be connected into a serial path that corresponds to the serial loop's execution

ordering. For example, consider program P3 shown below.

P3 : for (i = 0:4)
for (j = 0:4)
x[i,j] = x[i,j] + 5
This program will have the index set and execution ordering as depicted in figure

2.2. In this figure, the nodes represent index vectors and the arrows show the

serial ordering.

-21 -

The execution ordering provides a means for comparing vectors in the index
set. One vector in the index set may be said to be greater than another if it
occurred after the first in the execution ordering. A standard execution ordering
that is natural when considering nested loop programs is the lexicographical order-

ing defined below.





Figure 2.2: The Lexicographical Execution Ordering of Program P3

Definition 2.3: A lexicographical ordering over Zn is the ordering s.t. if xl, x2 E

Zn, then xi is greater than x2 in a lexicographical sense if for the minimum index
i, such that x [i] x2 [i] then xi [i] > x2[i]. One writes, xl> L X2.
Nested loops where all indices increment in positive directions generate index
vectors that are lexicographically ordered. For example, a special case of the lexi-
cographical ordering was depicted in figure 2.2. Whether the index set is lexi-
cographically ordered or not, the execution ordering may be used to compare
vectors. If Ii appears in the execution of A before 12, then we write Ii< E 12. In
positively incrementing nested loop programs, the execution ordering and the lexi-
cographical ordering are the same.
A concept related closely to the index set and the execution ordering is that of
a trace [Len87] of an algorithm. If the index set is is drawn as a grid in Z", and a

- 22 -

path is drawn from node to node as the serial algorithm executes (as in figure 2.2),

the entire path from the first node to the last is the trace of the algorithm. The

trace then is an ordered set of index vectors that are ordered by the serial execu-

tion ordering of the program. The trace ties the serial program description together

with the general program and algorithm models. The trace of the algorithm will be

used in the design of parallel algorithms in chapters 3, 4, and 5.

2.2 Program and Algorithm Dependence Structures

The dependence structure, DA, of an algorithm determines the amount of paral-

lelism that can be used during execution. If an algorithm has no dependencies, it

can be executed completely in parallel in a single time step. On the other hand, if

each computation requires input from the output of the previous computation in

the execution ordering, then no parallelism at all may be used. In a program,

dependencies arise for each program variable, y. If y is referenced in the expres-

sion E(I1) and again in E(I2) where Ii< E 12, and if Ii and 12 are the closest such

vectors, then E(12) depends on E(II) and a program dependence exists from Ii to

12 in DP. In an algorithm, if the variable v is generated by computation C(I1),

written C(Ii)->v, and if v is used for computation C(I2), written v->C(I2), then an

algorithm dependence exists from Ii to 12 in DA. The dependence structure of the

algorithm is a subset of its generating program's dependence structure. Because of

this, the algorithm's dependence graph may be deduced from the program.

The dependence structure determines what orderings are admissible. It is used

to determine which algorithm transformations will yield valid parallel realizations.

Also, the dependence structure will dictate the possible array processor topologies

that can be used to implement the algorithm since the dependencies in the algo-

rithm will transform to data paths in the processor array. The following definition

- 23 -

will give a method to determine the dependence structure for programs and algo-

Definition 2.4: The Program Dependence Function. D, is a mapping from the in-

dex set and the set of all program variables in the program P to the set of depend-

encies, D'. Individually, D maps a program variable y that appears in an expres-

sion E(I) to a dependence vector(s) d= D(y,I) where d E Z". The program de-

pendence vectors are grouped into two types as follows:

1) Input Dependencies -- if y e p and E(I) is the first expression in

the execution ordering that references y, then d = D(y,I) = 0.

2) Self Dependencies -- if y e ip and y appears in both E(Ii) and

E(I2) for some Ii,I2 E IP,where I1< E 12, and if I and 12 are the closest
such index vectors in the execution ordering, then d = D(v, Ii) = 12 I1.

Definition 2.5: The Algorithm Dependence Function. D, is a mapping from the
index set and the set of all variables in the algorithm A to the set of dependencies,

DA. Individually, the function maps a variable v from a computational node C(I)

to a dependence vector(s) d= D(v,I) where d E Z". The algorithm dependence

vectors are grouped into three types as follows:

1) Input Dependencies -- if v E X, and v->C(I), then d = D(v,I) = 0.
2) Absolute Self Dependencies -- if v E Y U Z and C(Ii)->v and

v -> C(2) for any 12 IP, then d = D(v,Ii) = 0.
3) Internal Dependencies -- if v E Z and C(II)->v and v -> C(I2)

for some I1,I2 In, then d = D(v, Ii) = I2 Ii.
Other useful functions are the variable functions.

Definition 2.6: The Program Variable Function. !P, is a mapping from the index

set of a program, P to the set of all program variables I Individually, 7p (I) is
the set of variables referenced in expression E(I).

- 24 -

Definition 2.7: The Algorithm Variable Function. u is a mapping from the index

set of an algorithm, A, to the set of all variables y = XUYUZ. Individually, U (I)

is the set of input and generated variables in computation C(I).

Some of the above definitions may be used to generate dependence graphs for

programs and algorithms. For the program, apply 4 :In-> ip and then D:(i ,In)->

DP. For the algorithm, apply t :In-># and then D:(u ,In)-> DA. Because of the

way dependence vectors were defined, they will originate and terminate at points in

the index set. The index set is the set of nodes and the program and algorithm

dependence vectors may be viewed as arcs in the program and algorithm graphs

respectively. In the algorithm graph, each node represents a computation, C(I). In

the program graph, each node represents an expression, E(I). If a graph G is

defined in its usual manor as G = (N,Ar) where N is a set of nodes and Ar is a set

of arcs, then the dependence graph of P may be written Dg = (I,DP) and the

dependence graph of A may be written Dg = (I,DA).
So we see the program and algorithm definitions may be written in shorthand

as the set of all program or algorithm variables, the set of expressions or computa-

tions, and the dependence graph of the program or algorithm; i.e. P = (' ,E,Dg)
and A = ( ,C,Dg). These definitions imply knowledge of the variables distribution
throughout the index set via the mappings 4p and u The program may be viewed

as a graph where variables (I) appear in expression E(I) and move an offset of

D(z (I),I) to appear in the next expression. The program graph consists of a set of
expression nodes that have directed dependence arcs for each program variable in
the expression that are drawn from one node that references a given program

variable to the next in the execution ordering that references that same program

variable. The algorithm may be viewed as a graph where computation C(I) oper-

ates on variables u (I) at point I in the graph. When the results are computed, the

each generated variable vg e u (I) will move to node I + D(vg,I). If D(vg,I) has m

- 25 -

outcomes, then v will be broadcast to nodes { I + di, I + d2, ..., I + dm). Figure 2.3
illustrates the concept for a simple algorithm without broadcasts.

C vg

Vg C(I)

Figure 2.3: Simple Two Statement Algorithm Graph with Dependence and
Variable Usage Information Displayed

The main difference between the program and algorithm models is the way
variables and dependencies are defined. The program variables are standard vari-

ables that can take on many values while the algorithm variables can only take on

a single value. Also, program dependence arcs are drawn between nodes that refer-

ence the same program variables while algorithm dependencies are only drawn
between nodes that generate and use a given algorithm variable. Because of this,
the set of algorithm dependencies will be a subset of the set of program depend-

ences and may be easily derived from the program.
In an algorithm, the dependence arc may be viewed as a path from the node
that generated that variable v to the node that used it. It may also be viewed as a
precedence relation. This view leads to the concept of the free ordering of the

algorithm. In this ordering all the elements of the input data set are viewed as

input nodes to the algorithm and all the output data elements are viewed as output

nodes. The input nodes have arcs going to all nodes in the index set that use the

input data. The generated variables are passed to other computational nodes over

dependence arcs or to output nodes. The algorithm is started by letting all the

- 26 -

inputs travel along all their arcs simultaneously. As soon as a given computation

C(I) has all of its inputs present on its input arcs, it performs the computation and

passes its generated variables to other nodes along their respective dependence

arcs. In this way, the algorithm is allowed to execute to completion. The resulting

parallel ordering is called the free ordering.

The free ordering of an algorithm dictates the minimum time hardware struc-

ture to execute the algorithm. A processor is provided for each node and a data

path is provided for each input, output and dependence arc. This solution requires

too much hardware and also requires data broadcasts to be made which is undesir-

able when considering locally connected VLSI implementations. Also, a given

processor would only execute a single computation during the entire algorithm,

making the PE utilization efficiency very low. For these reasons the free ordered

array algorithm is rarely directly implemented. Rather the algorithm's dependence

graph is localized and mapped to a more practical locally connected structure that

attains a higher PE utilization efficiency.

If an algorithm is to execute on a processor array with a topology as shown in

figure la, it is required that max( Ild I. )=1 for all d E DA. This is required so

the transformation discussed in section 2.3 will map the dependence arcs to local

communication paths in the array processor. The net effect is that a constraint is

put on the length of admissible dependence and I/O arcs. Figure 2.4 shows two I/O

equivalent algorithms. The graph of figure 2.4a reflects the free ordering of the

algorithm. Figure 2.4b shows the algorithm with additional dependencies imposed

to enforce local communications.

Several researchers [Lei81] [[Mol82] [Kun88] have considered methods to local-

ize dependence graphs. One method [Mol82] involves converting the algorithm to

an I/O equivalent algorithm where in the new algorithm identity operations are

inserted at the computational nodes. By inserting identity operations, the computer

- 27 -

dose not have to perform any extra computations but new variables will be gener-
ated in the algorithm's mathematical representation. In the original algorithm, if
v->C(II) and v->C(I2) for some selected Ii and 12, then in the new algorithm
v->C(Ii) and C(I1)->v' where v=v' then v'->C(I2) so that a dependence d = 12 Ii
is generated. This method imposes dependencies that are not in the original algo-
rithm and thus imposes more restrictions on the possible execution orderings for
the algorithm and may therefore slow down its parallel execution. For example,
the free ordering of figure 2.4a requires four time steps while that of figure 2.4b
requires seven. Care must be taken when imposing dependencies.

EE1H I -
PM Ub *

I. i






Figure 2.4: Nonlocalized and Localized Algorithm Dependence Graphs.
(a) Nonlocalized Algorithm (b) Localized Algorithm

The algorithm may be localized by converting it to an alternate algorithm by
setting DA = DP, where D' is the dependence structure of a program that generates
the algorithm. This is equivalent to Moldavan's method of localizing an algorithm
graph as described above. For example, the algorithm generated by program P4
P4: for (i=0:2)
for 0=0:2)
for (k=0:2)
Zij = Zij + Xik Ykj



- 28 -

would only have dependence arcs pointing in the k-direction. Since a and b are

input variables, each node would have two non-local input arcs drawn from the

input set. The program dependence structure, on the other hand has dependence

arcs pointing in all three axial directions and is localized as depicted in figure 2.5.

The matrix multiply algorithm is localized by imposing the dependence structure of

the matrix multiply program on the matrix multiply algorithm.

Some matrix definitions will be useful in describing the dependencies and the

imposed dependencies of the serial program.

Definition 2.7: A Variable Surface. Iv. for a variable y E ip in a program P is

the set

Iy ={III E I, y->E(I)orE(I)-> }

Definition 2.8: The Self Dependence Matrix. Dr. for a variable y in a program P

is a matrix whose columns are the unique values of the self dependence

vectors of y.

Definition 2.9: The Dependence Matrix. D for an algorithm A generated by a

program P is a matrix whose columns are the unique values of all vectors in DP.

Note that in figure 2.5 the dependence surfaces as defined above appear as straight

lines. Also in figure 2.5, the self dependence vectors are the vectors tracing out
the straight lines of the dependence surfaces. The self dependence matrices are

just the dependence vectors in this case, and the dependence matrix of the algo-

rithm is the identity matrix as shown below.

dx = dy = dz = 0 D = 010
0 0 1 001

- 29 -


Z1o Z2o

Z 2 Y20

X/212 J221


A0 Y Y12
X11 Xzi


j X10 X20o

Figure 2.5: Localized Matrix Multiply Dependence Structure

Note that the order of the columns of D are not important so can be arranged in

the most convenient manor. In some programs the self dependence matrices will

have more than one vector in them. These arise in programs with a non-uniform

dependence structure, or in programs with dependence surfaces that are planar or

higher dimensional. Nonuniform dependencies arise in many NLA algorithms

such as the Cholesky decomposition. Higher dimensional dependence surfaces

arise in algorithms with higher complexity such as the two dimensional convolution


2.3 Program and Algorithm Equivalences

Algorithms and programs are modeled to provide a mathematical framework

by which a known serial algorithm may be transformed into an equivalent algo-

- 30 -

rithm that is well suited for fast execution on a parallel architecture. First the

program that generates the algorithm is transformed to an alternate equivalent

form where each expression has an associated unique index vector. Once the pro-

gram is in this form the index set based algorithm mathematical model may be

produced. A parallel implementation of the algorithm is produced by transforming
the original algorithm model to an alternate one with explicit parallelism. A clear

concept of algorithm equivalence is needed in order to find ways to transform from

one algorithm to another. The definition below comes from [For84].

Definition 2.10: Given two algorithms Ai = (Xi,Y1,Z1,CI,In,DA) and A2 =

(X2,Y2,Z2,C2,J",D2), Al and A2 are Ti-equivalent and we write Ai A2, if and
only if

1) Ai is input output equivalent to A2

2) 3 transformations T and T* s.t. T:In->J" and

T*:proj(D )->proj(D )

3) T is a bijection

4) T* is a surjection

5) C(I) =C2(TI) = C2(J)

6) d2 = T*di and J = TI
From the definition above, ri-equivalence between A1 and A2 implies that A1

and A2 are IOE and that there is an invertable map from I" to Jn where the compu-

tation C (I) is mapped to computation C2(TI), and C (I) and C2 (TI) are the same

computation. It can be shown [For84] that for the above definition to hold, it is

required that T*(d)=T(I)-T(I-d). For the special case of linear transformations, the
above reduces to T* = T, so that the dependencies and the index set are trans-
formed through the same transformation. Linear transformations will receive most

attention in the dissertation. Nonlinear transformations will be discussed when ad-

dressing the partitioning problem in chapter 4.

- 31 -

The Ti-equivalence transformation can be viewed as simply a reindexing of a

given algorithm. Suppose an algorithm's graph was drawn as those of figures 2.4

and 2.5 If the index and dependence vectors were transformed through a linear

transformation, T, then the algorithm graph would be transformed so that the

nodes and arcs had different labels, but the structure of the graph would remain

the same. That is, the same arcs connect the same nodes so that the free orderings

of the two algorithms are identical. The two algorithms are isomorphic since the

transformation linking the two algorithms is a bijection. The transformed index set

is now viewed in a parallel algorithm coordinate system. Some of the axis in the

transformed index set are viewed as temporal axis and the others as processor

coordinate axis. This interpretation imposes an ordering on the transformed algo-

rithm. The only problem is to make sure that the parallel execution ordering pre-

serves the precedence relations of the algorithm's free ordering.

The transformation T is a map from the original algorithm's index coordinate

system to a coordinate system consisting of time and processor dimensions. There-

fore, T may be viewed in terms of time and processor sub-transformations. The

time transform r E Zqxn maps an index vector I G I to a temporal vector t=xr I.

The space transform, SE Zmxn, where m=n-q, maps I to a processor coordinate

vector P=SI. The time transform schedules a computation C(I) to occur at temporal

vector t in the parallel algorithm, and the space transform tells at which processor,

P, in the array the computation C(I) will occur. The temporal vectors are ordered


The transformation T is formed by augmenting the time transform with the

space transform. Given r E Zqxn, SE Zmxn and the requirement q+m=n, if S is

augmented to 7r, an n xn matrix results. Both 7r and S are required to have full

rank and the rows of ;r are required to be linearly independent of the rows of S, so

- 32 -

that the resulting augmented matrix will be invertable. These restrictions cause, the

resulting transformation,

to be a bijection as required in the definition of ri-equivalence. The precedence

relations of the original algorithm will be preserved if r d >L 0 for every dG DA.

This will cause computations that depend on results from a previous computation

to be executed at a later time in the parallel algorithm. With these conditions

holding, T will map A to a ri-equivalent algorithm. The transformation both sched-

ules and allocates resources for the computation C(I) in the array as

and there exists a T such that I T J.
and there exists a TV1 such that I = T~V J.


Uniform recurrent algorithms (URAs) are an important class of algorithms that

are especially well suited for mapping to systolic arrays. URAs have the property

that the same set of dependence vectors emanate from each node in the index set.

The idea of a URA is depicted in figure 3.1a where in contrast, figure 3.1b shows

an algorithm that is not a URA. The uniformity of the dependence structure will be

reflected in the array domain by regular communication networks and simple con-

trol strategies. Some examples of URAs are matrix-matrix multiplies, convolu-

tions, correlations, DFTs and partial differential equations solvers. URAs are nor-
mally computationally intensive and often require real time execution, making

them prime candidates for array implementation.

(a) (b)

Figure 3.1 : Uniform and Nonuniform Dependence Graphs.
(a) URA (b) Not a URA

The class of URAs is the simplest to analyze in terms of the program and

algorithm models presented in the last chapter. Most programs that generate URAs

- 33 -

- 34 -

may be written as a n-level nested loops with a single expression in the inner loop.

Several tools and methodologies for algorithmically specified array design and

analysis will be developed for URAs. These tools and methods will be extended to

broader classes of algorithms in chapter 5.

In the first part of this chapter URAs and programs that generate URAs will be

studied. In section 3.1 necessary and sufficient conditions will be given for a class

of programs to generate URAs. Also a subclass of programs will be studied that

will generate URAs via affine and/or linear indexing transformations. In section

3.2 streamlined matrix based program and algorithm representations will be devel-

oped for URAs that contain the same information as the general definitions. Using

the matrix based representations, simple closed form expressions may be derived

for the variable and dependence functions, and a recursive procedure may be used

to generate the index set [Dow88d] [Dow89]. These constructs will be useful for

computer aided URA specified systolic array design [Dow88a].

In sections 3.3 and 3.4 the transformation domains of URAs will be analyzed.

In section 3.3 matrix methods will be introduced to determine the data velocities

and distributions in the processor array. Also, a position function will be derived

that tells a datum's position at a given time (temporal vector) in the array algo-

rithm. Other temporal vector based functions will be derived that reveal data colli-

sion information and PE memory contents. In section 3.4 a set of equations will be

derived who's solution will provide a transformation that will yield an array algo-

rithm with desired data flow properties. Examples will be given to illustrate the


3.1 Uniform Recurrence Programs and Algorithms

The central focus of this section is the URA and it's generating program. The

URA is formally defined below.

- 35 -

Definition 3.1: A Uniform Recurrent Algorithm (URA) is an algorithm

A=(X,Y,Z,P,DA) where each dependence vector dEDA is valid at each point I EP.

A convenient form of a program to generate a URA is a Single Expression

Nested Loop Program (SENLP). The SENLP is defined below.

Definition 3.2 : A Single Expression Nested Loop Program (SENLP) is a program

P=( ,IP,DP,E) with the following conditions

1) The index set is lexicographically ordered and is generated by an

n-level nested loop.

2) E(I) = E for every IE P

3) If y E V then y E Rm and m n. The variables are indexed in the

expression using indexing transformations Fy:I->viy, viy E Zm. This

causes y [viy] to be an individual element of y In the expression the

variable y is referenced as y [Fy ()].
The most important point of the above definition is the way the variables are refer-

enced. A single expression is used that contains transformations that map from a

point in the index set to the variable indices of a particular program variable that is

referenced in the expression at that index point. These transformations allow a

single expression to cause a unique computation to occur at each point in the index

set of the algorithm generated by the program.

The indexing transformations provide a mathematical way to find program de-

pendencies. Suppose P is a SENLP with y E , y E Rm and viy E Zm is the

indexing vector for y Then a program dependence exists from Ii to 12 if y [viy] is

referenced in expressions Ii and 12, 11
lexicographical sense. If this is the case, then D(y,I1)=dy= 12-11 and both Ii and 12

must satisfy Fy(I)=viy. So if Ii is a solution to Fy(I)=viy and D(y,Ii) = 12-11 then 12
must also satisfy the equation Fy(1)=viy. If the solution set to Fy(L)=viy is known,

the dependence structure may be found by ordering the solution vectors lexi-

- 36-

cographically and differencing adjacent vectors in the list. This provides a useful

way to set up the dependence graph for an algorithm generated by a SENLP. If the
program is known to generate a URA a-priori, then the equation Fy(I)=viy needs to

only be analyzed once and the dependencies for all elements of y will be known at

all points of the index set. The following lemma will provide a way to know if a

SENLP will generate a URA a-priori. A theorem, whose proof is based on the

lemma, will tell that if the variable indexing transformations are affine or linear,

the SENLP will always generate a URA.

Lemma 3.1 : Necessary and sufficient conditions for a SENLP, P, to generate a

URA is that each variable indexing transformation, Fy in E of P satisfies

Fy(I) = Fy(I+dy) for every IEIP and each dy DP.

(Necessary) Assume P=>A where P is a SENLP and A is a URA. Let y EGi and

let Fy:I->viy be the indexing transformation for y in E. If the variable y [viy] is
referenced at points Ii and 12 of the index set, then viy=Fy(Ii) and viy=Fy(I2).

Therefore Fy(Ii)=Fy(I2). Furthermore, if Ii< LI2 then for some 12 in the solution

set of viy=Fy(I), I2=Ii+dy where dy=D(y,I1). The above two conditions lead to the


Fy(I1) = Fy(II+dy)
but since P=>A where A is a URA, we know that dy holds at all points in the index

set so that the above equation must hold for each dependence vector of at all

points of the index set, i.e.

Fy(I) = Fy(I+dy) for every IE In and each d E D'.
(Sufficient) Suppose P is a SENLP where for each indexing transformation Fy in E,
the relation Fy(1) = Fy(I+dy) holds for every I PI and each dy E D. If this is the

case, the same variable Y [viy] is referenced at points I and I + dy for each I IP.

Therefore dy is a valid dependence vector at each point in the index set. Since the

- 37-

condition also holds for each dy E DP, all the dependencies in the program are
valid at each point of the index set. Therefore if the algorithm A is generated from

P and the algorithm dependencies are derived by setting DA = DP, then A will be a
Theorem 3.1 : If P is a SENLP where each Fy in E is an affine transformation,
then P=>A where A is a URA.
Proof: Suppose P has only affine indexing transformations. Then if dy is a de-

pendence vector for the variable y GE' at point Ii E P, the relation

viy = Fy(I1) = Fy(Ii+dy)
must hold. Because Fy is an affine transformation, it may be written as the sum of
a linear transformation and a constant vector i.e.

Fy(Il+dy) = Fy(Ii+dy) + vic

where vie E Zm has the same dimensions as viy. Because FyL is linear, the above
may also be written

Fy(I,+dy) = Fy(I1) + FyL(dy) + vic
but since viy = Fy(Ii) = Fy(Ii+dy), we can replace the left hand side of the above

equation with Fy (I) = Fy (Ii) + vic to obtain

Fy(I,) + vic = Fy(II) + FyL(dy) + vic

so that it is clear that Fy (dy)=0. That is, either dy=0 or dy E N(Fy'). Therefore at

any arbitrary point I PI

Fy(I+dy) = FL(I) + FL(dy) + vic = F (I) + vic = Fy()

Fy(I) = Fy(I+dy) for every IE IP and each dy E DP.
Since the program will satisfy the necessary and sufficient conditions to generate a
URA, we know that P=>A where A is a URA

- 38 -

The above theorem identifies a class of programs that will be known a-priori to

generate URAs. If a SENLP is written with only affine transformations in E it will

be known to generate a URA. Since a linear transformation is a special case of

affine transformations where vic=0, the above theorem also holds for algorithms

with linear indexing transforms. Programs of this form are common in DSP and

NLA applications.

Another important point brought out in the proof of the above theorem con-
cerns the program's dependence structure. If a SENLP is written with only affine

indexing transformations of the form Fy(I) = Fy(1) + vic where FyL(I) is a linear

transformation and vic is a constant vector, then either dy=0 or dy E N(Fy). This

gives an easy way to find the dependence vectors by inspection of the indexing

transformation as will be discussed in the next section. Also, it shows that the

subsets of the index set that involve computations that use a given program vari-

able will be (n-m)-dimensional surfaces in the index set. These surfaces of vari-

ables will transform to paths in the array algorithm that a given variable will travel.

If N(FV) is one dimensional the variable Y [viy] will follow a straight line through
the index set as was exemplified in figure 2.5.

3.2 Matrix Based URP and URA Representation and Analysis

We next derive matrix representations for SENLPs and URAs. Let P be a

SENLP with only linear indexing transformations. If Fy:I->viy, where viy E Zm and
I E Z then Fy may be viewed as a linear transformation Fy:Zn->Zm. The defini-

tion below serves to define the transformation matrix of such a transformation.

Definition 3.1: An indexing matrix Fy E Zmxn for a variable Y, is a transforma-

tion matrix that maps the index vector, IE In to a variable index vector, viy E Zm.

That is, Fy is defined so viy = FyI.

- 39 -

To see how the indexing matrices are used, consider the matrix multiply pro-
P1 : for (i= 0 : 2)
for 0= 0 : 2)
for (k= 0 : 2)
Z = Zij + Xk Ykj
which has an index vector,


This program has p ={ X, Y, Z }. Each variable is indexed by variable index
vectors viz = [i j]T, vix = [i k]T and viy = [k j]T respectively. The index transfor-
mations are given for each variable as,

viy = M 1-[0 0 11 [k
so that the indexing matrices are given by

0 1 0 1 0 0

As is clear from the above example, the indexing matrices may be easily found
by inspection. In practice, a programs indexing scheme will never be much more
complicated than the one above. A translator could be written that would take an
expression written in a standard computer language and convert its indexing
clauses to indexing matrices. Assuming the variables in the original program were
indexed by linear combinations of the loop indices. Once the indexing matrices are
found, it is a simple task to determine the dependence vectors. Simply find the
minimum length lexicographically positive vector satisfying the equation Fydy=0
for each transform Fy. In general n-Rank(Fy) such vectors must be found that are
linearly independent. This can be done numerically or by using list searching tech-

- 40 -

niques on the index set. Also, it can be done by inspection for most programs. For
example it can be determined by inspection of the indexing matrices that

dz = dx = dy =

since these vectors are in the null spaces of their respective indexing matrices.
Another example of how the indexing matrices are used is the linear convolu-
tion program

P2 for (i = 0 : N-l)
for (j = 0 : N-l)
Zi = Zi + X Yi-j

where it follows that

Fz= CIO] Fx=[0 1 Fy=[l -I].

and the dependence vectors are

dz [0] dx = [1] =-

In this example the dependence vector dy is not in a standard unit basis direction.
The indexing matrices may be inserted in the expression directly to give a
standard expression format. The expression format is,

y7 [FylI] = f(yi [FylI], z2 [FMy2I, ... ys[F,,I])

This expression is denoted E and is used at all points in the program's index set.
The loop limits may be represented by two matrices that determine the bounda-
ries of the index set. The matrices allow the loops to range from a constant to
another constant, or from a constant plus a linear combination of the previous loop
indices to another constant to another linear combination of the previous loop
indices. For example, consider the problem of multiplying two upper triangular

- 41 -

square matrices. Since the product is known to be upper triangular also, only the

upper triangle of the product needs to be computed. That is, one would write

P3 : for (i = 0 : 2)
for (j = 0 : i)
for (k= i :j )
Zij = Zu + Xik* Ykj

Here the second limit of second coordinate of the index vector is dependent on

first index coordinate, and the limits third coordinate are dependent on the first

and second coordinates. Figure 3.2 shows the geometry of the index set generated

by the "for" statements of program P3. Note the difference between this figure and

figure 2.5 that was for the multiplication of square matrices. The affine transfor-

mations defined below serve to represent the loop limits and index set boundaries.

Definition 3.3: The boundary matrices 01 and Q2E Znxn+' are affine transforma-

tions of the index vector that linearly map an augmented index vector

to a point in the index set In. The leading nx n submatrices of %i and 02 are

strictly lower triangular.
The boundary matrices completely characterize the boundaries of the index set.

For example, consider again the upper triangular matrix multiply loop where the

loop limits of

for (i = 0 : 2)
for (j = 0 : i)
for (k= i : j)
Zij = Zij + Xik* Ykj

may be written

for( I = T : T )


- 42 -

00 0 000 (
DI= 00 0 2 1 0 0 0=
S0 00 0 0 k
00 10
Using this scheme, either of the square or upper triangular matrix-matrix multiply

loops or the convolution loop can be represented as

for(I= 4i :
Z[FzI] = Z[FzI] + X[FI] Y[FyI]

by changing the dimension and/or the entries in the matrices 01, 2, Fx, Fy, and

Fz. This provides the foundation for the definition of a subclass of SENLPs.

Figure 3.2: Upper Triangular Matrix-Matrix Multiply Dependence Graph

Definition 3.4: A linear recurrence program (LRP) involving s variables is a pro-

gram that may be specified by an (s + 3)-tuple consisting of s+2 matrices and an

expression as shown below.

- 43 -

P = (E,4 ,Fyl,4F2,...,Fys,E)

Here the boundary and the s indexing matrices are defined as above and E is the
standard expression. The nested loop program may be written in matrix form as

for(I= I : 4z )

y [FylI] = f(yi [F 1I], Y2 [Fy21], ..,s [FysI])

For LRPs, the above tuple contains the same information as the general pro-

gram definition. The sets of variables and dependencies in the program are speci-

fied by the indexing matrices, the index set is specified by Di and 02, and the set

of expressions is specified by E. All the information of the LRP is compactly

represented by s+2 relatively small matrices and a single expression. The LRP is
used to generate a Uniform Linear Recurrence Algorithm (ULRA). The above

tuple can also be used to represent a ULRA if it is recognized that C(I,E) occurs at
each point of the index set.
The above results may be extended to programs that use affine indexing trans-

formations. Let Fy(I) = FyLI + vic where Fy is an affine indexing transformation Fy
is an indexing matrix and vic is a constant variable index vector with the same

dimensions as viy. Then the affine transformation of IE I may be viewed as a
linear transformation Fy:I->viy where Fy may be written as an indexing matrix

Fy =[FL I vic]

with Fy E Zmxn1 Fy Zmxn, vie E Zm and T=[I | 1]T. Using the affine index-

ing matrix, another subclass of SENLPs is defined below.
Definition 3.4: An Affine Recurrence Program (ARP) involving s variables is a
program that may be specified by an (s + 3)-tuple consisting of s+2 affine transfor-"
mation matrices and an expression as shown below.

P = (4,,F,Fyl,2,...,Fys,E)

- 44 -

Here the boundary and the s affine indexing matrices are defined as above and E

is the standard expression. The nested loop program may be written in matrix

form as

for(I=iT : <~r )

y [FylI] = f(y, [FyI], y [Fy2 ] .... y,[FyT])

For ARPs, the above tuple contains the same information as the general pro-

gram definition. Everything is the same as the linear case in terms of the informa-

tion that is contained in the above definition except the way the dependence infor-

mation is contained in the indexing matrices. Now the dependence structure must

be determined by examining the linear submatrix of the indexing matrix. The

ARP is used to generate a Uniform Affine Recurrence Algorithm (UARA). The

above tuple can be also used to represent a UARA if it is recognized that C(I,E)

occurs at each point of the index set.

3.3 Transform Domain Analysis

So far it has been shown how to represent LRPs and ARPs and thus their

generated URAs as tuples of s+2 matrices and an expression. The next problem is

to map the algorithm to a systolic array. Again, we shall first derive the results for

the linear case and then extend them to the affine case. As mentioned in section

2.4, an invertable integer transformation matrix T E Znxn may be applied to the

index vector I E Zn to perform the mapping of computation C(I) to a systolic

computation C(J) where J=TI. C(I) and C(J) are identical computations but where

I tells where in the index set C(I) is performed, J tells where and when in the

systolic array C(J) is performed. The transformation must be selected so that the

dependence vectors project to positive temporal vectors in order for the algorithm's

precedence relations to be maintained. In this section we will show how to use the

matrix representation along with the transformation to extract data velocity and

- 45 -

data distribution [Li85] vectors. Also time parameterized functions will be derived
to yield data position, PE register contents, and data collision information.
Recall from section 2.4 that the algorithm transformation may be broken into a

time transform ;r E ZqXn and a space transform S E Zm X where m+q=n. The
computation C(I) is performed at time t = 7rI at processor P=SI. The transform T is
constructed by augmenting the r row vector with the space transform yielding a

non-singular square matrix T E Z"nn. The q-dimensional temporal vector may
be used to admit multiple clocks with separate data velocities associated with each
clocking (time) coordinate. This allows a linear transformation to force data to
make multiple passes through the array. The transformation has the form

J= -- = -- ITI

A useful set of transformation matrices are constructed by augmenting the time

transform :r with each of the indexing matrices Fy. The matrices, called the
time-index matrices, are defined below:
Definition 3.5: The time-index matrix for a variable 7 with index matrix

Fy E Zmxn in a LRP whose generated algorithm is scheduled to execute on a

systolic array via the time transform n E ZqXn is the invertable matrix formed by

combining or and Fy to obtain Ty E Znxn. This matrix is shown below.

Ty= --- (time-index matrix)

The above matrix defines another coordinate system. Vectors in the new sys-

tem are called time-index vectors and are written viy', where,

vi -- (time index vector)
( time

- 46-

Because Ty is required to be invertable (posing an additional restriction on the

selection of a) Ty can be viewed as a change of basis transformation (bijection)

from the index set to the time-index set. That is,

TyI = viy and I=T, y-viyt

Furthermore the basis may be changed from viyt to viyr2 using the cross transfor-

mation matrix.

Definition 3.6: The cross transformation, Tyly2, maps a vector viyzt to another vec-

tor viryi. The transformation is composed of Ty1 and Ty2 as

Tyzy2 = TyTy2-1

so that

Viy1t = Tyly2Viy2


Tyly2 Viy2t = Ty (T2-1 Viy2t ) = TylI = viyl1

The ability to change basis from I to viy or from viylt to viy2t will be useful in

deriving formulae concerning data position, velocity, distributions, and collisions.

First consider the position function. That is, given a data element y[viy], find its

position P in the processor array at time t. The known information is vi, E Zm

and t E Zq. This information is contained in the vector viy'. Since P = SI and I =

Ty,-viy', it follows that P = STy-lviyt, which gives the position of y[viy] at time t.

It will prove to be useful to define the following matrix based on the above result.

Definition 3.7: The parameter matrix, Sy E Qmx", for a variable Y in a recurrence

algorithm transformed by a transformation T is given by Sy = STy1.

- 47 -

Definition 3.8: The position function. Py, for a variable Y,is a transformation

Py:viy'->P where P processor coordinates of the datum y[viy] at time t, and is


Py(viyt) A Syvi7' = P

Note that S E Qm x n and Py E Zm and viyt E Z" so that Sy vit E Qm in general.

Thus the position function may map a time-index vector to a non-integer point in

the processor space. Since processors are only at the integer points, these cases

must be interpreted. In practice it is often the case that Sr E Zmxn__ Qmxn but

this is not required. In cases where it takes several clock cycle for a datum to move

to the next processor its position will be interpreted as being between processors,

hence the datum has a non-integer position in the processor space. These positions

may be interpreted as delay registers that are inserted between processors. Another

view is to assume the datum resides in PE local memory when it is not at an

integer position in the processor coordinate system. When enough clock cycles

have passed to make the position function evaluate to another integer position, the

datum is passed along the appropriate data path to the next processor. The posi-

tion function is not evaluated while the algorithm is executing, but tells how to
design or program the hardware.

The position function is used to find the data velocities. There are q velocity

vectors per variable Y, one with respect to each time coordinate. Because of the

uniform dependence structure, the velocity found for one data element y[viy] is the

same for each element of Y. That is, the velocity of one specific datum y[viy], will

actually be Y's wavefront velocity. The data velocity with respect to the ith time

coordinate is the change in processor position, AP of y[viy] when the ith time

coordinate is incremented by one. Therefore, if ei denotes the ith standard basis

vector of Zn, the ith velocity vector of Y is

- 48 -

Vy = Py(viyt + ei)-Py(viyt) 1 < i s q

or by definition of the position function,

vy1 = Sy(viyt + ei) Syviyt = Sy(viy' + ej viy') = Sye

This means that the ith (1 5iSq) column of the parameter matrix Sy is the velocity

vector of Y for the ith time coordinate of t E Zq.
The data distribution vectors, introduced by Li and Wah [Li85] are defined as

Definition 3.9: Given a datum y[viy] with viy E Zm and another datum y[viy+ ei]
1 5 i < m, (ei is the ith standard basis vector for Zm)the ith data distribution

vector 6y5' Qm is a vector in the processor coordinate system that for t = con-

stant, points from the processor containing y[viy] to the processor containing

y[viy + ei].
The distribution vectors may be found in terms of the position function. Let i =
q + j, 1 5 j < m, then the jth distribution vector is
6J = Py(viyt + ei) Py(vi ) 1 j < m, q i n

or by definition of the position function,

yij = Sy(vi/t + e1) Sy(viy ) = Sy(viyt + ei viy,) = Syei

This states that the jth distribution vector of Y is the (q + j)th column of Sy. It is

convenient to define the velocity and distribution matrices vy E Qmxq and

by E Qmxm as the submatrices of Sy, namely Sy = [vy6 1y]. The parameter ma-
trix is used to compute the data position of Y[viy] and explicitly lists the data
velocity and distribution vectors of 7. The velocities and/or distributions will be
non-integer when data takes more than one cycle to reach the next processor as
mentioned earlier.

- 49 -

The position function may be written in a familiar form using the above parti-

tioning of the parameter matrix. Since vy E Qm q, t E Zq, 6y E Qmxm and viy E

Zm the equation P = Sy viy' may be written P = vyt + 6dyviy. This is a vector form of

the familiar position function for linear motion at a constant velocity,

P( t) = vt + po. At t=O, Py(viyt) = -6yviy so that 6yviy is the initial position of the

datum at time t=0. The initial position is a function of viy, i.e. each separate

datum y [viy] has its own initial position in the array processor coordinate system.


720 Y21 722

Y710 Y11 Y12

YOO aI h Y02

Figure 3.3: Data Distributions for Agorithm With 6y = 12

The above view of the position function gives insight into how data moves in

the systolic array. In the common case of one-dimensional time, it shows that data

moves in a straight line by discrete steps of vy each time the clock is incremented.

A more interesting case is when q > 1 which arises when there are q dependence

vectors for each variable y For example, suppose in some array algorithm that a

given variable y had the following parameter matrix

- 50 -

S = 0 1 0

where n=4, q=2 and m=2 so that t, viy, P E Z2. The initial configuration of the

algorithm, based on the identity distribution matrix, would be that of figure 3.3.

Now consider how the datum Yoo moves through the array as the temporal

vector increments. The time trace of an algorithm is the lexicographically ordered

set of unique of temporal vectors that arise as the algorithm executes. Suppose in

this example that the time trace is given by

trace=[ [ o] [] [1] [i] ] [] [ 2]}

Then the position function for Yoo would become

P(tt200]T) = v, [ [] = vy ['] =

so that the path traced out by Yoo will be the same as the time trace.





t=7 t=8

YJo fin


t=4 t=5

0YOo oi Y02o2

Figure 3.4: Position Trace of Yoo Through The Processor Array


- 51 -

In the time trace that when ti is incremented, t2 is set back to zero. This causes
jumps in the position of the datum as time increments. This is the mechanism
where by a linear transformation allows data to make multiple paths through the
array. The datum Yoo's position trace is displayed in figure 3.4 and is based on the
above position function and time trace. In the figure, absolute one-dimensional
time is used by lexicographically ordering the temporal vectors. This is the ap-
proach taken in actual hardware.
Another set of functions that can be derived to help design and analyze systolic
algorithms are the processor memory functions. The memory function was intro-
duced in the context of the space-time-data equations for a systolic algorithm

[Kuo84]. The memory function states what indices viy of Y are contained in proc-

essor P at time t. There is a separate memory function for each variable Y in the

algorithm. The given information is the time t, and the processor coordinates, P.
This information is contained in the time-processor vector,


Recall that I = T-J so that since viy = FlI, it follows that viy = FyT-1J, which leads

to the following definitions.

Definition 3.10: The memory matrix, My E Qmxn, is given by

My = FyT-1
and is a linear transformation mapping from time-processor vectors to the variable

index vector viy for the variable Y.

Definition 3.11: The memory function, for a variable Y, is a mapping My:J-viy

and is written

viy = MyJ

- 52 -

The memory function specifies what data element y[viy] is contained in a proc-

essor P at time t. If MyJ contains rational non-integer entries, then no data ele-

ment is contained in an ALU register of processor p at time t.

The last function to be derived is the data collision function. The data collision

function determines with which data element from another data set a given datum

collides at time t. The collision function is a mapping Ciyy2: vi2t -, viyl. Given

vi2t, it follows I = Ty2-1viy2t and viyl = FyrI so that viyi = FlTy2-1viy2t, which leads
to the following definitions.

Definition 3.12: The collision matrix, Cyly2 Qmxn, is a transformation matrix

mapping Cyly2: viy2t .- viyl where Cyly2 = FyTy2-1.

Definition 3.13: The collision function of a variable Y2 with another variable Yi is a

function whose input is a variable-index vector viy2 and a time vector t, (viy2,) and

whose output is a variable-index vector viyi. The vector viyl is the variable-index

vector of the element of Yi with which Y2[viy2] collides at time t. We write

viyI = Cyly2viy2'

So far all the work has been done for ULRAs. Now the results will be ex-

tended to the affine case. Recall that UARPs had indexing transformations of the


Fy = FyI + vic

This indexing policy will affect the data's initial placement but not the velocity nor

the distribution. Therefore, the time-index matrix is defined for the linear portion


L |

- 53 -

and the parameter matrix for the linear portion is given by Sy =ST'1 so that the

velocity and the distribution vectors are stored in Sy just as they were in Sy. The
linear parameter matrix will find the position if the proper variable indices are

presented. If vic is subtracted as a correction term before the transform Sy the

proper results will be attained. To see this, let vi. E Z" be a vector created by

augmenting a q-dimensional zero vector with the constant offset vector of the

affine indexing transformation. Then vi$ will be compatible with the time-index

vector and may be used to make the correction in the variable index vector portion

of the time-index vector before applying the linear portion of the time-index ma-

trix so that I = T (vi-vii). Thus the new position function is obtained by apply-

ing P=SI where I is given by the above the same as the previous one, so that the

position function may be written Py(viy) = S Y(vi -vi).

The memory function is also simple to derive when affine indexing transforma-

tions are present. In the linear case My = FyT-' and viy =FyT'J = FyI. Now just

set ML = F~T-1 and the memory function becomes viy = My~J + vie.

The final function to be extended is the collision function. In the linear case

viy1 = FylTy2-1viy2t. The key was that I = T2-lviy2 From the discussion of the

L -1t
position function, we know that I = T2 (viyz -vit2) where vi2 is the zero aug-

mented constant indexing offset vector for y 2. Then the second transformation is

applied as viyi = FyII + vic2. Thus the full function is computed by

viyl=FLyTy- (viyz -vi2) + vica.

3.4 Algorithmically Specified Array Synthesis
The synthesis problem may be stated: given a serial algorithm in matrix form,

find a transformation T E Zn"x that will transform the serial algorithm to a
equivalent systolic algorithm where dta flow matches the interconnection con-
rr-equivalent systolic algorithm where data flow matches the interconnection con-

- 54 -

straints of the target array. The synthesis method uses the same equations as the

previous analysis techniques, but in a different configuration. A system of equa-

tions is derived that has as its solution the transformation T. Once the transform is

found, it may be tested to see if it preserves the algorithm's precedence relation-

ships by multiplying the dependence vectors by the time transformation and testing

to see if the result is lexicographically positive. Programs with affine indexing

transformations may be designed by the methods discussed in this section by using

the linear portion of the transformation.

In synthesizing a systolic algorithm, one specifies selected data velocities and
distributions in order to get a desired data flow. The other velocities and distribu-

tions are left as free parameters in order to make the resulting system more solv-

able. The data velocity and distribution information is contained in the parameter

matrices, which are related to the time-indexing matrix through the space trans-

form as Sy = STy-1. Multiplying on the right by Ty yields SyTy = S, or

which may be written vyX + 6yFy = S

The above is a synthesis equation. It relates data velocity, distribution and

indexing for the variable y to the time and space transforms. There is one synthe-

sis equation for each variable y in the generating program. All of the synthesis

equations must be solved simultaneously for ;r and S. When the system is solved,

the result is the algorithm transformation T that will map a serial algorithm to a

parallel algorithm with the desired data velocities and distributions. If no solution

exists, the constraints must be loosened until a solution is found.

The synthesis equation for the variable y is very similar to y 's position func-

tion. Note that if both sides of the synthesis equation vy+r + 6yFy = S are multiplied

- 55 -

on the right by the index vector and the sides are switched that the position func-

tion P = vyt + 6yviy results. While the position function relates vectors of different

coordinate systems, the synthesis equation relates the transformation matrices. The

synthesis equations are solved to determine the transform that will give the desired

position functions. This amounts to specifying vy and 6y and solving for r and S. If

all the vy and 6y matrices are specified, usually no solution will exist. Rather, only

the essential parameters are specified and the rest fall as they will to give a valid


There are several reasons why one would want to generate a systolic algorithm

with specified data flow parameters. The main reason is that if only particular data

paths are available, then one would like to design an algorithm that uses the exist-

ing paths. Sometimes it is desirable to constrain the distribution in order to have

data properly fit an array or to obtain algorithms with higher PE utilization effi-

ciency due to closely spaced data. Other times algorithms must be designed to run

on an array with fixed size and dimensions. In such cases the algorithm must be

partitioned to execute on the limited hardware resources as is described in the next

The synthesis equations may be applied using available information to simplify

their solution. For example the matrix multiply algorithm was shown to have an

identity dependence matrix in section 2.4. Also, time is one dimensional so that r

E Z1X3 Because D = 13, .rD = r, so to satisfy the precedence relations of the

algorithm each entry of .r must be positive. To obtain a minimal time algorithm

the entries are set to the minimum positive integer so that r = [1 1 1]. This type of

reasoning simplifies the solution of the synthesis equation for many algorithms. In

light of this, consider the following example.

- 56 -

Example 3.1: Use the synthesis equations to map the algorithm generated by the
matrix multiply program below to a two-dimensional systolic array with a four
nearest neighbor connection topology. Assume that no data are to stay in place.
Program: for (i = 0 : 2)
for (j= 0 : 2)
for (k= 0 : 2)
Zij = Zu + Xik Ykj
Solution : Note that n=3,m=2 and q=l and that the indexing matrices were give in
section 3.2. Since q=l, there is only one time dimension and data will make a
single pass through the array. From the discussion above it is already known that
the minimum time algorithm will be generated by r = [ 1 1 1] and some S yet to
be found. Since only four nearest neighbor connections are available and all data
must move, specify the velocities as

vz = Vx = ] v=

so that if the processor axis are set up as in figures 3.3 and 3.4 these velocities will
cause Z to move from left to right, X to move from top to bottom, and Y to move
from bottom to top. These velocities will cause data to travel along the existing
data paths of the array. The distributions are not critical so will be left as free
parameters and determined later by computing the parameter matrices. The syn-
thesis equations, vy+r + 6 Fy = S will be written for X,Y and Z in matrix form with
t = [ 1 1 1] and the multiplications vyx already performed.
[0 0 01 oi dr 2 1 + 6 6-] S[1
F1 1 1 [6~i6 1 O LS21 S22 S23

x o o oj + Li J] o 0 LS11 S12 s 3
X: 0 0 6 ax, 0 0 1o o S S S22 S23J

r i + [616] r[0 0 1 [S11 S12 S13]
0L 00 6 1 2 L0 1 0 S21 S22 S23

- 57 -

If the multiplications 6yFy are performed, the above equations reduce to

7o. o0 0[ 6 20] [S11 Sr 1 Sr13
1 1 2 + L[226O S21 S22 S23

[* 1 -1 -1 606,X2 [S11 S12 S3S
2 0 0 SO 0 606 ~ [S21 S22 S23J

f1 1 l1 0660 ] [S11 S12 S13
0 0 J E 0 62 ] ~ S21 S22S23J

The Z equation requires S13= 0 and S23=1. The X equation forces S12=-1 and
S22=0. The Y equation forces S11=1 and S21=0. Thus the S matrix is completely
determined and is given by

= -[1 -1 ]
0 1 01

and the parameter matrices can now be determined using Sy=STy to be

0 -T 1 -1 s--1 2 1] S -r1 -1 -2]
1 1 1 0 0 1 0 1 0

The first column of each of the above matrices is the velocity vector for the respec-
tive variable as specified at the beginning of the problem. The distribution vectors
are given in the second and third columns of the above matrices. Figure 3.5 shows
the initial configuration of the data as the algorithm is about to execute. The veloc-
ity and distribution vectors are also shown in the figure. Note that if the data
moves as specified, the correct results will be computed. Also note how the distri-
bution vectors point from the point in the processor space where a given variable is
located to the point in the processor space where another datum resides that has
one of its variable indices incremented with respect to the other's. The distribution
vectors tell how the data is distributed in the processor space.

- 58 -


Zo2 10 02

z12 Zo0 Xo

Z22 Y11 0

2o Y

vz I6 Y

02 2
X, y 2 1


Figure 3.5: The Systolic Algorithm of Example 3.1 with Velocity and Distribu-
tion Vectors Displayed

- 59 -

Example 3.2: Design a systolic algorithm to filter a 32 x32 pixel image with a 3 x3
window on a 3 x3 systolic array with a four nearest neighbor connection topology.
The filter coefficients must sit still, one per processor. Assume that random ac-
cess memory buffers exist at the array periphery that are controlled by an address
generator with read-write control.
Solution: Let x, y, w represent the input output and window images respectively,

where x E R31x31, y E R33x33, w E R3x3. Figure 3.6 shows how the filtering
operation may be written as four loops in matrix form. To make w remain still
and be distributed one per processor, let

vw = and 6,= 1]

Next we will assume x's and y's move in the same direction but at different speeds
as in H.T. Kung's W9 one-dimensional convolution algorithm [Ku82]. To achieve
this, set

vy=[ and v, = 1/2

Note that these velocities are specified so that it will take two clock cycles in either
temporal dimension in order for the a given X-variable to reach another processor.
Due to the way time increments, this will cause some data not to engage any
processors while time progresses in the least significant temporal dimension. The
data who's position due to the most significant time coordinate is an integer will
engage a processor every other cycle.
If the X and Y data distributions will be left as free parameters, the synthesis
equations become,

- 60 -

+ 1621 ~d2 L

Y 1 10 j 1 z1 12 O13 914
01 IL 21 Z22 gr23 Z24J

X: 1/2 01 [tII X2 =13 x143
L0 1/21 L3Z21 922 Z23 J24]

W: 0 0 kii 1'2 Z13 141
0 0 J 21 =22 Z23 324

611 612 1 0 -1
L 21 6 2 L 0 1 0-

+[1 0 0
E0 1] 0

0o s SZ1 S213 S14
0 LS21 S22 S23 S24J

] FSz S12 S13 S14]
-1j= 21 S22 S23 S24J

S 11 S12 S13 14
I S21 S22 S23 S24]

The W equation yields S directly,

W: S = 0 1 0

so that the X and Y equations may be rewritten

Y: I 2l X12 =13 314
Y 7t2l1 =22 OT23 3(24]

Y11 6Y12
1 21 6Y22

X: 1/ T1 l12 313 4lq +4 X11 6X12 -06Xl -BX12
Lx21 t22 2t23 324J + LaX22 -6 -6

From the Y equation it is clear that

=x23 Z24j L1 0
From the X equation we see that

1/2 11 21 = -6x
L3(21 3(22]
and that

1/2 [I3 ;:414 -6
[Z23 924.

0 1

x = -1/211 O]

10 0
= 0

= 0
S0 0

- 61 -


T22 122J LU01

so that the required transformation is

1 I--

Figure 3.6 shows a snap shot of the algorithm executing.

Xs4 X1 4

X55ss X45

X35 X25

XIs Xos

vx2 1/2

Vxi 1/2


6 1/2



<5y, i

00 0o Fz = 0 1
1 0 O 0 0 0

02= 000033 F= 1 0 1 0 Fy= [1
000033 0 1 0-1
0 0 00 2
0000 2_

Figure 3.6: 2-D Convolution Algorithm Designed in Example 3.2

. . Xo6

X61 X51

S X63 Xs3


S Yss

= 00000


An algorithm must often be mapped to a fixed architecture. The algorithms will

usually operate on large data sets which cause the algorithm to have a large index

set. The large index set is then mapped to a processor set through the space trans-

form S:P->Pm. Since S is a linear transformation with integer entries, the dimen-

sions of the processor set will be at least as big as the smallest m dimensions of

the index set. Sometimes certain dimensions of the index set are small, as in the

two-dimensional convolution example of the last chapter. Usually this is not the

case so further methods must be developed to map large algorithms into small

arrays. Another related problem is mapping an arbitrarily dimensioned algorithm

to a k-dimensional array. For example, one may wish to compute a matrix multi-

ply on a one-dimensional array, or a one-dimensional convolution on a two-

dimensional array.

This chapter will explore and develop methods to map an algorithm a fixed

architecture. This will be done by reviewing some known methods and introducing

some nonlinear number theoretic index set permutation transformations into the

existing framework. The new transformations will be used to transform sets of

integer vectors into sets of integer vectors of different dimensions. This added

capability will allow more types of algorithms to be mapped and will add more

degrees of freedom to the mappings themselves. Also, the nonlinear transforma-

- 62 -

- 63 -

tions may be integrated into the existing framework to allow the synthesis of algo-

rithmically specified arrays with size and dimension constraints.

The non-linear transformations are based on number theoretic set transforma-

tions. If a given dimension of an index set ranges from 0 to n-1, then this dimen-

sion may be viewed as the integer ring Zn. The elements of the ring may be

mapped to isomorphic rings of different dimensions. The maps are simple to im-

plement using known mappings. One map is based on the integer division theorem,

and the other is based on the chinese remainder theorem. Both maps depend on

the factors of n. For example, if n=a*b*c, then there exists a bijection map

p :Zn->Za XZb XZc.

In section 4.1 partitioning fundamentals will be covered. The concept of a vir-

tual processor will be introduced. Some partitioning methods in the literature will

be discussed, and an in-place partitioning method will be presented. In section 4.2

integer based program pretransformations will be applied to the index set to create

a new algorithm that may be mapped to a fixed structure. In section 4.3, integer

transformations are applied to the time-processor space to partition and alter the

dimensions of an array algorithm directly. The architectural ramifications of the

mapping are discussed to show what properties are required of the basic Process-

ing Element (PE) to support the partitioning scheme. In section 4.4, the integer
maps will be used with the synthesis equations to synthesize arrays of different

dimensions than the variable index vectors.

4.1 Partitioning Fundamentals

Array processor hardware is very complex due to the relatively large number of

processors and the large number of connections between them. If the array has

dimensions m Xm, then m2 processors will be required and the number of connec-

tions will be on the order of 2m2 or 4m2. Another concern is that m inputs and m

- 64 -

outputs will need to be serviced at the array boundaries each cycle. This means the

memory must be m times as fast as the PEs. These reasons have kept the size of

systolic arrays to small sizes when compared to the data sets that they must proc-

ess. So in order for systolic arrays to solve real-world applications, the algo-

rithms must be partitioned to fit the physical hardware structures.

The properties of the array and PE architectures must be taken into account

when mapping algorithms to the fixed architecture. Some architectures consist of

very simple PEs that can perform some basic arithmetic operations, while others

have more complex PEs. If the system has the simple PEs, then the partitioning

must be handled at the array boundaries. The controller must store intermediate

results in the off-array memory and send the data through the array several times

in order to compete the algorithm. The central issue is to determine the proper

sequences of data to send through the array for multiple passes. If the PE has

some local memory and possibly a pointer based addressing unit, then the parti-

tioning may be handled by the array itself. Here the main partitioning problem is

to tell the processors how to sequence data in local memory.

Most partitioning schemes employ some variation of the concept of a virtual

processor or virtual node. The idea is that the set of processors that is determined
by the mapping is actually a set of virtual processors. These virtual processors

must then be emulated by the physical array. That is, each physical PE must act as

a set of virtual PEs. This means that some additional scheduling and resource
allocation must be done to get the physical PEs to act as sets of virtual processing

nodes. Also, memory and addressing/sequencing hardware must be built into the

PEs. The addition of this hardware greatly complicates the array algorithm design

and implementation problem. Parallel languages such as OCCAM [Pou86] and W2

[Ann87] can make this task easier, but still the hardware must be programmed


- 65 -

Some partitioning methods based on algorithm transformation techniques are

available in the literature. The method of [For84] and [Mo186] involves slicing the

index set into bands and mapping these bands modulo L to a fixed LxL array.

The index set is sliced by separating the mx n space transformation matrix, S, into

m 1 xn sub-transforms. Bands are formed by finding surfaces in the index set that

satisfy Si I=constant and then reduced modulo L to fit on an LxL array. The com-

putations within each band are scheduled by the time transform, nr.

The method of [Hor87] is based on synthesizing algorithms with very low PE-
utilization efficiency and then eliminating the unused processors each cycle. The

net result is a relatively high efficiency array algorithm that uses only a few proces-

sors. A time transform with larger than normal entries is used which gives rise to

low data velocities. If a datum's velocity is less than one, it will not be involved in

a computation for many cycles. Also, processors will be idle many cycles. The

space transform generates a full size processor array but the time transform is

selected so that the execution time of the algorithm is longer than the minimum

execution time. This causes many processors to be idle most of the time. Thus if

the available array has dimensions Lx M, then the array algorithm is synthesized

with dimensions Li xMi where groups of (Li/L)X(Mi/M) processors will have

only one active PE at any given time. This group of (L /L) x (M /M) virtual

processors may be handled by a single physical processor since only one processor
in the group will be active each clock cycle. The algorithm takes longer to execute,

but this is expected since fewer processors are available to execute the task.

Another way to efficiently partition algorithms is based on computing the out-

put variables in place. This method is used for ULRPs that compute operations of

the form Z=X*Y. The first step is to synthesize an algorithm with Vz=0 and ,z=Im.

This will cause the z-variables to sit in place and to be distributed one per proces-

sor so that the size of the synthesized array is the same size as the z-data set. If

- 66 -

the synthesized array has dimensions Li x M, and the physical array has dimen-

sions LxM, the array is partitioned into LxM sub arrays as shown in figure 4.1

below. If L and M do not divide L1 and Mi respectively, then the boundary sub

arrays will have some unused processors, but the method may still be used.

The algorithm is executed by sequentially executing the algorithm on each sub-

array to compute the LXM sub-blocks of the output data set. Care is taken to

remove unused cycles in the time trace of the algorithm. This is done by starting

each subarray's time trace when the data wave first encounters the subarray and

halting once the data wave passes over.


Figure 4.1: In-Place Partitioning Method

Each sub-array algorithm may be set up and controlled with the help of the

memory functions and velocity vectors for the X and Y data sets. The X and Y

memory functions are applied to the boundary processors of each sub array to

obtain the sub-array input data variable index sequences. The velocity vectors are

used to tell which boundary processors will receive the input data. An example
will illustrate.

Example 4.1: Synthesize an array to execute the 4 x4 matrix multiply algorithm

described by the program below. Next partition the algorithm to execute on a 2 x 2

JJ-1 .J i JJ JJJ JJ-

jJ .j. j.Jj jjj j J
W > i b Ik h k > > k h k I lk b 0


- 67-

physical array structure. Compare the execution times of the unpartitioned and
partitioned algorithms with a uniprocessor algorithm.
Program: for (i = 0 : 3)
for (j= 0 : 3)
for (k= 0 : 3)
Zj = Zij + Xik Ykj
Solution: To make the algorithm execute in place, set 6z=Im and Vz = 0 so that the z

synthesis equation, vyr + SyFy = S becomes 0r + ImFy = S or, Fz = S. If gi is se-
lected as ;r = [1 1 1] in order to satisfy the program dependencies and provide a
minimum time algorithm, then the parameter matrices may be computed to be

s [-0 1 0 s- 0 1 0 s 1 -1 -1
0 0 1 -1 -1 S[[ 0 [

and the memory matrices are given by

[0 1 00 1 01 1 -1 -1
0 0 1 [ -1 1Y[0 0 1

note in this example that Sy = My for each y E { X,Y,Z }.
The resulting array algorithm is shown in figure 4.2 with partitioning lines
drawn. Figure 4.3 shows four separate algorithms that are executed in sequence.
The outputs must be flushed from the array after each sub-array algorithm exe-
cutes. The flushing operation could be pipelined to avoid wasted cycles at the cost
of a more complex controller.
The partitioning of figure 4.2 was performed graphically and was conceptually
simple. If the partitioning is to be performed automatically, a more structured
approach is needed. Because Vx= [0 i]T we know the x data will be input to the
right boundary processors. Similarly, because vy = [1 O]T the Y data will be input
to the bottom boundary processors. Thus.the boundary processor that must receive
input data are known. Next the time trace for the entire algorithm is formed by
transforming r:In->ttrace. In this example, the sorted time trace is given by

- 68 -

ttrace = { 0,1,2,3,4,5,6,7,8,9 }
The trace is used with the virtual processor coordinates of the boundary processors

of the subarrays to obtain the input data index sequences. Each element of the

time trace is augmented with the subarray boundary virtual processor coordinates

and loaded into a matrix as column vectors as shown below for the (0,0) proces-


0 1 2 3 4 56789
Po= 0 0 0000000

X11 X10

X03 X02 X01 XOO


Y10 Y01

Y20 Y11 Y02

Y30 Y21 I Y12 Y04

Y31 I Y22 Y13

Y32 Y23

Figure 4.2: In Place Algorithm with Partitioning Lines

X33 X32 X31 X30

X23 X22 X21 X20

X13 X12


- 69 -

This is a matrix of time-processor vectors whose columns are ordered in time. If

the matrix is multiplied by a memory matrix then the columns of the product

matrix will be variable index vectors. These vectors will be the ordered sequence

of indices that will be input to the given boundary processor. For example if the

above matrix is multiplied by Mx, the sequence

Xo3X12X1IX1o -1 Z10

XXozXoiXoo Zoo Zo





X33X32X3lX3o Zoo Zoo

X23X22X21X2o Zoo Zo

Figure 4.3: The Partitioned Algorithm As Four Sub-Arrays a) The Array to
Compute the (0,0) Subblock b) The Array to Compute the (0,1) Subblock c)
The Array to Compute the (1,0) Subblock b) The Array to Compute the (1,1)

Z12 Z13

Zo2 Zo3

Z30 Z3

-jZ20 ~ Z21j
~~r~ -T

- 70 -

vx 000000000
i o= 012 003456789

is obtained. If it is multiplied by My, the resulting sequence is

S0 1 2 3 4 5 6 7 8 9
v0= 000000000

If the sequence is multiplied by Mz, an all zero matrix results since Zoo is always

present at processor Poo.

The column vectors of the vi o and vioo matrices are the input variable index

sequences to Poo in figures 4.1 and 4.2. Note that the column vectors begin to
contain out of range values after 4 cycles. This signals that the data wave has

passed over the processor and no valid computations are occurring. These cycles

would not be executed in the partitioned algorithm. In other cases there would be a
number of columns that contained out of range values at the beginning and possi-

bly more at the end. This corresponds to a sub-array in the middle of the array

where the data wave takes several cycles to reach the boundary processor. The

computer detects the subsequence where the input indices are in range and sched-
ules the algorithm accordingly.

The last task in the example is to compare execution times. The serial algo-

rithm has complexity O(n3) so the 4 x4 algorithm takes 64 cycles to execute. The

full sized array algorithm can be seen to have complexity 0(3n-2) so that it would

require ten cycles to execute. If it is assumed that the flushing is not pipelined,

then the partitioned algorithm has a complexity of O(n3/2) and takes 32 cycles.

4.2 Integer Based Index Set Pretransformations

An algorithm transformation that preserves r, -equivalence must be a bijection,

but it does not have to be integer or affine. As mentioned earlier, S:In->Pm where

- 71 -

the size of pm is directly related to the size of P. As was shown in example 3.2, if

m dimensions of the index set are of the size of the the array, then S may be

selected to make the algorithm fit the array. Usually the size of the index set is

independent of the size of the physical array so that no fit may be made by care-
fully selecting a space transform. Rather, a number theoretic integer index set

bijection may be applied to the index set before the transform ,T. The index set

pretransformation will alter the shape of the index set in such a way that m dimen-

sions of the transformed index set will be of the desired size.

Certain conditions on the index set must be met before a number theoretic
permutation may be applied. First of all, the ith element of the index set must all

range from 0 to bi -1. If the element of the index ranges from one constant to

another constant, then the program may easily be rewritten to meet the require-

ment. In other words, the leading n xn sub-matrix of the boundary matrices must

be zero to apply the method. If IE P and i and j are components of I that range

from 0 to band bj respectively, the algorithm can be made to map to an ai x a2

array if ai Jbi and a2 Ibj. An integer bijection, 0i is applied to the original pro-

grams index set to pretransform program so that it will generate an algorithm that

may be directly mapped to a fixed size systolic array through a linear transforma-

tion. Thus the index set of the original program is mapped to another index set

that is subsequently mapped to the systolic array time-processor set Jn. The map-

ping may be viewed as Pi:I->I'n, and T:I'n->Jn so that TOi:P->Jn. Thus the trans-

formation ToI maps the program to the systolic array of fixed size.

Let a given dimension of the index set be generated by the loop

for (i = 0 : b-1 )
The dimension then consists of the integers { 0,1,2,...,b-1 } which is the integer

ring Zb. For the case where b = ala2a3 and an al Xa2 array is available, the

pretransformation, p will be a map O :Zb->Zal X Za2 XZa3. The map is known to

- 72 -

exist by the division algorithm of number theory [Ros84], although it will not pre-

serve the ring operations in general. Let i E Zb and (ii,i2,i3) E Z x Za2 X Za3

where q :e--> (il,i2,i3). Then the forward map is given by, (i0,i2,i3) = (i div a2a3,

(i mod a2) div a3, i mod a3). The inverse map, -':(il,i2,i3)->i is given by linear

equation i = a2a3il+a3i2+i3. The above is easily verified and follows directly from

the referenced theorem. The above relations show how to compute the mappings

t and q-1. Another easily verifiable fact is that the ordering of Zb is transformed

to an equivalent lexicographical ordering in Zai x Za2 xZa3. Thus the ordering of

computations is known for a program where some of the dimensions of the index

set are transformed through the above transformation.

The above mapping may be carried out quite naturally in a SENLP. The SENLP

will consist of nested "for" statements and a single expression. If one of the for

loops had the form

for ( i = 0 :b-1 )

and b = ala2a3, then the above statement may be written in the transformed form


for ( i = 0 : a-1 )
for ( i2 = 0 : a2- )
for ( i3 = 0 : a3- )

This loop generates an index set that may be viewed as a set of tuples (i1,i2,i3).

The set of these tuples are all of the elements of the ring Za XZa2 XZa3. Further-

more, the tuples are generated in lexicographical order, so the single loop has the

same effect as the triple loop and performs the mapping 5 The mappings may be

performed either mathematically, or by rewriting the program. The total index set

map, 0i is the composite of applying these transformations to selected dimensions

of the index vector. If qI is applied to the index set, so that a total of k extra

dimensions are produced, the transformed index set may be viewed as a subset of

- 73 -

Zn"+. The inverse map may is linear and may be represented as a matrix. The

tuple (ili2,...,ink) is represented as a vector I' = [il,i2,...,ink]T, and the inverse

map as a matrix ['1E ZnXn+k so that I=1['I'.

Once the index vector has been transformed by 0t, the indexing matrices will

need to be altered to operate on the transformed index vector. Since viy =Fy I, and

I=01I', the new index matrices are given by Fy '=Fy r1'. The transformed indexing

matrices now have dimensions Fy'E Zm Xn+k. If the mapping is carried out at the

program level as described above, the indexing matrices in the expression must be

replaced by the transformed indexing matrices F '. If the mapping is carried out

automatically, the above expressions may be used to compute the new indexing

scheme without involving the programmer.

Now the mathematical model exists to transform a SENLP to another SENLP
with an altered index set and expression. If the transform is chosen wisely, the

transformed program will generate an algorithm that will readily map to a fixed

sized systolic array. The extra temporal dimensions in the transformed index
vector will cause extra dimensions to arise in the time processor space, requiring

data to make multiple passes through the array. The following example will illus-

trate the method.

Example 4.2: Map the 12 x 12 matrix multiply program below to a 2 x2 systolic

array. Rewrite of the program to perform the mapping 01. Find the indexing

matrices in the transformed program. Evaluate the data velocities and distribu-

tions of the resulting algorithmically specified systolic array. Determine the com-

plexity of the algorithm and determine the total execution time. Compare with the

previous method.
Pi: for (i= 0 : 11)
for (j = 0 : 11)
for (k = 0: 11)
Zij = Zij + Xik Ykj

- 74 -

Solution: Select the first coordinate, i, for transformation. Note that 12 = 2 x2 x3
so that the above loop may be rewritten,
P2: for ( i = 0 : 1)
for ( i2 = 0 : 1)
for (i3 = 0: 2)
for (j = 0 : 11)
for (k = 0 : 11)
so that the transformed index vector is I' = [il,i2,i3,j,k]T and I' E Z5, xE Z3x5,

SE Z2x ,n = 5, q = 3 and m = 2. The new index vector may be computed from
the original using I' = [i div 6, (i div 3) mod 2, i mod 3, j, k]T. Now that the index
vector has been transformed, the indexing matrices must also be transformed. The
transformed indexing matrices are given by Fy '=Fy Fi1. In this example,

[6 3 1 0 00
.0 0 0 0 1.
so that the transformed indexing matrices become

, =6 3 1 0 0] P, = 6 3 1 0 O _0 0 0 0 1
L0 001 0 0 0001 0 00010
Next S is selected as

0 100 1001

Now S:IP->{ [O,O]T,[O,1]T,[1,0]'T,[1,1]T} so that only four processors are needed to
execute the algorithm. A time transform will be selected to satisfy the program
dependencies and matrix invertability constraints of T and T'y as

i 0 1 0 0
t= 010 10
1 00 0 1

- 75 -

The above transform was selected to generate a regular time trace and meet the

constraint that it's rows are linearly independent of the rows of S and each F'y.

Now the data velocities and distributions may be computed using S'y =ST'y1 to be

S 1'--[1/5 -3/5 0 1/5 3/5] [- 0 0 1 0 -/ 1
z ~ 0 1 0 0 1 S=- 1/3 0 5/3 1/3 5/3

0 1 -1 0 1
[0 1 0 0 -1

The worst case execution time is when the clock runs from t=[0,0,O]T to

t=[3,12,12]T with no wasted cycles removed. In this case a total of 4*13*13=676

cycles would be required for execution. The serial algorithm requires

12*12*12=1728. In this case the algorithm has a speed up of 1728/676=2.55. Since

4 processors are available, the PE efficiency is 2.55/4 = 63%.

The above example mapped a 12 x 12 algorithm to a 2 2 array. The direct

mapping would have created an array algorithm with a single clock, while the

partitioned algorithm used three clocks. Each data wave has one distinct velocity

vector associated with each clock. These extra degrees of freedom in the data

movement allow the data to be channeled through the fixed size array. The array
algorithm would require a controller at the array boundary moving data on and off

the array and into off-array memory. The data velocities are on the order of 1/3 to

1/5 so that a small amount of memory would need to be distributed throughout the

array in order for the algorithm to execute. This mapping technique generated an

array algorithm suitable for execution on a fine grain systolic array.

4.3 Processor Domain Post-transformations

In the last section integer bijections were applied to the index set prior to apply-
ing the algorithm transformation,T. The general formulation was T0I:I->J so that

J=T0II. In this section the integer transformation will be applied directly to the

- 76 -

time-processor vector, J. Now T:I->J and Oj:J->J' so that OjT:I->J'. The transfor-

mation 0j dose not affect the time portion of the time index vector, but only the

processor portion. Thus the map may be viewed as a q-dimensional identity trans-

formation augmented with an integer transformation that maps the m processor

dimensions into m' processor dimensions. The processor transformation portion of

the transformation may be applied directly to a processor coordinate vector to find

the new processor coordinates. This map has the form Op:P->P' so that P'=OpP.

One purpose 0j may be used for is to map m dimensional arrays to m+k di-

mensional arrays. For example, it can be used to map a one dimensional array to

a two dimensional array if a two dimensional array is available to execute an

algorithm that operates on one-dimensional data such as a one-dimensional con-

volution. If a three or higher dimensional architecture were available, the transfor-

mation PJ would readily map a lower dimensional algorithm to that structure. Of

course, due to current technological constraints, systolic arrays of more than two

dimensions are rarely built. If such structures do become available in the future,

then these methods could be applied to map algorithms into them.

Another use of the 0j is to partition an algorithm to execute in an architecture

with distributed memory and addressing units. The large virtual array is mapped

to a higher dimensional virtual array. The transform is selected so that two dimen-

sions of the transformed virtual array are the same size as the physical array. Next

each physical processor is assigned the set of virtual processors that are orthogonal

to the physical processor's coordinates. The mapping will give a way to arrange

data in the distributed memories as well as provide an addressing scheme to trans-

fer data between local memories and the PE's arithmetic units.

The position functions tell precisely how each data wave propagates through the

virtual arrays. The position functions of the transformed virtual array are given by


- 77 -

Py(viy) = pP(vi) = OpVyt + Op6yviy

Thus the initial positions are given by Py '=1rOyviy and the velocities in the trans-

formed array are given by v ;=pvy. If the integer division algorithm is used to

compute the map, then movement along an axial direction will transform into a
lexicographical movement in the primed dimensions. If the chinese remainder
theorem method is used, movement will be along diagonals in the transformed
virtual processor space with edge wrapping. The first flow is simpler to control so
that is the map that would most often be selected.
A simple use of the processor domain mapping is to get a one-dimensional
algorithm to execute on a two-dimensional array. Example 4.3 will illustrate how
the mapping forces data to flow on a higher dimensional array when the dimen-
sions are available. The other use of the processor map Op is illustrated in Example
4.4. In this example an algorithm will be partitioned by restructuring the index set
so that virtual processor dimensions are created. The subsets of virtual processors
with the same physical processor coordinates will be mapped to the physical proc-
essor. Data movement in virtual dimensions will be carried via addressing in the
physical processor's local memory.
Example 4.3: Map the convolution algorithm described in the program below to a
4 x 3 array using the relation J'= bjTI to map the algorithm. Select ~j by applying
the integer division algorithm and the chinese remainder theorem methods. Ana-
lyze data flow for each method.
P: for (i = 0 : 11)
for (j= 0 : 22)
Zi = Z +Xj Yi-j

Solution: Here the indexing matrices are given by

Fz=D 0] Fx[0 lO

Fy-E -1

- 78 -

To get the minimum time algorithm and satisfy the dependencies (dz=[0 1]T,
dx=[1 0]T, dy =[1 1]1) we select n = [1 1]. To make the Z-variables compute in
place and be distributed one per processor, set vz=0 and 6z=1. This makes the
Z-synthesis equation specify that S=F =[1 0]. Now that the transform is known the
data velocities and distributions may be computed to be

Sz=[ 1] sx= [l-i Sy= [1/2 1/2]

The matrices above show that the algorithm mapped to a one dimensional array.
The array and xo's path through the array are depicted in figure 4.4. The PEs are
numbered and zj is computed in processor j. The Y-variables are not shown but
move in the same direction as the x variables but at half the speed.

Figure 4.4: 1-D Systolic Convolution Array with xo Position Trace Shown

Two maps may be applied to the above array to map it to a 3 x4 array. The
first map, based on the integer division algorithm, is given by P' = OpP = (P,',P2'

) = ( P div 4, P mod 4 ). The new processor vector is thus P'= [PI',P2']T. If this
mapping is used, the systolic array of figure 4.4 is transformed to the systolic array
of figure 4.5. The CRT map is P' = FpP = (P',P2' ) = ( P mod 3, P mod 4 ). The

new processor vector is then given by P'= [P,',p2']T. The map exists because 3
and 4 are relatively prime. The array and the xo position trace are shown in figure
4.6. The data flow is now along the diagonal directions. This map requires more
I/O overhead by the array I/O controller and requires eight-neighborhood connec-
tions instead of four neighborhood connections.

- 79 -

Figure 4.5: 1-D Convolution on 2-D Array with xo Position Trace Shown Using
Division Algorithm Mapping

Figure 4.6: 1-D Convolution on a 2-D Array with xo Position Trace Shown Us-
ing CRT Based Mapping.

- 80 -

Example 4.4: Map a 16 x 16 matrix multiply to a 4 x4 array where each PE has
local memory and a memory addressing unit.

Solution: First map the algorithm to a 16 x 16 array with the outputs computed in

place. This is the same mapping used in example 4.1, so the indexing matrices are

the same, ;r =[1 1 1] S = Fz,and the parameter matrices are given by

[0 1 0 s 0 1 0] 1S = 1 -1
0 0 1 1 -1 -1 0 0 1

select the processor transformation as

P' = 1pP = P2 div
P2 mod4

so that the data position functions become
P0" = 0] + t

00 t1 mod4 0


Vy 0 P'3

P 15

Figure 4.7: Dimensions of Transformed Virtual Array and Non-zero Data Ve-
locities of Example 4.4

- 81 -

The virtual array space may be viewed as shown in figure 4.7. If the virtual array

is to be mapped to a 4 x 4 physical array, then sixteen virtual processors must be

mapped to each physical processor. The assignment may be made by assigning

physical PE(P'2P'3) the set of virtual processor nodes {[0 P'2 P'1]T, [1 P'2,P'3]T,

...,[15 P'2,P'3]T }. This assignment causes the Z's to sit still, but an entire column

is assigned to each processor. Since the Y's move in a direction that is orthogonal

to the physical array, they do not move from processor to processor in the physical

array but reside on a single physical PE as the Z's do. Thus an entire column of

the y-matrix is assigned to reside in memory on each PE. The columns of x and y

that reside on PE(P'2P'3) is column # = 4P'2 + P'3. Each PE will be required to

compute one column of sixteen output variables.

X 01sXo04 XoXoX ooX-

Ylo o
Y10 yo
Y20 Y11 Yo2
Y21 Y12

Y1so Y151

Figure 4.8: Unpartitioned Array's Initial Oth Row Data Alignment

So far the resources have been allocated to execute the algorithm, but still the

scheduling needs to be done. This amounts to specifying how the X-data will be

sequenced through the array and how the local Y- and Z-memories will be ad-

dressed. To schedule the virtual array, note that the first computation in the algo-

rithm is scheduled to execute in PE P = [0 0]T at time t = 0. The X's will move in a

manor similar to that depicted in figure 4.5. The Oth row of the virtual array is

shown in figure 4.8. The Op map will transform the bottom row of processors of

- 82 -

figure 4.8 to the 4 x4 grid of processors shown in figure 4.9. Instead of waiting for

their X-data to pass from the bottom row of the array to the next rows in its data

position trace, the X-stream of data will be injected to all rows of the processor

simultaneously. The memory pointers will be initialized as shown in figure 4.9,

and one output will be computed in each processor each sixteen cycles. The Y-

memory pointer will be incremented each clock cycle and wrapped modulo sixteen.

The Z-memory pointer will be used to address output data and will be incre-

mented every sixteen cycles. At the end of the array execution, each cell will

contain a one column of the output matrix.

S. X11Xo0X5X014

* . X 1oX0O15XoI4 Xo2Xol

S. Xlloxo1X5XO14 '

Figure 4.9: Physical Array with y-Memory and Initial y- and z-Memory Pointer
Values Displayed

- 83 -

Now consider the complexity of the algorithm. The serial algorithm takes 1536

cycles, while the unpartitioned array algorithm takes 46 cycles. The partitioned

algorithm takes 260 cycles. Both the partitioned and the unpartitioned algorithms
then need to have the results flushed from the array. The unpartitioned array has

dimensions 16 x 16 so that 256 processors are used. The serial algorithm takes

1536 cycles on one processor, which runs at 100% efficiency. The unpartitioned

algorithm completes 1536 operations on 256 processors so each PE must complete

1536/256=16 computations in 46 cycles so that the PEs are operating at 16/46 =

34.8% efficiency. The partitioned array has dimensions 4 x4 so that sixteen proc-

essors are used. Each processor must complete 1536/16=256 operations in 260

cycles so that the PEs are operating at 256/260=98.5% efficiency. So we see that

algorithm took longer to execute on the partitioned array, but the PEs were put to

much better use.

4.4 Array Synthesis Using Index Transformations

In this section integer bijections will be applied to variable indexing vectors.

This will relax the previous requirement that each P, viyi, viy, * ,vir E Zm .

For example this will allow algorithms with three dimensional variable index vec-

tors to be mapped directly to two dimensional arrays. Also, if some of the variable

index vectors are of different dimensions, an array may now be synthesized using

the new transformation.

Let Oy:viy->vi; where a sub component of Oy maps the elements of Zbl XZb2

to the elements of Zblb2 using the inverse division algorithm linear map. That is,

the map Oy maps from many dimensions to fewer while the previous maps, 0i, Oj

and Op mapped from few dimensions to many. For example, to map the two

dimensional variable index vector of x Rn xn to a one-dimensional variable in-

dex vector, the map would be ,yvix=[n 1]vix. Since the transform is linear, it may

- 84 -

be postmultiplied by the variable indexing matrix to make a new variable indexing

matrix given by F =0qyF. The new time-index matrix is the given by

By setting S;= S(T;)-', it can be seen that S ,= [ vy I 6'y] where 6'y is the distribu-
tion matrix for Y using the transformed variable indices. The position function is
readily given in terms of time and the original variable indices as

Py = S'yviY = Vyt + 6',yViy

If I is factored from the above equation, the synthesis equation becomes

S = S',T', = vr + 6'OyFy

This form of the synthesis equations allows an extra degree of freedom to the
array designer. Now a transformation py is placed before the variable index ma-

trix to change its dimensions. Two examples will illustrate how the method may be
applied. The first example uses the transformation to map a matrix multiply di-
rectly to a one-dimensional array. The second example will map a matrix-vector
multiply to a one-dimensional array. Without the variable index pretransforma-
tion, the matrix-vector multiply could not be mapped.

Example 4.5: Map a 4 x 4 matrix multiply to a one-dimensional systolic array by

directly synthesizing the algorithm using the extended synthesis equations.

Solution : In the matrix multiply the indexing matrices are given by

00 100 001
01 0 001 F 010

To make the algorithm map to a one dimensional array, the variable index vectors
must all map to one dimensional variable index vectors through the mapping 7y.

Because both components of the variable index vectors for each variable range

- 85 -

from zero to three, the transformation ,y=[4 1] may be used in all three synthesis
equations. The transformed variable index matrices are then given by

F'z=[4 1 0] F' =4 0 1] F',=[0 1 4]

Because I E Z3, the algorithm transformation must have dimensions TE Z3x3. To
map to a one-dimensional array, S E Z 3 so that E e Z2x3. With the trans-
formed indexing matrices being of dimension 1 x 3, the transformed time-index

matrices, T' will all have the proper dimensions TE Z3x3. Still, 7r must be
selected so that the original program dependencies map to lexicographically posi-
tive temporal vectors, and that each of the time-indexing matrices are invertable.
To achieve this and to achieve a minimum time array execution, select

1= 0 0 1

Next the synthesis equation S',T=S will be written in transposed form as

TTS ;T=ST or [ 3rT I F'T]S ,TST. This equation is in the form of a standard linear
system and may be readily solved. The three synthesis equations may be written

1 0 3 S1 0 3 v S 0 0 0v [Si
zV: 1 = x: 0 0 2 = S Y: 10 1 v = S
0 1 0 S 0 1 S 0 1 3 S

We shall constrain only the z-variable's movement and distribution to be vz =1/2,
v = 1, and 6'z =-1. The z-equation then gives S = [-1 0 1]. Next the x- and
y-equations are solved to yield the full set of parameter matrices to be

S' = [1/2 1 1 S'x = [0 4/3 -1/3] S' = [- 2 1]

The processor's size is determined by examining the processor set generated by
S:P->Pm. Here P EPm (m=l) is in the range -35 P5 3, so that a linear array of
seven processors is needed. The execution time of the algorithm is found by exam-

-86 -

ining the time trace generated by the mapping r:Ip->tq (q=2). Here the compo-
nents of the temporal vector range as 0 5 ti 5 6 and 0 5 t2 5 3 so that a total of
7*4=28 cycles are required to execute the algorithm. There are 64 total computa-
tions to be performed on seven processors so that at 100% efficiency the algorithm
would take 64/7=9.14 cycles to execute. The PE efficiency of the algorithm is then
9.14/28 = 32.7%. In comparison, the maximum PE efficiency on a two-
dimensional array is 40%, but a total of sixteen processors must be available and
extra communication paths are also needed.
Example 4.6: Map the matrix-vector multiply algorithm generated by the program
below to a one-dimensional array.
Program: for (i = 0 : 3)
for(j=0: 3)
Zi = Zi + Xij Yj

Solution: First note that

F, = [1 0 F- ] F=y 0 1

Also note that dz=[0 1]T, dx=[0 0]' and dy=[1 0]T so that a = [1 1] will satisfy the
dependence requirements of the algorithm. Next note that I Z2 so that TE Z2 2
and SE Z' X2. If the time-index matrices are constructed, Tz and Ty will be of the
proper dimensions, but Tx will not. Previously this would cause serious problems
synthesizing the array, but now select Oy = Ox = [4 1] so that F'x = OxFx = [4 1].
Now the array may be synthesized using the original Z- and Y- synthesis equa-
tions and the extended synthesis equation for the X-variables. The synthesis equa-

tions may again be written in their transposed form [ rT I F T]S T = ST where
now Fz=F'z, Fy= F'y and Fx = [4 1]. The equations may then be written,

I VX = dKLSJ [1 01 [ -&]= I

- 87 -

Set vz=l and 6z=-l so that the Z-equation specifies S to be S = [0 1]. This causes
the algorithm to map to a one-dimensional array with four PEs and the data flow
given by the parameter matrices below.

Sz = 1 S'x = [4/3- 1/3] S, =0 1]

The above examples show how to directly synthesize algorithmically specified
arrays using the variable index bijection. These new transforms give more degrees
of freedom to the design process. The examples in this chapter were all simple and
designed to illustrate particular techniques. In a real problem a design may use
more than one of the transformations introduced in this chapter. For example,
suppose example 4.6 required the matrix-vector multiplication to be mapped to a
2 x 2 array. Then the map Op = [P div 2 P mod 2 ]T could be applied to the proces-

sor synthesized in the example to generate the required array.


Numerical Linear Algebra (NLA) routines may be characterized in terms of

their loop and indexing structures. They normally consist of simple regular loops

with linear indexing performed on the variables. There are usually multiple expres-

sions that often occur at different levels of nesting in the loop structure. These

programs normally generate algorithms with non-uniform dependence graphs.

This chapter will focus on a method to design systolic arrays to implement such


The method for designing arrays for NLA routines will be a simple extension of

the previously developed methods. A program transformation will be applied to a

Multi-Level Multiple Expression Program (MLMEP) to convert it to a Multiple

Expression Nested Loop Program (MENLP). In the MENLP all expressions are

found at the deepest level of nesting. A matrix representation is developed for the

MENLP that allows the array synthesis methods of chapters three and four to be

applied. Some NLA programs transform to MENLPs where more than one expres-

sion appears at a given point in the index set. Such algorithms may also be

mapped if certain assumptions are made about the PEs. Namely they must be able

to execute small subroutines and have a limited amount of local storage.

5.1 Multiple Expression Nested Loop Programs (MENLPs)

The MENLP is a generalization of the SENLP where now different expressions

or lists of expressions may appear at different points in the program's index set.

- 88 -

- 89 -

Also, in an MENLP a given variable may be indexed by more than one indexing

matrix in a single expression. As was mentioned in chapter two, any program may

be written as a nested loop program provided conditional statements appear at the

deepest level of nesting. The transformation to the MENLP involves expanding the

index set so that a unique index vector exists for each expression or list of expres-

sions. Also conditional statements are inserted in the inner loop so that the opera-

tions will be executed in the proper order. The MENLP is defined below.

Definition 5.1 : A Multiple Expression Nested Loop Program (MENLP) is a pro-

gram P=(P ,P,DP,E) with the following conditions

1) The index set is lexicographically ordered and is generated by an

n-level nested loop.

2) E(I) is an expression or list of expressions that must be evaluated at
the point IE P.

3) There exists global variables y E i such that y E Rm and m n.

These variables are indexed using indexing matrices Fi :I->viy,

viy E Zm. where y [viy] is an individual element of y. Local

variables, y E i, may appear in E(I) that are dependent only on

other variables referenced in E(L).
In the above definition a distinction was made between global and local vari-

ables. The global variables are the matrix variables processed by the program

while the local variables are intermediate variables used at the expression node

E(I). This causes the placement of E(I) in the program dependence graph to be

determined solely by the global variable referenced in E(I).

Using the definition above, a matrix representation may be forwarded for an

MENLP. Because all dependencies in the MENLP are determined by the global

variables that are indexed via linear or affine indexing matrices, the global de-

pendence structure of the program may be easily determined. The indexing matri-


ces determine the dependence vectors, while the boundary matrices determine the

index set. The expression function determines which indexing matrices are used

and for which variables at a given point I E P. This leads to the matrix-based

representation for an MENLP given below.

P = (R1,42 ,Fi,F2,...Fr ,E)
The MENLP is then executed using,

for(I= Q I : 01 ) E(I)

Where Ei and 02 are the boundary matrices as defined in chapter three. The

matrices FI,F2,...,Fr comprise the set of indexing matrices that are used to refer-

ence the global variables. The expression function evaluates to the expression E(I)
at the index point I. The expression E(I) contains global and local variables, the

variable indexing transformations, and the arithmetic operation description.

Normally a NLA program will be simpler than may be implied by the above

model. The expression function will usually evaluate to from one to five expression

lists for an arbitrarily sized index set. Also, the expression lists will normally con-
tain only one expression with no local variables. Occasionally however, an expres-

sion list with several expressions will be needed at selected points in the index set.

Sections 5.3 and 5.4 will illustrate what typical NLA programs look like in MENLP


5.2 NLA Mapping Methodology

NLA algorithmically specified arrays may be designed using a systematic ap-

proach. First the NLA is converted to a MENLP and then the algorithm is mapped

using a similar approach to the one used to map SENLPs. The first step is to
translate the MLMEP to an MENLP. For example, consider the LU-decomposition

which may be written

- 91 -

for (k = 1 : n-1 )
for (i = k+1 : n)
aik = aik / akk
for (j = k+ : n)
aij aij aik*akj
The above loop has one expression at the second level of nesting and another at

the third, making it an MLMEP. The MLMEP is translated to an MENLP by writing

an equivalent nested loop program with conditional statements at the deepest level

of nesting. Consider the program below.

for (k = 1 : n-1 )
for (i = k+1 : n)
for (j= k: n)
if 0 = k) aik = aik / ak
else aij = aj aik*akj
The above program is a MENLP and and may be verified to be input output

equivalent to the previous MLMEP. Note that in the MENLP the j-index ranges

from k to n where in the MLMEP j ranges from k+1 to n. This creates index

vectors in the MENLP for the computations that occurred at the second level of

nesting in the MLMEP. In the MLMEP these computations could not be assigned

index vectors because the j-index was not defined at the second level of nesting.

With the above modification to the index set, each computation in the algorithm

generated by the above program will be assigned a unique index vector.

In a serial computer the above MLMEP is more efficient than the MENLP. If

the MENLP were executed on a serial computer the condition (j=k) would need to

be tested once for each computation that occurred, causing a large overhead. This

is of no concern because the MENLP is only an intermediate form to represent an

algorithm and the conditional statements will never be executed at run time.

Rather, the conditional statement is a compiler directive of sorts. It tells the com-

piler what operation a given processor must compute at a time t in the array.

- 92 -

The set of all indexing matrices and the way they are used in the MENLP
dictates the structure of the computational dependence graph. The only variable in
the program is a. Yet this variable is indexed in four different ways and thus the
MENLP has four indexing matrices. The indexing matrices F1,F2,F3 and F4 are
defined for aij,alk,akj and an respectively. With the index vector I = [k i j]T, the
indexing matrices are given by

-[0 o1 J o oj] -[ o o
001 100 001 00

The fact that the rank of F4 is one creates a problem in computing the velocity and
distribution vectors. The velocity and distribution vectors are determined by com-
puting the parameter matrices as was done in chapters three and four. A time-in-
dex matrix is formed by augmenting the time transform with each indexing matrix.

The kth parameter matrix is then given by S k=STk1. This clearly causes a problem
in computing the parameter matrix S4. The time index matrix T4 will be singular
so that the inverse will not exist. This problem may be alleviated by altering the
MENLP further. The condition for the expression referencing akk to be executed is
that k equals j. Thus the program may be rewritten once more as
for (k = 1 : n-1 )
for (i = k+1 : n)
for (j =k: n)
if ( = k) aik = aik / akj
else aij = aij aik*akj

where now only the three nonsingular indexing matrices are used. This technique
will often need to be applied to the indexing matrices that come from expressions
that are moved to a deeper level of nesting in the program transformation.
With the above modification, all velocity and distribution vectors may be com-
puted, but the meaning of these parameters needs to be interpreted. There is one
variable in the algorithm and there are three sets of velocity and distribution vec-

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs