SYSTOLIC AND BUS CONNECTED ARRAYS 
ARCHITECTURES, ALGORITHMS AND ALGORITHM TRANSFORMATIONS
BY
ERIC M. DOWLING
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1989
ACKNOWLEDGMENTS
First and foremost I would like to thank Dr. Fred J. Taylor. He was able to
guide me technically and show me the way in the academic world. He was an
inspiration to me and remains a role model to follow in the years to come. He
provided me with the support and resources I needed to get to where I am today.
I also thank my committee members, Dr. L.W. Couch, Dr. J.C. Principe, Dr. H.
Lam, and Dr. K. Sigmon. Each one of these professors provided me with guidance,
advice and high quality instruction in the class room.
I would like to express my respect and thanks to the members of the HSDAL.
They all were great to work with and full of energy and good ideas. Special thanks
goes to Dr. Mike Sousa for the work we did together in the systolic array algorithm
mapping area and the leadership he provided in the area of mathematics.
During the entire time that I worked on my Ph.D. I was funded and supported
by the Army Research Office under an ARO Fellowship.They provided me with the
necessities I needed to stick it out for over three years in pursuit of this degree. In
addition, they provided me with the means to get to conferences so that I could
meet my colleagues nationwide. For this I am deepfully grateful.
I would also like to take this opportunity to thank my parents for their support
and for providing my with strong role models. It was natural for me to seek gradu
ate level education since my mother has a Ph.D. and my father has a law degree.
Finally I would like to thank my lovely girlfriend, Joan. She made my Ph.D.
years both exciting and enjoyable.
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ..................................ii
LIST OF FIGURES .................................... v
ABSTRACT ......................................... vii
CHAPTERS
ONE INTRODUCTION..................................
1.1 Array Processor Fundamentals .... ................ 1
1.2 Scope of Dissertation................. ................ 3
1.3 Literature Review .................... ........... 6
1.4 Outline and Summary. .............. ............. .12
TWO PROGRAM AND ALGORITHM MODELS ................ .15
2.1 Programs and Algorithms ....... .....................16
2.2 Program and Algorithm Dependence Structures. ..............22
2.3 Program and Algorithm Equivalences ................. ..29
THREE ARRAYS FOR UNIFORM RECURRENT ALGORITHMS. .......33
3.1 Uniform Recurrence Programs and Algorithms. ..............34
3.2 Matrix Based URP and URA Representation and Analysis ...... .38
3.3 Transform Domain Analysis ....................... ...... .44
3.4 Algorithmically Specified Array Synthesis ................. .53
FOUR MAPPING TO FIXED ARCHITECTURES ................ .62
4.1 Partitioning Fundamentals .................. ....... ..63
4.2 Integer Based Index Set Pretransformations ................ .70
4.3 Processor Domain Posttransformations .................. .75
4.4 Array Synthesis Using Index Transformations. ...............83
FIVE NUMERICAL LINEAR ALGEBRAIC ARRAYS. ............ .88
5.1 Multiple Expression Nested Loop Programs (MENLPs) ........ .88
5.2 NLA Mapping Methodology ..........................90
5.3 Design of Cholesky Decomposition Arrays ................ .94
5.4 Design of QRDecomposition Arrays. .................. 103
SIX BUS CONNECTED ARCHITECTURES 
THEORY & PRACTICE............................ 112
6.1 The Bus Connected Architecture. ..................... 113
6.2 Design of URA Implementations for BCA Arrays .... ...... 118
6.3 The HSDAL/Texas Instruments FPAP Architecture .......... 125
6.4 The HSDAL/Texas Instruments FPAP Software ........... .131
SEVEN CONCLUSION ................................. 136
7.1 Summary ................. .....................136
7.2 Conclusions ................................... 138
7.3 Future Research Potentials. ....................... ..140
BIBLIOGRAPHY .......................................... 142
APPENDIX A FPAP ASSEMBLY LANGUAGE DEFINITION FILE ... 146
APPENDIX B FPAP ASSEMBLY LANGUAGE 3X3 MATRIX
MATRIX MULTIPLY PROGRAM.............. 150
APPENDIX C FPAP CLANGUAGE PIN LEVEL SIMULATOR ..... 152
BIOGRAPHICAL SKETCH ............................... 173
LIST OF FIGURES
FIGURE PAGE
1.1 Systolic Interconnection Structure and Embedded Topologies .. 4
1.2 The Bus Connected Architecture ..................... 5
2.1 Graphical View of Simple Algorithm with a 1D Index Set..... 18
2.2 The Lexicographal Execution Ordering of Program P3 ........ 21
2.3 Simple Two Statement Algorithm Graph With Dependence
and Variable Usage Information Displayed ............... 25
2.4 Nonlocalized and Localized Algorithm Dependence Graphs..... 27
2.5 Localized Matrix Multiply Dependence Structure ........... 29
3.1 Uniform and Nonuniform Dependence Graphs ............ 33
3.2 Upper Triangular Matrix Multiply Dependence Graph ........ 42
3.3 Data Distributions for Algorithm With 6y=I2 ............ 49
3.4 Position Trace of Yoo Through The Processor Array ......... .50
3.5 The Systolic Algorithm of Example 3.1 with Velocity and
Distribution Vectors Displayed ......... ............ 58
3.6 2D Convolution Algorithm Designed in Example 3.2 ........ .61
4.1 InPlace Partitioning Method . . . . . . . . . . .. 66
4.2 In Place Algorithm With Partitioning Lines . . . . . ... 68
4.3 The Partitioned Algorithm As Four SubArrays ............ 69
4.4 1D Systolic Convolution Array with Xo Position Trace Shown. .. 78
4.5 1D Convolution on a 2D Array with Xo Position Trace Shown
Using Division Algorithm Mapping .................... 79
4.6 1D Convolution on a 2D Array with xo Position Trace Shown
Using CRT Based Mapping............... ........... 79
4.7 Dimensions of Transformed Virtual Array and Nonzero
Data Velocities of Example 4.4 ..................... 80
4.8 Unpartitioned Array's Initial Oth Row Data Alignment....... 81
4.9 Physical Array with yMemory and Initial y and z Memory
Pointer Values Displayed .......................... 82
5.1 Orderings for Computing the 4x4 Cholesky Decomposition ..... 96
5.2 The Cholesky Decomposition Physical Array Processor. ....... 100
5.3 The 4x4 Cholesky Decomposition Array Algorithm .......... 101
5.4 Triangularization of a Matrix Using Givens Rotations ......... 105
5.5 The QRDecomposition Physical Array Processor. .......... 109
5.6 The 4x3 QRDecomposition Array Algorithm. .............. 111
6.1 The General Bus Connected Architecture (BCA) ........... 114
6.2 A Simple PE Architecture Ideal For Use in a BCA Array ....... 115
6.3 4x4 Matrix Multiply Algorithm for a BCA............... 118
6.4 A Single Output Determines the Input Sequences Of Two
Busses. ........... ...... ........... ........ 121
6.5 A 9x1 BCA for Concurrently Computing Nine FIR Outputs ..... 124
6.6 A 3x3 BCA for Concurrently Computing Nine FIR Outputs ..... 125
6.7 Basic SN74ACT8847 PE Architecture. .................. 127
6.8 FPAP Architecture with Common Bus and IBM PCAT Interface. .129
6.9 Microinstruction Format Used to Control FPAP Architecture. .. 130
6.10 FPAP Control Structure and Host Interface. .............. 131
6.11 Programming Models for Data Routing Commands. ......... 133
Abstract of Dissertation Presented to the Graduate School of the University of
Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of
Philosophy
SYSTOLIC AND BUS CONNECTED ARRAYS 
ARCHITECTURES, ALGORITHMS, AND ALGORITHM TRANSFORMATIONS
By
ERIC M. DOWLNG
May 1989
Chairman : Dr. Fred J. Taylor
Major Department : Electrical Engineering
Systolic and bus connected arrays have recently been shown to offer a signifi
cant throughput advantage over serial architectures for many digital signal, image
and matrix processing algorithms. While array structures are well suited for the
parallel execution of computationally intensive algorithms that operate on large
data sets, they are difficult to conceptualize, design and program. This dissertation
focuses on the problem of automatically generating an algorithmically specified
array design from a serial program description. Program loop structures are cast
into index set based program and algorithm models. The models are subsequently
transformed to array structures. Matrix based analysis tools are derived to ma
nipulate and transform the models. Integer transformations are introduced in order
to pretransform or posttransform the algorithm so that it can be mapped into
fixed sized or fixed dimensioned arrays. A set of design equations is derived that
may be used to solve for an algorithm transformation that will generate an algo
rithmically specified array with desired data flow characteristics. The method is
showcased by examples where several arrays are designed. Arrays are designed for
Cholesky decompositions, QRdecompositions, matrix multiplies, convolutions and
more. Also, a structured method is presented to generate bus connected array
implementations of algorithms based on the dependence graph of the serial algo
rithm. Finally, the design and implementation of an experimental programmable
bus connected array are presented.
CHAPTER ONE
INTRODUCTION
1.1 Array Processor Fundamentals
An array processor consists of a regular lattice of interconnected processor
elements and attendent memory. Control over the processors in the array may
come from a central controller (Single Instruction Multiple DataSIMD) or each
processor may execute its own instruction stream (Multiple Instruction Multiple
DataMIMD). The complexity of the processor elements (PEs) may vary from
simple single bit ALUs to complex instruction set processors with word parallel
fixed or floating point arithmetic.
Array processors present an alternative computing structure for the high speed
execution of certain types of algorithms. They have been applied to image process
ing, mathematical modeling of physical processes, digital signal processing (DSP),
and numerical linear algebra (NLA). All of these applications have several com
mon attributes. First of all they are characterized by regularly structured algo
rithms. Secondly, they all operate on ndimensional data sets where n is at least
one. Thirdly, they often run in real time. The high complexity and real time
requirements provide the motivation for seeking faster computing structures for
executing these algorithms. The regularly structured algorithms acting on ndi
mensional data sets give rise to the regular ndimensional lattice structures of the
arrays. Some arrays are constructed specifically for one algorithm (algorithmically
1
2
specified arrays). The second type of array is a programmable array. Programma
ble arrays are capable of implementing a wide class of algorithms.
One encounters many problems when implementing an algorithm on an array
processor in hardware. First of all, most problems require a large number of proc
essors to execute a given algorithm on a given data set without partitioning. An
other problem is the hardware complexity of an interconnection topology. If word
parallel data paths are used, the number of connections may grow too fast for
practical implementations, and as a result many practical designs are bit serial
[Fou88]. Array control poses another physical problem. The simplest control struc
ture to implement is SIMD. Another problem concerns array level I/O bandwidth.
If data arrive from a single source and are passed to the array boundary of n PEs,
and word parallel transactions are occurring, then n words must be read from
memory in order to service a single cycle of the array. This means that the memory
structure must operate n times faster than the PEs. Again this problem caused
many designs to be bit serial so that a single mbit memory read could service m
boundary processors in a single cycle.
VLSI technology has had and is having an impact on array processor designs.
Now multiple PEs may be packaged on a single chip, making these systems more
practical to build. More complex PEs that required whole boards to implement
may now be implemented on a single chip. Still, VLSI technology imposes con
straints on designs. The cost of communications in VLSI is very high. Communica
tion in VLSI circuits takes up approximately 90% of the total real estate of the chip
[Sei84]. The I/O pads and drivers for interchip communication may take 50%
[Sei84] of the power. Also, if many chips are to be connected together, the inter
processor communication cost at the board level will be exceedingly high, espe
cially if word wide paths are used.
3
Beside the impact of VLSI on array processor technology, advances have also
been made on array processor algorithms. Mathematicians have rewritten many
algorithms for parallel implementations, going back to Givens rotations and Jacobi
iterations over preferred methods (Householder transformations) [Gol83] on serial
machines. Researchers in the fields image and signal processing have developed
array algorithms for many of their applications. Also, a new science is emerging
that considers the problem of geometrically or mathematically representing an al
gorithm and then transforming it to an equivalent array processor algorithm using
linear and/or affine transformations. These aspects of advancement will be cov
ered in detail in the literature review in section 1.3.
1.2 Scope of Dissertation
This dissertation research focuses on systolic and bus connected array proces
sors and algorithms. More specifically, it studies systematic mathematical ap
proaches to design algorithms for these architectures. In considering these architec
tures several implicit assumptions are made. The systolic architecture is considered
to have nearest neighbor connections as depicted in figure la for the two dimen
sional case. Figures lb,lc, and Id show how linear, mesh and hexogonal inter
connection topologies may be implemented with the connections of Figure la.
Also, each processor may operate autonomously in a word parallel mode and be
able to compute whatever operations (sqrt,*,+,,/,...) are required. In some cases
the PEs may be required to make a decision. The Bus Connected Architecture
(BCA) is assumed to be configured as depicted in figure 2 where all paths are also
assumed to be word parallel. In the BCA a SIMD control strategy is assumed with
the addition of a halt line that allows given PEs to ignore instructions and an output
enable that allows selected PEs to write to busses individually.
4
(la)
(Ib)
(Id)
Figure 1.1 : Systolic Interconnection Structure and Embedded Topologies
la: The Interconnection Network; Ib: Linear Topology ; Ic: Mesh Topology ;
Id: Hexagonal Topology
The central focus of the dissertation is to develop methods for the design of
systolic and BCA algorithms. To this end, known serial algorithms are cast into a
mathematical framework that allows them to be manipulated and transformed into
equivalent parallel algorithms. The mathematical model is developed for a sub
class of algorithms that consist of a single statement embedded in an nlevel
nested loop. The statement operates on m dimensional data sets where indexing
into the data is done via a linear combination of the loop indices. The method is
developed for the systolic array case and is later used to help design BCA algo
rithms. Next the model is extended to include more diversified algorithms such as
those that arise in the field of image processing and numerical linear algebra. The
5
central objective is to provide a methodology which one can apply to design array
algorithms, or that one could program into a CAD tool to convert known serial
algorithms to parallel algorithms automatically.
Figure 1.2 : The Bus Connected Architecture (BCA)
Another focus of the dissertation is the design, implementation and program
ming of an experimental attached array processor system. The system, called the
Floating Point Array Processor (FPAP), is under development as an experimental
6
machine for image and signal processing as well as matrix computations. The
machine has a modified bus connected architecture (MBCA), SIMD control, and
32bit data paths. This machine will be examined as an experimental endeavor
that will keep a focus on the practical and technological issues that are involved in
implementing an actual system.
1.3 Literature Review
As mentioned in section 1.1, the research trends in array processor design and
analysis has taken several directions. Some of these directions are the building
models for parallel computation, the rewriting of algorithms so that they may be
more efficiently executed on parallel machines, the algebraic and/or geometric
representation of algorithms, and the transformation of these representations to
equivalent parallel implementations. Another research trend has been to study the
characteristics an architecture must possess in order to have an efficient VLSI
implementation. Hence the study of VLSI architectures and VLSI algorithms was
introduced. Researchers studied algorithms and how to alter them to run efficiently
on VLSI array processors. Another area is the design and analysis of experimental
machines and user programming environments and the like. This literature review
will present the highlights of these research trends along with related reference
information.
Early work in the area of algorithm mapping was performed as early as 1967
by Karp, Miller and Winograd [Kar67] at the IBM Watson research center. They
considered systems of uniform recurrence equations and their implementation on
theoretical parallel computers. This work introduced the concept of computations
occurring at points in an integer space. The uniform recurrence equations were
defined in such a way that they were guaranteed to have uniform dependencies
throughout the index set. They also viewed the algorithm in terms of a dependence
7
graph and looked at algebraic conversions of the graph to perform parallel sched
uling of computations.
In 1969, Karp and Miller produced a formal mathematical model for parallel
computation [Kar69]. An important concept introduced in this paper was that the
sequencing of computations was considered separately from the actual computa
tions themselves. They also used the concept of algorithm equivalence. Although
did not present a method to map from one schemataa" to an equivalent maximally
parallel form, they did propose that future work be done to this end.
Around the same time an aggressive research effort was proceeding at the
University of Illinois. The ILLIAC III array processor project spanned the 1960s,
and the ILLIAC IV was under development in 1967 and eventually built [Fou88].
Research concentrated both on hardware and software aspects. Kuck was a key
figure who worked on transforming programs into forms that could be executed
more efficiently in parallel[Kuc72],[Ban79]. The research focused on ways to re
write statements and loops into equivalent forms that were better suited for paral
lel execution. Several rules were introduced to transform programs to equivalent
enhanced forms. In [Kuc72] the concept of dependence as an arc form the output
of one computation to the input of another was stated. The central theme involved
the reordering of computations in advantageous ways that preserved the depend
ence structure of the original program.
Lamport [Lam74] was designing an optimizing compiler for the ILLIAC IV and
looked at methods to parallelize FORTRAN DO loops. He derived two methods for
converting the loops to explicitly parallel forms. The first was called the hyper
plane method and was applicable to SIMD and MIMD machines. The second was
called the coordinate method and was applicable only to SIMD structures. This
work would influence others in the development of methods to map nested loops to
VLSI arrays in years to come.
8
Another researcher at the University of Illinois was Khun [Kuh80] who ex
tended some of Kuck's ideas of program transformations [For84]. He developed
methods to transform the index set of an algorithm into an alternate index set that
produced an equivalent algorithm that preserved the dependencies of the original
algorithm. Using his method he was able to map several algorithms to SIMD struc
tures, and he showed how the index set could be transformed to directly yield VLSI
array algorithms. His method involved interpreting the transformed index set as
time and processor dimensions.
Moldavan [Mol82],[Mol83] extended and formalized the work of Khun. He
looked at the problem of algorithm transformation as reindexing the index set. He
provided necessary and sufficient conditions for a transformation to exist which
was later found to be flawed [Sou87] and corrected. He cast the problem in matrix
form and defined a matrix whose column vectors were the algorithm's dependence
vectors. He used a linear transformation to transform the algorithm's dependencies
to a new coordinate system. The new coordinates represented time and processors.
A valid transformation was an integer bijection matrix that forced the dependence
vectors to be projected to positive time coordinates. The transform was partitioned
into a time and a space transform, the details of which will be discussed more
thoroughly in chapter two.
A main difference between Moldavan's and previous work was the type of par
allel architecture being considered. Kuck and the others were considering general
type parallel machines that were assumed to act as a set of uniprocessors that
could interact with memory without conflict. Moldavan was considering systolic
architectures which are practical for VLSI implementation. Moldavan brought the
theoretical work in parallelism extraction and algorithm transformations together
with the newly introduced systolic architecture.
9
In 1978 Kung and Leiserson introduced the concept of systolic arrays [Kun78],
[Mea80], [Ku82] and developed some of the first systolic algorithms. They also
developed graph based methods to assist in the design of systolic algorithms. In
[Lei81] the systolic conversion lemma is presented along with related corollaries.
The net result was a method to convert an algorithm flow graph into an equivalent
systolic graph by adding delay elements in a specified way. Using this technique
broadcasts are eliminated. In [Kun84] the cut theorem was introduced. This pro
vided a method to further manipulate dependence graphs by splitting them up into
subgraphs and inserting delays between them. Again, the key idea was to provide a
tool to manipulate the graphs into equivalent systolic forms.
Kung at CarnigeMellon University, stressed a number of practical hardware
issues concerning systolic arrays, and Leiserson [Lei81] studied VLSI lay out prob
lems. Kung et al. designed a VLSI chip called the Programmable Systolic Chip,
(PSC) [Kun85b] that would act as a systolic processing node. The chip was de
signed to act as a simple programmable processing node that could be put into
board level systolic arrays. Their experience with the project was unfavorable and
they modified their concept of a systolic array from very fine grain to more coarse
grain. Using the modified concept, they produced a minisuper computer called
the WARP [Ann87] machine under a DARPA contract with General Electric as
their industrial partner. In this design, a single systolic node occupied a one square
foot board. The machine was configured as a linear array of ten cells. The project
was successful and the machine was primarily applied to computer vision. A strong
effort went into the design and implementation of W2, a parallel language for the
WARP [Gro87]. System Software was also written for a SUN workstation based
WARP network. A command shell was implemented in LISP [Bru87] to form an
elegant user environment. In 1990 the iwarp, a single chip VLSI version of the
systolic node found in the warp will become available from Intel Corp The
 10 
introduction the iwarp, should influence many systolic machines to be designed in
the future.
Another key reseacher in the field of VLSI array processors is S.Y. Kung. He
has recently published a text book [Ku88] where several aspects of array processor
technology are described to include algorithm mapping techniques, systolic and
wavefront architectures and hardware and software issues. He devised a variation
of the systolic array called a wavefront array [Ku85a]. A wavefront array is essen
tially a systolic array with a data driven asynchronous control strategy. Some algo
rithms are better suited to this type of structure, and some problems such as clock
skewing can be alleviated. According to Kung [Kun88] he also developed array
processor hardware and software. A machine called the WAP (Wavefront Array
Processor) was originally constructed as an 8x8 system [Kun82b]. He developed a
language called MFDL (Matrix Data Flow Language) and a simulator to execute
the language in order to assist in the hardware design.
With the introduction of the systolic and wavefront architectures, algorithm
mapping research tended to concentrate heavily on these architectures, especially
the systolic array. Since the wavefront array is just a variation of a systolic array,
the techniques of mapping algorithms to systolic arrays apply to wavefront arrays
as well. A review of these methods was compiled by Fortes and Wah [For85]. Most
of these methods were either graph based [Kun88], geometically based [Cap83] or
algebraic [Mol82],[For84]. One method is based on a set of equations derived in
"the theorem of systolic processing" [Li85]. These equations serve as constraining
equations for an optimization procedure that can find a systolic array that opti
mizes a timeprocessor cost function. The method was interesting and several pa
rameters of systolic array algorithms such as data velocities and distributions were
formally defined. The problem with the method was that the theorem of systolic
processing did not govern all algorithms but just a subclass of uniform recurrences
 11 
in two dimensions. Therefore the method can not be extended to algorithms that
deviate from the scope of the theorem, which is quite limited. The method can be
extended if new "theorems of systolic processing" are derived for algorithm
classes of interest.
Fortes has made significant contributions to the algorithm transformation field.
He studied algorithm mapping theory under Moldavan and extended and further
developed his method [For84]. He developed methods to optimize the time trans
form, and methods to select a space transform. With Moldavan, he developed a
partitioning technique based on the transformation method [Mol86]. He also stud
ied fault tolerance and alternate transformation methods that preserved various
types of algorithm equivalence.
Due to the research in systolic architectures and algorithms, several systolic
and wavefront machines have been constructed such as the SLAPP [Dra87] (Sys
tolic Linear Algebra Parallel Processor) which was built at the Naval Ocean Sys
tems Center for numerical linear algebraic algorithms. A systolic array, called the
Matrix1, was built by Saxpy [Fou87] and performs nearly a gigaFLOP for some
DSP and NLA algorithms. Several other systems were reported by Fortes and Wah
[For87] that include systolic arrays for beamforming and general DSP and NLA
algorithms.
In 1983 a systolic algorithm was introduced to solve linear and least squares
systems via Givens rotations on a systolic array [Hel83]. Others have developed
different algorithms that compute the QR decomposition via similar algorithms
[Boj84], [Luk86]. Stewart derived an algorithm based on Jacobi iterations to com
pute a Schur decomposition [Ste85]. Jacobi iterations were also used by Brent and
Luk [Bre85] to solve eigenvalue and singular value problems on systolic architec
tures. Nash and Hensen [Nas88] developed a modified Faddeva algorithm to solve
a wide class of NLA algorithms on a systolic SIMD array. These researchers are
 12 
looking at the algorithms themselves and finding stable alternatives that can be
efficiently implemented in parallel.
The research in systolic arrays can be separated into algorithm modeling and
transformation techniques, hardware/architecture, and pure algorithms research.
All three of these areas are related and intertwined. Array processor designers,
programmers, and algorithms researchers should be aware of all areas and use
them to their advantage. Current and future research must extend the knowledge
base of good parallizable algorithms, broaden algorithm mapping techniques, and
explore architectural and practical hardware issues.
2.4 Outline and Summary
This dissertation focuses on the problem of mapping algorithms to systolic and
bus connected processor arrays. It presents matrixbased techniques to compute
the parallel algorithm's data flow parameters and to solve for an algorithm trans
formation matrix. Many of the ideas in this work stem from the works of Kuck
[Kuc79], Moldavan [Mol82],[Mol83], and Fortes [For84]. This dissertation will
present a systematic methodology to design parallel algorithms from serial algo
rithm descriptions. Also, a 3x3 experimental array processor will be developed and
explored. An algorithm design methodology will be presented to help design algo
rithms for the experimental machine.
In chapter two, a mathematical framework for the description and transforma
tion of algorithms and programs is established. Both programs and algorithms are
defined over an index set of integer vectors that will allow the programs and algo
rithms to be manipulated mathematically. The important concepts of program and
algorithm equivalence are also defined.
In chapter three, a class of algorithms known as uniform recurrent algorithms
is studied along with the programs that describe (generate) them. The algorithms
 13 
are studied both in their original form and in their systolic forms. The original
sequential algorithms are mapped to systolic algorithms through a linear transfor
mation that was first studied by Moldavan and Fortes. Several functions are de
rived that reveal information concerning data flow and collisions in the systolic
algorithm. The general notion of time is expanded from a purely scalar entity to a
temporal vector. The meaning of vector time is explained and the temporal vector
is used to map the the two dimensional convolution algorithm to a two dimensional
systolic array. Several examples are used to illustrate how the theory is to be ap
plied to actual problems. A set of design equations called the systolic array synthe
sis equations is derived and applied to design systolic array algorithms with
prespecified data flow constraints.
In chapter four, the problem of mapping algorithms to fixed architectures is
addressed. Methods are presented to map algorithms to fixed sized and/or fixed
dimensioned arrays. A set of nonlinear integer transformations is used in conjunc
tion with the linear transformation to expand the degrees of freedom available to
meet the physical architectural constraints.
In chapter five, the framework developed in chapter three is extended to
accommodate a wider class of algorithms. Program transformations are used to
put more complex algorithms in a form that may subsequently be mapped to sys
tolic MIMD architectures. The method will be illustrated by mapping several algo
rithms such as the LU, Cholesky, and QR decompositions into array structures.
In chapter six, the general Bus Connected Architecture (BCA) and the design
of an attached modified bus connected array processor is described. The bus con
nected architecture will be studied and a systematic design method will be pre
sented to design uniform recurrent algorithms for BCA arrays. By designing an
actual system, the practical difficulties and hardware limitations are discovered.
Various architectural options are considered that effect the hardware complexity
14 
and efficiency of the overall system. Tradeoffs that must be made when construct
ing a system are described along with the way the target algorithms effect the
design of the architecture.
Chapter seven offers conclusions of the research and outlines where future
work would be useful. A bibliography is provided for reference and appendices
are included that contain relevant software listings.
CHAPTER TWO
PROGRAM AND ALGORITHM MODELS
An algorithm is normally thought of as a method for solving a problem, usually
on a computer. The algorithm is coded via a programming language so that the
computer may interpret and execute it. When the algorithm executes, it accepts an
input data set and generates an output data set. The algorithm may be viewed as
an ordered set of operations that are applied to the input data set to produce the
output data set. The program is the ordered set of instructions that are used to
code the algorithm. Normally the program is considered to be an implementation
of an algorithm. The algorithm is the general recurrence relation and the program
is a particular realization of the algorithm. In the systolic array mapping research
area the program is considered to be a computer language description of an algo
rithm [For84]. In an array compiler or CAD tool, the underlying recurrence struc
ture of the algorithm will be derived from the particular program that is presented
as input. Once the general algorithm structure is determined, a parallel realization
will be produced.
In this chapter programs and algorithms will be viewed in mathematical terms.
Definitions will first be stated for serial algorithms and programs, then more gen
eral definitions that incorporate the concepts of the index set, dependence struc
ture, and the sets of computations and expressions are introduced. The algorithm
model presented in this chapter will follow the works of Karp, Khun, Moldavan
and Fortes to a large degree. In Karp and Miller [Kar67], the concept of the index
 15 
 16 
set and the dependence structure defined over the index set was introduced. Khun
used the index set to map algorithms to parallel processing structures. The de
pendence matrix and the concept of transforming the dependence matrix to an
altered dependence matrix of the array algorithm was introduced by Moldavan
[Mol82]. Fortes [For84] developed a detailed mathematical description of an algo
rithm.
2.1 Programs and Algorithms
It is important to draw a distinction between a program and an algorithm. Let
X be the input data set, Y be the output data set and W be a transformation
W:X>Y. The algorithm may be thought of as a sequence of finitary operations
that are applied to X to obtain a set Y. If Q is a sequence of operations that are
applied to X to obtain Y, and Z is the set of intermediate values generated as the
algorithm executes, then the serial algorithm, Aser, is a tuple Aser = (X,Y,Z,Q). If
F is a list of instructions that cause a given computer to execute Q, and ?p is the
set of program variables that are referenced in the list of instructions then the
serial program, Pser, is a tuple Pser = (p ,r). The algorithm consists of the data
values and the physical action involved in a processing task, and a program con
sists of the data structures and the list of instructions that cause the computer to
execute an algorithm. Program variables are data objects that may take on many
values, while algorithm variables are specific instances or values of program vari
ables. If a program Pser causes the computer to execute the algorithm, Aser, then
Pser is said to generate Aser and we write Pser => Aser.
A data transformation executed on a computer may be viewed from mathemati
cal, algorithm, or program levels. The mathematical level concerns the transfor
mation itself. For example, one may wish to compute the Discrete Fourier Trans
form (DFT). It is well known that there are many ways to compute this transfor
 17 
mation on a computer. That is, several algorithms exist to compute the DFT. The
algorithm level concerns the sequence of computations used to perform the data
transformation. The program level concerns the instructions that are used to code
the algorithm. So an infinite set of programs exists that will generate a given algo
rithm. A program transformation or translation is used to move between programs
in this set.
The above view gives rise to the notions of algorithm and program equivalence.
Two algorithms AI=(X1,Y1,Z1,QI) and A2 = (X2,Y2,Z2,Q2) are Input Output
Equivalent (IOE) if XI=X2 implies that Y1=Y2. This type of equivalence does not
take numerical accuracy and stability into account. Another tighter form of algo
rithm equivalence is riequivalence [For84] (formally defined in section 2.3). In
this type of equivalence the algorithms must first be IOE, but also both algorithms
must perform the exact set of computations but possibly in a different order. This
type of equivalence may also be called computational equivalence because the
exact same set of computations is performed. Consequently Xi=X2 implies Y1=Y2
and that Q2 is a permutation of Qi1. Computationally equivalent algorithms will
have the same numerical roundoff and stability properties. If programs Pi and P2
both generate the same algorithm A, then Pi and P2 are said to be operationally
equivalent or Qequivalent. Qequivalent programs cause the computer to exe
cute the exact same ordered set of computations, Q. The notion of Qequivalence
between programs will allow a given algorithm to be advantageously expressed by
alternate programs.
More general program and algorithm definitions rely on the concept of the
index set. The index set, denoted In, is a set of vectors in Z" where one computa
tion or expression is associated with each vector. In an nlevel nested loop, the
index vector is a vector where each element is one of the loop indices. In these
 18 
cases the index set will be the set of vectors containing all combinations of indices
generated by the loop. For example, in the program below,
for (i=0:9)
for 0=0:9)
for (k=0:9)
y[i,j] = x[j,i] + 5
The index vector is
i
I = [lI
and the cardinality of the index set is 1000 since 1000 combinations of indices will
be generated as the loop executes.
Given a serial program Pl=>Aser, an Qequivalent serial program P2 may be
written that explicitly assigns each computation to a point in the index set. This is
done by rewriting the arbitrary program as a nested loop. For example,
Pi: x=x1
y=x3
y= x +y2
can be written in terms of a one dimensional index set as
P2 : for i=0:2
if (i=0) x = x 1
if (i=l) y = x 3
if (i=2) y = x + y2
which may be viewed graphically as shown in figure 2.1.
xx x y
x=xl 1 y= x3 y=x+y
i=0 i=1 i=2
Figure 2.1: Graphical View of Simple Algorithm with a 1D Index Set
Here the index vector is I e Z', and I = i. The index set is In = {0,1,2}. If the
compiler is assumed to make the data independent decisions in program P2 at
 19 
compile time and the loop is expanded into inline code, both programs will gener
ate identical object codes and be Qequivalent. It is easy to see that any finite
serial program may be written in a form that expressly assigns each expression to
a point in the index set. The original program is converted to another form where
each expression has an explicit index vector. This alternate form may be used to
generate an algorithm where each computation has an explicit index vector.
Since any program may be written as an Q equivalent program that explicitly
assigns an index vector to each expression, we shall only consider nested loop
programs where the index assignments are explicit. The algorithms generated by
these programs will then also have an explicit index set. Formal index set based
definitions for programs and algorithms are presented below.
Definition 2.1: A program is a four tuple
P=(i ,In,E,DP)
where
ip is the set of all program variables in the program
P is the index set where In C Z"
E is a set of expressions, i.e. a mapping of In onto E
DP is the set of dependencies in the program
Definition 2.2: An algorithm is a six tuple
A=(X,Y,Z,In,C,DA) where
X is the input data set
Y is the output data set
Z is the intermediate data set
In is the index set where I C Zn
C is the set of computations, i.e. a mapping of In onto C
DA is the set of dependencies in the algorithm.
 20 
The concept of dependencies in programs and algorithms will be developed further
in section 2.2.
The first distinction to make in the above definitions concerns the way variables
are defined. In a program, ?i denotes the set of all the program variables. That is,
the elements of v/ are data structures. If the program variable Y appeared in an
expression of a program P, then ye 7p The variable y may be used in many
different expressions and in many computations in the resulting algorithm. In an
algorithm on the other hand, each variable has a unique value. If a computation
computes a result, a new variable is generated in the algorithm. The algorithm
definition is concerned with each separate value, not with the variable names. This
distinction between program variables and algorithm variables was the main rea
son for presenting the program definition.
Two important mappings, E and C, appear in the above definitions. In a pro
gram P=>A, E(I) is the expression in P that causes the computation C(I) to occur in
A at point Ie P. The computation function may also be viewed as a function of
both the index vector and the expression that causes C(I) to be executed. In this
context, C is written as a function of two variables, C(I,E(I)). The forms C(I) and
C(I,E(I)) may be used interchangeably depending on context.
A serial nested loop program is a convenient way to represent a program de
fined over an index set. If this is the case, the index set of the general model may
be connected into a serial path that corresponds to the serial loop's execution
ordering. For example, consider program P3 shown below.
P3 : for (i = 0:4)
for (j = 0:4)
x[i,j] = x[i,j] + 5
This program will have the index set and execution ordering as depicted in figure
2.2. In this figure, the nodes represent index vectors and the arrows show the
serial ordering.
21 
The execution ordering provides a means for comparing vectors in the index
set. One vector in the index set may be said to be greater than another if it
occurred after the first in the execution ordering. A standard execution ordering
that is natural when considering nested loop programs is the lexicographical order
ing defined below.
i
finish
jj
start
Figure 2.2: The Lexicographical Execution Ordering of Program P3
Definition 2.3: A lexicographical ordering over Zn is the ordering s.t. if xl, x2 E
Zn, then xi is greater than x2 in a lexicographical sense if for the minimum index
i, such that x [i] x2 [i] then xi [i] > x2[i]. One writes, xl> L X2.
Nested loops where all indices increment in positive directions generate index
vectors that are lexicographically ordered. For example, a special case of the lexi
cographical ordering was depicted in figure 2.2. Whether the index set is lexi
cographically ordered or not, the execution ordering may be used to compare
vectors. If Ii appears in the execution of A before 12, then we write Ii< E 12. In
positively incrementing nested loop programs, the execution ordering and the lexi
cographical ordering are the same.
A concept related closely to the index set and the execution ordering is that of
a trace [Len87] of an algorithm. If the index set is is drawn as a grid in Z", and a
 22 
path is drawn from node to node as the serial algorithm executes (as in figure 2.2),
the entire path from the first node to the last is the trace of the algorithm. The
trace then is an ordered set of index vectors that are ordered by the serial execu
tion ordering of the program. The trace ties the serial program description together
with the general program and algorithm models. The trace of the algorithm will be
used in the design of parallel algorithms in chapters 3, 4, and 5.
2.2 Program and Algorithm Dependence Structures
The dependence structure, DA, of an algorithm determines the amount of paral
lelism that can be used during execution. If an algorithm has no dependencies, it
can be executed completely in parallel in a single time step. On the other hand, if
each computation requires input from the output of the previous computation in
the execution ordering, then no parallelism at all may be used. In a program,
dependencies arise for each program variable, y. If y is referenced in the expres
sion E(I1) and again in E(I2) where Ii< E 12, and if Ii and 12 are the closest such
vectors, then E(12) depends on E(II) and a program dependence exists from Ii to
12 in DP. In an algorithm, if the variable v is generated by computation C(I1),
written C(Ii)>v, and if v is used for computation C(I2), written v>C(I2), then an
algorithm dependence exists from Ii to 12 in DA. The dependence structure of the
algorithm is a subset of its generating program's dependence structure. Because of
this, the algorithm's dependence graph may be deduced from the program.
The dependence structure determines what orderings are admissible. It is used
to determine which algorithm transformations will yield valid parallel realizations.
Also, the dependence structure will dictate the possible array processor topologies
that can be used to implement the algorithm since the dependencies in the algo
rithm will transform to data paths in the processor array. The following definition
 23 
will give a method to determine the dependence structure for programs and algo
rithms.
Definition 2.4: The Program Dependence Function. D, is a mapping from the in
dex set and the set of all program variables in the program P to the set of depend
encies, D'. Individually, D maps a program variable y that appears in an expres
sion E(I) to a dependence vector(s) d= D(y,I) where d E Z". The program de
pendence vectors are grouped into two types as follows:
1) Input Dependencies  if y e p and E(I) is the first expression in
the execution ordering that references y, then d = D(y,I) = 0.
2) Self Dependencies  if y e ip and y appears in both E(Ii) and
E(I2) for some Ii,I2 E IP,where I1< E 12, and if I and 12 are the closest
such index vectors in the execution ordering, then d = D(v, Ii) = 12 I1.
Definition 2.5: The Algorithm Dependence Function. D, is a mapping from the
index set and the set of all variables in the algorithm A to the set of dependencies,
DA. Individually, the function maps a variable v from a computational node C(I)
to a dependence vector(s) d= D(v,I) where d E Z". The algorithm dependence
vectors are grouped into three types as follows:
1) Input Dependencies  if v E X, and v>C(I), then d = D(v,I) = 0.
2) Absolute Self Dependencies  if v E Y U Z and C(Ii)>v and
v > C(2) for any 12 IP, then d = D(v,Ii) = 0.
3) Internal Dependencies  if v E Z and C(II)>v and v > C(I2)
for some I1,I2 In, then d = D(v, Ii) = I2 Ii.
Other useful functions are the variable functions.
Definition 2.6: The Program Variable Function. !P, is a mapping from the index
set of a program, P to the set of all program variables I Individually, 7p (I) is
the set of variables referenced in expression E(I).
 24 
Definition 2.7: The Algorithm Variable Function. u is a mapping from the index
set of an algorithm, A, to the set of all variables y = XUYUZ. Individually, U (I)
is the set of input and generated variables in computation C(I).
Some of the above definitions may be used to generate dependence graphs for
programs and algorithms. For the program, apply 4 :In> ip and then D:(i ,In)>
DP. For the algorithm, apply t :In># and then D:(u ,In)> DA. Because of the
way dependence vectors were defined, they will originate and terminate at points in
the index set. The index set is the set of nodes and the program and algorithm
dependence vectors may be viewed as arcs in the program and algorithm graphs
respectively. In the algorithm graph, each node represents a computation, C(I). In
the program graph, each node represents an expression, E(I). If a graph G is
defined in its usual manor as G = (N,Ar) where N is a set of nodes and Ar is a set
of arcs, then the dependence graph of P may be written Dg = (I,DP) and the
dependence graph of A may be written Dg = (I,DA).
So we see the program and algorithm definitions may be written in shorthand
as the set of all program or algorithm variables, the set of expressions or computa
tions, and the dependence graph of the program or algorithm; i.e. P = (' ,E,Dg)
and A = ( ,C,Dg). These definitions imply knowledge of the variables distribution
throughout the index set via the mappings 4p and u The program may be viewed
as a graph where variables (I) appear in expression E(I) and move an offset of
D(z (I),I) to appear in the next expression. The program graph consists of a set of
expression nodes that have directed dependence arcs for each program variable in
the expression that are drawn from one node that references a given program
variable to the next in the execution ordering that references that same program
variable. The algorithm may be viewed as a graph where computation C(I) oper
ates on variables u (I) at point I in the graph. When the results are computed, the
each generated variable vg e u (I) will move to node I + D(vg,I). If D(vg,I) has m
 25 
outcomes, then v will be broadcast to nodes { I + di, I + d2, ..., I + dm). Figure 2.3
illustrates the concept for a simple algorithm without broadcasts.
C vg
Vg C(I)
Figure 2.3: Simple Two Statement Algorithm Graph with Dependence and
Variable Usage Information Displayed
The main difference between the program and algorithm models is the way
variables and dependencies are defined. The program variables are standard vari
ables that can take on many values while the algorithm variables can only take on
a single value. Also, program dependence arcs are drawn between nodes that refer
ence the same program variables while algorithm dependencies are only drawn
between nodes that generate and use a given algorithm variable. Because of this,
the set of algorithm dependencies will be a subset of the set of program depend
ences and may be easily derived from the program.
In an algorithm, the dependence arc may be viewed as a path from the node
that generated that variable v to the node that used it. It may also be viewed as a
precedence relation. This view leads to the concept of the free ordering of the
algorithm. In this ordering all the elements of the input data set are viewed as
input nodes to the algorithm and all the output data elements are viewed as output
nodes. The input nodes have arcs going to all nodes in the index set that use the
input data. The generated variables are passed to other computational nodes over
dependence arcs or to output nodes. The algorithm is started by letting all the
 26 
inputs travel along all their arcs simultaneously. As soon as a given computation
C(I) has all of its inputs present on its input arcs, it performs the computation and
passes its generated variables to other nodes along their respective dependence
arcs. In this way, the algorithm is allowed to execute to completion. The resulting
parallel ordering is called the free ordering.
The free ordering of an algorithm dictates the minimum time hardware struc
ture to execute the algorithm. A processor is provided for each node and a data
path is provided for each input, output and dependence arc. This solution requires
too much hardware and also requires data broadcasts to be made which is undesir
able when considering locally connected VLSI implementations. Also, a given
processor would only execute a single computation during the entire algorithm,
making the PE utilization efficiency very low. For these reasons the free ordered
array algorithm is rarely directly implemented. Rather the algorithm's dependence
graph is localized and mapped to a more practical locally connected structure that
attains a higher PE utilization efficiency.
If an algorithm is to execute on a processor array with a topology as shown in
figure la, it is required that max( Ild I. )=1 for all d E DA. This is required so
the transformation discussed in section 2.3 will map the dependence arcs to local
communication paths in the array processor. The net effect is that a constraint is
put on the length of admissible dependence and I/O arcs. Figure 2.4 shows two I/O
equivalent algorithms. The graph of figure 2.4a reflects the free ordering of the
algorithm. Figure 2.4b shows the algorithm with additional dependencies imposed
to enforce local communications.
Several researchers [Lei81] [[Mol82] [Kun88] have considered methods to local
ize dependence graphs. One method [Mol82] involves converting the algorithm to
an I/O equivalent algorithm where in the new algorithm identity operations are
inserted at the computational nodes. By inserting identity operations, the computer
 27 
dose not have to perform any extra computations but new variables will be gener
ated in the algorithm's mathematical representation. In the original algorithm, if
v>C(II) and v>C(I2) for some selected Ii and 12, then in the new algorithm
v>C(Ii) and C(I1)>v' where v=v' then v'>C(I2) so that a dependence d = 12 Ii
is generated. This method imposes dependencies that are not in the original algo
rithm and thus imposes more restrictions on the possible execution orderings for
the algorithm and may therefore slow down its parallel execution. For example,
the free ordering of figure 2.4a requires four time steps while that of figure 2.4b
requires seven. Care must be taken when imposing dependencies.
EE1H I 
PM Ub *
I. i
x
/
Y
>2~}
(a)
Figure 2.4: Nonlocalized and Localized Algorithm Dependence Graphs.
(a) Nonlocalized Algorithm (b) Localized Algorithm
The algorithm may be localized by converting it to an alternate algorithm by
setting DA = DP, where D' is the dependence structure of a program that generates
the algorithm. This is equivalent to Moldavan's method of localizing an algorithm
graph as described above. For example, the algorithm generated by program P4
below
P4: for (i=0:2)
for 0=0:2)
for (k=0:2)
Zij = Zij + Xik Ykj
I I
,,,,
 28 
would only have dependence arcs pointing in the kdirection. Since a and b are
input variables, each node would have two nonlocal input arcs drawn from the
input set. The program dependence structure, on the other hand has dependence
arcs pointing in all three axial directions and is localized as depicted in figure 2.5.
The matrix multiply algorithm is localized by imposing the dependence structure of
the matrix multiply program on the matrix multiply algorithm.
Some matrix definitions will be useful in describing the dependencies and the
imposed dependencies of the serial program.
Definition 2.7: A Variable Surface. Iv. for a variable y E ip in a program P is
the set
Iy ={III E I, y>E(I)orE(I)> }
Definition 2.8: The Self Dependence Matrix. Dr. for a variable y in a program P
is a matrix whose columns are the unique values of the self dependence
vectors of y.
Definition 2.9: The Dependence Matrix. D for an algorithm A generated by a
program P is a matrix whose columns are the unique values of all vectors in DP.
Note that in figure 2.5 the dependence surfaces as defined above appear as straight
lines. Also in figure 2.5, the self dependence vectors are the vectors tracing out
the straight lines of the dependence surfaces. The self dependence matrices are
just the dependence vectors in this case, and the dependence matrix of the algo
rithm is the identity matrix as shown below.
dx = dy = dz = 0 D = 010
0 0 1 001
 29 
k
Z1o Z2o
ZOO
Z 2 Y20
x02
X/212 J221
Y22
Xo2
XI X
A0 Y Y12
X11 Xzi
Yox
j X10 X20o
Figure 2.5: Localized Matrix Multiply Dependence Structure
Note that the order of the columns of D are not important so can be arranged in
the most convenient manor. In some programs the self dependence matrices will
have more than one vector in them. These arise in programs with a nonuniform
dependence structure, or in programs with dependence surfaces that are planar or
higher dimensional. Nonuniform dependencies arise in many NLA algorithms
such as the Cholesky decomposition. Higher dimensional dependence surfaces
arise in algorithms with higher complexity such as the two dimensional convolution
algorithm.
2.3 Program and Algorithm Equivalences
Algorithms and programs are modeled to provide a mathematical framework
by which a known serial algorithm may be transformed into an equivalent algo
 30 
rithm that is well suited for fast execution on a parallel architecture. First the
program that generates the algorithm is transformed to an alternate equivalent
form where each expression has an associated unique index vector. Once the pro
gram is in this form the index set based algorithm mathematical model may be
produced. A parallel implementation of the algorithm is produced by transforming
the original algorithm model to an alternate one with explicit parallelism. A clear
concept of algorithm equivalence is needed in order to find ways to transform from
one algorithm to another. The definition below comes from [For84].
Definition 2.10: Given two algorithms Ai = (Xi,Y1,Z1,CI,In,DA) and A2 =
(X2,Y2,Z2,C2,J",D2), Al and A2 are Tiequivalent and we write Ai A2, if and
only if
1) Ai is input output equivalent to A2
2) 3 transformations T and T* s.t. T:In>J" and
T*:proj(D )>proj(D )
3) T is a bijection
4) T* is a surjection
5) C(I) =C2(TI) = C2(J)
6) d2 = T*di and J = TI
From the definition above, riequivalence between A1 and A2 implies that A1
and A2 are IOE and that there is an invertable map from I" to Jn where the compu
tation C (I) is mapped to computation C2(TI), and C (I) and C2 (TI) are the same
computation. It can be shown [For84] that for the above definition to hold, it is
required that T*(d)=T(I)T(Id). For the special case of linear transformations, the
above reduces to T* = T, so that the dependencies and the index set are trans
formed through the same transformation. Linear transformations will receive most
attention in the dissertation. Nonlinear transformations will be discussed when ad
dressing the partitioning problem in chapter 4.
 31 
The Tiequivalence transformation can be viewed as simply a reindexing of a
given algorithm. Suppose an algorithm's graph was drawn as those of figures 2.4
and 2.5 If the index and dependence vectors were transformed through a linear
transformation, T, then the algorithm graph would be transformed so that the
nodes and arcs had different labels, but the structure of the graph would remain
the same. That is, the same arcs connect the same nodes so that the free orderings
of the two algorithms are identical. The two algorithms are isomorphic since the
transformation linking the two algorithms is a bijection. The transformed index set
is now viewed in a parallel algorithm coordinate system. Some of the axis in the
transformed index set are viewed as temporal axis and the others as processor
coordinate axis. This interpretation imposes an ordering on the transformed algo
rithm. The only problem is to make sure that the parallel execution ordering pre
serves the precedence relations of the algorithm's free ordering.
The transformation T is a map from the original algorithm's index coordinate
system to a coordinate system consisting of time and processor dimensions. There
fore, T may be viewed in terms of time and processor subtransformations. The
time transform r E Zqxn maps an index vector I G I to a temporal vector t=xr I.
The space transform, SE Zmxn, where m=nq, maps I to a processor coordinate
vector P=SI. The time transform schedules a computation C(I) to occur at temporal
vector t in the parallel algorithm, and the space transform tells at which processor,
P, in the array the computation C(I) will occur. The temporal vectors are ordered
lexicographically.
The transformation T is formed by augmenting the time transform with the
space transform. Given r E Zqxn, SE Zmxn and the requirement q+m=n, if S is
augmented to 7r, an n xn matrix results. Both 7r and S are required to have full
rank and the rows of ;r are required to be linearly independent of the rows of S, so
 32 
that the resulting augmented matrix will be invertable. These restrictions cause, the
resulting transformation,
to be a bijection as required in the definition of riequivalence. The precedence
relations of the original algorithm will be preserved if r d >L 0 for every dG DA.
This will cause computations that depend on results from a previous computation
to be executed at a later time in the parallel algorithm. With these conditions
holding, T will map A to a riequivalent algorithm. The transformation both sched
ules and allocates resources for the computation C(I) in the array as
and there exists a T such that I T J.
and there exists a TV1 such that I = T~V J.
CHAPTER THREE
ARRAYS FOR UNIFORM RECURRENT ALGORITHMS
Uniform recurrent algorithms (URAs) are an important class of algorithms that
are especially well suited for mapping to systolic arrays. URAs have the property
that the same set of dependence vectors emanate from each node in the index set.
The idea of a URA is depicted in figure 3.1a where in contrast, figure 3.1b shows
an algorithm that is not a URA. The uniformity of the dependence structure will be
reflected in the array domain by regular communication networks and simple con
trol strategies. Some examples of URAs are matrixmatrix multiplies, convolu
tions, correlations, DFTs and partial differential equations solvers. URAs are nor
mally computationally intensive and often require real time execution, making
them prime candidates for array implementation.
(a) (b)
Figure 3.1 : Uniform and Nonuniform Dependence Graphs.
(a) URA (b) Not a URA
The class of URAs is the simplest to analyze in terms of the program and
algorithm models presented in the last chapter. Most programs that generate URAs
 33 
 34 
may be written as a nlevel nested loops with a single expression in the inner loop.
Several tools and methodologies for algorithmically specified array design and
analysis will be developed for URAs. These tools and methods will be extended to
broader classes of algorithms in chapter 5.
In the first part of this chapter URAs and programs that generate URAs will be
studied. In section 3.1 necessary and sufficient conditions will be given for a class
of programs to generate URAs. Also a subclass of programs will be studied that
will generate URAs via affine and/or linear indexing transformations. In section
3.2 streamlined matrix based program and algorithm representations will be devel
oped for URAs that contain the same information as the general definitions. Using
the matrix based representations, simple closed form expressions may be derived
for the variable and dependence functions, and a recursive procedure may be used
to generate the index set [Dow88d] [Dow89]. These constructs will be useful for
computer aided URA specified systolic array design [Dow88a].
In sections 3.3 and 3.4 the transformation domains of URAs will be analyzed.
In section 3.3 matrix methods will be introduced to determine the data velocities
and distributions in the processor array. Also, a position function will be derived
that tells a datum's position at a given time (temporal vector) in the array algo
rithm. Other temporal vector based functions will be derived that reveal data colli
sion information and PE memory contents. In section 3.4 a set of equations will be
derived who's solution will provide a transformation that will yield an array algo
rithm with desired data flow properties. Examples will be given to illustrate the
concepts.
3.1 Uniform Recurrence Programs and Algorithms
The central focus of this section is the URA and it's generating program. The
URA is formally defined below.
 35 
Definition 3.1: A Uniform Recurrent Algorithm (URA) is an algorithm
A=(X,Y,Z,P,DA) where each dependence vector dEDA is valid at each point I EP.
A convenient form of a program to generate a URA is a Single Expression
Nested Loop Program (SENLP). The SENLP is defined below.
Definition 3.2 : A Single Expression Nested Loop Program (SENLP) is a program
P=( ,IP,DP,E) with the following conditions
1) The index set is lexicographically ordered and is generated by an
nlevel nested loop.
2) E(I) = E for every IE P
3) If y E V then y E Rm and m n. The variables are indexed in the
expression using indexing transformations Fy:I>viy, viy E Zm. This
causes y [viy] to be an individual element of y In the expression the
variable y is referenced as y [Fy ()].
The most important point of the above definition is the way the variables are refer
enced. A single expression is used that contains transformations that map from a
point in the index set to the variable indices of a particular program variable that is
referenced in the expression at that index point. These transformations allow a
single expression to cause a unique computation to occur at each point in the index
set of the algorithm generated by the program.
The indexing transformations provide a mathematical way to find program de
pendencies. Suppose P is a SENLP with y E , y E Rm and viy E Zm is the
indexing vector for y Then a program dependence exists from Ii to 12 if y [viy] is
referenced in expressions Ii and 12, 11
lexicographical sense. If this is the case, then D(y,I1)=dy= 1211 and both Ii and 12
must satisfy Fy(I)=viy. So if Ii is a solution to Fy(I)=viy and D(y,Ii) = 1211 then 12
must also satisfy the equation Fy(1)=viy. If the solution set to Fy(L)=viy is known,
the dependence structure may be found by ordering the solution vectors lexi
 36
cographically and differencing adjacent vectors in the list. This provides a useful
way to set up the dependence graph for an algorithm generated by a SENLP. If the
program is known to generate a URA apriori, then the equation Fy(I)=viy needs to
only be analyzed once and the dependencies for all elements of y will be known at
all points of the index set. The following lemma will provide a way to know if a
SENLP will generate a URA apriori. A theorem, whose proof is based on the
lemma, will tell that if the variable indexing transformations are affine or linear,
the SENLP will always generate a URA.
Lemma 3.1 : Necessary and sufficient conditions for a SENLP, P, to generate a
URA is that each variable indexing transformation, Fy in E of P satisfies
Fy(I) = Fy(I+dy) for every IEIP and each dy DP.
Proof:
(Necessary) Assume P=>A where P is a SENLP and A is a URA. Let y EGi and
let Fy:I>viy be the indexing transformation for y in E. If the variable y [viy] is
referenced at points Ii and 12 of the index set, then viy=Fy(Ii) and viy=Fy(I2).
Therefore Fy(Ii)=Fy(I2). Furthermore, if Ii< LI2 then for some 12 in the solution
set of viy=Fy(I), I2=Ii+dy where dy=D(y,I1). The above two conditions lead to the
equation
Fy(I1) = Fy(II+dy)
but since P=>A where A is a URA, we know that dy holds at all points in the index
set so that the above equation must hold for each dependence vector of at all
points of the index set, i.e.
Fy(I) = Fy(I+dy) for every IE In and each d E D'.
(Sufficient) Suppose P is a SENLP where for each indexing transformation Fy in E,
the relation Fy(1) = Fy(I+dy) holds for every I PI and each dy E D. If this is the
case, the same variable Y [viy] is referenced at points I and I + dy for each I IP.
Therefore dy is a valid dependence vector at each point in the index set. Since the
 37
condition also holds for each dy E DP, all the dependencies in the program are
valid at each point of the index set. Therefore if the algorithm A is generated from
P and the algorithm dependencies are derived by setting DA = DP, then A will be a
URA.
Theorem 3.1 : If P is a SENLP where each Fy in E is an affine transformation,
then P=>A where A is a URA.
Proof: Suppose P has only affine indexing transformations. Then if dy is a de
pendence vector for the variable y GE' at point Ii E P, the relation
viy = Fy(I1) = Fy(Ii+dy)
must hold. Because Fy is an affine transformation, it may be written as the sum of
a linear transformation and a constant vector i.e.
Fy(Il+dy) = Fy(Ii+dy) + vic
where vie E Zm has the same dimensions as viy. Because FyL is linear, the above
may also be written
Fy(I,+dy) = Fy(I1) + FyL(dy) + vic
but since viy = Fy(Ii) = Fy(Ii+dy), we can replace the left hand side of the above
equation with Fy (I) = Fy (Ii) + vic to obtain
Fy(I,) + vic = Fy(II) + FyL(dy) + vic
so that it is clear that Fy (dy)=0. That is, either dy=0 or dy E N(Fy'). Therefore at
any arbitrary point I PI
Fy(I+dy) = FL(I) + FL(dy) + vic = F (I) + vic = Fy()
so
Fy(I) = Fy(I+dy) for every IE IP and each dy E DP.
Since the program will satisfy the necessary and sufficient conditions to generate a
URA, we know that P=>A where A is a URA
 38 
The above theorem identifies a class of programs that will be known apriori to
generate URAs. If a SENLP is written with only affine transformations in E it will
be known to generate a URA. Since a linear transformation is a special case of
affine transformations where vic=0, the above theorem also holds for algorithms
with linear indexing transforms. Programs of this form are common in DSP and
NLA applications.
Another important point brought out in the proof of the above theorem con
cerns the program's dependence structure. If a SENLP is written with only affine
indexing transformations of the form Fy(I) = Fy(1) + vic where FyL(I) is a linear
transformation and vic is a constant vector, then either dy=0 or dy E N(Fy). This
gives an easy way to find the dependence vectors by inspection of the indexing
transformation as will be discussed in the next section. Also, it shows that the
subsets of the index set that involve computations that use a given program vari
able will be (nm)dimensional surfaces in the index set. These surfaces of vari
ables will transform to paths in the array algorithm that a given variable will travel.
If N(FV) is one dimensional the variable Y [viy] will follow a straight line through
the index set as was exemplified in figure 2.5.
3.2 Matrix Based URP and URA Representation and Analysis
We next derive matrix representations for SENLPs and URAs. Let P be a
SENLP with only linear indexing transformations. If Fy:I>viy, where viy E Zm and
I E Z then Fy may be viewed as a linear transformation Fy:Zn>Zm. The defini
tion below serves to define the transformation matrix of such a transformation.
Definition 3.1: An indexing matrix Fy E Zmxn for a variable Y, is a transforma
tion matrix that maps the index vector, IE In to a variable index vector, viy E Zm.
That is, Fy is defined so viy = FyI.
 39 
To see how the indexing matrices are used, consider the matrix multiply pro
gram.
P1 : for (i= 0 : 2)
for 0= 0 : 2)
for (k= 0 : 2)
Z = Zij + Xk Ykj
which has an index vector,
I
k
This program has p ={ X, Y, Z }. Each variable is indexed by variable index
vectors viz = [i j]T, vix = [i k]T and viy = [k j]T respectively. The index transfor
mations are given for each variable as,
viy = M 1[0 0 11 [k
so that the indexing matrices are given by
0 1 0 1 0 0
As is clear from the above example, the indexing matrices may be easily found
by inspection. In practice, a programs indexing scheme will never be much more
complicated than the one above. A translator could be written that would take an
expression written in a standard computer language and convert its indexing
clauses to indexing matrices. Assuming the variables in the original program were
indexed by linear combinations of the loop indices. Once the indexing matrices are
found, it is a simple task to determine the dependence vectors. Simply find the
minimum length lexicographically positive vector satisfying the equation Fydy=0
for each transform Fy. In general nRank(Fy) such vectors must be found that are
linearly independent. This can be done numerically or by using list searching tech
 40 
niques on the index set. Also, it can be done by inspection for most programs. For
example it can be determined by inspection of the indexing matrices that
dz = dx = dy =
since these vectors are in the null spaces of their respective indexing matrices.
Another example of how the indexing matrices are used is the linear convolu
tion program
P2 for (i = 0 : Nl)
for (j = 0 : Nl)
Zi = Zi + X Yij
where it follows that
Fz= CIO] Fx=[0 1 Fy=[l I].
and the dependence vectors are
dz [0] dx = [1] =
In this example the dependence vector dy is not in a standard unit basis direction.
The indexing matrices may be inserted in the expression directly to give a
standard expression format. The expression format is,
y7 [FylI] = f(yi [FylI], z2 [FMy2I, ... ys[F,,I])
This expression is denoted E and is used at all points in the program's index set.
The loop limits may be represented by two matrices that determine the bounda
ries of the index set. The matrices allow the loops to range from a constant to
another constant, or from a constant plus a linear combination of the previous loop
indices to another constant to another linear combination of the previous loop
indices. For example, consider the problem of multiplying two upper triangular
 41 
square matrices. Since the product is known to be upper triangular also, only the
upper triangle of the product needs to be computed. That is, one would write
P3 : for (i = 0 : 2)
for (j = 0 : i)
for (k= i :j )
Zij = Zu + Xik* Ykj
Here the second limit of second coordinate of the index vector is dependent on
first index coordinate, and the limits third coordinate are dependent on the first
and second coordinates. Figure 3.2 shows the geometry of the index set generated
by the "for" statements of program P3. Note the difference between this figure and
figure 2.5 that was for the multiplication of square matrices. The affine transfor
mations defined below serve to represent the loop limits and index set boundaries.
Definition 3.3: The boundary matrices 01 and Q2E Znxn+' are affine transforma
tions of the index vector that linearly map an augmented index vector
to a point in the index set In. The leading nx n submatrices of %i and 02 are
strictly lower triangular.
The boundary matrices completely characterize the boundaries of the index set.
For example, consider again the upper triangular matrix multiply loop where the
loop limits of
for (i = 0 : 2)
for (j = 0 : i)
for (k= i : j)
Zij = Zij + Xik* Ykj
may be written
for( I = T : T )
with
 42 
00 0 000 (
DI= 00 0 2 1 0 0 0=
S0 00 0 0 k
00 10
LI
Using this scheme, either of the square or upper triangular matrixmatrix multiply
loops or the convolution loop can be represented as
for(I= 4i :
Z[FzI] = Z[FzI] + X[FI] Y[FyI]
by changing the dimension and/or the entries in the matrices 01, 2, Fx, Fy, and
Fz. This provides the foundation for the definition of a subclass of SENLPs.
Figure 3.2: Upper Triangular MatrixMatrix Multiply Dependence Graph
Definition 3.4: A linear recurrence program (LRP) involving s variables is a pro
gram that may be specified by an (s + 3)tuple consisting of s+2 matrices and an
expression as shown below.
 43 
P = (E,4 ,Fyl,4F2,...,Fys,E)
Here the boundary and the s indexing matrices are defined as above and E is the
standard expression. The nested loop program may be written in matrix form as
for(I= I : 4z )
y [FylI] = f(yi [F 1I], Y2 [Fy21], ..,s [FysI])
For LRPs, the above tuple contains the same information as the general pro
gram definition. The sets of variables and dependencies in the program are speci
fied by the indexing matrices, the index set is specified by Di and 02, and the set
of expressions is specified by E. All the information of the LRP is compactly
represented by s+2 relatively small matrices and a single expression. The LRP is
used to generate a Uniform Linear Recurrence Algorithm (ULRA). The above
tuple can also be used to represent a ULRA if it is recognized that C(I,E) occurs at
each point of the index set.
The above results may be extended to programs that use affine indexing trans
formations. Let Fy(I) = FyLI + vic where Fy is an affine indexing transformation Fy
is an indexing matrix and vic is a constant variable index vector with the same
dimensions as viy. Then the affine transformation of IE I may be viewed as a
linear transformation Fy:I>viy where Fy may be written as an indexing matrix
Fy =[FL I vic]
with Fy E Zmxn1 Fy Zmxn, vie E Zm and T=[I  1]T. Using the affine index
ing matrix, another subclass of SENLPs is defined below.
Definition 3.4: An Affine Recurrence Program (ARP) involving s variables is a
program that may be specified by an (s + 3)tuple consisting of s+2 affine transfor"
mation matrices and an expression as shown below.
P = (4,,F,Fyl,2,...,Fys,E)
 44 
Here the boundary and the s affine indexing matrices are defined as above and E
is the standard expression. The nested loop program may be written in matrix
form as
for(I=iT : <~r )
y [FylI] = f(y, [FyI], y [Fy2 ] .... y,[FyT])
For ARPs, the above tuple contains the same information as the general pro
gram definition. Everything is the same as the linear case in terms of the informa
tion that is contained in the above definition except the way the dependence infor
mation is contained in the indexing matrices. Now the dependence structure must
be determined by examining the linear submatrix of the indexing matrix. The
ARP is used to generate a Uniform Affine Recurrence Algorithm (UARA). The
above tuple can be also used to represent a UARA if it is recognized that C(I,E)
occurs at each point of the index set.
3.3 Transform Domain Analysis
So far it has been shown how to represent LRPs and ARPs and thus their
generated URAs as tuples of s+2 matrices and an expression. The next problem is
to map the algorithm to a systolic array. Again, we shall first derive the results for
the linear case and then extend them to the affine case. As mentioned in section
2.4, an invertable integer transformation matrix T E Znxn may be applied to the
index vector I E Zn to perform the mapping of computation C(I) to a systolic
computation C(J) where J=TI. C(I) and C(J) are identical computations but where
I tells where in the index set C(I) is performed, J tells where and when in the
systolic array C(J) is performed. The transformation must be selected so that the
dependence vectors project to positive temporal vectors in order for the algorithm's
precedence relations to be maintained. In this section we will show how to use the
matrix representation along with the transformation to extract data velocity and
 45 
data distribution [Li85] vectors. Also time parameterized functions will be derived
to yield data position, PE register contents, and data collision information.
Recall from section 2.4 that the algorithm transformation may be broken into a
time transform ;r E ZqXn and a space transform S E Zm X where m+q=n. The
computation C(I) is performed at time t = 7rI at processor P=SI. The transform T is
constructed by augmenting the r row vector with the space transform yielding a
nonsingular square matrix T E Z"nn. The qdimensional temporal vector may
be used to admit multiple clocks with separate data velocities associated with each
clocking (time) coordinate. This allows a linear transformation to force data to
make multiple passes through the array. The transformation has the form
J=  =  ITI
A useful set of transformation matrices are constructed by augmenting the time
transform :r with each of the indexing matrices Fy. The matrices, called the
timeindex matrices, are defined below:
Definition 3.5: The timeindex matrix for a variable 7 with index matrix
Fy E Zmxn in a LRP whose generated algorithm is scheduled to execute on a
systolic array via the time transform n E ZqXn is the invertable matrix formed by
combining or and Fy to obtain Ty E Znxn. This matrix is shown below.
Ty=  (timeindex matrix)
The above matrix defines another coordinate system. Vectors in the new sys
tem are called timeindex vectors and are written viy', where,
vi  (time index vector)
( time
 46
Because Ty is required to be invertable (posing an additional restriction on the
selection of a) Ty can be viewed as a change of basis transformation (bijection)
from the index set to the timeindex set. That is,
TyI = viy and I=T, yviyt
Furthermore the basis may be changed from viyt to viyr2 using the cross transfor
mation matrix.
Definition 3.6: The cross transformation, Tyly2, maps a vector viyzt to another vec
tor viryi. The transformation is composed of Ty1 and Ty2 as
Tyzy2 = TyTy21
so that
Viy1t = Tyly2Viy2
since
Tyly2 Viy2t = Ty (T21 Viy2t ) = TylI = viyl1
The ability to change basis from I to viy or from viylt to viy2t will be useful in
deriving formulae concerning data position, velocity, distributions, and collisions.
First consider the position function. That is, given a data element y[viy], find its
position P in the processor array at time t. The known information is vi, E Zm
and t E Zq. This information is contained in the vector viy'. Since P = SI and I =
Ty,viy', it follows that P = STylviyt, which gives the position of y[viy] at time t.
It will prove to be useful to define the following matrix based on the above result.
Definition 3.7: The parameter matrix, Sy E Qmx", for a variable Y in a recurrence
algorithm transformed by a transformation T is given by Sy = STy1.
 47 
Definition 3.8: The position function. Py, for a variable Y,is a transformation
Py:viy'>P where P processor coordinates of the datum y[viy] at time t, and is
written
Py(viyt) A Syvi7' = P
Note that S E Qm x n and Py E Zm and viyt E Z" so that Sy vit E Qm in general.
Thus the position function may map a timeindex vector to a noninteger point in
the processor space. Since processors are only at the integer points, these cases
must be interpreted. In practice it is often the case that Sr E Zmxn__ Qmxn but
this is not required. In cases where it takes several clock cycle for a datum to move
to the next processor its position will be interpreted as being between processors,
hence the datum has a noninteger position in the processor space. These positions
may be interpreted as delay registers that are inserted between processors. Another
view is to assume the datum resides in PE local memory when it is not at an
integer position in the processor coordinate system. When enough clock cycles
have passed to make the position function evaluate to another integer position, the
datum is passed along the appropriate data path to the next processor. The posi
tion function is not evaluated while the algorithm is executing, but tells how to
design or program the hardware.
The position function is used to find the data velocities. There are q velocity
vectors per variable Y, one with respect to each time coordinate. Because of the
uniform dependence structure, the velocity found for one data element y[viy] is the
same for each element of Y. That is, the velocity of one specific datum y[viy], will
actually be Y's wavefront velocity. The data velocity with respect to the ith time
coordinate is the change in processor position, AP of y[viy] when the ith time
coordinate is incremented by one. Therefore, if ei denotes the ith standard basis
vector of Zn, the ith velocity vector of Y is
 48 
Vy = Py(viyt + ei)Py(viyt) 1 < i s q
or by definition of the position function,
vy1 = Sy(viyt + ei) Syviyt = Sy(viy' + ej viy') = Sye
This means that the ith (1 5iSq) column of the parameter matrix Sy is the velocity
vector of Y for the ith time coordinate of t E Zq.
The data distribution vectors, introduced by Li and Wah [Li85] are defined as
follows:
Definition 3.9: Given a datum y[viy] with viy E Zm and another datum y[viy+ ei]
1 5 i < m, (ei is the ith standard basis vector for Zm)the ith data distribution
vector 6y5' Qm is a vector in the processor coordinate system that for t = con
stant, points from the processor containing y[viy] to the processor containing
y[viy + ei].
The distribution vectors may be found in terms of the position function. Let i =
q + j, 1 5 j < m, then the jth distribution vector is
6J = Py(viyt + ei) Py(vi ) 1 j < m, q i n
or by definition of the position function,
yij = Sy(vi/t + e1) Sy(viy ) = Sy(viyt + ei viy,) = Syei
This states that the jth distribution vector of Y is the (q + j)th column of Sy. It is
convenient to define the velocity and distribution matrices vy E Qmxq and
by E Qmxm as the submatrices of Sy, namely Sy = [vy6 1y]. The parameter ma
trix is used to compute the data position of Y[viy] and explicitly lists the data
velocity and distribution vectors of 7. The velocities and/or distributions will be
noninteger when data takes more than one cycle to reach the next processor as
mentioned earlier.
 49 
The position function may be written in a familiar form using the above parti
tioning of the parameter matrix. Since vy E Qm q, t E Zq, 6y E Qmxm and viy E
Zm the equation P = Sy viy' may be written P = vyt + 6dyviy. This is a vector form of
the familiar position function for linear motion at a constant velocity,
P( t) = vt + po. At t=O, Py(viyt) = 6yviy so that 6yviy is the initial position of the
datum at time t=0. The initial position is a function of viy, i.e. each separate
datum y [viy] has its own initial position in the array processor coordinate system.
Pi
720 Y21 722
Y710 Y11 Y12
P2
YOO aI h Y02
Figure 3.3: Data Distributions for Agorithm With 6y = 12
The above view of the position function gives insight into how data moves in
the systolic array. In the common case of onedimensional time, it shows that data
moves in a straight line by discrete steps of vy each time the clock is incremented.
A more interesting case is when q > 1 which arises when there are q dependence
vectors for each variable y For example, suppose in some array algorithm that a
given variable y had the following parameter matrix
 50 
S = 0 1 0
where n=4, q=2 and m=2 so that t, viy, P E Z2. The initial configuration of the
algorithm, based on the identity distribution matrix, would be that of figure 3.3.
Now consider how the datum Yoo moves through the array as the temporal
vector increments. The time trace of an algorithm is the lexicographically ordered
set of unique of temporal vectors that arise as the algorithm executes. Suppose in
this example that the time trace is given by
trace=[ [ o] [] [1] [i] ] [] [ 2]}
Then the position function for Yoo would become
P(tt200]T) = v, [ [] = vy ['] =
so that the path traced out by Yoo will be the same as the time trace.
P1
A'
t
v2i
Y22
t=7 t=8
YJo fin
'12
t=4 t=5
t=3
0YOo oi Y02o2
Figure 3.4: Position Trace of Yoo Through The Processor Array
I
I
1
I
I
I
I
I
I
I
 51 
In the time trace that when ti is incremented, t2 is set back to zero. This causes
jumps in the position of the datum as time increments. This is the mechanism
where by a linear transformation allows data to make multiple paths through the
array. The datum Yoo's position trace is displayed in figure 3.4 and is based on the
above position function and time trace. In the figure, absolute onedimensional
time is used by lexicographically ordering the temporal vectors. This is the ap
proach taken in actual hardware.
Another set of functions that can be derived to help design and analyze systolic
algorithms are the processor memory functions. The memory function was intro
duced in the context of the spacetimedata equations for a systolic algorithm
[Kuo84]. The memory function states what indices viy of Y are contained in proc
essor P at time t. There is a separate memory function for each variable Y in the
algorithm. The given information is the time t, and the processor coordinates, P.
This information is contained in the timeprocessor vector,
J
Recall that I = TJ so that since viy = FlI, it follows that viy = FyT1J, which leads
to the following definitions.
Definition 3.10: The memory matrix, My E Qmxn, is given by
My = FyT1
and is a linear transformation mapping from timeprocessor vectors to the variable
index vector viy for the variable Y.
Definition 3.11: The memory function, for a variable Y, is a mapping My:Jviy
and is written
viy = MyJ
 52 
The memory function specifies what data element y[viy] is contained in a proc
essor P at time t. If MyJ contains rational noninteger entries, then no data ele
ment is contained in an ALU register of processor p at time t.
The last function to be derived is the data collision function. The data collision
function determines with which data element from another data set a given datum
collides at time t. The collision function is a mapping Ciyy2: vi2t , viyl. Given
vi2t, it follows I = Ty21viy2t and viyl = FyrI so that viyi = FlTy21viy2t, which leads
to the following definitions.
Definition 3.12: The collision matrix, Cyly2 Qmxn, is a transformation matrix
mapping Cyly2: viy2t . viyl where Cyly2 = FyTy21.
Definition 3.13: The collision function of a variable Y2 with another variable Yi is a
function whose input is a variableindex vector viy2 and a time vector t, (viy2,) and
whose output is a variableindex vector viyi. The vector viyl is the variableindex
vector of the element of Yi with which Y2[viy2] collides at time t. We write
viyI = Cyly2viy2'
So far all the work has been done for ULRAs. Now the results will be ex
tended to the affine case. Recall that UARPs had indexing transformations of the
form
Fy = FyI + vic
This indexing policy will affect the data's initial placement but not the velocity nor
the distribution. Therefore, the timeindex matrix is defined for the linear portion
as
L 
yL
 53 
and the parameter matrix for the linear portion is given by Sy =ST'1 so that the
velocity and the distribution vectors are stored in Sy just as they were in Sy. The
linear parameter matrix will find the position if the proper variable indices are
presented. If vic is subtracted as a correction term before the transform Sy the
proper results will be attained. To see this, let vi. E Z" be a vector created by
augmenting a qdimensional zero vector with the constant offset vector of the
affine indexing transformation. Then vi$ will be compatible with the timeindex
vector and may be used to make the correction in the variable index vector portion
of the timeindex vector before applying the linear portion of the timeindex ma
trix so that I = T (vivii). Thus the new position function is obtained by apply
ing P=SI where I is given by the above the same as the previous one, so that the
position function may be written Py(viy) = S Y(vi vi).
The memory function is also simple to derive when affine indexing transforma
tions are present. In the linear case My = FyT' and viy =FyT'J = FyI. Now just
set ML = F~T1 and the memory function becomes viy = My~J + vie.
The final function to be extended is the collision function. In the linear case
viy1 = FylTy21viy2t. The key was that I = T2lviy2 From the discussion of the
L 1t
position function, we know that I = T2 (viyz vit2) where vi2 is the zero aug
mented constant indexing offset vector for y 2. Then the second transformation is
applied as viyi = FyII + vic2. Thus the full function is computed by
viyl=FLyTy (viyz vi2) + vica.
3.4 Algorithmically Specified Array Synthesis
The synthesis problem may be stated: given a serial algorithm in matrix form,
find a transformation T E Zn"x that will transform the serial algorithm to a
equivalent systolic algorithm where dta flow matches the interconnection con
rrequivalent systolic algorithm where data flow matches the interconnection con
 54 
straints of the target array. The synthesis method uses the same equations as the
previous analysis techniques, but in a different configuration. A system of equa
tions is derived that has as its solution the transformation T. Once the transform is
found, it may be tested to see if it preserves the algorithm's precedence relation
ships by multiplying the dependence vectors by the time transformation and testing
to see if the result is lexicographically positive. Programs with affine indexing
transformations may be designed by the methods discussed in this section by using
the linear portion of the transformation.
In synthesizing a systolic algorithm, one specifies selected data velocities and
distributions in order to get a desired data flow. The other velocities and distribu
tions are left as free parameters in order to make the resulting system more solv
able. The data velocity and distribution information is contained in the parameter
matrices, which are related to the timeindexing matrix through the space trans
form as Sy = STy1. Multiplying on the right by Ty yields SyTy = S, or
which may be written vyX + 6yFy = S
The above is a synthesis equation. It relates data velocity, distribution and
indexing for the variable y to the time and space transforms. There is one synthe
sis equation for each variable y in the generating program. All of the synthesis
equations must be solved simultaneously for ;r and S. When the system is solved,
the result is the algorithm transformation T that will map a serial algorithm to a
parallel algorithm with the desired data velocities and distributions. If no solution
exists, the constraints must be loosened until a solution is found.
The synthesis equation for the variable y is very similar to y 's position func
tion. Note that if both sides of the synthesis equation vy+r + 6yFy = S are multiplied
 55 
on the right by the index vector and the sides are switched that the position func
tion P = vyt + 6yviy results. While the position function relates vectors of different
coordinate systems, the synthesis equation relates the transformation matrices. The
synthesis equations are solved to determine the transform that will give the desired
position functions. This amounts to specifying vy and 6y and solving for r and S. If
all the vy and 6y matrices are specified, usually no solution will exist. Rather, only
the essential parameters are specified and the rest fall as they will to give a valid
algorithm.
There are several reasons why one would want to generate a systolic algorithm
with specified data flow parameters. The main reason is that if only particular data
paths are available, then one would like to design an algorithm that uses the exist
ing paths. Sometimes it is desirable to constrain the distribution in order to have
data properly fit an array or to obtain algorithms with higher PE utilization effi
ciency due to closely spaced data. Other times algorithms must be designed to run
on an array with fixed size and dimensions. In such cases the algorithm must be
partitioned to execute on the limited hardware resources as is described in the next
chapter.
The synthesis equations may be applied using available information to simplify
their solution. For example the matrix multiply algorithm was shown to have an
identity dependence matrix in section 2.4. Also, time is one dimensional so that r
E Z1X3 Because D = 13, .rD = r, so to satisfy the precedence relations of the
algorithm each entry of .r must be positive. To obtain a minimal time algorithm
the entries are set to the minimum positive integer so that r = [1 1 1]. This type of
reasoning simplifies the solution of the synthesis equation for many algorithms. In
light of this, consider the following example.
 56 
Example 3.1: Use the synthesis equations to map the algorithm generated by the
matrix multiply program below to a twodimensional systolic array with a four
nearest neighbor connection topology. Assume that no data are to stay in place.
Program: for (i = 0 : 2)
for (j= 0 : 2)
for (k= 0 : 2)
Zij = Zu + Xik Ykj
Solution : Note that n=3,m=2 and q=l and that the indexing matrices were give in
section 3.2. Since q=l, there is only one time dimension and data will make a
single pass through the array. From the discussion above it is already known that
the minimum time algorithm will be generated by r = [ 1 1 1] and some S yet to
be found. Since only four nearest neighbor connections are available and all data
must move, specify the velocities as
1o]
vz = Vx = ] v=
so that if the processor axis are set up as in figures 3.3 and 3.4 these velocities will
cause Z to move from left to right, X to move from top to bottom, and Y to move
from bottom to top. These velocities will cause data to travel along the existing
data paths of the array. The distributions are not critical so will be left as free
parameters and determined later by computing the parameter matrices. The syn
thesis equations, vy+r + 6 Fy = S will be written for X,Y and Z in matrix form with
t = [ 1 1 1] and the multiplications vyx already performed.
[0 0 01 oi dr 2 1 + 6 6] S[1
F1 1 1 [6~i6 1 O LS21 S22 S23
x o o oj + Li J] o 0 LS11 S12 s 3
X: 0 0 6 ax, 0 0 1o o S S S22 S23J
r i + [616] r[0 0 1 [S11 S12 S13]
0L 00 6 1 2 L0 1 0 S21 S22 S23
 57 
If the multiplications 6yFy are performed, the above equations reduce to
7o. o0 0[ 6 20] [S11 Sr 1 Sr13
1 1 2 + L[226O S21 S22 S23
[* 1 1 1 606,X2 [S11 S12 S3S
2 0 0 SO 0 606 ~ [S21 S22 S23J
f1 1 l1 0660 ] [S11 S12 S13
0 0 J E 0 62 ] ~ S21 S22S23J
The Z equation requires S13= 0 and S23=1. The X equation forces S12=1 and
S22=0. The Y equation forces S11=1 and S21=0. Thus the S matrix is completely
determined and is given by
= [1 1 ]
0 1 01
and the parameter matrices can now be determined using Sy=STy to be
0 T 1 1 s1 2 1] S r1 1 2]
1 1 1 0 0 1 0 1 0
The first column of each of the above matrices is the velocity vector for the respec
tive variable as specified at the beginning of the problem. The distribution vectors
are given in the second and third columns of the above matrices. Figure 3.5 shows
the initial configuration of the data as the algorithm is about to execute. The veloc
ity and distribution vectors are also shown in the figure. Note that if the data
moves as specified, the correct results will be computed. Also note how the distri
bution vectors point from the point in the processor space where a given variable is
located to the point in the processor space where another datum resides that has
one of its variable indices incremented with respect to the other's. The distribution
vectors tell how the data is distributed in the processor space.
 58 
vx
Zo2 10 02
z12 Zo0 Xo
Z22 Y11 0
2o Y
vz I6 Y
02 2
X, y 2 1
Y12
Figure 3.5: The Systolic Algorithm of Example 3.1 with Velocity and Distribu
tion Vectors Displayed
 59 
Example 3.2: Design a systolic algorithm to filter a 32 x32 pixel image with a 3 x3
window on a 3 x3 systolic array with a four nearest neighbor connection topology.
The filter coefficients must sit still, one per processor. Assume that random ac
cess memory buffers exist at the array periphery that are controlled by an address
generator with readwrite control.
Solution: Let x, y, w represent the input output and window images respectively,
where x E R31x31, y E R33x33, w E R3x3. Figure 3.6 shows how the filtering
operation may be written as four loops in matrix form. To make w remain still
and be distributed one per processor, let
vw = and 6,= 1]
Next we will assume x's and y's move in the same direction but at different speeds
as in H.T. Kung's W9 onedimensional convolution algorithm [Ku82]. To achieve
this, set
vy=[ and v, = 1/2
Note that these velocities are specified so that it will take two clock cycles in either
temporal dimension in order for the a given Xvariable to reach another processor.
Due to the way time increments, this will cause some data not to engage any
processors while time progresses in the least significant temporal dimension. The
data who's position due to the most significant time coordinate is an integer will
engage a processor every other cycle.
If the X and Y data distributions will be left as free parameters, the synthesis
equations become,
 60 
+ 1621 ~d2 L
Y 1 10 j 1 z1 12 O13 914
01 IL 21 Z22 gr23 Z24J
X: 1/2 01 [tII X2 =13 x143
L0 1/21 L3Z21 922 Z23 J24]
W: 0 0 kii 1'2 Z13 141
0 0 J 21 =22 Z23 324
611 612 1 0 1
L 21 6 2 L 0 1 0
+[1 0 0
E0 1] 0
0o s SZ1 S213 S14
0 LS21 S22 S23 S24J
] FSz S12 S13 S14]
1j= 21 S22 S23 S24J
S 11 S12 S13 14
I S21 S22 S23 S24]
The W equation yields S directly,
W: S = 0 1 0
so that the X and Y equations may be rewritten
Y: I 2l X12 =13 314
Y 7t2l1 =22 OT23 3(24]
Y11 6Y12
1 21 6Y22
X: 1/ T1 l12 313 4lq +4 X11 6X12 06Xl BX12
Lx21 t22 2t23 324J + LaX22 6 6
From the Y equation it is clear that
=x23 Z24j L1 0
From the X equation we see that
1/2 11 21 = 6x
L3(21 3(22]
and that
1/2 [I3 ;:414 6
[Z23 924.
0 1
x = 1/211 O]
10 0
= 0
= 0
S0 0
 61 
Finally
T22 122J LU01
so that the required transformation is
0n
1 I
0s
Figure 3.6 shows a snap shot of the algorithm executing.
Xs4 X1 4
Woo
X55ss X45
Y45
X35 X25
Y3s
XIs Xos
Y25
vx2 1/2
Vxi 1/2
1/2
6
x2
6 1/2
VY2
1
1
Y2
YJ1
<5y, i
00 0o Fz = 0 1
1 0 O 0 0 0
02= 000033 F= 1 0 1 0 Fy= [1
000033 0 1 01
0 0 00 2
0000 2_
Figure 3.6: 2D Convolution Algorithm Designed in Example 3.2
Ys2
. . Xo6
X61 X51
S X63 Xs3
Y54
X64
X65
S Yss
= 00000
00000
00000
00000
CHAPTER FOUR
MAPPING TO FIXED ARCHITECTURES
An algorithm must often be mapped to a fixed architecture. The algorithms will
usually operate on large data sets which cause the algorithm to have a large index
set. The large index set is then mapped to a processor set through the space trans
form S:P>Pm. Since S is a linear transformation with integer entries, the dimen
sions of the processor set will be at least as big as the smallest m dimensions of
the index set. Sometimes certain dimensions of the index set are small, as in the
twodimensional convolution example of the last chapter. Usually this is not the
case so further methods must be developed to map large algorithms into small
arrays. Another related problem is mapping an arbitrarily dimensioned algorithm
to a kdimensional array. For example, one may wish to compute a matrix multi
ply on a onedimensional array, or a onedimensional convolution on a two
dimensional array.
This chapter will explore and develop methods to map an algorithm a fixed
architecture. This will be done by reviewing some known methods and introducing
some nonlinear number theoretic index set permutation transformations into the
existing framework. The new transformations will be used to transform sets of
integer vectors into sets of integer vectors of different dimensions. This added
capability will allow more types of algorithms to be mapped and will add more
degrees of freedom to the mappings themselves. Also, the nonlinear transforma
 62 
 63 
tions may be integrated into the existing framework to allow the synthesis of algo
rithmically specified arrays with size and dimension constraints.
The nonlinear transformations are based on number theoretic set transforma
tions. If a given dimension of an index set ranges from 0 to n1, then this dimen
sion may be viewed as the integer ring Zn. The elements of the ring may be
mapped to isomorphic rings of different dimensions. The maps are simple to im
plement using known mappings. One map is based on the integer division theorem,
and the other is based on the chinese remainder theorem. Both maps depend on
the factors of n. For example, if n=a*b*c, then there exists a bijection map
p :Zn>Za XZb XZc.
In section 4.1 partitioning fundamentals will be covered. The concept of a vir
tual processor will be introduced. Some partitioning methods in the literature will
be discussed, and an inplace partitioning method will be presented. In section 4.2
integer based program pretransformations will be applied to the index set to create
a new algorithm that may be mapped to a fixed structure. In section 4.3, integer
transformations are applied to the timeprocessor space to partition and alter the
dimensions of an array algorithm directly. The architectural ramifications of the
mapping are discussed to show what properties are required of the basic Process
ing Element (PE) to support the partitioning scheme. In section 4.4, the integer
maps will be used with the synthesis equations to synthesize arrays of different
dimensions than the variable index vectors.
4.1 Partitioning Fundamentals
Array processor hardware is very complex due to the relatively large number of
processors and the large number of connections between them. If the array has
dimensions m Xm, then m2 processors will be required and the number of connec
tions will be on the order of 2m2 or 4m2. Another concern is that m inputs and m
 64 
outputs will need to be serviced at the array boundaries each cycle. This means the
memory must be m times as fast as the PEs. These reasons have kept the size of
systolic arrays to small sizes when compared to the data sets that they must proc
ess. So in order for systolic arrays to solve realworld applications, the algo
rithms must be partitioned to fit the physical hardware structures.
The properties of the array and PE architectures must be taken into account
when mapping algorithms to the fixed architecture. Some architectures consist of
very simple PEs that can perform some basic arithmetic operations, while others
have more complex PEs. If the system has the simple PEs, then the partitioning
must be handled at the array boundaries. The controller must store intermediate
results in the offarray memory and send the data through the array several times
in order to compete the algorithm. The central issue is to determine the proper
sequences of data to send through the array for multiple passes. If the PE has
some local memory and possibly a pointer based addressing unit, then the parti
tioning may be handled by the array itself. Here the main partitioning problem is
to tell the processors how to sequence data in local memory.
Most partitioning schemes employ some variation of the concept of a virtual
processor or virtual node. The idea is that the set of processors that is determined
by the mapping is actually a set of virtual processors. These virtual processors
must then be emulated by the physical array. That is, each physical PE must act as
a set of virtual PEs. This means that some additional scheduling and resource
allocation must be done to get the physical PEs to act as sets of virtual processing
nodes. Also, memory and addressing/sequencing hardware must be built into the
PEs. The addition of this hardware greatly complicates the array algorithm design
and implementation problem. Parallel languages such as OCCAM [Pou86] and W2
[Ann87] can make this task easier, but still the hardware must be programmed
directly.
 65 
Some partitioning methods based on algorithm transformation techniques are
available in the literature. The method of [For84] and [Mo186] involves slicing the
index set into bands and mapping these bands modulo L to a fixed LxL array.
The index set is sliced by separating the mx n space transformation matrix, S, into
m 1 xn subtransforms. Bands are formed by finding surfaces in the index set that
satisfy Si I=constant and then reduced modulo L to fit on an LxL array. The com
putations within each band are scheduled by the time transform, nr.
The method of [Hor87] is based on synthesizing algorithms with very low PE
utilization efficiency and then eliminating the unused processors each cycle. The
net result is a relatively high efficiency array algorithm that uses only a few proces
sors. A time transform with larger than normal entries is used which gives rise to
low data velocities. If a datum's velocity is less than one, it will not be involved in
a computation for many cycles. Also, processors will be idle many cycles. The
space transform generates a full size processor array but the time transform is
selected so that the execution time of the algorithm is longer than the minimum
execution time. This causes many processors to be idle most of the time. Thus if
the available array has dimensions Lx M, then the array algorithm is synthesized
with dimensions Li xMi where groups of (Li/L)X(Mi/M) processors will have
only one active PE at any given time. This group of (L /L) x (M /M) virtual
processors may be handled by a single physical processor since only one processor
in the group will be active each clock cycle. The algorithm takes longer to execute,
but this is expected since fewer processors are available to execute the task.
Another way to efficiently partition algorithms is based on computing the out
put variables in place. This method is used for ULRPs that compute operations of
the form Z=X*Y. The first step is to synthesize an algorithm with Vz=0 and ,z=Im.
This will cause the zvariables to sit in place and to be distributed one per proces
sor so that the size of the synthesized array is the same size as the zdata set. If
 66 
the synthesized array has dimensions Li x M, and the physical array has dimen
sions LxM, the array is partitioned into LxM sub arrays as shown in figure 4.1
below. If L and M do not divide L1 and Mi respectively, then the boundary sub
arrays will have some unused processors, but the method may still be used.
The algorithm is executed by sequentially executing the algorithm on each sub
array to compute the LXM subblocks of the output data set. Care is taken to
remove unused cycles in the time trace of the algorithm. This is done by starting
each subarray's time trace when the data wave first encounters the subarray and
halting once the data wave passes over.
M
Figure 4.1: InPlace Partitioning Method
Each subarray algorithm may be set up and controlled with the help of the
memory functions and velocity vectors for the X and Y data sets. The X and Y
memory functions are applied to the boundary processors of each sub array to
obtain the subarray input data variable index sequences. The velocity vectors are
used to tell which boundary processors will receive the input data. An example
will illustrate.
Example 4.1: Synthesize an array to execute the 4 x4 matrix multiply algorithm
described by the program below. Next partition the algorithm to execute on a 2 x 2
A
JJ1 .J i JJ JJJ JJ
jJ .j. j.Jj jjj j J
JJJJ JJJJJJ JJIJJ JJJ
JJJ!J JJJ JJJJJ JJJIJ
W > i b Ik h k > > k h k I lk b 0
JJJJ JJJJ JJJ J J Jjjn
JJJ 1 JJJ LJJJ JJJJ JJJA
JJJJ JIJJ U JJJ IJJJ
Q
 67
physical array structure. Compare the execution times of the unpartitioned and
partitioned algorithms with a uniprocessor algorithm.
Program: for (i = 0 : 3)
for (j= 0 : 3)
for (k= 0 : 3)
Zj = Zij + Xik Ykj
Solution: To make the algorithm execute in place, set 6z=Im and Vz = 0 so that the z
synthesis equation, vyr + SyFy = S becomes 0r + ImFy = S or, Fz = S. If gi is se
lected as ;r = [1 1 1] in order to satisfy the program dependencies and provide a
minimum time algorithm, then the parameter matrices may be computed to be
s [0 1 0 s 0 1 0 s 1 1 1
0 0 1 1 1 S[[ 0 [
and the memory matrices are given by
[0 1 00 1 01 1 1 1
0 0 1 [ 1 1Y[0 0 1
note in this example that Sy = My for each y E { X,Y,Z }.
The resulting array algorithm is shown in figure 4.2 with partitioning lines
drawn. Figure 4.3 shows four separate algorithms that are executed in sequence.
The outputs must be flushed from the array after each subarray algorithm exe
cutes. The flushing operation could be pipelined to avoid wasted cycles at the cost
of a more complex controller.
The partitioning of figure 4.2 was performed graphically and was conceptually
simple. If the partitioning is to be performed automatically, a more structured
approach is needed. Because Vx= [0 i]T we know the x data will be input to the
right boundary processors. Similarly, because vy = [1 O]T the Y data will be input
to the bottom boundary processors. Thus.the boundary processor that must receive
input data are known. Next the time trace for the entire algorithm is formed by
transforming r:In>ttrace. In this example, the sorted time trace is given by
 68 
ttrace = { 0,1,2,3,4,5,6,7,8,9 }
The trace is used with the virtual processor coordinates of the boundary processors
of the subarrays to obtain the input data index sequences. Each element of the
time trace is augmented with the subarray boundary virtual processor coordinates
and loaded into a matrix as column vectors as shown below for the (0,0) proces
sor.
0 1 2 3 4 56789
Po= 0 0 0000000
0000000000
X11 X10
X03 X02 X01 XOO
YOO
Y10 Y01
Y20 Y11 Y02
Y30 Y21 I Y12 Y04
Y31 I Y22 Y13
Y32 Y23
Figure 4.2: In Place Algorithm with Partitioning Lines
X33 X32 X31 X30
X23 X22 X21 X20
X13 X12
Vx
 69 
This is a matrix of timeprocessor vectors whose columns are ordered in time. If
the matrix is multiplied by a memory matrix then the columns of the product
matrix will be variable index vectors. These vectors will be the ordered sequence
of indices that will be input to the given boundary processor. For example if the
above matrix is multiplied by Mx, the sequence
Xo3X12X1IX1o 1 Z10
XXozXoiXoo Zoo Zo
X33X32X31X30
X3X232X2lX2o
X13X12X11X10
X03X02X01Xoo
X33X32X3lX3o Zoo Zoo
X23X22X21X2o Zoo Zo
Figure 4.3: The Partitioned Algorithm As Four SubArrays a) The Array to
Compute the (0,0) Subblock b) The Array to Compute the (0,1) Subblock c)
The Array to Compute the (1,0) Subblock b) The Array to Compute the (1,1)
Subblock
Z12 Z13
Zo2 Zo3
Z30 Z3
jZ20 ~ Z21j
~~r~ T
 70 
vx 000000000
i o= 012 003456789
is obtained. If it is multiplied by My, the resulting sequence is
S0 1 2 3 4 5 6 7 8 9
v0= 000000000
If the sequence is multiplied by Mz, an all zero matrix results since Zoo is always
present at processor Poo.
The column vectors of the vi o and vioo matrices are the input variable index
sequences to Poo in figures 4.1 and 4.2. Note that the column vectors begin to
contain out of range values after 4 cycles. This signals that the data wave has
passed over the processor and no valid computations are occurring. These cycles
would not be executed in the partitioned algorithm. In other cases there would be a
number of columns that contained out of range values at the beginning and possi
bly more at the end. This corresponds to a subarray in the middle of the array
where the data wave takes several cycles to reach the boundary processor. The
computer detects the subsequence where the input indices are in range and sched
ules the algorithm accordingly.
The last task in the example is to compare execution times. The serial algo
rithm has complexity O(n3) so the 4 x4 algorithm takes 64 cycles to execute. The
full sized array algorithm can be seen to have complexity 0(3n2) so that it would
require ten cycles to execute. If it is assumed that the flushing is not pipelined,
then the partitioned algorithm has a complexity of O(n3/2) and takes 32 cycles.
4.2 Integer Based Index Set Pretransformations
An algorithm transformation that preserves r, equivalence must be a bijection,
but it does not have to be integer or affine. As mentioned earlier, S:In>Pm where
 71 
the size of pm is directly related to the size of P. As was shown in example 3.2, if
m dimensions of the index set are of the size of the the array, then S may be
selected to make the algorithm fit the array. Usually the size of the index set is
independent of the size of the physical array so that no fit may be made by care
fully selecting a space transform. Rather, a number theoretic integer index set
bijection may be applied to the index set before the transform ,T. The index set
pretransformation will alter the shape of the index set in such a way that m dimen
sions of the transformed index set will be of the desired size.
Certain conditions on the index set must be met before a number theoretic
permutation may be applied. First of all, the ith element of the index set must all
range from 0 to bi 1. If the element of the index ranges from one constant to
another constant, then the program may easily be rewritten to meet the require
ment. In other words, the leading n xn submatrix of the boundary matrices must
be zero to apply the method. If IE P and i and j are components of I that range
from 0 to band bj respectively, the algorithm can be made to map to an ai x a2
array if ai Jbi and a2 Ibj. An integer bijection, 0i is applied to the original pro
grams index set to pretransform program so that it will generate an algorithm that
may be directly mapped to a fixed size systolic array through a linear transforma
tion. Thus the index set of the original program is mapped to another index set
that is subsequently mapped to the systolic array timeprocessor set Jn. The map
ping may be viewed as Pi:I>I'n, and T:I'n>Jn so that TOi:P>Jn. Thus the trans
formation ToI maps the program to the systolic array of fixed size.
Let a given dimension of the index set be generated by the loop
for (i = 0 : b1 )
The dimension then consists of the integers { 0,1,2,...,b1 } which is the integer
ring Zb. For the case where b = ala2a3 and an al Xa2 array is available, the
pretransformation, p will be a map O :Zb>Zal X Za2 XZa3. The map is known to
 72 
exist by the division algorithm of number theory [Ros84], although it will not pre
serve the ring operations in general. Let i E Zb and (ii,i2,i3) E Z x Za2 X Za3
where q :e> (il,i2,i3). Then the forward map is given by, (i0,i2,i3) = (i div a2a3,
(i mod a2) div a3, i mod a3). The inverse map, ':(il,i2,i3)>i is given by linear
equation i = a2a3il+a3i2+i3. The above is easily verified and follows directly from
the referenced theorem. The above relations show how to compute the mappings
t and q1. Another easily verifiable fact is that the ordering of Zb is transformed
to an equivalent lexicographical ordering in Zai x Za2 xZa3. Thus the ordering of
computations is known for a program where some of the dimensions of the index
set are transformed through the above transformation.
The above mapping may be carried out quite naturally in a SENLP. The SENLP
will consist of nested "for" statements and a single expression. If one of the for
loops had the form
for ( i = 0 :b1 )
and b = ala2a3, then the above statement may be written in the transformed form
as
for ( i = 0 : a1 )
for ( i2 = 0 : a2 )
for ( i3 = 0 : a3 )
This loop generates an index set that may be viewed as a set of tuples (i1,i2,i3).
The set of these tuples are all of the elements of the ring Za XZa2 XZa3. Further
more, the tuples are generated in lexicographical order, so the single loop has the
same effect as the triple loop and performs the mapping 5 The mappings may be
performed either mathematically, or by rewriting the program. The total index set
map, 0i is the composite of applying these transformations to selected dimensions
of the index vector. If qI is applied to the index set, so that a total of k extra
dimensions are produced, the transformed index set may be viewed as a subset of
 73 
Zn"+. The inverse map may is linear and may be represented as a matrix. The
tuple (ili2,...,ink) is represented as a vector I' = [il,i2,...,ink]T, and the inverse
map as a matrix ['1E ZnXn+k so that I=1['I'.
Once the index vector has been transformed by 0t, the indexing matrices will
need to be altered to operate on the transformed index vector. Since viy =Fy I, and
I=01I', the new index matrices are given by Fy '=Fy r1'. The transformed indexing
matrices now have dimensions Fy'E Zm Xn+k. If the mapping is carried out at the
program level as described above, the indexing matrices in the expression must be
replaced by the transformed indexing matrices F '. If the mapping is carried out
automatically, the above expressions may be used to compute the new indexing
scheme without involving the programmer.
Now the mathematical model exists to transform a SENLP to another SENLP
with an altered index set and expression. If the transform is chosen wisely, the
transformed program will generate an algorithm that will readily map to a fixed
sized systolic array. The extra temporal dimensions in the transformed index
vector will cause extra dimensions to arise in the time processor space, requiring
data to make multiple passes through the array. The following example will illus
trate the method.
Example 4.2: Map the 12 x 12 matrix multiply program below to a 2 x2 systolic
array. Rewrite of the program to perform the mapping 01. Find the indexing
matrices in the transformed program. Evaluate the data velocities and distribu
tions of the resulting algorithmically specified systolic array. Determine the com
plexity of the algorithm and determine the total execution time. Compare with the
previous method.
Pi: for (i= 0 : 11)
for (j = 0 : 11)
for (k = 0: 11)
Zij = Zij + Xik Ykj
 74 
Solution: Select the first coordinate, i, for transformation. Note that 12 = 2 x2 x3
so that the above loop may be rewritten,
P2: for ( i = 0 : 1)
for ( i2 = 0 : 1)
for (i3 = 0: 2)
for (j = 0 : 11)
for (k = 0 : 11)
so that the transformed index vector is I' = [il,i2,i3,j,k]T and I' E Z5, xE Z3x5,
SE Z2x ,n = 5, q = 3 and m = 2. The new index vector may be computed from
the original using I' = [i div 6, (i div 3) mod 2, i mod 3, j, k]T. Now that the index
vector has been transformed, the indexing matrices must also be transformed. The
transformed indexing matrices are given by Fy '=Fy Fi1. In this example,
[6 3 1 0 00
.0 0 0 0 1.
00001
so that the transformed indexing matrices become
, =6 3 1 0 0] P, = 6 3 1 0 O _0 0 0 0 1
L0 001 0 0 0001 0 00010
Next S is selected as
0 100 1001
Now S:IP>{ [O,O]T,[O,1]T,[1,0]'T,[1,1]T} so that only four processors are needed to
execute the algorithm. A time transform will be selected to satisfy the program
dependencies and matrix invertability constraints of T and T'y as
i 0 1 0 0
t= 010 10
1 00 0 1
 75 
The above transform was selected to generate a regular time trace and meet the
constraint that it's rows are linearly independent of the rows of S and each F'y.
Now the data velocities and distributions may be computed using S'y =ST'y1 to be
S 1'[1/5 3/5 0 1/5 3/5] [ 0 0 1 0 / 1
z ~ 0 1 0 0 1 S= 1/3 0 5/3 1/3 5/3
0 1 1 0 1
[0 1 0 0 1
The worst case execution time is when the clock runs from t=[0,0,O]T to
t=[3,12,12]T with no wasted cycles removed. In this case a total of 4*13*13=676
cycles would be required for execution. The serial algorithm requires
12*12*12=1728. In this case the algorithm has a speed up of 1728/676=2.55. Since
4 processors are available, the PE efficiency is 2.55/4 = 63%.
The above example mapped a 12 x 12 algorithm to a 2 2 array. The direct
mapping would have created an array algorithm with a single clock, while the
partitioned algorithm used three clocks. Each data wave has one distinct velocity
vector associated with each clock. These extra degrees of freedom in the data
movement allow the data to be channeled through the fixed size array. The array
algorithm would require a controller at the array boundary moving data on and off
the array and into offarray memory. The data velocities are on the order of 1/3 to
1/5 so that a small amount of memory would need to be distributed throughout the
array in order for the algorithm to execute. This mapping technique generated an
array algorithm suitable for execution on a fine grain systolic array.
4.3 Processor Domain Posttransformations
In the last section integer bijections were applied to the index set prior to apply
ing the algorithm transformation,T. The general formulation was T0I:I>J so that
J=T0II. In this section the integer transformation will be applied directly to the
 76 
timeprocessor vector, J. Now T:I>J and Oj:J>J' so that OjT:I>J'. The transfor
mation 0j dose not affect the time portion of the time index vector, but only the
processor portion. Thus the map may be viewed as a qdimensional identity trans
formation augmented with an integer transformation that maps the m processor
dimensions into m' processor dimensions. The processor transformation portion of
the transformation may be applied directly to a processor coordinate vector to find
the new processor coordinates. This map has the form Op:P>P' so that P'=OpP.
One purpose 0j may be used for is to map m dimensional arrays to m+k di
mensional arrays. For example, it can be used to map a one dimensional array to
a two dimensional array if a two dimensional array is available to execute an
algorithm that operates on onedimensional data such as a onedimensional con
volution. If a three or higher dimensional architecture were available, the transfor
mation PJ would readily map a lower dimensional algorithm to that structure. Of
course, due to current technological constraints, systolic arrays of more than two
dimensions are rarely built. If such structures do become available in the future,
then these methods could be applied to map algorithms into them.
Another use of the 0j is to partition an algorithm to execute in an architecture
with distributed memory and addressing units. The large virtual array is mapped
to a higher dimensional virtual array. The transform is selected so that two dimen
sions of the transformed virtual array are the same size as the physical array. Next
each physical processor is assigned the set of virtual processors that are orthogonal
to the physical processor's coordinates. The mapping will give a way to arrange
data in the distributed memories as well as provide an addressing scheme to trans
fer data between local memories and the PE's arithmetic units.
The position functions tell precisely how each data wave propagates through the
virtual arrays. The position functions of the transformed virtual array are given by
,"7r
 77 
Py(viy) = pP(vi) = OpVyt + Op6yviy
Thus the initial positions are given by Py '=1rOyviy and the velocities in the trans
formed array are given by v ;=pvy. If the integer division algorithm is used to
compute the map, then movement along an axial direction will transform into a
lexicographical movement in the primed dimensions. If the chinese remainder
theorem method is used, movement will be along diagonals in the transformed
virtual processor space with edge wrapping. The first flow is simpler to control so
that is the map that would most often be selected.
A simple use of the processor domain mapping is to get a onedimensional
algorithm to execute on a twodimensional array. Example 4.3 will illustrate how
the mapping forces data to flow on a higher dimensional array when the dimen
sions are available. The other use of the processor map Op is illustrated in Example
4.4. In this example an algorithm will be partitioned by restructuring the index set
so that virtual processor dimensions are created. The subsets of virtual processors
with the same physical processor coordinates will be mapped to the physical proc
essor. Data movement in virtual dimensions will be carried via addressing in the
physical processor's local memory.
Example 4.3: Map the convolution algorithm described in the program below to a
4 x 3 array using the relation J'= bjTI to map the algorithm. Select ~j by applying
the integer division algorithm and the chinese remainder theorem methods. Ana
lyze data flow for each method.
P: for (i = 0 : 11)
for (j= 0 : 22)
Zi = Z +Xj Yij
Solution: Here the indexing matrices are given by
Fz=D 0] Fx[0 lO
FyE 1
 78 
To get the minimum time algorithm and satisfy the dependencies (dz=[0 1]T,
dx=[1 0]T, dy =[1 1]1) we select n = [1 1]. To make the Zvariables compute in
place and be distributed one per processor, set vz=0 and 6z=1. This makes the
Zsynthesis equation specify that S=F =[1 0]. Now that the transform is known the
data velocities and distributions may be computed to be
Sz=[ 1] sx= [li Sy= [1/2 1/2]
The matrices above show that the algorithm mapped to a one dimensional array.
The array and xo's path through the array are depicted in figure 4.4. The PEs are
numbered and zj is computed in processor j. The Yvariables are not shown but
move in the same direction as the x variables but at half the speed.
Figure 4.4: 1D Systolic Convolution Array with xo Position Trace Shown
Two maps may be applied to the above array to map it to a 3 x4 array. The
first map, based on the integer division algorithm, is given by P' = OpP = (P,',P2'
) = ( P div 4, P mod 4 ). The new processor vector is thus P'= [PI',P2']T. If this
mapping is used, the systolic array of figure 4.4 is transformed to the systolic array
of figure 4.5. The CRT map is P' = FpP = (P',P2' ) = ( P mod 3, P mod 4 ). The
new processor vector is then given by P'= [P,',p2']T. The map exists because 3
and 4 are relatively prime. The array and the xo position trace are shown in figure
4.6. The data flow is now along the diagonal directions. This map requires more
I/O overhead by the array I/O controller and requires eightneighborhood connec
tions instead of four neighborhood connections.
 79 
Figure 4.5: 1D Convolution on 2D Array with xo Position Trace Shown Using
Division Algorithm Mapping
Figure 4.6: 1D Convolution on a 2D Array with xo Position Trace Shown Us
ing CRT Based Mapping.
 80 
Example 4.4: Map a 16 x 16 matrix multiply to a 4 x4 array where each PE has
local memory and a memory addressing unit.
Solution: First map the algorithm to a 16 x 16 array with the outputs computed in
place. This is the same mapping used in example 4.1, so the indexing matrices are
the same, ;r =[1 1 1] S = Fz,and the parameter matrices are given by
[0 1 0 s 0 1 0] 1S = 1 1
0 0 1 1 1 1 0 0 1
select the processor transformation as
PI
P' = 1pP = P2 div
P2 mod4
so that the data position functions become
P0" = 0] + t
00 t1 mod4 0
P'2
3
Vy 0 P'3
3
P 15
Figure 4.7: Dimensions of Transformed Virtual Array and Nonzero Data Ve
locities of Example 4.4
 81 
The virtual array space may be viewed as shown in figure 4.7. If the virtual array
is to be mapped to a 4 x 4 physical array, then sixteen virtual processors must be
mapped to each physical processor. The assignment may be made by assigning
physical PE(P'2P'3) the set of virtual processor nodes {[0 P'2 P'1]T, [1 P'2,P'3]T,
...,[15 P'2,P'3]T }. This assignment causes the Z's to sit still, but an entire column
is assigned to each processor. Since the Y's move in a direction that is orthogonal
to the physical array, they do not move from processor to processor in the physical
array but reside on a single physical PE as the Z's do. Thus an entire column of
the ymatrix is assigned to reside in memory on each PE. The columns of x and y
that reside on PE(P'2P'3) is column # = 4P'2 + P'3. Each PE will be required to
compute one column of sixteen output variables.
X 01sXo04 XoXoX ooX
yoo
Ylo o
Y10 yo
Y20 Y11 Yo2
Y21 Y12
Y22
Y140
Y1so Y151
Y152
Figure 4.8: Unpartitioned Array's Initial Oth Row Data Alignment
So far the resources have been allocated to execute the algorithm, but still the
scheduling needs to be done. This amounts to specifying how the Xdata will be
sequenced through the array and how the local Y and Zmemories will be ad
dressed. To schedule the virtual array, note that the first computation in the algo
rithm is scheduled to execute in PE P = [0 0]T at time t = 0. The X's will move in a
manor similar to that depicted in figure 4.5. The Oth row of the virtual array is
shown in figure 4.8. The Op map will transform the bottom row of processors of
 82 
figure 4.8 to the 4 x4 grid of processors shown in figure 4.9. Instead of waiting for
their Xdata to pass from the bottom row of the array to the next rows in its data
position trace, the Xstream of data will be injected to all rows of the processor
simultaneously. The memory pointers will be initialized as shown in figure 4.9,
and one output will be computed in each processor each sixteen cycles. The Y
memory pointer will be incremented each clock cycle and wrapped modulo sixteen.
The Zmemory pointer will be used to address output data and will be incre
mented every sixteen cycles. At the end of the array execution, each cell will
contain a one column of the output matrix.
S. X11Xo0X5X014
* . X 1oX0O15XoI4 Xo2Xol

S. Xlloxo1X5XO14 '
Figure 4.9: Physical Array with yMemory and Initial y and zMemory Pointer
Values Displayed
 83 
Now consider the complexity of the algorithm. The serial algorithm takes 1536
cycles, while the unpartitioned array algorithm takes 46 cycles. The partitioned
algorithm takes 260 cycles. Both the partitioned and the unpartitioned algorithms
then need to have the results flushed from the array. The unpartitioned array has
dimensions 16 x 16 so that 256 processors are used. The serial algorithm takes
1536 cycles on one processor, which runs at 100% efficiency. The unpartitioned
algorithm completes 1536 operations on 256 processors so each PE must complete
1536/256=16 computations in 46 cycles so that the PEs are operating at 16/46 =
34.8% efficiency. The partitioned array has dimensions 4 x4 so that sixteen proc
essors are used. Each processor must complete 1536/16=256 operations in 260
cycles so that the PEs are operating at 256/260=98.5% efficiency. So we see that
algorithm took longer to execute on the partitioned array, but the PEs were put to
much better use.
4.4 Array Synthesis Using Index Transformations
In this section integer bijections will be applied to variable indexing vectors.
This will relax the previous requirement that each P, viyi, viy, * ,vir E Zm .
For example this will allow algorithms with three dimensional variable index vec
tors to be mapped directly to two dimensional arrays. Also, if some of the variable
index vectors are of different dimensions, an array may now be synthesized using
the new transformation.
Let Oy:viy>vi; where a sub component of Oy maps the elements of Zbl XZb2
to the elements of Zblb2 using the inverse division algorithm linear map. That is,
the map Oy maps from many dimensions to fewer while the previous maps, 0i, Oj
and Op mapped from few dimensions to many. For example, to map the two
dimensional variable index vector of x Rn xn to a onedimensional variable in
dex vector, the map would be ,yvix=[n 1]vix. Since the transform is linear, it may
 84 
be postmultiplied by the variable indexing matrix to make a new variable indexing
matrix given by F =0qyF. The new timeindex matrix is the given by
By setting S;= S(T;)', it can be seen that S ,= [ vy I 6'y] where 6'y is the distribu
tion matrix for Y using the transformed variable indices. The position function is
readily given in terms of time and the original variable indices as
Py = S'yviY = Vyt + 6',yViy
If I is factored from the above equation, the synthesis equation becomes
S = S',T', = vr + 6'OyFy
This form of the synthesis equations allows an extra degree of freedom to the
array designer. Now a transformation py is placed before the variable index ma
trix to change its dimensions. Two examples will illustrate how the method may be
applied. The first example uses the transformation to map a matrix multiply di
rectly to a onedimensional array. The second example will map a matrixvector
multiply to a onedimensional array. Without the variable index pretransforma
tion, the matrixvector multiply could not be mapped.
Example 4.5: Map a 4 x 4 matrix multiply to a onedimensional systolic array by
directly synthesizing the algorithm using the extended synthesis equations.
Solution : In the matrix multiply the indexing matrices are given by
00 100 001
01 0 001 F 010
To make the algorithm map to a one dimensional array, the variable index vectors
must all map to one dimensional variable index vectors through the mapping 7y.
Because both components of the variable index vectors for each variable range
 85 
from zero to three, the transformation ,y=[4 1] may be used in all three synthesis
equations. The transformed variable index matrices are then given by
F'z=[4 1 0] F' =4 0 1] F',=[0 1 4]
Because I E Z3, the algorithm transformation must have dimensions TE Z3x3. To
map to a onedimensional array, S E Z 3 so that E e Z2x3. With the trans
formed indexing matrices being of dimension 1 x 3, the transformed timeindex
matrices, T' will all have the proper dimensions TE Z3x3. Still, 7r must be
selected so that the original program dependencies map to lexicographically posi
tive temporal vectors, and that each of the timeindexing matrices are invertable.
To achieve this and to achieve a minimum time array execution, select
1= 0 0 1
Next the synthesis equation S',T=S will be written in transposed form as
TTS ;T=ST or [ 3rT I F'T]S ,TST. This equation is in the form of a standard linear
system and may be readily solved. The three synthesis equations may be written
1 0 3 S1 0 3 v S 0 0 0v [Si
zV: 1 = x: 0 0 2 = S Y: 10 1 v = S
0 1 0 S 0 1 S 0 1 3 S
We shall constrain only the zvariable's movement and distribution to be vz =1/2,
v = 1, and 6'z =1. The zequation then gives S = [1 0 1]. Next the x and
yequations are solved to yield the full set of parameter matrices to be
S' = [1/2 1 1 S'x = [0 4/3 1/3] S' = [ 2 1]
The processor's size is determined by examining the processor set generated by
S:P>Pm. Here P EPm (m=l) is in the range 35 P5 3, so that a linear array of
seven processors is needed. The execution time of the algorithm is found by exam
86 
ining the time trace generated by the mapping r:Ip>tq (q=2). Here the compo
nents of the temporal vector range as 0 5 ti 5 6 and 0 5 t2 5 3 so that a total of
7*4=28 cycles are required to execute the algorithm. There are 64 total computa
tions to be performed on seven processors so that at 100% efficiency the algorithm
would take 64/7=9.14 cycles to execute. The PE efficiency of the algorithm is then
9.14/28 = 32.7%. In comparison, the maximum PE efficiency on a two
dimensional array is 40%, but a total of sixteen processors must be available and
extra communication paths are also needed.
Example 4.6: Map the matrixvector multiply algorithm generated by the program
below to a onedimensional array.
Program: for (i = 0 : 3)
for(j=0: 3)
Zi = Zi + Xij Yj
Solution: First note that
F, = [1 0 F ] F=y 0 1
Also note that dz=[0 1]T, dx=[0 0]' and dy=[1 0]T so that a = [1 1] will satisfy the
dependence requirements of the algorithm. Next note that I Z2 so that TE Z2 2
and SE Z' X2. If the timeindex matrices are constructed, Tz and Ty will be of the
proper dimensions, but Tx will not. Previously this would cause serious problems
synthesizing the array, but now select Oy = Ox = [4 1] so that F'x = OxFx = [4 1].
Now the array may be synthesized using the original Z and Y synthesis equa
tions and the extended synthesis equation for the Xvariables. The synthesis equa
tions may again be written in their transposed form [ rT I F T]S T = ST where
now Fz=F'z, Fy= F'y and Fx = [4 1]. The equations may then be written,
I VX = dKLSJ [1 01 [ &]= I
 87 
Set vz=l and 6z=l so that the Zequation specifies S to be S = [0 1]. This causes
the algorithm to map to a onedimensional array with four PEs and the data flow
given by the parameter matrices below.
Sz = 1 S'x = [4/3 1/3] S, =0 1]
The above examples show how to directly synthesize algorithmically specified
arrays using the variable index bijection. These new transforms give more degrees
of freedom to the design process. The examples in this chapter were all simple and
designed to illustrate particular techniques. In a real problem a design may use
more than one of the transformations introduced in this chapter. For example,
suppose example 4.6 required the matrixvector multiplication to be mapped to a
2 x 2 array. Then the map Op = [P div 2 P mod 2 ]T could be applied to the proces
sor synthesized in the example to generate the required array.
CHAPTER FIVE
NUMERICAL LINEAR ALGEBRAIC ARRAYS
Numerical Linear Algebra (NLA) routines may be characterized in terms of
their loop and indexing structures. They normally consist of simple regular loops
with linear indexing performed on the variables. There are usually multiple expres
sions that often occur at different levels of nesting in the loop structure. These
programs normally generate algorithms with nonuniform dependence graphs.
This chapter will focus on a method to design systolic arrays to implement such
algorithms.
The method for designing arrays for NLA routines will be a simple extension of
the previously developed methods. A program transformation will be applied to a
MultiLevel Multiple Expression Program (MLMEP) to convert it to a Multiple
Expression Nested Loop Program (MENLP). In the MENLP all expressions are
found at the deepest level of nesting. A matrix representation is developed for the
MENLP that allows the array synthesis methods of chapters three and four to be
applied. Some NLA programs transform to MENLPs where more than one expres
sion appears at a given point in the index set. Such algorithms may also be
mapped if certain assumptions are made about the PEs. Namely they must be able
to execute small subroutines and have a limited amount of local storage.
5.1 Multiple Expression Nested Loop Programs (MENLPs)
The MENLP is a generalization of the SENLP where now different expressions
or lists of expressions may appear at different points in the program's index set.
 88 
 89 
Also, in an MENLP a given variable may be indexed by more than one indexing
matrix in a single expression. As was mentioned in chapter two, any program may
be written as a nested loop program provided conditional statements appear at the
deepest level of nesting. The transformation to the MENLP involves expanding the
index set so that a unique index vector exists for each expression or list of expres
sions. Also conditional statements are inserted in the inner loop so that the opera
tions will be executed in the proper order. The MENLP is defined below.
Definition 5.1 : A Multiple Expression Nested Loop Program (MENLP) is a pro
gram P=(P ,P,DP,E) with the following conditions
1) The index set is lexicographically ordered and is generated by an
nlevel nested loop.
2) E(I) is an expression or list of expressions that must be evaluated at
the point IE P.
3) There exists global variables y E i such that y E Rm and m n.
These variables are indexed using indexing matrices Fi :I>viy,
viy E Zm. where y [viy] is an individual element of y. Local
variables, y E i, may appear in E(I) that are dependent only on
other variables referenced in E(L).
In the above definition a distinction was made between global and local vari
ables. The global variables are the matrix variables processed by the program
while the local variables are intermediate variables used at the expression node
E(I). This causes the placement of E(I) in the program dependence graph to be
determined solely by the global variable referenced in E(I).
Using the definition above, a matrix representation may be forwarded for an
MENLP. Because all dependencies in the MENLP are determined by the global
variables that are indexed via linear or affine indexing matrices, the global de
pendence structure of the program may be easily determined. The indexing matri
90
ces determine the dependence vectors, while the boundary matrices determine the
index set. The expression function determines which indexing matrices are used
and for which variables at a given point I E P. This leads to the matrixbased
representation for an MENLP given below.
P = (R1,42 ,Fi,F2,...Fr ,E)
The MENLP is then executed using,
for(I= Q I : 01 ) E(I)
Where Ei and 02 are the boundary matrices as defined in chapter three. The
matrices FI,F2,...,Fr comprise the set of indexing matrices that are used to refer
ence the global variables. The expression function evaluates to the expression E(I)
at the index point I. The expression E(I) contains global and local variables, the
variable indexing transformations, and the arithmetic operation description.
Normally a NLA program will be simpler than may be implied by the above
model. The expression function will usually evaluate to from one to five expression
lists for an arbitrarily sized index set. Also, the expression lists will normally con
tain only one expression with no local variables. Occasionally however, an expres
sion list with several expressions will be needed at selected points in the index set.
Sections 5.3 and 5.4 will illustrate what typical NLA programs look like in MENLP
form.
5.2 NLA Mapping Methodology
NLA algorithmically specified arrays may be designed using a systematic ap
proach. First the NLA is converted to a MENLP and then the algorithm is mapped
using a similar approach to the one used to map SENLPs. The first step is to
translate the MLMEP to an MENLP. For example, consider the LUdecomposition
which may be written
 91 
for (k = 1 : n1 )
for (i = k+1 : n)
aik = aik / akk
for (j = k+ : n)
aij aij aik*akj
The above loop has one expression at the second level of nesting and another at
the third, making it an MLMEP. The MLMEP is translated to an MENLP by writing
an equivalent nested loop program with conditional statements at the deepest level
of nesting. Consider the program below.
for (k = 1 : n1 )
for (i = k+1 : n)
for (j= k: n)
if 0 = k) aik = aik / ak
else aij = aj aik*akj
The above program is a MENLP and and may be verified to be input output
equivalent to the previous MLMEP. Note that in the MENLP the jindex ranges
from k to n where in the MLMEP j ranges from k+1 to n. This creates index
vectors in the MENLP for the computations that occurred at the second level of
nesting in the MLMEP. In the MLMEP these computations could not be assigned
index vectors because the jindex was not defined at the second level of nesting.
With the above modification to the index set, each computation in the algorithm
generated by the above program will be assigned a unique index vector.
In a serial computer the above MLMEP is more efficient than the MENLP. If
the MENLP were executed on a serial computer the condition (j=k) would need to
be tested once for each computation that occurred, causing a large overhead. This
is of no concern because the MENLP is only an intermediate form to represent an
algorithm and the conditional statements will never be executed at run time.
Rather, the conditional statement is a compiler directive of sorts. It tells the com
piler what operation a given processor must compute at a time t in the array.
 92 
The set of all indexing matrices and the way they are used in the MENLP
dictates the structure of the computational dependence graph. The only variable in
the program is a. Yet this variable is indexed in four different ways and thus the
MENLP has four indexing matrices. The indexing matrices F1,F2,F3 and F4 are
defined for aij,alk,akj and an respectively. With the index vector I = [k i j]T, the
indexing matrices are given by
[0 o1 J o oj] [ o o
001 100 001 00
The fact that the rank of F4 is one creates a problem in computing the velocity and
distribution vectors. The velocity and distribution vectors are determined by com
puting the parameter matrices as was done in chapters three and four. A timein
dex matrix is formed by augmenting the time transform with each indexing matrix.
The kth parameter matrix is then given by S k=STk1. This clearly causes a problem
in computing the parameter matrix S4. The time index matrix T4 will be singular
so that the inverse will not exist. This problem may be alleviated by altering the
MENLP further. The condition for the expression referencing akk to be executed is
that k equals j. Thus the program may be rewritten once more as
for (k = 1 : n1 )
for (i = k+1 : n)
for (j =k: n)
if ( = k) aik = aik / akj
else aij = aij aik*akj
where now only the three nonsingular indexing matrices are used. This technique
will often need to be applied to the indexing matrices that come from expressions
that are moved to a deeper level of nesting in the program transformation.
With the above modification, all velocity and distribution vectors may be com
puted, but the meaning of these parameters needs to be interpreted. There is one
variable in the algorithm and there are three sets of velocity and distribution vec
