A SYSTOLIC DISTRIBUTED ARITHMETIC COMPUTING MACHINE FOR
DIGITAL SIGNAL PROCESSING AND LINEAR ALGEBRA APPLICATIONS
By
GINKOU MA
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1989
To my parents,
whose love and
unwavering support has never diminished
ACKNOWLEDGMENTS
I would like to take this opportunity to thank my advisor Dr. Fred J. Taylor
who has given me the financial support and an excellent learning environment
for the years leading to my dissertation. In these years, I have had the great
pleasure of working for, and with, him. He has not only been an advisor and
mentor, but a friend as well. He has taught me how to learn from failures and
from successes. His favor is kept in my heart, forever.
I would also like to thank my supervisory committee members, Dr. J.
Staudhammer, Dr. Y.C. Chow, Dr. J.C. Principe, and Dr. H. Lam. They of
fered their valued wisdom on many occasions. Their knowledge and insight
have kept me from going too far astray in the technical literature.
Special thanks go to Monica Murphy, her help has been too great to only
say thanks. Thanks also go to Michael F. Griffin, Eric Dowling, Mike Sousa,
Neil Euliano, Chintana Griffin, Preeti Rao, and RomShen Kao.
I would, especially, like to thank SheueJen; without her support and pa
tience, this dissertation would never have been completed.
Finally, I would like to thank my parents, who have given me all they have.
 ii 
TABLE OF CONTENTS
ACKNOWLEDGMENTS ....................
LIST OF TABLES ........................
LIST OF FIGURES ......................
ABSTRACT .................... .........
CHAPTER 1 INTRODUCTION ............
1.1 DSP System Architecture ..........
1.2 HighPerformance DSP Processors ..
1.3 System Integration ...............
1.4 Method Of Study ................
........................... vi
... vii
. .. ... ...... ........... ... ix
. . . . . . . . . . . . . . 1
CHAPTER 2 FAST POLICIES FOR ELEMENTARY
ARITHMETIC FUNCTIONS COMPUTATION ......
2.1 Fast TwoOperand Adders/Subtractors ...............
2.2 M ultiplier Policies .... ........ .....................
2.2.1 Traditional Methods And Cellular Arrays .......
2.2.2 NonTraditional Methods ....................
2.2.3 Multiplier/Accumulator ......................
2.3 D division ..........................................
2.4 Fast Arithmetic Function Computation Techniques ......
2.4.1 Approximation Method ......................
2.4.2 CORDIC Technique .........................
2.4.3 CoTransformation Method ..................
CHAPTER 3 MULTIPURPOSE FLOATINGPOINT
DISTRIBUTED ARITHMETIC UNIT ..............
3.1 Distributed Arithmetic Unit ............ .. ........
3.2 FloatingPoint Distributed Arithmetic Unit (FPDAU) ...
3.3 Throughput And Precision of FPDAU ................
3.4 Complex FPDAU (CFPDAU) ........................
3.5 MultiPurpose FPDAU/CFPDAU .....................
CHAPTER 4 SYSTOLIC ARCHITECTURE ...................
4.1 Systolic A rrays ....................................
4.2 M apping Algorithms ................................
. 38
. 39
. 42
. 45
.. 49
. 53
. 69
....... 70
........ 73
 IV 
. . iii
...........
...........
...........
4.3 Algebraic Mapping Techniques .............................. 78
4.3.1 Algorithm Model .................................. 79
4.3.2 Transformation Function ............................ 83
4.3.3 Example ........................................ 87
4.3.4 Transformation Matrix And The Behavior
Of The Systolic Array ............................... 94
4.4 Partitioning ......................................... . 97
4.4.1 Moldavan And Fortes' Method ........................ 97
4.4.2 Redistributing The Computation Method ............... 102
4.4.3 Partitioning Of InPlace Algorithms .................. 104
4.4.4 Partitioning For GeneralPurpose Systolic Systems ...... 104
4.5 Systolic Array Algorithm Synthesis .......................... 105
4.6 Multiple Dimensions Of Time ............................. 108
4.7 Interrupt of Systolic Arrays ................................ 115
CHAPTER 5 OPTIMAL PARTITIONABLE TRANSFORMATION
AND THE RECONFIGURABLE SYSTOLIC ARRAY ....... 119
5.1 Algebraic Modeling For The Algorithm ....................... 121
5.1.1 Dependence Vectors ............................... 125
5.1.2 Velocity And Distribution Vectors .................... 131
5.1.3 Systolic Array Algorithm Synthesis ................... 134
5.2 Optimal Transformation Matrix ............................ 139
5.3 Partitioning With The Modulus Theory ....................... 143
5.4 System Architecture ..................................... 153
CHAPTER 6 CONCLUSIONS AND FUTURE RESEARCH .............. 160
APPENDICES
A WHY 2's COMPLEMENT ................................ 167
B A BITSERIAL DISTRIBUTED ARITHMETIC FILTER .......... 168
BIBLIOGRAPHY .................................................. 181
BIOGRAPHICAL SKETCH ......................................... 187
LIST OF TABLES
TABLE 2.1
TABLE 2.2
The FloatingPoint And The FixedPoint
Multiplier And Multiplier/Accumulator Chips ............... 11
Summary Of The CoTransformation Algorithm ............ 34
 vi 
LIST OF FIGURES
FIGURE 2.1 Distributed Arithmetic Multiplier/Accumulator ............... 15
FIGURE 2.2 Typical Computing Step For CORDIC Algorithm ............ 25
FIGURE 2.3 The Implementation Of The CORDIC Arithmetic Unit ........ 27
FIGURE 2.4 The Evaluation Of CoTransformation Function zo = f(xo, Yo) 29
FIGURE 2.5 The Unified Apparatus .................................. 35
FIGURE 3.1 Core Distributed Arithmetic Unit .......................... 40
FIGURE 3.2 The FloatingPoint Distributed Arithmetic Unit .............. 44
FIGURE 3.3 Latncy Of The FPDAU v.s. The Conventional
M ultiplier/Accumulator .................................. 46
FIGURE 3.4 Complex FloatingPoint Distributed Arithmetic Unit (CFPDAU) 52
FIGURE 3.5 The Real Part Of The Modified CFPDAU .................. 54
FIGURE 3.6 Complex Distributed Arithmetic Multiplier .................. 55
FIGURE 3.7 MultiPurpose Arithmetic Unit (MPAU) .................... 57
FIGURE 3.8 The Systolic MPAU Array ............................... 66
FIGURE 4.1 3x3 MatrixMatrix Multiplication (Design 1) ................ 71
FIGURE 4.2 3x3 MatrixMatrix Multiplication (Design 2) ................ 72
FIGURE 4.3 Two 3x3 MatrixMatrix Multiplications (Design 2) ........... 73
FIGURE 4.4 Index Set Of A 3x3 MatrixMatrix Multiply Algorithm ....... 76
 vii 
FIGURE 4.5 Geometric Mapping Of A Convolution [CAP83] ............. 77
FIGURE 4.6 Global BusConnection Architecture ....................... 81
FIGURE 4.7 Interconnection Matrix For A Systolic Array ................ 85
FIGURE 4.8 Systolic Array/Algorithm Synergism ....................... 88
FIGURE 4.9 Procedure Of Mapping The MatrixMatrix Multiply (T1) ...... 92
FIGURE 4.10 Data Distribution And Velocity Of MatrixMatrix Multiply (T2) 93
FIGURE 4.11 Partitioned Index Space [MOL86] ......................... 99
FIGURE 4.12 Bands And FIFO Queue Registers For Partitioning [MOL86] 101
FIGURE 4.13 Data Distributions Of 2D Convolution Example ........... 112
FIGURE 4.14 Interrupt For A 3x3 InPlace MatrixMatrix Multiplication ... 116
FIGURE 5.1 Flowchart Of Finding The Transformation Matrix For
A Systolic Array ....................................... 120
FIGURE 5.2 Dependency Vector ................................... 125
FIGURE 5.3 Algorithm Synthesis For An Image Processing Example ..... 138
FIGURE 5.4 Partitioning Example Of A MatrixMatrix Multiplication ..... 150
FIGURE 5.5 Reconfigurable Systolic Array Architecture ................ 155
FIGURE 6.1 3x3 Cholesky Matrix Decomposition ...................... 165
 viii 
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
A SYSTOLIC DISTRIBUTED ARITHMETIC COMPUTING
MACHINE FOR DIGITAL SIGNAL PROCESSING AND LINEAR
ALGEBRA APPLICATIONS
By
GinKou Ma
May 1989
Chairman: Dr. Fred J. Taylor
Major Department: Electrical Engineering
Highend, realtime signal and image processing, communications, com
puter graphics, and similar tasks are generally computebound rather than input/
outputbound. As such, there is an immediate as well as continuing need to
develop highspeed, highperformance, lowcost numeric data processors which
can be integrated into small packages with lowpower dissipation.
 ix
In this dissertation, a floatingpoint distributed arithmetic unit (FPDAU) is
developed and extended to the design of a complex floatingpoint DA unit
(CFPDAU). The new class of processor is shown to be an enhancement to ex
isting fixedpoint distributed arithmetic processors and, when compared to tradi
tional arithmetic processors, is faster, more compact, and more precise. In ad
dition, a means of partitioning a large system into smaller distributed arithmetic
subsystems is presented where each subSOP can also be used as a general
purpose processor. Finally, the processor study shows that by integrating a
minimal additional amount of logic into the basic design, a fast multipurpose
arithmetic unit (MPAU) can be created which can also perform elementary
arithmetic functions faster than existing stateoftheart devices.
The intrinsic speed and power of the DA processor can be leveraged by
parallel processing. A special type of parallel architecture that has particular
utility to DSP and linear algebra applications is the systolic array [KUN78].
This dissertation has addressed several important issues which extend this tradi
tionally applicationspecific architecture to more generalpurpose problems. By
dealing with such problems as optimally mapping algorithms to systolic arrays,
partitioning algorithms onto small systolic arrays, and using rolling interrupts,
plus introducing reconfiguable architecture for various data flow requirements,
the dissertation research begins to clear the way for generalpurpose systolic ar
ray computing based on fast compact processors.
CHAPTER 1
INTRODUCTION
Signal processing, referring to the study of signals and systems, traces its
origins back to the empirical science developed by designers of musical instru
ments. Today, signal processing technology is used in such areas as filtering,
adaptive filtering, spectral analysis, neural networks, biomedical data processing,
communications and sound reproduction, sonar and radar processing, data com
munication, seismic signal processing, computer graphics, and a host of other
applications. Digital signal processing (DSP) is a relatively new technical
branch of signal processing that is concerned with the study of systems and sig
nals with respect to the constraints and attributes imposed on them by digital
computing machinery.
Rapid advances in digital electronics have contributed to the rapid assimila
tion of DSP theory and technology into a host of problems. While it is tempting
to view today's DSP devices as powerful (capable of performing several million
operations per second), they fall far short of the multibillion operations per
second performance required to support realtime applications in the area of
computer vision, communications, speech and image understanding, and so
forth. The theory of how to process these signals is now known, but because of
the computebound nature of the current technology base, the technology re
mains untested. To make this bridge between the science of DSP and these
highend applications, new and powerful highperformance DSP tools and sys
tem must be developed.
 
2
1.1 DSP System Architecture
To understand where research energies should be focused, it is important to
define where the current stateoftheart has taken us. Prior to this decade, the
increase in the speed of digital systems could be attributed to the increased
speed and superior packaging of electronic parts. Although further advances in
this direction will provide increased performance, this trend has slowed. Now,
as we enter a new decade, it is apparent that parallelism is required. Here, a
large number of individually powerful digital components will work on a small
portion of the total problem. Collectively, they will have the computational ca
pacity to attend to those problems currently beyond the limits of today's DSP
technology. The many possible forms that a parallel digital computing machine
may take are called architectures. A special type of parallel architecture that
has demonstrated a powerful signal processing capability is the systolic array
[KUN78]. The systolic array is particularly adept at implementing the primary
DSP operations which are modeled as matrixmatrix, matrixvector, or vector
vector multiplications, or sumofproducts arithmetic. This relatively new sci
ence, unfortunately, still exhibits some shortcomings which prohibit its direct
insertion into highend DSP applications. For example, a systolic array cur
rently performs only as a specialpurpose singletask computing architecture.
To extend the systolic array to a realtime generalpurpose DSP computing
arena, the automatic synthesis of systolic programs, array reconfigurability, and
interrupt servicing problems must be considered.
1.2 HighPerformance DSP Processors
As stated, the major DSP operations are matrixmatrix, matrixvector, and
vectorvector multiplications, and sumofproducts (SOP) arithmetic. Since
DSP is an arithmetic intensive field of study, a key to developing fast systems is
3
designing new, highspeed arithmetic processors. These innovative fast arithme
tic processors must be capable of performing fast addition and multiplication,
and to a lesser extent, division, logarithm, exponentiation, square root, transcen
dental, etc. This study is historically referred to as computer arithmetic.
Traditional fixedpoint and floatingpoint computer arithmetic schemes
have served the field of DSP well for over two decades. However, even in their
fastest forms, they result in very expensive, complex, and large hardware pack
ages and run at speeds which are too slow to provide a viable multiprocessor
systems solution to the truly computebound problems. Over the last six years,
researchers at the University of Florida have been developing several technolo
gies which can achieve high DSP speeds in small packages [TAY84A]. These
methods offer an opportunity to achieve the needed computational bandwidth,
provided they lend themselves to systolic array insertion and the weaknesses of
the systolic array can be abridged.
1.3 System Integration
The hardware/algorithm synergism problem refers to the study of achieving
the optimal balance between increasing performance through adding algorithm
complexity and simultaneously reducing hardware complexity. The importance
of this claim is well illustrated by contemporary DSP chips, which emphasize
both fast arithmetic and a simple instruction set. These devices belong to a
class of computers which are referred to as reduced instruction set computers
(RISC) machines. While these chips provide very highthroughput, they often
exhibit a weakness in implementing complex operations, such as logarithms, ef
ficiently. Therefore, in order to achieve a synergistic design, the demands of
the candidate DSP algorithms must be carefully balanced with advantages and
limitations of the candidate processors and system architectures.
4
1.4 Method of Study
This study of highperformance DSP systems began with a critical review
of the current stateoftheart arithmetic processors. This review, found in
Chapter 2, led to a fundamentally important technique for speeding up the
propagation delay of a basic carry/borrow unit. The fast adder/subtractor which
resulted was shown to be applicable to the design of fast multiply/divide units.
The study of these fast arithmetic processors compared the algorithm/architec
ture synergism of traditional and proposed systems. Topics relating to fast
multiplication and scaling were subdivided into major groupings which included
standalone multipliers, cellular arrays, memory intensive policies, logarithmic
systems, modular arithmetic, and distributed arithmetic systems. Based on this
study, more efficient DSP computational hardware was developed for later inser
tion onto a systolic array.
The principal outcome of this phase of the study was the identification of
the distributed arithmetic (DA) unit as the processor of choice. DA units (see
Section 2.2.3 and Appendix B) can be applied whenever a linear sum of
weighted (scaled) partial products of the form y = Z ai xi is to be produced for
fixed coefficients ai. The throughput potential of this class of filters is essen
tially limited only by the delay of the highspeed semiconductor memory cell
and the data wordlength. Many authors [ZOH73], [PEL74], [LIU75], [BUR77],
[JEN77], [KAM77], [TAN78], [ZEM80], [ARJ81], [TAY83B], [TAY84C],
[CHE85], and [TAY86] have demonstrated the efficacy of the fixedpoint DA
for use in finite impulse response (FIR) and infinite impulse response (1R) fil
tering plus fast Fourier transformations (FFT). Their work supports the analy
sis, which showed that DA units (when applicable) are faster and more precise
than equivalently packaged fixedpoint multiply/accumulate units. The pre
sented study further showed that when inputoutput latency is not a critical is
5
sue, a bitserial DA (presented in Appendix B) could be used to minimize the
processor's complexity without reducing realtime throughput. The design also
achieved an ideal synergism with the data communication needs of a dense sys
tolic array.
The existing DA technology base applies only to fixedpoint designs. Many
important DSP applications require highprecision floatingpoint operations.
The presented study extended the fixedpoint distributed arithmetic unit (DAU)
to the floatingpoint distributed arithmetic unit (FPDAU) case. In Sections 3.2
and 3.3, the FPDAU was formally derived and analyzed. It was shown that the
resulting arithmetic processor is more precise, faster, less complex, and less
costly than any currently available commercial floatingpoint devices.
The FPDAU was further modified to process complex SOP operations.
However, in this case, the size of the processor was doubled in order to main
tain the speed capability of a real FPDAU. Nevertheless, this is a factor of two
advantage over conventional methods, which require the device complexity be
increased by a factor of four.
Currently all DA devices presume that the DSP coefficients and/or parame
ters are known a priori. The new DA structure was shown to allow the basic
processor to service general arithmetic calls as well. The new robust processor
was shown to be capable of supporting traditional DA operations when coeffi
cients are known a priori and generalpurpose multiply/accumulate otherwise.
Compared to commercially available multiply/accumulate chips, it was shown
that the researched architecture is both faster and less complex.
Other elementary arithmetic functions, such as division, exponential func
tion, logarithm, square root, transcendentals, etc., are also required in DSP.
These elementary arithmetic functions are usually performed using approxima
tion methods (which requires only multiplication and addition/subtraction; pre
6
sented in Sections 2.3 and 2.4.1) at the expense of throughput. In addition to
the approximation methods, there are other algorithms called cotransformations
(Section 2.4). They are based on the use of a small lookup table, simple con
trol circuits, shifters, and adders/subtractors. Compared to approximation meth
ods, cotransformation methods offer a higher performance potential. Cotrans
formation algorithms were combined with the developed floatingpoint/complex
distributed arithmetic unit to produce a new MultiPurpose Arithmetic Unit or
MPAU (Section 3.5). The MPAU can used as a special purpose or general
multipurpose arithmetic unit which provides highspeed elementary arithmetic
function arithmetic. The MPAUs can also be connected to construct a systolic
array computing system for DSP and linear algebra applications.
The innovative processor technology was shown to be able to implement the
primitive instructions required in DSP. In addition, it was shown to have an at
tractive very large scale integrated circuit (VLSI) implementation. This makes
the unit a logical candidate for use with dense parallel processing machines.
As stated, DSP and linear algebra applications lend themselves to systolic
array implementation because the candidate algorithms, though computationally
intensive, are very regular. Systolic arrays belong to a broader class of fine
grain, parallel, single instruction multiple data (SIMD) path machines. The de
sign of a large finegrain parallel processor array is as much limited by the in
ner processor communication requirements as it is by processor speed. All of
these factors favor VLSI architectures with a minimum amount of local physical
communication and little or no global communication. To design parallel proc
essing systems in VLSI successfully, the design should be modular, have regular
data and control paths, and most importantly contain only localized communica
tion. The processor technology previously referenced was shown to be well
suited to this mission. However, simply providing a systolic array with a supe
7
rior processor will not reduce many of the current limitations of these arrays.
Chapters 4 and 5 present techniques which were developed to map the sequen
tial loops found in DSP algorithms onto the systolic arrays. The techniques are
enhancements to transformation methods originally proposed by Moldavan
[MOL82] and [MOL83].
It is shown that the mapping of an algorithm onto a systolic array can be
viewed as a transformation of the original algorithm loop indices into physical
array spatial and temporal coordinates. Different transformation matrices are
shown to result in distinct velocity and distribution functions. Thus, the archi
tecture of the systolic array is heavily reliant on the transformation matrix.
Methods discussing how to find the optimal transformation matrix (where "opti
mal" means that under the limit of finite available array size or the minimal
required execution time, the system will finish the computation in the least time
or require the least array size, respectively) are not found in the literature. In
order to find the optimal transformation matrix, the conventional one
dimensional time index theory should be extended to multidimensional time
indices (presented in Sections 4.6 and 5.2). Meanwhile, for a realtime comput
ing system, the interrupt problem should also be considered (Section 4.7).
The University of Florida research effort also focuses on developing proce
dures that can be used to partition a large DSP problem into one which can be
executed on small systolic arrays. In Section 4.4, these techniques are re
viewed. One of the most visible, namely the Moldavan and Fortes' partitioning
method, is shown to have three drawbacks: (1) external FIFO buffers are re
quired, and (2) the utilization of the PEs is inefficient, and (3) it is difficult to
find the optimal partition solution. The partitioning scheme developed by
Horiike, Nishida, and Sakaguchi can produce a more efficient implementation
but requires extra internal storage to synchronize the computation. Other im
8
pediments are: (1) there is no efficient way to find the required transformation
matrix, and (2) once the transformation matrix is found, the actual ordering of
the data set in the I/O buffers is not defined.
In Sections 5.1, 5.2, and 5.3, the optimal partitionable transformation ma
trix was shown to be directly calculable. This new technique can also be used
to implement the systolic array algorithm synthesis where some functions of the
systolic array are predefined (e.g., the flows of some data sets are desired and
fixed and others are flexible). It was demonstrated that the drawbacks of exist
ing partitioning methods could be eliminated using the proposed Modulus The
ory Partitioning algorithm (Section 5.3). This algorithm provided the optimal
transformation matrix for the partitioning such that no external FIFO was re
quired. In this way, the data flow control and data distribution remained similar
to cases where no partitioning was applied. Finally, a reconfigurable systolic
array architecture was presented. These results made the interrupt strategy pre
sented in Section 4.7 viable.
The results of the presented research demonstrate that a general purpose
reconfigurable systolic array can be developed based on the proposed new proc
essor technology. In addition, the new system provides powerful software con
trol and programming tools which extend the utility and capability of the array.
The new computing system provides the highbandwidth required of highend
DSP and algebraic applications in an affordable and compact package, and the
ultimate goal of processor/array/algorithm synergism is achieved.
CHAPTER 2
FAST POLICIES FOR ELEMENTARY ARITHMETIC
FUNCTIONS COMPUTATION
Digital signal processing (DSP), whether relating to filters or transforms, is
an arithmetic intensive study with the predominant operation being multiply/ac
cumulate. As such, the key to developing highspeed systems is the design of
the fast additions and multiplications. Other elementary arithmetic functions
such as division, logarithm, exponential function, square root, transcendental,
etc., are also often used in DSP (adaptive filters, transforms, matrix inverse, im
age processing, plus others). Thus, the fast algorithms and architectures for im
plementing these functions should also be investigated.
2.1 Fast TwoOperand Adders/Subtractors
Highperformance adders are essential not only for addition, but also for
subtraction (subtraction in 2's complement system is actually a complementad
dition operation), multiplication, and division, plus others. The slowest adder
configuration is the ripple adder. Here, carry information is passed (rippled)
from a digit location to one of higher significance. For such a ripple adder, exe
cution speed is essentially limited to the time delay associated with propagating
the carry information for the least significant bit (LSB) to the most significant
bit (MSB). As a result, the key to designing fast adders is carry acceleration.
There are various carry acceleration techniques for designing fast twooperand
adders (e.g., asynchronous carrycompletion adders and three classes of syn
chronous adders, namely, conditionalsum, carryselect, and carrylookahead
9
 10 
adders). Asynchronous adders were studied by [GIL55]. Studies on carry propa
gation length were reported in [REI60] and [BRI73]. The conditionalsum addi
tion was proposed by [SKL60A]. Carryselect adders were introduced by
[BED62]. Other researchers, [ALE67], [FEI68], [LEH61], [MAC61], [SKL60B],
[SKL63], and [WEL69], have investigated the carry propagation speedup tech
niques and their possible implementations. It is fair to say that the popular
carrylookahead adder is the end product of this research into parallel carry
generations. In particular, the work of [SKL63] is most comprehensive. With
today's VLSI technology, adder construction is no longer a great burden on the
designer.
2.2 Multiplier Policies
The importance of the fast multiplier is well illustrated by the available
DSP chips which emphasize relatively fast fixedpoint multipliers in their RISC
instruction sets. For example, the TMS32020 contains a 16x16 bits 200ns multi
plier which occupies about 40 percent of the chip area. Newer DSP chips be
longing to the third generation, such as the TMS32030, AT&T DSP32, plus oth
ers, have introduced architectural improvements but leave the fast multiplier as
their cornerstone. Another recent technological innovation has been the dedi
cated multiply, multiply/accumulate, and numeric processor chip. They appear in
both fixedpoint and floatingpoint forms and are summarized in Table 2.1. A
variation on this theme is the arithmetic coprocessor chip (e.g., INTEL
8087/80287) which can perform the higher order algebraic tasks assigned to
them by a CPU chip more efficiently than the CPU itself (e.g., multiplication,
division, exponentiate, transcendental, etc.). From all this we see that the key
to performance is translating the basic algebraic procedures of multiplication
into fast, compact digital operations. These concepts must be understood at the
 11 
TABLE 2.1
The FloatingPoint And The FixedPoint
Multiplier And Multiplier/Accumulator Chips
FLOATINGPOINT CHIPS
32b 64b Power $
Type MFLOPS MFLOPS W U.S.
AMD 8 N/A 7.5 700
Analog Dev. 10 2.5 0.4 350
Weitek 4 2 2.0 900
8 8 2.0 900
10 N/A 2.0 350
Bipolar Integration Tech. N/A 45 5.5 N/A
IDT N/A 10 0.75 N/A
TI N/A 14.7 1.0 N/A
TRW 10 N/A 0.21 N/A
FIXEDPOINT CHIPS
Type 12x12 MUL 12x12 MAC 16x15 MUL 16x16 MAC 24x24 MUL
Analog Dev. 110n 130n 75n 85n 200n
TRW N/A 135n 45n 50n N/A
AMD N/A N/A 90n N/A N/A
Logic Dev. 80n N/A 45n 45n N/A
Weitek N/A N/A 55n 75n N/A
IDT 30n 30n 35n 35n N/A
CYPRESS N/A N/A 45n N/A N/A
ARRAY PROCESSOR
ST100, 100 MFLOP, $250,000
system level. The design of a host of fast, multiplier strategies have been stud
ied in the context of their algorithms and architecture.
2.2.1 Traditional Methods And Cellular Arrays
Among the traditional multipliers, the first was the slow shiftadd architec
ture followed by others (e.g., Booth's algorithm and modified version, and Wal
lace Trees [WAS82] and [HWA79]). Later, the advent of MSI, LSI, and now
VLSI, resulted in a profusion of monolithic multiplier devices (Table 2.1). If
these units can operate as standalone multipliers, without the need to intercon
 12 
nect them to other multiplier chips, they are called Nonadditive Multiplier Mod
ules or NMMs. However, it is often necessary to connect identical NMMs to
gether to achieve multiplier wordlengths which exceed the capability of a single
module. Using multiple copies of an NMM, large wordlength multiplier arrays,
called Cellular Arrays, can be designed. The objective of an effective cellular
design is to additively recombine the subproduct outputs of the NMM cells in
such a way so as to reconstruct the final product (e.g., NMM Wallace Array
[HWA79]). However, as is often the case, high performance is often gained at
the expense of system complexity. Thus, some authors deal with the multiplica
tion at the bit level (e.g., Pezaris Multiplier Array [HWA79], BaughWooley
Multiplier [HWA79] and bitserial multiplier [SIP84] and [KAN85]). This is in
sharp contrast to conventional architectures which communicate as worldwide
transactions. As a result, highly functional VLSI designs can be realized. Bit
serial networks can be easily routed and interconnected without the problems of
bitparallel busing. However, this technique needs more complex control cir
cuitry and will introduce longer latency.
2.2.2 NonTraditional Methods
All the above multipliers are designed to process the multiplication of
fixedpoint numbers which appear in signmagnitude plus 1's and 2's comple
ment form. The overwhelming choice for a fixedpoint number system is 2's
complement (see Appendix A). Alternative number systems can also be used to
design a fast multiplier (e.g., the canonic signed digit number system, the loga
rithmic number system (LNS), and the residue number system (RNS)). The
canonic signed digit multiplier [TAY84B] is based on the concept that the zero
term will contribute nothing to the sumofproducts (SOP) and may, therefore,
be "bypassed," and an overall speedup can be achieved. To implement this
 13 
architecture, however, extra circuitry is required for recoding a 2's complement
integer into a canonical form and the speed is data dependent. The LNS has
been studied in a DSP environment by numerous authors([KIN71], [TAY83A],
[SWA83], [SIC83], [TAY85B], and [TAY85C] In the LNS a word is given by:
X =r:ex (2.1)
where r is the radix and ex is the signed fractional exponent. In this system, the
operations of multiply, divide, square, and square root only require exponent
manipulation. These operations (i.e., addition or binary shift) are simple to im
plement in fast hardware. Addition and subtraction are, as one can see, a dif
ferent story. It is known that the LNS, compared to fixedpoint schemes, en
joys a wide dynamic range and high precision. However, the precision and dy
namic range capability of a LNS unit is essentially limited by the available ad
dress space of a high speed RAM or ROM.
The residue number system6[McC79], [TAY84A], [WAN78], [HER75],
[ARM80], [TAY85A], [TAY84D], [TAY82A], and [TAY82B])is an integer num
ber system defined with respect to a set P (called a moduli set) of relatively
prime integers (called moduli). In particular, if P = (p1 ..., PL) and GCD(pi, p)
= 1, then over the integer range [0,M) ([M/2,M/2) for signed numbers), where
M = Pi P2 ... PL, an integer X is isomorphic to the Ltuple of residues given
by X * (x1, x2, ..., X), Xi = X mod pi. If two integers X and Y have admissible
RNS representation, and if they are algebraically combined under the arithmetic
operation 'o', where o = (+,,*) (division is not closed in the RNS), then Z = X
o Y + (x, o Yi, ..., XL o yi) = (z, z2,..., zL) if Z belongs to the residue class Z,
= {0,1,...,M1}. The remarkable feature of this statement is that in the RNS, Z
can be computed as L concurrent operations without the need of producing
carry digits or establishing a carry management policy! For complex number
multiplication, the QRNS [TAY85A] can be used.
 14 
2.2.3 Multiplier/Accumulator
The major operation found in the DSP is the multiply/accumulate. As such,
many commercial DSP chips deal with the fixedpoint numbers. A fundamental
problem in developing fixedpoint DSP systems is managing dynamic range
overflow. This is especially true in highgain, narrow band filters where the in
ternal gain of a filter can be several orders of magnitude higher than that meas
ured at the output. A serious overflow problem can also occur when two nbit
numbers are multiplied. Overflow may immediately occur if data and coeffi
cients are not properly scaled. Assuming that the data is properly scaled, an
other overflow problem can emerge. If two nbit 2's complement numbers are
multiplied, the 2nbit product contains two signbits. The leftmost signbit can
be masked but that is not sufficient to inhibit overflow. This, in fact, is a weak
ness in some DSP microprocessors (e.g., TMS 32020) but can be detected in
software at the expense of increased overhead. Although the recently an
nounced DSP chips deal with floatingpoint operations, they only perform the
multiply/accumulate operation well. For computing other elementary arithmetic
functions (e.g., division, transcendental, plus others), the approximation methods
(discussed later) must be used and, therefore, speed is slow. The systolic array
multiplier/accumulator, [ULL84] and [MEA80]/uses the multiprocess architec
ture to perform the multiply/accumulate operations. In Chapters 4 and 5, the
systolic array is presented in detail.
Technically, multiplication is the algebraic combining of two variables using
a set of consistent multiplication rules. A special case is called scaling, which
applies the same combining rules to a variable and a constant. Such operations
are ubiquitous in DSP and are often found as "coefficient scaling" operations.
For example, a simple finite impulse response (FIR) is given by
 15 
y(k) = ajx(kj); j E [0, L) (2.2)
where the {aj} coefficient set is known a priori. Any of the previously refer
enced fixedpoint multiplier schemes could be used to accept aj and x(kj) as
input operands and export a partial product. If each multiplyadd cycle re
quires t, units of time, then y(k) can be produced in approximately Ltc time
units. An alternative scheme has been proposed and is based on interpreting
Equation 2.2 as a nested summation. If x(r) is an nbit 2's complement word,
then
x(r) = > 2mx[r : m] 2n'x[r : n 1]
m (2.3)
for m = 0,1,...,n2 and x [r : v] is the vthbit of x(r), x[r : v] e [0,1]. Upon
substitution, Equation 2.2 becomes
y(k) = I 2m'D(m) 2n1(n 1)
m (2.4)
where t(q) = Ja;x[kj : q], j e [0,L). Observe that 4(q) is a function of the
known coefficient set {aj} and the qth common bits of x(r), r = 0,1,...,L1. Re
ferring to Figure 2.1, it can be seen that the Lbit data field, obtained by ac
X(n)
SL bits CONTROL
X(n) I LER
INPUT L COUNT
QUEUE X(n1) 2 xT bit +/
X(n2)
2(2ROM n)
L n bit C> (q)
regis
ters X(nL1) /R
FIGURE 2.1
Distributed Arithmetic Multiplier/Accumulator
 16 
cessing the qth common bit of x(k) through x(kL+1), is used as an Lbit ad
dress which is presented to the read only memory (ROM) table. The ROM has
simply been programmed with the 2L precomputed values of I(m). For exam
ple, if L = 3, ao = 1.5, a, = 3.0, and a2 = 1.0, then
Address x[n : m] x[n1 : m] x[n2 : m] ((m)
0 0 0 0 0.0
1 0 0 1 1.5
2 0 1 0 3.0
3 0 1 1 1.5
4 1 0 0 1.0
5 1 0 1 2.5
6 1 1 0 2.0
7 1 1 1 0.5
Finally, scaling by 2m, as found in Equation 2.2, can be implemented as a sim
ple shiftadd task. The result is that a filter cycle can now be completed in ap
proximately n memory cycle units. For example, if L = 12, n = 16, tmu. =
100ns and tMM.cyc = 10ns, then a conventional multiplier filter cycle delay would
be on the order of 1200ns versus 160ns for the distributed designs. As we see,
this method can provide the fastest computation for scaling SOP operations. Of
course, the filter coefficient set must be known a priori which precludes its use
in adaptive filter applications. This problem is solved in Chapter 3.
2.3 Division
Division is also an important arithmetic operation in DSP and linear alge
bra applications. The computation time of division is always greater than that
of multiplication. Thus, for scaling operations (i.e., coefficients known a priori),
the designer should use multiplication instead of division. However, for the
general case, fast dividers are required.
Division algorithms can be grouped into two classes according to their itera
tive operator. The first class, where subtraction is the iterative operation, con
tains many familiar algorithms (e.g., nonrestoring division), is relatively slow,
 17 
and has an execution speed proportional to the operand (divisor) length. In the
second class, multiplication is used as the iterative operator to achieve higher
speed. Algorithms of this class obtain a reciprocal of the divisor first, and then
multiply the result by the dividend. There are two main multiplicative methods
of finding the reciprocal. One is the series expansion based on the Maclaurin
seriest[AND67] and [SPI68) and the other is the NewtonRaphson iteration
[BUR81]. This type of algorithm converges quadratically and its execution time
is proportional to log2 of the divisor length. As a result, a fast multiplier can
play an important role in fast division. Compared to the cotransformation
method, however, this is slower.
2.4 Fast Arithmetic Function Computation Techniques
For computing the elementary arithmetic functions, the Taylor series
method can be used. However, even at low precision, the Taylor series is still
computationally intensive. For some functions like exponential, logarithm, ratio,
square root and trigonometric functions, the cotransformation and the CORDIC
techniques can be used to achieve higher computational speed with a relatively
simple hardware architecture consisting of a small lookup table, a simple con
troller, shifters, and adders, as shown in the following.
2.4.1 Approximation Method [LYU65] and [COD80]
The Taylor series or power series expansion of some f(x) is:
f"(a)(x a)2 f n(a)(x a)n1
f(x) = f(a) + f'(a)(x a) + 2 + ... + (n 1) + Rn
2! (ni)!
Rn = the remainder after n terms (2.5)
and generally converges for all values of x in some domain called the interval
of convergence and diverges for all x outside this interval. For example
 18 
X2 X3
eX = 1 + x + + + ... < x < o
2! 3!
Inx)=2( 2 x1 1 x1 3 1 x1
ln(x)= 2 (x1) + X1)3 + I( 1)5 + ... > 0
rx +1) 3 Hx _+1 5 +x >+1o
X3 X5 X7
sin(x)=x + + + ... oo << 00
3! 5! 7!
X2 X4 X6
cos(x)=  + + + ... oo
2! 4! 6!
1 x3 1 3 x5 1*3*5*x7
sin'(x)=x + + 2*4 + + ... 00< < 00
2 3 2 4 5 2*4*6*7
(2.6)
Although the Taylor series can be used to compute arithmetic functions, the
convergent speed is too slow. Thus, instead of the Taylor series, approximation
algorithms should be used to calculate the elementary arithmetic functions.
The arithmetic functions can be usually approximated by a series of
Chebyshev polynomials
n
f (X) = CnTy (x)
y=o (2.7)
This series requires fewer arithmetic operations than the Taylor series and can
speed up a computation provided the approximation is sufficiently accurate.
The approximation formulas for the elementary arithmetic functions are illus
trated as follows for x a floatingpoint operand:
ex = 2y where y = x log2 (e)
2X = 2y where y = x
10x = 2y where y = x log2 (10) (2.8)
To calculate these exponentials, first compute y = (i+f) where i and f are the
integer and fractional parts of the floatingpoint number y, respectively and
 19 
log2(e) and log2 (10) are the prestored constants. Also, f must be positive and
in the range of [0,1). The result equals
2y= 2' x 2f (2.9)
where 2f can be computed approximately by
2f= Co +C f +C2f2 + ... +C5 f +e(f) (2.10)
with je( f ) < 107 and
Co = 0.9999999702 C1 = 0.6931530732007278
C2 = 0.240153617040129 C3 = 0.0558263180623292
C4 = 0.0089893400833312 Cs = 0.0018775766770276 (2.11)
The above equation can be conveniently evaluated by means of a nested multi
plication, that is
2f= Co + f ( C1 + f ( C2 + f ( C3 + f ( C4+ f C5)))) (2.12)
Thus, in total the computation requires 6 floatingpoint multiplications, 5 float
ingpoint additions and 1 integer addition (exponential adjustment of the result).
This algorithm is accurate to at least six significant digits. For higher precision,
more power terms are required and the minmax algorithm can be used to de
rive the coefficients.
The following approximation formulas [ATT88] can be used to compute the
floatingpoint logarithms of the floatingpoint operand x > 0:
M [M 2 M ]28
n(x) = Co+ C2 + C... + C + 9 (E + 1)
2 2 2
log2(x) = Co + C + + C 8 + (E + 1)
loglo(x)=Co + C M + C2 + ... + C y + C9(E + 1)
(2.13)
(2.13)
where
 20 
x = M 2E M = mantissa and E = exponent (2.14)
The coefficients are as follows:
In(x) log2(x) log1o(x)
Co = 3.067466148 Co = 4.4254182 Co = 1.332183622
C1 = 11.30516183 C1 = 16.3099009156 C1 = 4.909769401
C2 = 27.7466647 C2 = 40.02997556826 C2 = 12.05022337
C3 = 51.49518505 C3 = 74.29184809343 C3 = 22.36407471
C4 = 66.69583126 C4 = 96.221745 C4 = 28.96563148
C5 = 58.53503341 Cs = 84.44820241827 Cs = 25.42144201
C6 = 33.20167437 C6 = 47.89989096077 C6 = 14.41930397
C7 = 10.98927014 C7 = 15.8541655496 C7 = 4.772579384
C8 = 1.613006222 C8 = 2.32707607725 C8 = 0.7005197015
C9 = 0.6931471806 C9 = 0.3010299957
(2.15)
Similarly, nested multiplication can be used to calculate the above results. In
such a way, In(x) and log2(x) are accurate to at least four decimal places and
five significant digits, and log lo(x) is accurate to at least five decimal placed
and six significant digits.
There are three algorithms, namely square root, fast square root, and quick
square root, for computing the floatingpoint square root of the floatingpoint
operand x > 0. The square root function is more accurate than either fast
square root or quick square root functions. The following algorithm is used for
computing the square root function:
S= 2z where Z= log(x)
2 (2.16)
The above log2(x) algorithm can then be used to calculate Z = (i+f), where i
and f are the integer and fractional parts of the floatingpoint number Z, re
spectively. Also, f must be positive and in the range of [0,1). Equation 2.12
can be then used to compute 2'. This algorithm is accurate to at least six sig
 21 
nificant digits. The fast square root function is faster but less accurate than the
square root function/and uses the following algorithm:
y=Jx where x=M*2E with M in [1,2) (2.17)
Compute Z = M using two NewtonRaphson iterations, that is,
M
Z[n] = 0.5 Z[n 1] +0.5 for n = 1,2 and Z[0] = 1.18
Z[n 1] (2.18)
Then
E
y = Z[2] 22 for E even
E1
y = 2* Z[2] 2 2 for E odd (2.19)
The fast square root algorithm is accurate to at least four significant digits. The
quick square root function is the fastest method but it is less accurate than
either the square root or the fast root functions. The quick square root function
uses the following algorithm:
y=Jx where x = M 2E with M in [1, 2) (2.20)
Compute Z = M using the following approximation:
Z = 0.59 + 0.4237288136 M 0.28 (M 1.47)4
E
y = Z 22 for E even
E1
y = 2 Z 22 for E odd (2.21)
This algorithm is accurate to at least three significant digits.
Trigonometric functions are very important for transformation and rotation
processing. The tangent function returns the floatingpoint tangent of the float
ingpoint operand x (in radians,  < x ), which can be approximated by
4 4
the following polynomial ratio:
945x 105x3 + x5
tan(x) x 420x2 + 15x4 (2.22)
945x 420x2 + 15x4 (2.22)
 22 
with jerrorj <10 7. Its inverse function, arctangent, can be computed using the
following algorithm:
tan'(x) = +C1 y+C3 y3+ ... C9 y x > 0
4 (2.23)
x1
where y = x and
x+1
C1 = 0.999866 C3 = 0.3302995 C5 = 0.180141
C7 = 0.085133 C9 = 0.0208351 (2.24)
This algorithm is accurate to at least three significant digits.
The other elementary trigonometric functions, called sin(x) and cos(x), re
turn the floatingpoint sine and cosine of the floatingpoint operand x (in radi
ans,  : x<), respectively. These functions can be calculated using the
2 2
following algorithms:
sin(x) = x + C1 x3 + C2 x5 + C3 x7 + C4 x9
= ( 1 +2 (C X2 (C2 + X2 ( C3 + C4 X2))))
cos(x)= Z ( 1+Z2(C1 + Z2 2(C2 + 2 2(C3 + C4 Z2))))
where Z jx (2
2 (2.25)
For higher precision, more higher power terms are needed. The coefficients for
different accuracies are listed below:
 23 
for b 24, or d 8
C1= 0.16666 65668
C2 = 0.83330 25139
C3 = 0.19807 41872
C4 = 0.26019 03036
for 25 < b 32, or 9 d 10
E+0
E2
E3
E5
Ci= 0.16666
C2 = 0.83333
C3= 0.19840
C4 = 0.27523
C5= 0.23868
for 33 < b < 50, or 11 < d < 15
C1= 0.16666 66666 66659 653 E+0
C2 = 0.83333 33333 27592 139 E2
C3 = 0.19841 26982 32225 068 E3
C4 = 0.27557 31642 12926 457 E5
C5 = 0.25051 87088 34705 760 E7
C6 = 0.16047 84463 23816 900 E9
C7= 0.73706 62775 07114 174 E12
for 51 5 b : 60, or 16 < d < 18
C = 0.16666 66666 66666 65052 E+0
C2 = 0.83333 33333 33316 50314 E2
C3= 0.19841 26984 12018 40457 E3
C4 = 0.27557 31921 01527 56119 E5
C5 = 0.25052 10679 82745 84544 E7
C6= 0.16058 93649 03715 89114 E9
C7= 0.76429 17806 89104 67734 E12
Cs = 0.27204 79095 78888 46175 E14
(2.26)
where b represents the binary bits and d represents the decimal digits. The
computation of their inverse functions, sin'(x) and cos'(x) called arcsine and
arccosine for Ixl < 0.966, can use
sin'(x) = C1 x + C3 x3 + ... + C11 x11
=x( C + x2(C3 + x2(... + x2(C9 + C112))))
cos(x) = sin'(x)
2 (2.27)
with the coefficients
C1 = 0.9999999711
C5 = 0.0749014744
C9 = 0.0223169693
C3 = 0.1666698337
C7 = 0.0459387798
C1 = 0.0448569846
66660
30720
83282
97106
34640
833
566
313
775
601
E+0
E2
E3
E5
E7
(2.28)
 24 
This approximation provides an accuracy of jerrorj <10 7 for Ix < 0.5. For jIx
> 0.5, accuracy decreases as jIx increases. The result is accurate to the first
decimal place at Ix[ = 0.95.
2.4.2 CORDIC [VOL59] Technique
From the previous section, we see that compared to the power series the
approximation method has fewer arithmetic operations, especially multiplication/
division operations. Thus, it can be used to speed up the computation of ele
mentary arithmetic functions. However, the computation of trigonometric func
tions still requires a lot of operations (e.g., using the nested multiplication
method, the tangent function requires 5 multiplications, 4 additions/subtractions,
and 1 division; and the sine function needs 5 multiplications and 4 additions/
subtractions). To further reduce computational complexity, the COordinate Ro
tation DIgital Computer (CORDIC) trigonometric computing technique can be
used.
There are two computing modes, rotation and vectoring, in the CORDIC
trigonometric operations. In the rotation mode, the coordinate components of a
vector and an angle of rotation are given and the coordinate components of the
original vector, after rotation through the given angle, are computed. In the sec
ond mode, vectoring, the coordinate components of a vector are given and the
magnitude and the angular argument of the original vector are computed. That
is,
Rotation : y'= k (y cosA + x sin )
Sx' = k (x cosA y sin A)
R R=k/x2 + y2
Vectoring : k 2 +
0 tan1 y
x (2.29)
 25 
where k is an invariable constant. These equations can be solved by the follow
ing method. Let the original vector xi, yi be
yi= Risin0i
xi= Ricos0i (2.30)
Then, if we add/subtract the x and y components by the 2( 2 ) y, and
2( i2 ) xi, respectively, we get
yil = yi 2( i2 ) X
= 1 +22 i2) Ri sin (Oi fa i)
xi+1 = xi T 2( i2) Yi
= 1 + 22(i2 ) Ri cos ( 0i f ai) (2.31)
where the general expression for ai with i > 1 is
ai= tan' 2( i2 ) (2.32)
These relations are shown in Figure 2.2. The special angle ai can be used as
the magnitude of rotation associated with each computing step. The peculiar
y
Yi+1 i= +1
2('02) 1+2(i2) R i
2(i2)i (i2)R
2(i 2)xI /
S = 1
2(i z)y(
a I
xi+1 xi Xi+1 X
FIGURE 2.2
Typical Computing Step For CORDIC Algorithm
 26 
magnitude of each a is such that a "rotation" of coordinate components
through :ia may be accomplished by the simple process of shifting and adding.
Likewise, after the completion of the rotation step in which the i+1 terms
are obtained, the i+2 terms may be computed from these terms with the results
Yi+2= 1+22(i) V1+22(i2) Risin(0i + Siai + ji+~ai+1)
xi2= 1+22(i1) 1 + 22(2 Ricos(0i + Ciai + Ci+lai+i) (2.33)
where
j = + 1 or 1 (2.34)
Consider the initial coordinate components yi and xi where
yi= Risin i
xi= Ricos 0i (2.35)
By establishing the first and most significant step as a rotation through 
2
and by establishing the number of steps as n, the expression for the final coor
dinate components will be
yn+I= [ 1+20 1+22 1 +22(n2)
Ri sin(0i + ijai + 2a2 + ... + n an)
x = [ r1+ 2 + 22 * + 22(n2) *
Ricos (0i + ji ai + 2 a2 + ... + ln an) (2.36)
The increase in the magnitude of the components for a particular value of n is
a constant presented by k. The value selected for n is a function of the desired
computing accuracy. For example, for n = 24, k = 1.646760255.
It can now be shown that, for a rotation of a set of coordinate components
yi and xi through a given angle (as required in the rotation mode), it is neces
sary to obtain the expressions
yn+l = kRi sin(Oi+Al)
Xn+l = kRi cos(Oi+A) (2.37)
 27 
To obtain these relationships, it is required that
A = Clal + 2 a2 + ... + Cnan (2.38)
and for vectoring, it is required that
01 = Clal + z2a2 + + +nan (2.39)
Equations 2.38 and 2.39 form a special radix representation equivalent to the
desired angle, A or 0 where
a1 
2
a2 = tan' 20 = 
4
a3 = tan' 21 = 0.463647609
Ai = tan' 2( i2 ) (2.40)
These a terms are referred to as ATR (Arc Tangent Radix) constants and can
be precomputed and stored in the computer. The C terms are referred to as
ATR digits and are determined during each operation. The implementation of
the CORDIC arithmetic unit is shown in Figure 2.3. Then, from the following
equations
Y register + Adder
Yi+1
S+ subtractor +
2(i2)Xi
S/R i Adder
X register W + subtractor
ANGLE register o + Adder 0 j+1
ATR CONSTANTS subtractor
FIGURE 2.3
The Implementation Of The CORDIC Arithmetic Unit
 28 
Yn+i = k(ya cosA + xl sin2)
xn+l = k( x, cos2 yi sin2)
0 = tan1 Yl
xl (2.41)
we see that
(1) sin(x) can be computed by initializing the y register, x register, and
angle register to 0,1, and A respectively. Then, use the sign of the
result in the angle register as the determinator for the ATR digits C.
When the resulting angle equals zero, the value yn+, multiplied by 
k
is equal to sin(2).
(2) cos(2) is calculated similarly to sin(x) except that the desired value
1
cos(2) is computed by multiplying by the result Xn+l. cos(2) and
sin(A) can be computed simultaneously. Thus, tan2 = sin(2) can
cos (A)
also be computed.
(3) For computing the arctangent function, the y register and the x regis
ter are loaded with the values yi and xi, respectively. The angle regis
ter is initialized to zero. The sign of the result in the y register will
be used as the determinator for the ATR digits C. When the resulting
yn+, equals zero, the accumulated sum in the angle is equal to the
original angular argument 01 for the coordinate components yi and xi.
The result in the x register is equal to kR1. Therefore, Ri(= x +y)
1
can be obtained by multiplying by the result in the x register.
k
During each iterative operation, the ATR digit, j, is determined by the negative
of the sign of the controller.
2.4.3 CoTransformation Method [CHE72]
In this section, it will be shown that a relatively simple device can evaluate
exponentials, logarithms, ratios, and square roots for fraction arguments, em
 29 
playing only shifts, adds, small table lookups, and bit counting. The scheme is
based on the cotransformation of a number pair (x,y) such that F(x, y) = f(xo)
is invariant; when x is driven towards a known value xw, y is driven towards
the results. The evaluation of Zo = f(xo) can be calculated by introducing a pa
rameter y to form a twovariable function F(x,y) such that
(1) There is a known starting value y = yo with F(xo,yo) = Zo.
(2) There are convenient, known mechanisms to cotransform the number
pair (xi, yi), into (xitl, yi+,) leaving F invariant, namely
F(xi+1, yi+l) = F(xi, yi).
(3) The sequence of xtransformations tends to a known goal x = xw,
while the corresponding sequence of ytransformations tends to y = yw.
F is so defined that F(xw, yw) = yw.
(4) The iterative part of the algorithm is essentially the repeated applica
tion of the cotransformation procedure, carried out in the absence of
detailed knowledge about the value Zo.
In terms of solid coordinate geometry, F is easily represented by a curve
confined to the Z = Zo plane and passing through P0 (xo, yo, Zo) as shown in Fig
ure 2.4. That is
Z
Q(xw, Yw, ZQ)
Y
FIGX U 2..4
The Evaluation Of CoTransformation Function z0=f(x0, z0)
F(xy) f 0
z = zo plane
PONa, YoZO)
Xy=yX pplane
FIGURE 2..4
The Evaluation Of CoTransformation Function zo=f(xo, yo)
 30 
F(x,y) = Zo = f(xo) (2.42)
and the transformation invariance is merely the requirement that if the point
Pi(xi, yi,Zi) is on the curve F, then so is Pi+1 = (xi+t, yi+1,Zo). Clearly, then,
Zo = F(xo,yo) = F(xl, y) = F(x2, 2) = ... (xi,yi)
= ... = F (xw,yw) = Yw (2.43)
If the xprocess converges to Xw, the corresponding y, should then equal Zo.
In reality, it may be difficult or impossible to reach Xw exactly, and ex
trapolation by a Taylor series would be desirable, if F(x,y) is differentiable with
respect to x near Xw. Thus,
8F
F(xw p, yn) = F(xw, Yn) +O (U2)
ax xw
aF
Sx x (2.44)
For an Nbit fraction, if Ju < 2N, just the first term will usually be adequate.
On the other hand, both the iteration cost and roundoff error can be
N
halved by including the term linear in / with Ju < 2 2. Within this termina
tion algorithm, the multiplication (if any) involves a halfprecision number (,)
as one of the operands, costing half of a standard multiply time. In what fol
lows, we shall assume that linear extrapolation will be used when [/U 2 2
with N N>> 1.
The functions being considered here are:
(1) w ex 0 x < In (2) with F = y ex, Xw = 0
(2) w+ln(x) 0.5 < x<1 with F = y+ln(x), xw= 1
w y
(3) 0.5 x
x x
w y
(4) 0.5 x < 1 with F = xw = 1
SExp l (2.45)
Exponential ( 0 < x < ln(2) = 0.69314 ... )
 31 
Zo = w exO = Yo exo
= (yo ao ) exoIn (a
= yn eP = y" +
Initiation
Function
Transformations
Termination
Relative Error
= Yi exi =
Yn
: yo = w
: F(x,y) = yex
: xi+ = i In ai, yi+i = yiai
: xn = / ; Zo = Yn + Xn Yn
: l 2 + 2 21
2 3
* Logarithm ( 0.5 < x < 1 )
Zo = w + In (xo) = Yo + In (xo)
=( yo In (ao)) + In (xo ao) = + In (x) = ...
= yn+1n(1/Y) y= nU
Initiation yo = w
Function : F(x,y) = y + In(x)
Transformations : xi+ = xiai, yt+ = yi In ai
Termination : 1n = ; Zo yn (1 n)
/Z2 22u
Relative Error : 0 >  (1 ) 
2 3
* Ratio (0.5 < x < 1)
Z =w Yo
ZO 
XO XO
(yo ao) Yi
(xo ao) xi
Yn
= n +ss Yn
( /U)
Initiation
Function
Transformations
Termination
Relative Error
: Yo = w
: F(x,y) =
x
X
: Xi+1 = xiai, yi+j = yiai
: 1 n = ; Zo Yn + ( 1 Xn)
2
( 1 p 2 )
(2.48)
(2.46)
(2.47)
S)
u
 32 
S Inverse Square Root ( 0.25 < x < 1 )
w yo
zo =1x
(yo ao) y
Jxao2 "X '
Yn JU
 ynyn
2
Initiation : yo = w
Function F(x,y) =
Transformations : xi+ = xiai2
Termination 1 n = P ;
Relative Error
: 0 < =1
x
Yi+i = yiai
(1 Xn)
Zo Yn + Yn
2
+ (1 + ) < 2N1
2 8
(2.49)
We can then select ai to be of the form (1 + 2m), such that multiplications
by ai can be replaced by a shift and an add. The value m is usually chosen as
the position number (the ithbit after the binary point is said to have position
number equaling i) of the leading 1bit in jxi xl; but an increase by 1 is
needed for the inverse square root. Therefore,
(1) For wex, m is the position number of the leading 1bit in xi. Thus,
we have
Xi= 2m+P 0 P < 2m (2.50)
Here p represents the bit pattern after the mthbit. The latter, among
all bits in xi, is responsible for over half of the value of jxiXwl and
is the leading candidate for removal. With the help of a small table
{(Dm = In (1 + 2m)}, we have
 33 
xi+ = (2m + P) In (1 + 2m)
22m 23m
= (2m +P) (2m + ... )
2 3
=P+0(22m) (2.51)
where the most objectionable bit is replaced by a secondorder distur
bance, and p is largely intact. The ytransformation is performed by a
right shift of m places, and then by adding to the unshifted yi; that is,
Yi+1 = Yi + 2myr (2.52)
The iteration requires a table lookup, a shift, and two adds. With m
N N
varying from , a total of tabular entries are needed; even if
2' 2
N = 60, the table requirement is only 30 words.
(2) For w+ln(x), m is selected to be the position number of the leading
1bit in (1 x); here we have
xi= 1 (2m+P) 0 P < 2m
xi1 = xi + 2mxi = 1 P + 0(22m) (2.53)
Again the most objectionable bit is replaced by a secondorder distur
bance. Also,
Yi+i = Yi In(1 +2m) (2.54)
can be calculated by using the same table as (1).
w
(3) For , the xtransformation is the same as in (2) and the ytransfor
x
mation is the same as in (1), namely,
Xi+l = Xi + 2mxi
Yi+1 = yi + 2myi (2.55)
W
(4) For m is chosen as 1 plus the position number of the leading
1bit in (1 xi)
xi = 1 2 (2m+P) 0 5 P<2m (2.56)
then
 34 
ti= xi+ 2mxi
Xi+l = ti+ 2m ti
= 1 2P(1 + 2 2m + 22m) 22m (3 + 2 2m)
= 1 2[ P+ (22m)] (2.57)
The ytransformation is the same as in (1), namely
Yi+i = Yi + 2myi (2.58)
The explicit algorithms are summarized in Table 2.2. The four iteration
algorithm can be handled by one unified apparatus (see Figure 2.5), consisting
N
of an adder A, a shifting mechanism S, and an word fast memory M hold
N 2
N
ing {In (1 + 2m)}, 1 m . The apparatus is capable of the following ele
2
mentary operations:
(a) Load C(x) (the contents of register x) into T
(b) Deduce m from C(T) and store it in V
(c) Use m as address to fetch )m from M; add )m and C(T)
(d) Store the sum in x
(e) Load C(Y) into T
(f) Shift C(T) to the right by m places and put result into U; add C(T)
and C(U)
TABLE 2.2
Summary Of The CoTransformation Algorithm
Termination
Range of x f(x) F(xi, yi) xi xi Yi+l Xn Algorithm
xi In(1+2m) n
[0, ln(2)) wex xi 2+ 2myi 0 < <2 2 Ynfe = Yn+Yn
xit2mxi
[0,5, 1) w+ln(x) yi+ln(xi) 1(2m+p) yxi In(l1+2m) 1 t yn+ n(1t) yn/U
=_ 1p
xi+2mxi
[0.5, 1) w/x yi/xi 1(2m+p) x yi 2myi 1 t Yn /(1l)t yn + Yn/y
1p
[0.25, 1) w/fx yi/xi 12(2m+p xi(12m 2 yi+2myi 1p yn/T / = Y+ynyn/2
=_ 12p
 35 
Im m' To termination algorithm
m_
Dm = In(l + 2m)
x or y 2m(x or y)M
T U
A
FIGURE 2.5
The Unified Apparatus
(g) Store the sum in Y
(h) Go to step (a) if m < mri, else enter the termination algorithm.
Starting by putting xo into X, w=(yo) into Y, the sequencing for the evalu
ations are [
(1) exo
(2) In(xo)
1
(3) 
xo
1
(4)
Jxo
w=yo=l
w = yo = 1 ;
w=yo= ;
w = yo = ;
w = yo = 1 ;
(abcdefgh);
then C(y) + C(y) C(x) = exo
(abfdecgh);
then C(y) + 1 C(x) ln(xo)
(abfdefgh);
then C(y) +C(y) [1 C(x)] 
xo
(abfdafdefgh);
then C(y) + C(y)* 1 C
2
(2.59)
The apparatus invokes the termination algorithm by shipping C(x),C(y) away, to
be processed by more conventional equipment. If selfcontained operation is
 36 
desired, the iterations can be allowed to run until uz 2N. It would then be
necessary to double the table size.
To derive a rough timing estimate, we equate memory access time with
shift time, and note that a conventional multiplier takes time T = N shiftadd
N 3N
times. The expected iterations involve shiftadds for the inverse square
4 4
root and for the other three functions; the time estimates are, therefore, 
T 2 4
and , respectively. To these, one has to add the timing overhead of the termi
nal algorithm. For logarithms, the complementadd requirement adds little to
T
the iteration cost. The total timing cost is still measured by . For exponentials,
2
ratio, and inverse square roots, a halfmultiply is needed. If this is done with a
T
conventional multiplier, the cost would again be for a total cost of T (expo
2
nential, ratio) and 1.25T (inverse square root).
For the out of range numbers, x's, these functions can be computed as fol
lows.
(1) Exponentials:
Let x = m2E equal i ln(2) +f where i is an integer and f is a fractional
number, 0 < f < lIn(2). Then
ex = i*ln(2)f= ef 2 (2.60)
Here ef can then be calculated by the above algorithm. Multiplying 2'
is nothing but exponential addition/subtraction. Transferring x to
i ln(2)+f can be achieved by
1
x* = I+F
In (2) (2.61)
1
where = 1.442695040...; I and F are the integer and fractional
In (2)
parts of the product, respectively. Then,
if I 0 i=I, f = F In (2)
if I<0 i=I1, f= (IF) *ln(2)
with In (2) = 0.693147... (2.62)
 37 
(2) Logarithms:
x = m2E x > 0 and m [0.5,1)
In(x) = E In (2) + In(m) (2.63)
where In(m) can be computed using the above algorithm and ln(2) is a
prestored constant
(3) Ratios:
1 1
x=+ m2E = 2E m E [0.5,1)
x m (2.64)
(4) Inverse square roots:
x =m2E x>0 and mE [0.5,1)
E
=22 rm if E even
Yx
1 E+ /m
=2 2 if E odd
Sv2 (2.65)
The CORDIC and the cotransformation techniques can therefore provide
faster computation by employing only shifts, adds/subtracts, and small lookup
tables. Thus, these algorithms are integrated into the designed multipurpose
arithmetic unit which is discussed in the next chapter.
CHAPTER 3
MULTIPURPOSE FLOATINGPOINT DISTRIBUTED ARITHMETIC UNIT
The major operations found in DSP can be modeled as a matrixmatrix,
matrixvector, or vectorvector multiplication, with use a sumofproducts
(SOP) as their primitive operation. In a fixedpoint SOP, the multiplication task
is generally dominant from a temporal viewpoint. Thus, for realtime DSP, a
highspeed multiplier is required. From the previous chapter, we have seen that
if the coefficients of a SOP operation are known a priori, then the fastest SOP
processor is based on the socalled distributed arithmetic (DA) paradigm.
The concept of DA can be traced to Anderson [AND71] and Zohar
[ZOH73] who noted that the lumped parameter algebra associated with a linear
shiftinvariant filter can be redistributed over a set of bit patterns. By the
mid1970's, the lowcost, highperformance memory chips needed to implement
the required DA mappings made the DA SOP commercially available. By re
placing the time consuming conventional lumped arithmetic operations with
highspeed table lookups, high date throughputs were achieved. Since its intro
duction, the distributed filter approach has been applied to a variety of prob
lems using MSI, LSI, VLSI, microprocessor, and bitslice microprocessor tech
nologies ([AND71], [LIU75], and [BUR77]), and has been studied extensively in
terms of complexitythroughput tradeoffs([AND81], [ZOH73], [BUR77],
[PEL74], [JEN77], [TAN78], [KAM77], [ARJ81], [TAY83B], [TAY84C],
[CHE85], and [TAY86]) With the exception of a block floatingpoint study
[TAY84A], these designs have remained fixedpoint. However, as mentioned in
Section 2.2.3, extra overhead must be paid to process the effects of dynamic
 38 
 39 
range overflow in a fixedpoint number system. This overhead can be elimi
nated by considering the floatingpoint DA computing system designed in Sec
tion 3.2.
In Section 3.3, the floatingpoint distributed arithmetic unit (FPDAU) is ex
tended to a complex floatingpoint distributed arithmetic unit (CFPDAU).
While a traditional complex multiply requires 4 real products and 2 adds, the
CFPDAU requires but two concurrent table lookup operations; one produces the
real part of the result and the other produces the imaginary part. In this way,
the computational speed is quadrupled with only a doubling of circuit complex
ity.
The FPDAU or the CFPDAU, like the DA, is applicable to linear SOP cal
culations where one set of the operands is known a priori. Thus, the CFPDAU
needs to be further modified if it is to serve as a generalpurpose arithmetic
processor. The architecture for such a multipurpose arithmetic unit (MPAU) is
presented in Section 3.4. The more generalpurpose CFPDAU requires only a
slight modification of the basic CFPDAU. Essentially, only two extra small
lookup tables are required (one for the trigonometric functions and the other for
the exponentials, logarithms, ratios and inverse square roots). Thus the
CFPDAU architecture can become the cornerstone for the design of a fast, com
pact, robust processor.
The data regularity and control characteristics of the developed arithmetic
unit also make it an ideal candidate for integrating a finegrain systolic array as
discussed in next chapter.
3.1 Distributed Arithmetic Unit
In general, a linear SOP is given as
 40 
N1
y(n) = Z aix(ni); x = input, y = output.
i=o (3.1)
Suppose further that the coefficient set {aj is known a priori (e.g., digital fil
ters, digital transforms, numerical analysis) and x(n) ( jx(n)[ < 1 ) is coded as a
fixedpoint (FXP) 2's complement Lbit word, then the DA version of Equation
3.1 is
L1 N1
y(n) = Cn(0) + Z Fn(k)2k; n(k) = Z aix[k : ni]
k= 1 i=0 (3.2)
where x[k : ni] is the kth least significant bit of x(ni). The value of 'n(k)
can be obtained as a table lookup call and the scaling by 2k can be performed
using a shifter. The implementation of Equation 3.2 is shown in Figure 3.1 and
address
generator
X(nN+1 (,) ShiftAdder
partial
product
table  
lookups OUTPUT y(n)
X(n1)
S/R
X(n)
Shift register
address core fixcldr'oinT
generator
buffer Cldistriuted unit
INPUT x(ni)
FIGURE 3.1
Core Distributed Arithmetic Unit
is referred to as a distributed arithmetic unit. The throughput of this unit is es
sentially limited only by memory cycle time and the data wordlength. When the
order (i.e., N) exceeds the addressing space limits of the memory table, sub
 41 
SOP's must be designed (i.e., divide the original SOP into small groups of SOP)
and integrated together. For such cases, the conventional approach, proposed
by Burrus [BUR77] and others, is to operate a set of distributed subSOP sec
tions in parallel and then combine their outputs with an addertree.
Both finite impulse response (FIR) digital filters and infinite impulse re
sponse (IR) digital filters can be implemented with this distributed arithmetic
unit. An error analysis of the FIR digital filters which were implemented using
the DA method, has been reported by Peled and Liu [PEL74] and improved by
Kammeyer [KAM77]. The Peled and Liu model suggests that the finite
wordlength effect errors are of zero mean with a variance of a IR = Q2/9 where
Q is the quantization step size. An error analysis for a DA IIR digital filter has
been published by Taylor [TAY86], and it indicates:
2 Q2 Q2
ODI = IH(Z)12+; for DirectI architecture
12 9
2 Q2 Q2
a D_I = IH(Z)I2 + ; for DirectH architecture (33)
These studies prove the thesis that DA data processing (when applicable) is
faster and more precise than an equivalently packaged conventional unit which
has error variance:
2 Q2 (NQ)2
o = N+N ; where N is the order of IIR filter
3 Q2
12 (3.4)
Fixedpoint DA is a method by which high throughput digital filtering can
be supported. However, as is often the case, high performance is gained at the
expense of system complexity. In Appendix B, a bitserial DA architecture is
presented which mitigates the pinout and large memory problems without re
ducing throughput. The method lends itself to direct VLSI integration and can
be used to develop highorder, highprecision digital filters.
 42 
3.2 FloatingPoint Distributed Arithmetic Unit (FPDAU)
Distributed arithmetic units (DAUs) are well studied in the context of
fixedpoint data processing. The only exception was the block floatingpoint
DAU reported by Taylor [TAY84A]. This concept will be extended to a full
floatingpoint DAU revision which is referred to as the floatingpoint distributed
arithmetic unit (FPDAU).
If the data are represented in the conventional floatingpoint form
x = m(x)re(x) where m(x) is the mbit mantissa, r is the radix, and e(x) is the
ebit exponent, then the SOP with floatingpoint precision output yields the fol
lowing:
N1 N1
y(n)= = aix (ni)= a aim (x(ni)) ye(x(n))
i=o i=o (3.5)
Assume E = max(e(x(j))) where j = n, n1, ..., nN+1, then
N1
y(n) = y'( Y ai m(x(n i)))
i=0
m(x(n i)) = m (x(n i)) y(x(n1))
A (x(n i)) = e (x(n i)) E < 0 (3.6)
For r = 2, the scaling of m(x(j)) by y (x(j)) can be accomplished using a stan
dard 2's complement "right shift" at the address generator buffer.
In the aforementioned fixedpoint DAU designs, each cell of the address
generator register was of length mbits for a total length of N*m bits (N is the
order of the filter). Since a scaled mantissa (i.e., m(x(j)) is used in the
FPDAU, it may be desirable to extend the length of each cell to (m+V)bits
where V = max ( A ( x (n i))), for a total length of N(m+V)bits. In this
way, mantissa can be scaled by a maximal value, 2' without a loss of signifi
cance. Under these assumptions, the FPDAU version of Equation 3.6 becomes:
 43 
m+V1
y(n) = E ( 2kDn (x[k: n i] ) ( x[0: ni] )
k=1 (3.7)
where (Dn and x[k : n i] have been previously defined.
An FPDAU is diagrammed in Figure 3.2. It consists of "core" floating
point DA engine which is augmented with a set of peripheral function modules.
The additional hardware cells consists ofr
(1) an exponent queue;
(2) an exponent magnitude compare module;
(3) an address generator (possibly of extended length);
(4) normalization circuitry; and
(5) a floatingpoint shiftadder instead of fixedpoint shiftadder.
An FPDAU cycle consists of the following operations in the indicated order.
(1) Read the next exponent e(x(j)) and mantissa m(x(j)). Present e(x(j))
to the exponent queue and exponent comparator. Load m(x(j)) into
the address generator buffer.
(2) After reading the data, determine the maximum exponent E = max{
e(x(j)) I j = n, n1, n2, ..., nN+1}
(3) Adjust the contents of the jth address generator cell (in a 2's comple
ment sense) by the amount Ee(x(j)) for all j concurrently.
(4) Perform a core floatingpoint DA cycle for up to m+V table lookups.
(5) Adjust the final exponent by adding (or subtracting) E from E(y(n)) to
produce e(y(n)).
In practice, the input and output adjustments would be performed concurrently,
reducing temporal overhead to a minimum. As a result, the FPDAU cycle is
again essentially defined by the speed of the core floatingpoint unit. The value
is approximately m+V memory accesses.
 44 
Output e(y(n))
exponent y
Input
mantissa
m(x(n))
FIGURE 3.2
The FloatingPoint Distributed Arithmetic Unit
The FPDAU provides a means of computing a floatingpoint SOP opera
tion. The same calculation could be implemented using traditional components
by supplying all N coefficient data pairs (i.e., (ai, x(n i))) to a floatingpoint
multiplier in a serial manner. The Nproducts would be sent to a floatingpoint
 45 
arithmetic unit for accumulation. Assuming that accumulation is interleaved
with multiplication, the traditionally fashioned SOP would have a latency of the
order of N floatingpoint multiply cycles. In addition, there would be some loss
of precision (due to rounding/truncating) during each multiplication. The prod
ucts may or may not be less significant. Compared to traditional methods, the
FPDAU offers several distinct advantages. These include
accelerated arithmetic speed,
improved error budget, and
superior packaging and cost.
3.3 Throughput and Precision of FPDAU
A preliminary analysis of a DA floatingpoint vector processor of length 16
will be presented. The data for the new unit is based on commercially available
memory and support chips. The latency of the preliminary design is compared
to a stateoftheart Weitek WTL1032/1033 floatingpoint chip set which dissi
pates two watts and costs $700 in small quantities. For pipelined architecture,
the latency of the WTL1032 multiplier (32x32 bits) is 100ns per stage and the
latency of the WTL1033 ALU is also 100ns per stage with a total latency is
500ns each. The 16tap SOP will then take 2.6gs when implemented using this
chip set in pipeline mode. The FPDAU model is prefaced on the assumption
that m = 24 bits, V = 7 bits (i.e., 32bit single precision model). Based on a
12ns table latency table latency (e.g., Performance P4C187), 20ns shiftadder
(64bits) latency, a 10ns address register shift/align, 10ns exponent compare,
and 10ns mantissa normalization, the latency of the floatingpoint SOP using
the FPDAU is reduced to 1.02gs without pipelining.
Without pipelining, the FPDAU is still approximately 2.6 times faster than
the stateoftheart pipelined multiply/accumulate processor. In addition, the
 46 
latency of the FPDAU is dependent on the wordlength, not on the length of SOP
as in the traditional multiply/accumulate processor. The latency versus the
length of SOP for the FPDAU and the traditional multiply/accumulate processor
is shown in Figure 3.3. It is seen that for the FPDAU, the latency is a constant
Latency (t s)
5
4 WTL 1032/1033
3 (pipelined)
1 _ FPDAU
1 2 3 4 JJ 16 17 18 19
# of the taps of SOP
FIGURE 3.3
Latency Of The FPDAU v.s. The Conventional Multiplier/Accumulator
if the wordlength of the address generator is fixed. However, the wordlength is
dependent on the divergence of the data and the precision requirement. In ad
dition, the size of the table increases geometrically as the length of SOP in
creases. The solution to these problems is to use the multiple subSOP and the
addertree. Details are discussed in Section 3.5.
Next, the precision of implementing SOP using FPDAU will be compared
with the precision of using the traditional multiply/accumulate processor. In
previous fixedpoint DA studies, it was demonstrated that DA offers distinct
precision advantages over conventional SOP designs. A comparison of how
floatingpoint architectures differ can be similarly analyzed using statistical
modeling methods developed for the study of digital filters. For each imple
mentation of the Nth order FIR equation:
47 
N1
y(n) = 1 ai xi; xi= x(n i)
i= o (3.8)
the conventional architecture consists of many error sources distributed through
out its architecture. For a direct implementation, the multiplicative mantissa
adjustment and normalization operation rounds the full precision product into a
single precision value and normalizes the result to a range 0.5 rti < 1.0. If
ai and xi produce a uniformly distributed mantissa (a reasonable assumption),
then
2 r (yexi + ea)2 if 0.5 mi < 1
S 2 ( exi + eai+)2 if 0.25 m mi< 0.5 (3.9)
Q2
where r = 2 for binary system, a2= , and Q = 2m (mantissa has mbits). The
12'
total error budget of using the conventional multiply/accumulate processor then
becomes
NI Ne 1
TC = or 0.5 E (( y(e + e )2 ) + 0.5 (( (ex+ )
i=o i=0 (3.10)
where e represents the expected value. If the range of the products is assumed
to be uniformly distributed over [M,M), M = 2q, q = 2e (exponent has e bits),
then
E(( yexi + ei)2) = f2 dx 2M =
wM (3.11)
which yields
= a2 ( (2.5N2.5)
SM3.; (3.12)
 48 
Q2
The FPDAU model consists of a aJQAU =* ((r)2) error variance attrib
9
uted to the core floatingpoint DAU operation. The resulting output y(n) has an
equivalent floatingpoint of y(n) = YDAU YE where
E = max ( log, x i ) ; r = 2 (for binary system) (3.13)
As a result, the error variance for the FPDAU can be modeled as
2 2 Eyex\2
2FPDAU = D2DAU + E ((yi)2 )
o TFPDAU O DAU
= laE ((yeai)2) ((yexi)2)
4 1
= o2 M (assume ai and xi are independent)
3 3
= 402M2
9 (3.14)
The relative error is therefore
M2
22 ( 2.5N 2.5 )
0 TC 3 9* ( 2.5N2.5)
Y ~ 4 2 12
S2TFPDAU 2 * M
9
S7.5 N = 2N
4 (3.15)
That is, the FPDAU has about a 2Nfold advantage over traditional schemes.
For a typical value of N = 16, the improvement in precision can be an addi
tional five bits of precision. This may or may not be significant based on the
mantissa wordwidth m. These results are in complete agreement with what has
been observed with fixedpoint conventional and distributed architecture.
In a conventional floatingpoint system, the allocated wordwidth (32bits or
64bits typical) is divided between the mantissa and exponent. This division
establishes precision and dynamic range limits. These parameters, in turn, es
tablish latency (throughput) bounds and, to some degree, packaging and power
requirements. However, in many application specific designs, the optimal
 49 
choice of precision and range delimitation may not correlate well with the stan
dard format sizes (i.e., double or extended precision mantissa and exponent val
ues). Often, in realtime applications, one may choose to sacrifice precision
(mantissa) to gain speed. This will have the effect of reducing the amount of
time spent in multiplication. However, the dynamic alteration of a precision
qualifier is not permitted in commercially available, standard format, floating
point devices.
The dynamic assignment of mantissa and exponent wordwidths is permitted
in an FPDAU design. The effect on latency will be essentially a scaling of the
partial product/accumulation time which is on the order of m+V memory lookup
cycles. That is, speed will scale linearly with precision (i.e., m) and logarithmi
cally with range (i.e., V = log2 2max(e)). In fact, the FPDAU can be reduced to
a fixedpoint processor simply by setting the exponent to zero. Therefore, the
FPDAU can be dynamically "tuned" to provide the best speed, precision, and
range mixes. This can be accomplished without any hardware or architecture
changes.
3.4 Complex FPDAU (CFPDAU)
Complex arithmetic is essential to many transformations and image and sig
nal processing operations. In the complex domain, the N sum of complex prod
ucts is given by
N1
y= I wixi
i=O
wi= win +jw)i ; known a priori
xi=Xi +jxic, ; input
y = yg+jy, ; output
(3.16)
 50 
where all the variables (real and imaginary) are coded as floatingpoint words.
For a traditional system, the execution of Equation 3.16 would require
.4n real products
.2n real adds
Using Equation 3.6 as a model, Equation 3.16 can be rewritten to read
N1
y = E( ( Mw m wi fmi )
i=0
=Y yE( ( i( mi + O3 mi )
i=O (3.17)
with
E=max (ex (i), e(i) i = 0, . N1) (3.18)
where
ex (i) = exponent value of xi
ex (i) = exponent value of xi (3.19)
In addition, mi is defined to be the scaled mantissa values as defined in the
manner used in the FPDAU development. Equation 3.17 can be directly real
ized as an FPDAU satisfying the equation:
( m+V1
yp=yE 2k(n (x[k:i]) n (x[0:i] )
k= 1
( m+V I
y YE 2k Qn(x[k: i]) Qn(x[0:i])
k = 1 (3.20)
where
 51 
NI
,n (X [k : i) = (w) i.i [k : i] w is im [k: i])
i=O
Nl
Qn(x [k: i]) = (woimi [k: i] + ,i mi [k: i])
i=O
V= max (E ex, ) (3.21)
The required architecture is shown in Figure 3.4. Observe that all the external
circuitry about the core floatingpoint DAU remains functionally the same, ex
cept that the address generator register size is doubled to hold both the real and
imaginary mantissas. The core of DAU is also doubled with one section pro
ducing real partial products and the other generating the imaginary terms.
One significant difference between the complex FPDAU (CFPDAU) and the
FPDAU is that the address space requirement, for a given N, increases by a
factor of two. The CFPDAU differs from conventional complex multipliers in
several important areas as well. First, while a traditional complex multiply re
quires 4 real products and 2 real adds per multiplication, the CFPDAU tables
collect these operations into two groups and execute the 4 real products and 2
real adds as two concurrent table lookup operations. One table lookup gener
ates the real part of complex product and the other produces the imaginary
part.
The speed at which an CFPDAU can operate is again established by table
lookup cycle times. Based on 12ns memory module, and 20ns floatingpoint
shift adder, the 16taps complex SOP can be performed at an effective latency
of 32x(m+V) nanoseconds where m+V is the address register cell length in bits.
For a typical value of m+V = 48bits, the effective latency of 16tap complex
SOP is approximately 1.5ps. The latency of a typical real floatingpoint 16tap
SOP, implemented using a stateoftheart multiplier and ALU is much greater
 52 
m, m m m
FIGURE 3.4
Complex FloatingPoint Distributed Arithmetic Unit (CFPDAU)
than this figure (as shown in previous section) and, for the complex multiply/ac
cumulate case, would be four times longer. Thus, the CFPDAU can outperform
existing products and does so within a comparable hardware budget.
 53 
3.5 MultiPurpose FPDAU/CFPDAU
As indicated in the previous section, the address space of the FPDAU is
increased by a factor of two, compared to the FPDAU. This means that the
lookup table size increases 2N times (N is the number of taps of the SOP). For
example, the 16tap complex SOP will require two 232word memory. Com
pared to the FPDAU, each table is larger by over 4 orders of magnitude. This
memory size requirement obviously precludes fabrication in one VLSI chip. In
order to reduce the table size, the address generator buffer can be divided into
several groups. In this way, the memory size is reduced geometrically. Each of
these groups has its own subSOP lookup table. The partial SOP can then be
summed together using an addertree. The number of the levels of the adder
tree depends on the number of the groups, which in turn affect processor la
tency. More groups result in the longer latency. However, this effect can be
reduced when a pipelining is employed. Trading large table sizes off against
minimal extra circuitry (adders and shiftadders) and a small increment of la
tency is generally an accepted design choice. For example, an 18tap complex
SOP will require two 236word lookup tables. If the address generator register
buffer is divided into six groups, then only twelve 26word lookup tables are
needed. In this instance, we save the memory of more than 8 orders of magni
tude by introducing only five extra shiftadders and one addertree consisting of
six adders. The diagram of this scheme is shown in Figure 3.5. The adder
tree has three levels. For the 20ns floatingpoint adder technique, it introduces
60ns (pipelined) to the total latency and can be ignored.
Reducing the table size is one reason to divide the address generator buffer
to groups. Another reason is that under such conditions, each subSOP of the
CFPDAU can be used as a generalpurpose multiplier. These multipliers can
 54 
INPUT
ev(i) 
my(i)
ev(i)
m
ev(i)
v(i) 
*"d
ev(i)
mV(i)
m
v(i)
SW(i)
ev(i)
yv(i)
FIGURE 3.5
The Real Part Of The Modified CFPDAU
be connected together to construct a systolic array, which is the subject of the
next two chapters.
Recall that the complex multiply
yK = OOR X Il xe
Ye = 0)2 x + 0) XU
(3.22)
evi) Shift Control
(S/C)
exponent
compare
E = max(ee ))
exponent
queue
Exponent Adjuster
(E/A)
SADDER
max(e,(i)I
exponent
INPUT y mantisa
____________________M ~Y
 55 
can be rewritten to read:
ynz = YE Z 2k (m[k: x]) (m[ x]=)
k= 1
where
Ex= max(ex ,ex )
V = max (Eex. )
4 is the table consisting of 0,wo, we, o) ow
Q is the table consisting of 0,woS, w wo) + wo
m[k:x] is the kth least significant bit of the
mantissas xR and xu
Compared to the traditional complex multiply which requires 4 real
and 2 real adds, the DA multiply (Equation 3.23; its architecture is
Figure 3.6) is faster and simpler.
(3.24)
products
shown in
FIGURE 3.6
Complex Distributed Arithmetic Multiplier
(3.23)
 56 
Thus, when the address generator buffer of the CFPDAU is divided into
several groups, each group with its own subSOP lookup table can then be used
as a DA complex multiplier. In such a case, each group can be used as a sub
SOP unit when the coefficients are known a priori or as generalpurpose multi
plier if the coefficients are not known a priori. However, this design is still not
powerful enough to use this device as the arithmetic unit of a generalpurpose
computing system for DSP and linear algebraic applications. Fortunately, the
fast architecture for computing the elementary arithmetic functions (e.g., trigo
nometric functions, logarithmic, exponential, ratio, and inverse square root) can
be installed into the complex DA multiplier very easily. The diagram of this ar
chitecture is shown in Figure 3.7. This arithmetic unit is referred to as a Multi
Purpose Arithmetic Unit (MPAU) which can implement the following arithmetic
functions in about one to three shiftadd multiplication time.
half of 3tap complex SOP
6tap real SOP
half of complex multiplication
real multiplication
real division
trigonometric functions (sin, cos, tan)
rotation (x' = Rcos (0+2), y' = Rcos (0 +1))
vectoring (R = x+ y2, 0 = tan' )
x
real natural logarithms
real natural exponentials
real inverse square roots
Compared to the traditional implementation methods, MPAU is much faster and
simpler.
 57 
FIGURE 3.7
MultiPurpose Arithmetic Unit (MPAU)
The MPAU achieves the above elementary arithmetic functions in the man
ner explicitly explained in the following (with math error, overflow, and under
flow detections are ignored):
 58 
A. Complex/Real SOP and Multiplication
(i) Real part of Complex SOP
LOADT(MT, QSOP);
LOADS(Ms, SOP, 6);
SELSOP;
AJE;
MOVE(E,EO);
ZERO(X1);
for (i=1; i
AMOVE(MS, V);
ADDAI(X1, TAB);
SEMAI(X1);
AMOVE(MS,V);
SUBA1(X1,TAB);
MOVE(ACC1,X1);
MOVE(X1,JO);
EPADD;
/* Load lookup table from Data
RAM into lookup table Fsop
/* Load 6 data (for complex data it
is 3 complex numbers, for real
data it is 6 real numbers) from
Data RAM into registers SOP(i) i
= 0, 1, 2, ..., 5
/* Select the SOP lookup table
/* Do exponent compare, generate
max{e} and max{Ae}, adjust the
mantissas of SOP(i)
/* Move C(E) (contents of register
E) to register EO
/* Set X1 register to zero
/* rfm = # of bits of mantissa +
max{Ae}
/* Shiftright one bit of the mantis
sas of registers SOP(i) to gener
ate the table address in register
MS and move it into register V
/* get the lookup value from SOP
table, put it into register TAB,
and C(ACC1) = C(X1) +
C(TAB)
/* decrease exponent of C(ACC1)
by 1 and move the result into
register X1
/* get the value C(TAB) from SOP
lookup table and C(ACC1) =
C(X1)C(TAB)
/* C(Xl) < C(ACC1)
/* C(JO) < C(X1)
 59 
/* Add the exponent of JO by
C(EO) and store the exponent
RCV;
into register El and the mantissa
into M1
/* Combine C(E1) and C(M1) into
a floatingpoint number and
store it into register J1
SAVE (J1,M); /* Store C(J1) into Data RAM
The same operations are used to generate the imaginary part except that the
contents of the lookup table are different.
(ii) the above procedure can be used to generate 6tap subSOP.
(iii) For complex and real multiplications, the same operations are used except
that the first two commands are replaced by
(1) Real Part of Complex Multiplication
LOADR(Mr,X1); /* Load real part of multiplicand
from Data RAM into register X1
LOADR(Mi,Y1); /* Load imaginary part of multipli
cand from Data RAM into regis
ter Y1
MOVE(X1,R1); /* Move C(X1) into register R1
MOVE(Y1,R2); /* C(R2) < C(X1)
SUBA1(X1,Y1); /* C(ACC1) = C(X1)C(Y1)
MOVE(ACC1,R3); /* C(R3) < C(ACC1)
LOADTR; /* Load the contents of registers
R1, R2, and R3 into table Dsop
ZERO(SOP); /* Set registers SOP(i) to zeros
LOADS(Ms,SOP,2) /* Load the complex multiplier into
register SOP(O) and SOP(l)
The imaginary part is the same except the fifth command, SUBA1, is replaced
by ADDAI(X1,Y1).
(2) Real Multiplication
LOAD(M,Rl);
/* Load multiplicand into register
R1
 60 
LOADTR;
ZERO(SOP);
LOADS(Ms,SOP, 1);
B. Real Division (Ratio)
LOADR(Mx,JO);
SEM;
if (SN==1)
MOVE(M1,X1);
else
MOVE(M1,X1);
LOADR(My,X2);
while (DEDM < m)
*SEMYI(m);
*SEMY2(rfi);
**ADDA1(X1,Y1);
**ADDA2(X2,Y2);
***MOVE(ACC1,X1);
***MOVE(ACC2,X2);
}
MOVE(E1,EO);
if (SN==1)
MOVE(X2,JO);
else
MOVE(X2,JO);
EPADD;
RCV;
SAVE(J1,M);
/* Load multiplier into register SOP
/* Load divisor, x, into register JO
/* Separate the exponent and man
tissa, exponent is stored in regis
ter El and mantissa in register
M1
/* X is a negative number
/* C(X1) < C(M1)
/* X is a positive number
/* C(X1) < C(ml)
/* Load dividend into register X2
/* Deduce mf from 1C(X1) and
move it into register V. m = #of
bits of mantissa
/* Decrease exponent of C(X1) by
rm and store the result into regis
ter Y1
/* Decrease exponent of C(X2) by
fi and store the result into regis
ter Y2
/* C(ACC1) = C(X1)+C(Y1)
/* C(ACC2) = C(X2)+C(Y2)
/* C(X1) < C(ACC1)
/* C(X2) < C(ACC2)
/* C(EO) < C(EI
/* divisor is a negative number
/* C(JO) < C(X2)
/* C(JO) < C(X2)
/* Store C(J1) into Data RAM
 61 
*,*** : the pair can be processed concurrently
C. Sin. Cos. Tan, and Rotation
LOADR(M,,ANG);
ZERO(V);
SELTR;
if (SIGN==0) {
LOADR(Mx,X2);
LOADR(My,X1);
SUBA3; }
else {
LOADR(Mx,X2);
LOADR(My,X1);
ADDA3; }
INC(V);
for (i=l; i < m; i++) {
*SEMY1(2i2);
*SEMY2(2i2);
EXCHY12;
/* Load angle (in radius) from Data
RAM into register ANG
/* Set the register V to zero
/* Select the trigonometric function
table (tan' 2(2)) for table
lookup
/* Angle is a positive number
/* LOAD xcoordinate value into
register X2
/* C(X1) < (Ycoordinate value)
/* get the value C(TAB) from the
lookup table and C(ANG) =
C(ANG)C(TAB)
/* Angle is a negative number
/* C(ANG) = C(ANG)+C(TAB)
/* C(V) = C(V)+1
/* Decrease exponent of C(X1) by
2i2 and store the result into
register Y1
/* exchange the contents of register
Y1 and Y2.
if (SIGN = 0) {
**SUBA1(X1,Y1);
**ADDA2(X2,Y2);
SUBA3; }
else {
**ADDA1(X1,Y1);
**SUBA2(S2,Y2);
/* C(ACC1) = C(X1)C(Y1)
 62 
ADDA3; }
***MOVE(ACC1,X1);
* *MOVE(ACC2,X2);
INC(V);
}
For Sin: Set C(Mx) = 1 and C(My) = 0, and after running above procedure do
C(X1) = C(X2)*C(C3) and store C(X1) into Data RAM, where register C3
1
contains the constant 
k
For Cos: Set C(Mx) = 1 and C(My) = 0, and after running above procedure do
C(X1) = C(X1)*C(C3) and store C(X1) into Data RAM
For Tan: Set C(Mx) = 1 and C(My) = 0, and after running above procedure do
C(X1) = C(X2)/C(X1) and store C(X1) into Data RAM
D. Vectoring
ZERO(V);
SELTR;
ZERO(ANG);
ADDA3;
LOADR(My,X1);
LOADR(Mx,X2);
while (C(X2) != 0) {
*SEMY1(2*C(V)2);
*SEMY2(2*C(V)2);
EXCHY12;
if (C(X2) > 0) {
**ADDA1(X1,Y1)
*SUBA2(X2,Y2);
**ADDA3; }
/* Set register V to zero
/* Set angle to zero
/* decrease exponent of C(X1) by
2*C(V)2 and store the result
into register Y1
/* decrease exponent of C(X2) by
2*C(V)2 and store the result
into register Y2
 63 
else {
**SUBA1(X1,Y2);
**ADDA2(X2,Y2);
**SUBA3; }
MOVE(ACC1,X1);
MOVE(ACC2,X2);
The value in register ANG is = tan1 Y. To get R = + y2, we must multi
x
ply C(X1) by C(C3) and store the result to Data RAM.
E. Natural Logarithm Y+In(X) X > 0
LOADR(Mx,JO);
SEM;
if (SN==1)
MOVE(M1,X1);
else
MOVE(M1,X1);
SELLN;
LOADR(My,X2);
while (DEDM < m) {
SEMY1 (ri);
*ADDA1(X1,Y1);
*SUBA2(X2,TAB);
**MOVE(ACC1,X1);
**MOVE(ACC2,X2);
}
MOVE(E1,X1);
{ C(XI) = C(X1)*C(C2); }
/* Load operand X to register JO
/* if operand X is negative, then
move M1 to register X1, other
wise move M1 to register X1,
where M1 is the mantissa of X.
/* Select the ln(1+2m) table for ta
ble lookup
/* Load operand Y to register X2
/* DEDM of (1X); m is the num
ber of bits of mantissa
/* multiply C(X1) by C(C2). Reg
ister C2 contains the constant
ln(2)
MOVE(X2,Y1);
if (SN==1)
 64 
SUBA1(X1,Y1);
else
ADDA1(X1,Y1);
MOVE(ACC1,X1);
SAVE(X1,M);
/* Y+ln(X) = Y+ln(m2) =
Y+E*ln(2)ln(m)
For computing In(x), we can set C(My) = 0 (i.e.; Y = 0).
F. Natural Exponential Function vex
LOADR(Mx,X1);
{ C(X1) = C(X1)*C(C1); }
MOVE(X1,IO);
if (C(I) > 0) {
MOVE(I,EO);
MOVE(F,xl); }
else {
MOVE(I1,EO);
MOVE(1F,X1); }
{ C(X1) = C(X1)*C(C2); }
SELLN;
LOADR(My,X2);
while (DEDM < m) {
/* Load operand X into register X1
/* C(X1) = X*l() Register C1
contains the constant ln(2)
/* Separate the integer and frac
tional parts of C(X1), the integer
part and the fractional parts are
stored in registers I and F, re
spectively.
1
/* X*ln(2)= I+F. If I > O, i = I and
f = F*ln(2), otherwise i = I1
and f = (1F)*ln(2)
/* f = F*ln(2) or f = (1F)*ln(2)
/* Load operand Y to register X2
/* DEDM of X; Compute yet.
Sy = yei*ln(2)+f yet 2
SEMY2(rf);
*SUBA1(X1,TAB);
*ADDA2(X2,Y2);
**MOVE(ACC1,X1);
**MOVE(ACC2,X2);
 65 
MOVE(X2,JO);
EPADD;
RCV;
SAVE(J1,M)
For computing ex, we set C(My) = 1 (i.e.; Y = 1).
G. Inverse Sauare Root v/ J: x> 0
LOADR(Mx,JO);
CEP;
MOVE(E1,EO);
MOVE(M1,X1);
LOADR(My,X2);
while (DEDM < m) {
SEMY1(ri);
ADDA1(X1,Y1);
MOVE(ACC1,X1);
SEMY1(mrf);
ADDA1(X1,Y1);
MOVE(ACC1,X1);
SEMY2(rf);
ADDA2(X2,Y2);
MOVE(ACC2,X2);
}
MOVE(X2,JO);
EPADD;
RCV;
SAVE(J1,M);
/* Load operand X into register JO
/* X = m2E. If E is even, C(E1) =
E
y and C(M1)=m otherwise
E+1 m
C(E1) = and C(MI) = 2
/* Load operand Y into register X2
/* DEDM of 21(1X1); Compute
y
FC(M1)
For computing ~ we can set C(My) = 1 (i.e.; y = 1).
 66 
The above commands are used to explain how the MPAU implements the
elementary arithmetic functions. In fact, these commands can be hardwired
such that the MPAU will actually be a RISC arithmetic processor. The data
RAM and program RAM can be used for connecting the MPAUs into a systolic
array (see Figure 3.8) as discussed in Section 5.4. Each MPAU uses the Data
PE : MultiPurpose Arithmetic Unit
DM : Data RAM
C : AdderTree For Column PEs
R : AdderTree For Row PEs
FIGURE 3.8
The Systolic MPAU Array
 67 
RAM as its I/O port and has its own funct
reduced instructions to each MPAU are(
(1) CSOP(Ms,MD,N)
(2) RSOP(Ms,MD,N)
(3) ADD(Msl,MS2,MD)
(4) SUB(Msl,Ms2,MD)
(5) RCMULT(Ms ,Ms2,MD)
(6) ICMULT(Msl,MS2,MD)
(7) MULT(Msl,MS2,MD)
(8) DIV(Msl,Ms2,MD)
(9) ROT(Msl,MS2,MS3,MD1,MD2)
(10) VET(Msi,Ms2,MD1,MD2)
(11) IN(Msl,MS2,MD)
(12) EXP(Msl,Ms2,MD)
(13) ISQRT(Msi,Ms2,MD)
(14) LOADT(Ms,N)
(15) MOVE(Ms,MD,N)
ion controller (program RAM). The
/* Complex SOP; Ms: begin ad
dress of data array; MD: the des
tination address for storing the
result; N = # of data (< 6).
/* Real SOP
/* C(MD) = C(Msl)+C(Ms2)
/* C(MD) = C(Msl)C(Ms2)
/* Calculate the real part of C(MD)
= C(Msl)*C(Ms2) where C(MD),
C(Msl) and C(Ms2) are complex
numbers.
/* Calculate the imaginary part
complex multiplication.
/* Real multiplication
/* Real division, C(MD) =
C(MsI)/C(Ms2)
/* Rotation. If let C(Msi) = 1,
C(Ms2) = 0 and C(Ms3) = K (an
gle in radius) then C(MDI) =
cos(k) and C(MD2) = sin(X). If
C(Msl) = X, C(Ms2) = Y, and
C(Ms3) = k, then C(MD) =
R*cos(O + X) and C(MD2) =
R*sin(0 + k), where X =
R*cos(O) and Y = R*sin(0).
/* Vectoring. C(MDi) = M +M2
MMD) = (M2)*e
and C(MD2) = tan1i M
/ C(MD) = C(Ms2)+ln(C(Msi))
/* Load table of length N from
Data RAM to the lookup table
RAM (N < 64)
/* Move data ( length N) beginning
at address Ms to address MD.
 68 
(16) ADDT(Ms,MD) /* Load the data C(Ms) to adder
tree, get the result, and store it
into MD.
For such an architecture, if the coefficients are known a priori, then each
MPAU can be used as a SOP processor to compute the subSOP. The adder
trees can then be used to sum together the results of subSOPs. For use as a
general purpose computing system, the MPAUs can be programmed as inde
pendent arithmetic units for parallel computation or as a systolic array for com
puting FOR/DO loops. The method of mapping the sequential FOR/DO loops
algorithms, which are often used to model DSP and linear algebraic applica
tions, are discussed in the next two chapters.
CHAPTER 4
SYSTOLIC ARCHITECTURE
Fields such as signal and image processing, communications, computer
graphics, matrix operations, physical simulation, and others require an enor
mous number of computations. Many of these applications must be executed in
realtime, thus further increasing the speed and the throughput demands. These
increasing demands indicate the need to develop highspeed, lowcost, high
performance numeric computers having low packaging volumes. Highspeed
arithmetic processors, especially those designed using multipurpose arithmetic
units (MPAUs), are potentially capable of meeting these requirements. Another
technology, namely highspeed parallel architectures, should also be considered.
Only recently have very large scale integrated circuit (VLSI) techniques provided
the economic justification for this study. However, large parallel machines have
a fundamental limitation, namely, the high cost of communication. For example,
it is easy to integrate twenty 8bit adders into one VLSI chip but infeasible to
have over 20*3*8=480 I/O pins on one package.
Communication is expensive in terms of real costs, inputoutput (I/O) re
quirements (measured in pins), and the attendant problem of long communica
tion links. Furthermore, delays associated with offchip communications intro
duce another bottleneck. All of these factors favor VLSI architectures with a
minimum amount of communication, especially global communication. There
fore, the design of a parallel system should be modular, have regular data and
control paths, and most importantly contain feasible I/O and localized communi
cation. A special type of parallel architecture that has particular utility to signal
 69 
 70 
processing and which can be efficiently implemented in VLSI is the systolic ar
ray.
4.1 Systolic Arrays
The systolic array was originally proposed by Kung and Leiserson in 1978
[KUN78]. The term "systolic array" stems from the analogy that the data flows
from computer memory in a rhythmic fashion, passing through the processor
elements (PEs) before it returns to memory, similar to blood circulating from
and to the heart [KUN82]. Formally, a systolic array consists of a set of identi
cal interconnected computational cells, each capable of performing some simple
operations. Communication of information from one cell to another is accom
plished along tightly coupled, localized (nearest neighbor) data links. Further
more, I/O operations are restricted to the array boundary.
A systolic array assumes that once data is removed from memory, it is
used by a number of PEs before returning to memory. This not only allows for
localized communication among the PEs, but also reduces the I/O requirements
of the system. As such, the systolic array can alleviate the imbalance (the in
ability of the I/O system to provide data at a sufficient rate to the arithmetic
unit) by balancing I/O speed with computational speed.
Similar to an assembly line, each PE of the systolic array performs a small
part of a much larger task. For example, a 3x3 matrixmatrix multiplication al
gorithm C=A*B can be implemented using a systolic array as shown in Figure
4.1 (Design 1). In this example, the algorithm executes by passing the rows of
A from left to right through the array while simultaneously passing the columns
of B from the top to the bottom of the array. Elements of the resulting matrix,
C, remain inplace in the PEs and accumulate the results. Notice that only local
communication between a PE and its.nearest neighbors is needed. The key to
 71 
B33
B32 B23
[Cn C12 C 13
C21 C22 C23
C31 C32 C33Z
[All A2 A13" B1 B12 B13]
A21 A22 A23 B21 B22 B23
A31 A32 A33J LB31 B32 B33
A13 A12 All
A23 A22 A21
A33 A32 A31
B31 B22 B13
B21 B12
B11
FIGURE 4.1
3x3 MatrixMatrix Multiplication (Design 1)
the systolic array is that the data elements are used multiple times before re
turning to memory. Each row and column is operated on by each of 3 PEs be
fore the calculation is complete.
Most DSP algorithms can be implemented in a systolic array in various
ways. The above example can also be implemented as shown in Figure 4.2 (De
sign 2). In this design, all three matrices are moving through the array. The A
matrix is moving from left to right, the B matrix is moving from right to left,
and the C matrix is moving from top to bottom. Compared to Figure 4.1 (De
sign 1), this design exhibits some different characteristics. Most notable is the
number of PEs needed in this design is increased from 9 to 15. This design,
 72 
C33
C32 C23
C31 C22 C13
C2z C12
C11
A31 A21 All Bil B12 B13
A32 A22 A2 B21 B22 B23
A33 A23 A13 B3 B32 B33
Ji
ECI C12 C13 IAl A12 A131 B1 B12 B13
C21 C22 C23 = A21 A22 A23 B2 B22 B23
C31 C32 C331 LA31 A32 A33J LB31 B32 B331
FIGURE 4.2
3x3 MatrixMatrix Multiplication (Design 2)
however, offers a number of advantages which may overcome the cost differen
tial of added hardware. First, the results are moving through the array and thus
can be extracted as they exit the bottom of the array. Another advantage is that
each processor is used only every other cycle (utilization <= 50%), thus allowing
two different matrix multiplications (say, C=A*B and Z=X*Y) to be executed
currently as shown in Figure 4.3. Note that in only one extra cycle, twice the
effective number of matrix multiplies can be performed. Design 2 can thus offer
increased PE utilization and throughput when more than one matrix multiply is
needed. Thus, the mapping of algorithms onto a systolic array is nonunique
 73 
Z33
Z32 C33 Z23
Z31 C32 Z22 C23 Z13
C31 Z21 C22 Z12 C13
C21 Z11 C12
Cn
X31 A31 X21 A21 X11 AIl BBi Y 1 Bi2 Y12 B13 Y13
X32 A32 X22 A22 X12 A12 B2 Y21 B22 Y22 B23 Y23
X33 A33 X23 A23 X13 A13% B 1B31 Y31 B32 Y32 B33 Y33
Ji
[C1 C12 C13 All Az12 A13 A B1 B12 B13
C21 C22 C23 A21 A22 A23z B21 B22 B23
C31 C32 C33 A LA31 A32 A33 LB31 B32 B33J
[Z11 212 ZI13 rX11 X12 X131 rY11 Y12 Y13
Z21 Z22 Z23 = X21 X22 X23 Y21 Y22 23
Z31 Z32 Z331 LX31 X32 X33J LY31 Y32 Y33J
FIGURE 4.3
Two 3x3 MatrixMatrix Multiplications (Design 2)
and the designer has great deal of flexibility in choosing which design to imple
ment.
4.2 Mapping Algorithms
Mapping algorithms onto systolic arrays consists of completely specifying
the hardware and I/O requirements of the target systolic array. The hardware
requirements include the number of PEs, the configuration of the PEs (linear,
rectangular, triangular, etc.), and the communication paths between the PEs.
 74 
The key is to position and order the data flow through the array properly. The
I/O requirements include positioning the data in the I/O buffers found at the pe
riphery of the array such that they can flow through the array in the correct or
der and at the right time.
The following parameters specify the mapping of an algorithm onto a sys
tolic array:
1. Number of processors
2. Execution time
3. Velocity vectors (for each variable)
4. Distribution vectors (for each variable)
The number of processors depends upon the algorithm mapping rule. The num
ber of PEs needed can, however, be much larger than the number of PEs physi
cally available in the systolic system. In such a case, a partitioning scheme is
needed to map the algorithm to the actually available PEs (see Section 4.4 for
details).
Execution time is also dependent on the specific implementation chosen.
In the strictest sense, this usually excludes the time to "preload" or "flush" the
systolic array. Determining execution time becomes increasingly complicated
when more than one time dimension is necessary. With multiple time dimen
sions, the data sets move in different directions depending on the increment se
quence of the time vectors and the array timing control becomes piecewise
continuous. Piecewise continuous time vectors prevent the current mapping
techniques from utilizing all of the possible pipelining in an algorithm (see Sec
tion 4.6 for details).
Velocity vectors represent the direction and the distance traveled by a data
element for each clock cycle and are constant for each element of a data set.
There is one velocity vector for each time dimension, which contains one ele
 75 
ment for each spatial dimension of the systolic array. Thus, if q represents the
number of spatial dimensions of the systolic array and m represents the number
of time dimensions in the systolic implementation (where q+m is the number of
index dimensions of the algorithm), then there will be m velocity vectors with a
dimension of qxl.
The distribution vectors determine the layout of the data elements in each
data set. For each index (or subscript) of a data set, there is one distribution
vector which contains q elements (q is the spatial dimension of the systolic ar
ray as above). For example, A(I,J) of matrixmatrix multiplication has two dis
tribution vectors; one distribution vector is a vector from A(I,J) to A(I+1,J) and
the other is a vector from A(I,J) to A(I,J+1). Thus, given the position of the
first element in a data set, the entire data set can be configured using the distri
bution vectors.
In a sequential algorithm, or a recurrence equation, each calculation can be
uniquely defined by either the indices of the defining FOR/DO loop or the sub
scripts in a recurrence equation. This representation facilitates the visualization
of the problem. An index space is then defined as a set of vectors that are the
dimensions of the algorithm. For example, twodimensional matrixmatrix mul
tiply has three loops, thus its index space is J3 (Figure 4.4) and each algorithm
variable y is indexed by an index vector viy E J3. Each valid vector in the in
dex space represents one (or more) calculations that are being performed in the
algorithm kernel. Thus, for mapping algorithms onto systolic arrays, the index
space should then be transformed to the systolic array space consisting of time
and spatial coordinates. As such, the newly transformed space can be used to
indicate when and where the computations will occur.
Many techniques exist for mapping algorithms onto systolic arrays. The
earliest technique is the heuristic method. This method consists of tracing the
 76 
k
Coo Co C2O
CI 21 B20
C02 012 22 B21
Ao2 B
A0A
A12 A
Bol
A o, o I B %
Aoo.^  ^  ^ ^
jA2 AIO A20
Index Set of a 3x3 MatrixMatrix Multiply Algorithm
(** initialization of C[i,j] is ignored **) :
for i=0 to 2 do
for j=0 to 2 do
for k=0 to 2 do
C[i,j] = C[i,j] + A[i,k]*B[k,j]
end k;
end j;
end i;
FIGURE 4.4
Index Set Of A 3x3 MatrixMatrix Multiply Algorithm
execution of an algorithm and then determining by hand the positions of the
data variables and the number of PEs needed. Most other methods use a repre
sentation of the data dependencies (see the following section for further details)
of an algorithm and transform these dependencies into systolic array spatial and
time coordinates. One group of methods uses geometric techniques based on
the use of data flow graphs and other similar structures (CAP86], [HUA86],
 77 
[QUI84], and [KUN85 Figure 4.5 shows the geometric mapping of a convolu
XO Xi X2
Y2 Y2 Y2
I (time)
(A) Graph
W2
Yo
WI
Y2
Y2
wo
xo
xl
X2
(B) Architecture
FIGURE 4.5
Geometric Mapping Of A Convolution [CAP83]
tion algorithm after [CAP83] (where the length of both X and Y is assumed to
be three). This geometric mapping technique consists of graphing a representa
tive number of points in the index set to the array. These points are then con
nected and transformed into a multiprocessor design. Note that in Figure
4.5A, the solid lines represent constant values of X, the dashed lines represent
constant values of Y, and the dotted lines represent constant values of W. Now,
if we simply replace the I coordinate axis by a time coordinate axis and allow
the J coordinate to become the spatial coordinate of the systolic array, we obtain
the design shown in Figure 4.5B. This design requires three processors (the
number of distinct J coordinates) and the values of Y remain stationary in the
PEs because they are constant over time. The values of W move from top to
 78 
bottom through the array and the values of X move from bottom to top at twice
the frequency as the values of W.
Geometric methods usually consist of some form of a directed graph which
is derived from an execution trace of the algorithm. The nodes most often rep
resent points in the index space of the algorithm and the directed graph is
translated, rotated, or projected into a graph that represents the systolic algo
rithm. LISP is well suited for this type of mapping because graphs are easily
represented in LISP via pointers and lists[HUA86] and [QUI84] The main
problem with geometric techniques is that they are often difficult to implement
via computer for higher dimension problems. The algebraic techniques de
scribed below can be more easily implemented on computers and are more effi
cient.
4.3 Algebraic Mappine Techniques
The major operations of signal processing and linear algebra applications
can be modeled as matrixmatrix, matrixvector, or in a sumofproducts
(SOP) form. In a uniprocessor environment, the implementation is highly reli
ant on nesting computational loops. For example, the product of two NxN matri
ces, A and B producing C = A*B, is given by<
for i = 1 to N
for j = 1 to N
for k = 1 to N
S : C[i,j] = C[i,j] + A[i,k]*B[k,j]
end k
end j
end i (4.1)
Here, the initialization of C[i,j] is ignored. For future reference, S1 will be
called a statement and the three tuple v = (i,j,k) will be called an index tuple.
The presented sequential algorithm is said to run in O(N3) time over its three
 79 
independent time indices. To accelerate this process, a multiprocessor systolic
array can be used. Algebraic mapping techniques are similar to the geometric
techniques (Section 4.2) in that they both normally extract the data dependen
cies of the algorithm first and then base the systolic implementation on this in
formation [MOL83], [LI85], [MOL82], [FOR85], [MOL86], [SOU86], and
[DOW87]. To use the algebraic mapping technique, the following information
is required:
1) the algorithm index set,
2) the computations performed at each index point,
3) the data dependencies which ultimately dictate the algorithm communi
cation requirements, and
4) the algorithm I/O variables.
After extracting the above information from an algorithm, a transformation
function can then be found to map the algorithm onto the systolic array. The
transformation function transforms the index space of the algorithm to the sys
tolic array space while reserving the same computations. In the following sec
tions, we use the symbol I to denote the set of nonnegative integers and Z to
denote the set of all integers. The nth cartesian powers of I and Z are denoted
as I" and Zn, respectively.
4.3.1 Algorithm Model
In order to map an algorithm onto a systolic array, it is convenient to de
fine an algorithm model. An algorithm G over an algebraic structure A is a 5
tuple G = (Jn,C,D,X,Y) where
J is the finite index set of G, J"C I";
C is a set of triples (v, yg, yu) where v E Jn, yg is a variable, and yu is a
term built from operations of A and variables ranging over A. We
 80 
call yg the variable generated at v, and any variable appearing in yu is
called a used variable;
D is a set of triples (v,y,d) where v E J1, y is a variable, and d is an
element of Z";
X is the set of input variables of G;
Y is the set of output variables of G.
There are three types of dependencies in D:
(i) Input dependence: (v,y,d) is an input dependence if y E X and y is an
operand of yu in computation (v, yg, yu); by definition d = 0.
(ii) Selfdependence: (v,y,d) is a selfdependence if y is one of the oper
ands of yu in computation (v, yg, u); by definition d = 0.
(iii) Internal dependence: (v,y,d) is an internal dependence if y, the output
of a statement Si at some iterative loop v', is supplied to a statement
at some index tuple v; by definition d = v v* (v, v E Jn).
It is convenient to represent dependencies in D as a matrix D = [D D'] where Do
is a submatrix of D containing all input and selfdependencies, and D' is the
submatrix of internal dependencies. Every column of D is the last element of
the triple (v,y,d) and is labeled dy. If dependencies are valid at almost every
point of the index set, the labels of the column of D need not be shown and,
for practical purposes, selfdependencies can be ignored. For example, the de
pendencies of A and B of the above matrixmatrix multiply example can be
taken as input dependencies, resulting in the architecture shown in Figure 4.6.
In order to reduce the I/O pins, this architecture uses either busconnection or
global communication. On the other hand, the sequential matrixmatrix multi
ply example can be interpreted as a systolic process by defining statement
S1 using the following conventions:
 81 
FIGURE 4.6
Global BusConnection Architecture
(1) define the indices explicitly displayed in the sequential code to be ex
plicit variables,
(2) define the indices not explicitly displayed to be implicit (e.g., C[i,j] is
an explicit function of (i,j) and an implicit function of k),
(3) rewrite the statements of the sequential code using explicit variables as
subscripts and implicit variables as superscripts,
(4) apply rule (3) to the lefthand side of the assignment statement di
rectly but decrease the implicit indices by one on the righthand side.
This establishes that the righthand side data set were defined in the
previous systolic cycle (precedence relation). For example,
k k1 j1 iI
Ci,j Ci,j + Ai,k k,j
constant B = B1
kj k,j from previous
con t j1 iteration over k
constant ,k = i,k (4.2)
index set for this example is J3= ( i, j, k; 1 < i, j, k < N ). The dependence
vectors are
dl= [0,0,1]t
d2= [0,1,0]t
d3= [1,0,0]t
t = transpose
from{ Ci = f(C i) }
from {Ajk = f(A )}
from {Bk, = f(B ) }
(4.3)
The
 82 
which produces a dependence matrix D (where the positional dependency of the
A,B,C and i,j,k data is explicitly exhibited for clarity) given by"
CAB
0o o k
D= [did2,d3]= o 1 j
L1 o 0 k (4.4)
So far, the algorithm model is only a static model which does not indicate the
execution order on the index set. Though the index points in the above exam
ple are ordered in lexicographical order, this is an artificial ordering and can be
modified for the purpose of extracting parallelism. Next, the execution order
must be adjusted to guarantee the correctness of computation. The execution of
an algorithm G = (J,C,D,X,Y) is described byr
1) the specification of a partial ordering oc on J (called execution order
ing): such that for all (v, yg, Yu)E D' we have 0 oc d (i.e., d larger
than zero in the sense of cx);
2) the execution rule: until all computations in C have been performed,
execute (v, yg, yu) for all v cc v for which (v', yg, yu) have terminated.
In the following text, if d=vv* > 0, it means that computations indexed by v*
must be performed prior to those indexed by v. Mapping the algorithm onto the
systolic array can then be viewed as changing the features of an algorithm while
preserving its equivalence in computations. That is, two algorithms
G = (J,C,D,X,Y) and G' = (PJ,C,D',X,Y) are said to be equivalent if and only
if
1) algorithm G* is inputoutput equivalent to G;
2) the index set of G* is the transformed index set of G; P" = T(JP)
where T is a bijection function;
3) to any operation of G there corresponds an identical operation in G*
and vice versa;
 83 
4) dependencies of G' are the transformed dependencies of G; D'= T(D).
In the algebraic mapping method, a transfer function T must be found such
that the first coordinate of the transformed index set presents the execution or
der, leaving the rest of the index coordinates to be selected by the algorithm de
signer to meet specific systolic array communication requirements. The follow
ing section presents how a transformation, T, can be selected such that the
transformed algorithm can be mapped onto a systolic array.
4.3.2 Transformation Function
A mapping which transforms an algorithm G into an algorithm G' is de
fined as:
T 
= S [ (4.5)
where the mapping of n and S are defined as r : J"n and S : J" ., 1.
Here, only a linear transformation, T, (where T E Zn"x) is considered. Thus,
for a given dependence matrix D, there exists a nonsingular matrix T which
maps the sequential algorithm into concurrent processes. The transformation T
can therefore be more precisely given asr
j njn
T = n n1
L J S J 1J (4.6)
The mapping T will produce a new array dependence matrix D* = TD. The first
row of T, denoted ;r, produces timing information about the array process (the
transformed algorithm). It is again required that ;rdi > 0 for all di E D so as to
insure that the precedence rule is preserved. The resulting first row of D* con
tains the delay information existing within the systolic array. Thus, a computa
tion indexed by v = Jn in the original algorithm will be processed at time
t = 3v. Moreover, the total running time of the new algorithm is usually
 84 
T = max(3v)min(rv)+l. This assumes a unitary time increment. In general,
the time increment may not be unitary; but it is given by the smallest trans
formed dependence (i.e., minimum rd). Therefore, the execution time is the
number of hyperplanes n sweeping the index space J" and is given by the ratio
TTOTAL = {max[g(v v2)] + 1} / min[rdi] (4.7)
for any v', v2 J" and di D.
Finally, the second transformation submatrix S captures the spatial coordi
nate information. The transformation submatrix S should be selected such that
the transformed dependencies are mapped onto a systolic array modeled as
(J1, P) where J1 C Zn1 is the spatial index set of the systolic array and
P E Z(n)xr is a matrix of interconnection primitives defined as
P = [P1, P2, P (4.8)
where Pj is a column vector indicating a unique direction of a communication
link and r is the number of communication paths between PEs (three examples
of P are shown in Figure 4.7). In other words, the selected transformation sub
matrix S should satisfy
SD = PK (4.9)
where matrix K indicates the utilization of primitive interconnections in matrix
P. Matrix K = [kji] is such that
kji 2 0 (4.10)
Skji <5 .di
j (4.11)
Equation 4.10 simply requires that all elements of the K matrix are nonnegative
and Equation 4.11 requires that communication of data associated with depend
ence di must be performed using some primitives Pj's exactly Zikji times.
 85 
Y2 Systolic Array
PI P2 P3 P4 PS P6 P7 P8 P9
=0 0 0 11 1 111 J(
=0 11 0 0 11 11 2
(A)
S Systolic Array
Pl P2 P3 P4
p 0 1 0 o ]
0 0 11 J2
(B)
J2 Systolic Array
Pi P2 P3
P=[0 1 0] T,
0L 0 1 12
(C)
FIGURE 4.7
Interconnection Matrix For A Systolic Array
 86 
For example, let us select the transfer function
T= 0 1 0
0 0 1 J (4.12)
for the above matrixmatrix multiplication with dependence matrix
CAB
0 0 1
D= [dl,d2,d3]= 0 1 0 j
1 0 0 k (4.13)
and assume
P1 P2 P3
pr 1 0
0 0 1 (4.14)
as shown in Figure 4.7.c. Then,
rD = [1, 1, 1] (4.15)
and
dl d2 d3
SD 1 0 PK 1 0 1 0
1 0 =PK= 0 1 0 (4.16)
This means that the implementation of data communication associated with de
pendence di requires only one primitive P3, and likewise, only one primitive P2
for d2, and only one primitive P, for d3.
It is possible that some interconnection primitives will not be used. These
correspond to rows of matrix K with zero elements. Most often, many transfor
mation submatrices S can be found, and each transformation leads to a differ
ent array. After selecting the transformation S, the corresponding (n1)xn sub
matrix, in D', displays the systolic interprocessor connections or communication
paths. As a result, the specification of T provides the designer with a wealth of
 87 
information about the physical array, its wiring (communication), and data flow.
In such a way, the velocity vector of the variable y is determined from
Tdi = v Vi =s
V; = sv; (4.17)
and the distribution vector of y can be derived from
I T [_V]
(4.18)
All these systolic array/algorithm synergism processes are summarized in Figure
4.8.
4.3.3 Example
For the matrixmatrix multiply example, the sequential algorithm runs in
O(N3) time. By committing N2 processors to this problem, it would be ex
pected that the execution time would be reduced N2fold. As shown in Section
4.3.1, the dependence vectors are/
dl= [0, 0, 1]t
d2 = [0, 1,0]t
d3 = [1,0, 0] (4.19)
which produces a dependence matrix D given by
CAB
D=[di,d2,d3]= 0 1 0 j
S1 0 0 1 k (4.20)
Now, a nonsingular transformation matrix T, must be found in which each col
umn vector of the dependence matrix to a vector has its first element greater
than zero. For this example, the D matrix is a 3x3 exchange matrix [GOL83]
such that ;r of T can be chosen as r = [1, 1, 1] which makes the first row of
the transformed dependence matrix D* to be [1, 1, 1] (i.e., all elements are
 88 
T
Parallel Execution
ARRAY ALGORITHM
The resulting information :
Algorithm index dependence
Algorithm to array mapping
Array interprocessor dependence
Start time
Stop time
Computational latency
Array location of algorithm index set
Algorithm activity at clock t and array
coordinate V
D
T
D* = TD
min(r v) = To
max(7 v) = Tf
max(n v)min(zr v)+l = T' TO+ 1
Tv= time = v = [+
Tv=  V
coordinate = S v
T =V
LVJ
FIGURE 4.8
Systolic Array/Algorithm Synergism
ALGORITHM SPACE
for i
for j
for k
algorithm index set
v = (i,j,k,...)
ARRAY SPACE
array time = t = 7 v
array coordinates = V = S v
= (J, 2, ...)
 89 
greater than zero). There are numerous choices of S which all satisfy the re
quirements SD = PK. Different S's will result in distinct communication paths
and variant data flows. For instance
S F 0 1 0
1 0 0 1 (4.21)
it follows
CAB
[1 1 1 1 1 t
Ti= 0 1 0 and D =T1D= 0 1 0 Ji
0 0 1 1 0 0 J2 (4.22)
The first row of Di indicates the number of time units for each cycle. The sec
ond two rows indicate the variables' direction of motion in the systolic array. In
this implementation, the first column of Di is
C
0 Y,
1 J2 (4.23)
which means the C data set will move in the positive J2 direction (one PE per
unit cycle). The second and third columns indicate, respectively, that the A data
set will move in the positive J1 direction (one PE per unit cycle) and the B data
set will remain inplace in the PEs. The startexecution time and stopexecu
tion time calculations, denoted by TO and Tf, are given by
TO = min[,rv] = n [1, 1, 1]' = 3 (4.24)
and T = max[v] = n [N, N, N]' = 3N (4.25)
The total execution time is then
TTOTAL = Tf To + 1 = 3N 2 (4.26)
and the algorithm will run in O(N) time, as expected.
Once the transformation matrix, T, is found, the index set can be trans
formed by using
 90 
Tv= I (4.27)
where v is one index vector from the index set consisting of values (i,j,k) and
the transformed index vector tells the time and location where the computation
associated with the index vector is executed. Assume N = 3 for this example,
then from
[i i+j+k=t]
Tj = = j ; 1 < i,j,k < 3
k k =J2 (4.28)
we know that the range of time and spatial indices are 35t59, 1lJ51 3, and
1512 53. These provide the following information: (1) the first computation
occurs at t = 3 and the last computation at t = 9; total execution time is 9 3 +
1 = 7, and (2) a total of nine PEs are required for this implementation and they
are numbered by [((1,J12) I 1 5 JiJ2 3]. The computation at the index values
(i,j,k) (e.g. (1,2,3)), corresponding to calculation C[1,2] = C[1,2] + A[1,3] *
B[3,2], is transformed as follows
[1 1 1 1 6
0 1 0 2 = 2
0 0 1 3 3 (4.29)
This indicates that the above computation will be executed at cycle 6, at proces
sor [2,3]. Because the transformation matrix, T, is nonsingular, we can always
find its inverse transformation matrix, T'. For this instance, it is
"1111
T~'= 0 1 0
0 0 1 (4.30)
The algorithm index of activity happening at clock t and spatial coordinates J1
and J2 can then be derived in the reverse order using the inverse transformation
matrix T1, as follows
