• TABLE OF CONTENTS
HIDE
 Title Page
 Dedication
 Acknowledgement
 Table of Contents
 List of Tables
 List of Figures
 Abstract
 Introduction
 Fast policies for elementary arithmetic...
 Multi-purpose floating-point distributed...
 Systolic architecture
 Optimal partitionable transformation...
 Conclusions and future researc...
 Appendix
 Bibliography
 Biographical sketch
 Copyright














Group Title: systolic distributed arithmetic computing machine for digital signal processing and linear algebra applications
Title: A systolic distributed arithmetic computing machine for digital signal processing and linear algebra applications
CITATION THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00082271/00001
 Material Information
Title: A systolic distributed arithmetic computing machine for digital signal processing and linear algebra applications
Physical Description: x, 174 leaves : ill. ; 28 cm.
Language: English
Creator: Ma, Gin-Kou, 1956-
Publication Date: 1989
 Subjects
Subject: Systolic array circuits   ( lcsh )
Signal processing   ( lcsh )
Electronic data processing -- Distributed Processing   ( lcsh )
Algebras, Linear   ( lcsh )
Electrical Engineering thesis Ph. D
Dissertations, Academic -- Electrical Engineering -- UF
Genre: bibliography   ( marcgt )
non-fiction   ( marcgt )
 Notes
Thesis: Thesis (Ph. D.)--University of Florida, 1989.
Bibliography: Includes bibliographical references (leaves 168-173)
Statement of Responsibility: by Gin-Kou Ma.
General Note: Typescript.
General Note: Vita.
 Record Information
Bibliographic ID: UF00082271
Volume ID: VID00001
Source Institution: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: aleph - 001512989
oclc - 21924303
notis - AHC5980

Table of Contents
    Title Page
        Page i
    Dedication
        Page ii
    Acknowledgement
        Page iii
    Table of Contents
        Page iv
        Page v
    List of Tables
        Page vi
    List of Figures
        Page vii
        Page viii
    Abstract
        Page ix
        Page x
    Introduction
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
    Fast policies for elementary arithmetic functions computation
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
    Multi-purpose floating-point distributed arithmetic unit
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
    Systolic architecture
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
    Optimal partitionable transformation and the reconfigurable systolic array
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
        Page 130
        Page 131
        Page 132
        Page 133
        Page 134
        Page 135
        Page 136
        Page 137
        Page 138
        Page 139
        Page 140
        Page 141
        Page 142
        Page 143
        Page 144
        Page 145
        Page 146
        Page 147
        Page 148
        Page 149
        Page 150
        Page 151
        Page 152
        Page 153
        Page 154
        Page 155
        Page 156
        Page 157
        Page 158
        Page 159
    Conclusions and future research
        Page 160
        Page 161
        Page 162
        Page 163
        Page 164
        Page 165
        Page 166
    Appendix
        Page 167
        Page 168
        Page 169
        Page 170
        Page 171
        Page 172
        Page 173
        Page 174
        Page 175
        Page 176
        Page 177
        Page 178
        Page 179
        Page 180
    Bibliography
        Page 181
        Page 182
        Page 183
        Page 184
        Page 185
        Page 186
    Biographical sketch
        Page 187
        Page 188
        Page 189
    Copyright
        Copyright
Full Text










A SYSTOLIC DISTRIBUTED ARITHMETIC COMPUTING MACHINE FOR
DIGITAL SIGNAL PROCESSING AND LINEAR ALGEBRA APPLICATIONS










By

GIN-KOU MA


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY






UNIVERSITY OF FLORIDA


1989















To my parents,
whose love and
unwavering support has never diminished
















ACKNOWLEDGMENTS


I would like to take this opportunity to thank my advisor Dr. Fred J. Taylor

who has given me the financial support and an excellent learning environment

for the years leading to my dissertation. In these years, I have had the great

pleasure of working for, and with, him. He has not only been an advisor and

mentor, but a friend as well. He has taught me how to learn from failures and

from successes. His favor is kept in my heart, forever.

I would also like to thank my supervisory committee members, Dr. J.

Staudhammer, Dr. Y.C. Chow, Dr. J.C. Principe, and Dr. H. Lam. They of-

fered their valued wisdom on many occasions. Their knowledge and insight

have kept me from going too far astray in the technical literature.

Special thanks go to Monica Murphy, her help has been too great to only

say thanks. Thanks also go to Michael F. Griffin, Eric Dowling, Mike Sousa,

Neil Euliano, Chintana Griffin, Preeti Rao, and Rom-Shen Kao.

I would, especially, like to thank Sheue-Jen; without her support and pa-

tience, this dissertation would never have been completed.

Finally, I would like to thank my parents, who have given me all they have.


- ii -










TABLE OF CONTENTS



ACKNOWLEDGMENTS ....................


LIST OF TABLES ........................


LIST OF FIGURES ......................


ABSTRACT .................... .........

CHAPTER 1 INTRODUCTION ............

1.1 DSP System Architecture ..........
1.2 High-Performance DSP Processors ..
1.3 System Integration ...............
1.4 Method Of Study ................


........................... vi


... vii


. .. ... ...... ........... ... ix

. . . . . . . . . . . . . . 1


CHAPTER 2 FAST POLICIES FOR ELEMENTARY
ARITHMETIC FUNCTIONS COMPUTATION ......

2.1 Fast Two-Operand Adders/Subtractors ...............
2.2 M ultiplier Policies .... ........ .....................
2.2.1 Traditional Methods And Cellular Arrays .......
2.2.2 Non-Traditional Methods ....................
2.2.3 Multiplier/Accumulator ......................
2.3 D division ..........................................
2.4 Fast Arithmetic Function Computation Techniques ......
2.4.1 Approximation Method ......................
2.4.2 CORDIC Technique .........................
2.4.3 Co-Transformation Method ..................

CHAPTER 3 MULTI-PURPOSE FLOATING-POINT
DISTRIBUTED ARITHMETIC UNIT ..............

3.1 Distributed Arithmetic Unit ............ .. ........
3.2 Floating-Point Distributed Arithmetic Unit (FPDAU) ...
3.3 Throughput And Precision of FPDAU ................
3.4 Complex FPDAU (CFPDAU) ........................
3.5 Multi-Purpose FPDAU/CFPDAU .....................

CHAPTER 4 SYSTOLIC ARCHITECTURE ...................

4.1 Systolic A rrays ....................................
4.2 M apping Algorithms ................................


. 38

. 39
. 42
. 45
.. 49
. 53

. 69


....... 70
........ 73


- IV -


. . iii


...........


...........
...........









4.3 Algebraic Mapping Techniques .............................. 78
4.3.1 Algorithm Model .................................. 79
4.3.2 Transformation Function ............................ 83
4.3.3 Example ........................................ 87
4.3.4 Transformation Matrix And The Behavior
Of The Systolic Array ............................... 94
4.4 Partitioning ......................................... . 97
4.4.1 Moldavan And Fortes' Method ........................ 97
4.4.2 Redistributing The Computation Method ............... 102
4.4.3 Partitioning Of In-Place Algorithms .................. 104
4.4.4 Partitioning For General-Purpose Systolic Systems ...... 104
4.5 Systolic Array Algorithm Synthesis .......................... 105
4.6 Multiple Dimensions Of Time ............................. 108
4.7 Interrupt of Systolic Arrays ................................ 115
CHAPTER 5 OPTIMAL PARTITIONABLE TRANSFORMATION
AND THE RECONFIGURABLE SYSTOLIC ARRAY ....... 119
5.1 Algebraic Modeling For The Algorithm ....................... 121
5.1.1 Dependence Vectors ............................... 125
5.1.2 Velocity And Distribution Vectors .................... 131
5.1.3 Systolic Array Algorithm Synthesis ................... 134
5.2 Optimal Transformation Matrix ............................ 139
5.3 Partitioning With The Modulus Theory ....................... 143
5.4 System Architecture ..................................... 153
CHAPTER 6 CONCLUSIONS AND FUTURE RESEARCH .............. 160

APPENDICES
A WHY 2's COMPLEMENT ................................ 167
B A BIT-SERIAL DISTRIBUTED ARITHMETIC FILTER .......... 168

BIBLIOGRAPHY .................................................. 181

BIOGRAPHICAL SKETCH ......................................... 187
















LIST OF TABLES


TABLE 2.1



TABLE 2.2


The Floating-Point And The Fixed-Point
Multiplier And Multiplier/Accumulator Chips ............... 11


Summary Of The Co-Transformation Algorithm ............ 34


- vi -
















LIST OF FIGURES


FIGURE 2.1 Distributed Arithmetic Multiplier/Accumulator ............... 15

FIGURE 2.2 Typical Computing Step For CORDIC Algorithm ............ 25

FIGURE 2.3 The Implementation Of The CORDIC Arithmetic Unit ........ 27

FIGURE 2.4 The Evaluation Of Co-Transformation Function zo = f(xo, Yo) 29

FIGURE 2.5 The Unified Apparatus .................................. 35

FIGURE 3.1 Core Distributed Arithmetic Unit .......................... 40

FIGURE 3.2 The Floating-Point Distributed Arithmetic Unit .............. 44

FIGURE 3.3 Latncy Of The FPDAU v.s. The Conventional
M ultiplier/Accumulator .................................. 46

FIGURE 3.4 Complex Floating-Point Distributed Arithmetic Unit (CFPDAU) 52

FIGURE 3.5 The Real Part Of The Modified CFPDAU .................. 54

FIGURE 3.6 Complex Distributed Arithmetic Multiplier .................. 55

FIGURE 3.7 Multi-Purpose Arithmetic Unit (MPAU) .................... 57

FIGURE 3.8 The Systolic MPAU Array ............................... 66

FIGURE 4.1 3x3 Matrix-Matrix Multiplication (Design 1) ................ 71

FIGURE 4.2 3x3 Matrix-Matrix Multiplication (Design 2) ................ 72

FIGURE 4.3 Two 3x3 Matrix-Matrix Multiplications (Design 2) ........... 73

FIGURE 4.4 Index Set Of A 3x3 Matrix-Matrix Multiply Algorithm ....... 76


- vii -









FIGURE 4.5 Geometric Mapping Of A Convolution [CAP83] ............. 77

FIGURE 4.6 Global Bus-Connection Architecture ....................... 81

FIGURE 4.7 Interconnection Matrix For A Systolic Array ................ 85

FIGURE 4.8 Systolic Array/Algorithm Synergism ....................... 88

FIGURE 4.9 Procedure Of Mapping The Matrix-Matrix Multiply (T1) ...... 92

FIGURE 4.10 Data Distribution And Velocity Of Matrix-Matrix Multiply (T2) 93

FIGURE 4.11 Partitioned Index Space [MOL86] ......................... 99

FIGURE 4.12 Bands And FIFO Queue Registers For Partitioning [MOL86] 101

FIGURE 4.13 Data Distributions Of 2-D Convolution Example ........... 112

FIGURE 4.14 Interrupt For A 3x3 In-Place Matrix-Matrix Multiplication ... 116

FIGURE 5.1 Flowchart Of Finding The Transformation Matrix For
A Systolic Array ....................................... 120

FIGURE 5.2 Dependency Vector ................................... 125

FIGURE 5.3 Algorithm Synthesis For An Image Processing Example ..... 138

FIGURE 5.4 Partitioning Example Of A Matrix-Matrix Multiplication ..... 150

FIGURE 5.5 Reconfigurable Systolic Array Architecture ................ 155

FIGURE 6.1 3x3 Cholesky Matrix Decomposition ...................... 165


- viii -















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy





A SYSTOLIC DISTRIBUTED ARITHMETIC COMPUTING
MACHINE FOR DIGITAL SIGNAL PROCESSING AND LINEAR
ALGEBRA APPLICATIONS





By

Gin-Kou Ma


May 1989





Chairman: Dr. Fred J. Taylor
Major Department: Electrical Engineering



High-end, real-time signal and image processing, communications, com-

puter graphics, and similar tasks are generally compute-bound rather than input/

output-bound. As such, there is an immediate as well as continuing need to

develop high-speed, high-performance, low-cost numeric data processors which

can be integrated into small packages with low-power dissipation.


- ix-








In this dissertation, a floating-point distributed arithmetic unit (FPDAU) is

developed and extended to the design of a complex floating-point DA unit

(CFPDAU). The new class of processor is shown to be an enhancement to ex-

isting fixed-point distributed arithmetic processors and, when compared to tradi-

tional arithmetic processors, is faster, more compact, and more precise. In ad-

dition, a means of partitioning a large system into smaller distributed arithmetic

subsystems is presented where each sub-SOP can also be used as a general-

purpose processor. Finally, the processor study shows that by integrating a

minimal additional amount of logic into the basic design, a fast multi-purpose

arithmetic unit (MPAU) can be created which can also perform elementary

arithmetic functions faster than existing state-of-the-art devices.

The intrinsic speed and power of the DA processor can be leveraged by

parallel processing. A special type of parallel architecture that has particular

utility to DSP and linear algebra applications is the systolic array [KUN78].

This dissertation has addressed several important issues which extend this tradi-

tionally application-specific architecture to more general-purpose problems. By

dealing with such problems as optimally mapping algorithms to systolic arrays,

partitioning algorithms onto small systolic arrays, and using rolling interrupts,

plus introducing reconfiguable architecture for various data flow requirements,

the dissertation research begins to clear the way for general-purpose systolic ar-

ray computing based on fast compact processors.

















CHAPTER 1
INTRODUCTION


Signal processing, referring to the study of signals and systems, traces its

origins back to the empirical science developed by designers of musical instru-

ments. Today, signal processing technology is used in such areas as filtering,

adaptive filtering, spectral analysis, neural networks, biomedical data processing,

communications and sound reproduction, sonar and radar processing, data com-

munication, seismic signal processing, computer graphics, and a host of other

applications. Digital signal processing (DSP) is a relatively new technical

branch of signal processing that is concerned with the study of systems and sig-

nals with respect to the constraints and attributes imposed on them by digital

computing machinery.

Rapid advances in digital electronics have contributed to the rapid assimila-

tion of DSP theory and technology into a host of problems. While it is tempting

to view today's DSP devices as powerful (capable of performing several million

operations per second), they fall far short of the multi-billion operations per

second performance required to support real-time applications in the area of

computer vision, communications, speech and image understanding, and so

forth. The theory of how to process these signals is now known, but because of

the compute-bound nature of the current technology base, the technology re-

mains untested. To make this bridge between the science of DSP and these

high-end applications, new and powerful high-performance DSP tools and sys-

tem must be developed.


- -






-2-


1.1 DSP System Architecture

To understand where research energies should be focused, it is important to

define where the current state-of-the-art has taken us. Prior to this decade, the

increase in the speed of digital systems could be attributed to the increased

speed and superior packaging of electronic parts. Although further advances in

this direction will provide increased performance, this trend has slowed. Now,

as we enter a new decade, it is apparent that parallelism is required. Here, a

large number of individually powerful digital components will work on a small

portion of the total problem. Collectively, they will have the computational ca-

pacity to attend to those problems currently beyond the limits of today's DSP

technology. The many possible forms that a parallel digital computing machine

may take are called architectures. A special type of parallel architecture that

has demonstrated a powerful signal processing capability is the systolic array

[KUN78]. The systolic array is particularly adept at implementing the primary

DSP operations which are modeled as matrix-matrix, matrix-vector, or vector-

vector multiplications, or sum-of-products arithmetic. This relatively new sci-

ence, unfortunately, still exhibits some shortcomings which prohibit its direct

insertion into high-end DSP applications. For example, a systolic array cur-

rently performs only as a special-purpose single-task computing architecture.

To extend the systolic array to a real-time general-purpose DSP computing

arena, the automatic synthesis of systolic programs, array reconfigurability, and

interrupt servicing problems must be considered.


1.2 High-Performance DSP Processors

As stated, the major DSP operations are matrix-matrix, matrix-vector, and

vector-vector multiplications, and sum-of-products (SOP) arithmetic. Since

DSP is an arithmetic intensive field of study, a key to developing fast systems is





-3-


designing new, high-speed arithmetic processors. These innovative fast arithme-

tic processors must be capable of performing fast addition and multiplication,

and to a lesser extent, division, logarithm, exponentiation, square root, transcen-

dental, etc. This study is historically referred to as computer arithmetic.

Traditional fixed-point and floating-point computer arithmetic schemes

have served the field of DSP well for over two decades. However, even in their

fastest forms, they result in very expensive, complex, and large hardware pack-

ages and run at speeds which are too slow to provide a viable multi-processor

systems solution to the truly compute-bound problems. Over the last six years,

researchers at the University of Florida have been developing several technolo-

gies which can achieve high DSP speeds in small packages [TAY84A]. These

methods offer an opportunity to achieve the needed computational bandwidth,

provided they lend themselves to systolic array insertion and the weaknesses of

the systolic array can be abridged.


1.3 System Integration


The hardware/algorithm synergism problem refers to the study of achieving
the optimal balance between increasing performance through adding algorithm

complexity and simultaneously reducing hardware complexity. The importance

of this claim is well illustrated by contemporary DSP chips, which emphasize

both fast arithmetic and a simple instruction set. These devices belong to a

class of computers which are referred to as reduced instruction set computers

(RISC) machines. While these chips provide very high-throughput, they often

exhibit a weakness in implementing complex operations, such as logarithms, ef-

ficiently. Therefore, in order to achieve a synergistic design, the demands of

the candidate DSP algorithms must be carefully balanced with advantages and

limitations of the candidate processors and system architectures.






-4-


1.4 Method of Study

This study of high-performance DSP systems began with a critical review

of the current state-of-the-art arithmetic processors. This review, found in

Chapter 2, led to a fundamentally important technique for speeding up the

propagation delay of a basic carry/borrow unit. The fast adder/subtractor which
resulted was shown to be applicable to the design of fast multiply/divide units.

The study of these fast arithmetic processors compared the algorithm/architec-

ture synergism of traditional and proposed systems. Topics relating to fast

multiplication and scaling were subdivided into major groupings which included

stand-alone multipliers, cellular arrays, memory intensive policies, logarithmic

systems, modular arithmetic, and distributed arithmetic systems. Based on this

study, more efficient DSP computational hardware was developed for later inser-

tion onto a systolic array.

The principal outcome of this phase of the study was the identification of

the distributed arithmetic (DA) unit as the processor of choice. DA units (see

Section 2.2.3 and Appendix B) can be applied whenever a linear sum of

weighted (scaled) partial products of the form y = Z ai xi is to be produced for

fixed coefficients ai. The throughput potential of this class of filters is essen-

tially limited only by the delay of the high-speed semi-conductor memory cell

and the data wordlength. Many authors [ZOH73], [PEL74], [LIU75], [BUR77],

[JEN77], [KAM77], [TAN78], [ZEM80], [ARJ81], [TAY83B], [TAY84C],

[CHE85], and [TAY86] have demonstrated the efficacy of the fixed-point DA

for use in finite impulse response (FIR) and infinite impulse response (1R) fil-

tering plus fast Fourier transformations (FFT). Their work supports the analy-

sis, which showed that DA units (when applicable) are faster and more precise

than equivalently packaged fixed-point multiply/accumulate units. The pre-

sented study further showed that when input-output latency is not a critical is-






-5-


sue, a bit-serial DA (presented in Appendix B) could be used to minimize the

processor's complexity without reducing real-time throughput. The design also

achieved an ideal synergism with the data communication needs of a dense sys-

tolic array.

The existing DA technology base applies only to fixed-point designs. Many

important DSP applications require high-precision floating-point operations.

The presented study extended the fixed-point distributed arithmetic unit (DAU)

to the floating-point distributed arithmetic unit (FPDAU) case. In Sections 3.2

and 3.3, the FPDAU was formally derived and analyzed. It was shown that the

resulting arithmetic processor is- more precise, faster, less complex, and less

costly than any currently available commercial floating-point devices.

The FPDAU was further modified to process complex SOP operations.

However, in this case, the size of the processor was doubled in order to main-

tain the speed capability of a real FPDAU. Nevertheless, this is a factor of two

advantage over conventional methods, which require the device complexity be

increased by a factor of four.

Currently all DA devices presume that the DSP coefficients and/or parame-

ters are known a priori. The new DA structure was shown to allow the basic

processor to service general arithmetic calls as well. The new robust processor

was shown to be capable of supporting traditional DA operations when coeffi-

cients are known a priori and general-purpose multiply/accumulate otherwise.

Compared to commercially available multiply/accumulate chips, it was shown

that the researched architecture is both faster and less complex.

Other elementary arithmetic functions, such as division, exponential func-

tion, logarithm, square root, transcendentals, etc., are also required in DSP.

These elementary arithmetic functions are usually performed using approxima-

tion methods (which requires only multiplication and addition/subtraction; pre-






-6-


sented in Sections 2.3 and 2.4.1) at the expense of throughput. In addition to
the approximation methods, there are other algorithms called co-transformations

(Section 2.4). They are based on the use of a small lookup table, simple con-

trol circuits, shifters, and adders/subtractors. Compared to approximation meth-

ods, co-transformation methods offer a higher performance potential. Co-trans-

formation algorithms were combined with the developed floating-point/complex

distributed arithmetic unit to produce a new Multi-Purpose Arithmetic Unit or

MPAU (Section 3.5). The MPAU can used as a special purpose or general

multi-purpose arithmetic unit which provides high-speed elementary arithmetic

function arithmetic. The MPAUs can also be connected to construct a systolic

array computing system for DSP and linear algebra applications.

The innovative processor technology was shown to be able to implement the
primitive instructions required in DSP. In addition, it was shown to have an at-

tractive very large scale integrated circuit (VLSI) implementation. This makes

the unit a logical candidate for use with dense parallel processing machines.

As stated, DSP and linear algebra applications lend themselves to systolic

array implementation because the candidate algorithms, though computationally

intensive, are very regular. Systolic arrays belong to a broader class of fine-

grain, parallel, single instruction multiple data (SIMD) path machines. The de-

sign of a large fine-grain parallel processor array is as much limited by the in-

ner processor communication requirements as it is by processor speed. All of

these factors favor VLSI architectures with a minimum amount of local physical

communication and little or no global communication. To design parallel proc-

essing systems in VLSI successfully, the design should be modular, have regular

data and control paths, and most importantly contain only localized communica-

tion. The processor technology previously referenced was shown to be well

suited to this mission. However, simply providing a systolic array with a supe-






-7-


rior processor will not reduce many of the current limitations of these arrays.

Chapters 4 and 5 present techniques which were developed to map the sequen-

tial loops found in DSP algorithms onto the systolic arrays. The techniques are

enhancements to transformation methods originally proposed by Moldavan

[MOL82] and [MOL83].

It is shown that the mapping of an algorithm onto a systolic array can be

viewed as a transformation of the original algorithm loop indices into physical

array spatial and temporal coordinates. Different transformation matrices are

shown to result in distinct velocity and distribution functions. Thus, the archi-

tecture of the systolic array is heavily reliant on the transformation matrix.

Methods discussing how to find the optimal transformation matrix (where "opti-

mal" means that under the limit of finite available array size or the minimal

required execution time, the system will finish the computation in the least time

or require the least array size, respectively) are not found in the literature. In

order to find the optimal transformation matrix, the conventional one-

dimensional time index theory should be extended to multi-dimensional time

indices (presented in Sections 4.6 and 5.2). Meanwhile, for a real-time comput-

ing system, the interrupt problem should also be considered (Section 4.7).

The University of Florida research effort also focuses on developing proce-

dures that can be used to partition a large DSP problem into one which can be

executed on small systolic arrays. In Section 4.4, these techniques are re-

viewed. One of the most visible, namely the Moldavan and Fortes' partitioning

method, is shown to have three drawbacks: (1) external FIFO buffers are re-

quired, and (2) the utilization of the PEs is inefficient, and (3) it is difficult to

find the optimal partition solution. The partitioning scheme developed by

Horiike, Nishida, and Sakaguchi can produce a more efficient implementation

but requires extra internal storage to synchronize the computation. Other im-






-8-


pediments are: (1) there is no efficient way to find the required transformation

matrix, and (2) once the transformation matrix is found, the actual ordering of

the data set in the I/O buffers is not defined.

In Sections 5.1, 5.2, and 5.3, the optimal partitionable transformation ma-

trix was shown to be directly calculable. This new technique can also be used

to implement the systolic array algorithm synthesis where some functions of the

systolic array are pre-defined (e.g., the flows of some data sets are desired and

fixed and others are flexible). It was demonstrated that the drawbacks of exist-

ing partitioning methods could be eliminated using the proposed Modulus The-

ory Partitioning algorithm (Section 5.3). This algorithm provided the optimal

transformation matrix for the partitioning such that no external FIFO was re-

quired. In this way, the data flow control and data distribution remained similar

to cases where no partitioning was applied. Finally, a reconfigurable systolic

array architecture was presented. These results made the interrupt strategy pre-

sented in Section 4.7 viable.

The results of the presented research demonstrate that a general purpose

reconfigurable systolic array can be developed based on the proposed new proc-

essor technology. In addition, the new system provides powerful software con-

trol and programming tools which extend the utility and capability of the array.

The new computing system provides the high-bandwidth required of high-end

DSP and algebraic applications in an affordable and compact package, and the

ultimate goal of processor/array/algorithm synergism is achieved.
















CHAPTER 2
FAST POLICIES FOR ELEMENTARY ARITHMETIC
FUNCTIONS COMPUTATION


Digital signal processing (DSP), whether relating to filters or transforms, is

an arithmetic intensive study with the predominant operation being multiply/ac-

cumulate. As such, the key to developing high-speed systems is the design of

the fast additions and multiplications. Other elementary arithmetic functions

such as division, logarithm, exponential function, square root, transcendental,

etc., are also often used in DSP (adaptive filters, transforms, matrix inverse, im-

age processing, plus others). Thus, the fast algorithms and architectures for im-

plementing these functions should also be investigated.


2.1 Fast Two-Operand Adders/Subtractors


High-performance adders are essential not only for addition, but also for

subtraction (subtraction in 2's complement system is actually a complement-ad-

dition operation), multiplication, and division, plus others. The slowest adder

configuration is the ripple adder. Here, carry information is passed (rippled)

from a digit location to one of higher significance. For such a ripple adder, exe-

cution speed is essentially limited to the time delay associated with propagating

the carry information for the least significant bit (LSB) to the most significant

bit (MSB). As a result, the key to designing fast adders is carry acceleration.

There are various carry acceleration techniques for designing fast two-operand

adders (e.g., asynchronous carry-completion adders and three classes of syn-

chronous adders, namely, conditional-sum, carry-select, and carry-lookahead


-9-






- 10 -


adders). Asynchronous adders were studied by [GIL55]. Studies on carry propa-

gation length were reported in [REI60] and [BRI73]. The conditional-sum addi-

tion was proposed by [SKL60A]. Carry-select adders were introduced by

[BED62]. Other researchers, [ALE67], [FEI68], [LEH61], [MAC61], [SKL60B],

[SKL63], and [WEL69], have investigated the carry propagation speed-up tech-

niques and their possible implementations. It is fair to say that the popular

carry-lookahead adder is the end product of this research into parallel carry

generations. In particular, the work of [SKL63] is most comprehensive. With

today's VLSI technology, adder construction is no longer a great burden on the

designer.


2.2 Multiplier Policies


The importance of the fast multiplier is well illustrated by the available

DSP chips which emphasize relatively fast fixed-point multipliers in their RISC

instruction sets. For example, the TMS32020 contains a 16x16 bits 200ns multi-

plier which occupies about 40 percent of the chip area. Newer DSP chips be-

longing to the third generation, such as the TMS32030, AT&T DSP32, plus oth-

ers, have introduced architectural improvements but leave the fast multiplier as

their cornerstone. Another recent technological innovation has been the dedi-

cated multiply, multiply/accumulate, and numeric processor chip. They appear in

both fixed-point and floating-point forms and are summarized in Table 2.1. A

variation on this theme is the arithmetic co-processor chip (e.g., INTEL

8087/80287) which can perform the higher order algebraic tasks assigned to

them by a CPU chip more efficiently than the CPU itself (e.g., multiplication,

division, exponentiate, transcendental, etc.). From all this we see that the key

to performance is translating the basic algebraic procedures of multiplication

into fast, compact digital operations. These concepts must be understood at the






- 11 -


TABLE 2.1
The Floating-Point And The Fixed-Point
Multiplier And Multiplier/Accumulator Chips


FLOATING-POINT CHIPS
32b 64b Power $
Type M-FLOPS M-FLOPS W U.S.
AMD 8 N/A 7.5 700
Analog Dev. 10 2.5 0.4 350
Weitek 4 2 2.0 900
8 8 2.0 900
10 N/A 2.0 350
Bipolar Integration Tech. N/A 45 5.5 N/A
IDT N/A 10 0.75 N/A
TI N/A 14.7 1.0 N/A
TRW 10 N/A 0.21 N/A


FIXED-POINT CHIPS

Type 12x12 MUL 12x12 MAC 16x15 MUL 16x16 MAC 24x24 MUL
Analog Dev. 110n 130n 75n 85n 200n
TRW N/A 135n 45n 50n N/A
AMD N/A N/A 90n N/A N/A
Logic Dev. 80n N/A 45n 45n N/A
Weitek N/A N/A 55n 75n N/A
IDT 30n 30n 35n 35n N/A
CYPRESS N/A N/A 45n N/A N/A


ARRAY PROCESSOR
ST-100, 100 MFLOP, $250,000


system level. The design of a host of fast, multiplier strategies have been stud-

ied in the context of their algorithms and architecture.


2.2.1 Traditional Methods And Cellular Arrays


Among the traditional multipliers, the first was the slow shift-add architec-

ture followed by others (e.g., Booth's algorithm and modified version, and Wal-

lace Trees [WAS82] and [HWA79]). Later, the advent of MSI, LSI, and now

VLSI, resulted in a profusion of monolithic multiplier devices (Table 2.1). If

these units can operate as stand-alone multipliers, without the need to intercon-






- 12 -


nect them to other multiplier chips, they are called Non-additive Multiplier Mod-

ules or NMMs. However, it is often necessary to connect identical NMMs to-

gether to achieve multiplier wordlengths which exceed the capability of a single

module. Using multiple copies of an NMM, large wordlength multiplier arrays,

called Cellular Arrays, can be designed. The objective of an effective cellular

design is to additively recombine the subproduct outputs of the NMM cells in

such a way so as to reconstruct the final product (e.g., NMM Wallace Array

[HWA79]). However, as is often the case, high performance is often gained at

the expense of system complexity. Thus, some authors deal with the multiplica-

tion at the bit level (e.g., Pezaris Multiplier Array [HWA79], Baugh-Wooley

Multiplier [HWA79] and bit-serial multiplier [SIP84] and [KAN85]). This is in

sharp contrast to conventional architectures which communicate as worldwide

transactions. As a result, highly functional VLSI designs can be realized. Bit-

serial networks can be easily routed and interconnected without the problems of

bit-parallel busing. However, this technique needs more complex control cir-

cuitry and will introduce longer latency.


2.2.2 Non-Traditional Methods


All the above multipliers are designed to process the multiplication of

fixed-point numbers which appear in sign-magnitude plus 1's and 2's comple-

ment form. The overwhelming choice for a fixed-point number system is 2's

complement (see Appendix A). Alternative number systems can also be used to

design a fast multiplier (e.g., the canonic signed digit number system, the loga-

rithmic number system (LNS), and the residue number system (RNS)). The

canonic signed digit multiplier [TAY84B] is based on the concept that the zero

term will contribute nothing to the sum-of-products (SOP) and may, therefore,

be "by-passed," and an overall speed-up can be achieved. To implement this





- 13 -


architecture, however, extra circuitry is required for recoding a 2's complement
integer into a canonical form and the speed is data dependent. The LNS has
been studied in a DSP environment by numerous authors([KIN71], [TAY83A],
[SWA83], [SIC83], [TAY85B], and [TAY85C] In the LNS a word is given by:

X =r:ex (2.1)
where r is the radix and ex is the signed fractional exponent. In this system, the
operations of multiply, divide, square, and square root only require exponent
manipulation. These operations (i.e., addition or binary shift) are simple to im-
plement in fast hardware. Addition and subtraction are, as one can see, a dif-
ferent story. It is known that the LNS, compared to fixed-point schemes, en-
joys a wide dynamic range and high precision. However, the precision and dy-
namic range capability of a LNS unit is essentially limited by the available ad-
dress space of a high speed RAM or ROM.

The residue number system6[McC79], [TAY84A], [WAN78], [HER75],
[ARM80], [TAY85A], [TAY84D], [TAY82A], and [TAY82B])is an integer num-
ber system defined with respect to a set P (called a moduli set) of relatively
prime integers (called moduli). In particular, if P = (p1 ..., PL) and GCD(pi, p)
= 1, then over the integer range [0,M) ([-M/2,M/2) for signed numbers), where

M = Pi P2 ... PL, an integer X is isomorphic to the L-tuple of residues given
by X -* (x1, x2, ..., X), Xi = X mod pi. If two integers X and Y have admissible
RNS representation, and if they are algebraically combined under the arithmetic
operation 'o', where o = (+,-,*) (division is not closed in the RNS), then Z = X
o Y -+ (x, o Yi, ..., XL o yi) = (z, z2,..., zL) if Z belongs to the residue class Z,
= {0,1,...,M-1}. The remarkable feature of this statement is that in the RNS, Z
can be computed as L concurrent operations without the need of producing
carry digits or establishing a carry management policy! For complex number
multiplication, the QRNS [TAY85A] can be used.






- 14 -


2.2.3 Multiplier/Accumulator



The major operation found in the DSP is the multiply/accumulate. As such,

many commercial DSP chips deal with the fixed-point numbers. A fundamental

problem in developing fixed-point DSP systems is managing dynamic range

overflow. This is especially true in high-gain, narrow band filters where the in-

ternal gain of a filter can be several orders of magnitude higher than that meas-

ured at the output. A serious overflow problem can also occur when two n-bit

numbers are multiplied. Overflow may immediately occur if data and coeffi-

cients are not properly scaled. Assuming that the data is properly scaled, an-

other overflow problem can emerge. If two n-bit 2's complement numbers are

multiplied, the 2n-bit product contains two sign-bits. The leftmost sign-bit can

be masked but that is not sufficient to inhibit overflow. This, in fact, is a weak-

ness in some DSP microprocessors (e.g., TMS 32020) but can be detected in

software at the expense of increased overhead. Although the recently an-

nounced DSP chips deal with floating-point operations, they only perform the

multiply/accumulate operation well. For computing other elementary arithmetic

functions (e.g., division, transcendental, plus others), the approximation methods

(discussed later) must be used and, therefore, speed is slow. The systolic array

multiplier/accumulator, [ULL84] and [MEA80]/uses the multi-process architec-

ture to perform the multiply/accumulate operations. In Chapters 4 and 5, the

systolic array is presented in detail.

Technically, multiplication is the algebraic combining of two variables using

a set of consistent multiplication rules. A special case is called scaling, which

applies the same combining rules to a variable and a constant. Such operations

are ubiquitous in DSP and are often found as "coefficient scaling" operations.

For example, a simple finite impulse response (FIR) is given by






- 15 -


y(k) = ajx(k-j); j E [0, L) (2.2)

where the {aj} coefficient set is known a priori. Any of the previously refer-

enced fixed-point multiplier schemes could be used to accept aj and x(k-j) as

input operands and export a partial product. If each multiply-add cycle re-

quires t, units of time, then y(k) can be produced in approximately Ltc time

units. An alternative scheme has been proposed and is based on interpreting

Equation 2.2 as a nested summation. If x(r) is an n-bit 2's complement word,

then

x(r) = > 2mx[r : m] 2n-'x[r : n 1]
m (2.3)
for m = 0,1,...,n-2 and x [r : v] is the vth-bit of x(r), x[r : v] e [0,1]. Upon

substitution, Equation 2.2 becomes

y(k) = I 2m'D(m) 2n-1(n 1)
m (2.4)

where t(q) = Ja;x[k-j : q], j e [0,L). Observe that 4(q) is a function of the

known coefficient set {aj} and the qth common bits of x(r), r = 0,1,...,L-1. Re-

ferring to Figure 2.1, it can be seen that the L-bit data field, obtained by ac-


X(n)
SL bits CONTROL-
X(n) --I LER
INPUT L COUNT
QUEUE X(n-1) 2 xT bit +/-
X(n-2)
2(2ROM --n)
L- n bit C> (q)
regis-
ters X(n-L-1) /R



FIGURE 2.1
Distributed Arithmetic Multiplier/Accumulator






- 16 -


cessing the qth common bit of x(k) through x(k-L+1), is used as an L-bit ad-

dress which is presented to the read only memory (ROM) table. The ROM has

simply been programmed with the 2L precomputed values of I(m). For exam-

ple, if L = 3, ao = 1.5, a, = -3.0, and a2 = 1.0, then

Address x[n : m] x[n-1 : m] x[n-2 : m] ((m)
0 0 0 0 0.0
1 0 0 1 1.5
2 0 1 0 -3.0
3 0 1 1 -1.5
4 1 0 0 1.0
5 1 0 1 2.5
6 1 1 0 -2.0
7 1 1 1 -0.5

Finally, scaling by 2m, as found in Equation 2.2, can be implemented as a sim-

ple shift-add task. The result is that a filter cycle can now be completed in ap-

proximately n memory cycle units. For example, if L = 12, n = 16, tmu. =

100ns and tMM.cyc = 10ns, then a conventional multiplier filter cycle delay would

be on the order of 1200ns versus 160ns for the distributed designs. As we see,

this method can provide the fastest computation for scaling SOP operations. Of

course, the filter coefficient set must be known a priori which precludes its use

in adaptive filter applications. This problem is solved in Chapter 3.


2.3 Division


Division is also an important arithmetic operation in DSP and linear alge-

bra applications. The computation time of division is always greater than that

of multiplication. Thus, for scaling operations (i.e., coefficients known a priori),

the designer should use multiplication instead of division. However, for the

general case, fast dividers are required.

Division algorithms can be grouped into two classes according to their itera-

tive operator. The first class, where subtraction is the iterative operation, con-

tains many familiar algorithms (e.g., non-restoring division), is relatively slow,






- 17 -


and has an execution speed proportional to the operand (divisor) length. In the

second class, multiplication is used as the iterative operator to achieve higher

speed. Algorithms of this class obtain a reciprocal of the divisor first, and then

multiply the result by the dividend. There are two main multiplicative methods

of finding the reciprocal. One is the series expansion based on the Maclaurin

seriest[AND67] and [SPI68) and the other is the Newton-Raphson iteration

[BUR81]. This type of algorithm converges quadratically and its execution time

is proportional to log2 of the divisor length. As a result, a fast multiplier can

play an important role in fast division. Compared to the co-transformation

method, however, this is slower.


2.4 Fast Arithmetic Function Computation Techniques


For computing the elementary arithmetic functions, the Taylor series
method can be used. However, even at low precision, the Taylor series is still

computationally intensive. For some functions like exponential, logarithm, ratio,

square root and trigonometric functions, the co-transformation and the CORDIC

techniques can be used to achieve higher computational speed with a relatively

simple hardware architecture consisting of a small lookup table, a simple con-

troller, shifters, and adders, as shown in the following.


2.4.1 Approximation Method [LYU65] and [COD80]


The Taylor series or power series expansion of some f(x) is:

f"(a)(x a)2 f n-(a)(x a)n-1
f(x) = f(a) + f'(a)(x a) + 2 + ... + (n 1)- + Rn
2! (n-i)!
Rn = the remainder after n terms (2.5)

and generally converges for all values of x in some domain called the interval

of convergence and diverges for all x outside this interval. For example






- 18 -


X2 X3
eX = 1 + x + + + ...- < x < o
2! 3!
Inx)=2( 2 x-1 1 x-1 3 1 x-1
ln(x)= 2 (x-1) + X----1)3 + -I( -1)5 + ... > 0
rx +1) 3 Hx _+1 5 +x >+1o

X3 X5 X7
sin(x)=x- + + + ... -oo << 00
3! 5! 7!

X2 X4 X6
cos(x)= -- + + + ... -oo 2! 4! 6!

1 x3 1 3 x5 1*3*5*x7
sin-'(x)=x + + 2-*-4 + + ... -00< < 00
2 3 2 4 5 2*4*6*7


(2.6)
Although the Taylor series can be used to compute arithmetic functions, the

convergent speed is too slow. Thus, instead of the Taylor series, approximation

algorithms should be used to calculate the elementary arithmetic functions.

The arithmetic functions can be usually approximated by a series of
Chebyshev polynomials
n
f (X) = CnTy (x)
y=o (2.7)

This series requires fewer arithmetic operations than the Taylor series and can

speed up a computation provided the approximation is sufficiently accurate.

The approximation formulas for the elementary arithmetic functions are illus-
trated as follows for x a floating-point operand:

ex = 2y where y = x log2 (e)
2X = 2y where y = x
10x = 2y where y = x log2 (10) (2.8)

To calculate these exponentials, first compute y = (i+f) where i and f are the
integer and fractional parts of the floating-point number y, respectively and





- 19 -


log2(e) and log2 (10) are the pre-stored constants. Also, f must be positive and
in the range of [0,1). The result equals

2y= 2' x 2f (2.9)
where 2f can be computed approximately by

2f= Co +C f +C2f2 + ... +C5 f +e(f) (2.10)

with je( f ) < 10-7 and

Co = 0.9999999702 C1 = 0.6931530732007278
C2 = 0.240153617040129 C3 = 0.0558263180623292
C4 = 0.0089893400833312 Cs = 0.0018775766770276 (2.11)

The above equation can be conveniently evaluated by means of a nested multi-

plication, that is

2f= Co + f ( C1 + f ( C2 + f ( C3 + f ( C4+ f C5)))) (2.12)

Thus, in total the computation requires 6 floating-point multiplications, 5 float-

ing-point additions and 1 integer addition (exponential adjustment of the result).
This algorithm is accurate to at least six significant digits. For higher precision,
more power terms are required and the min-max algorithm can be used to de-

rive the coefficients.

The following approximation formulas [ATT88] can be used to compute the
floating-point logarithms of the floating-point operand x > 0:

M [M 2 M ]28
n(x) = Co+ C2 + C... + C + 9 (E + 1)
2 2 2


log2(x) = Co + C + + C 8 + (E + 1)


loglo(x)=Co + C- M + C2 + ... + C y + C9(E + 1)
(2.13)

(2.13)


where






- 20 -


x = M 2E M = mantissa and E = exponent (2.14)

The coefficients are as follows:
In(x) log2(x) log1o(x)
Co = 3.067466148 Co = 4.4254182 Co = 1.332183622
C1 = 11.30516183 C1 = 16.3099009156 C1 = 4.909769401
C2 = 27.7466647 C2 = 40.02997556826 C2 = 12.05022337
C3 = 51.49518505 C3 = 74.29184809343 C3 = 22.36407471
C4 = 66.69583126 C4 = 96.221745 C4 = 28.96563148
C5 = 58.53503341 Cs = 84.44820241827 Cs = 25.42144201
C6 = 33.20167437 C6 = 47.89989096077 C6 = 14.41930397
C7 = 10.98927014 C7 = 15.8541655496 C7 = 4.772579384
C8 = 1.613006222 C8 = 2.32707607725 C8 = 0.7005197015
C9 = 0.6931471806 C9 = 0.3010299957

(2.15)
Similarly, nested multiplication can be used to calculate the above results. In

such a way, In(x) and log2(x) are accurate to at least four decimal places and

five significant digits, and log lo(x) is accurate to at least five decimal placed

and six significant digits.

There are three algorithms, namely square root, fast square root, and quick
square root, for computing the floating-point square root of the floating-point

operand x > 0. The square root function is more accurate than either fast

square root or quick square root functions. The following algorithm is used for

computing the square root function:

S= 2z where Z= log(x)
2 (2.16)
The above log2(x) algorithm can then be used to calculate Z = (i+f), where i

and f are the integer and fractional parts of the floating-point number Z, re-

spectively. Also, f must be positive and in the range of [0,1). Equation 2.12

can be then used to compute 2'. This algorithm is accurate to at least six sig-






- 21 -


nificant digits. The fast square root function is faster but less accurate than the

square root function/and uses the following algorithm:

y=Jx where x=M*2E with M in [1,2) (2.17)

Compute Z = -M using two Newton-Raphson iterations, that is,

M
Z[n] = 0.5 Z[n 1] +0.5 for n = 1,2 and Z[0] = 1.18
Z[n 1] (2.18)

Then

E
y = Z[2] 22 for E even
E-1
y = 2* Z[2] 2 2 for E odd (2.19)

The fast square root algorithm is accurate to at least four significant digits. The

quick square root function is the fastest method but it is less accurate than

either the square root or the fast root functions. The quick square root function

uses the following algorithm:

y=Jx where x = M 2E with M in [1, 2) (2.20)

Compute Z = -M using the following approximation:

Z = 0.59 + 0.4237288136 M 0.28 (M 1.47)4
E
y = Z 22 for E even
E-1
y = 2 Z 22 for E odd (2.21)

This algorithm is accurate to at least three significant digits.

Trigonometric functions are very important for transformation and rotation

processing. The tangent function returns the floating-point tangent of the float-

ing-point operand x (in radians, -- < x -), which can be approximated by
4 4
the following polynomial ratio:

945x 105x3 + x5
tan(x) x -420x2 + 15x4 (2.22)
945x 420x2 + 15x4 (2.22)






- 22 -


with jerrorj <10 -7. Its inverse function, arc-tangent, can be computed using the

following algorithm:

tan-'(x) = -+C1 y+C3 y3+ ... C9 y x > 0
4 (2.23)
x-1
where y = x and
x+1
C1 = 0.999866 C3 = 0.3302995 C5 = 0.180141
C7 = 0.085133 C9 = 0.0208351 (2.24)

This algorithm is accurate to at least three significant digits.

The other elementary trigonometric functions, called sin(x) and cos(x), re-

turn the floating-point sine and cosine of the floating-point operand x (in radi-

ans, -- : x<-), respectively. These functions can be calculated using the
2 2
following algorithms:

sin(x) = x + C1 x3 + C2 x5 + C3 x7 + C4 x9
= ( 1 +2 (C X2 (C2 + X2 ( C3 + C4 X2))))

cos(x)= Z ( 1+Z2(C1 + Z2 2(C2 + 2 2(C3 + C4 Z2))))

where Z --jx| (2
2 (2.25)
For higher precision, more higher power terms are needed. The coefficients for

different accuracies are listed below:






- 23 -


for b 24, or d 8
C1= -0.16666 65668
C2 = 0.83330 25139
C3 = -0.19807 41872
C4 = 0.26019 03036


for 25 < b 32, or 9 d 10


E+0
E-2
E-3
E-5


Ci= -0.16666
C2 = 0.83333
C3= -0.19840
C4 = 0.27523
C5= -0.23868


for 33 < b < 50, or 11 < d < 15
C1= -0.16666 66666 66659 653 E+0
C2 = 0.83333 33333 27592 139 E-2
C3 = -0.19841 26982 32225 068 E-3
C4 = 0.27557 31642 12926 457 E-5
C5 = -0.25051 87088 34705 760 E-7
C6 = 0.16047 84463 23816 900 E-9
C7= -0.73706 62775 07114 174 E-12

for 51 5 b : 60, or 16 < d < 18
C = -0.16666 66666 66666 65052 E+0
C2 = 0.83333 33333 33316 50314 E-2
C3= -0.19841 26984 12018 40457 E-3
C4 = 0.27557 31921 01527 56119 E-5
C5 = -0.25052 10679 82745 84544 E-7
C6= 0.16058 93649 03715 89114 E-9
C7= -0.76429 17806 89104 67734 E-12
Cs = 0.27204 79095 78888 46175 E-14
(2.26)
where b represents the binary bits and d represents the decimal digits. The

computation of their inverse functions, sin-'(x) and cos-'(x) called arcsine and

arccosine for Ixl < 0.966, can use

sin-'(x) = C1 x + C3 x3 + ... + C11 x11
=x( C + x2(C3 + x2(... + x2(C9 + C112))))

cos-(x) = sin-'(x)
2 (2.27)
with the coefficients


C1 = 0.9999999711
C5 = 0.0749014744
C9 = 0.0223169693


C3 = 0.1666698337
C7 = 0.0459387798
C1 = 0.0448569846


66660
30720
83282
97106
34640


833
566
313
775
601


E+0
E-2
E-3
E-5
E-7


(2.28)






- 24 -


This approximation provides an accuracy of jerrorj <10 -7 for Ix| < 0.5. For jIx

> 0.5, accuracy decreases as jIx increases. The result is accurate to the first

decimal place at Ix[ = 0.95.



2.4.2 CORDIC [VOL59] Technique



From the previous section, we see that compared to the power series the

approximation method has fewer arithmetic operations, especially multiplication/

division operations. Thus, it can be used to speed up the computation of ele-

mentary arithmetic functions. However, the computation of trigonometric func-

tions still requires a lot of operations (e.g., using the nested multiplication

method, the tangent function requires 5 multiplications, 4 additions/subtractions,

and 1 division; and the sine function needs 5 multiplications and 4 additions/

subtractions). To further reduce computational complexity, the COordinate Ro-

tation DIgital Computer (CORDIC) trigonometric computing technique can be

used.

There are two computing modes, rotation and vectoring, in the CORDIC

trigonometric operations. In the rotation mode, the coordinate components of a

vector and an angle of rotation are given and the coordinate components of the

original vector, after rotation through the given angle, are computed. In the sec-

ond mode, vectoring, the coordinate components of a vector are given and the

magnitude and the angular argument of the original vector are computed. That

is,

Rotation : y'= k (y cosA + x sin )
Sx' = k (x cosA y sin A)

R R=k/x2 + y2
Vectoring : k 2 +
0 tan-1 y
x (2.29)






- 25 -


where k is an invariable constant. These equations can be solved by the follow-

ing method. Let the original vector xi, yi be

yi= Risin0i
xi= Ricos0i (2.30)

Then, if we add/subtract the x and y components by the 2-( -2 ) y, and

2-( i-2 ) xi, respectively, we get

yil = yi 2-( i-2 ) X

= 1 +2-2 i-2) Ri sin (Oi fa i)


xi+1 = xi T 2-( i-2) Yi

= 1 + 2-2(i-2 ) Ri cos ( 0i f ai) (2.31)

where the general expression for ai with i > 1 is

ai= tan-' 2-( i-2 ) (2.32)

These relations are shown in Figure 2.2. The special angle ai can be used as

the magnitude of rotation associated with each computing step. The peculiar

y


Yi+1 i= +1
2-('0-2) 1+2-(i-2) R i

2-(i-2)i -(i2)R
2-(i -2)x-I---- /




S- -= -1

2-(i- z)y(
a I

xi+1 xi Xi+1 X


FIGURE 2.2
Typical Computing Step For CORDIC Algorithm





- 26 -


magnitude of each a is such that a "rotation" of coordinate components
through :ia may be accomplished by the simple process of shifting and adding.
Likewise, after the completion of the rotation step in which the i+1 terms
are obtained, the i+2 terms may be computed from these terms with the results

Yi+2= 1+2-2(i-) V1+2-2(i-2) Risin(0i + Siai + ji+~ai+1)

xi2= 1+2-2(i-1) 1 + 2-2(-2 Ricos(0i + Ciai + Ci+lai+i) (2.33)

where

j = + 1 or 1 (2.34)

Consider the initial coordinate components yi and xi where

yi= Risin i
xi= Ricos 0i (2.35)
By establishing the first and most significant step as a rotation through -
2
and by establishing the number of steps as n, the expression for the final coor-
dinate components will be

yn+I= [ 1+2-0 1+2-2 1 +2-2(n-2)
Ri sin(0i + ijai + 2a2 + ... + n an)

x = [ r1+ 2- + 2-2 * + 2-2(n-2) *
Ricos (0i + ji ai + 2 a2 + ... + ln an) (2.36)
The increase in the magnitude of the components for a particular value of n is
a constant presented by k. The value selected for n is a function of the desired
computing accuracy. For example, for n = 24, k = 1.646760255.
It can now be shown that, for a rotation of a set of coordinate components
yi and xi through a given angle (as required in the rotation mode), it is neces-
sary to obtain the expressions

yn+l = kRi sin(Oi+Al)
Xn+l = kRi cos(Oi+A) (2.37)






- 27 -


To obtain these relationships, it is required that

A = Clal + 2 a2 + ... + Cnan (2.38)

and for vectoring, it is required that

-01 = Clal + z2a2 + + +nan (2.39)

Equations 2.38 and 2.39 form a special radix representation equivalent to the

desired angle, A or 0 where

a1 --
2

a2 = tan-' 2-0 = -
4
a3 = tan-' 2-1 = 0.463647609



Ai = tan-' 2-( i-2 ) (2.40)

These a terms are referred to as ATR (Arc Tangent Radix) constants and can

be pre-computed and stored in the computer. The C terms are referred to as

ATR digits and are determined during each operation. The implementation of

the CORDIC arithmetic unit is shown in Figure 2.3. Then, from the following

equations


Y register + Adder-
Yi+1
S+ subtractor +
2-(i-2)Xi

S/R -i Adder-
X register W + subtractor


ANGLE register o- + Adder- 0 j+1
ATR CONSTANTS subtractor

FIGURE 2.3
The Implementation Of The CORDIC Arithmetic Unit






- 28 -


Yn+i = k(ya cosA + xl sin2)
xn+l = k( x, cos2 yi sin2)

0 = tan-1 Yl
xl (2.41)
we see that

(1) sin(x) can be computed by initializing the y register, x register, and

angle register to 0,1, and A respectively. Then, use the sign of the

result in the angle register as the determinator for the ATR digits C.

When the resulting angle equals zero, the value yn+, multiplied by -
k
is equal to sin(2).

(2) cos(2) is calculated similarly to sin(x) except that the desired value
1
cos(2) is computed by multiplying by the result Xn+l. cos(2) and

sin(A) can be computed simultaneously. Thus, tan2 = sin(2) can
cos (A)
also be computed.

(3) For computing the arc-tangent function, the y register and the x regis-

ter are loaded with the values yi and xi, respectively. The angle regis-

ter is initialized to zero. The sign of the result in the y register will

be used as the determinator for the ATR digits C. When the resulting

yn+, equals zero, the accumulated sum in the angle is equal to the
original angular argument 01 for the coordinate components yi and xi.

The result in the x register is equal to kR1. Therefore, Ri(= x +y)
1
can be obtained by multiplying by the result in the x register.
k
During each iterative operation, the ATR digit, j, is determined by the negative

of the sign of the controller.

2.4.3 Co-Transformation Method [CHE72]

In this section, it will be shown that a relatively simple device can evaluate

exponentials, logarithms, ratios, and square roots for fraction arguments, em-






- 29 -


playing only shifts, adds, small table look-ups, and bit counting. The scheme is

based on the co-transformation of a number pair (x,y) such that F(x, y) = f(xo)

is invariant; when x is driven towards a known value xw, y is driven towards

the results. The evaluation of Zo = f(xo) can be calculated by introducing a pa-

rameter y to form a two-variable function F(x,y) such that

(1) There is a known starting value y = yo with F(xo,yo) = Zo.

(2) There are convenient, known mechanisms to co-transform the number

pair (xi, yi), into (xitl, yi+,) leaving F invariant, namely

F(xi+1, yi+l) = F(xi, yi).

(3) The sequence of x-transformations tends to a known goal x = xw,

while the corresponding sequence of y-transformations tends to y = yw.

F is so defined that F(xw, yw) = yw.

(4) The iterative part of the algorithm is essentially the repeated applica-

tion of the co-transformation procedure, carried out in the absence of

detailed knowledge about the value Zo.

In terms of solid coordinate geometry, F is easily represented by a curve

confined to the Z = Zo plane and passing through P0 (xo, yo, Zo) as shown in Fig-

ure 2.4. That is

Z
Q(xw, Yw, ZQ)
Y




FIGX U 2..4






The Evaluation Of Co-Transformation Function z0=f(x0, z0)
F(xy) f 0
z = zo plane
PONa, YoZO)





Xy=yX pplane


FIGURE 2..4
The Evaluation Of Co-Transformation Function zo=f(xo, yo)






- 30 -


F(x,y) = Zo = f(xo) (2.42)

and the transformation invariance is merely the requirement that if the point

Pi(xi, yi,Zi) is on the curve F, then so is Pi+1 = (xi+t, yi+1,Zo). Clearly, then,

Zo = F(xo,yo) = F(xl, y) = F(x2, 2) = ... (xi,yi)
= ... = F (xw,yw) = Yw (2.43)

If the x-process converges to Xw, the corresponding y, should then equal Zo.

In reality, it may be difficult or impossible to reach Xw exactly, and ex-

trapolation by a Taylor series would be desirable, if F(x,y) is differentiable with

respect to x near Xw. Thus,

8F
F(xw p, yn) = F(xw, Yn) +O (U2)
ax xw
aF
Sx x (2.44)

For an N-bit fraction, if Ju| < 2-N, just the first term will usually be adequate.

On the other hand, both the iteration cost and round-off error can be
N
halved by including the term linear in / with |Ju < 2 2. Within this termina-

tion algorithm, the multiplication (if any) involves a half-precision number (,)

as one of the operands, costing half of a standard multiply time. In what fol-

lows, we shall assume that linear extrapolation will be used when [/U 2 2

with N N>> 1.

The functions being considered here are:

(1) w ex 0 x < In (2) with F = y ex, Xw = 0

(2) w+ln(x) 0.5 < x<1 with F = y+ln(x), xw= 1
w y
(3) 0.5 x x x
w y
(4) 0.5 x < 1 with F = xw = 1
SExp l (2.45)

Exponential ( 0 < x < ln(2) = 0.69314 ... )






- 31 -


Zo = w exO = Yo exo

= (yo ao ) exo-In (a
= yn eP = y" +

Initiation
Function

Transformations
Termination

Relative Error


= Yi exi =
Yn


: yo = w
: F(x,y) = yex

: xi+ = i In ai, yi+i = yiai

: xn = / ; Zo = Yn + Xn Yn

: l 2 + 2 2--1
2 3


* Logarithm ( 0.5 < x < 1 )

Zo = w + In (xo) = Yo + In (xo)

=( yo In (ao)) + In (xo ao) = + In (x) = ...
= yn+1n(1-/Y) y= n-U
Initiation yo = w
Function : F(x,y) = y + In(x)

Transformations : xi+ = xiai, yt+ = yi In ai
Termination : 1-n = ; Zo yn- (1 n)
/Z2 22u
Relative Error : 0 > -- (1- ) --
2 3


* Ratio (0.5 < x < 1)

Z =w Yo
ZO- -
XO XO
(yo ao) Yi
(xo ao) xi
Yn
= n +ss Yn
( -/U)


Initiation
Function

Transformations
Termination

Relative Error


: Yo = w
: F(x,y) =
x
X
: Xi+1 = xiai, yi+j = yiai
: 1 n = ; Zo Yn + ( 1 Xn)
2

( 1 p 2 )


(2.48)


(2.46)


(2.47)


S)

u






- 32 -


S Inverse Square Root ( 0.25 < x < 1 )

w yo
zo =1-x-

(yo ao) y
Jxao2 "X '

Yn JU
-- -ynyn-
2

Initiation : yo = w
Function F(x,y) =
Transformations : xi+ = xiai2
Termination 1 n = P ;


Relative Error


: 0 < =-1


x
Yi+i = yiai
(1- Xn)
Zo Yn + Yn
2
+ (1 + ) < 2-N-1
2 8


(2.49)
We can then select ai to be of the form (1 + 2-m), such that multiplications

by ai can be replaced by a shift and an add. The value m is usually chosen as

the position number (the ith-bit after the binary point is said to have position

number equaling i) of the leading 1-bit in jxi- xl; but an increase by 1 is

needed for the inverse square root. Therefore,

(1) For wex, m is the position number of the leading 1-bit in xi. Thus,

we have

Xi= 2-m+P 0 P < 2-m (2.50)

Here p represents the bit pattern after the mth-bit. The latter, among

all bits in xi, is responsible for over half of the value of jxi-Xwl and

is the leading candidate for removal. With the help of a small table

{(Dm = In (1 + 2-m)}, we have






- 33 -


xi+ = (2-m + P) -In (1 + 2-m)
2-2m 2-3m
= (2-m +P) (2-m + ... )
2 3
=P+0(2-2m) (2.51)

where the most objectionable bit is replaced by a second-order distur-

bance, and p is largely intact. The y-transformation is performed by a

right shift of m places, and then by adding to the unshifted yi; that is,

Yi+1 = Yi + 2-myr (2.52)

The iteration requires a table lookup, a shift, and two adds. With m
N N
varying from -, a total of tabular entries are needed; even if
2' 2
N = 60, the table requirement is only 30 words.

(2) For w+ln(x), m is selected to be the position number of the leading

1-bit in (1 x); here we have

xi= 1- (2-m+P) 0 P < 2-m
xi1 = xi + 2-mxi = 1 P + 0(2-2m) (2.53)

Again the most objectionable bit is replaced by a second-order distur-

bance. Also,

Yi+i = Yi In(1 +2-m) (2.54)

can be calculated by using the same table as (1).
w
(3) For -, the x-transformation is the same as in (2) and the y-transfor-
x
mation is the same as in (1), namely,

Xi+l = Xi + 2-mxi
Yi+1 = yi + 2-myi (2.55)

W
(4) For m is chosen as 1 plus the position number of the leading

1-bit in (1 xi)

xi = 1 2 (2-m+P) 0 5 P<2-m (2.56)


then






- 34 -


ti= xi+ 2-mxi
Xi+l = ti+ 2-m ti
= 1 2P(1 + 2 2-m + 2-2m) 2-2m (3 + 2 2-m)
= 1 -2[ P+ (2-2m)] (2.57)

The y-transformation is the same as in (1), namely

Yi+i = Yi + 2-myi (2.58)

The explicit algorithms are summarized in Table 2.2. The four iteration

algorithm can be handled by one unified apparatus (see Figure 2.5), consisting
N
of an adder A, a shifting mechanism S, and an -word fast memory M hold-
N 2
N
ing {-In (1 + 2-m)}, 1 m -. The apparatus is capable of the following ele-
2
mentary operations:

(a) Load C(x) (the contents of register x) into T

(b) Deduce m from C(T) and store it in V

(c) Use m as address to fetch )m from M; add )m and C(T)

(d) Store the sum in x

(e) Load C(Y) into T

(f) Shift C(T) to the right by m places and put result into U; add C(T)

and C(U)


TABLE 2.2
Summary Of The Co-Transformation Algorithm


Termination
Range of x f(x) F(xi, yi) xi xi Yi+l Xn Algorithm

xi In(1+2-m) n
[0, ln(2)) wex xi 2+ 2-myi 0 < <2 2 Ynfe = -Yn+Yn

xit2-mxi
[0,5, 1) w+ln(x) yi+ln(xi) 1-(2-m+p) yxi In(l1+2-m) 1 -t yn+ n(1-t) yn-/U
=_ 1-p
xi+2-mxi
[0.5, 1) w/x yi/xi 1-(2-m+p) x yi 2-myi 1 -t Yn /(1l-)t yn + Yn/y
1-p

[0.25, 1) w/fx yi/xi 1-2(2-m+p xi(12-m 2 yi+2-myi 1-p yn/T / = Y+ynyn/2
=_ 1-2p






- 35 -


Im m' To termination algorithm
m_

-Dm = In(l + 2-m)
x or y 2-m(x or y)M
T U



A



FIGURE 2.5
The Unified Apparatus


(g) Store the sum in Y

(h) Go to step (a) if m < mri, else enter the termination algorithm.

Starting by putting xo into X, w=(yo) into Y, the sequencing for the evalu-

ations are [


(1) exo


(2) In(xo)

1
(3) -
xo
1
(4)
Jxo


w=yo=l
w = yo = 1 ;



w=yo= ;


w = yo = ;

w = yo = 1 ;


(abcdefgh);
then C(y) + C(y) C(x) = exo
(abfdecgh);
then C(y) + 1 C(x) ln(xo)
(abfdefgh);
then C(y) +C(y) [1 -C(x)] -
xo
(abfdafdefgh);
then C(y) + C(y)* 1 C
2


(2.59)
The apparatus invokes the termination algorithm by shipping C(x),C(y) away, to

be processed by more conventional equipment. If self-contained operation is






- 36 -


desired, the iterations can be allowed to run until uz 2-N. It would then be

necessary to double the table size.

To derive a rough timing estimate, we equate memory access time with

shift time, and note that a conventional multiplier takes time T = N shift-add
N 3N
times. The expected iterations involve shift-adds for the inverse square
4 4
root and for the other three functions; the time estimates are, therefore, -
T 2 4
and -, respectively. To these, one has to add the timing overhead of the termi-

nal algorithm. For logarithms, the complement-add requirement adds little to
T
the iteration cost. The total timing cost is still measured by -. For exponentials,
2
ratio, and inverse square roots, a half-multiply is needed. If this is done with a
T
conventional multiplier, the cost would again be for a total cost of T (expo-
2
nential, ratio) and 1.25T (inverse square root).

For the out of range numbers, x's, these functions can be computed as fol-

lows.

(1) Exponentials:

Let x = m2E equal i ln(2) +f where i is an integer and f is a fractional

number, 0 < f < lIn(2). Then

ex = i*ln(2)f= ef 2 (2.60)

Here ef can then be calculated by the above algorithm. Multiplying 2'

is nothing but exponential addition/subtraction. Transferring x to

i ln(2)+f can be achieved by

1
x*--- = I+F
In (2) (2.61)
1
where = 1.442695040...; I and F are the integer and fractional
In (2)
parts of the product, respectively. Then,

if I 0 i=I, f = F In (2)
if I<0 i=I-1, f= (I-F) *ln(2)
with In (2) = 0.693147... (2.62)






- 37 -


(2) Logarithms:

x = m2E x > 0 and m [0.5,1)
In(x) = E In (2) + In(m) (2.63)

where In(m) can be computed using the above algorithm and ln(2) is a

pre-stored constant

(3) Ratios:

1 1
x=+ m2E -= 2-E- m E [0.5,1)
x m (2.64)

(4) Inverse square roots:

x =m2E x>0 and mE [0.5,1)
E
-=22 rm if E even
Yx
1 E+ /m
-=2 2 if E odd
Sv2 (2.65)

The CORDIC and the co-transformation techniques can therefore provide

faster computation by employing only shifts, adds/subtracts, and small lookup

tables. Thus, these algorithms are integrated into the designed multi-purpose

arithmetic unit which is discussed in the next chapter.















CHAPTER 3
MULTI-PURPOSE FLOATING-POINT DISTRIBUTED ARITHMETIC UNIT

The major operations found in DSP can be modeled as a matrix-matrix,

matrix-vector, or vector-vector multiplication, with use a sum-of-products

(SOP) as their primitive operation. In a fixed-point SOP, the multiplication task

is generally dominant from a temporal viewpoint. Thus, for real-time DSP, a

high-speed multiplier is required. From the previous chapter, we have seen that

if the coefficients of a SOP operation are known a priori, then the fastest SOP

processor is based on the so-called distributed arithmetic (DA) paradigm.

The concept of DA can be traced to Anderson [AND71] and Zohar

[ZOH73] who noted that the lumped parameter algebra associated with a linear

shift-invariant filter can be redistributed over a set of bit patterns. By the

mid-1970's, the low-cost, high-performance memory chips needed to implement

the required DA mappings made the DA SOP commercially available. By re-

placing the time consuming conventional lumped arithmetic operations with

high-speed table lookups, high date throughputs were achieved. Since its intro-

duction, the distributed filter approach has been applied to a variety of prob-

lems using MSI, LSI, VLSI, microprocessor, and bit-slice microprocessor tech-

nologies ([AND71], [LIU75], and [BUR77]), and has been studied extensively in

terms of complexity-throughput tradeoffs([AND81], [ZOH73], [BUR77],

[PEL74], [JEN77], [TAN78], [KAM77], [ARJ81], [TAY83B], [TAY84C],

[CHE85], and [TAY86]) With the exception of a block floating-point study

[TAY84A], these designs have remained fixed-point. However, as mentioned in
Section 2.2.3, extra overhead must be paid to process the effects of dynamic


- 38 -






- 39 -


range overflow in a fixed-point number system. This overhead can be elimi-

nated by considering the floating-point DA computing system designed in Sec-

tion 3.2.

In Section 3.3, the floating-point distributed arithmetic unit (FPDAU) is ex-

tended to a complex floating-point distributed arithmetic unit (CFPDAU).

While a traditional complex multiply requires 4 real products and 2 adds, the

CFPDAU requires but two concurrent table lookup operations; one produces the

real part of the result and the other produces the imaginary part. In this way,

the computational speed is quadrupled with only a doubling of circuit complex-

ity.

The FPDAU or the CFPDAU, like the DA, is applicable to linear SOP cal-

culations where one set of the operands is known a priori. Thus, the CFPDAU

needs to be further modified if it is to serve as a general-purpose arithmetic

processor. The architecture for such a multi-purpose arithmetic unit (MPAU) is

presented in Section 3.4. The more general-purpose CFPDAU requires only a

slight modification of the basic CFPDAU. Essentially, only two extra small

lookup tables are required (one for the trigonometric functions and the other for

the exponentials, logarithms, ratios and inverse square roots). Thus the

CFPDAU architecture can become the cornerstone for the design of a fast, com-

pact, robust processor.

The data regularity and control characteristics of the developed arithmetic

unit also make it an ideal candidate for integrating a fine-grain systolic array as

discussed in next chapter.



3.1 Distributed Arithmetic Unit



In general, a linear SOP is given as






- 40 -


N-1
y(n) = Z aix(n-i); x = input, y = output.
i=o (3.1)

Suppose further that the coefficient set {aj is known a priori (e.g., digital fil-

ters, digital transforms, numerical analysis) and x(n) ( jx(n)[ < 1 ) is coded as a

fixed-point (FXP) 2's complement L-bit word, then the DA version of Equation

3.1 is

L-1 N-1
y(n) = Cn(0) + Z Fn(k)2-k; n(k) = Z aix[k : n-i]
k= 1 i=0 (3.2)

where x[k : n-i] is the kth least significant bit of x(n-i). The value of 'n(k)

can be obtained as a table lookup call and the scaling by 2-k can be performed

using a shifter. The implementation of Equation 3.2 is shown in Figure 3.1 and


address
generator
X(n-N+1 (,) Shift-Adder

partial
product
table -- -
lookups OUTPUT y(n)
X(n-1)
S/R
X(n)
Shift register
address core fixcld-r'oinT
generator
buffer Cldistriuted unit


INPUT x(n-i)

FIGURE 3.1
Core Distributed Arithmetic Unit


is referred to as a distributed arithmetic unit. The throughput of this unit is es-

sentially limited only by memory cycle time and the data wordlength. When the

order (i.e., N) exceeds the addressing space limits of the memory table, sub-






- 41 -


SOP's must be designed (i.e., divide the original SOP into small groups of SOP)

and integrated together. For such cases, the conventional approach, proposed

by Burrus [BUR77] and others, is to operate a set of distributed sub-SOP sec-

tions in parallel and then combine their outputs with an adder-tree.

Both finite impulse response (FIR) digital filters and infinite impulse re-

sponse (IR) digital filters can be implemented with this distributed arithmetic

unit. An error analysis of the FIR digital filters which were implemented using

the DA method, has been reported by Peled and Liu [PEL74] and improved by

Kammeyer [KAM77]. The Peled and Liu model suggests that the finite

wordlength effect errors are of zero mean with a variance of a IR = Q2/9 where

Q is the quantization step size. An error analysis for a DA IIR digital filter has

been published by Taylor [TAY86], and it indicates:

2 Q2 Q2
OD-I = IH(Z)12+-; for Direct-I architecture
12 9
2 Q2 Q2
a D_-I = IH(Z)I2 + -; for Direct-H architecture (33)

These studies prove the thesis that DA data processing (when applicable) is

faster and more precise than an equivalently packaged conventional unit which

has error variance:

2 Q2 (NQ)2
o = N-+N ; where N is the order of IIR filter

3 Q2
12 (3.4)

Fixed-point DA is a method by which high throughput digital filtering can

be supported. However, as is often the case, high performance is gained at the

expense of system complexity. In Appendix B, a bit-serial DA architecture is

presented which mitigates the pin-out and large memory problems without re-

ducing throughput. The method lends itself to direct VLSI integration and can

be used to develop high-order, high-precision digital filters.






- 42 -


3.2 Floating-Point Distributed Arithmetic Unit (FPDAU)



Distributed arithmetic units (DAUs) are well studied in the context of

fixed-point data processing. The only exception was the block floating-point

DAU reported by Taylor [TAY84A]. This concept will be extended to a full

floating-point DAU revision which is referred to as the floating-point distributed

arithmetic unit (FPDAU).

If the data are represented in the conventional floating-point form

x = m(x)re(x) where m(x) is the m-bit mantissa, r is the radix, and e(x) is the

e-bit exponent, then the SOP with floating-point precision output yields the fol-

lowing:

N-1 N-1
y(n)= = aix (n-i)= a aim (x(n-i)) ye(x(n-))
i=o i=o (3.5)

Assume E = max(e(x(j))) where j = n, n-1, ..., n-N+1, then

N-1
y(n) = y'( Y ai m-(x(n -i)))
i=0
m(x(n i)) = m (x(n i)) y(x(n-1))

A (x(n i)) = e (x(n i)) E < 0 (3.6)

For r = 2, the scaling of m(x(j)) by y (x(j)) can be accomplished using a stan-

dard 2's complement "right shift" at the address generator buffer.

In the aforementioned fixed-point DAU designs, each cell of the address

generator register was of length m-bits for a total length of N*m bits (N is the

order of the filter). Since a scaled mantissa (i.e., m(x(j)) is used in the

FPDAU, it may be desirable to extend the length of each cell to (m+V)-bits

where V = max (- A ( x (n i))), for a total length of N(m+V)-bits. In this

way, mantissa can be scaled by a maximal value, 2-' without a loss of signifi-

cance. Under these assumptions, the FPDAU version of Equation 3.6 becomes:





- 43 -


m+V-1
y(n) = E ( 2-kDn (x[k: n- i] ) ( x[0: n-i] )
k=1 (3.7)
where (Dn and x[k : n i] have been previously defined.

An FPDAU is diagrammed in Figure 3.2. It consists of "core" floating-

point DA engine which is augmented with a set of peripheral function modules.

The additional hardware cells consists ofr

(1) an exponent queue;

(2) an exponent magnitude compare module;

(3) an address generator (possibly of extended length);

(4) normalization circuitry; and

(5) a floating-point shift-adder instead of fixed-point shift-adder.
An FPDAU cycle consists of the following operations in the indicated order.

(1) Read the next exponent e(x(j)) and mantissa m(x(j)). Present e(x(j))

to the exponent queue and exponent comparator. Load m(x(j)) into

the address generator buffer.

(2) After reading the data, determine the maximum exponent E = max{

e(x(j)) I j = n, n-1, n-2, ..., n-N+1}
(3) Adjust the contents of the jth address generator cell (in a 2's comple-

ment sense) by the amount E-e(x(j)) for all j concurrently.
(4) Perform a core floating-point DA cycle for up to m+V table lookups.

(5) Adjust the final exponent by adding (or subtracting) E from E(y(n)) to

produce e(y(n)).

In practice, the input and output adjustments would be performed concurrently,

reducing temporal overhead to a minimum. As a result, the FPDAU cycle is

again essentially defined by the speed of the core floating-point unit. The value

is approximately m+V memory accesses.






- 44 -


Output e(y(n))
exponent y


Input
mantissa
m(x(n))


FIGURE 3.2
The Floating-Point Distributed Arithmetic Unit



The FPDAU provides a means of computing a floating-point SOP opera-

tion. The same calculation could be implemented using traditional components

by supplying all N coefficient data pairs (i.e., (ai, x(n i))) to a floating-point

multiplier in a serial manner. The N-products would be sent to a floating-point





- 45 -


arithmetic unit for accumulation. Assuming that accumulation is interleaved

with multiplication, the traditionally fashioned SOP would have a latency of the

order of N floating-point multiply cycles. In addition, there would be some loss

of precision (due to rounding/truncating) during each multiplication. The prod-

ucts may or may not be less significant. Compared to traditional methods, the

FPDAU offers several distinct advantages. These include

accelerated arithmetic speed,

improved error budget, and

superior packaging and cost.


3.3 Throughput and Precision of FPDAU


A preliminary analysis of a DA floating-point vector processor of length 16

will be presented. The data for the new unit is based on commercially available

memory and support chips. The latency of the preliminary design is compared

to a state-of-the-art Weitek WTL1032/1033 floating-point chip set which dissi-

pates two watts and costs $700 in small quantities. For pipelined architecture,

the latency of the WTL1032 multiplier (32x32 bits) is 100ns per stage and the

latency of the WTL1033 ALU is also 100ns per stage with a total latency is

500ns each. The 16-tap SOP will then take 2.6gs when implemented using this

chip set in pipeline mode. The FPDAU model is prefaced on the assumption

that m = 24 bits, V = 7 bits (i.e., 32-bit single precision model). Based on a

12ns table latency table latency (e.g., Performance P4C187), 20ns shift-adder

(64-bits) latency, a 10ns address register shift/align, 10ns exponent compare,

and 10ns mantissa normalization, the latency of the floating-point SOP using

the FPDAU is reduced to 1.02gs without pipelining.

Without pipelining, the FPDAU is still approximately 2.6 times faster than

the state-of-the-art pipelined multiply/accumulate processor. In addition, the






- 46 -


latency of the FPDAU is dependent on the wordlength, not on the length of SOP

as in the traditional multiply/accumulate processor. The latency versus the

length of SOP for the FPDAU and the traditional multiply/accumulate processor

is shown in Figure 3.3. It is seen that for the FPDAU, the latency is a constant



Latency (t s)


5
4 WTL 1032/1033
3 (pipelined)

1 -_ FPDAU

1 2 3 4 JJ 16 17 18 19
# of the taps of SOP


FIGURE 3.3
Latency Of The FPDAU v.s. The Conventional Multiplier/Accumulator



if the wordlength of the address generator is fixed. However, the wordlength is

dependent on the divergence of the data and the precision requirement. In ad-

dition, the size of the table increases geometrically as the length of SOP in-

creases. The solution to these problems is to use the multiple sub-SOP and the

adder-tree. Details are discussed in Section 3.5.

Next, the precision of implementing SOP using FPDAU will be compared

with the precision of using the traditional multiply/accumulate processor. In

previous fixed-point DA studies, it was demonstrated that DA offers distinct

precision advantages over conventional SOP designs. A comparison of how

floating-point architectures differ can be similarly analyzed using statistical

modeling methods developed for the study of digital filters. For each imple-

mentation of the Nth order FIR equation:






-47 -


N-1
y(n) = 1 ai xi; xi= x(n -i)
i= o (3.8)
the conventional architecture consists of many error sources distributed through-

out its architecture. For a direct implementation, the multiplicative mantissa

adjustment and normalization operation rounds the full precision product into a

single precision value and normalizes the result to a range 0.5 rti < 1.0. If

ai and xi produce a uniformly distributed mantissa (a reasonable assumption),
then

2 r (yexi + ea)2 if 0.5 mi < 1
S 2 ( exi + eai+)2 if 0.25 m mi< 0.5 (3.9)
Q2
where r = 2 for binary system, a2= -, and Q = 2-m (mantissa has m-bits). The
12'
total error budget of using the conventional multiply/accumulate processor then

becomes
N-I N-e 1
TC = or 0.5 E (( y(e + e )2 ) + 0.5 (( (ex+ )
i=o i=0 (3.10)
where e represents the expected value. If the range of the products is assumed

to be uniformly distributed over [-M,M), M = 2q, q = 2e (exponent has e bits),

then


E(( yexi + ei)2) = f2 dx 2M =
-wM (3.11)
which yields




= a2 (- (2.5N-2.5)
SM3.; (3.12)






- 48 -


Q2
The FPDAU model consists of a aJQAU =-* ((r)2) error variance attrib-
9
uted to the core floating-point DAU operation. The resulting output y(n) has an

equivalent floating-point of y(n) = YDAU YE where

E = max ( log, x i ) ; r = 2 (for binary system) (3.13)

As a result, the error variance for the FPDAU can be modeled as
2 2 Eyex\2
2FPDAU = D2DAU + E ((yi)2 )
o TFPDAU O DAU


= laE ((yeai)2) ((yexi)2)

4 1
= -o2 M (assume ai and xi are independent)
3 3

= 402M2
9 (3.14)
The relative error is therefore

M2
22 ( 2.5N 2.5 )
0 TC 3 9* ( 2.5N-2.5)
Y ~ 4 2 12
S2TFPDAU 2 * M
9

S7.5 N = 2N
4 (3.15)

That is, the FPDAU has about a 2N-fold advantage over traditional schemes.

For a typical value of N = 16, the improvement in precision can be an addi-

tional five bits of precision. This may or may not be significant based on the

mantissa wordwidth m. These results are in complete agreement with what has

been observed with fixed-point conventional and distributed architecture.

In a conventional floating-point system, the allocated wordwidth (32-bits or

64-bits typical) is divided between the mantissa and exponent. This division

establishes precision and dynamic range limits. These parameters, in turn, es-

tablish latency (throughput) bounds and, to some degree, packaging and power

requirements. However, in many application specific designs, the optimal






- 49 -


choice of precision and range delimitation may not correlate well with the stan-

dard format sizes (i.e., double or extended precision mantissa and exponent val-

ues). Often, in real-time applications, one may choose to sacrifice precision

(mantissa) to gain speed. This will have the effect of reducing the amount of

time spent in multiplication. However, the dynamic alteration of a precision

qualifier is not permitted in commercially available, standard format, floating-

point devices.

The dynamic assignment of mantissa and exponent wordwidths is permitted

in an FPDAU design. The effect on latency will be essentially a scaling of the

partial product/accumulation time which is on the order of m+V memory lookup

cycles. That is, speed will scale linearly with precision (i.e., m) and logarithmi-

cally with range (i.e., V = log2 2max(e)). In fact, the FPDAU can be reduced to

a fixed-point processor simply by setting the exponent to zero. Therefore, the

FPDAU can be dynamically "tuned" to provide the best speed, precision, and

range mixes. This can be accomplished without any hardware or architecture

changes.


3.4 Complex FPDAU (CFPDAU)


Complex arithmetic is essential to many transformations and image and sig-

nal processing operations. In the complex domain, the N sum of complex prod-

ucts is given by

N-1
y= I wixi
i=O

wi= win +jw)i ; known a priori

xi=Xi +jxic, ; input


y = yg+jy, ; output


(3.16)





- 50 -


where all the variables (real and imaginary) are coded as floating-point words.
For a traditional system, the execution of Equation 3.16 would require
.4n real products
.2n real adds

Using Equation 3.6 as a model, Equation 3.16 can be rewritten to read
N-1
y = E( ( Mw m-- wi fmi )
i=0


=Y -yE( ( i( m-i + O3 mi )
i=O (3.17)
with

E=max (ex (i), e(i) i = 0, . N-1) (3.18)

where

ex (i) = exponent value of xi

ex (i) = exponent value of xi (3.19)

In addition, mi is defined to be the scaled mantissa values as defined in the

manner used in the FPDAU development. Equation 3.17 can be directly real-
ized as an FPDAU satisfying the equation:
( m+V-1
yp=yE 2-k(n (x[k:i]) n (x[0:i] )
k= 1

( m+V- I
y YE 2-k Qn(x[k: i]) Qn(x[0:i])
k = 1 (3.20)


where






- 51 -


N-I
,n (X [k : i) = (w) -i.i [k : i] w is im [k: i])
i=O

N-l
Qn(x [k: i]) = (woimi [k: i] + ,i mi [k: i])
i=O

V= max (E- ex, ) (3.21)

The required architecture is shown in Figure 3.4. Observe that all the external

circuitry about the core floating-point DAU remains functionally the same, ex-

cept that the address generator register size is doubled to hold both the real and

imaginary mantissas. The core of DAU is also doubled with one section pro-

ducing real partial products and the other generating the imaginary terms.

One significant difference between the complex FPDAU (CFPDAU) and the

FPDAU is that the address space requirement, for a given N, increases by a

factor of two. The CFPDAU differs from conventional complex multipliers in

several important areas as well. First, while a traditional complex multiply re-

quires 4 real products and 2 real adds per multiplication, the CFPDAU tables

collect these operations into two groups and execute the 4 real products and 2

real adds as two concurrent table lookup operations. One table lookup gener-

ates the real part of complex product and the other produces the imaginary

part.

The speed at which an CFPDAU can operate is again established by table

lookup cycle times. Based on 12ns memory module, and 20ns floating-point

shift adder, the 16-taps complex SOP can be performed at an effective latency

of 32x(m+V) nanoseconds where m+V is the address register cell length in bits.

For a typical value of m+V = 48-bits, the effective latency of 16-tap complex

SOP is approximately 1.5ps. The latency of a typical real floating-point 16-tap

SOP, implemented using a state-of-the-art multiplier and ALU is much greater






- 52 -


m, m m m


FIGURE 3.4
Complex Floating-Point Distributed Arithmetic Unit (CFPDAU)



than this figure (as shown in previous section) and, for the complex multiply/ac-

cumulate case, would be four times longer. Thus, the CFPDAU can outperform

existing products and does so within a comparable hardware budget.






- 53 -


3.5 Multi-Purpose FPDAU/CFPDAU



As indicated in the previous section, the address space of the FPDAU is

increased by a factor of two, compared to the FPDAU. This means that the

lookup table size increases 2N times (N is the number of taps of the SOP). For

example, the 16-tap complex SOP will require two 232-word memory. Com-

pared to the FPDAU, each table is larger by over 4 orders of magnitude. This

memory size requirement obviously precludes fabrication in one VLSI chip. In

order to reduce the table size, the address generator buffer can be divided into

several groups. In this way, the memory size is reduced geometrically. Each of

these groups has its own sub-SOP lookup table. The partial SOP can then be

summed together using an adder-tree. The number of the levels of the adder-

tree depends on the number of the groups, which in turn affect processor la-

tency. More groups result in the longer latency. However, this effect can be

reduced when a pipelining is employed. Trading large table sizes off against

minimal extra circuitry (adders and shift-adders) and a small increment of la-

tency is generally an accepted design choice. For example, an 18-tap complex

SOP will require two 236-word lookup tables. If the address generator register

buffer is divided into six groups, then only twelve 26-word lookup tables are

needed. In this instance, we save the memory of more than 8 orders of magni-

tude by introducing only five extra shift-adders and one adder-tree consisting of

six adders. The diagram of this scheme is shown in Figure 3.5. The adder-

tree has three levels. For the 20ns floating-point adder technique, it introduces

60ns (pipelined) to the total latency and can be ignored.

Reducing the table size is one reason to divide the address generator buffer

to groups. Another reason is that under such conditions, each sub-SOP of the

CFPDAU can be used as a general-purpose multiplier. These multipliers can







- 54 -


INPUT

ev(i) -
my(i)


ev(i)-
m



ev(i)
v(i) -
*"d-


ev(i)
mV(i)




m
v(i)
SW(i)

ev(i)
yv(i)


FIGURE 3.5
The Real Part Of The Modified CFPDAU





be connected together to construct a systolic array, which is the subject of the

next two chapters.

Recall that the complex multiply


yK = OOR X Il xe


Ye = 0)2 x + 0) XU


(3.22)


evi) Shift Control
(S/C)
exponent
compare
E = max(ee ))

exponent
queue


Exponent Adjuster
(E/A)

SADDER
max(e,(i)I


exponent
INPUT y mantisa
____________________M ~Y






- 55 -


can be rewritten to read:






ynz = YE- Z 2-k (m[k: x]) (m[ x]=)
k= 1


where


Ex= max(ex ,ex )
V = max (E-ex. )

4 is the table consisting of 0,wo, we, o) ow
Q is the table consisting of 0,woS, w wo) + wo
m[k:x] is the kth least significant bit of the
mantissas xR and xu

Compared to the traditional complex multiply which requires 4 real

and 2 real adds, the DA multiply (Equation 3.23; its architecture is

Figure 3.6) is faster and simpler.


(3.24)

products

shown in


FIGURE 3.6
Complex Distributed Arithmetic Multiplier


(3.23)






- 56 -


Thus, when the address generator buffer of the CFPDAU is divided into

several groups, each group with its own sub-SOP lookup table can then be used

as a DA complex multiplier. In such a case, each group can be used as a sub-

SOP unit when the coefficients are known a priori or as general-purpose multi-

plier if the coefficients are not known a priori. However, this design is still not

powerful enough to use this device as the arithmetic unit of a general-purpose

computing system for DSP and linear algebraic applications. Fortunately, the

fast architecture for computing the elementary arithmetic functions (e.g., trigo-

nometric functions, logarithmic, exponential, ratio, and inverse square root) can

be installed into the complex DA multiplier very easily. The diagram of this ar-

chitecture is shown in Figure 3.7. This arithmetic unit is referred to as a Multi-

Purpose Arithmetic Unit (MPAU) which can implement the following arithmetic

functions in about one to three shift-add multiplication time.

half of 3-tap complex SOP

6-tap real SOP

half of complex multiplication

real multiplication

real division

trigonometric functions (sin, cos, tan)

rotation (x' = Rcos (0+2), y' = Rcos (0 +1))

vectoring (R = x+ y2, 0 = tan-' )
x
real natural logarithms

real natural exponentials

real inverse square roots

Compared to the traditional implementation methods, MPAU is much faster and

simpler.






- 57 -


FIGURE 3.7
Multi-Purpose Arithmetic Unit (MPAU)





The MPAU achieves the above elementary arithmetic functions in the man-

ner explicitly explained in the following (with math error, overflow, and under-

flow detections are ignored):






- 58 -


A. Complex/Real SOP and Multiplication


(i) Real part of Complex SOP


LOADT(MT, QSOP);

LOADS(Ms, SOP, 6);




SELSOP;
AJE;


MOVE(E,EO);

ZERO(X1);
for (i=1; i
AMOVE(MS, V);



ADDAI(X1, TAB);



SEMAI(X1);


AMOVE(MS,V);
SUBA1(X1,TAB);


MOVE(ACC1,X1);

MOVE(X1,JO);
EPADD;


/* Load lookup table from Data
RAM into lookup table Fsop
/* Load 6 data (for complex data it
is 3 complex numbers, for real
data it is 6 real numbers) from
Data RAM into registers SOP(i) i
= 0, 1, 2, ..., 5
/* Select the SOP lookup table
/* Do exponent compare, generate
max{e} and max{Ae}, adjust the
mantissas of SOP(i)
/* Move C(E) (contents of register
E) to register EO
/* Set X1 register to zero
/* rfm = # of bits of mantissa +
max{Ae}
/* Shift-right one bit of the mantis-
sas of registers SOP(i) to gener-
ate the table address in register
MS and move it into register V
/* get the lookup value from SOP
table, put it into register TAB,
and C(ACC1) = C(X1) +
C(TAB)
/* decrease exponent of C(ACC1)
by 1 and move the result into
register X1


/* get the value C(TAB) from SOP
lookup table and C(ACC1) =
C(X1)-C(TAB)
/* C(Xl) <- C(ACC1)

/* C(JO) <- C(X1)





- 59 -


/* Add the exponent of JO by
C(EO) and store the exponent
RCV;


into register El and the mantissa
into M1
/* Combine C(E1) and C(M1) into
a floating-point number and
store it into register J1


SAVE (J1,M); /* Store C(J1) into Data RAM
The same operations are used to generate the imaginary part except that the

contents of the lookup table are different.

(ii) the above procedure can be used to generate 6-tap sub-SOP.

(iii) For complex and real multiplications, the same operations are used except

that the first two commands are replaced by

(1) Real Part of Complex Multiplication


LOADR(Mr,X1); /* Load real part of multiplicand
from Data RAM into register X1
LOADR(Mi,Y1); /* Load imaginary part of multipli-
cand from Data RAM into regis-
ter Y1
MOVE(X1,R1); /* Move C(X1) into register R1
MOVE(Y1,R2); /* C(R2) <- C(X1)
SUBA1(X1,Y1); /* C(ACC1) = C(X1)-C(Y1)
MOVE(ACC1,R3); /* C(R3) <- C(ACC1)
LOADTR; /* Load the contents of registers
R1, R2, and R3 into table Dsop
ZERO(SOP); /* Set registers SOP(i) to zeros
LOADS(Ms,SOP,2) /* Load the complex multiplier into
register SOP(O) and SOP(l)
The imaginary part is the same except the fifth command, SUBA1, is replaced

by ADDAI(X1,Y1).

(2) Real Multiplication


LOAD(M,Rl);


/* Load multiplicand into register
R1






- 60 -


LOADTR;
ZERO(SOP);
LOADS(Ms,SOP, 1);

B. Real Division (Ratio)


LOADR(Mx,JO);
SEM;


if (SN==1)
MOVE(-M1,X1);
else
MOVE(M1,X1);
LOADR(My,X2);
while (DEDM < m)


*SEMYI(m);


*SEMY2(rfi);


**ADDA1(X1,Y1);
**ADDA2(X2,Y2);
***MOVE(ACC1,X1);
***MOVE(ACC2,X2);

}
MOVE(-E1,EO);
if (SN==1)
MOVE(-X2,JO);
else


MOVE(X2,JO);
EPADD;
RCV;
SAVE(J1,M);


/* Load multiplier into register SOP


/* Load divisor, x, into register JO
/* Separate the exponent and man-
tissa, exponent is stored in regis-
ter El and mantissa in register
M1
/* X is a negative number
/* C(X1) <- -C(M1)
/* X is a positive number
/* C(X1) <- C(ml)
/* Load dividend into register X2
/* Deduce mf from 1-C(X1) and
move it into register V. m = #of
bits of mantissa
/* Decrease exponent of C(X1) by
rm and store the result into regis-
ter Y1
/* Decrease exponent of C(X2) by
fi and store the result into regis-
ter Y2
/* C(ACC1) = C(X1)+C(Y1)
/* C(ACC2) = C(X2)+C(Y2)
/* C(X1) <- C(ACC1)
/* C(X2) <- C(ACC2)


/* C(EO) <- -C(EI
/* divisor is a negative number
/* C(JO) <- -C(X2)


/* C(JO) <- C(X2)


/* Store C(J1) into Data RAM






- 61 -


*,*** : the pair can be processed concurrently


C. Sin. Cos. Tan, and Rotation


LOADR(M,,ANG);


ZERO(V);
SELTR;


if (SIGN==0) {
LOADR(Mx,X2);

LOADR(-My,X1);
SUBA3; }


else {
LOADR(-Mx,X2);
LOADR(My,X1);
ADDA3; }
INC(V);
for (i=l; i < m; i++) {
*SEMY1(2i-2);


*SEMY2(2i-2);
EXCHY12;


/* Load angle (in radius) from Data
RAM into register ANG
/* Set the register V to zero
/* Select the trigonometric function
table (tan-' 2-(-2)) for table
lookup
/* Angle is a positive number
/* LOAD x-coordinate value into
register X2
/* C(X1) <- -(Y-coordinate value)
/* get the value C(TAB) from the
lookup table and C(ANG) =
C(ANG)-C(TAB)
/* Angle is a negative number


/* C(ANG) = C(ANG)+C(TAB)
/* C(V) = C(V)+1


/* Decrease exponent of C(X1) by
2i-2 and store the result into
register Y1


/* exchange the contents of register
Y1 and Y2.


if (SIGN = 0) {
**SUBA1(X1,Y1);
**ADDA2(X2,Y2);
SUBA3; }
else {
**ADDA1(X1,Y1);
**SUBA2(S2,Y2);


/* C(ACC1) = C(X1)-C(Y1)





- 62 -


ADDA3; }
***MOVE(ACC1,X1);
* *MOVE(ACC2,X2);
INC(V);
}
For Sin: Set C(Mx) = 1 and C(My) = 0, and after running above procedure do

C(X1) = C(X2)*C(C3) and store C(X1) into Data RAM, where register C3
1
contains the constant -
k
For Cos: Set C(Mx) = 1 and C(My) = 0, and after running above procedure do

C(X1) = C(X1)*C(C3) and store C(X1) into Data RAM

For Tan: Set C(Mx) = 1 and C(My) = 0, and after running above procedure do

C(X1) = C(X2)/C(X1) and store C(X1) into Data RAM


D. Vectoring


ZERO(V);
SELTR;
ZERO(ANG);
ADDA3;
LOADR(My,X1);
LOADR(-Mx,X2);
while (C(X2) != 0) {
*SEMY1(2*C(V)-2);


*SEMY2(2*C(V)-2);


EXCHY12;
if (C(X2) > 0) {
**ADDA1(X1,Y1)
*SUBA2(X2,Y2);
**ADDA3; }


/* Set register V to zero


/* Set angle to zero






/* decrease exponent of C(X1) by
2*C(V)-2 and store the result
into register Y1
/* decrease exponent of C(X2) by
2*C(V)-2 and store the result
into register Y2





- 63 -


else {
**SUBA1(X1,Y2);
**ADDA2(X2,Y2);
**SUBA3; }
MOVE(ACC1,X1);
MOVE(ACC2,X2);


The value in register ANG is = tan-1 -Y. To get R = + y2, we must multi-
x
ply C(X1) by C(C3) and store the result to Data RAM.


E. Natural Logarithm Y+In(X) X > 0


LOADR(Mx,JO);
SEM;
if (SN==1)
MOVE(-M1,X1);
else
MOVE(M1,X1);
SELLN;

LOADR(My,X2);
while (DEDM < m) {

SEMY1 (ri);
*ADDA1(X1,Y1);
*SUBA2(X2,TAB);
**MOVE(ACC1,X1);
**MOVE(ACC2,X2);

}
MOVE(E1,X1);
{ C(XI) = C(X1)*C(C2); }


/* Load operand X to register JO


/* if operand X is negative, then
move -M1 to register X1, other-
wise move M1 to register X1,
where M1 is the mantissa of X.
/* Select the ln(1+2-m) table for ta-
ble lookup
/* Load operand Y to register X2
/* DEDM of (1-X); m is the num-
ber of bits of mantissa


/* multiply C(X1) by C(C2). Reg-
ister C2 contains the constant
ln(2)


MOVE(X2,Y1);
if (SN==1)






- 64 -


SUBA1(X1,Y1);
else
ADDA1(X1,Y1);
MOVE(ACC1,X1);
SAVE(X1,M);


/* Y+ln(X) = Y+ln(m2) =
Y+E*ln(2)ln(m)


For computing In(x), we can set C(My) = 0 (i.e.; Y = 0).


F. Natural Exponential Function vex


LOADR(Mx,X1);
{ C(X1) = C(X1)*C(C1); }


MOVE(X1,IO);




if (C(I) > 0) {

MOVE(I,EO);

MOVE(F,xl); }
else {
MOVE(I-1,EO);
MOVE(1-F,X1); }
{ C(X1) = C(X1)*C(C2); }
SELLN;
LOADR(My,X2);
while (DEDM < m) {


/* Load operand X into register X1

/* C(X1) = X*l(-) Register C1
contains the constant ln(2)

/* Separate the integer and frac-
tional parts of C(X1), the integer
part and the fractional parts are
stored in registers I and F, re-
spectively.
1
/* X*ln(2)= I+F. If I > O, i = I and
f = F*ln(2), otherwise i = I-1
and f = (1-F)*ln(2)


/* f = F*ln(2) or f = (1-F)*ln(2)


/* Load operand Y to register X2
/* DEDM of X; Compute yet.
Sy = yei*ln(2)+f yet 2


SEMY2(rf);
*SUBA1(X1,TAB);
*ADDA2(X2,Y2);
**MOVE(ACC1,X1);
**MOVE(ACC2,X2);






- 65 -


MOVE(X2,JO);
EPADD;
RCV;
SAVE(J1,M)
For computing ex, we set C(My) = 1 (i.e.; Y = 1).


G. Inverse Sauare Root v/ J: x> 0


LOADR(Mx,JO);
CEP;



MOVE(E1,EO);
MOVE(M1,X1);
LOADR(My,X2);
while (DEDM < m) {


SEMY1(ri);
ADDA1(X1,Y1);
MOVE(ACC1,X1);
SEMY1(mrf);
ADDA1(X1,Y1);
MOVE(ACC1,X1);
SEMY2(rf);
ADDA2(X2,Y2);
MOVE(ACC2,X2);
}
MOVE(X2,JO);
EPADD;
RCV;
SAVE(J1,M);


/* Load operand X into register JO
/* X = m2E. If E is even, C(E1) =
E
y and C(M1)=m otherwise
E+1 m
C(E1) = and C(MI) = 2




/* Load operand Y into register X2
/* DEDM of 2-1(1-X1); Compute
y
FC(M1)


For computing -~ we can set C(My) = 1 (i.e.; y = 1).






- 66 -


The above commands are used to explain how the MPAU implements the

elementary arithmetic functions. In fact, these commands can be hard-wired

such that the MPAU will actually be a RISC arithmetic processor. The data

RAM and program RAM can be used for connecting the MPAUs into a systolic

array (see Figure 3.8) as discussed in Section 5.4. Each MPAU uses the Data


PE : Multi-Purpose Arithmetic Unit
DM : Data RAM
C : Adder-Tree For Column PEs
R : Adder-Tree For Row PEs

FIGURE 3.8
The Systolic MPAU Array





- 67 -


RAM as its I/O port and has its own funct
reduced instructions to each MPAU are(

(1) CSOP(Ms,MD,N)



(2) RSOP(Ms,MD,N)
(3) ADD(Msl,MS2,MD)
(4) SUB(Msl,Ms2,MD)
(5) RCMULT(Ms ,Ms2,MD)



(6) ICMULT(Msl,MS2,MD)

(7) MULT(Msl,MS2,MD)
(8) DIV(Msl,Ms2,MD)

(9) ROT(Msl,MS2,MS3,MD1,MD2)








(10) VET(Msi,Ms2,MD1,MD2)


(11) IN(Msl,MS2,MD)
(12) EXP(Msl,Ms2,MD)
(13) ISQRT(Msi,Ms2,MD)

(14) LOADT(Ms,N)


(15) MOVE(Ms,MD,N)


ion controller (program RAM). The


/* Complex SOP; Ms: begin ad-
dress of data array; MD: the des-
tination address for storing the
result; N = # of data (< 6).
/* Real SOP
/* C(MD) = C(Msl)+C(Ms2)
/* C(MD) = C(Msl)-C(Ms2)
/* Calculate the real part of C(MD)
= C(Msl)*C(Ms2) where C(MD),
C(Msl) and C(Ms2) are complex
numbers.
/* Calculate the imaginary part
complex multiplication.
/* Real multiplication
/* Real division, C(MD) =
C(MsI)/C(Ms2)
/* Rotation. If let C(Msi) = 1,
C(Ms2) = 0 and C(Ms3) = K (an-
gle in radius) then C(MDI) =
cos(k) and C(MD2) = sin(X). If
C(Msl) = X, C(Ms2) = Y, and
C(Ms3) = k, then C(MD) =
R*cos(O + X) and C(MD2) =
R*sin(0 + k), where X =
R*cos(O) and Y = R*sin(0).
/* Vectoring. C(MDi) = M +M2
MMD) = (M2)*e
and C(MD2) = tan1i M-

/ C(MD) = C(Ms2)+ln(C(Msi))




/* Load table of length N from
Data RAM to the lookup table
RAM (N < 64)
/* Move data ( length N) beginning
at address Ms to address MD.






- 68 -


(16) ADDT(Ms,MD) /* Load the data C(Ms) to adder-
tree, get the result, and store it
into MD.
For such an architecture, if the coefficients are known a priori, then each

MPAU can be used as a SOP processor to compute the sub-SOP. The adder-

trees can then be used to sum together the results of sub-SOPs. For use as a

general purpose computing system, the MPAUs can be programmed as inde-

pendent arithmetic units for parallel computation or as a systolic array for com-

puting FOR/DO loops. The method of mapping the sequential FOR/DO loops

algorithms, which are often used to model DSP and linear algebraic applica-

tions, are discussed in the next two chapters.















CHAPTER 4
SYSTOLIC ARCHITECTURE

Fields such as signal and image processing, communications, computer

graphics, matrix operations, physical simulation, and others require an enor-

mous number of computations. Many of these applications must be executed in

real-time, thus further increasing the speed and the throughput demands. These

increasing demands indicate the need to develop high-speed, low-cost, high-

performance numeric computers having low packaging volumes. High-speed

arithmetic processors, especially those designed using multi-purpose arithmetic

units (MPAUs), are potentially capable of meeting these requirements. Another

technology, namely high-speed parallel architectures, should also be considered.

Only recently have very large scale integrated circuit (VLSI) techniques provided

the economic justification for this study. However, large parallel machines have

a fundamental limitation, namely, the high cost of communication. For example,

it is easy to integrate twenty 8-bit adders into one VLSI chip but infeasible to

have over 20*3*8=480 I/O pins on one package.

Communication is expensive in terms of real costs, input-output (I/O) re-

quirements (measured in pins), and the attendant problem of long communica-

tion links. Furthermore, delays associated with off-chip communications intro-

duce another bottleneck. All of these factors favor VLSI architectures with a

minimum amount of communication, especially global communication. There-

fore, the design of a parallel system should be modular, have regular data and

control paths, and most importantly contain feasible I/O and localized communi-

cation. A special type of parallel architecture that has particular utility to signal


- 69 -






- 70 -


processing and which can be efficiently implemented in VLSI is the systolic ar-

ray.


4.1 Systolic Arrays


The systolic array was originally proposed by Kung and Leiserson in 1978

[KUN78]. The term "systolic array" stems from the analogy that the data flows

from computer memory in a rhythmic fashion, passing through the processor

elements (PEs) before it returns to memory, similar to blood circulating from

and to the heart [KUN82]. Formally, a systolic array consists of a set of identi-

cal interconnected computational cells, each capable of performing some simple

operations. Communication of information from one cell to another is accom-

plished along tightly coupled, localized (nearest neighbor) data links. Further-

more, I/O operations are restricted to the array boundary.

A systolic array assumes that once data is removed from memory, it is

used by a number of PEs before returning to memory. This not only allows for

localized communication among the PEs, but also reduces the I/O requirements

of the system. As such, the systolic array can alleviate the imbalance (the in-

ability of the I/O system to provide data at a sufficient rate to the arithmetic

unit) by balancing I/O speed with computational speed.

Similar to an assembly line, each PE of the systolic array performs a small

part of a much larger task. For example, a 3x3 matrix-matrix multiplication al-

gorithm C=A*B can be implemented using a systolic array as shown in Figure

4.1 (Design 1). In this example, the algorithm executes by passing the rows of

A from left to right through the array while simultaneously passing the columns

of B from the top to the bottom of the array. Elements of the resulting matrix,

C, remain in-place in the PEs and accumulate the results. Notice that only local

communication between a PE and its.nearest neighbors is needed. The key to






- 71 -


B33


B32 B23


[Cn C12 C 13
C21 C22 C23
C31 C32 C33Z
[All A2 A13" B1 B12 B13]
A21 A22 A23 B21 B22 B23
A31 A32 A33J LB31 B32 B33





A13 A12 All


A23 A22 A21


A33 A32 A31


B31 B22 B13


B21 B12


B11


FIGURE 4.1
3x3 Matrix-Matrix Multiplication (Design 1)


the systolic array is that the data elements are used multiple times before re-

turning to memory. Each row and column is operated on by each of 3 PEs be-

fore the calculation is complete.

Most DSP algorithms can be implemented in a systolic array in various

ways. The above example can also be implemented as shown in Figure 4.2 (De-

sign 2). In this design, all three matrices are moving through the array. The A

matrix is moving from left to right, the B matrix is moving from right to left,

and the C matrix is moving from top to bottom. Compared to Figure 4.1 (De-

sign 1), this design exhibits some different characteristics. Most notable is the

number of PEs needed in this design is increased from 9 to 15. This design,






- 72 -


C33

C32 C23


C31 C22 C13


C2z C12


C11

A31 A21 All Bil B12 B13

A32 A22 A2 B21 B22 B23


A33 A23 A13 B3 B32 B33




Ji
ECI C12 C13 I-Al A12 A131 B1 B12 B13
C21 C22 C23 = A21 A22 A23 B2 B22 B23
C31 C32 C331 LA31 A32 A33J LB31 B32 B331


FIGURE 4.2
3x3 Matrix-Matrix Multiplication (Design 2)



however, offers a number of advantages which may overcome the cost differen-

tial of added hardware. First, the results are moving through the array and thus

can be extracted as they exit the bottom of the array. Another advantage is that

each processor is used only every other cycle (utilization <= 50%), thus allowing

two different matrix multiplications (say, C=A*B and Z=X*Y) to be executed

currently as shown in Figure 4.3. Note that in only one extra cycle, twice the

effective number of matrix multiplies can be performed. Design 2 can thus offer

increased PE utilization and throughput when more than one matrix multiply is

needed. Thus, the mapping of algorithms onto a systolic array is non-unique






- 73 -


Z33

Z32 C33 Z23

Z31 C32 Z22 C23 Z13

C31 Z21 C22 Z12 C13

C21 Z11 C12

Cn

X31 A31 X21 A21 X11 AIl BBi Y 1 Bi2 Y12 B13 Y13


X32 A32 X22 A22 X12 A12 B2 Y21 B22 Y22 B23 Y23


X33 A33 X23 A23 X13 A13% B 1-B31 Y31 B32 Y32 B33 Y33




Ji
[C1 C12 C13 All Az12 A13 A -B1 B12 B13
C21 C22 C23 A21 A22 A23z B21 B22 B23
C31 C32 C33 A LA31 A32 A33 LB31 B32 B33J
[Z11 212 ZI13 rX11 X12 X131 rY11 Y12 Y13
Z21 Z22 Z23 = X21 X22 X23 Y21 Y22 23
Z31 Z32 Z331 LX31 X32 X33J LY31 Y32 Y33J


FIGURE 4.3
Two 3x3 Matrix-Matrix Multiplications (Design 2)


and the designer has great deal of flexibility in choosing which design to imple-

ment.


4.2 Mapping Algorithms


Mapping algorithms onto systolic arrays consists of completely specifying

the hardware and I/O requirements of the target systolic array. The hardware

requirements include the number of PEs, the configuration of the PEs (linear,

rectangular, triangular, etc.), and the communication paths between the PEs.






- 74 -


The key is to position and order the data flow through the array properly. The

I/O requirements include positioning the data in the I/O buffers found at the pe-

riphery of the array such that they can flow through the array in the correct or-

der and at the right time.

The following parameters specify the mapping of an algorithm onto a sys-

tolic array:

1. Number of processors

2. Execution time

3. Velocity vectors (for each variable)

4. Distribution vectors (for each variable)

The number of processors depends upon the algorithm mapping rule. The num-

ber of PEs needed can, however, be much larger than the number of PEs physi-

cally available in the systolic system. In such a case, a partitioning scheme is

needed to map the algorithm to the actually available PEs (see Section 4.4 for

details).

Execution time is also dependent on the specific implementation chosen.

In the strictest sense, this usually excludes the time to "preload" or "flush" the

systolic array. Determining execution time becomes increasingly complicated

when more than one time dimension is necessary. With multiple time dimen-

sions, the data sets move in different directions depending on the increment se-

quence of the time vectors and the array timing control becomes piece-wise

continuous. Piece-wise continuous time vectors prevent the current mapping

techniques from utilizing all of the possible pipelining in an algorithm (see Sec-

tion 4.6 for details).

Velocity vectors represent the direction and the distance traveled by a data

element for each clock cycle and are constant for each element of a data set.

There is one velocity vector for each time dimension, which contains one ele-






- 75 -


ment for each spatial dimension of the systolic array. Thus, if q represents the

number of spatial dimensions of the systolic array and m represents the number

of time dimensions in the systolic implementation (where q+m is the number of

index dimensions of the algorithm), then there will be m velocity vectors with a

dimension of qxl.

The distribution vectors determine the layout of the data elements in each

data set. For each index (or subscript) of a data set, there is one distribution

vector which contains q elements (q is the spatial dimension of the systolic ar-

ray as above). For example, A(I,J) of matrix-matrix multiplication has two dis-

tribution vectors; one distribution vector is a vector from A(I,J) to A(I+1,J) and

the other is a vector from A(I,J) to A(I,J+1). Thus, given the position of the

first element in a data set, the entire data set can be configured using the distri-

bution vectors.

In a sequential algorithm, or a recurrence equation, each calculation can be

uniquely defined by either the indices of the defining FOR/DO loop or the sub-

scripts in a recurrence equation. This representation facilitates the visualization

of the problem. An index space is then defined as a set of vectors that are the

dimensions of the algorithm. For example, two-dimensional matrix-matrix mul-

tiply has three loops, thus its index space is J3 (Figure 4.4) and each algorithm

variable y is indexed by an index vector viy E J3. Each valid vector in the in-

dex space represents one (or more) calculations that are being performed in the

algorithm kernel. Thus, for mapping algorithms onto systolic arrays, the index

space should then be transformed to the systolic array space consisting of time

and spatial coordinates. As such, the newly transformed space can be used to

indicate when and where the computations will occur.

Many techniques exist for mapping algorithms onto systolic arrays. The

earliest technique is the heuristic method. This method consists of tracing the






- 76 -


k

Coo Co C2O

CI 21 B20

C02 012 22 B21


Ao2 B
A0A


A12 A





Bol
A o, o I B %





Aoo.^ --- -^ ---- ^ ^
jA2 AIO A20

Index Set of a 3x3 Matrix-Matrix Multiply Algorithm
(** initialization of C[i,j] is ignored **) :

for i=0 to 2 do
for j=0 to 2 do
for k=0 to 2 do
C[i,j] = C[i,j] + A[i,k]*B[k,j]
end k;
end j;
end i;

FIGURE 4.4
Index Set Of A 3x3 Matrix-Matrix Multiply Algorithm

execution of an algorithm and then determining by hand the positions of the

data variables and the number of PEs needed. Most other methods use a repre-

sentation of the data dependencies (see the following section for further details)

of an algorithm and transform these dependencies into systolic array spatial and

time coordinates. One group of methods uses geometric techniques based on

the use of data flow graphs and other similar structures (CAP86], [HUA86],






- 77 -


[QUI84], and [KUN85 Figure 4.5 shows the geometric mapping of a convolu-


XO Xi X2
Y2 Y2 Y2

I (time)


(A) Graph


W2






Yo
WI



Y2



Y2
wo
xo


xl

X2
(B) Architecture


FIGURE 4.5
Geometric Mapping Of A Convolution [CAP83]


tion algorithm after [CAP83] (where the length of both X and Y is assumed to

be three). This geometric mapping technique consists of graphing a representa-

tive number of points in the index set to the array. These points are then con-

nected and transformed into a multi-processor design. Note that in Figure

4.5A, the solid lines represent constant values of X, the dashed lines represent

constant values of Y, and the dotted lines represent constant values of W. Now,

if we simply replace the I coordinate axis by a time coordinate axis and allow

the J coordinate to become the spatial coordinate of the systolic array, we obtain

the design shown in Figure 4.5B. This design requires three processors (the

number of distinct J coordinates) and the values of Y remain stationary in the

PEs because they are constant over time. The values of W move from top to






- 78 -


bottom through the array and the values of X move from bottom to top at twice

the frequency as the values of W.

Geometric methods usually consist of some form of a directed graph which

is derived from an execution trace of the algorithm. The nodes most often rep-

resent points in the index space of the algorithm and the directed graph is

translated, rotated, or projected into a graph that represents the systolic algo-

rithm. LISP is well suited for this type of mapping because graphs are easily

represented in LISP via pointers and lists[HUA86] and [QUI84] The main

problem with geometric techniques is that they are often difficult to implement

via computer for higher dimension problems. The algebraic techniques de-

scribed below can be more easily implemented on computers and are more effi-

cient.


4.3 Algebraic Mappine Techniques


The major operations of signal processing and linear algebra applications

can be modeled as matrix-matrix, matrix-vector, or in a sum-of-products

(SOP) form. In a uni-processor environment, the implementation is highly reli-

ant on nesting computational loops. For example, the product of two NxN matri-

ces, A and B producing C = A*B, is given by<

for i = 1 to N
for j = 1 to N
for k = 1 to N
S : C[i,j] = C[i,j] + A[i,k]*B[k,j]
end k
end j
end i (4.1)

Here, the initialization of C[i,j] is ignored. For future reference, S1 will be

called a statement and the three tuple v = (i,j,k) will be called an index tuple.

The presented sequential algorithm is said to run in O(N3) time over its three






- 79 -


independent time indices. To accelerate this process, a multi-processor systolic

array can be used. Algebraic mapping techniques are similar to the geometric

techniques (Section 4.2) in that they both normally extract the data dependen-

cies of the algorithm first and then base the systolic implementation on this in-

formation [MOL83], [LI85], [MOL82], [FOR85], [MOL86], [SOU86], and

[DOW87]. To use the algebraic mapping technique, the following information

is required:

1) the algorithm index set,

2) the computations performed at each index point,

3) the data dependencies which ultimately dictate the algorithm communi-

cation requirements, and

4) the algorithm I/O variables.

After extracting the above information from an algorithm, a transformation

function can then be found to map the algorithm onto the systolic array. The

transformation function transforms the index space of the algorithm to the sys-

tolic array space while reserving the same computations. In the following sec-

tions, we use the symbol I to denote the set of non-negative integers and Z to

denote the set of all integers. The nth cartesian powers of I and Z are denoted

as I" and Zn, respectively.


4.3.1 Algorithm Model


In order to map an algorithm onto a systolic array, it is convenient to de-

fine an algorithm model. An algorithm G over an algebraic structure A is a 5

tuple G = (Jn,C,D,X,Y) where

J is the finite index set of G, J"C I";

C is a set of triples (v, yg, yu) where v E Jn, yg is a variable, and yu is a

term built from operations of A and variables ranging over A. We






- 80 -


call yg the variable generated at v, and any variable appearing in yu is

called a used variable;

D is a set of triples (v,y,d) where v E J1, y is a variable, and d is an

element of Z";

X is the set of input variables of G;

Y is the set of output variables of G.

There are three types of dependencies in D:

(i) Input dependence: (v,y,d) is an input dependence if y E X and y is an

operand of yu in computation (v, yg, yu); by definition d = 0.

(ii) Self-dependence: (v,y,d) is a self-dependence if y is one of the oper-

ands of yu in computation (v, yg, u); by definition d = 0.

(iii) Internal dependence: (v,y,d) is an internal dependence if y, the output

of a statement Si at some iterative loop v', is supplied to a statement

at some index tuple v; by definition d = v v* (v, v E Jn).

It is convenient to represent dependencies in D as a matrix D = [D D'] where Do

is a submatrix of D containing all input and self-dependencies, and D' is the

submatrix of internal dependencies. Every column of D is the last element of

the triple (v,y,d) and is labeled dy. If dependencies are valid at almost every

point of the index set, the labels of the column of D need not be shown and,

for practical purposes, self-dependencies can be ignored. For example, the de-

pendencies of A and B of the above matrix-matrix multiply example can be

taken as input dependencies, resulting in the architecture shown in Figure 4.6.

In order to reduce the I/O pins, this architecture uses either bus-connection or

global communication. On the other hand, the sequential matrix-matrix multi-

ply example can be interpreted as a systolic process by defining statement

S1 using the following conventions:






- 81 -


FIGURE 4.6
Global Bus-Connection Architecture

(1) define the indices explicitly displayed in the sequential code to be ex-

plicit variables,

(2) define the indices not explicitly displayed to be implicit (e.g., C[i,j] is

an explicit function of (i,j) and an implicit function of k),

(3) rewrite the statements of the sequential code using explicit variables as

subscripts and implicit variables as superscripts,

(4) apply rule (3) to the left-hand side of the assignment statement di-
rectly but decrease the implicit indices by one on the right-hand side.

This establishes that the right-hand side data set were defined in the

previous systolic cycle (precedence relation). For example,
k k-1 j-1 i-I
Ci,j Ci,j + Ai,k k,j
constant B = B1
kj k,j from previous
con t j-1 iteration over k
constant ,k = i,k (4.2)

index set for this example is J3= ( i, j, k; 1 < i, j, k < N ). The dependence


vectors are

dl= [0,0,1]t

d2= [0,1,0]t

d3= [1,0,0]t
t = transpose


from{ Ci = f(C i) }

from {Ajk = f(A )}

from {Bk, = f(B ) }


(4.3)


The






- 82 -


which produces a dependence matrix D (where the positional dependency of the

A,B,C and i,j,k data is explicitly exhibited for clarity) given by"

CAB
0o o k
D= [did2,d3]= o 1 j
L1 o 0- k (4.4)
So far, the algorithm model is only a static model which does not indicate the

execution order on the index set. Though the index points in the above exam-

ple are ordered in lexicographical order, this is an artificial ordering and can be

modified for the purpose of extracting parallelism. Next, the execution order

must be adjusted to guarantee the correctness of computation. The execution of

an algorithm G = (J,C,D,X,Y) is described byr

1) the specification of a partial ordering oc on J (called execution order-

ing): such that for all (v, yg, Yu)E D' we have 0 oc d (i.e., d larger

than zero in the sense of cx);

2) the execution rule: until all computations in C have been performed,

execute (v, yg, yu) for all v cc v for which (v', yg, yu) have terminated.

In the following text, if d=v-v* > 0, it means that computations indexed by v*

must be performed prior to those indexed by v. Mapping the algorithm onto the
systolic array can then be viewed as changing the features of an algorithm while

preserving its equivalence in computations. That is, two algorithms

G = (J,C,D,X,Y) and G' = (PJ,C,D',X,Y) are said to be equivalent if and only
if

1) algorithm G* is input-output equivalent to G;

2) the index set of G* is the transformed index set of G; P" = T(JP)

where T is a bijection function;

3) to any operation of G there corresponds an identical operation in G*

and vice versa;






- 83 -


4) dependencies of G' are the transformed dependencies of G; D'= T(D).

In the algebraic mapping method, a transfer function T must be found such

that the first coordinate of the transformed index set presents the execution or-

der, leaving the rest of the index coordinates to be selected by the algorithm de-

signer to meet specific systolic array communication requirements. The follow-

ing section presents how a transformation, T, can be selected such that the

transformed algorithm can be mapped onto a systolic array.


4.3.2 Transformation Function

A mapping which transforms an algorithm G into an algorithm G' is de-

fined as:

T --
= S [ (4.5)

where the mapping of n and S are defined as r : J"n and S : J" -., 1.

Here, only a linear transformation, T, (where T E Zn"x) is considered. Thus,

for a given dependence matrix D, there exists a non-singular matrix T which

maps the sequential algorithm into concurrent processes. The transformation T

can therefore be more precisely given asr

j njn
T = ----n n-1
L J S J -1J (4.6)

The mapping T will produce a new array dependence matrix D* = TD. The first

row of T, denoted ;r, produces timing information about the array process (the

transformed algorithm). It is again required that ;rdi > 0 for all di E D so as to

insure that the precedence rule is preserved. The resulting first row of D* con-

tains the delay information existing within the systolic array. Thus, a computa-

tion indexed by v = Jn in the original algorithm will be processed at time

t = 3v. Moreover, the total running time of the new algorithm is usually





- 84 -


T = max(3v)-min(rv)+l. This assumes a unitary time increment. In general,

the time increment may not be unitary; but it is given by the smallest trans-

formed dependence (i.e., minimum rd-). Therefore, the execution time is the

number of hyperplanes n sweeping the index space J" and is given by the ratio

TTOTAL = {max[g(v v2)] + 1} / min[rdi] (4.7)

for any v', v2 J" and di D.

Finally, the second transformation submatrix S captures the spatial coordi-

nate information. The transformation submatrix S should be selected such that

the transformed dependencies are mapped onto a systolic array modeled as

(J-1, P) where J-1 C Zn-1 is the spatial index set of the systolic array and

P E Z(n-)xr is a matrix of interconnection primitives defined as

P = [P1, P2, P (4.8)

where Pj is a column vector indicating a unique direction of a communication

link and r is the number of communication paths between PEs (three examples

of P are shown in Figure 4.7). In other words, the selected transformation sub-

matrix S should satisfy

SD = PK (4.9)

where matrix K indicates the utilization of primitive interconnections in matrix

P. Matrix K = [kji] is such that

kji 2 0 (4.10)

Skji <5 .di
j (4.11)
Equation 4.10 simply requires that all elements of the K matrix are nonnegative

and Equation 4.11 requires that communication of data associated with depend-

ence di must be performed using some primitives Pj's exactly Zikji times.





- 85 -


Y2 Systolic Array


PI P2 P3 P4 PS P6 P7 P8 P9
=0 0 0 1-1 1 1-1-1 J(
=0 1-1 0 0 1-1 1-1 2
(A)


S Systolic Array


Pl P2 P3 P4
p 0 1 0 o ]
0 0 11 J2

(B)


J2 Systolic Array


Pi P2 P3
P=[0 1 0] T,
0L 0 1 12

(C)


FIGURE 4.7
Interconnection Matrix For A Systolic Array





- 86 -


For example, let us select the transfer function


T= 0 1 0
0 0 1 J (4.12)
for the above matrix-matrix multiplication with dependence matrix

CAB
0 0 1
D= [dl,d2,d3]= 0 1 0 j
1 0 0 k (4.13)

and assume

P1 P2 P3
p-r 1 0
0 0 1 (4.14)

as shown in Figure 4.7.c. Then,

rD = [1, 1, 1] (4.15)
and

dl d2 d3

SD 1 0 PK 1 0 1 0
1 0 =PK= 0 1 0 (4.16)

This means that the implementation of data communication associated with de-

pendence di requires only one primitive P3, and likewise, only one primitive P2

for d2, and only one primitive P, for d3.

It is possible that some interconnection primitives will not be used. These
correspond to rows of matrix K with zero elements. Most often, many transfor-
mation submatrices S can be found, and each transformation leads to a differ-

ent array. After selecting the transformation S, the corresponding (n-1)xn sub-

matrix, in D', displays the systolic interprocessor connections or communication

paths. As a result, the specification of T provides the designer with a wealth of





- 87 -


information about the physical array, its wiring (communication), and data flow.
In such a way, the velocity vector of the variable y is determined from

Tdi = v Vi =s

V; = sv; (4.17)

and the distribution vector of y can be derived from

I T- [_V]
(4.18)
All these systolic array/algorithm synergism processes are summarized in Figure
4.8.


4.3.3 Example

For the matrix-matrix multiply example, the sequential algorithm runs in

O(N3) time. By committing N2 processors to this problem, it would be ex-
pected that the execution time would be reduced N2-fold. As shown in Section
4.3.1, the dependence vectors are/

dl= [0, 0, 1]t
d2 = [0, 1,0]t
d3 = [1,0, 0] (4.19)

which produces a dependence matrix D given by

CAB

D=[di,d2,d3]= 0 1 0 j
S1 0 0 1 k (4.20)
Now, a non-singular transformation matrix T, must be found in which each col-
umn vector of the dependence matrix to a vector has its first element greater
than zero. For this example, the D matrix is a 3x3 exchange matrix [GOL83]
such that ;r of T can be chosen as r = [1, 1, 1] which makes the first row of

the transformed dependence matrix D* to be [1, 1, 1] (i.e., all elements are






- 88 -


T


Parallel Execution


ARRAY ALGORITHM


The resulting information :


Algorithm index dependence

Algorithm to array mapping

Array interprocessor dependence

Start time

Stop time

Computational latency

Array location of algorithm index set


Algorithm activity at clock t and array
coordinate V


D

T
D* = TD

min(r v) = To
max(7 v) = Tf

max(n v)-min(zr v)+l = T'- TO+ 1

Tv= time = -v = [+
Tv= --------- V
coordinate = S v

T- =V
LVJ


FIGURE 4.8
Systolic Array/Algorithm Synergism


ALGORITHM SPACE

for i
for j
for k


algorithm index set
v = (i,j,k,...)


ARRAY SPACE


array time = t = 7 v
array coordinates = V = S v
= (J, 2, ...)





- 89 -


greater than zero). There are numerous choices of S which all satisfy the re-
quirements SD = PK. Different S's will result in distinct communication paths
and variant data flows. For instance

S F 0 1 0
1- 0 0 1 (4.21)

it follows

CAB
[1 1 1 1 1 t
Ti= 0 1 0 and D =T1D= 0 1 0 Ji
0 0 1 1 0 0 J2 (4.22)

The first row of Di indicates the number of time units for each cycle. The sec-
ond two rows indicate the variables' direction of motion in the systolic array. In

this implementation, the first column of Di is

C

0 Y,
1 J2 (4.23)

which means the C data set will move in the positive J2 direction (one PE per
unit cycle). The second and third columns indicate, respectively, that the A data

set will move in the positive J1 direction (one PE per unit cycle) and the B data
set will remain in-place in the PEs. The start-execution time and stop-execu-
tion time calculations, denoted by TO and Tf, are given by

TO = min[,rv] = n [1, 1, 1]' = 3 (4.24)

and T = max[v] = n [N, N, N]' = 3N (4.25)

The total execution time is then

TTOTAL = Tf To + 1 = 3N 2 (4.26)

and the algorithm will run in O(N) time, as expected.
Once the transformation matrix, T, is found, the index set can be trans-
formed by using






- 90 -


Tv= I (4.27)

where v is one index vector from the index set consisting of values (i,j,k) and
the transformed index vector tells the time and location where the computation
associated with the index vector is executed. Assume N = 3 for this example,
then from

[i i+j+k=t]
Tj = = j ; 1 < i,j,k < 3
k k =J2 (4.28)

we know that the range of time and spatial indices are 35t59, 1lJ51 3, and

1512 53. These provide the following information: (1) the first computation
occurs at t = 3 and the last computation at t = 9; total execution time is 9 3 +
1 = 7, and (2) a total of nine PEs are required for this implementation and they
are numbered by [((1,J12) I 1 5 JiJ-2 3]. The computation at the index values

(i,j,k) (e.g. (1,2,3)), corresponding to calculation C[1,2] = C[1,2] + A[1,3] *
B[3,2], is transformed as follows
[1 1 1 1 6
0 1 0 2 = 2
0 0 1 3 3 (4.29)
This indicates that the above computation will be executed at cycle 6, at proces-
sor [2,3]. Because the transformation matrix, T, is non-singular, we can always
find its inverse transformation matrix, T-'. For this instance, it is
"1-1-11
T~'= 0 1 0
0 0 1 (4.30)
The algorithm index of activity happening at clock t and spatial coordinates J1

and J2 can then be derived in the reverse order using the inverse transformation
matrix T-1, as follows




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs