• TABLE OF CONTENTS
HIDE
 Title Page
 Copyright
 Acknowledgement
 Table of Contents
 List of Tables
 List of Figures
 Abstract
 Introduction
 Theory
 Key array processor implementation...
 System architecture
 VLSI implementation of processing...
 Yield enhancement and fault...
 Conclusions and future work
 Appendix
 Reference
 Biographical sketch
 Copyright














Group Title: fault tolerant GEQRNS processing element for linear systolic array DSP applications
Title: A fault tolerant GEQRNS processing element for linear systolic array DSP applications
CITATION THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00082386/00001
 Material Information
Title: A fault tolerant GEQRNS processing element for linear systolic array DSP applications
Physical Description: x, 120 leaves : ill., photos ; 29 cm.
Language: English
Creator: Smith, Jeremy C., 1966-
Publication Date: 1994
 Subjects
Subject: Sistolic array circuits   ( lcsh )
Array processors   ( lcsh )
Electrical Engineering thesis Ph. D
Dissertations, Academic -- Electrical Engineering -- UF
Genre: bibliography   ( marcgt )
non-fiction   ( marcgt )
 Notes
Thesis: Thesis (Ph. D.)--University of Florida, 1994.
Bibliography: Includes bibliographical references (leaves 117-119).
Statement of Responsibility: by Jeremy C. Smith.
General Note: Typescript.
General Note: Vita.
 Record Information
Bibliographic ID: UF00082386
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: aleph - 002007920
oclc - 32490064
notis - AKJ5193

Table of Contents
    Title Page
        Page i
    Copyright
        Page ii
    Acknowledgement
        Page iii
    Table of Contents
        Page iv
        Page v
    List of Tables
        Page vi
    List of Figures
        Page vii
        Page viii
        Page ix
    Abstract
        Page x
    Introduction
        Page 1
        Page 2
        Page 3
    Theory
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
    Key array processor implementation issues
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
    System architecture
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
    VLSI implementation of processing element
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
    Yield enhancement and fault tolerance
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
    Conclusions and future work
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
    Appendix
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
    Reference
        Page 117
        Page 118
        Page 119
    Biographical sketch
        Page 120
        Page 121
        Page 122
    Copyright
        Copyright
Full Text










A FAULT TOLERANT GEQRNS PROCESSING ELEMENT
FOR LINEAR SYSTOLIC ARRAY DSP APPLICATIONS











By

JEREMY C. SMITH


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY


UNIVERSITY OF FLORIDA


1994





























Copyright 1994


by


Jeremy C. Smith














ACKNOWLEDGEMENTS


I would like to thank my advisor Dr. Fred J. Taylor for providing me with the

means and the environment necessary for getting this work done, and for allowing

me the freedom to pursue the research directions I saw fit. Special thanks also go to

Dr. Graham A. Jullien for serving on my Ph.D committee from such a far distance

away. I would also like to thank Dr. Mark E. Law, Dr. Jose C. Principe and Dr.

Bernard A. Mair for serving on my committee.

I would also like to thank my parents, whose early preparation and dedication

have made my accomplishments possible.

I would, especially, like to thank Diane for her patience, dedication and love,

which were all necessary ingredients for the completion of this dissertation.

















TABLE OF CONTENTS


ACKNOWLEDGEMENTS ............................ iii

LIST OF TABLES ..................... ............ vi

LIST OF FIGURES .................... ............ vii

ABSTRACT .................................... x

CHAPTERS

1 INTRODUCTION ................... .......... 1


2 THEORY ..............................

2.1 The Quadratic Residue Number System (QRNS) .....
2.2 The Galois-Enhanced QRNS (GEQRNS) .........
2.3 Dynamic Range .......................
2.4 Exam ple .. ........... .. .. ..........

3 KEY ARRAY PROCESSOR IMPLEMENTATION ISSUES ..

3.1 Synchronization .......................
3.1.1 Traditional Approaches ..............
3.1.2 True Single Phase Clocked Systems ........
3.2 Synchronous vs. Asynchronous Systems ..........
3.3 Fundamental Manufacturing Limitations ..........
3.3.1 Defect Size Distribution ..............
3.3.2 Defect Spatial Distribution and Yield Models . .

4 SYSTEM ARCHITECTURE ...................

4.1 Architectural Overview ..... . ............
4.2 Multiply Accumulate PE Architecture ...........
4.2.1 Modulo-P Adders ..................
4.3 Forward Mapping: Integer to GEQRNS ..........
4.4 Inverse Mapping: QRNS to Residue . . . . . .
4.5 Chinese Remainder Theorem . . . . . . . .

5 VLSI IMPLEMENTATION OF PROCESSING ELEMENT .


4
. . 4

4
. . 6
6
7
9

. . 15
... 15
15
. . 16
. . 24
. . 29
. . 33
. . 34
. . 38

.. 46

. .46
. .46
.. 50
. . 52
. . 54
. . 54

. . 57


5.1 True Single Phase Clocking Scheme









5.1.1 Pipeline Registers . . . . . . . ... . . 58
5.1.2 Data Storage Shift Register . . . . . . . ... 59
5.2 Exponentiation ROM ....................... 60
5.3 Electronic Reconfiguration Switches . . . . . . ... 65
5.4 PE Performance ................... ........ 67
5.5 Early Versions of the Processing element . . . . . .... 73
5.5.1 Version One .......................... 73
5.5.2 Version Two ......................... 74

6 YIELD ENHANCEMENT AND FAULT TOLERANCE ........ 79

6.1 Yield Enhancement via Reconfiguration . . . . . ... 79
6.1.1 Yield Estimates ....................... 82
6.2 A Comparison with Replacing Moduli . . . . . . ... 87
6.3 Detecting Faults ................... ........ 90

7 CONCLUSIONS AND FUTURE WORK . . . . . . ... 95


APPENDICES

A VLSI CELL LAYOUTS ............... .. ..... 100


B OBSERVED CHIP DATA ........................ 109


C COMPUTER PROGRAMS ........................ 110


REFERENCES ...................... ............. 117

BIOGRAPHICAL SKETCH .................. ......... 120















LIST OF TABLES


2.1 Table of Maximum Dynamic Ranges for Eight to Four Modulus System. 9

2.2 Table of Maximum Inner Product Lengths . . . . . . . 9

2.3 Log-Antilog Table for pi = 5. ....................... 11

2.4 Log-Antilog Table for P2 = 13 ... ............. . .... 11


4.1 Full Adder truth table. ............ .... ....... ... 51


6.1 Table of System Component Areas. . . . . . . . . ... 83

6.2 Table of Non-Redundant Chip Yields. . . . . . . . ... 83

6.3 Table of Redundant Chip Yields. . . . . . . . ..... 85















LIST OF FIGURES


3.1 Single Phase Latch System ................... ..... 17

3.2 Double-latch non-overlapping pseudo two phase system . . ... 19

3.3 High performance non-overlapping pseudo two phase system . . 20

3.4 Delay transformation model ....................... 21

3.5 Transparency problem introduced by complementary phase of clock. 23

3.6 System model for TSPC edge based clocking. . . . . . ... 25

3.7 Timing waveforms for TSPC edge based clocking. . . . . .... 26

3.8 System model for TSPC latch based clocking. . . . . . ... 27

3.9 Timing waveforms for TSPC latch based clocking. . . . . .... 28

3.10 Potential skew hazard with TSPC scheme [44] . . . . . . 30

3.11 Clocking against the data flow to exploit skew [44]. . . . ... 30

3.12 Non-local communication problem solution [44. . . . . ... 31

3.13 SEM photographs of defects from early PE fabrications. . . ... 35

3.14 Defect density vs. defect radius . . . . . . . ..... . 36

3.15 Critical area for Parallel Conductors. . . . . . . . ... 38


4.1 System Architecture ........................... 47

4.2 Processor Architecture . . . . . . . ..... ......... 49

4.3 Standard Modulo P Architecture . . . . . . . ..... 51

4.4 Modulo-P Adder Building Block Primitives. . . . . . ... 52

4.5 Carry Select Modulo Adder. . . . . . . . ..... . . 53

4.6 Forward-mapping (0) conversion module with GEQRNS log table .54









4.7 Inverse-mapping (0-1) conversion module. . . . . . .... 55

4.8 CRT block diagram. .......................... .. 56


5.1 TSPC Pipeline Register ......................... 58

5.2 SPICE simulation of fast transition path for shift register ...... ..61

5.3 TSPC Shift Register Cell with Storage . . . . . . ..... 62

5.4 Floorplan of Exponentiation ROM. . . . . . . . .... 63

5.5 Key ROM Circuit Elements ......................... 63

5.6 SPICE simulation of ROM operation. . . . . . . . ... 66

5.7 Die photograph of processor. . . . . . . . ..... . . 69

5.8 Oscilloscope photo of clock signal and output bit zero. ... . .... 70

5.9 Pass-through test ................... ......... 71

5.10 Oscilloscope photo of pass-through test output. . . . . ... 72

5.11 SPICE simulation of pass-through test. . . . . . . . ... 72

5.12 Processor architecture of first version. . . . . . . . ... 75

5.13 Die photograph of first version of PE. . . . . . . . ... 76

5.14 Oscilloscope photo of non-overlapping clocks for first chip. . . 77

5.15 Processor architecture of second version. . . . . . . ... 77

5.16 Die photograph of second version of PE. . . . . . . ... 78


6.1 Yield Curves for Various Length Arrays (Scheme 1). . . . . 86

6.2 Yield Curves for Various Length Arrays (Scheme 2). . . . ... 91

6.3 System Areas for Scheme 1 (lower) and Scheme 2 (higher). . . 91

6.4 Adjusted Yield Curves For Scheme 1 (Best Case). . . . ... 92

6.5 Adjusted Yield Curves For Scheme 1 (Worst Case). . . . ... 92

6.6 Adjusted Yield Curves For Scheme 2 (Best Case). . . . ... 93

6.7 Adjusted Yield Curves For Scheme 2 (Worst Case). . . . .... 93


A.1 ROM Word-line Decoder Cell . . . . . . . . . . 101









A.2 ROM Sense Amplifier ................... ........ 102

A.3 ROM Programming Matrix ....................... 103

A.4 Mod P Building Block (Zero) ....................... 104

A.5 Mod P Building Block (One) ............. .......... 105

A.6 DSSR Cell ................. ............... 106

A.7 Pipeline Register with Direction Logic . . . . . . . ... 107

A.8 Reconfigurable Switch Element . . . . . . . ..... . 108















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy




A FAULT TOLERANT GEQRNS PROCESSING ELEMENT
FOR LINEAR SYSTOLIC ARRAY DSP APPLICATIONS

By

Jeremy C. Smith

August 1994


Chairman: Dr. Fred J. Taylor
Major Department: Electrical Engineering

In this work the design of a Galois Enhanced Quadratic Residue Number

System (GEQRNS) processor is presented, which can be used to construct linear

systolic arrays. The processor architecture has been optimized to perform multiply-

accumulate type operations on complex operands. The properties of finite fields have

been exploited to perform this complex multiplication in a manner which results

in greatly reduced hardware complexity. The processor is also shown to have a

high degree of tolerance to manufacturing defects and faults which can occur during

operation. The combination of these two factors makes this an ideal candidate for

array signal processing applications, where high complex arithmetic data rates are

required. A prototype processing element has been fabricated in 1.5 Ptm CMOS

technology, which is shown to operate at 40 MHz.















CHAPTER 1
INTRODUCTION

Arithmetic bandwidth continues to remain the principal limitation in high

speed Digital Signal Processing (DSP) applications. In the past, systolic arrays have

been proposed as a means to achieve high computational throughput for compute

bound applications [14]. The highly modular nature of systolic arrays makes them

attractive for larger than conventional levels of integration such as Ultra Large Scale

Integration (ULSI), and Wafer Scale Integration (WSI). In an array processor, each

building-block cell or Processing Element (PE) performs some basic arithmetic op-

eration on data which arrives at its inputs. The data flow is in some predetermined

and regular ordering, so that operands arrive at the correct processor at the correct

time. The architecture of each PE is highly optimized and is then repeated over a

large silicon area.

In order to realize large monolithic arrays of processing elements, it is necessary

to cope with the fact that many of the processors in an arbitrary array will have some

fatal defect at the time of manufacture. Additionally, as chip transistor count and

scaling density increases, so does the probability of operational failure, due to bias-

related physical phenomena [8].

In this work the design of a Residue Number System (RNS) processor is pre-

sented, which is used to construct a linear systolic array. The processor architec-

ture has been optimized to perform multiply-accumulate type operations on complex

operands. The properties of finite fields have been exploited to perform this complex

multiplication in a manner which results in greatly reduced hardware complexity.









The processor is also shown to have a high degree of tolerance to manufacturing de-

fects and faults which occur during operation. The combination of these two factors

makes this an ideal candidate for signal processing applications, where high complex

arithmetic data rates are required.

In section two, an introduction to the theory leading up to the processor archi-

tecture is presented. It will be shown how complex multiplication can be performed in

two non-communicating parallel channels via the Quadratic Residue Number System

(QRNS) mapping. Furthermore, the mapping may be taken to one more level, where

the actual multiplication is performed as a sum of two number-theoretic exponents.

Next, some bounds are given on the maximum length of a complex inner product

that can be computed, within the given dynamic range of the system. Finally, the

chapter concludes with an illustrative numerical example of the mapping techniques.

Some key implementation issues which relate to the design of large area in-

tegrated circuits (ICs) are presented in Chapter Three. In a broad sense, these are

synchronization and manufacturing yield. The first section of Chapter Three de-

scribes the clocking techniques that have been historically used in integrated circuits

and some of their limitations. The section concludes with the introduction of the

true single phase clocked (TSPC) technique, which represents the latest development

in synchronous system design. The second section of Chapter Three presents the

key issues of IC manufacturing. A physical notion of manufacturing defects will be

developed along with some models used to predict integrated circuit yield.

An overview of the proposed system will be presented in Chapter Four, along

with the architectural design of the processing element. The design tradeoffs which

resulted in the final PE implementation are discussed in detail. Finally, the architec-

ture of the support modules necessary to perform the forward and reverse mappings

of the input and output data are presented. It is shown that these conversion elements

can be implemented exclusively, with the key modules of the PE.









The VLSI design of the PE is presented in Chapter Five. The chapter focuses

heavily on the transistor-level design of the PE. The internal details of each major

module of the PE are presented along with computer simulation of their behavior.

Some results of the testing of fabricated chips are then presented. It is shown that

the PE is capable of maintaining a very high data rate, which is due, ultimately, to

the aggressive design techniques used for its electronics. Finally, a discussion of the

earlier versions of the PE is presented, which detail the evolution and optimization

of the current architecture.

An analysis of the fault-tolerant properties of the proposed system is given in

Chapter Six. It is shown that the yield of a large-area sixteen processor array can

be significantly increased by the chosen redundancy scheme. This scheme is then

compared to one which is traditionally used for RNS systems and is shown to be far

more efficient and beneficial.

Finally, Chapter Seven summarizes the accomplishments of this dissertation.

Some concluding remarks and future directions which this work might take, are also

presented.















CHAPTER 2
THEORY

2.1 The Quadratic Residue Number System (QRNS)

The Residue Numbering System (RNS) has long been proposed as a means

of achieving high-computational bandwidths in signal processing systems [34, 26].

The RNS gets its speed advantage because computations over a large base ring can

be implemented over smaller computation rings, due to an isomorphism between

elements in the base ring and the direct sum of the computation rings. An integer

X in the RNS is represented by an L-tuple of residues:


X = (X,...,XL) (2.1)

where xi =< X >m, is the ith residue and mi is the ith modulus. The

production rule for the ith digit in a RNS computation is


Zi = < x + y; >m, (2.2)
Zi = < xi x Yi >m

which represents modular addition and multiplication, respectively. The im-

portance of Equation 2.2 is that the computation of any digit in the L-tuple is in-

dependent of any other digit. This means that there are no carries between the

residue channels. If the residue channels are of small wordwidth, then high compu-

tation rates can be achieved in physical systems. This is the central theme in RNS

implementations.

The rule which maps computations over the implementation rings back to the

base ring is the Chinese Remainder Theorem (CRT). The CRT is given below:










f L
X = < i x < (i)-1 x xi >,> >M (mod M) (2.3)

Where M = nfl mi, and for i,j E {1,2,3,..., L}, gcd(mi, mj) = 1 for i 5 j,
and rhi = M/mi, with rh; x (rni)-1 = 1. A historical complaint about RNS systems is
that the speed gains obtained by the fast parallel channels are lost in the CRT since
it requires a final mod(M) operation across the entire dynamic range. However,
this is less of an issue today as large wordwidth fast binary adders are regularly
demonstrated in the literature [23, 27]. As we will see shortly, smaller binary adders

can be used to make larger modulo adders, with minimal area complexity.
Complex operations in the RNS can be performed by simply emulating con-
ventional architectures with RNS elements (Complex Residue Numbering System
(CRNS)). However, the QRNS first introduced by Leung [17] and later developed by
Krogmeier and Jenkins [12] is a much better way of performing complex operations.
In QRNS real and imaginary components are encoded into two independent quan-
tities, whereby complex operations can be performed independently in two parallel
channels. This requires that the moduli be restricted to primes of the form 4k + 1.
If this is so, the equation


x2 + 1 = 0 (mod p) (2.4)

has two solutions in the ring Zp, denoted by 3 and 3-1, which are additive and
multiplicative inverses of each other. We define a forward mapping
0: Zp[j]/(j2 + 1) -- Zp x Zp to be


0(a + jb) = (z,z*)
z = < a + jb >p z* < a jb >p
We will call the z and z* operands the normal and conjugate components,
respectively.
The inverse mapping 0-': Z, x Zp Zp[j]/(j2 + 1) is given by











0-1(z, z*) =< 2-1(z + z*) >p +j < 2-3-1(z z*) >p (2.6)

If (z, z*), (w, w*) E Z, x Z,, then addition and multiplication operations in
the ring < Zp x Zp, +, > are given by


(z, z*) + (w, w*) = (z + w, + w*) (2.7)
(z, z*) (w, ) = (zw, z*w*)
Since the z and z* channels are independent, they can be implemented in two
separate channels. Complex arithmetic can thus be performed in two simultaneous
operations, executed in one clock cycle.


2.2 The Galois-Enhanced QRNS (GEQRNS)

The properties of Galois fields can be used to further simplify complex mul-
tiplication in the RNS [25, 24, 18, 35]. It is well known that for any prime mod-

ulus p that there exists some a E Zp that generates all non-zero elements of the
field GF(p). That is, any non-zero element in Zp can be represented by ak, where

k E {0,1,2,...,p 2}. Since we can represent all elements of GF(p) {0} by
exponents, multiplication can be performed via exponent addition. This is highly
desirable from a hardware standpoint since n-bit adders tend to be smaller and faster

than n-bit multipliers. A number theoretic logarithm table is used to obtain the

power of a for each QRNS operand, and an antilogarithm table is used to recover the
summed powers (modulo p-1). Exploitation of this cyclic property also permits the
use of moduli which are larger than those typically used for RNS systems (typically
less than five bits), since a hardware multiplier is not needed. This translates into
increased dynamic range for fewer channels. Eight bit moduli have been used in this
implementation, but the technique could easily be extended to 10 or 11 bit moduli.

Beyond this, the logarithm and antilogarithm tables become too large and slow to
be of beneficial use.









2.3 Dynamic Range

The legitimate range of integers in a RNS system is [0, M 1] (where M =

n1= Pi). All 4k + 1 primes bounded by eight bits belong to the set { 241, 233, 197,
181, 173, 157, 149, 137, 113, 109, 101, 97, 89, 73, 61, 57, 53, 41, 37, 29, 17, 13,
5 }. In theory, a RNS system could be constructed in which the maximum range

was the product of all of these primes 7.796643721 x 1042([2142]). Clearly this

would result in an impractical implementation, as a massive CRT would be needed
to recover the residues. Systems of practical interest would constitute moduli sets of

say the first eight moduli. If a signed representation for integers is desired, we can
divide the interval [0, M) up evenly into a positive half and a negative half. Since

M will be odd for any product of moduli we have defined, the dynamic range will

be [-(M 1)/2, (M 1)/2]. Each integer X is mapped onto the range [0, M 1]
according to


SX (mod pi) X > 0
x =( I 2xx...,)**_LJ = ( (2.8)
pi (X (mod pi)) X < 0
where, X mod(pi) is the least positive residue of X with respect to pi. The
negative part of the dynamic range thus maps to the upper part of the legitimate

range:


SPositiveRange : [0, (M 1)/2]
NegativeRange : [(M + 1)/2, M 1] )
Table 2.1 depicts the maximum ranges achievable for eight through four modu-
lus systems. We will consider the four modulus case. We are interested in performing

inner product computations of the form:

N-1
c = 1 a(k)b(k) (2.10)
k=O









where a(k) and b(k) are complex sequences. We will assume that the real and

imaginary parts of both sequences are signed numbers where


-2 + 1 < babi 2 (2.11)

The real and imaginary components of Equation 2.10 will thus satisfy the

bounds


2N(2a 1)(21 1) < c, ci 2N(2" 1)(2 1) (2.12)

We must now contain these limits within our total dynamic range so that

overflow will not occur during the computation of Equation 2.10, we thus obtain the

following inequality


(M 1) > 2N(2_ 1)(2# 1) (2.13)
2 -
which implies that the maximum inner product length NMAX is


M-1
NMAx M (2.14)
NMX < 4(2c 1)(2 1)214

Some tabulated inner product lengths are shown in Table 2.2 for various val-

ues of a and /. We will assume that integers with a, / > 7 represent quantities

which are known a priori, as it is unlikely that numbers of larger wordwidths would

be available from high-speed signal acquisition circuits. Large wordwidth quantities

would typically be filter coefficients, which can be reduced modulo piby a host pro-

cessor, prior to entering the RNS system. Thus, two random eight-bit data streams,

or one random eight-bit and one pre-known data stream of wordwidth greater than

eight bits can be input.










Table 2.1: Table of Maximum Dynamic Ranges for Eight to Four Modulus System.

Dynamic Range
Modulus Set nI Closest power of 2
241,233,197,181,173,157,149,137 1110121095908704853 [259J
241,233,197,181,173,157,149 8103073692764269 [252J
241,233,197,181,173,157 54383044917881 L245J
241,233,197,181,173 346388821133 [238]
241,233,197,181 2002247521 L230o


Table 2.2: Table of Maximum Inner Product Lengths


2.4 Example

We will now consider a sample calculation based on the theory presented so

far. For this simple case, the smallest two 4k + 1 primes will be used as moduli.

These are pi = 5 and P2 = 13. Some preliminary constants that will be needed by

Equation 2.3, Equation 2.5 and Equation 2.6 will be presented first, as well as the

contents of the lookup tables needed to compute the number theoretic logarithm and

antilogarithm values.

For the the CRT (Equation 2.3) we will need to know the values of M, il,

(rmi)-, rn2 and (nr2)-1. These are obtained as follows:


Maximum Inner Product Length
a # NMAX
7 7 31,034
7 9 7,713
7 11 1,925
7 13 481
7 15 120









M = nf= mi = M = 5 x 13 = 65

i=5x 3 =j m5i3=13
rh 1 = 5 V ni = 13

< rhi x (i)1>5 = 1 = (rm)-1 = 2 (2.15)

m2 -- 5x13
2 -i=13 2 rh2=5

< A2 x (m2)-1 >13 = 1 = (t2)-1= 8
The forward QRNS mapping (Equation 2.5) requires that we find the two

constants such that the equation < x2 + 1 0 >p can be solved. These are obtained
for the moduli used here as follows:


< x2+1=-0>5 = f=2 +j1 =3
(2.16)
< 2+ 1 0>13 > = 2 = 5 32 = 8
We note that <2 x 3 >5 = 1 and that < 2 +3 > = 0, thus 2 and 3 are
multiplicative and additive inverses of each other, modulo 5. Similarly, < 5 x 8 >13 =

1 and < 5 + 8 >13 = 0. Thus, two elements always exist in each case, which behave

exactly like the imaginary operators (j) which we are familiar with.

The inverse QRNS mapping requires that we also find the multiplicative in-

verse of 2, modulo our prime moduli. These are:

< 2-1 >5 = 3
(2.17)
< 2-1 >13 7
Finally, we need to obtain the logarithm-antilogarithm tables to perform the

QRNS to GEQRNS mapping and the GEQRNS to QRNS inverse mapping, respec-

tively. We recall that we can only generate the non-zero elements in GF(p) with some

ak (where k is modulo p 1). We must thus consider a zero as a special case, which

will be denoted by *. The tables are given below, with the generators for each case.

The number theoretic logarithm is obtained from going right-to-left in the tables,

and the anti-logarithm, from left-to-right.















Table 2.3: Log-Antilog Table for pi = 5.


Log-Antilog Table for pi = 5, a = 3
Power < ap- >p Element
0 < 3<0>4 >5 1
1 < 3<1>4 >5 3
2 < 3<2>4 >5 4
3 < 3<3>4 >5 2
*_ 0


Table 2.4: Log-Antilog Table for p2 = 13.









Now, suppose we wish to compute the product of two complex numbers 6 +j3
and 4 + j5. The product using standard arithmetic is 9 + j42. Let us now compute
the product using RNS. We must first perform the forward QRNS mapping:


0(6 + j3) (z, z*)
z = (<6+jf x3>5,<6+ x 3>13) = (2,8)
z* = (<6- j x 3>5,<6-J2 x3>13) = (0,4)

0(4 +j5) (w,w*) (2.18)
w = (<4+jf x5>5,<4 +2 x 5>13) = (4,3)
w* = (< 4 x 5 >,, < 4 x5>13) = (4,5)

Thus, our complex numbers map into the set of ordered pairs


6+j3 (2,8)(0,4)
4+j5 (4,3)(4,5) (2.19)
At this point we can perform the multiplication as an actual multiplication as
in QRNS, or we can use logarithmic addition as in GEQRNS. For now we will use
QRNS. Performing component wise multiplication as usual, we obtain

(2,8)(0,4)
x
(4,3)(4,5)
4 (2.20)
(< 2 x 4 >5, < 8 x 3 >13)(< 0 x 4 >5, < 4 x 5 >13)

(3,11)(0,7)
We must now use the inverse QRNS mapping, which yields

0-11(3,0) = < (2-1(3 + 0) + j2-1(j)-(3 -0) >5
= <3x3+j3x3x3>5
= 4 +j2
(2.21)
0-12(11, 7) = < (2-1(11 + 7) + j2-1()-1(11 7) >13
=< 7 x 18+j7 x8 x4>13
= 9 +j3








Finally, we must use a real and an imaginary CRT to recover our standard
integer representation. The following expressions result


= < < ri x < i1 x 4 >5 >65 + < A2 x < 7i2-1 x 9 >13 >65 >65
= < 13 x < 2 x 4 >5 >65 + < 5 x < 8 X 9 >13 >65 >65
= < 39 + 35 >65
= 9
S= < < ri, x < ril-1 x 2 >5 >65 + < i2 x < x 3 >13 >65 65 (2.22)
S< < 13 x < 2 x 2 >5 >65 + < 5 x < 8 x 3 >13 >65 >65
= <52 + 55>65
= 42

Thus, we obtain the same product as before. Now, we will perform the multi-
plication in Equation 2.20 using our GEQRNS logarithm tables. We will simply use
Table 2.3 and Table 2.4 to lookup the logarithm and antilogarithms of the operands.
Thus,


(2,8)(0,4)
x
(4,3)(4,5)
4
(33, 79)(*,710)
x
(32,78)(32, 73) (2.23)

(3<3+2>4, 7<9+8>12)( 7<10+3>12)

(31, 75)(*, 71)

(3,11)(0,7)
We obtain the same result as we did in Equation 2.20, and thus, if we proceed
with the inverse QRNS mapping and CRT, will obtain the same final result. Finally,
we point out that the GEQRNS is only defined for multiplication. As we will see later,
this encoding is highly desirable from a hardware standpoint. In Chapters Four and
Five, the architecture of the multiply-accumulate processor is presented. Operands

are input to the multiplier portion of the chip as GEQRNS exponents. Once they






14

are multiplied, their product is converted back to QRNS inside the processor and

subsequently used in the accumulate sections of the chip.















CHAPTER 3
KEY ARRAY PROCESSOR IMPLEMENTATION ISSUES

3.1 Synchronization

The continual increase in integrated circuit scaling density has permitted the

development of chips which simultaneously exhibit increased complexity and operat-

ing speed. It is now possible to build entire systems on a single chip which consist

of many interacting circuit elements. By far, the largest gains in integration are at-

tained with monolithic (single chip) ICs. This is because the delay times associated

with communication between modules in chip can easily be an order of magnitude

less than those associated with chip-to-chip communication. Synchronous systems

constitute the bulk of chips fabricated today. A synchronous system is one in which

data is passed to or from communicating modules in a chip on the active "edges"

or "states" of a global clock. This greatly simplifies the design of individual circuit

elements as data only needs to be stable at these edges or states, rather than in

continuous time.

As die areas and clock speeds continue to grow, however, it becomes more and

more difficult to guarantee that global clock signals arrive at different locations on a

chip at the same time. The differences in the arrival times of global clocking signals

are due to differences in the path lengths to individual modules. Even if these path

lengths can be physically made the same, there are random unavoidable variations in

the delay characteristics of the paths that are intrinsic to the manufacturing process.

Additionally, the time delays of the clock paths are influenced by thermal and power

supply variations, which can also be random in nature or dependent on operating

conditions. The difference between arrival times of the clock signal is known as clock









skew. Clock synchronization is a nontrivial problem for present-day large area, high

performance die [6]. Clock skew is a fundamental limitation to the maximum speed

achievable for very fast chips. As the cycle times of these chips decreases, the skew

time becomes more and more of a significant fraction of the total cycle time. The

problem is particularly aggravated when data must be passed to and from distant

modules in a chip.

The choice of a clocking strategy is thus of paramount importance for the

design of high performance integrated circuits. The general trend with time has been

a reduction in the number of clock signals that are generated and routed around

a chip. There are associated tradeoffs, however, in circuit complexity, speed and

clocking safety. The sections that follow will describe some of the more modern

clocking schemes that have been used successfully in integrated circuits in the past

and some emerging technologies for very high performance chips.

3.1.1 Traditional Approaches

Pseudo Single Phase Latch Based Systems

The simplest type of data storage element is a latch. Latches are used to hold

data at the inputs and outputs of combinational logic gates. In is simplest form, a

latch passes data present at its input to its output when the clock is high (active).

This is the transparent phase. If the data changes at the input of the latch when the

clock is high, it will also change at the output after the time it takes for the change to

propagate through the latch. When the clock goes low, the data that was present at

the falling edge of the clock is stored and cannot change. This is the nontransparent

phase. Latches exhibit the lowest transistor count, due to their simplicity. This factor

motivates their use in VLSI systems. In later sections, we will expand our definitions

of transparent and non-transparent latch phases.










Inputs Output
L L1 -- L1 -
Combinational Next
Logic ^Stage

(D -D

L1
I I 4


Present Next
State State







Figure 3.1: Single Phase Latch System

Signals propagate through combinational logic networks at different rates,

depending on their values. This is due to the internal characteristics of the transistor

switching paths comprising the gates. It is not uncommon, for example, to find

propagation delays for highs to be twice as long as those for lows, or vice versa.

This difference in delay is also related to the function being implemented and to

combinations of the input variables. Consider the implementation of a state machine

shown in Figure 3.1, which employs a pseudo single phase latch clocking scheme.

The output to the next stage and the next state information, are computed from the

present inputs and present state data.

New data is sampled by the logic network just after the rising edge of the

clock, is modified by the combinational logic (CL) and should be ready to be passed

on the the next stage (and feedback path) just before the next rising edge of the clock.

We are interested on placing constraints on the allowable delays in the logic so that

the maximum operating frequency can be obtained. The clock period, ro, is given by

the sum of the high and low phases as 7r = TH +-L. We desire the clocking to be data

independent, which requires that the slowest delay in the logic be less than r6. At the








same time we require the fastest delay in the logic to be slower than TH, otherwise a

potential conflict could occur. If the fast path delay was significantly faster than TH

then the newly generated data from the CL network could race through the feedback

path and to the next stage, thereby corrupting the previously generated (valid) data.

Non-deterministic behavior of the system results is this phenomenon occurs. This

is called a race condition which must be avoided at all costs. The delay of the CL

network, TCL, is thus subject to the two-sided constraint:


TH < TCL < 7r (3.1)

To put it succinctly, we require that the slow path be fast enough and the

fast path be slow enough. This two-sided requirement is very difficult to guarantee

in VLSI systems, in the context of process parameters which have some statistical

distribution. This clocking scheme was used in some early VLSI chips, but was later

abandoned due to its implicit hazards. More security can be obtained by using a

multiphase scheme. The tradeoff being, that the simplicity of this scheme is forfeited

for reduced risk and ease of design.

Non-overlapping Pseudo Two Phase Clocking

Non-overlapping pseudo two phase clocking (NPTC) has been the mainstay of

the semiconductor industry for several years now. Most integrated circuits designed

today, still employ a NPTC scheme. Its popularity has been due to its relative safety

and immunity to race conditions. This security is achieved by the introduction of

a second clock phase. The active phase of the second clock does not overlap with

the active phase of the first clock. There is thus no possibility of a race condition

occurring. This is shown in Figure 3.2.

Here, a double latch scheme is employed. New input data is let in to the L1

latch just after the rising edge of 01. On the falling edge of 01 it is "frozen" and cannot

change again. After time TA, 02 becomes active and the new data propagates to the











Inputs Outputs
4 L1 .) L2 L1 4 L2
Combinational L L2
Logic

0*1 02 1 02
Next Stage
L2 4- L1 4--
Present Next
State State




OxDI
2 I>






Figure 3.2: Double-latch non-overlapping pseudo two phase system


CL network. We notice that the fast path race condition has been eliminated, as even

if the CL network is very fast for a particular input (i.e. would have computed the

new output with a delay less than r7), it could not race to the next stage (or through

the feedback path) since the L2 latches are non-transparent. A two-sided constraint

has now been reduced to a one-sided constraint. Now, the only requirement of the

CL network is that it meet the upper bound on the slowest transition path. This is

satisfied if the following equation holds true:


TCL < 72 + TB + T1
(3.2)
or TCL < T TA
where we define Tr = r, = r2 = rl + TA + T + TB. We notice that the

TA time delay is an overhead that must paid in every cycle. This is the price we

have paid for timing safety, along with the increased complexity of the double latches

and wiring of the second phase. We also notice that we have effectively made an

edge-triggered latch from our cascade of L1 and L2 latches, as new data enters the

system on the rising edge of 2. This idea will be important later, and we will call

an edge sensitive latch a Flip-flop (FF). Flip-flops can be both negative and positive


























D2 (1

Figure 3.3: High performance non-overlapping pseudo two phase system

edge sensitive. The time wasted during 7A can be gained back if the CL network can

be partitioned appropriately. This is shown in Figure 3.3. Here, combinational logic

has been placed between latches, rather than after a cascade of two latches. The

advantage of this scheme can be appreciated by examining the constraint equations,

which can be obtained from a similar analysis:


TCL1 < 71 + TA + 72

TCL2 < T2+ TB 1 (3.3)

The sum of the CL network delays must be less than the clock period, or

TCLI + TCL2 < 7T. Any combination of delay times that satisfies this constraint can
be implemented. If one of the CL networks is faster than the other, we can thus

trade delay giving the slow one more time and the fast one less time. The total delay,

TCL1 + TCL2, now approaches the clock period, which is more efficient than before.

The NPTC clocking scheme is safe as long as the "dead-time" between the
phases can be maintained. In the presence of clock skew, however, this requirement














cL2 CL2

2 -c1 -- 1 2 1
+d


-d
CL1 CLI

Figure 3.4: Delay transformation model

may be violated. For example, if two non-local modules are communicating, then the

relative differences in path delays may degrade our safety margin enough for to cause

timing errors. We can model the effects of skew by introducing a delay transformation

that preserves the overall timing scheme presented so far. For a general logic module

shown in Figure 3.4, if we add a positive delay to the all inputs and subtract this delay

from all outputs, the overall timing remains the same. We can thus model the effects

of skew by considering an overall timing loop. A circuit with known delays through

the combinational logic and uncertain clock skew can be transformed into a system

with uncertain delays through the combinational network and no clock skew [7].

Delay is thus added to one piece of the combinational network and subtracted from

the other. This delay may be positive or negative. Negative delay has no physical

significance in a real system. Negative delays have significance in the transformed

system, however. We note that if 1 is delayed more than 2 such that this delay is

greater than of equal to TA an overlap of active phases results, which requires two-

sided constraints as in the case of the single phase latch clocking. The system is likely

to fail at this point.

Pseudo Single Phase Edge Clocking

If edge triggering is used for the system in Figure 3.1 then the drawbacks of

using an additional clock phase would be avoided. Data would move to the next









adjacent stage (and to the feedback path) on the rising edge of the clock and thus,

would not be subject to the lower bound on logic delay as before. Of course the system

would still be vulnerable to clock skew between communicating non-local modules.

Single phase edge based clocking and NPTC schemes have been used successfully in

integrated circuits, with the former associated with higher performance.

At this point, the reader is probably wondering why the word "pseudo" ap-

pears in all of the descriptions. This is because we have purposefully stayed away

from the internal details of the latches. In reality, complementary (inverted) phases

of the clock signals must be generated to make all of our latches work. This is due

to the nature of CMOS logic which employs two different species of transistors to

pass the full range of logic values. If we view the transistors as ideal switches, then

their operation (in the most elementary form) requires that a high voltage turns the

NMOS device on and the PMOS device off. The converse is true, that a low turns

the NMOS off and the PMOS on. The NMOS device will pass a strong zero and a

weak one, while the PMOS will pass a strong one and a weak zero. The simplest

switch that will pass the full logic level range is a parallel combination of an NMOS

and PMOS device, called a transmission gate. This is shown in Figure 3.5, where a

simple positive edge-triggered latch is shown. The circuit works by letting the new

data into the first half of the latch during the low phase of q and presenting it to the

output during the high phase of . The internal states in the latch are stored on the

parasitic capacitances at the inputs of the inverters in the latch (dynamic logic). If

the switching time of 0 is zero as in the ideal case, then there is no potential race

condition. In reality, however, this time cannot be zero and the clock signal must

have a finite slope. There is thus a built-in transparency during the interval r. If the

propagation delay through the latch is on the order of r then a race condition exists.
We must thus keep r as small as possible, which implies that the clock edge-rate must

be high. To gain an appreciation of the problem, suppose that we are working with












A$





Ideal case
i--



Finite transition time
^^-2




Finite transition time
with inversion
delay

Figure 3.5: Transparency problem introduced by complementary phase of clock.

a typical 5V chip, and we require that 7 < 2ns. The clock edge rate must thus be

greater than 2.5 billion volts per second! Of course the output of the latch cannot rise

in zero time (nor can the inverters in the latch), so this will relax our delay constraint

somewhat. Nevertheless, the transition time of the clock and its complement must be

very small. The problem is compounded since 4 is typically generated locally from

0. This implies that < will always lag behind 0 since it cannot be produced in zero

time, which increases r. There are thus built-in problems associated with inverting

the clock signal for a complementary phase. If we had a family of latch circuits that

could operate with only one clock phase then this problem would be solved. The

edge rate sensitivity will be a fundamental limitation, if the logic block that follows

the latches can switch in times on the order of the clock transition time. For very

high performance chips this is in fact the case, and very careful attention must be

paid to the worst case clock edge rate.








3.1.2 True Single Phase Clocked Systems

The True-Single-Phase-Clocking (TSPC) scheme [43, 44, 1] represents the

state of the art in integrated circuit clocking. In TSPC, the clock signal is never

inverted to produce a complementary phase, which significantly ameliorates the prob-

lems pointed out earlier. TSPC schemes become more attractive as chip die areas

grow towards ULSI and WSI dimensions, as only one clock line needs to be routed

around the chip and since clock skew between phases is eliminated. It also much eas-

ier to guarantee the long term reliability of a single clock interconnect, or distribution

network. Reducing the clocking complexity of the storage elements is important for

large systems since the clock capacitive load influences the overall system speed.

At the heart of the technique is the use of two types of latch circuits which

are alternately transparent on either the high or low phases of the clock. The latch

which is transparent on the high phase of the clock and is non-transparent on the

low phase is called the N-latch. Likewise, the latch transparent on the low phase

and non-transparent on the high phase is called the P-latch. The latches permit

combinational logic circuits to be placed between N and P latch sections, or actually

embedded in the latches themselves. The scheme supports static or dynamic CMOS

logic elements, and is thus fully applicable to all types of CMOS systems. We will

defer the discussion of specific circuit topologies of the latch elements until Chapter

Five. For now, in latch based schemes, we can deal only with the abstraction of

N-blocks and P-blocks, where "blocks" consist of only their respective latch elements

together with combinational logic. We note that the notion of combinational logic

elements contained in or between latch elements is equivalent. The previously defined

concepts of edge-based systems still hold. A positive edge flip-flop is made from a

cascade of a P-latch followed by a N-latch. Similarly, a negative edge flip-flop is made

from a cascade of a N-latch followed by a P-latch.






















Sat ,at


negative skew positive skew

Figure 3.6: System model for TSPC edge based clocking.

TSPC Edge Based Systems

The effects of clock skew in a TSPC scheme can be examined with the system

shown in Figure 3.6. Skew is defined as the difference in time between clock signals

arriving at the receiving flip-flop relative to the transmitting flip-flop. Skew may take

on a negative or positive value as shown, depending on the relative magnitudes of

the path delays (A, and A2). We note that negative skew has a positive value for

A21 and positive skew has a negative value for A21. We will also define r, as the

setup-time, which is the minimum amount of time that the data must be stable prior

to the active edge of the clock. Likewise, we can define Th as the hold-time, which

is the minimum amount of time that the data must be held after the active edge

of the clock. There are also delay times associated with the propagation times for

new data through the flip-flops and combinational logic circuits. We will define the

propagation delay through the flip-flops (i.e. after the active edge) by TQ, and the

propagation delay through the CL logic by TCL. All of the quantities defined so far

may take on maximum and minimum values, which will be denoted with subscripts

M and m, respectively.













(a) I I I
'TQM +' CLM


Th+-


'mCLm





'rQM + CLM

Figure 3.7: Timing waveforms for TSPC edge based clocking.

As before, we are interested in deriving a set of constraint equations to char-

acterize our system. We wish to be able to run the system at the maximum allowable

clock frequency. We now consider the case with clock skew when non-local flip-flops

are communicating. The timing diagram for the situation where A2 > A1 (negative

skew) is shown in Case (b) of Figure 3.7 The maximum allowable clock period can

be expressed as:



T4 TQM + TCLM + Ts A21m (3.4)

On the other hand, the minimum delays must be such that the newly generated

data does not reach the distant flip-flop before its hold time. This requires that:



TCLm + TQm 2 Th + A21M (3.5)

The maximum allowable clock skew in the system is given by:


A21M : TCLm + TQm Th


(3.6)


























Figure 3.8: System model for TSPC latch based clocking.

Typically 7rh is near zero for this class of circuits, so the maximum allowable

clock skew is just given by the minimum flip-flop and logic delays. The situation

for positive clock skew is shown in Case (c) of Figure 3.7 (-A21). Substituting in a

negative value for A21 in Equation 3.4, will yield an increase in the maximum clock

period. Thus, positive skew will tend to slow the system down. Negative skew will

allow the system to operate faster, but will place constraints on the minimum speed

of the logic, since fast CL networks will tend to have shorter minimum delay times.

TSPC Latch Based Systems

We can obtain a similar set of equations for non-local communication between

modules in latch based TSPC systems. Our prototype system is shown in Figure 3.8,

and timing waveforms are shown in Figure 3.9. For simplicity we will assume that

the delays in the P and N latches are the same. Data stored in the N-latch (when q

goes low) must propagate through the CL network and arrive at the P-latch before

its setup time. This is shown in Figure 3.9, Case (a). For negative skew (Case(b))

we thus obtain the following expression which is very similar to Equation 3.4:
















TOM +'CLM

*Th|
(b) -
Tom +TCLm



(c) A-21 -
(C)
OM + CLM

Figure 3.9: Timing waveforms for TSPC latch based clocking.




TeL > TQM + TCLM + Ts A21m (3.7)

We note that there is also a constraint on the high width of the clock in order

to give N1 enough time to obtain data. This is given by the following equation where

TQM is the maximum delay of the P latch



T0, H 7QM + T. (3.8)

Again we must consider the minimum allowable delay in the system. Data

must not race through the P latch and the CL network before the hold time of the

N2 latch. Thus, the requirement below must hold



TCLm + TQm Th + A21M (3.9)

The maximum allowable clock skew in the system is given by:


A21M < TCLm + TQm Th


(3.10)










Again, positive skew has the effect of lengthening the clock period. Design of

these systems must strike a balance between Equation 3.7 Equation 3.8 and Equa-

tion 3.10.

It is interesting to consider situations in which Equation 3.10 cannot be sat-

isfied [44]. This is shown in Figure 3.10. Here the skew has exceeded the gate delay.

Non-deterministic behavior of the system results because the evaluate phases of N

and P blocks overlap. The system can be made to operate correctly, however, by

simply clocking against the data flow. This is shown in Figure 3.11. In this way, the

evaluation phase of the next block will be completely contained in the data-stable

zone of the last block. It is thus possible to exploit the skew, if its direction can

be guaranteed. Another solution is to only latch data on the start transitions of the

evaluation phases. This can be accomplished by adding a "re-synchronizing" N-block

in front of the P block as shown in Figure 3.12. This illustrates the flexibility of the

technique, where we can selectively add edge-triggering where needed. Finally we

note that in systems where the data flow must be in two directions, a totally edge-

based scheme is best. The maximum skew in this case is nearly up to a half clock

cycle [44].


3.2 Synchronous vs. Asynchronous Systems

Since the early days of systolic arrays, it has been realized that clock synchro-

nization over a large silicon area would be a limiting factor. In an attempt to solve

this problem, Kung [15] proposed his Wavefront Array, in which data would move

between processing elements in a self-timed manner, via a handshaking protocol.

More recently, Afghahi and Svensson [2] have conducted a study of syn-

chronous and asynchronous clocking schemes, for layout groundrules ranging from 3

to 0.3 ptm. The study was based on physical entities in actual processes, rather than

on speculative models. The results of their findings have important consequences for











Data N-Block h -BIOCK N-BlOCK -bIOCK

Clock Ac Ac -Ac
Ci C2 C3 C4



C E L E L
g~ Output Data 1 ~ Output Data 2
sAc : Eval. -- Corrupted by data 2
Sdata 1

C2 L E L E
Figure 3.10: Potential skew hazard with TSPC scheme [44].






Data N-Block P-Block N-Block P-Block


Ci I Cs A Clock



C E L E L
*Ag~ Output Data 1 Output Data 2
Ac'- Eval. Eval.,
data l data 2
2 L E L E

Figure 3.11: Clocking against the data flow to exploit skew [44].














Data -- N-Bock P-Block - N-Block P-BI

Clock c C-
C1 C2 C3 C4


Ci E L E L
Ag Output Data 1 -- Output Data 2 -
CaJ E L E L
-Ag- Output Data 3 Output Data 4 -

C4 L E L E
t Eval. t Eval.
-data 1 data 2
& data 3 & data 4
t1 t2

Figure 3.12: Non-local communication problem solution [44].


the clocking scheme used for a particular implementation, since the optimum choice

(in terms of overall system speed) is intimately related to the module grain size. Some

of their findings are reproduced below, as they are relevant to our problem.

Asynchronous timing schemes have been proposed as a solution to clock skew

in VLSI systems. It is generally realized that fully asynchronous logic (i.e. no

clock) requires too much handshaking overhead for practical use when the number of

inputs and outputs is large. Present day interest in asynchronous systems centers on

locally synchronous, globally asynchronous schemes, where each module in a system

operates with its own clock, but communicates with other modules asynchronously.

Since data arriving at a module must be processed synchronously, a synchronizer

circuit is necessary for each input. The synchronizer is vulnerable to error, if the

input signal changes during the edge of the local sampling clock. If this occurs, the

synchronizer goes into what is known as a metastable state (MSS), where the output

is undefined for some period of time. A synchronization failure potentially results

in a system failure. In order to avoid this, a time interval must be allowed for the

synchronizer to resolve the MSS. This time interval is an extra delay which impacts









the overall data rate. Obviously, reducing the module communication rate decreases

the likelihood of synchronization failure, at the penalty of reduced system bandwidth.

A more favorable approach, is to develop some probalistic bounds on the resolution

time, t, of the synchronizer circuit. In this work, to estimate t, a synchronization

failure rate of 1 per year was considered acceptable. Synchronization time was shown

to be the limiting factor for asynchronous systems.

The study showed that the time complexity of synchronous systems is



(O(log R)05)

while for asynchronous systems it is



O(logR)

where R is the size of the system. This suggests that for fine-grained systems

(where skew is acceptable) that synchronous clocking should be used, while for coarser

grained applications, asynchronous schemes should be used. The authors also showed

that where a pipelined clocking mode is used (i.e. the global clock is segmented

into an optimum number of small sections and each section driven by a repeater),

that synchronous systems will always outperform asynchronous systems. This is

very different from the general belief that asynchronous systems will be the fastest

possible implementation in a scaled technology. They also showed that speeds up to

approximately 2 Ghz. can be achieved in single phase synchronous CMOS systems

for 0.3 pJm technology, although the line repeaters must be spaced 1 millimeter apart

to achieve this.









3.3 Fundamental Manufacturing Limitations

The occurrence of defects is inherent to the semiconductor manufacturing pro-

cess. These defects arise due to many physical mechanisms such as particle contam-

ination, imperfections in insulating oxide layers, mask misalignment, step-coverage

problems, warping of the wafer during high temperature steps, etc. The first two

cases tend to produce random defects, which affect local regions of a wafer, while the

remaining three tend to produce defects which affect many chips (global regions) on

a wafer. Global defects are usually not considered in defect-tolerant analysis, since

they represent gross disturbances in the manufacturing process, and are generally not

a function of chip area. Additionally, various process monitors (test circuits placed

at strategic locations of a wafer to determine the quality of each manufacturing step)

are used to reject (or correct) wafers which exhibit such characteristics early in the

process. Consequently, the overall quality of the manufacturing process is determined

primarily by local defects.

Integrated circuit yield is intimately related to manufacturing defects. Yield is

defined as the ratio of the number of working chips on a wafer to the number of chips

fabricated, and is always less than 1. For conventional VLSI systems, yield studies

pertain to manufacturing economics with the goal of maximizing profits. In the

context of ULSI and WSI systems, however, yield relates to fundamental feasibility.

Yield is further divided into two broad categories: functional yield and parametric

yield. Functional yield (sometimes called catastrophic yield) is determined based on

the criterion of a chip successfully performing its desired logical functions. Parametric

yield is determined based on the chip meeting some predefined operating specification,

such as minimum speed or power dissipation. For our purposes, we will not consider

parametric yield since it is usually highly correlated to global disturbances on a wafer.









3.3.1 Defect Size Distribution

Early work on defects occurring in the semiconductor manufacturing process

modeled random defects as dimensionless points, where any defect occurring in an

integrated circuit was assumed to cause a failure. A more modern view of defects

[28] models them as extra or missing disks of material in the conducting and non-

conducting layers of an IC, which are characterized by varying radius and spatial

distribution on a wafer. These may take the form of shorts or opens in conducting

layers used for interconnections, oxide pinholes in insulating layers which can cause a

short (or leakage) between conductors, junction leakages or shorts to the substrate in

diffusions, etc. The model assumes that defect types are independent, in that a de-

fect of a particular type does not interact with, or cause, another defect of a different

type. In practice this has been shown to be a good approximation [39, pp.149-171].

Examples of actual defect types from early prototype fabrication runs of the PE are

shown in Figure 3.13. These defects may cause an immediate fault, as in Case A

(left), or may not cause a fault as in Case B (right). Case A shows a short between

two power busses, which was caused by a particle on the chip (the particle is the

dark spot). Case B is a pinhole in the passivation layer (overglass) layer, which does

not cause a short circuit since there is no conductor above it. Clearly, then, we must

consider the nature of a defect in order to determine if it will manifest itself as a

circuit fault. Random defects (sometimes called spot defects) are typically caused

by particle contamination (dirt) on the chip or on photolithographic masks during

manufacture.

As alluded to in the previous paragraph, defects are of many differing types.
The size distribution function for defects of type i is given by:



f(R) = R O < R < Xo (3.11)
k2/(R)P' Xo, < R


























Figure 3.13: SEM photographs of defects from early PE fabrications.

where R is the defect radius and Xo, and pi are parameters which are extracted from

the fabrication line. Each manufacturing step in a process has associated with it its

own characteristic set of defect types. The parameter i in Equation 3.11 can thus be

taken on a mask-by-mask basis (i.e. consider polysilicon shorts and opens, first-level

metal shorts and opens etc.). Smaller defects tend to be more numerous than larger

defects, since it is more difficult to filter small particles from the ambient environment

and manufacturing chemicals. We see in Figure 3.14 that the defect density peaks at

the value for Xoi. This corresponds to the resolution limit of the photolithography

in the process. The physical reason for the peak is that defects of a smaller radius

simply cannot be resolved by the photolithography, and thus manifest themselves

with decreasing frequency. Typically, minimum design rules are set well above this

value, so that the only size distribution exhibited in practice is fi(R) = k_ In the

above expression, both Xoi and pi may have different values for each defect type i.

Typically, a value of p; w 3 is assumed [28].

As mentioned previously, not all defects in a semiconductor process cause

faults. For example, Walker has suggested that if Xoi = 0.5 pm and the minimum










f(R) Minimum
/( design
rule.







Defect Radius (R).

Figure 3.14: Defect density vs. defect radius

line separation is 3 jm (i.e. minimum design rule in Figure 3.14), then only one-in-
seventy-two defects are potentially fault producing [39, pp.41]. With this in mind,
we must then consider the effective defect density, Do0, which is the defect density as
seen by the layout. The defect density variation with respect to defect radius, Di(R),
is thus obtained by multiplying Doi and fi(R). Hence,


Doik2
D,(R) =
(R)P,
= ( (3.12)
(R)Pi
The relationship between parameters Ki, pi and the effective defect density is
as follows. Suppose that two metal lines in an IC have a minimum spacing s, and
we are interested in determining the effective defect density for shorts. All defects
with radius less than s/2 will not be able to cause a short in the layout, regardless of
where they lie. The effective defect density, Di-effective, as seen by the layout is then


Doo Ki. Ki
Di-effective ( )---dR =
A/2 (R)s (Pi 1) (1)2









K; /2 p'-1
= -1 (3.13)
(pi 1) 3

Similarly, the effective defect density for opens in lines of minimum width w,

is given by:



K; /2 p'-
Di-ef fective = 1(2)P (3.14)
(pi I w

We have seen in the case of this simple example, a relationship between ef-

fective defect density and the layout geometry. There is thus a finite probability of

a particular defect producing a circuit fault. This idea is further extended into the

concept of critical area, which is the portion of a layout sensitive to a defect of a

particular radius. Critical area A(R) can be defined for a defect of radius R, as that

area on a die in which the center of a circular defect has to fall for a fault to occur

in a circuit. This is illustrated in Figure 3.15, where we consider again, the case of

a chip consisting of wires of width w spaced s units apart. The total chip area is

A0. As we have seen before, defects of radius less than s/2 will not cause shorts, and

thus, A(R) = 0 for defects with radius R < s/2. If we consider defects with radius

R > s/2, A(R) increases. The critical area increases linearly until it reaches Ao. This

occurs at defect radius Ro = (s + w/2), where a fault occurring anywhere on the chip

would cause a short. The increase in critical area is linear for this simple case, but

for a complex layout, we can only state that critical area will increase monotonically

with defect radius. In practice very large defects seldom occur, and the distribution

given in Figure 3.14 can be truncated at some maximum radius.














Critical
Area.
RW/2+S
Defects never Some defects Defects always
cause short. produce short. cause short.
A(R) Ao




S/2 S + W/2
Defect Radius (R)

Figure 3.15: Critical area for Parallel Conductors.

3.3.2 Defect Spatial Distribution and Yield Models

Poisson Statistics

The spatial distribution of defects must also be considered when determining

circuit yield. Early yield models assumed that defects followed a Poisson distribu-

tion, and were considered as point defects. Poisson processes are modeled by the

expression:


prob{X = x} = (3.15)

The three underlying assumptions of a Poisson spatial process are [16]: (1)

that the number of events occurring in one segment of space is independent of the

number of events in any nonoverlapping segment; (2) that the mean process rate A

must remain constant for the span of space considered; (3) the smaller the segment

of space, the less likely it is for more than one event to occur in that segment. For

yield purposes, we are concerned with the case where X = 0, as this is the condition









for zero circuit failures. The mean process rate, A, is number of circuit faults. We

can define this per process step by Ai, which is given by Ai = DoiAi, with Doi and

Ai as defined previously (i.e. the defect density and critical area, respectively). The

yield of a particular step in a semiconductor process is thus:


Yi = e-'i = e-DoiA (3.16)

The total yield of the chip is then the product of the individual yields, or

simply:



Yoa = fY (3.17)
i=1
where there are n total defect types. The formal methodology in Equation 3.16

and equation 3.17 of considering defect mechanisms on a mask-by-mask basis with

regard to layout dependent critical areas is best left to CAD tools. For analytical

purposes, we usually consider overall average defect densities only. For this reason

we will drop the i subscript in subsequent discussions, and define the quantity Do to

be the average fatal defect density. Critical areas are usually taken to be the area of

the entire chip, or the areas of major subsections of the chip. The yield for Poisson

distributed defects, thus becomes:


Y = e- = e-DA (3.18)

It is still possible to calculate the individual yields of particular sections of

a chip with this simplified view, from the area and defect density of that particular

section. Equation 3.17 suggests that areas on an integrated circuit which are the most

complex (i.e. require more manufacturing steps), will exhibit the lowest yield. This

is the case for logic, which usually makes use of all design layers (typically greater

than 12 masks). Conversely, interconnection busses which consist of one layer, say

second level metal, with no other layers beneath them, only require a few processing









steps (three or so). The corresponding yields of such sections will be higher than

that of same area logic circuits. The statistical independence of failures modeled

by the Poisson distribution permits the total yield of a chip to be computed from

the product of its individual component parts. This is the most useful property of

Poisson based yield models.

Modified Poisson Statistics

It is now well known that for large chips, the Poisson model gives pessimistic

predictions for yield when compared to actual fabrication line data. This is because

of a phenomenon known as defect clustering. Defects tend to cluster between lots of

wafers, between wafers in the same lot and across individual wafers [11]. Clustering

of defects between lots is perhaps the easiest behavior to appreciate, due to the

batch oriented nature of the semiconductor process. Manufacturing parameters will

certainly change over a period of weeks or months, due to equipment and operating-

environment variations. Thus, it is reasonable to expect some statistical differences

in the yields of identical chips from different batches.

It is interesting to consider the physical mechanisms relating to the other types

of defect clustering exhibited. Clustering within wafers is due to minute differences in

environmental conditions at different locations on the surface of a wafer. Stapper has

suggested that the defect clusters are generated when vibration or other environmen-

tal changes (i.e. irregular gas flow or pressure changes) cause a cloud of particles to

break loose from the manufacturing equipment [32]. When these clouds land on the

surface of a wafer, the resulting defects produced will be clustered. Other very subtle

mechanisms also influence the spatial characteristics defect patterns. For example,

it was observed for many years in the semiconductor industry that defects tended to

be random within the center of a wafer, and correlated towards the perimeter. This

caused many researchers to divide wafers into concentric zones, where the yield in

each zone was modeled by a Poisson distribution with its own defect density. This









behavior arose because wafers were carried in plastic boxes called "boats", between

process steps [32]. The inside of a "boat" is similar in construction to the inside of a

slide projector, where grooves permit wafers to be stacked vertically (in parallel). A

boat is open on one side only, and dust particles can only approach the wafers within

from this side. Suppose a dust cloud is present near a group of wafers in a boat. Even

if the particles are uniformly suspended, they are electrostatically attracted to the

nearest edge of the wafers due to the electrostatic potential created when the wafer is

slid into the boat. This leads to the observed edge clustering. These defect patterns

are less seen today, due to improved wafer handling techniques. We also note that

the dust cloud would only affect wafers closest to it in the boat, and may not affect

other wafers in the same boat. This explains clustering from wafer-to-wafer in the

same lot.

The mechanisms previously described, suggest that if there is a defect present

at some position on a wafer, that it is highly likely that there is another one near by.

This directly violates the spatial independence assumption of a Poisson process, which

assumes that a defect at a particular location is not correlated with an adjacent defect.

Thus, defect locations on a wafer are statistically dependent, rather than independent.

This is why purely Poisson expressions do not accurately model integrated circuit

yield. Clustering improves yield since it is better to have defects clumped together

affecting fewer chips, than randomly distributed, potentially affecting more chips.

The yield formula of Equation 3.18 can be modified to account for defect

clustering by assuming that defects are still Poisson distributed, but considering A

to be a random variable. The mere fact that A is a random variable suggests defect

clustering, regardless of the distribution used [11]. This technique was first described

in the literature by Murphy [19]. If F(A) is a cumulative distribution function for

the average number of faults per chip, then associated with F(A) is the probability

density function f(A) given by:










dF(A)
f (A)= A (3.19)

where f(A)dA is the probability of having an average number of faults per chip be-
tween A and A + dA. The overall yield thus becomes:


Y = e-'f(A)dA (3.20)

The function f(A) is known as a compounder or mixing function. Murphy
reasoned that a bell shaped Gaussian distribution would be appropriate for f(A),
although he could not integrate the resulting expression. He approximated the Gaus-
sian distribution with a triangular distribution. The yield expression he obtained
matched his manufacturing data more accurately than a Poisson distribution. Many
distributions for f(A) have been suggested in the past, but none have gained more
acceptance that that first used by Stapper, the Gamma distribution [32, 30]. The
Gamma distribution is given below as:


1 -
f(A) = A-le (3.21)

where a and 3 are parameters. The mean and variance of the Gamma distribu-
tion is E(A) = a3 and V(A) = a32. If Equation 3.21 is substituted into Equation 3.20
and solved, the following expression results:


F(a + x)"322)
prob(X = x)= (a)(1 (3.22)
X!F(a)(1 + P)0+x
This distribution is known as the Negative Binomial distribution. The average
number of faults per chip (the grand average) is normally taken to be A where A =
E(X) = af, so that 3 = Equation 3.22 can thus be expressed as:


prob(X = ) = + (3.23)
!ir(a)(1 + -)+ 3









The mean and variance of Equation 3.23 are given by:


E(X) = A
V(X) = A(+) (3.24)
we note that the variance of this distribution (r2) is greater than the mean, which

is different from the Poisson distribution where the variance equals the mean. The

parameter a is determined from data on the distribution of defects, and can be

calculated from the expression:


a (3.25)
A
We see from this equation, that as the variance of the defect distribution

approaches the mean, the value of a approaches oo. This corresponds to the Poisson

case. Small values of a correspond to increased clustering. We can thus account

for the Poisson distribution by choosing an appropriate value of a. This suggests

that Negative Binomial statistics are more fundamental to IC manufacturing than

Poisson statistics. For all practical purposes, a > 10 adequately models the Poisson

case. As before, we are interested in the case where X = 0 for yield and Equation 3.23

becomes:


Y = prob(X = 0) = 1 + = 1+ A (3.26)

This expression is known as the Negative Binomial Model (NBM), and has

been widely used in the industry to forecast integrated circuit yields. The NBM

formally contains an additional gross yield term, Yo, which multiplies Equation 3.26.

This term models the effects of large scale process disturbances, but as before, is

generally not considered for preliminary yield analysis, since it is not a function of

chip area. Conservative values for Do are between 1 and 2 fatal defects per cm2, and

a < 1. An average defect density of 1 per cm2 has remained the defect standard

for some time (although probably now less than 0.5 per cm2 is more appropriate for









some fabrication lines). It was recently reported that a facility could be considered

"world-class" if it was able to maintain an average fatal defect density of 0.3 per cm2

(in 1992) [41]. In the past, defect densities have kept commercial IC die areas below

1 cm2.

Clustering has been shown to apply to large areas of a wafer, which typically

exceeds the area of a chip. This is known as large area clustering. If large area

clustering is assumed, it is common in analysis to assume that defects are uniformly

distributed within the cluster (i.e. within a chip). We note that this was the assump-

tion implicitly made in Equation 3.26, in that the same value of a was applied to

the entire chip. To determine if large area clustering holds for a chip of a particular

size, it is necessary to examine particle distributions on actual wafers. This is accom-

plished by first dividing the wafers into square regions called qaudrats. The number

of particles which occur per quadrat is then counted so that a frequency distribution

for the number of particles per quadrat can be obtained [29]. The parameters of the

Negative Binomial distribution can then be determined from a maximum likelihood

estimation technique and checked for goodness of fit with a chi-square test. The

quadrat area is varied and the process repeated. The variability in the estimates of

a can be obtained. The validation of the large-area clustering assumption is based

on the overlap of the standard deviations (a ) of the estimated values of a for

increasing quadrat areas. For quadrat areas up to a critical value, the ranges will

overlap indicating that a can be considered constant. Beyond this point the ranges

will no longer overlap, and the value of alpha will begin to increase. Equation 3.26

will no longer be valid beyond this point.

The large area clustering assumption has proven to be valid for well over a

decade and a half. Equation 3.26 has been used successfully in the industry for

many years and for many products. In our yield analysis, we will assume large area

clustering. As chip sizes approach wafer scale dimensions, however, it is very much






45

a open question as to what yield model is most appropriate. For very large chips

the relationships between clusters is critical, as it impacts the amount and type of

redundancy needed for fault-tolerance. What essentially happens for very large chip

areas is that there is clustering of clusters. There have been several models proposed

[31, 22, 38] to evaluate WSI or near-WSI yields, but there are no standards as of yet.














CHAPTER 4
SYSTEM ARCHITECTURE

4.1 Architectural Overview

Linear systolic arrays can be used to implement a variety of DSP algorithms

such as convolution, FIR filtering, Fourier Transforms and polynomial operations

[33]. Linear arrays can also be used to perform linear algebraic operations such

as matrix-matrix and matrix-vector multiplication [13]. Linear arrays have reduced

input output requirements compared to two-dimensional arrays and vector arrays,

which remain constant as more processors are added. This is a key point, as large-area

chips are typically I/O constrained. Figure 4.1 depicts the architecture of our pro-

posed system. It consists of sixteen multiply-accumulate processing elements (PEs)

connected to form a fault tolerant linear array. Here, a PE which has failed is by-

passed completely, and replaced by a spare. There is one spare PE per modulus for

both normal and conjugate channels. If more than one PE has failed per modulus,

then the system can still operate with a reduced number of PEs. We note that it is

possible to obtain full utilization of all good processors in a linear array, which, in

general, is not the case in a two-dimensional array.


4.2 Multiply Accumulate PE Architecture

Figure 4.2 depicts the architectural details of the GEQRNS processing ele-

ment, and a die photograph is shown in Figure 5.7. The PE has been optimized

to perform complex multiply-accumulate type operations on both in-place or partial

result data. Two eight-bit operands to be multiplied, x and y, are the exponents of

elements ac,ay E GF(p) {0}. The y operand bus (Y-bus) supplies data to the
















S. .. \ CRT -"R
X xxi log PE PE NORMAL PE PE
Xi- xri CHANNEL 2-1(z+z)
yr+Jyi log N P E P
xr-jxi log PE PE CONJUGATE PE PE PE V
i -- X CHANNEL i" 2-1 -(z-z)
yr-jyi lIog > * *

Ax PE PEL NORMAL PE PE
Xi- CHANNEL z2(z+z')
yr+jyi log *
P- nP
Xr-Xi log PE P CONJUGATE PE PE PE
SCHANNEL X z 2-Vl(z-z) -
yr-jyi log * **

Xir PE :- PE i CHANNEORL PE PE E z l(z+z) L

f yr+jyi log s o d A I
rPE J PE CONJUGATE PE PEPE
CHANNEL z" 2-1(z-z)
yr-jyi log *

Figure 4.1: System Architecture

multiplier directly. The x operand bus (X-bus) supplies data to the multiplier and

to the input of the data-storage shift register (DSSR). It is desirable to incorporate

local storage at each PE in the array, since there are some algorithms for which the

data dependency does not permit operands to arrive at each PE via a purely linear

flow (whether in the same or opposing directions). An example of this is matrix

multiplication, which typically employs a 2-D, or vector array. For signal process-

ing, however, matrix multiplication is normally restricted to the case where at least

one matrix is pre-known (usually filter coefficients), or even further to matrix-vector

multiplication (i.e. signal vector). We can thus pre-store the columns of the known









matrix for matrix multiplication (or block multiplication, for large matrices), and lin-

early propagate the rows of the input matrix or signal along the array. This results

in greatly reduced I/O requirements.

In this implementation, the DSSR can be used to store up to sixteen operands

which are known a priori. A shift register was chosen over a SRAM since it requires

less area than a sixteen by eight-bit SRAM (i.e. no decoders or read/write cir-

cuitry) and requires minimal control logic. As it stands, the DSSR is approximately

60% of the size of the multiply-accumulate portion of the PE, which illustrates how

area-expensive programmable storage is. If more storage is desired for a particular

application, then at some point a SRAM will become more area-efficient, even with

the associated increase in overhead. Once data has been loaded into the DSSR, it can

be circulated continuously via an internal feedback path. The X-Bus can be freed for

other global data-move operations when local restored data is used. Data shifting

in the DSSR can also be halted (which is a significant feature since we are using

dynamic latches). This will greatly simplify array data-flow timing, as processors

further along the linear chain can wait for operands/partial results from previous

processors, with their restored internal operands "lined up in place" and ready for

processing.

Just prior to multiplication, it is necessary to check for a zero operand, since

zero must be handled as an exception in GEQRNS. An unused binary code word is

chosen as the GEQRNS zero (i.e. some value between pi and 28 1). In this case,

255D was chosen, since zero detection will simply be the logical AND of all data bits.

If a zero is detected for either operand, a flag is raised, which will set the input of

the exponentiation table's pipeline register to zero after the next clock cycle. The

output of the modular adder provides the address input to the exponentiation table.

We note that we could have used a standard binary adder here, and performed the

modular reduction in the ROM. It is much more area efficient, however, to perform





































Figure 4.2: Processor Architecture


the modular reduction first, since the size of the ROM is halved. This is evident from

a comparison of the relative size of the modular adders and ROM in Figure 5.7. A

doubling in ROM area would represent a substantial increase in PE area, whereas our

modular adders are only about 50% larger than a similar word-width binary adder.

The ROM table has the value of ap-1 programmed at each corresponding

address location. The QRNS product is thus obtained at the output of the ROM.

The computed product is then fed to one input of another modulo adder,

which reduces the computed sum mod(pi). If the second input of this modulo adder

is connected in feedback to its output, an accumulator is formed. This mode is used

for algorithms requiring results to be computed in place. For algorithms requiring









partial results to flow from PE to PE, the second input of the mod(pi) adder is

connected to an adjacent processor. A dedicated bi-directional systolic output bus

is used to transfer computed QRNS results (in place or partial) to the next adjacent

processor in the array, in the natural ordering predicated by the algorithm being

implemented.

4.2.1 Modulo-P Adders

Figure 4.3 depicts the basic construction of the modulo adders used for the

multiplier and accumulator portions of the PE. This is a standard biased-addition

scheme [9], where an offset of value 2" p; is added to a 2n bit adder to make a

mod(pi) adder (i.e. any n-bit binary adder is intrinsically mod(2n)). Operation of

this scheme requires that the input words be E {0... (pi 1)} which is accomplished

during input conversion. The magnitude of pi is also required to be less than 2" 1.

Two binary adders are cascaded such that the output of the first adder is input to

the second adder which has an appropriate offset added. Here, eight bit ripple carry

adders are used, and the offset added to the second adder is the value (28 pi). The

correct mod(pi) sum is selected from the outputs of either the first or second adder

via a multiplexer controlled by the logical OR of the carry bits of the first and second

adders.

In most implementations, standard cell adder modules are used, and the off-

set programmed by hardwiring inputs to low or high as needed. This is somewhat

wasteful as the offset is known a priori, and thus half of the logic of the second adder

need not be included. In this implementation, the offset has been bit programmed

by only implementing the logic corresponding to an added zero or one at each bit

position of the second adder. Basic mod(p) adder primitive cells are then constructed

from a FA and the logic corresponding to offset-zero and offset-one (see Figure 4.4).

The transmission gate adder presented in [42, pp.317-320] was used to implement the

FA circuit, since pass-gate implementations were found to be faster than standard

















X Y

Figure 4.3: Standard Modulo P Architecture

Table 4.1: Full Adder truth table.

FA Truth Table
Inputs Outputs
Ci Ai Bi Si+l Ci+l
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
111 1 1


CMOS realizations. These blocks can then be used to construct a mod(p) adder of

arbitrary value. Since the datapath width is only eight bits, a ripple carry scheme

can be used here without speed penalty.

For larger wordwidths, such as needed in the CRT, a carry select scheme can

be used with these same small width ripple carry modulo adders (Figure 4.5). Here,

the larger dynamic range is partitioned into m, k bit sections. The offset value of

corresponding to 2mk M, where M is the desired modulus is programmed as before

by selecting and arranging the appropriate primitive cells. We can think of these

zero and one primitives as having two input and output carry bits, and a multiplexer

select input. The four possibilities for the input carries can be preprogrammed and

selected by the output carries from the previous k bit section. The total modulo add

time is thus:

























Figure 4.4: Modulo-P Adder Building Block Primitives.



TmodM = Tkripple + (m 1)Tmux4 + TOR + Tmux2 + Tmu,4 (4.1)

where, Tmux4, TOR, Tmux2 are the propagation delays of the four-input carry
multiplexer, the final OR-gate and the two-input multiplexer (in the primitive cells),
respectively. For typical technologies, these times will be all < 1 ns (including wiring
delays over the distances involved), and the delay for a single primitive cell is Ins.
The area required for an eight bit modulo section is 708 A x 140 A (283 jum x 56
/m in 0.8 jpm technology). Thus, replicating the k bit sections 4(m 1) times will
consume minimal area.

4.3 Forward Mapping: Integer to GEQRNS

Equation 2.5 and Equation 2.6 describe mappings necessary to convert input
data to and from QRNS, respectively. These equations are implemented as shown
in Figure 4.6 and Figure 4.7, respectively. For the forward mapping, it is necessary
to reduce the real and imaginary components of the input eight-bit data streams


C2, ONE
SI PRIMITIVE

III


Cli 02 u1,


G1,i u ,


1_______________11


C1,
"


















X(mk-1),(m-1)k ** -- H > (mk.1).(m-1)k
--
Y(mk-1).(m-1)kl >
C1 O C1 CCO C CO C1 CO

00 0 1 10 11
*







C1 COM C CM C1 CO C1CO
X(2k-1),k - <> (2k-1).k
Y(2k-1),k
C1CO C1CO C1O C10CO

X l 0 0 0 1 1 0


X(k-1),0
l -(k-1).O
Y(k-1).,
C1 CO

00

Figure 4.5: Carry Select Modulo Adder.


mod(pi), since the correct operation of the mod(pi) adders requires that the input

operands be E {0... pi -1}. This is accomplished by two 256 by eight-bit ROMs. For

the imaginary part of the input words, multiplication by 3j and modular reduction

is also accomplished in the same ROM. We note that +j is used for the normal

channel and that -3 is used for the conjugate channel. The ROM outputs are then

input to a modular adder to complete the QRNS mapping. The QRNS operand is

then converted to GEQRNS via a final logarithm table, which has the zero encoding

present address zero. Again, it is more area-efficient to reduce the sum modulo Pi

first, rather than perform this operation in the ROM, as the size of the ROM would





















Figure 4.6: Forward-mapping (0) conversion module with GEQRNS log table

approximately double. We note that four of the modules shown in Figure 4.6 are

needed per modulus (see Figure 4.1).


4.4 Inverse Mapping: QRNS to Residue

The architecture of the inverse QRNS mapping is shown in Figure 4.7. Input

data arrives from the output of the last PE in the array. Equation 2.6 requires that we

perform a z z* operation for the imaginary part of the output. This is accomplished

by looking-up the modular complement of z* (i.e. -z* + pi) before adding to the z

term. This keeps the inputs to the modular adder in the desired range for correct

operation (i.e. {0... pi -1}). The result of the modular addition is used to look-up

the value of 2-1)-1 times the input to the ROM, with a final mod pi reduction. The

operation of the real part of the inverse mapping is likewise analogous, except that

normal and conjugate operands can be added modulo pi directly, and that 2-1 is

looked-up times the input to the ROM, with a final mod pi reduction.


4.5 Chinese Remainder Theorem

The architecture of the CRT is shown in Figure 4.8. Here the real and imag-

inary outputs of each of the modules in Figure 4.7 are grouped and fed to real

and imaginary CRT modules (see Figure 4.1). This is denoted by the xi inputs in














regwist rs8Pi 8 to
S4 <2-1(z+z*)> pi (REAL)

Z*





ROM
<> (265 x 8) to
ROM- 8 <2_11(z-z*)>pi (IMAG.)
ROM
(265 x 8)
8 <-Z*>pi "
(complement)

Figure 4.7: Inverse-mapping (0-1) conversion module.

Figure 4.8, and likewise in Equation 2.3. We note that the xi terms are the only

"unknowns" in the CRT, as ri;i and r^n1 are pre-known constants. We can thus use

xi to look-up the corresponding expansion terms in the CRT, and perform the de-

sired mod(M) reduction at the same time in four ROMs. We could have used 256 by

31-bit ROMs here (the dynamic range is 30.9 bits), but they would be significantly

slower than their narrower counterparts due to increased internal capacitive loads

(i.e. higher word-line capacitance). It is imperative to keep the delays in the CRT

the same as those in the rest of the system, so as not to degrade overall performance.

The final modular summation is accomplished by mod(M) adder tree of carry-select

modular adders of the type described in Figure 4.5. Thus, all of the input and out-

put conversion hardware can be implemented with the same basic ROM and modular

adder cells developed for the PE. The computational throughput is thus the same for

all elements in the system.


















A A -1
pl>M
ROM
(265x7)
., ROM
(265x8)






Xl -- ----- -
8 r ROM CR b31 d
(265x8)
SROM
(265x8)


P2>M 31
ROM
(265x7)
8_ROM .-
(265x8)






P3>M 1 Int.
ROM out
(265x7)
ROM
(265x8) J >
X3 ROM ou31
(265x8)




P4>M31
ROM _
(265x7)
== ROM _
(265x8)
8 ROM I 131
(265x8)
_. ROM
(265x8) I "


Figure 4.8: CRT block diagram.














CHAPTER 5
VLSI IMPLEMENTATION OF PROCESSING ELEMENT

5.1 True Single Phase Clocking Scheme

A TSPC edge-based clocking scheme was selected for the implementation of

the processing element. Edge-based clocking was selected over a latch based scheme

for several reasons, the most significant of which is the requirement that the data

should be able to flow in two directions. Bi-directional data flow offers the most

flexibility from an algorithm-mapping standpoint, which is critical for the linear array.

Another reason for the use of an edge based scheme is that the circuitry of the PE is

very heterogeneous from a VLSI perspective. The modular adders are combinational

circuits and the ROM is a CMOS domino-logic circuit. Furthermore, the accumulator

portion of the PE is either a state machine (i.e. new data gets added to the running

sum which is fed-back), or a combinational circuit if partial sums arrive from other

adjacent PEs. We thus have a combinational multiplier (mod-P adder) which drives a

CMOS domino ROM which in-turn drives a state machine or another CL block. Also,

the data storage shift register introduces its own set of timing requirements as we

will see subsequently. The clocking scheme of the PE was the dominant design issue,

which required extreme care to implement successfully. The timing requirements

ultimately determined the types of circuits selected. All of the synchronization issues

introduced in Chapter Three came into play for the design. The move from NPTC

to TSPC clocking was the most significant feature which distinguished this version

of the PE from its earlier predecessors. The associated increase in performance was

greater than a factor of two.










MO M2 M1

SXP SYP

D- I IM5 M3 nD -D

SXN SYN

M4 M7 M6



Figure 5.1: TSPC Pipeline Register

5.1.1 Pipeline Registers

The split-output latch circuit [44] shown in Figure 5.1 was used to implement
all pipeline registers and DSSR cells in this chip. This particular cell exhibits minimal
clock loading in that only two transistor gate loads are seen per stage. This is
ultimately why this particular variant was selected over others in its family, which
have four gate loads per stage [44]. This fact is very important since there are
many gates connected to the clock line in a highly pipelined architecture such as

this. The DSSR contributes the highest portion of capacitance per PE to the clock
line, so halving the clock load per latch is an excellent tradeoff. The split-output
latch is more difficult to implement than its counterparts, and must be designed very
carefully. This will become apparent shortly.
The pipeline registers (flip-flops) are of the negative edge triggered kind.
Negative-edge triggering was chosen because it was most compatible with the timing
requirements of the ROM. A single large inverter can also be used as a clock buffer,
since most external system level clocks are positive-edge triggered.









5.1.2 Data Storage Shift Register

The data storage shift register imposes the strictest timing constraints on the

clock signal. A shift register can be built with the un-buffered (inverting) form of the

split-output latch, if there are an even number of stages. This results in a minimal

transistor-count implementation. As we are interested in building a shift register

which is sixteen levels deep, then the use of this configuration is a possible choice here.

There are implicit subtleties, however, in such a scheme. This is because a potential

fast-transition race condition exists for the case where the output of a register directly

drives the input of another register. If the clock-edge transition time is slower than

(or even on the order of) the transition time of the register output, then timing
errors may result. This is demonstrated in Figure 5.2 from SPICE simulations on

the extracted layout of the un-buffered shift-register cell. The simulation represents

the output for a cascade of two cells. The first cell in the chain is driven from a
"stable" data source (bitO), which changes well before the negative-edge of the clock.

The output of the first cell (bitl), drives the second cell whose output is shown in

the third curve (bit2). The clock-edge transition time is swept from 1.75 ns to 3.75

ns, in 0.25 ns increments. At about an input clock-edge transition time of 2.75 ns, a

gradual negative slope can be observed at the output of the second latch, during the

high portion of the clock.

The exact failure mechanism is is complex, and can best be understood by

an examination of the fourth and fifth curves. A fast low-to-high transition will

cause transistor M7 to turn-off too early (i.e. before M3 fully turns off). This may

cause excess charge from node SYP to be deposited on node SYN (which should

have been fully discharged), thereby causing its potential to rise. If this potential

exceeds the threshold voltage of M6 then an unwanted discharging of the output node

will result. The potential of node SYN is augmented by clock-feedthrough when the

clock transitions from low-to-high, which exacerbates the problem. We note that this









behavior is a function of the clock edge-rate only, and cannot be solved by slowing

the system down. The clock-edge rate will thus have to be kept relatively fast for the

unbuffered case. The total parasitic capacitance on node SYN should be kept as large

as is practical, so that the magnitude of the "hop" induced by clock-feedthrough is

minimized. For this reason, oversized diffusion islands were used on the drains of

transistors M7 and M3 (see Appendix A).

The situation can be ameliorated, however, by introducing some delay between

latches. The most obvious solution is to use "weak" inverters to buffer the outputs.

This was used here, but was taken a step further which yielded a double gain for the

tradeoff in area. A feedback path was added to the DSSR cells, which permits the

data contained in the DSSR to be "frozen" (note the latches are dynamic). From a

data-flow standpoint this is desirable, as we can stop the shift registers in adjacent

processors with operands in the correct "place". This means that we do not have

to zero-pad the stored data for alignment in the shift register. The DSSR is costly

enough in area, and including extra cells just so that they can be zero-padded is an

unacceptable waste. The tradeoff, of-course, is more transistors per cell (and another

control signal). However, this was deemed viable during the design phase, as the area

of the DSSR is still smaller than an SRAM based implementation. The final shift

register cell implementation is shown in Figure 5.3. The maximum transition time

of the clock is now approximately 4.25 ns.


5.2 Exponentiation ROM

Semiconductor memories provide the key computational elements of most RNS

systems and usually determine maximum obtainable operating speed. A ROM was

chosen over a RAM for the Exponentiation table of the PE, since a ROM of a partic-

ular size is a factor of four to six times smaller than a RAM of the same size. There

is also no need for programmability here, as the moduli are fixed. A block diagram








61









FAILURE MODE IN SHIFT REGISTER FOR EARLY LOW-TO-HIGH INPUT


......... ............. ............. .............. SRP2A.TRO
V L .0 . PHI
0 I




SRP2AF.TRO
0....-.- ...... .. ........ .................
V L .0 BIT2
0 I ---





V L : | BIT2
0. .............. .. .. .... ....... ......
O I
L N 2 .50
T
0 .. . :. I.

: SRP2A.TRO
V L 4 .0 j ....... ..... X2.................... SXN
0 I
L N 2.0 ......... ............. X2.SXP



ORP2A.TRO

T I : I : ----
.. .. .. ..... ........ ..... .. .... .. ...I .......... .. ......... ...... .....




01
25.ON 50.ON 75.ON 100.ON
0. TIME (LIN) 125.ON


Figure 5.2: SPICE simulation of fast transition path for shift register














DD D 0 1 D

7-
ST

ST





ST

Figure 5.3: TSPC Shift Register Cell with Storage

of the logical organization of the Exponentiation ROM is shown in Figure 5.4, and

the key circuit elements are shown in Figure 5.5. The ROM is masked programmed

based on the modulus choice, by including a contact to connect a programming (dis-

charge) transistor to the bit line. Connecting a programming transistor in the ROM

array, results in a logical zero at the output of the ROM, for the accessed bit location.

There are eight parallel bit locations accessed per address, thereby producing a byte

of data per address input. There are a total of 256-bytes stored in the ROM. Since

the programming transistors are of minimal width, they will limit operational speed

due to the discharge delay of the bit-line. The ROM logic is partitioned so that the

bit line height can be kept relatively short (which results in reduced parasitic capac-
itance). Differential sensing techniques were also used, so that a logical zero could

be determined well before the bit line has fully discharged.

Figure 5.5 reveals that the ROM is precharged during the low clock level,

which is just after new data has arrived at the address inputs. The data is stable

during the evaluation (high clock) stage, when the word-line decoders become active.













12 BLOCK DEC.
SENSE
AMPLIFIERS
A: CLU IMM nDEc


12 BLOCK DEC.
SENSE
AMPLIFIERS
9-A r I UMm nEC


5:32
128 X 8 bit WORD 128 X8 bit
STORAGE LINE STORAGE
ARRAY DECODER ARRAY


Figure 5.4: Floorplan of Exponentiation ROM.




SENSE AMPLIFIER BLOCK
I I I. I i. DECODER


BLy+3 BLy+2 BLy+I BLy


Figure 5.5: Key ROM Circuit Elements.


WLX







Vdd


A c j A









A precharge PMOS transistor which is controlled by the clock and a "weak" PMOS

device connected in a positive feedback configuration have been included at the sense

amplifier bit-line-voltage-input. The precharge transistor was included so that this

point may quickly charge to Vdd. If there were no PMOS devices here, then the

potential at this node would slowly charge to approximately 4V through two series

NMOS devices of the column decoder (since at one-of-four paths is always active).

The PMOS device in the feedback loop also helps with precharging once the sense

amplifier has "tripped", but was primarily included to maintain a high potential at

this node (for a programmed high), in case of any charge loss caused by leakage paths.

The low cycle duration of the clock can thus be shortened during precharge, so that

more time can be given to the high cycle (where bit-line pull down occurs) without

adding to the total clock period. This is consistent with the notion of "trading delay",

introduced in Chapter Three.

Figure 5.6 shows a SPICE simulation of the operation of the ROM, for the case

of a programmed low. This represents the worst case delay, as a logical high value

is obtained by default from the action of precharging the bit-line. The simulation

was done on an extracted layout of a bit-line programmed with all zeros, as this

is the worst case for parasitic diffusion capacitance. A sense amplifier and the full

word-line decoder were also included, and the parasitic (gate) capacitance of the

maximum possible number of programming transistors per word-line (64) was also

taken into account. The first graph shows the clock input and address line zero,

which becomes active shortly after the clock goes low. We note that even though a

valid address is present, all word-lines are low due to the action of dynamic NAND

gate and inverter combination of the word-line decoders (see Figure 5.5). The output

of the sense amplifier is shown in the second graph. This goes high as expected

during precharge. The potential of the sense amplifier input as well as the bit line

is shown in the third graph. These two points charge at essentially the same rate.









The potential of word-line-one is shown in the fourth graph, which becomes active

after some delay during the evaluate phase of the clock. When the clock becomes

high, the bit line begins to fall. We note the input of the sense amp begins to

transition slightly after the bit line, due to the action of the "weak" PMOS device.

This point soon "catches-up" to the bit line once the sense amp has had time to turn

the feedback device off. The bit-line potential falls as expected. The rise time of the

bit-line is approximately 2.4 ns during precharge and the fall time is approximately

3.0 ns during discharge. The propagation delay taken from the mid-point of the

rising clock edge to the mid-point of the sense-amp output is approximately 5.6 ns.

The simulation was conducted with clock rise and fall times of 2 ns, and with a clock

period of 15 ns. The simulation output suggests that the cycle time could be possibly

shortened, however, to approximately 12 ns.


5.3 Electronic Reconfiguration Switches

In an effort to improve the survivability and manufacturability of large linear

chains of PEs, electronic reconfiguration switches were incorporated into the PE

architecture. These elements are the outermost transmission gates in Figure 4.2.

The layout ground rule used for these elements was less aggressive than that of other

circuits in the chip, since it is desirable that a very high switch yield be exhibited.

For example, a six-lambda minimum metal thickness and spacing was used (minimum

is three-lambda), in order to reduce the likelihood of shorts and opens, and double

contacts and vias were used between all layers. One author has suggested that extra

or missing material defects greater than two line spacings or widths, respectively, are

considered so rare that they almost never occur in practice [10]. Indeed, it has been

verified consistently for several years, that the defect frequency falls-off with the cube

of the radius [28], thus small changes in line spacings or widths can greatly impact

survivability. Bypass interconnection lines were also run in metal two, with minimal




















TIMING WAVEFORMS FOR ROM PROGRAMMED ZERO)


.o ...... .. ... .............. ........ I .......... .. V
L I /



I I
N .0 ^--v -/-7----- --v v '-y --: ----

...1..... ... ... ... ... L. .. ..... ..... . .. .. .. ..




0.
ROM2.TRO









L .0 -- -
L
I BLO
2.0


593.921U
ROM2.TRO

4 0. ---------ILI






0. TIME LLIN] 25.ON


Figure 5.6: SPICE simulation of ROM operation.









logic placed underneath these channels, in order to reduce the probability of inter-

layer shorts due to oxide pinholes. In order to truly determine the fault vulnerability

of the reconfiguration switches, it is necessary to have detailed knowledge of the

process defect statistics. For example, it has been suggested [40] that due to the

use of positive photoresist in most high resolution processes [3], that extra material

defects are a factor of ten times more likely than missing material defects. This means

that wider spaced, thinner interconnection wires would actually have a higher yield

than thicker, closer spaced wires in this process. Thus, the accurate modeling of the

interconnection yield within the context of the process defect characteristics crucial.

There are advantages and disadvantages to electronic switches verses physi-

cal restructuring techniques [36]. Physical switches, such as laser programmed links,

have lower on state resistance compared to electronic switches, and typically exhibit

smaller area since no other logic is needed (i.e. latches) to store configuration infor-

mation. The main drawback with physical switches is that they must be programmed

at the time of manufacture, and thus, cannot correct for faults which occur in the

field. Conversely, electronic switches can correct for run-time failures and can also

compensate for manufacturing defects. In a fully pipelined architecture such as this,

the added switch propagation delay is contained in the pipeline delay, and, therefore,

is less of a design issue. Only one configuration latch is needed for all reconfiguration

switches in the PE, since all busses are switched out simultaneously. It is important to

note that the operation of the reconfiguration circuitry, only depends on twenty-four

transistors (i.e., 12 transmission gates (Figure 4.2)), and twenty-four wire segments

per PE.


5.4 PE Performance

The processing element shown in Figure 5.7 was fabricated in a 1.5 Jm CMOS

process (ORBIT Semiconductor [21]) and occupies a total die area of 2.4 mm x 2.4









mm. The chip represents a full-custom design with the exception of the I/O pad

drivers which were supplied by MOSIS [37]. Data was fed to the chip from an eight-

bit binary counter circuit. Because of the 40 pin limitation on the package, it was not

possible to bring out all 24 input signals (8 x 3) in addition to the power, ground,

control inputs and the 8 outputs needed. Instead, both multiplier inputs were brought

out and the external accumulator input was fed internally from one of the multiplier

inputs (the Y input). It is thus not possible to test the multiplier and accumulator

portions of the chip independently.

The tests consisted of holding the Y-multiplier input constant, while incre-

menting the X-input. In this way the contents of the exponentiation ROM could be

examined. The Y-input was held at a value of zero (decimal zero), which gets added

to the output of the ROM by the accumulator. Both the input signals and output

data were simultaneously examined with a Hewlett-Packard 16500A logic analyzer,

which sampled data at each rising clock transition. The chip clock is produced in-

ternally from a buffered and inverted version of the external clock, so a positive-edge

externally is a negative-edge internally (of-course phases are flipped also). Recall

that the chip is negative-edge triggered internally. It was possible to verify the pro-

grammed contents of the ROM (see Appendix B). It should be pointed out that this

is the strictest test, that could be conducted, as the ROM could fail due to insufficient

precharge or discharge time. Also, since the ROM is addressed by a modulo adder,

the test verifies that the mod adder can sustain the data rate. The observed data was

compared to a mask of the programmed data by the logic analyzer, which was setup

to stop if there was any difference between observed and mask data. This test was

run for approximately three weeks. The chip was run at a clock frequency of 40 MHz.

during this test without the logic analyzer stopping. The test was discontinued after

three weeks. A bit-line compare voltage of approximately 4.6V was needed to obtain

this data rate. This suggests that the bit-lines were not fully discharging at this




















































Figure 5.7: Die photograph of processor.



























Figure 5.8: Oscilloscope photo of clock signal and output bit zero.


speed, which further supports the use of differential sensing techniques for flexibility.

The second test involved holding the X-input at zero (which we recall is is coded

as 255D), while sequencing the the (shared) Y-input. In this way, the operation of

the modular accumulator could be verified. A photograph of the clock signal and

the least significant data output line is shown in Figure 5.8. The least significant

data bit is "statistically" the output that changes the most, and timing problems will

usually be seen here first. Figure 5.8 shows that the output data transitions some

several nanoseconds after the rising edge of the clock. This is observed because of

the sum total of I/O delay times. That is, the total delay before the new data can be

presented externally, relative to our fixed reference of the external rising edge, is the

input buffer time TIB plus the output buffer time ToB. There are also other delays

such as latch propagation delay times as well, in addition to the clock internal buffer

delay. The output seems to lag the rising clock-edge by a time on the order of 10 to

11 ns.

In order to gain a better estimate of the I/O delay times involved, a simple

pass-through test was implemented. This is just an input buffer which directly drives














INPUT OUTPUT


-IB + -* IB

OUTPUT L



INPUT

Figure 5.9: Pass-through test.

an output buffer (the output buffers are inverting). This is depicted in Figure 5.9

and the actual output is shown in Figure 5.10. The time between the peak of the

input and the low of the output is approximately 5-6 ns. It is interesting to compare

the observed delay time with the simulated delay time. The output of a SPICE

simulation of the pass-through test is shown in Figure 5.11. For the simulation, a

load capacitance of 10 picofarads was used to model the capacitance seen at the

output of the chip. It was not possible to actually measure this capacitance, but

10 pF is a reasonable estimate (the scope probe is 7 pF). The simulated delay time

for a high-in to a low-out is approximately 3.1 nS and for a low-in to high-out,

approximately 4.4 nS. This suggests that the measured vs. simulated delays are on

the order of 25-40 % higher. Although this cannot be generalized for the rest of the

circuits in the chip, it still provides a crude measure of the variability between actual

and simulated performance.










72













































Figure 5.10: Oscilloscope photo of pass-through test output.













SPICE SIMULATION OF PASS-THROUGH TEST

S- DELAY-.TTRO



L \ : -
IN






N 2.0




0
S. -- .............. .. ........... .......... ............ .... ........




S_. 0 .............. .................... ....... ............ ... .... .. .. .... -


S .... ...... ...... .............. .... .. . .....i
N 0



5. ON 10.ON S.ON 20.ON 25 ON
o. TIME (LIN] 25.ON





Figure 5.11: SPICE simulation of pass-through test.









5.5 Early Versions of the Processing element

5.5.1 Version One

Figure 5.12 details the architecture of the first version of the PE and a die pho-

tograph is shown in Figure 5.13. This chip did not use the modular adder structure

described earlier, but rather, computed the modular reduction of output operands in

ROMs. The two input operands were "multiplied" in a standard binary adder and fed

to the exponentiation ROM where they were reduced modulo p- and the generator,

a, raised to the corresponding power (all done as a single lookup). A modular adder

was made for the accumulator portion of the PE from a standard adder and a final

ROM table to perform the modulo p reduction. The PE could only be programmed

for seven-bit moduli since, as described in Section 4.2, the result of performing the

modular reduction in a ROM table after an addition, is an effective doubling of the

ROM size (i.e. the sum of two k-bit numbers is a k + 1 bit number).

The modular adders produced from a binary adder and a look-up table were

two pipeline stages deep (see Figure 5.12). For the accumulator there is an implicit

subtlety here. If this structure is used, then odd and even indexed terms in a sum-

mation will be accumulated independently. That is, two partial sums will be formed,

one for the even terms and one for the odd terms, which cannot directly be summed.

This requires a final summation externally to the PE to complete the accumulation.

This fact forced the development of cells that would support the implementation of

a single-cycle modulo adder, without compromising speed. At the time, the only

cells available [20] to build the modular adders were four-bit carry-lookahead adders

which would not have operated fast enough to support an un-pipelined cascade. The

bit-programmed modular adder cells described earlier, thus significantly impacted

the design.

The first version of the PE also employed non-overlapping pseudo two phase

clocking. A photograph of the clocking waveforms is shown in Figure 5.14. In this









figure, the vertical axis is 10x, the lower trace is 01 and the upper trace is 02. The

pipeline registers were composed of a cascade of two transparent latches (just as

described in Chapter Three). Data was input to the first latch during the high phase

of q1 and was presented to the logic circuits on the rising edge of 2. The ROM was

precharged during the high phase of 02 (i.e. PMOS precharge devices were controlled

by q2) and was evaluated on the low phase of 02. Output data was then latched on

the falling edge of q1. It was actually necessary to make a third clock signal to avoid

a race condition in the ROM word-line decoder, by delaying 2 slightly. The early

ROM employed a dynamic NOR-gate decoding structure rather than the dynamic

NAND-gate used in the final version. The NOR gate was gated (ANDed) with the

delayed version of q2, so that the word-lines would be low during precharge. The

gating clock signal had to be delayed so that a momentary glitch would not occur on

the word-lines just after precharge, before all but one NOR gates discharged. This

complication was eliminated with the NAND structure (although the dynamic NOR

is intrinsically a faster gate). The first PE was fabricated in a 2 /m CMOS technology

[37] and was shown to operate at a clock frequency of 16 MHz.

5.5.2 Version Two

The second revision of the PE, was the first to use the modular adder scheme

in its accumulator. Its architecture is given in Figure 5.15, and a die photograph

is shown in Figure 5.16. The removal of the second ROM resulted in a significant

decrease in area. A modulo adder was not used as the "multiplier" portion of the

chip, due to vertical space limitations in the MOSIS Tiny-Chip [37] pad-frame. This

PE thus also used seven-bit moduli. The second chip was also a NPTC machine,

which employed the same ROM and pipeline latches as its predecessor. This chip

was also fabricated in the MOSIS 2 pm CMOS technology and was shown to operate

at 16 MHz.














































Figure 5.12: Processor architecture of first version.




















































Figure 5.13: Die photograph of first version of PE.































Figure 5.14: Oscilloscope photo of non-overlapping clocks for first chip.


EXTERNAL
ZERO DETECT


Figure 5.15: Processor architecture of second version.

















































Figure 5.16: Die photograph of second version of PE.















CHAPTER 6
YIELD ENHANCEMENT AND FAULT TOLERANCE

6.1 Yield Enhancement via Reconfiguration

It is evident from Chapter Three, that large area integrated circuits must

have some measure of fault tolerance. For highly structured architectures, which

have identical replicated cells, it is sometimes possible to include redundant modules

of which m out of n must function in order for the chip to be considered usable. In

the past, many authors have considered the reconfiguration yield to be unity and

that all failures could be corrected for by the reconfiguration scheme. Assuming that

the reconfiguration yield is 1 is increasingly being shown to be a bad assumption,

particularly if the system area is large. What is really assumed, is that m out of

n cells are free from defects which affect reconfigurable nets and that all n cells are

free from defects that affect non-reconfigurable nets. Actually, we must also consider

defects that affect global signals such as power and ground, as they cannot directly

be reconfigured for. This requires a very detailed consideration (layout extraction)

of susceptible areas on a net by net basis, however, which is more suitable for CAD

tools. For our purposes we can consider our PE area as being composed of two

components:


APE = APE-r + APE-n (6.1)

where APE-n is the area of the bypass and interconnection busses and APE-r

is the area of the remaining computational-logic portions of the PE.

As mentioned previously, the layout ground rule used for the bypass elements

was less aggressive than that of other circuits in the chip. Since the complexity









of these elements is greatly reduced over that of other portions of the circuit, an

argument could be made that the effective fatal defect density of the switches is

reduced over that of the rest of the circuit (i.e. fewer applicable defect mechanisms).

However, in order to truly quantitatively determine the relative fault vulnerability of

reconfiguration switches, it is necessary to have detailed knowledge of mask-by-mask

process defect statistics. Since we do not have such statistics, we will not speculate,

and will consider all areas of the PEs at the same defect density. Our estimates will

thus be slightly conservative. Since we are assuming large area clustering, the faults

in adjacent modules are dependent (uniformly distributed), and we cannot use simple

Binomial expressions for determining if M out of N modules are functioning (this

assumes faults are independently distributed).

Systolic arrays in which R spare PEs have been added, where at least M =

N- R out of N must function, have been proposed before. The independent, parallel

nature of RNS channels, however, provides further unique opportunities for enhanc-
ing fault and defect tolerance of systolic arrays. For our proposed four modulus

system, each processing node in the array is effectively broken up (algorithmically

and physically) into four smaller PEs via the RNS mapping. This would suggest that
the inclusion of an extra PE per modulus would permit up to four faults to be tol-

erated per channel, rather than just one, when compared to a similar system (i.e. in

dynamic range and arithmetic functionality), in which the computations are carried

out over one physically contiguous PE. The requirement, here, is that no more than

one fault occurs per modulus. If there is more than one fault per modulus, then the
number of usable PEs will be less than M.

Koren and Stapper [11] have presented a yield model for chips with redun-

dancy consisting of multiple module types, in which the failures in adjacent modules

are dependent. We can consider our proposed linear array chip as fitting this cate-

gory, where there are eight different module types (two x four moduli), where M out









of N PEs must function, and there are R spares for each module type. The following

expression describes the total yield:


Ni N2 Ns Ni-Mi N2-M2 Ns-Ms
Y= E E ... E E E ... E
MI=N1-RI M2=N2-R2 M8=N8-R8 kl=0 k2=0 ks=0
(-1)k k2 (_ k8 N 1N- Ms

x[+ ((M + kl) + ... + Ms ks)s + ACK)]
x 1+4 ------------

x CMi M,M2 ,,MI ,M6,Ms ,M M

(6.2)

where the Ai terms represent the number of faults in each PE and ACK is the

number of faults in the "chip-kill" area. The "coverage factor" term in the above

expression, CMi,m2,M3,M4,M5s,M6,M7,M = 1, if the chip is acceptable with M1 and M2 ...

Ms fault-free modules of types 1... 8. Otherwise, CM1,M2,M3,M,,Ms,M6,M,,M8 = 0. This

term provides the means of counting those terms in Equation 6.2 which are fixable.

We note that all combinations are fixable since the moduli are physically discrete and

the system is operable if there are at least M PEs surviving in each modulus.

The number of faults in each module AX is related to the number of faults in

the total chip, A, by the following:


S= A (6.3)
Atotal
For our proposed system, the corresponding parameters are:



N; = 17

Ri = 1

Ai = ApEr









ACK = AlOpads + Aoutconv + ACRT + AInon. + 8 x 17 x APE,

(6.4)

and CM,M3,M,,M,4,M, ,M6,Ms, = 1 for Mi E {16,17}, since all of these com-

binations are fixable. We note that the chip kill area contains the area of all input

and output conversion circuits, the CRT and I/O pads as well as the area of the

reconfiguration elements in the PE.

6.1.1 Yield Estimates

We will consider the case of no redundancy first, and then compare this to the

reconfigured yield. For simplicity, assume average fatal defect densities of 1, 1.5 and

2.0 per cm2. Area values are based on a 0.8 im CMOS technology (where A = 0.4

tim). The areas of the input and output conversion modules are based on floorplans

consisting of cells developed and fabricated in the PE, and are A scalable to the

0.8 pm process. Thus, the area estimates presented are realistic, as they are based

on existing cells. Table 6.1 depicts the areas of the major system components. It

can be seen that the PE area dominates the bulk of the total system area (78.1%);

therefore, this is the most logical place to apply fault tolerance. The yields for no

redundancy are presented in Table 6.2, for various values of the clustering parameter

a. Equation 3.26 has been used to obtain the values in Table 6.2.

Let us now consider the case for the redundant sixteen processor array. The

total area of each processing element is 0.008464 cm2, and the area of the switching

elements in the PE is 0.002208 cm2 (ApE-n). The difference between these two

areas (0.006256 cm2) represents the area of the PE which can tolerate a circuit fault

(ApE-r). We have considered the case for one added redundant PE since this is
the minimum amount of redundancy that can be added to the array, and since the

switching out of more than one defective PE adds to the critical path delay as, each

bypassed PE adds the delay of two series transmission gates plus interconnect delay











Table 6.1: Table of System Component Areas.


Areas of System Components
Module Area Area Number Total Area
[A x A] [cm2] [cm2]
Input (Projected)
Conversion:
X, + jx 1500 x 1500 0.0036 8 0.0288
yr jy; 1500 x 1500 0.0036 8 0.0288
Log 1150 x 750 0.0014 16 0.0224
Output (Projected)
Conversion:
2-1(z + z*) 1550 x 800 0.0020 4 0.0080
2-l-1(z z*) 2700 x 850 0.0037 4 0.0148
CRT 6800 x 5600 0.0609 2 0.1218
PE: (Actual)
Normal 2300 x 2300 0.0085 64 0.544
Conjugate 2300 x 2300 0.0085 64 0.544
I/O Non-Scalable
Drivers (Actual) 0.0004 200 0.08
Chip II 1.393


Table 6.2: Table of Non-Redundant Chip Yields.

Projected Non-Redundant Chip Yields
,a Average Fatal Defect
Densities per [cm2]
1.0 1.5 2.0
0.25 0.6246 0.5717 0.5357
0.50 0.5139 0.4394 0.3901
0.75 0.4550 0.3684 0.3125
1.00 0.4179 0.3237 0.2641
2.00 0.3474 0.2392 0.1746









(see Figure 4.2). As we will show shortly, just one extra processor will significantly
improve the system yield. Equation 6.2 thus becomes:


17 17 17 17-MI 17-M2 17-M8
Y = E E ... E ... E
MI=16 M2=16 M8=16 k1=O k2=0 ks8=0
(_l)(_l)k...(_l) (1 -M,\ (17 M 17 (17 Ms
M, ki Ms k8
[1 ((MI + k-i)A + ... + (Ms + k)s + ACK)
x 1 + ------------

xl

(6.5)

Equation 6.5 was used in a computer program (see Appendix C) along with
the previous values of defect densities and a, to produce Table 6.3. We see that the

yield of the array changes from 62.46% to 72.53% in the best case (with Do = 1

per cm2 and a = 0.25), which is a 16.1% improvement. For the worst case (with

Do = 2 per cm2 and a = 2), the yield changes from 17.46% to 35.87%, which is a

105.4% improvement. The added area overhead associated with the redundancy is

a 5%. Thus, this is a worthwhile tradeoff in both cases, and in particular for the

worst case, in which the yield is doubled for a 5% increase in area. Equation 6.5

was extended to arrays of up to 32 processors. For the best case above, the yield of

non-redundant vs. redundant is 55.04% and 67.33% respectively, which is a 22.32%

increase. For the worst case, the results become 8.34% and 23.38%, a 180.3% increase.

The area overhead for one redundant processor for the 32 PE case is a 2.7%, again, a

worthwhile tradeoff. Any redundancy not used for yield enhancement at the time of
manufacture, can be used for fault tolerance in the field, if a failure occurs in a PE.

Since the PEs consist of 78% of the active area of our system, this will most often be
the type of failure that occurs.










Table 6.3: Table of Redundant Chip Yields.

Projected Redundant Chip Yields
a Average Fatal Defect
Densities per [cm2]
1.0 1.5 2.0
0.25 0.7253 0.6706 0.6318
0.50 0.6588 0.5798 0.5236
0.75 0.6259 0.5323 0.4657
1.00 0.6060 0.5028 0.4290
2.00 0.5703 0.4472 0.3587


Yield curves for the best and worst case manufacturing parameters have been

plotted in Figure 6.1. The top two curves represent the redundant and non-redundant

best case, respectively. The bottom two curves represent the redundant and non-

redundant worst case, respectively.

At this point, some cautions are in order. We must be very careful when

interpreting the increase in yield of redundant vs. non-redundant chips. Ultimately,

all that really matters from a manufacturing standpoint is the number of good chips

that leave the foundry. By including extra circuitry for redundancy, we have at the

same time diminished the number of chips that are made. That is, the area of a

redundant chip is larger than that of a non-redundant one, which translates into

less chips per wafer. Since only good chips can be sold, we must be very careful to

guarantee that the added redundancy does not cause too much product loss. If we

are not careful, we could actually wind-up loosing money with redundant chips. In

the following equations, we will represent the yield of non-redundant chips by YNR,

and the yield of redundant chips by YR. The actual number of good chips for the

non-redundant case will be denoted by KNR and for the redundant case, KR. The

total number of chips fabricated for non-redundant and redundant cases is NNR and

NR, respectively. Finally, the percentage of product loss, due to redundancy, will be

denoted by PL. Consider the following:










Yield Curves for Various Length Arrays
0.8---- -

X. .... X ..... X ... .. X ...... ...
0.7 .... .... X ..... X X.. .. ........

.00
0.6 0 ..... G ..... 0 ....


0.5


0. 0.4
.-
. X ..... . ..
0.3-X X.x. xx


0.2
. o .... o ..... o .... 0 .. 0 ....... 0 .... ... ..... ...
0.1- ----.-o ..o ..... ....-i ----. ... 0....


16 18 20 22 24 26 28 30 32
Number of PEs

Figure 6.1: Yield Curves for Various Length Arrays (Scheme 1).






YR = t
NR

NR = (1- PL)NNR

? (6.6)
KR > KNR
?
YR x NR > YNR X NNR

YR x (1 PL) > YNR

Equation 6.6 must always be satisfied to determine if the chosen redundancy

scheme actually produced more working chips. We will assume that the percentage

increase in area for redundant chips, translates into the same percentage of product

loss. The degree of accuracy of this assumption ultimately depends on how many

rectangular chips can fit into a round wafer. For the most part, though, this is a good









assumption since a few hundred chips will typically fit into a commercially sized wafer

(6-8 inch diameter). For our 16 processor example, there is thus a 5 % product loss.

For the best and worst cases, we get


(0.95) YR, be > YNRbest
?
(0.95) x (0.7253) > 0.6246

0.6890 > 0.6246
(6.7)

(0.95) X YR..ort > YNRwost

(0.95) x (0.3587) > 0.1746

0.3407 > 0.1746
Thus, we obtain more working chips in each case, which supports the chosen

redundancy scheme for the linear array.


6.2 A Comparison with Replacing Moduli

An alternative redundancy scheme will be considered in this section. Sup-

pose that fault tolerance was introduced into the system by including an additional

modulus in the array, rather than an extra PE per modulus. This would have the

effect of partitioning the system differently with respect to fault tolerance. The new

"modules" would consist of a set of both normal and conjugate channels (for each

modulus), together with the input and output conversion modules. This is depicted

in Figure 4.1 by the dotted lines around each modulus. The CRT would now have to

be reconfigurable, as the dynamic range of the system would change when a defective

module is switched out and a spare switched in. If the system were configured in

this way, then faults could be tolerated in the input and output conversion mod-

ules, as well as in the PEs. The only "chip-kill" area, is that composed of the CRT

(which now is larger) and the I/O pads. The system would now consist of a single








"module-type", of which there are four total for the non-redundant case and five for

the redundant. There are now five fixable combinations, compared to eight as before.

Since there is only a single module type, Equation 6.2 reduces to:


N1 Ni -Mi
Y = E E
MI=N--RI kjl=0

)k N Nl M ) [l ((MI + k)A + ACK)
M(-1) k x 1 +a
x CM1

(6.8)

For this scheme, the corresponding parameters are:



N, = 5

R1 = 1
ApR 1
A1 = 2 x (APEr + APE- ) X 16 + 4 x (Aljcon, + ALogl) + AoutConvl

ACK = AIOpads + ACRTN,,

(6.9)

with CM1 = 1 for Mi E {4, 5}. We have divided the area of the PE intercon-
nections by two since we do not bypass PEs in this scheme, and thus, only need half

of the wires. AiCo,,n is the area of one forward QRNS mapping module and ALog1

is the area of a single log table. There are four of each of these per reconfigurable

module. There is one inverse QRNS block per reconfigurable module, the area of
which is denoted by Aoutco,nl. We will assume the I/O overhead is the same as

before. Finally, we need to obtain a value for the area of the reconfigurable CRT.

Approximately 2 of the CRT is composed of ROMs. Since the new CRT will have
to be re-programmable (i.e. to change the overall M), this section will need to be









RAM based. RAMs occupy about four to six times the area of ROMs (as stated

earlier). We will use a value of four as a multiplication factor, to estimate the area

of a RAM based implementation. The carry-select mod(M) adders used in the CRT
will also have to be reconfigurable. We will not be able to use the bit-programming

scheme described in Chapter Four, and thus, will increase the complexity by a factor

of 25 %. The new variable offsets will have to be stored in registers also. We will

again be very generous and assume that the total complexity of the modulo adder

portion of the CRT increases by 50 %. The overall area multiplication factor for

the reconfigurable CRT is 3.166. If we substitute the redundancy parameters defined

above in Equation 6.8 the following expression results:


5 5-MI
Y = E E
M1=4 ki=0

k 5 5 M(, )) [1 ((Mi + ki)X + AC'K
(-1) M kL x 1+
CM1

(6.10)

The area values for the new system were substituted in Equation 6.10, which
was evaluated with a computer program (see Appendix C). For our sixteen processor

example, the best case non-redundant vs. redundant yield is 60.80 % vs. 69.27 %. For

the worst case manufacturing parameters, the yields are 15.04 % vs. 26.18 %. How do
these values now favor with product loss accounted for? The overhead associated with

this redundancy scheme is approximately 41 %. Thus, after scaling the redundant

yields by unity minus this amount we get (1 0.41) x 69.27 = 40.87 < 60.80 and for

the worst case (1-0.41) x 26.18 = 15.44 > 15.04. For the best case of manufacturing

parameters we have actually lost money by including redundancy. That is we would

have obtained more working chips if we had done nothing! For the worst case, the

gains are questionable since the margins are so low. The yield curves for best and









worst cases are plotted in Figure 6.2 for 16 to 32 processor systems. The areas of

chips using the first and second redundancy schemes are plotted in Figure 6.3. We

point out that the areas of chips using the first scheme are less than those using the

second, thus the first scheme always wins. Finally the product loss adjusted yields

are plotted for both schemes for best and worst cases. Figure 6.4 and Figure 6.5

are plots for the best and worst cases, respectively for the first scheme. These curves

illustrate that we always benefit for this scheme, for these manufacturing parameters.

Figures 6.6 and Figure 6.7 show the curves for the second scheme. Figure 6.6 shows

that we always loose money for this case, while Figure 6.7 shows that as the array size

grows, we gain more for the implemented redundancy. This is because the percent-

area-overhead associated with the redundancy decreases as the array grows (in both

schemes).

The results of this analysis suggests that it is better to implement redundancy

within the moduli, rather than between the moduli. Thus, the redundancy scheme

used in the proposed system is justifiable. In reconfigurable systems, the amount of

redundancy must be kept as small as possible to obtain the most benefit. In much of

the RNS literature, the point is commonly made that fault tolerance can be achieved

by including extra moduli. However, in the context of this analysis, this point is

questionable. Finally, it should be pointed out that if the moduli are very small,

such that the inclusion of extra channels will not significantly increase the chip area,

then the analysis may be more favorable.


6.3 Detecting Faults

The detection of failed processors in the chosen scheme must be done off-line.

The test procedure combines the bypassing of failed PEs to isolate columns in which

failed PEs lie in and then suppling specific test vectors to determine which row of

the column has failed. For example, all columns in the array can be bypassed except




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs