A FAULT TOLERANT GEQRNS PROCESSING ELEMENT
FOR LINEAR SYSTOLIC ARRAY DSP APPLICATIONS
By
JEREMY C. SMITH
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1994
Copyright 1994
by
Jeremy C. Smith
ACKNOWLEDGEMENTS
I would like to thank my advisor Dr. Fred J. Taylor for providing me with the
means and the environment necessary for getting this work done, and for allowing
me the freedom to pursue the research directions I saw fit. Special thanks also go to
Dr. Graham A. Jullien for serving on my Ph.D committee from such a far distance
away. I would also like to thank Dr. Mark E. Law, Dr. Jose C. Principe and Dr.
Bernard A. Mair for serving on my committee.
I would also like to thank my parents, whose early preparation and dedication
have made my accomplishments possible.
I would, especially, like to thank Diane for her patience, dedication and love,
which were all necessary ingredients for the completion of this dissertation.
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ............................ iii
LIST OF TABLES ..................... ............ vi
LIST OF FIGURES .................... ............ vii
ABSTRACT .................................... x
CHAPTERS
1 INTRODUCTION ................... .......... 1
2 THEORY ..............................
2.1 The Quadratic Residue Number System (QRNS) .....
2.2 The Galois-Enhanced QRNS (GEQRNS) .........
2.3 Dynamic Range .......................
2.4 Exam ple .. ........... .. .. ..........
3 KEY ARRAY PROCESSOR IMPLEMENTATION ISSUES ..
3.1 Synchronization .......................
3.1.1 Traditional Approaches ..............
3.1.2 True Single Phase Clocked Systems ........
3.2 Synchronous vs. Asynchronous Systems ..........
3.3 Fundamental Manufacturing Limitations ..........
3.3.1 Defect Size Distribution ..............
3.3.2 Defect Spatial Distribution and Yield Models . .
4 SYSTEM ARCHITECTURE ...................
4.1 Architectural Overview ..... . ............
4.2 Multiply Accumulate PE Architecture ...........
4.2.1 Modulo-P Adders ..................
4.3 Forward Mapping: Integer to GEQRNS ..........
4.4 Inverse Mapping: QRNS to Residue . . . . . .
4.5 Chinese Remainder Theorem . . . . . . . .
5 VLSI IMPLEMENTATION OF PROCESSING ELEMENT .
4
. . 4
4
. . 6
6
7
9
. . 15
... 15
15
. . 16
. . 24
. . 29
. . 33
. . 34
. . 38
.. 46
. .46
. .46
.. 50
. . 52
. . 54
. . 54
. . 57
5.1 True Single Phase Clocking Scheme
5.1.1 Pipeline Registers . . . . . . . ... . . 58
5.1.2 Data Storage Shift Register . . . . . . . ... 59
5.2 Exponentiation ROM ....................... 60
5.3 Electronic Reconfiguration Switches . . . . . . ... 65
5.4 PE Performance ................... ........ 67
5.5 Early Versions of the Processing element . . . . . .... 73
5.5.1 Version One .......................... 73
5.5.2 Version Two ......................... 74
6 YIELD ENHANCEMENT AND FAULT TOLERANCE ........ 79
6.1 Yield Enhancement via Reconfiguration . . . . . ... 79
6.1.1 Yield Estimates ....................... 82
6.2 A Comparison with Replacing Moduli . . . . . . ... 87
6.3 Detecting Faults ................... ........ 90
7 CONCLUSIONS AND FUTURE WORK . . . . . . ... 95
APPENDICES
A VLSI CELL LAYOUTS ............... .. ..... 100
B OBSERVED CHIP DATA ........................ 109
C COMPUTER PROGRAMS ........................ 110
REFERENCES ...................... ............. 117
BIOGRAPHICAL SKETCH .................. ......... 120
LIST OF TABLES
2.1 Table of Maximum Dynamic Ranges for Eight to Four Modulus System. 9
2.2 Table of Maximum Inner Product Lengths . . . . . . . 9
2.3 Log-Antilog Table for pi = 5. ....................... 11
2.4 Log-Antilog Table for P2 = 13 ... ............. . .... 11
4.1 Full Adder truth table. ............ .... ....... ... 51
6.1 Table of System Component Areas. . . . . . . . . ... 83
6.2 Table of Non-Redundant Chip Yields. . . . . . . . ... 83
6.3 Table of Redundant Chip Yields. . . . . . . . ..... 85
LIST OF FIGURES
3.1 Single Phase Latch System ................... ..... 17
3.2 Double-latch non-overlapping pseudo two phase system . . ... 19
3.3 High performance non-overlapping pseudo two phase system . . 20
3.4 Delay transformation model ....................... 21
3.5 Transparency problem introduced by complementary phase of clock. 23
3.6 System model for TSPC edge based clocking. . . . . . ... 25
3.7 Timing waveforms for TSPC edge based clocking. . . . . .... 26
3.8 System model for TSPC latch based clocking. . . . . . ... 27
3.9 Timing waveforms for TSPC latch based clocking. . . . . .... 28
3.10 Potential skew hazard with TSPC scheme [44] . . . . . . 30
3.11 Clocking against the data flow to exploit skew [44]. . . . ... 30
3.12 Non-local communication problem solution [44. . . . . ... 31
3.13 SEM photographs of defects from early PE fabrications. . . ... 35
3.14 Defect density vs. defect radius . . . . . . . ..... . 36
3.15 Critical area for Parallel Conductors. . . . . . . . ... 38
4.1 System Architecture ........................... 47
4.2 Processor Architecture . . . . . . . ..... ......... 49
4.3 Standard Modulo P Architecture . . . . . . . ..... 51
4.4 Modulo-P Adder Building Block Primitives. . . . . . ... 52
4.5 Carry Select Modulo Adder. . . . . . . . ..... . . 53
4.6 Forward-mapping (0) conversion module with GEQRNS log table .54
4.7 Inverse-mapping (0-1) conversion module. . . . . . .... 55
4.8 CRT block diagram. .......................... .. 56
5.1 TSPC Pipeline Register ......................... 58
5.2 SPICE simulation of fast transition path for shift register ...... ..61
5.3 TSPC Shift Register Cell with Storage . . . . . . ..... 62
5.4 Floorplan of Exponentiation ROM. . . . . . . . .... 63
5.5 Key ROM Circuit Elements ......................... 63
5.6 SPICE simulation of ROM operation. . . . . . . . ... 66
5.7 Die photograph of processor. . . . . . . . ..... . . 69
5.8 Oscilloscope photo of clock signal and output bit zero. ... . .... 70
5.9 Pass-through test ................... ......... 71
5.10 Oscilloscope photo of pass-through test output. . . . . ... 72
5.11 SPICE simulation of pass-through test. . . . . . . . ... 72
5.12 Processor architecture of first version. . . . . . . . ... 75
5.13 Die photograph of first version of PE. . . . . . . . ... 76
5.14 Oscilloscope photo of non-overlapping clocks for first chip. . . 77
5.15 Processor architecture of second version. . . . . . . ... 77
5.16 Die photograph of second version of PE. . . . . . . ... 78
6.1 Yield Curves for Various Length Arrays (Scheme 1). . . . . 86
6.2 Yield Curves for Various Length Arrays (Scheme 2). . . . ... 91
6.3 System Areas for Scheme 1 (lower) and Scheme 2 (higher). . . 91
6.4 Adjusted Yield Curves For Scheme 1 (Best Case). . . . ... 92
6.5 Adjusted Yield Curves For Scheme 1 (Worst Case). . . . ... 92
6.6 Adjusted Yield Curves For Scheme 2 (Best Case). . . . ... 93
6.7 Adjusted Yield Curves For Scheme 2 (Worst Case). . . . .... 93
A.1 ROM Word-line Decoder Cell . . . . . . . . . . 101
A.2 ROM Sense Amplifier ................... ........ 102
A.3 ROM Programming Matrix ....................... 103
A.4 Mod P Building Block (Zero) ....................... 104
A.5 Mod P Building Block (One) ............. .......... 105
A.6 DSSR Cell ................. ............... 106
A.7 Pipeline Register with Direction Logic . . . . . . . ... 107
A.8 Reconfigurable Switch Element . . . . . . . ..... . 108
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
A FAULT TOLERANT GEQRNS PROCESSING ELEMENT
FOR LINEAR SYSTOLIC ARRAY DSP APPLICATIONS
By
Jeremy C. Smith
August 1994
Chairman: Dr. Fred J. Taylor
Major Department: Electrical Engineering
In this work the design of a Galois Enhanced Quadratic Residue Number
System (GEQRNS) processor is presented, which can be used to construct linear
systolic arrays. The processor architecture has been optimized to perform multiply-
accumulate type operations on complex operands. The properties of finite fields have
been exploited to perform this complex multiplication in a manner which results
in greatly reduced hardware complexity. The processor is also shown to have a
high degree of tolerance to manufacturing defects and faults which can occur during
operation. The combination of these two factors makes this an ideal candidate for
array signal processing applications, where high complex arithmetic data rates are
required. A prototype processing element has been fabricated in 1.5 Ptm CMOS
technology, which is shown to operate at 40 MHz.
CHAPTER 1
INTRODUCTION
Arithmetic bandwidth continues to remain the principal limitation in high
speed Digital Signal Processing (DSP) applications. In the past, systolic arrays have
been proposed as a means to achieve high computational throughput for compute
bound applications [14]. The highly modular nature of systolic arrays makes them
attractive for larger than conventional levels of integration such as Ultra Large Scale
Integration (ULSI), and Wafer Scale Integration (WSI). In an array processor, each
building-block cell or Processing Element (PE) performs some basic arithmetic op-
eration on data which arrives at its inputs. The data flow is in some predetermined
and regular ordering, so that operands arrive at the correct processor at the correct
time. The architecture of each PE is highly optimized and is then repeated over a
large silicon area.
In order to realize large monolithic arrays of processing elements, it is necessary
to cope with the fact that many of the processors in an arbitrary array will have some
fatal defect at the time of manufacture. Additionally, as chip transistor count and
scaling density increases, so does the probability of operational failure, due to bias-
related physical phenomena [8].
In this work the design of a Residue Number System (RNS) processor is pre-
sented, which is used to construct a linear systolic array. The processor architec-
ture has been optimized to perform multiply-accumulate type operations on complex
operands. The properties of finite fields have been exploited to perform this complex
multiplication in a manner which results in greatly reduced hardware complexity.
The processor is also shown to have a high degree of tolerance to manufacturing de-
fects and faults which occur during operation. The combination of these two factors
makes this an ideal candidate for signal processing applications, where high complex
arithmetic data rates are required.
In section two, an introduction to the theory leading up to the processor archi-
tecture is presented. It will be shown how complex multiplication can be performed in
two non-communicating parallel channels via the Quadratic Residue Number System
(QRNS) mapping. Furthermore, the mapping may be taken to one more level, where
the actual multiplication is performed as a sum of two number-theoretic exponents.
Next, some bounds are given on the maximum length of a complex inner product
that can be computed, within the given dynamic range of the system. Finally, the
chapter concludes with an illustrative numerical example of the mapping techniques.
Some key implementation issues which relate to the design of large area in-
tegrated circuits (ICs) are presented in Chapter Three. In a broad sense, these are
synchronization and manufacturing yield. The first section of Chapter Three de-
scribes the clocking techniques that have been historically used in integrated circuits
and some of their limitations. The section concludes with the introduction of the
true single phase clocked (TSPC) technique, which represents the latest development
in synchronous system design. The second section of Chapter Three presents the
key issues of IC manufacturing. A physical notion of manufacturing defects will be
developed along with some models used to predict integrated circuit yield.
An overview of the proposed system will be presented in Chapter Four, along
with the architectural design of the processing element. The design tradeoffs which
resulted in the final PE implementation are discussed in detail. Finally, the architec-
ture of the support modules necessary to perform the forward and reverse mappings
of the input and output data are presented. It is shown that these conversion elements
can be implemented exclusively, with the key modules of the PE.
The VLSI design of the PE is presented in Chapter Five. The chapter focuses
heavily on the transistor-level design of the PE. The internal details of each major
module of the PE are presented along with computer simulation of their behavior.
Some results of the testing of fabricated chips are then presented. It is shown that
the PE is capable of maintaining a very high data rate, which is due, ultimately, to
the aggressive design techniques used for its electronics. Finally, a discussion of the
earlier versions of the PE is presented, which detail the evolution and optimization
of the current architecture.
An analysis of the fault-tolerant properties of the proposed system is given in
Chapter Six. It is shown that the yield of a large-area sixteen processor array can
be significantly increased by the chosen redundancy scheme. This scheme is then
compared to one which is traditionally used for RNS systems and is shown to be far
more efficient and beneficial.
Finally, Chapter Seven summarizes the accomplishments of this dissertation.
Some concluding remarks and future directions which this work might take, are also
presented.
CHAPTER 2
THEORY
2.1 The Quadratic Residue Number System (QRNS)
The Residue Numbering System (RNS) has long been proposed as a means
of achieving high-computational bandwidths in signal processing systems [34, 26].
The RNS gets its speed advantage because computations over a large base ring can
be implemented over smaller computation rings, due to an isomorphism between
elements in the base ring and the direct sum of the computation rings. An integer
X in the RNS is represented by an L-tuple of residues:
X = (X,...,XL) (2.1)
where xi =< X >m, is the ith residue and mi is the ith modulus. The
production rule for the ith digit in a RNS computation is
Zi = < x + y; >m, (2.2)
Zi = < xi x Yi >m
which represents modular addition and multiplication, respectively. The im-
portance of Equation 2.2 is that the computation of any digit in the L-tuple is in-
dependent of any other digit. This means that there are no carries between the
residue channels. If the residue channels are of small wordwidth, then high compu-
tation rates can be achieved in physical systems. This is the central theme in RNS
implementations.
The rule which maps computations over the implementation rings back to the
base ring is the Chinese Remainder Theorem (CRT). The CRT is given below:
f L
X = < i x < (i)-1 x xi >,> >M (mod M) (2.3)
Where M = nfl mi, and for i,j E {1,2,3,..., L}, gcd(mi, mj) = 1 for i 5 j,
and rhi = M/mi, with rh; x (rni)-1 = 1. A historical complaint about RNS systems is
that the speed gains obtained by the fast parallel channels are lost in the CRT since
it requires a final mod(M) operation across the entire dynamic range. However,
this is less of an issue today as large wordwidth fast binary adders are regularly
demonstrated in the literature [23, 27]. As we will see shortly, smaller binary adders
can be used to make larger modulo adders, with minimal area complexity.
Complex operations in the RNS can be performed by simply emulating con-
ventional architectures with RNS elements (Complex Residue Numbering System
(CRNS)). However, the QRNS first introduced by Leung [17] and later developed by
Krogmeier and Jenkins [12] is a much better way of performing complex operations.
In QRNS real and imaginary components are encoded into two independent quan-
tities, whereby complex operations can be performed independently in two parallel
channels. This requires that the moduli be restricted to primes of the form 4k + 1.
If this is so, the equation
x2 + 1 = 0 (mod p) (2.4)
has two solutions in the ring Zp, denoted by 3 and 3-1, which are additive and
multiplicative inverses of each other. We define a forward mapping
0: Zp[j]/(j2 + 1) -- Zp x Zp to be
0(a + jb) = (z,z*)
z = < a + jb >p z* < a jb >p
We will call the z and z* operands the normal and conjugate components,
respectively.
The inverse mapping 0-': Z, x Zp Zp[j]/(j2 + 1) is given by
0-1(z, z*) =< 2-1(z + z*) >p +j < 2-3-1(z z*) >p (2.6)
If (z, z*), (w, w*) E Z, x Z,, then addition and multiplication operations in
the ring < Zp x Zp, +, > are given by
(z, z*) + (w, w*) = (z + w, + w*) (2.7)
(z, z*) (w, ) = (zw, z*w*)
Since the z and z* channels are independent, they can be implemented in two
separate channels. Complex arithmetic can thus be performed in two simultaneous
operations, executed in one clock cycle.
2.2 The Galois-Enhanced QRNS (GEQRNS)
The properties of Galois fields can be used to further simplify complex mul-
tiplication in the RNS [25, 24, 18, 35]. It is well known that for any prime mod-
ulus p that there exists some a E Zp that generates all non-zero elements of the
field GF(p). That is, any non-zero element in Zp can be represented by ak, where
k E {0,1,2,...,p 2}. Since we can represent all elements of GF(p) {0} by
exponents, multiplication can be performed via exponent addition. This is highly
desirable from a hardware standpoint since n-bit adders tend to be smaller and faster
than n-bit multipliers. A number theoretic logarithm table is used to obtain the
power of a for each QRNS operand, and an antilogarithm table is used to recover the
summed powers (modulo p-1). Exploitation of this cyclic property also permits the
use of moduli which are larger than those typically used for RNS systems (typically
less than five bits), since a hardware multiplier is not needed. This translates into
increased dynamic range for fewer channels. Eight bit moduli have been used in this
implementation, but the technique could easily be extended to 10 or 11 bit moduli.
Beyond this, the logarithm and antilogarithm tables become too large and slow to
be of beneficial use.
2.3 Dynamic Range
The legitimate range of integers in a RNS system is [0, M 1] (where M =
n1= Pi). All 4k + 1 primes bounded by eight bits belong to the set { 241, 233, 197,
181, 173, 157, 149, 137, 113, 109, 101, 97, 89, 73, 61, 57, 53, 41, 37, 29, 17, 13,
5 }. In theory, a RNS system could be constructed in which the maximum range
was the product of all of these primes 7.796643721 x 1042([2142]). Clearly this
would result in an impractical implementation, as a massive CRT would be needed
to recover the residues. Systems of practical interest would constitute moduli sets of
say the first eight moduli. If a signed representation for integers is desired, we can
divide the interval [0, M) up evenly into a positive half and a negative half. Since
M will be odd for any product of moduli we have defined, the dynamic range will
be [-(M 1)/2, (M 1)/2]. Each integer X is mapped onto the range [0, M 1]
according to
SX (mod pi) X > 0
x =( I 2xx...,)**_LJ = ( (2.8)
pi (X (mod pi)) X < 0
where, X mod(pi) is the least positive residue of X with respect to pi. The
negative part of the dynamic range thus maps to the upper part of the legitimate
range:
SPositiveRange : [0, (M 1)/2]
NegativeRange : [(M + 1)/2, M 1] )
Table 2.1 depicts the maximum ranges achievable for eight through four modu-
lus systems. We will consider the four modulus case. We are interested in performing
inner product computations of the form:
N-1
c = 1 a(k)b(k) (2.10)
k=O
where a(k) and b(k) are complex sequences. We will assume that the real and
imaginary parts of both sequences are signed numbers where
-2 + 1 < babi 2 (2.11)
The real and imaginary components of Equation 2.10 will thus satisfy the
bounds
2N(2a 1)(21 1) < c, ci 2N(2" 1)(2 1) (2.12)
We must now contain these limits within our total dynamic range so that
overflow will not occur during the computation of Equation 2.10, we thus obtain the
following inequality
(M 1) > 2N(2_ 1)(2# 1) (2.13)
2 -
which implies that the maximum inner product length NMAX is
M-1
NMAx M (2.14)
NMX < 4(2c 1)(2 1)214
Some tabulated inner product lengths are shown in Table 2.2 for various val-
ues of a and /. We will assume that integers with a, / > 7 represent quantities
which are known a priori, as it is unlikely that numbers of larger wordwidths would
be available from high-speed signal acquisition circuits. Large wordwidth quantities
would typically be filter coefficients, which can be reduced modulo piby a host pro-
cessor, prior to entering the RNS system. Thus, two random eight-bit data streams,
or one random eight-bit and one pre-known data stream of wordwidth greater than
eight bits can be input.
Table 2.1: Table of Maximum Dynamic Ranges for Eight to Four Modulus System.
Dynamic Range
Modulus Set nI Closest power of 2
241,233,197,181,173,157,149,137 1110121095908704853 [259J
241,233,197,181,173,157,149 8103073692764269 [252J
241,233,197,181,173,157 54383044917881 L245J
241,233,197,181,173 346388821133 [238]
241,233,197,181 2002247521 L230o
Table 2.2: Table of Maximum Inner Product Lengths
2.4 Example
We will now consider a sample calculation based on the theory presented so
far. For this simple case, the smallest two 4k + 1 primes will be used as moduli.
These are pi = 5 and P2 = 13. Some preliminary constants that will be needed by
Equation 2.3, Equation 2.5 and Equation 2.6 will be presented first, as well as the
contents of the lookup tables needed to compute the number theoretic logarithm and
antilogarithm values.
For the the CRT (Equation 2.3) we will need to know the values of M, il,
(rmi)-, rn2 and (nr2)-1. These are obtained as follows:
Maximum Inner Product Length
a # NMAX
7 7 31,034
7 9 7,713
7 11 1,925
7 13 481
7 15 120
M = nf= mi = M = 5 x 13 = 65
i=5x 3 =j m5i3=13
rh 1 = 5 V ni = 13
< rhi x (i)1>5 = 1 = (rm)-1 = 2 (2.15)
m2 -- 5x13
2 -i=13 2 rh2=5
< A2 x (m2)-1 >13 = 1 = (t2)-1= 8
The forward QRNS mapping (Equation 2.5) requires that we find the two
constants such that the equation < x2 + 1 0 >p can be solved. These are obtained
for the moduli used here as follows:
< x2+1=-0>5 = f=2 +j1 =3
(2.16)
< 2+ 1 0>13 > = 2 = 5 32 = 8
We note that <2 x 3 >5 = 1 and that < 2 +3 > = 0, thus 2 and 3 are
multiplicative and additive inverses of each other, modulo 5. Similarly, < 5 x 8 >13 =
1 and < 5 + 8 >13 = 0. Thus, two elements always exist in each case, which behave
exactly like the imaginary operators (j) which we are familiar with.
The inverse QRNS mapping requires that we also find the multiplicative in-
verse of 2, modulo our prime moduli. These are:
< 2-1 >5 = 3
(2.17)
< 2-1 >13 7
Finally, we need to obtain the logarithm-antilogarithm tables to perform the
QRNS to GEQRNS mapping and the GEQRNS to QRNS inverse mapping, respec-
tively. We recall that we can only generate the non-zero elements in GF(p) with some
ak (where k is modulo p 1). We must thus consider a zero as a special case, which
will be denoted by *. The tables are given below, with the generators for each case.
The number theoretic logarithm is obtained from going right-to-left in the tables,
and the anti-logarithm, from left-to-right.
Table 2.3: Log-Antilog Table for pi = 5.
Log-Antilog Table for pi = 5, a = 3
Power < ap- >p Element
0 < 3<0>4 >5 1
1 < 3<1>4 >5 3
2 < 3<2>4 >5 4
3 < 3<3>4 >5 2
*_ 0
Table 2.4: Log-Antilog Table for p2 = 13.
Now, suppose we wish to compute the product of two complex numbers 6 +j3
and 4 + j5. The product using standard arithmetic is 9 + j42. Let us now compute
the product using RNS. We must first perform the forward QRNS mapping:
0(6 + j3) (z, z*)
z = (<6+jf x3>5,<6+ x 3>13) = (2,8)
z* = (<6- j x 3>5,<6-J2 x3>13) = (0,4)
0(4 +j5) (w,w*) (2.18)
w = (<4+jf x5>5,<4 +2 x 5>13) = (4,3)
w* = (< 4 x 5 >,, < 4 x5>13) = (4,5)
Thus, our complex numbers map into the set of ordered pairs
6+j3 (2,8)(0,4)
4+j5 (4,3)(4,5) (2.19)
At this point we can perform the multiplication as an actual multiplication as
in QRNS, or we can use logarithmic addition as in GEQRNS. For now we will use
QRNS. Performing component wise multiplication as usual, we obtain
(2,8)(0,4)
x
(4,3)(4,5)
4 (2.20)
(< 2 x 4 >5, < 8 x 3 >13)(< 0 x 4 >5, < 4 x 5 >13)
(3,11)(0,7)
We must now use the inverse QRNS mapping, which yields
0-11(3,0) = < (2-1(3 + 0) + j2-1(j)-(3 -0) >5
= <3x3+j3x3x3>5
= 4 +j2
(2.21)
0-12(11, 7) = < (2-1(11 + 7) + j2-1()-1(11 7) >13
=< 7 x 18+j7 x8 x4>13
= 9 +j3
Finally, we must use a real and an imaginary CRT to recover our standard
integer representation. The following expressions result
= < < ri x < i1 x 4 >5 >65 + < A2 x < 7i2-1 x 9 >13 >65 >65
= < 13 x < 2 x 4 >5 >65 + < 5 x < 8 X 9 >13 >65 >65
= < 39 + 35 >65
= 9
S= < < ri, x < ril-1 x 2 >5 >65 + < i2 x < x 3 >13 >65 65 (2.22)
S< < 13 x < 2 x 2 >5 >65 + < 5 x < 8 x 3 >13 >65 >65
= <52 + 55>65
= 42
Thus, we obtain the same product as before. Now, we will perform the multi-
plication in Equation 2.20 using our GEQRNS logarithm tables. We will simply use
Table 2.3 and Table 2.4 to lookup the logarithm and antilogarithms of the operands.
Thus,
(2,8)(0,4)
x
(4,3)(4,5)
4
(33, 79)(*,710)
x
(32,78)(32, 73) (2.23)
(3<3+2>4, 7<9+8>12)( 7<10+3>12)
(31, 75)(*, 71)
(3,11)(0,7)
We obtain the same result as we did in Equation 2.20, and thus, if we proceed
with the inverse QRNS mapping and CRT, will obtain the same final result. Finally,
we point out that the GEQRNS is only defined for multiplication. As we will see later,
this encoding is highly desirable from a hardware standpoint. In Chapters Four and
Five, the architecture of the multiply-accumulate processor is presented. Operands
are input to the multiplier portion of the chip as GEQRNS exponents. Once they
14
are multiplied, their product is converted back to QRNS inside the processor and
subsequently used in the accumulate sections of the chip.
CHAPTER 3
KEY ARRAY PROCESSOR IMPLEMENTATION ISSUES
3.1 Synchronization
The continual increase in integrated circuit scaling density has permitted the
development of chips which simultaneously exhibit increased complexity and operat-
ing speed. It is now possible to build entire systems on a single chip which consist
of many interacting circuit elements. By far, the largest gains in integration are at-
tained with monolithic (single chip) ICs. This is because the delay times associated
with communication between modules in chip can easily be an order of magnitude
less than those associated with chip-to-chip communication. Synchronous systems
constitute the bulk of chips fabricated today. A synchronous system is one in which
data is passed to or from communicating modules in a chip on the active "edges"
or "states" of a global clock. This greatly simplifies the design of individual circuit
elements as data only needs to be stable at these edges or states, rather than in
continuous time.
As die areas and clock speeds continue to grow, however, it becomes more and
more difficult to guarantee that global clock signals arrive at different locations on a
chip at the same time. The differences in the arrival times of global clocking signals
are due to differences in the path lengths to individual modules. Even if these path
lengths can be physically made the same, there are random unavoidable variations in
the delay characteristics of the paths that are intrinsic to the manufacturing process.
Additionally, the time delays of the clock paths are influenced by thermal and power
supply variations, which can also be random in nature or dependent on operating
conditions. The difference between arrival times of the clock signal is known as clock
skew. Clock synchronization is a nontrivial problem for present-day large area, high
performance die [6]. Clock skew is a fundamental limitation to the maximum speed
achievable for very fast chips. As the cycle times of these chips decreases, the skew
time becomes more and more of a significant fraction of the total cycle time. The
problem is particularly aggravated when data must be passed to and from distant
modules in a chip.
The choice of a clocking strategy is thus of paramount importance for the
design of high performance integrated circuits. The general trend with time has been
a reduction in the number of clock signals that are generated and routed around
a chip. There are associated tradeoffs, however, in circuit complexity, speed and
clocking safety. The sections that follow will describe some of the more modern
clocking schemes that have been used successfully in integrated circuits in the past
and some emerging technologies for very high performance chips.
3.1.1 Traditional Approaches
Pseudo Single Phase Latch Based Systems
The simplest type of data storage element is a latch. Latches are used to hold
data at the inputs and outputs of combinational logic gates. In is simplest form, a
latch passes data present at its input to its output when the clock is high (active).
This is the transparent phase. If the data changes at the input of the latch when the
clock is high, it will also change at the output after the time it takes for the change to
propagate through the latch. When the clock goes low, the data that was present at
the falling edge of the clock is stored and cannot change. This is the nontransparent
phase. Latches exhibit the lowest transistor count, due to their simplicity. This factor
motivates their use in VLSI systems. In later sections, we will expand our definitions
of transparent and non-transparent latch phases.
Inputs Output
L L1 -- L1 -
Combinational Next
Logic ^Stage
(D -D
L1
I I 4
Present Next
State State
Figure 3.1: Single Phase Latch System
Signals propagate through combinational logic networks at different rates,
depending on their values. This is due to the internal characteristics of the transistor
switching paths comprising the gates. It is not uncommon, for example, to find
propagation delays for highs to be twice as long as those for lows, or vice versa.
This difference in delay is also related to the function being implemented and to
combinations of the input variables. Consider the implementation of a state machine
shown in Figure 3.1, which employs a pseudo single phase latch clocking scheme.
The output to the next stage and the next state information, are computed from the
present inputs and present state data.
New data is sampled by the logic network just after the rising edge of the
clock, is modified by the combinational logic (CL) and should be ready to be passed
on the the next stage (and feedback path) just before the next rising edge of the clock.
We are interested on placing constraints on the allowable delays in the logic so that
the maximum operating frequency can be obtained. The clock period, ro, is given by
the sum of the high and low phases as 7r = TH +-L. We desire the clocking to be data
independent, which requires that the slowest delay in the logic be less than r6. At the
same time we require the fastest delay in the logic to be slower than TH, otherwise a
potential conflict could occur. If the fast path delay was significantly faster than TH
then the newly generated data from the CL network could race through the feedback
path and to the next stage, thereby corrupting the previously generated (valid) data.
Non-deterministic behavior of the system results is this phenomenon occurs. This
is called a race condition which must be avoided at all costs. The delay of the CL
network, TCL, is thus subject to the two-sided constraint:
TH < TCL < 7r (3.1)
To put it succinctly, we require that the slow path be fast enough and the
fast path be slow enough. This two-sided requirement is very difficult to guarantee
in VLSI systems, in the context of process parameters which have some statistical
distribution. This clocking scheme was used in some early VLSI chips, but was later
abandoned due to its implicit hazards. More security can be obtained by using a
multiphase scheme. The tradeoff being, that the simplicity of this scheme is forfeited
for reduced risk and ease of design.
Non-overlapping Pseudo Two Phase Clocking
Non-overlapping pseudo two phase clocking (NPTC) has been the mainstay of
the semiconductor industry for several years now. Most integrated circuits designed
today, still employ a NPTC scheme. Its popularity has been due to its relative safety
and immunity to race conditions. This security is achieved by the introduction of
a second clock phase. The active phase of the second clock does not overlap with
the active phase of the first clock. There is thus no possibility of a race condition
occurring. This is shown in Figure 3.2.
Here, a double latch scheme is employed. New input data is let in to the L1
latch just after the rising edge of 01. On the falling edge of 01 it is "frozen" and cannot
change again. After time TA, 02 becomes active and the new data propagates to the
Inputs Outputs
4 L1 .) L2 L1 4 L2
Combinational L L2
Logic
0*1 02 1 02
Next Stage
L2 4- L1 4--
Present Next
State State
OxDI
*2 I>*
Figure 3.2: Double-latch non-overlapping pseudo two phase system
CL network. We notice that the fast path race condition has been eliminated, as even
if the CL network is very fast for a particular input (i.e. would have computed the
new output with a delay less than r7), it could not race to the next stage (or through
the feedback path) since the L2 latches are non-transparent. A two-sided constraint
has now been reduced to a one-sided constraint. Now, the only requirement of the
CL network is that it meet the upper bound on the slowest transition path. This is
satisfied if the following equation holds true:
TCL < 72 + TB + T1
(3.2)
or TCL < T TA
where we define Tr = r, = r2 = rl + TA + T + TB. We notice that the
TA time delay is an overhead that must paid in every cycle. This is the price we
have paid for timing safety, along with the increased complexity of the double latches
and wiring of the second phase. We also notice that we have effectively made an
edge-triggered latch from our cascade of L1 and L2 latches, as new data enters the
system on the rising edge of 2. This idea will be important later, and we will call
an edge sensitive latch a Flip-flop (FF). Flip-flops can be both negative and positive
D2 (1
Figure 3.3: High performance non-overlapping pseudo two phase system
edge sensitive. The time wasted during 7A can be gained back if the CL network can
be partitioned appropriately. This is shown in Figure 3.3. Here, combinational logic
has been placed between latches, rather than after a cascade of two latches. The
advantage of this scheme can be appreciated by examining the constraint equations,
which can be obtained from a similar analysis:
TCL1 < 71 + TA + 72
TCL2 < T2+ TB 1 (3.3)
The sum of the CL network delays must be less than the clock period, or
TCLI + TCL2 < 7T. Any combination of delay times that satisfies this constraint can
be implemented. If one of the CL networks is faster than the other, we can thus
trade delay giving the slow one more time and the fast one less time. The total delay,
TCL1 + TCL2, now approaches the clock period, which is more efficient than before.
The NPTC clocking scheme is safe as long as the "dead-time" between the
phases can be maintained. In the presence of clock skew, however, this requirement
cL2 CL2
2 -c1 -- 1 2 1
+d
-d
CL1 CLI
Figure 3.4: Delay transformation model
may be violated. For example, if two non-local modules are communicating, then the
relative differences in path delays may degrade our safety margin enough for to cause
timing errors. We can model the effects of skew by introducing a delay transformation
that preserves the overall timing scheme presented so far. For a general logic module
shown in Figure 3.4, if we add a positive delay to the all inputs and subtract this delay
from all outputs, the overall timing remains the same. We can thus model the effects
of skew by considering an overall timing loop. A circuit with known delays through
the combinational logic and uncertain clock skew can be transformed into a system
with uncertain delays through the combinational network and no clock skew [7].
Delay is thus added to one piece of the combinational network and subtracted from
the other. This delay may be positive or negative. Negative delay has no physical
significance in a real system. Negative delays have significance in the transformed
system, however. We note that if 1 is delayed more than 2 such that this delay is
greater than of equal to TA an overlap of active phases results, which requires two-
sided constraints as in the case of the single phase latch clocking. The system is likely
to fail at this point.
Pseudo Single Phase Edge Clocking
If edge triggering is used for the system in Figure 3.1 then the drawbacks of
using an additional clock phase would be avoided. Data would move to the next
adjacent stage (and to the feedback path) on the rising edge of the clock and thus,
would not be subject to the lower bound on logic delay as before. Of course the system
would still be vulnerable to clock skew between communicating non-local modules.
Single phase edge based clocking and NPTC schemes have been used successfully in
integrated circuits, with the former associated with higher performance.
At this point, the reader is probably wondering why the word "pseudo" ap-
pears in all of the descriptions. This is because we have purposefully stayed away
from the internal details of the latches. In reality, complementary (inverted) phases
of the clock signals must be generated to make all of our latches work. This is due
to the nature of CMOS logic which employs two different species of transistors to
pass the full range of logic values. If we view the transistors as ideal switches, then
their operation (in the most elementary form) requires that a high voltage turns the
NMOS device on and the PMOS device off. The converse is true, that a low turns
the NMOS off and the PMOS on. The NMOS device will pass a strong zero and a
weak one, while the PMOS will pass a strong one and a weak zero. The simplest
switch that will pass the full logic level range is a parallel combination of an NMOS
and PMOS device, called a transmission gate. This is shown in Figure 3.5, where a
simple positive edge-triggered latch is shown. The circuit works by letting the new
data into the first half of the latch during the low phase of q and presenting it to the
output during the high phase of . The internal states in the latch are stored on the
parasitic capacitances at the inputs of the inverters in the latch (dynamic logic). If
the switching time of 0 is zero as in the ideal case, then there is no potential race
condition. In reality, however, this time cannot be zero and the clock signal must
have a finite slope. There is thus a built-in transparency during the interval r. If the
propagation delay through the latch is on the order of r then a race condition exists.
We must thus keep r as small as possible, which implies that the clock edge-rate must
be high. To gain an appreciation of the problem, suppose that we are working with
A$
Ideal case
i--
Finite transition time
^^-2
Finite transition time
with inversion
delay
Figure 3.5: Transparency problem introduced by complementary phase of clock.
a typical 5V chip, and we require that 7 < 2ns. The clock edge rate must thus be
greater than 2.5 billion volts per second! Of course the output of the latch cannot rise
in zero time (nor can the inverters in the latch), so this will relax our delay constraint
somewhat. Nevertheless, the transition time of the clock and its complement must be
very small. The problem is compounded since 4 is typically generated locally from
0. This implies that < will always lag behind 0 since it cannot be produced in zero
time, which increases r. There are thus built-in problems associated with inverting
the clock signal for a complementary phase. If we had a family of latch circuits that
could operate with only one clock phase then this problem would be solved. The
edge rate sensitivity will be a fundamental limitation, if the logic block that follows
the latches can switch in times on the order of the clock transition time. For very
high performance chips this is in fact the case, and very careful attention must be
paid to the worst case clock edge rate.
3.1.2 True Single Phase Clocked Systems
The True-Single-Phase-Clocking (TSPC) scheme [43, 44, 1] represents the
state of the art in integrated circuit clocking. In TSPC, the clock signal is never
inverted to produce a complementary phase, which significantly ameliorates the prob-
lems pointed out earlier. TSPC schemes become more attractive as chip die areas
grow towards ULSI and WSI dimensions, as only one clock line needs to be routed
around the chip and since clock skew between phases is eliminated. It also much eas-
ier to guarantee the long term reliability of a single clock interconnect, or distribution
network. Reducing the clocking complexity of the storage elements is important for
large systems since the clock capacitive load influences the overall system speed.
At the heart of the technique is the use of two types of latch circuits which
are alternately transparent on either the high or low phases of the clock. The latch
which is transparent on the high phase of the clock and is non-transparent on the
low phase is called the N-latch. Likewise, the latch transparent on the low phase
and non-transparent on the high phase is called the P-latch. The latches permit
combinational logic circuits to be placed between N and P latch sections, or actually
embedded in the latches themselves. The scheme supports static or dynamic CMOS
logic elements, and is thus fully applicable to all types of CMOS systems. We will
defer the discussion of specific circuit topologies of the latch elements until Chapter
Five. For now, in latch based schemes, we can deal only with the abstraction of
N-blocks and P-blocks, where "blocks" consist of only their respective latch elements
together with combinational logic. We note that the notion of combinational logic
elements contained in or between latch elements is equivalent. The previously defined
concepts of edge-based systems still hold. A positive edge flip-flop is made from a
cascade of a P-latch followed by a N-latch. Similarly, a negative edge flip-flop is made
from a cascade of a N-latch followed by a P-latch.
Sat ,at
negative skew positive skew
Figure 3.6: System model for TSPC edge based clocking.
TSPC Edge Based Systems
The effects of clock skew in a TSPC scheme can be examined with the system
shown in Figure 3.6. Skew is defined as the difference in time between clock signals
arriving at the receiving flip-flop relative to the transmitting flip-flop. Skew may take
on a negative or positive value as shown, depending on the relative magnitudes of
the path delays (A, and A2). We note that negative skew has a positive value for
A21 and positive skew has a negative value for A21. We will also define r, as the
setup-time, which is the minimum amount of time that the data must be stable prior
to the active edge of the clock. Likewise, we can define Th as the hold-time, which
is the minimum amount of time that the data must be held after the active edge
of the clock. There are also delay times associated with the propagation times for
new data through the flip-flops and combinational logic circuits. We will define the
propagation delay through the flip-flops (i.e. after the active edge) by TQ, and the
propagation delay through the CL logic by TCL. All of the quantities defined so far
may take on maximum and minimum values, which will be denoted with subscripts
M and m, respectively.
(a) I I I
'TQM +' CLM
Th+-
'mCLm
'rQM + CLM
Figure 3.7: Timing waveforms for TSPC edge based clocking.
As before, we are interested in deriving a set of constraint equations to char-
acterize our system. We wish to be able to run the system at the maximum allowable
clock frequency. We now consider the case with clock skew when non-local flip-flops
are communicating. The timing diagram for the situation where A2 > A1 (negative
skew) is shown in Case (b) of Figure 3.7 The maximum allowable clock period can
be expressed as:
T4 TQM + TCLM + Ts A21m (3.4)
On the other hand, the minimum delays must be such that the newly generated
data does not reach the distant flip-flop before its hold time. This requires that:
TCLm + TQm 2 Th + A21M (3.5)
The maximum allowable clock skew in the system is given by:
A21M : TCLm + TQm Th
(3.6)
Figure 3.8: System model for TSPC latch based clocking.
Typically 7rh is near zero for this class of circuits, so the maximum allowable
clock skew is just given by the minimum flip-flop and logic delays. The situation
for positive clock skew is shown in Case (c) of Figure 3.7 (-A21). Substituting in a
negative value for A21 in Equation 3.4, will yield an increase in the maximum clock
period. Thus, positive skew will tend to slow the system down. Negative skew will
allow the system to operate faster, but will place constraints on the minimum speed
of the logic, since fast CL networks will tend to have shorter minimum delay times.
TSPC Latch Based Systems
We can obtain a similar set of equations for non-local communication between
modules in latch based TSPC systems. Our prototype system is shown in Figure 3.8,
and timing waveforms are shown in Figure 3.9. For simplicity we will assume that
the delays in the P and N latches are the same. Data stored in the N-latch (when q
goes low) must propagate through the CL network and arrive at the P-latch before
its setup time. This is shown in Figure 3.9, Case (a). For negative skew (Case(b))
we thus obtain the following expression which is very similar to Equation 3.4:
TOM +'CLM
*Th|
(b) -
Tom +TCLm
(c) A-21 -
(C)
OM + CLM
Figure 3.9: Timing waveforms for TSPC latch based clocking.
TeL > TQM + TCLM + Ts A21m (3.7)
We note that there is also a constraint on the high width of the clock in order
to give N1 enough time to obtain data. This is given by the following equation where
TQM is the maximum delay of the P latch
T0, H 7QM + T. (3.8)
Again we must consider the minimum allowable delay in the system. Data
must not race through the P latch and the CL network before the hold time of the
N2 latch. Thus, the requirement below must hold
TCLm + TQm Th + A21M (3.9)
The maximum allowable clock skew in the system is given by:
A21M < TCLm + TQm Th
(3.10)
Again, positive skew has the effect of lengthening the clock period. Design of
these systems must strike a balance between Equation 3.7 Equation 3.8 and Equa-
tion 3.10.
It is interesting to consider situations in which Equation 3.10 cannot be sat-
isfied [44]. This is shown in Figure 3.10. Here the skew has exceeded the gate delay.
Non-deterministic behavior of the system results because the evaluate phases of N
and P blocks overlap. The system can be made to operate correctly, however, by
simply clocking against the data flow. This is shown in Figure 3.11. In this way, the
evaluation phase of the next block will be completely contained in the data-stable
zone of the last block. It is thus possible to exploit the skew, if its direction can
be guaranteed. Another solution is to only latch data on the start transitions of the
evaluation phases. This can be accomplished by adding a "re-synchronizing" N-block
in front of the P block as shown in Figure 3.12. This illustrates the flexibility of the
technique, where we can selectively add edge-triggering where needed. Finally we
note that in systems where the data flow must be in two directions, a totally edge-
based scheme is best. The maximum skew in this case is nearly up to a half clock
cycle [44].
3.2 Synchronous vs. Asynchronous Systems
Since the early days of systolic arrays, it has been realized that clock synchro-
nization over a large silicon area would be a limiting factor. In an attempt to solve
this problem, Kung [15] proposed his Wavefront Array, in which data would move
between processing elements in a self-timed manner, via a handshaking protocol.
More recently, Afghahi and Svensson [2] have conducted a study of syn-
chronous and asynchronous clocking schemes, for layout groundrules ranging from 3
to 0.3 ptm. The study was based on physical entities in actual processes, rather than
on speculative models. The results of their findings have important consequences for
Data N-Block h -BIOCK N-BlOCK -bIOCK
Clock Ac Ac -Ac
Ci C2 C3 C4
C E L E L
g~ Output Data 1 ~ Output Data 2
sAc : Eval. -- Corrupted by data 2
Sdata 1
C2 L E L E
Figure 3.10: Potential skew hazard with TSPC scheme [44].
Data N-Block P-Block N-Block P-Block
Ci I Cs A Clock
C E L E L
*Ag~ Output Data 1 Output Data 2
Ac'- Eval. Eval.,
data l data 2
2 L E L E
Figure 3.11: Clocking against the data flow to exploit skew [44].
Data -- N-Bock P-Block - N-Block P-BI
Clock c C-
C1 C2 C3 C4
Ci E L E L
Ag Output Data 1 -- Output Data 2 -
CaJ E L E L
-Ag- Output Data 3 Output Data 4 -
C4 L E L E
t Eval. t Eval.
-data 1 data 2
& data 3 & data 4
t1 t2
Figure 3.12: Non-local communication problem solution [44].
the clocking scheme used for a particular implementation, since the optimum choice
(in terms of overall system speed) is intimately related to the module grain size. Some
of their findings are reproduced below, as they are relevant to our problem.
Asynchronous timing schemes have been proposed as a solution to clock skew
in VLSI systems. It is generally realized that fully asynchronous logic (i.e. no
clock) requires too much handshaking overhead for practical use when the number of
inputs and outputs is large. Present day interest in asynchronous systems centers on
locally synchronous, globally asynchronous schemes, where each module in a system
operates with its own clock, but communicates with other modules asynchronously.
Since data arriving at a module must be processed synchronously, a synchronizer
circuit is necessary for each input. The synchronizer is vulnerable to error, if the
input signal changes during the edge of the local sampling clock. If this occurs, the
synchronizer goes into what is known as a metastable state (MSS), where the output
is undefined for some period of time. A synchronization failure potentially results
in a system failure. In order to avoid this, a time interval must be allowed for the
synchronizer to resolve the MSS. This time interval is an extra delay which impacts
the overall data rate. Obviously, reducing the module communication rate decreases
the likelihood of synchronization failure, at the penalty of reduced system bandwidth.
A more favorable approach, is to develop some probalistic bounds on the resolution
time, t, of the synchronizer circuit. In this work, to estimate t, a synchronization
failure rate of 1 per year was considered acceptable. Synchronization time was shown
to be the limiting factor for asynchronous systems.
The study showed that the time complexity of synchronous systems is
(O(log R)05)
while for asynchronous systems it is
O(logR)
where R is the size of the system. This suggests that for fine-grained systems
(where skew is acceptable) that synchronous clocking should be used, while for coarser
grained applications, asynchronous schemes should be used. The authors also showed
that where a pipelined clocking mode is used (i.e. the global clock is segmented
into an optimum number of small sections and each section driven by a repeater),
that synchronous systems will always outperform asynchronous systems. This is
very different from the general belief that asynchronous systems will be the fastest
possible implementation in a scaled technology. They also showed that speeds up to
approximately 2 Ghz. can be achieved in single phase synchronous CMOS systems
for 0.3 pJm technology, although the line repeaters must be spaced 1 millimeter apart
to achieve this.
3.3 Fundamental Manufacturing Limitations
The occurrence of defects is inherent to the semiconductor manufacturing pro-
cess. These defects arise due to many physical mechanisms such as particle contam-
ination, imperfections in insulating oxide layers, mask misalignment, step-coverage
problems, warping of the wafer during high temperature steps, etc. The first two
cases tend to produce random defects, which affect local regions of a wafer, while the
remaining three tend to produce defects which affect many chips (global regions) on
a wafer. Global defects are usually not considered in defect-tolerant analysis, since
they represent gross disturbances in the manufacturing process, and are generally not
a function of chip area. Additionally, various process monitors (test circuits placed
at strategic locations of a wafer to determine the quality of each manufacturing step)
are used to reject (or correct) wafers which exhibit such characteristics early in the
process. Consequently, the overall quality of the manufacturing process is determined
primarily by local defects.
Integrated circuit yield is intimately related to manufacturing defects. Yield is
defined as the ratio of the number of working chips on a wafer to the number of chips
fabricated, and is always less than 1. For conventional VLSI systems, yield studies
pertain to manufacturing economics with the goal of maximizing profits. In the
context of ULSI and WSI systems, however, yield relates to fundamental feasibility.
Yield is further divided into two broad categories: functional yield and parametric
yield. Functional yield (sometimes called catastrophic yield) is determined based on
the criterion of a chip successfully performing its desired logical functions. Parametric
yield is determined based on the chip meeting some predefined operating specification,
such as minimum speed or power dissipation. For our purposes, we will not consider
parametric yield since it is usually highly correlated to global disturbances on a wafer.
3.3.1 Defect Size Distribution
Early work on defects occurring in the semiconductor manufacturing process
modeled random defects as dimensionless points, where any defect occurring in an
integrated circuit was assumed to cause a failure. A more modern view of defects
[28] models them as extra or missing disks of material in the conducting and non-
conducting layers of an IC, which are characterized by varying radius and spatial
distribution on a wafer. These may take the form of shorts or opens in conducting
layers used for interconnections, oxide pinholes in insulating layers which can cause a
short (or leakage) between conductors, junction leakages or shorts to the substrate in
diffusions, etc. The model assumes that defect types are independent, in that a de-
fect of a particular type does not interact with, or cause, another defect of a different
type. In practice this has been shown to be a good approximation [39, pp.149-171].
Examples of actual defect types from early prototype fabrication runs of the PE are
shown in Figure 3.13. These defects may cause an immediate fault, as in Case A
(left), or may not cause a fault as in Case B (right). Case A shows a short between
two power busses, which was caused by a particle on the chip (the particle is the
dark spot). Case B is a pinhole in the passivation layer (overglass) layer, which does
not cause a short circuit since there is no conductor above it. Clearly, then, we must
consider the nature of a defect in order to determine if it will manifest itself as a
circuit fault. Random defects (sometimes called spot defects) are typically caused
by particle contamination (dirt) on the chip or on photolithographic masks during
manufacture.
As alluded to in the previous paragraph, defects are of many differing types.
The size distribution function for defects of type i is given by:
f(R) = R O < R < Xo (3.11)
k2/(R)P' Xo, < R
Figure 3.13: SEM photographs of defects from early PE fabrications.
where R is the defect radius and Xo, and pi are parameters which are extracted from
the fabrication line. Each manufacturing step in a process has associated with it its
own characteristic set of defect types. The parameter i in Equation 3.11 can thus be
taken on a mask-by-mask basis (i.e. consider polysilicon shorts and opens, first-level
metal shorts and opens etc.). Smaller defects tend to be more numerous than larger
defects, since it is more difficult to filter small particles from the ambient environment
and manufacturing chemicals. We see in Figure 3.14 that the defect density peaks at
the value for Xoi. This corresponds to the resolution limit of the photolithography
in the process. The physical reason for the peak is that defects of a smaller radius
simply cannot be resolved by the photolithography, and thus manifest themselves
with decreasing frequency. Typically, minimum design rules are set well above this
value, so that the only size distribution exhibited in practice is fi(R) = k_ In the
above expression, both Xoi and pi may have different values for each defect type i.
Typically, a value of p; w 3 is assumed [28].
As mentioned previously, not all defects in a semiconductor process cause
faults. For example, Walker has suggested that if Xoi = 0.5 pm and the minimum
f(R) Minimum
/( design
rule.
Defect Radius (R).
Figure 3.14: Defect density vs. defect radius
line separation is 3 jm (i.e. minimum design rule in Figure 3.14), then only one-in-
seventy-two defects are potentially fault producing [39, pp.41]. With this in mind,
we must then consider the effective defect density, Do0, which is the defect density as
seen by the layout. The defect density variation with respect to defect radius, Di(R),
is thus obtained by multiplying Doi and fi(R). Hence,
Doik2
D,(R) =
(R)P,
= ( (3.12)
(R)Pi
The relationship between parameters Ki, pi and the effective defect density is
as follows. Suppose that two metal lines in an IC have a minimum spacing s, and
we are interested in determining the effective defect density for shorts. All defects
with radius less than s/2 will not be able to cause a short in the layout, regardless of
where they lie. The effective defect density, Di-effective, as seen by the layout is then
Doo Ki. Ki
Di-effective ( )---dR =
A/2 (R)s (Pi 1) (1)2
K; /2 p'-1
= -1 (3.13)
(pi 1) 3
Similarly, the effective defect density for opens in lines of minimum width w,
is given by:
K; /2 p'-
Di-ef fective = 1(2)P (3.14)
(pi I w
We have seen in the case of this simple example, a relationship between ef-
fective defect density and the layout geometry. There is thus a finite probability of
a particular defect producing a circuit fault. This idea is further extended into the
concept of critical area, which is the portion of a layout sensitive to a defect of a
particular radius. Critical area A(R) can be defined for a defect of radius R, as that
area on a die in which the center of a circular defect has to fall for a fault to occur
in a circuit. This is illustrated in Figure 3.15, where we consider again, the case of
a chip consisting of wires of width w spaced s units apart. The total chip area is
A0. As we have seen before, defects of radius less than s/2 will not cause shorts, and
thus, A(R) = 0 for defects with radius R < s/2. If we consider defects with radius
R > s/2, A(R) increases. The critical area increases linearly until it reaches Ao. This
occurs at defect radius Ro = (s + w/2), where a fault occurring anywhere on the chip
would cause a short. The increase in critical area is linear for this simple case, but
for a complex layout, we can only state that critical area will increase monotonically
with defect radius. In practice very large defects seldom occur, and the distribution
given in Figure 3.14 can be truncated at some maximum radius.
Critical
Area.
R~~W/2+S~~
Defects never Some defects Defects always
cause short. produce short. cause short.
A(R) Ao
S/2 S + W/2
Defect Radius (R)
Figure 3.15: Critical area for Parallel Conductors.
3.3.2 Defect Spatial Distribution and Yield Models
Poisson Statistics
The spatial distribution of defects must also be considered when determining
circuit yield. Early yield models assumed that defects followed a Poisson distribu-
tion, and were considered as point defects. Poisson processes are modeled by the
expression:
prob{X = x} = (3.15)
The three underlying assumptions of a Poisson spatial process are [16]: (1)
that the number of events occurring in one segment of space is independent of the
number of events in any nonoverlapping segment; (2) that the mean process rate A
must remain constant for the span of space considered; (3) the smaller the segment
of space, the less likely it is for more than one event to occur in that segment. For
yield purposes, we are concerned with the case where X = 0, as this is the condition
for zero circuit failures. The mean process rate, A, is number of circuit faults. We
can define this per process step by Ai, which is given by Ai = DoiAi, with Doi and
Ai as defined previously (i.e. the defect density and critical area, respectively). The
yield of a particular step in a semiconductor process is thus:
Yi = e-'i = e-DoiA (3.16)
The total yield of the chip is then the product of the individual yields, or
simply:
Yoa = fY (3.17)
i=1
where there are n total defect types. The formal methodology in Equation 3.16
and equation 3.17 of considering defect mechanisms on a mask-by-mask basis with
regard to layout dependent critical areas is best left to CAD tools. For analytical
purposes, we usually consider overall average defect densities only. For this reason
we will drop the i subscript in subsequent discussions, and define the quantity Do to
be the average fatal defect density. Critical areas are usually taken to be the area of
the entire chip, or the areas of major subsections of the chip. The yield for Poisson
distributed defects, thus becomes:
Y = e- = e-DA (3.18)
It is still possible to calculate the individual yields of particular sections of
a chip with this simplified view, from the area and defect density of that particular
section. Equation 3.17 suggests that areas on an integrated circuit which are the most
complex (i.e. require more manufacturing steps), will exhibit the lowest yield. This
is the case for logic, which usually makes use of all design layers (typically greater
than 12 masks). Conversely, interconnection busses which consist of one layer, say
second level metal, with no other layers beneath them, only require a few processing
steps (three or so). The corresponding yields of such sections will be higher than
that of same area logic circuits. The statistical independence of failures modeled
by the Poisson distribution permits the total yield of a chip to be computed from
the product of its individual component parts. This is the most useful property of
Poisson based yield models.
Modified Poisson Statistics
It is now well known that for large chips, the Poisson model gives pessimistic
predictions for yield when compared to actual fabrication line data. This is because
of a phenomenon known as defect clustering. Defects tend to cluster between lots of
wafers, between wafers in the same lot and across individual wafers [11]. Clustering
of defects between lots is perhaps the easiest behavior to appreciate, due to the
batch oriented nature of the semiconductor process. Manufacturing parameters will
certainly change over a period of weeks or months, due to equipment and operating-
environment variations. Thus, it is reasonable to expect some statistical differences
in the yields of identical chips from different batches.
It is interesting to consider the physical mechanisms relating to the other types
of defect clustering exhibited. Clustering within wafers is due to minute differences in
environmental conditions at different locations on the surface of a wafer. Stapper has
suggested that the defect clusters are generated when vibration or other environmen-
tal changes (i.e. irregular gas flow or pressure changes) cause a cloud of particles to
break loose from the manufacturing equipment [32]. When these clouds land on the
surface of a wafer, the resulting defects produced will be clustered. Other very subtle
mechanisms also influence the spatial characteristics defect patterns. For example,
it was observed for many years in the semiconductor industry that defects tended to
be random within the center of a wafer, and correlated towards the perimeter. This
caused many researchers to divide wafers into concentric zones, where the yield in
each zone was modeled by a Poisson distribution with its own defect density. This
behavior arose because wafers were carried in plastic boxes called "boats", between
process steps [32]. The inside of a "boat" is similar in construction to the inside of a
slide projector, where grooves permit wafers to be stacked vertically (in parallel). A
boat is open on one side only, and dust particles can only approach the wafers within
from this side. Suppose a dust cloud is present near a group of wafers in a boat. Even
if the particles are uniformly suspended, they are electrostatically attracted to the
nearest edge of the wafers due to the electrostatic potential created when the wafer is
slid into the boat. This leads to the observed edge clustering. These defect patterns
are less seen today, due to improved wafer handling techniques. We also note that
the dust cloud would only affect wafers closest to it in the boat, and may not affect
other wafers in the same boat. This explains clustering from wafer-to-wafer in the
same lot.
The mechanisms previously described, suggest that if there is a defect present
at some position on a wafer, that it is highly likely that there is another one near by.
This directly violates the spatial independence assumption of a Poisson process, which
assumes that a defect at a particular location is not correlated with an adjacent defect.
Thus, defect locations on a wafer are statistically dependent, rather than independent.
This is why purely Poisson expressions do not accurately model integrated circuit
yield. Clustering improves yield since it is better to have defects clumped together
affecting fewer chips, than randomly distributed, potentially affecting more chips.
The yield formula of Equation 3.18 can be modified to account for defect
clustering by assuming that defects are still Poisson distributed, but considering A
to be a random variable. The mere fact that A is a random variable suggests defect
clustering, regardless of the distribution used [11]. This technique was first described
in the literature by Murphy [19]. If F(A) is a cumulative distribution function for
the average number of faults per chip, then associated with F(A) is the probability
density function f(A) given by:
dF(A)
f (A)= A (3.19)
where f(A)dA is the probability of having an average number of faults per chip be-
tween A and A + dA. The overall yield thus becomes:
Y = e-'f(A)dA (3.20)
The function f(A) is known as a compounder or mixing function. Murphy
reasoned that a bell shaped Gaussian distribution would be appropriate for f(A),
although he could not integrate the resulting expression. He approximated the Gaus-
sian distribution with a triangular distribution. The yield expression he obtained
matched his manufacturing data more accurately than a Poisson distribution. Many
distributions for f(A) have been suggested in the past, but none have gained more
acceptance that that first used by Stapper, the Gamma distribution [32, 30]. The
Gamma distribution is given below as:
1 -
f(A) = A-le (3.21)
where a and 3 are parameters. The mean and variance of the Gamma distribu-
tion is E(A) = a3 and V(A) = a32. If Equation 3.21 is substituted into Equation 3.20
and solved, the following expression results:
F(a + x)"322)
prob(X = x)= (a)(1 (3.22)
X!F(a)(1 + P)0+x
This distribution is known as the Negative Binomial distribution. The average
number of faults per chip (the grand average) is normally taken to be A where A =
E(X) = af, so that 3 = Equation 3.22 can thus be expressed as:
prob(X = ) = + (3.23)
!ir(a)(1 + -)+ 3
The mean and variance of Equation 3.23 are given by:
E(X) = A
V(X) = A(+) (3.24)
we note that the variance of this distribution (r2) is greater than the mean, which
is different from the Poisson distribution where the variance equals the mean. The
parameter a is determined from data on the distribution of defects, and can be
calculated from the expression:
a (3.25)
A
We see from this equation, that as the variance of the defect distribution
approaches the mean, the value of a approaches oo. This corresponds to the Poisson
case. Small values of a correspond to increased clustering. We can thus account
for the Poisson distribution by choosing an appropriate value of a. This suggests
that Negative Binomial statistics are more fundamental to IC manufacturing than
Poisson statistics. For all practical purposes, a > 10 adequately models the Poisson
case. As before, we are interested in the case where X = 0 for yield and Equation 3.23
becomes:
Y = prob(X = 0) = 1 + = 1+ A (3.26)
This expression is known as the Negative Binomial Model (NBM), and has
been widely used in the industry to forecast integrated circuit yields. The NBM
formally contains an additional gross yield term, Yo, which multiplies Equation 3.26.
This term models the effects of large scale process disturbances, but as before, is
generally not considered for preliminary yield analysis, since it is not a function of
chip area. Conservative values for Do are between 1 and 2 fatal defects per cm2, and
a < 1. An average defect density of 1 per cm2 has remained the defect standard
for some time (although probably now less than 0.5 per cm2 is more appropriate for
some fabrication lines). It was recently reported that a facility could be considered
"world-class" if it was able to maintain an average fatal defect density of 0.3 per cm2
(in 1992) [41]. In the past, defect densities have kept commercial IC die areas below
1 cm2.
Clustering has been shown to apply to large areas of a wafer, which typically
exceeds the area of a chip. This is known as large area clustering. If large area
clustering is assumed, it is common in analysis to assume that defects are uniformly
distributed within the cluster (i.e. within a chip). We note that this was the assump-
tion implicitly made in Equation 3.26, in that the same value of a was applied to
the entire chip. To determine if large area clustering holds for a chip of a particular
size, it is necessary to examine particle distributions on actual wafers. This is accom-
plished by first dividing the wafers into square regions called qaudrats. The number
of particles which occur per quadrat is then counted so that a frequency distribution
for the number of particles per quadrat can be obtained [29]. The parameters of the
Negative Binomial distribution can then be determined from a maximum likelihood
estimation technique and checked for goodness of fit with a chi-square test. The
quadrat area is varied and the process repeated. The variability in the estimates of
a can be obtained. The validation of the large-area clustering assumption is based
on the overlap of the standard deviations (a ) of the estimated values of a for
increasing quadrat areas. For quadrat areas up to a critical value, the ranges will
overlap indicating that a can be considered constant. Beyond this point the ranges
will no longer overlap, and the value of alpha will begin to increase. Equation 3.26
will no longer be valid beyond this point.
The large area clustering assumption has proven to be valid for well over a
decade and a half. Equation 3.26 has been used successfully in the industry for
many years and for many products. In our yield analysis, we will assume large area
clustering. As chip sizes approach wafer scale dimensions, however, it is very much
45
a open question as to what yield model is most appropriate. For very large chips
the relationships between clusters is critical, as it impacts the amount and type of
redundancy needed for fault-tolerance. What essentially happens for very large chip
areas is that there is clustering of clusters. There have been several models proposed
[31, 22, 38] to evaluate WSI or near-WSI yields, but there are no standards as of yet.
CHAPTER 4
SYSTEM ARCHITECTURE
4.1 Architectural Overview
Linear systolic arrays can be used to implement a variety of DSP algorithms
such as convolution, FIR filtering, Fourier Transforms and polynomial operations
[33]. Linear arrays can also be used to perform linear algebraic operations such
as matrix-matrix and matrix-vector multiplication [13]. Linear arrays have reduced
input output requirements compared to two-dimensional arrays and vector arrays,
which remain constant as more processors are added. This is a key point, as large-area
chips are typically I/O constrained. Figure 4.1 depicts the architecture of our pro-
posed system. It consists of sixteen multiply-accumulate processing elements (PEs)
connected to form a fault tolerant linear array. Here, a PE which has failed is by-
passed completely, and replaced by a spare. There is one spare PE per modulus for
both normal and conjugate channels. If more than one PE has failed per modulus,
then the system can still operate with a reduced number of PEs. We note that it is
possible to obtain full utilization of all good processors in a linear array, which, in
general, is not the case in a two-dimensional array.
4.2 Multiply Accumulate PE Architecture
Figure 4.2 depicts the architectural details of the GEQRNS processing ele-
ment, and a die photograph is shown in Figure 5.7. The PE has been optimized
to perform complex multiply-accumulate type operations on both in-place or partial
result data. Two eight-bit operands to be multiplied, x and y, are the exponents of
elements ac,ay E GF(p) {0}. The y operand bus (Y-bus) supplies data to the
S. .. \ CRT -"R
X xxi log PE PE NORMAL PE PE
Xi- xri CHANNEL 2-1(z+z)
yr+Jyi log N P E P
xr-jxi log PE PE CONJUGATE PE PE PE V
i -- X CHANNEL i" 2-1 -(z-z)
yr-jyi lIog > * *
Ax PE PEL NORMAL PE PE
Xi- CHANNEL z2(z+z')
yr+jyi log *
P- nP
Xr-Xi log PE P CONJUGATE PE PE PE
SCHANNEL X z 2-Vl(z-z) -
yr-jyi log * **
Xir PE :- PE i CHANNEORL PE PE E z l(z+z) L
f yr+jyi log s o d A I
rPE J PE CONJUGATE PE PEPE
CHANNEL z" 2-1(z-z)
yr-jyi log *
Figure 4.1: System Architecture
multiplier directly. The x operand bus (X-bus) supplies data to the multiplier and
to the input of the data-storage shift register (DSSR). It is desirable to incorporate
local storage at each PE in the array, since there are some algorithms for which the
data dependency does not permit operands to arrive at each PE via a purely linear
flow (whether in the same or opposing directions). An example of this is matrix
multiplication, which typically employs a 2-D, or vector array. For signal process-
ing, however, matrix multiplication is normally restricted to the case where at least
one matrix is pre-known (usually filter coefficients), or even further to matrix-vector
multiplication (i.e. signal vector). We can thus pre-store the columns of the known
matrix for matrix multiplication (or block multiplication, for large matrices), and lin-
early propagate the rows of the input matrix or signal along the array. This results
in greatly reduced I/O requirements.
In this implementation, the DSSR can be used to store up to sixteen operands
which are known a priori. A shift register was chosen over a SRAM since it requires
less area than a sixteen by eight-bit SRAM (i.e. no decoders or read/write cir-
cuitry) and requires minimal control logic. As it stands, the DSSR is approximately
60% of the size of the multiply-accumulate portion of the PE, which illustrates how
area-expensive programmable storage is. If more storage is desired for a particular
application, then at some point a SRAM will become more area-efficient, even with
the associated increase in overhead. Once data has been loaded into the DSSR, it can
be circulated continuously via an internal feedback path. The X-Bus can be freed for
other global data-move operations when local restored data is used. Data shifting
in the DSSR can also be halted (which is a significant feature since we are using
dynamic latches). This will greatly simplify array data-flow timing, as processors
further along the linear chain can wait for operands/partial results from previous
processors, with their restored internal operands "lined up in place" and ready for
processing.
Just prior to multiplication, it is necessary to check for a zero operand, since
zero must be handled as an exception in GEQRNS. An unused binary code word is
chosen as the GEQRNS zero (i.e. some value between pi and 28 1). In this case,
255D was chosen, since zero detection will simply be the logical AND of all data bits.
If a zero is detected for either operand, a flag is raised, which will set the input of
the exponentiation table's pipeline register to zero after the next clock cycle. The
output of the modular adder provides the address input to the exponentiation table.
We note that we could have used a standard binary adder here, and performed the
modular reduction in the ROM. It is much more area efficient, however, to perform
Figure 4.2: Processor Architecture
the modular reduction first, since the size of the ROM is halved. This is evident from
a comparison of the relative size of the modular adders and ROM in Figure 5.7. A
doubling in ROM area would represent a substantial increase in PE area, whereas our
modular adders are only about 50% larger than a similar word-width binary adder.
The ROM table has the value of ap-1 programmed at each corresponding
address location. The QRNS product is thus obtained at the output of the ROM.
The computed product is then fed to one input of another modulo adder,
which reduces the computed sum mod(pi). If the second input of this modulo adder
is connected in feedback to its output, an accumulator is formed. This mode is used
for algorithms requiring results to be computed in place. For algorithms requiring
partial results to flow from PE to PE, the second input of the mod(pi) adder is
connected to an adjacent processor. A dedicated bi-directional systolic output bus
is used to transfer computed QRNS results (in place or partial) to the next adjacent
processor in the array, in the natural ordering predicated by the algorithm being
implemented.
4.2.1 Modulo-P Adders
Figure 4.3 depicts the basic construction of the modulo adders used for the
multiplier and accumulator portions of the PE. This is a standard biased-addition
scheme [9], where an offset of value 2" p; is added to a 2n bit adder to make a
mod(pi) adder (i.e. any n-bit binary adder is intrinsically mod(2n)). Operation of
this scheme requires that the input words be E {0... (pi 1)} which is accomplished
during input conversion. The magnitude of pi is also required to be less than 2" 1.
Two binary adders are cascaded such that the output of the first adder is input to
the second adder which has an appropriate offset added. Here, eight bit ripple carry
adders are used, and the offset added to the second adder is the value (28 pi). The
correct mod(pi) sum is selected from the outputs of either the first or second adder
via a multiplexer controlled by the logical OR of the carry bits of the first and second
adders.
In most implementations, standard cell adder modules are used, and the off-
set programmed by hardwiring inputs to low or high as needed. This is somewhat
wasteful as the offset is known a priori, and thus half of the logic of the second adder
need not be included. In this implementation, the offset has been bit programmed
by only implementing the logic corresponding to an added zero or one at each bit
position of the second adder. Basic mod(p) adder primitive cells are then constructed
from a FA and the logic corresponding to offset-zero and offset-one (see Figure 4.4).
The transmission gate adder presented in [42, pp.317-320] was used to implement the
FA circuit, since pass-gate implementations were found to be faster than standard
X Y
Figure 4.3: Standard Modulo P Architecture
Table 4.1: Full Adder truth table.
FA Truth Table
Inputs Outputs
Ci Ai Bi Si+l Ci+l
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
111 1 1
CMOS realizations. These blocks can then be used to construct a mod(p) adder of
arbitrary value. Since the datapath width is only eight bits, a ripple carry scheme
can be used here without speed penalty.
For larger wordwidths, such as needed in the CRT, a carry select scheme can
be used with these same small width ripple carry modulo adders (Figure 4.5). Here,
the larger dynamic range is partitioned into m, k bit sections. The offset value of
corresponding to 2mk M, where M is the desired modulus is programmed as before
by selecting and arranging the appropriate primitive cells. We can think of these
zero and one primitives as having two input and output carry bits, and a multiplexer
select input. The four possibilities for the input carries can be preprogrammed and
selected by the output carries from the previous k bit section. The total modulo add
time is thus:
Figure 4.4: Modulo-P Adder Building Block Primitives.
TmodM = Tkripple + (m 1)Tmux4 + TOR + Tmux2 + Tmu,4 (4.1)
where, Tmux4, TOR, Tmux2 are the propagation delays of the four-input carry
multiplexer, the final OR-gate and the two-input multiplexer (in the primitive cells),
respectively. For typical technologies, these times will be all < 1 ns (including wiring
delays over the distances involved), and the delay for a single primitive cell is Ins.
The area required for an eight bit modulo section is 708 A x 140 A (283 jum x 56
/m in 0.8 jpm technology). Thus, replicating the k bit sections 4(m 1) times will
consume minimal area.
4.3 Forward Mapping: Integer to GEQRNS
Equation 2.5 and Equation 2.6 describe mappings necessary to convert input
data to and from QRNS, respectively. These equations are implemented as shown
in Figure 4.6 and Figure 4.7, respectively. For the forward mapping, it is necessary
to reduce the real and imaginary components of the input eight-bit data streams
C2, ONE
SI PRIMITIVE
III
Cli 02 u1,
G1,i u ,
1_______________11
C1,
"
X(mk-1),(m-1)k ** --
~~ ~~~~ H > ~~~~(mk.1).(m-1)k~~
~~ --~~
Y(mk-1).(m-1)kl >
C1 O C1 CCO C CO C1 CO
00 0 1 10 11
*
C1 COM C CM C1 CO C1CO
X(2k-1),k -~~
~~~~ <> ~~~~(2k-1).k~~
Y(2k-1),k
C1CO C1CO C1O C10CO
X l 0 0 0 1 1 0
X(k-1),0
~~l -~~~~(k-1).O~~
Y(k-1).,
C1 CO
00
Figure 4.5: Carry Select Modulo Adder.
mod(pi), since the correct operation of the mod(pi) adders requires that the input
operands be E {0... pi -1}. This is accomplished by two 256 by eight-bit ROMs. For
the imaginary part of the input words, multiplication by 3j and modular reduction
is also accomplished in the same ROM. We note that +j is used for the normal
channel and that -3 is used for the conjugate channel. The ROM outputs are then
input to a modular adder to complete the QRNS mapping. The QRNS operand is
then converted to GEQRNS via a final logarithm table, which has the zero encoding
present address zero. Again, it is more area-efficient to reduce the sum modulo Pi
first, rather than perform this operation in the ROM, as the size of the ROM would
Figure 4.6: Forward-mapping (0) conversion module with GEQRNS log table
approximately double. We note that four of the modules shown in Figure 4.6 are
needed per modulus (see Figure 4.1).
4.4 Inverse Mapping: QRNS to Residue
The architecture of the inverse QRNS mapping is shown in Figure 4.7. Input
data arrives from the output of the last PE in the array. Equation 2.6 requires that we
perform a z z* operation for the imaginary part of the output. This is accomplished
by looking-up the modular complement of z* (i.e. -z* + pi) before adding to the z
term. This keeps the inputs to the modular adder in the desired range for correct
operation (i.e. {0... pi -1}). The result of the modular addition is used to look-up
the value of 2-1)-1 times the input to the ROM, with a final mod pi reduction. The
operation of the real part of the inverse mapping is likewise analogous, except that
normal and conjugate operands can be added modulo pi directly, and that 2-1 is
looked-up times the input to the ROM, with a final mod pi reduction.
4.5 Chinese Remainder Theorem
The architecture of the CRT is shown in Figure 4.8. Here the real and imag-
inary outputs of each of the modules in Figure 4.7 are grouped and fed to real
and imaginary CRT modules (see Figure 4.1). This is denoted by the xi inputs in
regwist rs8Pi 8 to
S4 <2-1(z+z*)> pi (REAL)
Z*
ROM
<> (265 x 8) to
ROM- 8 <2_11(z-z*)>pi (IMAG.)
ROM
(265 x 8)
8 <-Z*>pi "
(complement)
Figure 4.7: Inverse-mapping (0-1) conversion module.
Figure 4.8, and likewise in Equation 2.3. We note that the xi terms are the only
"unknowns" in the CRT, as ri;i and r^n1 are pre-known constants. We can thus use
xi to look-up the corresponding expansion terms in the CRT, and perform the de-
sired mod(M) reduction at the same time in four ROMs. We could have used 256 by
31-bit ROMs here (the dynamic range is 30.9 bits), but they would be significantly
slower than their narrower counterparts due to increased internal capacitive loads
(i.e. higher word-line capacitance). It is imperative to keep the delays in the CRT
the same as those in the rest of the system, so as not to degrade overall performance.
The final modular summation is accomplished by mod(M) adder tree of carry-select
modular adders of the type described in Figure 4.5. Thus, all of the input and out-
put conversion hardware can be implemented with the same basic ROM and modular
adder cells developed for the PE. The computational throughput is thus the same for
all elements in the system.
A A -1
pl>M
ROM
(265x7)
., ROM
(265x8)
Xl -- ----- -
8 r ROM CR b31 d
(265x8)
SROM
(265x8)
P2>M 31
ROM
(265x7)
8_ROM .-
(265x8)
P3>M 1 Int.
ROM out
(265x7)
ROM
(265x8) J >
X3 ROM ou31
(265x8)
P4>M31
ROM _
(265x7)
== ROM _
(265x8)
8 ROM I 131
(265x8)
_. ROM
(265x8) I "
Figure 4.8: CRT block diagram.
CHAPTER 5
VLSI IMPLEMENTATION OF PROCESSING ELEMENT
5.1 True Single Phase Clocking Scheme
A TSPC edge-based clocking scheme was selected for the implementation of
the processing element. Edge-based clocking was selected over a latch based scheme
for several reasons, the most significant of which is the requirement that the data
should be able to flow in two directions. Bi-directional data flow offers the most
flexibility from an algorithm-mapping standpoint, which is critical for the linear array.
Another reason for the use of an edge based scheme is that the circuitry of the PE is
very heterogeneous from a VLSI perspective. The modular adders are combinational
circuits and the ROM is a CMOS domino-logic circuit. Furthermore, the accumulator
portion of the PE is either a state machine (i.e. new data gets added to the running
sum which is fed-back), or a combinational circuit if partial sums arrive from other
adjacent PEs. We thus have a combinational multiplier (mod-P adder) which drives a
CMOS domino ROM which in-turn drives a state machine or another CL block. Also,
the data storage shift register introduces its own set of timing requirements as we
will see subsequently. The clocking scheme of the PE was the dominant design issue,
which required extreme care to implement successfully. The timing requirements
ultimately determined the types of circuits selected. All of the synchronization issues
introduced in Chapter Three came into play for the design. The move from NPTC
to TSPC clocking was the most significant feature which distinguished this version
of the PE from its earlier predecessors. The associated increase in performance was
greater than a factor of two.
MO M2 M1
SXP SYP
D- I IM5 M3 nD -D
SXN SYN
M4 M7 M6
Figure 5.1: TSPC Pipeline Register
5.1.1 Pipeline Registers
The split-output latch circuit [44] shown in Figure 5.1 was used to implement
all pipeline registers and DSSR cells in this chip. This particular cell exhibits minimal
clock loading in that only two transistor gate loads are seen per stage. This is
ultimately why this particular variant was selected over others in its family, which
have four gate loads per stage [44]. This fact is very important since there are
many gates connected to the clock line in a highly pipelined architecture such as
this. The DSSR contributes the highest portion of capacitance per PE to the clock
line, so halving the clock load per latch is an excellent tradeoff. The split-output
latch is more difficult to implement than its counterparts, and must be designed very
carefully. This will become apparent shortly.
The pipeline registers (flip-flops) are of the negative edge triggered kind.
Negative-edge triggering was chosen because it was most compatible with the timing
requirements of the ROM. A single large inverter can also be used as a clock buffer,
since most external system level clocks are positive-edge triggered.
5.1.2 Data Storage Shift Register
The data storage shift register imposes the strictest timing constraints on the
clock signal. A shift register can be built with the un-buffered (inverting) form of the
split-output latch, if there are an even number of stages. This results in a minimal
transistor-count implementation. As we are interested in building a shift register
which is sixteen levels deep, then the use of this configuration is a possible choice here.
There are implicit subtleties, however, in such a scheme. This is because a potential
fast-transition race condition exists for the case where the output of a register directly
drives the input of another register. If the clock-edge transition time is slower than
(or even on the order of) the transition time of the register output, then timing
errors may result. This is demonstrated in Figure 5.2 from SPICE simulations on
the extracted layout of the un-buffered shift-register cell. The simulation represents
the output for a cascade of two cells. The first cell in the chain is driven from a
"stable" data source (bitO), which changes well before the negative-edge of the clock.
The output of the first cell (bitl), drives the second cell whose output is shown in
the third curve (bit2). The clock-edge transition time is swept from 1.75 ns to 3.75
ns, in 0.25 ns increments. At about an input clock-edge transition time of 2.75 ns, a
gradual negative slope can be observed at the output of the second latch, during the
high portion of the clock.
The exact failure mechanism is is complex, and can best be understood by
an examination of the fourth and fifth curves. A fast low-to-high transition will
cause transistor M7 to turn-off too early (i.e. before M3 fully turns off). This may
cause excess charge from node SYP to be deposited on node SYN (which should
have been fully discharged), thereby causing its potential to rise. If this potential
exceeds the threshold voltage of M6 then an unwanted discharging of the output node
will result. The potential of node SYN is augmented by clock-feedthrough when the
clock transitions from low-to-high, which exacerbates the problem. We note that this
behavior is a function of the clock edge-rate only, and cannot be solved by slowing
the system down. The clock-edge rate will thus have to be kept relatively fast for the
unbuffered case. The total parasitic capacitance on node SYN should be kept as large
as is practical, so that the magnitude of the "hop" induced by clock-feedthrough is
minimized. For this reason, oversized diffusion islands were used on the drains of
transistors M7 and M3 (see Appendix A).
The situation can be ameliorated, however, by introducing some delay between
latches. The most obvious solution is to use "weak" inverters to buffer the outputs.
This was used here, but was taken a step further which yielded a double gain for the
tradeoff in area. A feedback path was added to the DSSR cells, which permits the
data contained in the DSSR to be "frozen" (note the latches are dynamic). From a
data-flow standpoint this is desirable, as we can stop the shift registers in adjacent
processors with operands in the correct "place". This means that we do not have
to zero-pad the stored data for alignment in the shift register. The DSSR is costly
enough in area, and including extra cells just so that they can be zero-padded is an
unacceptable waste. The tradeoff, of-course, is more transistors per cell (and another
control signal). However, this was deemed viable during the design phase, as the area
of the DSSR is still smaller than an SRAM based implementation. The final shift
register cell implementation is shown in Figure 5.3. The maximum transition time
of the clock is now approximately 4.25 ns.
5.2 Exponentiation ROM
Semiconductor memories provide the key computational elements of most RNS
systems and usually determine maximum obtainable operating speed. A ROM was
chosen over a RAM for the Exponentiation table of the PE, since a ROM of a partic-
ular size is a factor of four to six times smaller than a RAM of the same size. There
is also no need for programmability here, as the moduli are fixed. A block diagram
61
FAILURE MODE IN SHIFT REGISTER FOR EARLY LOW-TO-HIGH INPUT
......... ............. ............. .............. SRP2A.TRO
V L .0 . PHI
0 I
SRP2AF.TRO
0....-.- ...... .. ........ .................
V L .0 BIT2
0 I ---
V L : | BIT2
0. .............. .. .. .... ....... ......
O I
L N 2 .50
T
0 .. . :. I.
: SRP2A.TRO
V L 4 .0 j ....... ..... X2.................... SXN
0 I
L N 2.0 ......... ............. X2.SXP
ORP2A.TRO
T I : I : ----
.. .. .. ..... ........ ..... .. .... .. ...I .......... .. ......... ...... .....
01
25.ON 50.ON 75.ON 100.ON
0. TIME (LIN) 125.ON
Figure 5.2: SPICE simulation of fast transition path for shift register
DD D 0 1 D
7-
ST
ST
ST
Figure 5.3: TSPC Shift Register Cell with Storage
of the logical organization of the Exponentiation ROM is shown in Figure 5.4, and
the key circuit elements are shown in Figure 5.5. The ROM is masked programmed
based on the modulus choice, by including a contact to connect a programming (dis-
charge) transistor to the bit line. Connecting a programming transistor in the ROM
array, results in a logical zero at the output of the ROM, for the accessed bit location.
There are eight parallel bit locations accessed per address, thereby producing a byte
of data per address input. There are a total of 256-bytes stored in the ROM. Since
the programming transistors are of minimal width, they will limit operational speed
due to the discharge delay of the bit-line. The ROM logic is partitioned so that the
bit line height can be kept relatively short (which results in reduced parasitic capac-
itance). Differential sensing techniques were also used, so that a logical zero could
be determined well before the bit line has fully discharged.
Figure 5.5 reveals that the ROM is precharged during the low clock level,
which is just after new data has arrived at the address inputs. The data is stable
during the evaluation (high clock) stage, when the word-line decoders become active.
12 BLOCK DEC.
SENSE
AMPLIFIERS
A: CLU IMM nDEc
12 BLOCK DEC.
SENSE
AMPLIFIERS
9-A r I UMm nEC
5:32
128 X 8 bit WORD 128 X8 bit
STORAGE LINE STORAGE
ARRAY DECODER ARRAY
Figure 5.4: Floorplan of Exponentiation ROM.
SENSE AMPLIFIER BLOCK
I I I. I i. DECODER
BLy+3 BLy+2 BLy+I BLy
Figure 5.5: Key ROM Circuit Elements.
WLX
Vdd
A c j A
A precharge PMOS transistor which is controlled by the clock and a "weak" PMOS
device connected in a positive feedback configuration have been included at the sense
amplifier bit-line-voltage-input. The precharge transistor was included so that this
point may quickly charge to Vdd. If there were no PMOS devices here, then the
potential at this node would slowly charge to approximately 4V through two series
NMOS devices of the column decoder (since at one-of-four paths is always active).
The PMOS device in the feedback loop also helps with precharging once the sense
amplifier has "tripped", but was primarily included to maintain a high potential at
this node (for a programmed high), in case of any charge loss caused by leakage paths.
The low cycle duration of the clock can thus be shortened during precharge, so that
more time can be given to the high cycle (where bit-line pull down occurs) without
adding to the total clock period. This is consistent with the notion of "trading delay",
introduced in Chapter Three.
Figure 5.6 shows a SPICE simulation of the operation of the ROM, for the case
of a programmed low. This represents the worst case delay, as a logical high value
is obtained by default from the action of precharging the bit-line. The simulation
was done on an extracted layout of a bit-line programmed with all zeros, as this
is the worst case for parasitic diffusion capacitance. A sense amplifier and the full
word-line decoder were also included, and the parasitic (gate) capacitance of the
maximum possible number of programming transistors per word-line (64) was also
taken into account. The first graph shows the clock input and address line zero,
which becomes active shortly after the clock goes low. We note that even though a
valid address is present, all word-lines are low due to the action of dynamic NAND
gate and inverter combination of the word-line decoders (see Figure 5.5). The output
of the sense amplifier is shown in the second graph. This goes high as expected
during precharge. The potential of the sense amplifier input as well as the bit line
is shown in the third graph. These two points charge at essentially the same rate.
The potential of word-line-one is shown in the fourth graph, which becomes active
after some delay during the evaluate phase of the clock. When the clock becomes
high, the bit line begins to fall. We note the input of the sense amp begins to
transition slightly after the bit line, due to the action of the "weak" PMOS device.
This point soon "catches-up" to the bit line once the sense amp has had time to turn
the feedback device off. The bit-line potential falls as expected. The rise time of the
bit-line is approximately 2.4 ns during precharge and the fall time is approximately
3.0 ns during discharge. The propagation delay taken from the mid-point of the
rising clock edge to the mid-point of the sense-amp output is approximately 5.6 ns.
The simulation was conducted with clock rise and fall times of 2 ns, and with a clock
period of 15 ns. The simulation output suggests that the cycle time could be possibly
shortened, however, to approximately 12 ns.
5.3 Electronic Reconfiguration Switches
In an effort to improve the survivability and manufacturability of large linear
chains of PEs, electronic reconfiguration switches were incorporated into the PE
architecture. These elements are the outermost transmission gates in Figure 4.2.
The layout ground rule used for these elements was less aggressive than that of other
circuits in the chip, since it is desirable that a very high switch yield be exhibited.
For example, a six-lambda minimum metal thickness and spacing was used (minimum
is three-lambda), in order to reduce the likelihood of shorts and opens, and double
contacts and vias were used between all layers. One author has suggested that extra
or missing material defects greater than two line spacings or widths, respectively, are
considered so rare that they almost never occur in practice [10]. Indeed, it has been
verified consistently for several years, that the defect frequency falls-off with the cube
of the radius [28], thus small changes in line spacings or widths can greatly impact
survivability. Bypass interconnection lines were also run in metal two, with minimal
TIMING WAVEFORMS FOR ROM PROGRAMMED ZERO)
.o ...... .. ... .............. ........ I .......... .. V
L I /
I I
N .0 ^--v -/-7----- --v v '-y --: ----
...1..... ... ... ... ... L. .. ..... ..... . .. .. .. ..
0.
ROM2.TRO
L .0 -- -
L
I BLO
2.0
593.921U
ROM2.TRO
4 0. ---------ILI
0. TIME LLIN] 25.ON
Figure 5.6: SPICE simulation of ROM operation.
logic placed underneath these channels, in order to reduce the probability of inter-
layer shorts due to oxide pinholes. In order to truly determine the fault vulnerability
of the reconfiguration switches, it is necessary to have detailed knowledge of the
process defect statistics. For example, it has been suggested [40] that due to the
use of positive photoresist in most high resolution processes [3], that extra material
defects are a factor of ten times more likely than missing material defects. This means
that wider spaced, thinner interconnection wires would actually have a higher yield
than thicker, closer spaced wires in this process. Thus, the accurate modeling of the
interconnection yield within the context of the process defect characteristics crucial.
There are advantages and disadvantages to electronic switches verses physi-
cal restructuring techniques [36]. Physical switches, such as laser programmed links,
have lower on state resistance compared to electronic switches, and typically exhibit
smaller area since no other logic is needed (i.e. latches) to store configuration infor-
mation. The main drawback with physical switches is that they must be programmed
at the time of manufacture, and thus, cannot correct for faults which occur in the
field. Conversely, electronic switches can correct for run-time failures and can also
compensate for manufacturing defects. In a fully pipelined architecture such as this,
the added switch propagation delay is contained in the pipeline delay, and, therefore,
is less of a design issue. Only one configuration latch is needed for all reconfiguration
switches in the PE, since all busses are switched out simultaneously. It is important to
note that the operation of the reconfiguration circuitry, only depends on twenty-four
transistors (i.e., 12 transmission gates (Figure 4.2)), and twenty-four wire segments
per PE.
5.4 PE Performance
The processing element shown in Figure 5.7 was fabricated in a 1.5 Jm CMOS
process (ORBIT Semiconductor [21]) and occupies a total die area of 2.4 mm x 2.4
mm. The chip represents a full-custom design with the exception of the I/O pad
drivers which were supplied by MOSIS [37]. Data was fed to the chip from an eight-
bit binary counter circuit. Because of the 40 pin limitation on the package, it was not
possible to bring out all 24 input signals (8 x 3) in addition to the power, ground,
control inputs and the 8 outputs needed. Instead, both multiplier inputs were brought
out and the external accumulator input was fed internally from one of the multiplier
inputs (the Y input). It is thus not possible to test the multiplier and accumulator
portions of the chip independently.
The tests consisted of holding the Y-multiplier input constant, while incre-
menting the X-input. In this way the contents of the exponentiation ROM could be
examined. The Y-input was held at a value of zero (decimal zero), which gets added
to the output of the ROM by the accumulator. Both the input signals and output
data were simultaneously examined with a Hewlett-Packard 16500A logic analyzer,
which sampled data at each rising clock transition. The chip clock is produced in-
ternally from a buffered and inverted version of the external clock, so a positive-edge
externally is a negative-edge internally (of-course phases are flipped also). Recall
that the chip is negative-edge triggered internally. It was possible to verify the pro-
grammed contents of the ROM (see Appendix B). It should be pointed out that this
is the strictest test, that could be conducted, as the ROM could fail due to insufficient
precharge or discharge time. Also, since the ROM is addressed by a modulo adder,
the test verifies that the mod adder can sustain the data rate. The observed data was
compared to a mask of the programmed data by the logic analyzer, which was setup
to stop if there was any difference between observed and mask data. This test was
run for approximately three weeks. The chip was run at a clock frequency of 40 MHz.
during this test without the logic analyzer stopping. The test was discontinued after
three weeks. A bit-line compare voltage of approximately 4.6V was needed to obtain
this data rate. This suggests that the bit-lines were not fully discharging at this
Figure 5.7: Die photograph of processor.
Figure 5.8: Oscilloscope photo of clock signal and output bit zero.
speed, which further supports the use of differential sensing techniques for flexibility.
The second test involved holding the X-input at zero (which we recall is is coded
as 255D), while sequencing the the (shared) Y-input. In this way, the operation of
the modular accumulator could be verified. A photograph of the clock signal and
the least significant data output line is shown in Figure 5.8. The least significant
data bit is "statistically" the output that changes the most, and timing problems will
usually be seen here first. Figure 5.8 shows that the output data transitions some
several nanoseconds after the rising edge of the clock. This is observed because of
the sum total of I/O delay times. That is, the total delay before the new data can be
presented externally, relative to our fixed reference of the external rising edge, is the
input buffer time TIB plus the output buffer time ToB. There are also other delays
such as latch propagation delay times as well, in addition to the clock internal buffer
delay. The output seems to lag the rising clock-edge by a time on the order of 10 to
11 ns.
In order to gain a better estimate of the I/O delay times involved, a simple
pass-through test was implemented. This is just an input buffer which directly drives
INPUT OUTPUT
-IB + -* IB
OUTPUT L
INPUT
Figure 5.9: Pass-through test.
an output buffer (the output buffers are inverting). This is depicted in Figure 5.9
and the actual output is shown in Figure 5.10. The time between the peak of the
input and the low of the output is approximately 5-6 ns. It is interesting to compare
the observed delay time with the simulated delay time. The output of a SPICE
simulation of the pass-through test is shown in Figure 5.11. For the simulation, a
load capacitance of 10 picofarads was used to model the capacitance seen at the
output of the chip. It was not possible to actually measure this capacitance, but
10 pF is a reasonable estimate (the scope probe is 7 pF). The simulated delay time
for a high-in to a low-out is approximately 3.1 nS and for a low-in to high-out,
approximately 4.4 nS. This suggests that the measured vs. simulated delays are on
the order of 25-40 % higher. Although this cannot be generalized for the rest of the
circuits in the chip, it still provides a crude measure of the variability between actual
and simulated performance.
72
Figure 5.10: Oscilloscope photo of pass-through test output.
SPICE SIMULATION OF PASS-THROUGH TEST
S- DELAY-.TTRO
L \ : -
IN
N 2.0
0
S. -- .............. .. ........... .......... ............ .... ........
S_. 0 .............. .................... ....... ............ ... .... .. .. .... -
S .... ...... ...... .............. .... .. . .....i
N 0
5. ON 10.ON S.ON 20.ON 25 ON
o. TIME (LIN] 25.ON
Figure 5.11: SPICE simulation of pass-through test.
5.5 Early Versions of the Processing element
5.5.1 Version One
Figure 5.12 details the architecture of the first version of the PE and a die pho-
tograph is shown in Figure 5.13. This chip did not use the modular adder structure
described earlier, but rather, computed the modular reduction of output operands in
ROMs. The two input operands were "multiplied" in a standard binary adder and fed
to the exponentiation ROM where they were reduced modulo p- and the generator,
a, raised to the corresponding power (all done as a single lookup). A modular adder
was made for the accumulator portion of the PE from a standard adder and a final
ROM table to perform the modulo p reduction. The PE could only be programmed
for seven-bit moduli since, as described in Section 4.2, the result of performing the
modular reduction in a ROM table after an addition, is an effective doubling of the
ROM size (i.e. the sum of two k-bit numbers is a k + 1 bit number).
The modular adders produced from a binary adder and a look-up table were
two pipeline stages deep (see Figure 5.12). For the accumulator there is an implicit
subtlety here. If this structure is used, then odd and even indexed terms in a sum-
mation will be accumulated independently. That is, two partial sums will be formed,
one for the even terms and one for the odd terms, which cannot directly be summed.
This requires a final summation externally to the PE to complete the accumulation.
This fact forced the development of cells that would support the implementation of
a single-cycle modulo adder, without compromising speed. At the time, the only
cells available [20] to build the modular adders were four-bit carry-lookahead adders
which would not have operated fast enough to support an un-pipelined cascade. The
bit-programmed modular adder cells described earlier, thus significantly impacted
the design.
The first version of the PE also employed non-overlapping pseudo two phase
clocking. A photograph of the clocking waveforms is shown in Figure 5.14. In this
figure, the vertical axis is 10x, the lower trace is 01 and the upper trace is 02. The
pipeline registers were composed of a cascade of two transparent latches (just as
described in Chapter Three). Data was input to the first latch during the high phase
of q1 and was presented to the logic circuits on the rising edge of 2. The ROM was
precharged during the high phase of 02 (i.e. PMOS precharge devices were controlled
by q2) and was evaluated on the low phase of 02. Output data was then latched on
the falling edge of q1. It was actually necessary to make a third clock signal to avoid
a race condition in the ROM word-line decoder, by delaying 2 slightly. The early
ROM employed a dynamic NOR-gate decoding structure rather than the dynamic
NAND-gate used in the final version. The NOR gate was gated (ANDed) with the
delayed version of q2, so that the word-lines would be low during precharge. The
gating clock signal had to be delayed so that a momentary glitch would not occur on
the word-lines just after precharge, before all but one NOR gates discharged. This
complication was eliminated with the NAND structure (although the dynamic NOR
is intrinsically a faster gate). The first PE was fabricated in a 2 /m CMOS technology
[37] and was shown to operate at a clock frequency of 16 MHz.
5.5.2 Version Two
The second revision of the PE, was the first to use the modular adder scheme
in its accumulator. Its architecture is given in Figure 5.15, and a die photograph
is shown in Figure 5.16. The removal of the second ROM resulted in a significant
decrease in area. A modulo adder was not used as the "multiplier" portion of the
chip, due to vertical space limitations in the MOSIS Tiny-Chip [37] pad-frame. This
PE thus also used seven-bit moduli. The second chip was also a NPTC machine,
which employed the same ROM and pipeline latches as its predecessor. This chip
was also fabricated in the MOSIS 2 pm CMOS technology and was shown to operate
at 16 MHz.
Figure 5.12: Processor architecture of first version.
Figure 5.13: Die photograph of first version of PE.
Figure 5.14: Oscilloscope photo of non-overlapping clocks for first chip.
EXTERNAL
ZERO DETECT
Figure 5.15: Processor architecture of second version.
Figure 5.16: Die photograph of second version of PE.
CHAPTER 6
YIELD ENHANCEMENT AND FAULT TOLERANCE
6.1 Yield Enhancement via Reconfiguration
It is evident from Chapter Three, that large area integrated circuits must
have some measure of fault tolerance. For highly structured architectures, which
have identical replicated cells, it is sometimes possible to include redundant modules
of which m out of n must function in order for the chip to be considered usable. In
the past, many authors have considered the reconfiguration yield to be unity and
that all failures could be corrected for by the reconfiguration scheme. Assuming that
the reconfiguration yield is 1 is increasingly being shown to be a bad assumption,
particularly if the system area is large. What is really assumed, is that m out of
n cells are free from defects which affect reconfigurable nets and that all n cells are
free from defects that affect non-reconfigurable nets. Actually, we must also consider
defects that affect global signals such as power and ground, as they cannot directly
be reconfigured for. This requires a very detailed consideration (layout extraction)
of susceptible areas on a net by net basis, however, which is more suitable for CAD
tools. For our purposes we can consider our PE area as being composed of two
components:
APE = APE-r + APE-n (6.1)
where APE-n is the area of the bypass and interconnection busses and APE-r
is the area of the remaining computational-logic portions of the PE.
As mentioned previously, the layout ground rule used for the bypass elements
was less aggressive than that of other circuits in the chip. Since the complexity
of these elements is greatly reduced over that of other portions of the circuit, an
argument could be made that the effective fatal defect density of the switches is
reduced over that of the rest of the circuit (i.e. fewer applicable defect mechanisms).
However, in order to truly quantitatively determine the relative fault vulnerability of
reconfiguration switches, it is necessary to have detailed knowledge of mask-by-mask
process defect statistics. Since we do not have such statistics, we will not speculate,
and will consider all areas of the PEs at the same defect density. Our estimates will
thus be slightly conservative. Since we are assuming large area clustering, the faults
in adjacent modules are dependent (uniformly distributed), and we cannot use simple
Binomial expressions for determining if M out of N modules are functioning (this
assumes faults are independently distributed).
Systolic arrays in which R spare PEs have been added, where at least M =
N- R out of N must function, have been proposed before. The independent, parallel
nature of RNS channels, however, provides further unique opportunities for enhanc-
ing fault and defect tolerance of systolic arrays. For our proposed four modulus
system, each processing node in the array is effectively broken up (algorithmically
and physically) into four smaller PEs via the RNS mapping. This would suggest that
the inclusion of an extra PE per modulus would permit up to four faults to be tol-
erated per channel, rather than just one, when compared to a similar system (i.e. in
dynamic range and arithmetic functionality), in which the computations are carried
out over one physically contiguous PE. The requirement, here, is that no more than
one fault occurs per modulus. If there is more than one fault per modulus, then the
number of usable PEs will be less than M.
Koren and Stapper [11] have presented a yield model for chips with redun-
dancy consisting of multiple module types, in which the failures in adjacent modules
are dependent. We can consider our proposed linear array chip as fitting this cate-
gory, where there are eight different module types (two x four moduli), where M out
of N PEs must function, and there are R spares for each module type. The following
expression describes the total yield:
Ni N2 Ns Ni-Mi N2-M2 Ns-Ms
Y= E E ... E E E ... E
MI=N1-RI M2=N2-R2 M8=N8-R8 kl=0 k2=0 ks=0
(-1)k k2 (_ k8 N 1N- Ms
x[+ ((M + kl) + ... + Ms ks)s + ACK)]
x 1+4 ------------
x CMi M,M2 ,,MI ,M6,Ms ,M M
(6.2)
where the Ai terms represent the number of faults in each PE and ACK is the
number of faults in the "chip-kill" area. The "coverage factor" term in the above
expression, CMi,m2,M3,M4,M5s,M6,M7,M = 1, if the chip is acceptable with M1 and M2 ...
Ms fault-free modules of types 1... 8. Otherwise, CM1,M2,M3,M,,Ms,M6,M,,M8 = 0. This
term provides the means of counting those terms in Equation 6.2 which are fixable.
We note that all combinations are fixable since the moduli are physically discrete and
the system is operable if there are at least M PEs surviving in each modulus.
The number of faults in each module AX is related to the number of faults in
the total chip, A, by the following:
S= A (6.3)
Atotal
For our proposed system, the corresponding parameters are:
N; = 17
Ri = 1
Ai = ApEr
ACK = AlOpads + Aoutconv + ACRT + AInon. + 8 x 17 x APE,
(6.4)
and CM,M3,M,,M,4,M, ,M6,Ms, = 1 for Mi E {16,17}, since all of these com-
binations are fixable. We note that the chip kill area contains the area of all input
and output conversion circuits, the CRT and I/O pads as well as the area of the
reconfiguration elements in the PE.
6.1.1 Yield Estimates
We will consider the case of no redundancy first, and then compare this to the
reconfigured yield. For simplicity, assume average fatal defect densities of 1, 1.5 and
2.0 per cm2. Area values are based on a 0.8 im CMOS technology (where A = 0.4
tim). The areas of the input and output conversion modules are based on floorplans
consisting of cells developed and fabricated in the PE, and are A scalable to the
0.8 pm process. Thus, the area estimates presented are realistic, as they are based
on existing cells. Table 6.1 depicts the areas of the major system components. It
can be seen that the PE area dominates the bulk of the total system area (78.1%);
therefore, this is the most logical place to apply fault tolerance. The yields for no
redundancy are presented in Table 6.2, for various values of the clustering parameter
a. Equation 3.26 has been used to obtain the values in Table 6.2.
Let us now consider the case for the redundant sixteen processor array. The
total area of each processing element is 0.008464 cm2, and the area of the switching
elements in the PE is 0.002208 cm2 (ApE-n). The difference between these two
areas (0.006256 cm2) represents the area of the PE which can tolerate a circuit fault
(ApE-r). We have considered the case for one added redundant PE since this is
the minimum amount of redundancy that can be added to the array, and since the
switching out of more than one defective PE adds to the critical path delay as, each
bypassed PE adds the delay of two series transmission gates plus interconnect delay
Table 6.1: Table of System Component Areas.
Areas of System Components
Module Area Area Number Total Area
[A x A] [cm2] [cm2]
Input (Projected)
Conversion:
X, + jx 1500 x 1500 0.0036 8 0.0288
yr jy; 1500 x 1500 0.0036 8 0.0288
Log 1150 x 750 0.0014 16 0.0224
Output (Projected)
Conversion:
2-1(z + z*) 1550 x 800 0.0020 4 0.0080
2-l-1(z z*) 2700 x 850 0.0037 4 0.0148
CRT 6800 x 5600 0.0609 2 0.1218
PE: (Actual)
Normal 2300 x 2300 0.0085 64 0.544
Conjugate 2300 x 2300 0.0085 64 0.544
I/O Non-Scalable
Drivers (Actual) 0.0004 200 0.08
Chip II 1.393
Table 6.2: Table of Non-Redundant Chip Yields.
Projected Non-Redundant Chip Yields
,a Average Fatal Defect
Densities per [cm2]
1.0 1.5 2.0
0.25 0.6246 0.5717 0.5357
0.50 0.5139 0.4394 0.3901
0.75 0.4550 0.3684 0.3125
1.00 0.4179 0.3237 0.2641
2.00 0.3474 0.2392 0.1746
(see Figure 4.2). As we will show shortly, just one extra processor will significantly
improve the system yield. Equation 6.2 thus becomes:
17 17 17 17-MI 17-M2 17-M8
Y = E E ... E ... E
MI=16 M2=16 M8=16 k1=O k2=0 ks8=0
(_l)(_l)k...(_l) (1 -M,\ (17 M 17 (17 Ms
M, ki Ms k8
[1 ((MI + k-i)A + ... + (Ms + k)s + ACK)
x 1 + ------------
xl
(6.5)
Equation 6.5 was used in a computer program (see Appendix C) along with
the previous values of defect densities and a, to produce Table 6.3. We see that the
yield of the array changes from 62.46% to 72.53% in the best case (with Do = 1
per cm2 and a = 0.25), which is a 16.1% improvement. For the worst case (with
Do = 2 per cm2 and a = 2), the yield changes from 17.46% to 35.87%, which is a
105.4% improvement. The added area overhead associated with the redundancy is
a 5%. Thus, this is a worthwhile tradeoff in both cases, and in particular for the
worst case, in which the yield is doubled for a 5% increase in area. Equation 6.5
was extended to arrays of up to 32 processors. For the best case above, the yield of
non-redundant vs. redundant is 55.04% and 67.33% respectively, which is a 22.32%
increase. For the worst case, the results become 8.34% and 23.38%, a 180.3% increase.
The area overhead for one redundant processor for the 32 PE case is a 2.7%, again, a
worthwhile tradeoff. Any redundancy not used for yield enhancement at the time of
manufacture, can be used for fault tolerance in the field, if a failure occurs in a PE.
Since the PEs consist of 78% of the active area of our system, this will most often be
the type of failure that occurs.
Table 6.3: Table of Redundant Chip Yields.
Projected Redundant Chip Yields
a Average Fatal Defect
Densities per [cm2]
1.0 1.5 2.0
0.25 0.7253 0.6706 0.6318
0.50 0.6588 0.5798 0.5236
0.75 0.6259 0.5323 0.4657
1.00 0.6060 0.5028 0.4290
2.00 0.5703 0.4472 0.3587
Yield curves for the best and worst case manufacturing parameters have been
plotted in Figure 6.1. The top two curves represent the redundant and non-redundant
best case, respectively. The bottom two curves represent the redundant and non-
redundant worst case, respectively.
At this point, some cautions are in order. We must be very careful when
interpreting the increase in yield of redundant vs. non-redundant chips. Ultimately,
all that really matters from a manufacturing standpoint is the number of good chips
that leave the foundry. By including extra circuitry for redundancy, we have at the
same time diminished the number of chips that are made. That is, the area of a
redundant chip is larger than that of a non-redundant one, which translates into
less chips per wafer. Since only good chips can be sold, we must be very careful to
guarantee that the added redundancy does not cause too much product loss. If we
are not careful, we could actually wind-up loosing money with redundant chips. In
the following equations, we will represent the yield of non-redundant chips by YNR,
and the yield of redundant chips by YR. The actual number of good chips for the
non-redundant case will be denoted by KNR and for the redundant case, KR. The
total number of chips fabricated for non-redundant and redundant cases is NNR and
NR, respectively. Finally, the percentage of product loss, due to redundancy, will be
denoted by PL. Consider the following:
Yield Curves for Various Length Arrays
0.8---- -
X. .... X ..... X ... .. X ...... ...
0.7 .... .... X ..... X X.. .. ........
.00
0.6 0 ..... G ..... 0 ....
0.5
0. 0.4
.-
. X ..... . ..
0.3-X X.x. xx
0.2
. o .... o ..... o .... 0 .. 0 ....... 0 .... ... ..... ...
0.1- ----.-o ..o ..... ....-i ----. ... 0....
16 18 20 22 24 26 28 30 32
Number of PEs
Figure 6.1: Yield Curves for Various Length Arrays (Scheme 1).
YR = t
NR
NR = (1- PL)NNR
? (6.6)
KR > KNR
?
YR x NR > YNR X NNR
YR x (1 PL) > YNR
Equation 6.6 must always be satisfied to determine if the chosen redundancy
scheme actually produced more working chips. We will assume that the percentage
increase in area for redundant chips, translates into the same percentage of product
loss. The degree of accuracy of this assumption ultimately depends on how many
rectangular chips can fit into a round wafer. For the most part, though, this is a good
assumption since a few hundred chips will typically fit into a commercially sized wafer
(6-8 inch diameter). For our 16 processor example, there is thus a 5 % product loss.
For the best and worst cases, we get
(0.95) YR, be > YNRbest
?
(0.95) x (0.7253) > 0.6246
0.6890 > 0.6246
(6.7)
(0.95) X YR..ort > YNRwost
(0.95) x (0.3587) > 0.1746
0.3407 > 0.1746
Thus, we obtain more working chips in each case, which supports the chosen
redundancy scheme for the linear array.
6.2 A Comparison with Replacing Moduli
An alternative redundancy scheme will be considered in this section. Sup-
pose that fault tolerance was introduced into the system by including an additional
modulus in the array, rather than an extra PE per modulus. This would have the
effect of partitioning the system differently with respect to fault tolerance. The new
"modules" would consist of a set of both normal and conjugate channels (for each
modulus), together with the input and output conversion modules. This is depicted
in Figure 4.1 by the dotted lines around each modulus. The CRT would now have to
be reconfigurable, as the dynamic range of the system would change when a defective
module is switched out and a spare switched in. If the system were configured in
this way, then faults could be tolerated in the input and output conversion mod-
ules, as well as in the PEs. The only "chip-kill" area, is that composed of the CRT
(which now is larger) and the I/O pads. The system would now consist of a single
"module-type", of which there are four total for the non-redundant case and five for
the redundant. There are now five fixable combinations, compared to eight as before.
Since there is only a single module type, Equation 6.2 reduces to:
N1 Ni -Mi
Y = E E
MI=N--RI kjl=0
)k N Nl M ) [l ((MI + k)A + ACK)
M(-1) k x 1 +a
x CM1
(6.8)
For this scheme, the corresponding parameters are:
N, = 5
R1 = 1
ApR 1
A1 = 2 x (APEr + APE- ) X 16 + 4 x (Aljcon, + ALogl) + AoutConvl
ACK = AIOpads + ACRTN,,
(6.9)
with CM1 = 1 for Mi E {4, 5}. We have divided the area of the PE intercon-
nections by two since we do not bypass PEs in this scheme, and thus, only need half
of the wires. AiCo,,n is the area of one forward QRNS mapping module and ALog1
is the area of a single log table. There are four of each of these per reconfigurable
module. There is one inverse QRNS block per reconfigurable module, the area of
which is denoted by Aoutco,nl. We will assume the I/O overhead is the same as
before. Finally, we need to obtain a value for the area of the reconfigurable CRT.
Approximately 2 of the CRT is composed of ROMs. Since the new CRT will have
to be re-programmable (i.e. to change the overall M), this section will need to be
RAM based. RAMs occupy about four to six times the area of ROMs (as stated
earlier). We will use a value of four as a multiplication factor, to estimate the area
of a RAM based implementation. The carry-select mod(M) adders used in the CRT
will also have to be reconfigurable. We will not be able to use the bit-programming
scheme described in Chapter Four, and thus, will increase the complexity by a factor
of 25 %. The new variable offsets will have to be stored in registers also. We will
again be very generous and assume that the total complexity of the modulo adder
portion of the CRT increases by 50 %. The overall area multiplication factor for
the reconfigurable CRT is 3.166. If we substitute the redundancy parameters defined
above in Equation 6.8 the following expression results:
5 5-MI
Y = E E
M1=4 ki=0
k 5 5 M(, )) [1 ((Mi + ki)X + AC'K
(-1) M kL x 1+
CM1
(6.10)
The area values for the new system were substituted in Equation 6.10, which
was evaluated with a computer program (see Appendix C). For our sixteen processor
example, the best case non-redundant vs. redundant yield is 60.80 % vs. 69.27 %. For
the worst case manufacturing parameters, the yields are 15.04 % vs. 26.18 %. How do
these values now favor with product loss accounted for? The overhead associated with
this redundancy scheme is approximately 41 %. Thus, after scaling the redundant
yields by unity minus this amount we get (1 0.41) x 69.27 = 40.87 < 60.80 and for
the worst case (1-0.41) x 26.18 = 15.44 > 15.04. For the best case of manufacturing
parameters we have actually lost money by including redundancy. That is we would
have obtained more working chips if we had done nothing! For the worst case, the
gains are questionable since the margins are so low. The yield curves for best and
worst cases are plotted in Figure 6.2 for 16 to 32 processor systems. The areas of
chips using the first and second redundancy schemes are plotted in Figure 6.3. We
point out that the areas of chips using the first scheme are less than those using the
second, thus the first scheme always wins. Finally the product loss adjusted yields
are plotted for both schemes for best and worst cases. Figure 6.4 and Figure 6.5
are plots for the best and worst cases, respectively for the first scheme. These curves
illustrate that we always benefit for this scheme, for these manufacturing parameters.
Figures 6.6 and Figure 6.7 show the curves for the second scheme. Figure 6.6 shows
that we always loose money for this case, while Figure 6.7 shows that as the array size
grows, we gain more for the implemented redundancy. This is because the percent-
area-overhead associated with the redundancy decreases as the array grows (in both
schemes).
The results of this analysis suggests that it is better to implement redundancy
within the moduli, rather than between the moduli. Thus, the redundancy scheme
used in the proposed system is justifiable. In reconfigurable systems, the amount of
redundancy must be kept as small as possible to obtain the most benefit. In much of
the RNS literature, the point is commonly made that fault tolerance can be achieved
by including extra moduli. However, in the context of this analysis, this point is
questionable. Finally, it should be pointed out that if the moduli are very small,
such that the inclusion of extra channels will not significantly increase the chip area,
then the analysis may be more favorable.
6.3 Detecting Faults
The detection of failed processors in the chosen scheme must be done off-line.
The test procedure combines the bypassing of failed PEs to isolate columns in which
failed PEs lie in and then suppling specific test vectors to determine which row of
the column has failed. For example, all columns in the array can be bypassed except
~~
~~ |