<%BANNER%>

Efficient algorithms for VLSI CAD

HIDE
 Front Cover
 Acknowledgement
 Table of Contents
 List of Tables
 List of Figures
 Abstract
 1. Introduction
 2. Transistor folding
 3. Performance driven module implementation...
 4. Gate resizing to reduce power...
 5. Conclusions and future work
 References
 Biographical sketch
 
PRIVATE ITEM Digitization of this item is currently in progress.
MISSING IMAGE

Material Information

Title:
Efficient algorithms for VLSI CAD
Physical Description:
ix, 90 leaves : ill. ; 29 cm.
Language:
English
Creator:
Cheng, Yu Cheuk, 1971-
Publication Date:

Subjects

Subjects / Keywords:
Integrated circuits -- Very large scale integration -- Design and construction -- Data processing   ( lcsh )
Computer-aided design   ( lcsh )
Computer and Information Science and Engineering thesis, Ph.D   ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF   ( lcsh )
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph.D.)--University of Florida, 1998.
Bibliography:
Includes bibliographical references (leaves 87-89).
Statement of Responsibility:
by Yu Cheuk Cheng.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 030006090
oclc - 41448289
System ID:
AA00018779:00001

MISSING IMAGE

Material Information

Title:
Efficient algorithms for VLSI CAD
Physical Description:
ix, 90 leaves : ill. ; 29 cm.
Language:
English
Creator:
Cheng, Yu Cheuk, 1971-
Publication Date:

Subjects

Subjects / Keywords:
Integrated circuits -- Very large scale integration -- Design and construction -- Data processing   ( lcsh )
Computer-aided design   ( lcsh )
Computer and Information Science and Engineering thesis, Ph.D   ( lcsh )
Dissertations, Academic -- Computer and Information Science and Engineering -- UF   ( lcsh )
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph.D.)--University of Florida, 1998.
Bibliography:
Includes bibliographical references (leaves 87-89).
Statement of Responsibility:
by Yu Cheuk Cheng.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 030006090
oclc - 41448289
System ID:
AA00018779:00001

Table of Contents
    Front Cover
        Page i
    Acknowledgement
        Page ii
    Table of Contents
        Page iii
        Page iv
    List of Tables
        Page v
    List of Figures
        Page vi
        Page vii
    Abstract
        Page viii
        Page ix
    1. Introduction
        Page 1
        Page 2
        Page 3
        Page 4
    2. Transistor folding
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
    3. Performance driven module implementation selection
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
    4. Gate resizing to reduce power consumption
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
    5. Conclusions and future work
        Page 85
        Page 86
    References
        Page 87
        Page 88
        Page 89
    Biographical sketch
        Page 90
        Page 91
        Page 92
        Page 93
Full Text









EFFICIENT ALGORITHMS FOR VLSI CAD


By

YU CHEUK CHENG














A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1998













ACKNOWLEDGMENTS


I would like to express my appreciation and gratitude to my advisor Professor

Sartaj Sahni for his support and guidance to my research work. I thank him for

spending a lot of his valuable time with me to give me research ideas. Without his

patience and encouragement, this research would not have been done.

I would also like to thank other members in my supervisory committee, Dr. Tim

Davis, Dr. Richard Newman, Dr. Sanguthevar Rajasekaran and Dr. Andrew Vince,

for their interest and comments. I would like to express my appreciation to Dr. Steve

Thebaut for his support and encouragement throughout my study.

Many thanks go to my friends here in the university as well as those in Hong

Kong and around the world. Special thanks go to my friend Desmond Kwan, who

gave me constant support and encouragement through my final year's PhD study.

I would like to express my endless thanks to my parents and my brother Eddie for

their love, support and energy throughout my lifelong educational endeavors. Special

thanks go to my fiancee Isabella Fung for her love, encouragement and patience

through all these years of study. To them I dedicate this work.














TABLE OF CONTENTS


ACKNOWLEDGMENTS .................... ..... ..... ii

LIST OF TABLES .................... ............. v

LIST OF FIGURES .... ............................ vi

ABSTRACT ...................... ............... viii

1 INTRODUCTION ....... ..... .... .............. 1

1.1 Background ........ ................ .. ... .. 1
1.2 Physical Design Automation .. ..... ........ .......... 2
1.3 Dissertation Outline ........................... 3

2 TRANSISTOR FOLDING ........................... 5

2.1 Introduction ................. ............... 5
2.2 Problem Formulation ........................... 6
2.3 Our Algorithm .... .......................... 9
2.3.1 Phase I .... .......................... 9
2.3.2 Phase II .. .. . ........ ..... .. .. .. 11
2.4 Experimental Results .... ............ ........... 13
2.5 Conclusion ....... .......... ............... 15

3 PERFORMANCE DRIVEN MODULE IMPLEMENTATION SELECTION 18

3.1 Introduction ................................ 18
3.2 O(plogn) Algorithm of Her et al ...... .................. 22
3.3 Our O(plogn) Algorithm .................. ....... 24
3.3.1 Stage 1 ................... ............ 24
3.3.2 Stage 2 .... ............... ........... . 30
3.3.3 Implementation Details ..................... 31
3.3.4 Time Complexity ................ ...... . 32
3.4 Multichannel 2-PDMIS Problem ................... 36
3.5 Experimental Results ..................... ...... 38
3.6 Conclusion ......................... ........ 40












4 GATE RESIZING TO REDUCE POWER CONSUMPTION ........ 41


4.1 Introduction ... ....... ........... .....
4.2 Series-Parallel Circuits ....................
4.2.1 Definition ......................
4.2.2 Complete Library Gate Resizing (CGR) ......
4.2.3 Complete Library with Upper Bounds (CUGR) and
4.2.4 Time Complexity of Convex GR Problem . . .
4.3 Tree Circuits .........................
4.4 CGR with Multigate Modules Is NP-Hard .........
4.5 General Circuits .. ......................
4.5.1 The CGR Algorithm of Chen and Sarrafzadeh . .


Convex c(
. . .


v


4.5.2 Comments on the Algorithm of Chen and Sarrafzadeh . . .
4.5.3 A Unified Framework for CGR, CUGR and ConvexCGR . .
4.6 The General Gate Resizing Problem (GGR) . . . . . . ..
4.7 Experimental Results ............. ..............
4.8 Conclusion ................ .. ...... ...........

5 CONCLUSIONS AND FUTURE WORK. . . . . . . . ...


REFERENCES ................. .. ...............

BIOGRAPHICAL SKETCH ..... .. ............ .......


41
45
45
47
)s 49
54
62
63
68
68
71
72
77
78
81

85


87

90













LIST OF TABLES


2.1 Run time and speedup using a uniform distribution . . . ... 16

2.2 Run time and speedup using a uniform distribution with larger limits 16

2.3 Run time and speedup using a normal distribution . . . ... 17

3.1 Running time for benchmark channels ........ . .... .. 39

3.2 Running time for generated channels . . . .... ..... .. 39

4.1 Run time and speedup when required time is equal to critical path
length . . . . . . . . . . . . . . . . . 80

4.2 Run time and speedup when required time is doubled . . . . 80

4.3 Run time for squar5 with different required time . . . . .... 80

4.4 Power reduction of GGR algorithms (1) ..... . ........... .. 81

4.5 Power reduction of GGR algorithms (2) . . . . . .... 82

4.6 Power reduction of GGR algorithms (3) . . . . . . . . 82

4.7 Power reduction of GGR algorithms (4) ..... . ........... .. 83

4.8 Power reduction of GGR algorithms (5). .. . . ... . . . 83

4.9 Power reduction of GGR algorithms (6) . . . . .... .. . 84













LIST OF FIGURES


2.1 An example circuit with 4 pairs of transistors . . . . . . 7

2.2 The Circuit of Figure 2.1 after folding with hp = 4 and hl = 3 . . 7

2.3 Computing SP and SN ............ .............. 10

2.4 Compute optimal hp and hn ....................... 12

2.5 Refined Phase 2 algorithm ........................ 14

3.1 An example PDMIS problem. (a) first implementation; (b) second
implementation; (c) selections that satisfy the net span constraints;
(d) selection with better density . . . . . . . .. . 21

3.2 Critical modules of net i ..... . ........ ......... 26

3.3 Function Assign .. ............................ 27

3.4 Partition a routing channel into regions . . . . . . ..... 29

3.5 Function Satisfy .... ...... .. ............. .. .. 33

3.6 Function Search . .................... ......... 34

3.7 Procedure Undo ............. ... ............. 34

3.8 The two regions to be searched recursively after the binary search . 38

4.1 Digraph corresponds to a circuit . . . . . . . . . ... 45

4.2 Circuit Examples (Source: Li et al. [20]). (a) Chain; (b) Simple Par-
allel Circuit; (c) Series-Parallel Circuit; (d) Non-Series-Parallel Circuit 46

4.3 Transformation of Series-Parallel Circuits. (a) Chain; (b) Simple Par-
allel Circuit .... ... ................. .. ... .. 49

4.4 Transformation of a Series-Parallel Circuit into a single gate .. . 50

4.5 Computation of new delay for each gate . . . . . . .... 51











4.6 Convex delay-power-consumption graph . . . . . ..... 52

4.7 Algorithm Parallel-Merge. ... . ... . . . . . . . . 55

4.8 Worst-case merging of n gates . . . . . . . ..... . 56

4.9 BBST used to represent DP list {(3, 28), (5, 24), (3, 21)} . . . 57

4.10 Update of D(s and C()s for internal nodes during tree rotations . 59

4.11 Change of c values of internal nodes of L2 for the kth tuple of L1 . 60

4.12 Circuit C2 that exert the worst case behavior . . . . . ... 62

4.13 Transformation of a basic tree to a simple parallel circuit ...... ..63

4.14 A module v with two gates A and B . . . . . . ..... 63

4.15 Variable subassembly for variable xi . . . . . . . ..... 64

4.16 Clause subassembly for (11 v 12 V 13) . . . . . . . ..... 66

4.17 Application of algorithm of Chen and Sarrafzadeh [3] on a CGR circuit.
(a) An example CGR circuit; (b) sensitive graph . . . . ... 69

4.18 A simple example CGR circuit . . . . . . . ..... . 71

4.19 Application of algorithm of Chen and Sarrafzadeh [3] on CUGR cir-
cuit. (a) An example CUGR circuit; (b) delay of each gate after first
iteration; (c) delay of each gate after algorithm [3] terminates; (d)
delay of each gate for optimal power reduction . . . . .. . 73

4.20 A PERT network for the CGR circuit of Figure 4.17(a).. . . ... 75

4.21 Transformation of vertex v into a chain in PERT network ...... ..77














Abstract of Dissertation
Presented to the Graduate School of the University of Florida
in Partial Fulfillment of the Requirements for the
Degree of Doctor of Philosophy


EFFICIENT ALGORITHMS FOR VLSI CAD

By

Yu Cheuk Cheng

December 1998



Chairman: Dr. Sartaj Sahni
Major Department: Computer and Information Science and Engineering


In this dissertation, we develop efficient algorithms for three problems that arise in

very large scale integrated computer-aided design (VLSI CAD): (1) transistor folding,

(2) module implementation selection, and (3) gate resizing.

Transistor folding reduces the area of row-based designs that employ transistors of

different size. Kim and Kang have developed an O(m2 log m) algorithm to optimally

fold m transistor pairs. In this dissertation we develop an O(m2) algorithm for

optimal transistor folding. Our experiments indicate that our algorithm runs 3 to 50

times as fast for m values in the range [100, 100000].

We develop an O(plogn) time algorithm to obtain optimal solutions to the p

pin n net single channel performance-driven implementation selection problem in

viii











which each module has at most two possible implementations (2-PDMIS). Although

Her, Wang and Wong have also developed an O(p log n) algorithm for this problem,

experiments indicate that our algorithm is twice as fast on small circuits and up to

eleven times as fast on larger circuits. We also develop an O(pnC-') time algorithm

for the c, c > 1, channel version of the 2-PDMIS problem.

We study the problem of resizing gates to reduce overall power consumption while

satisfying a circuit's timing constraints. Polynomial time algorithms for series-parallel

and tree circuits are obtained. Gate resizing with multigate modules is shown to be

NP-hard. Algorithms that improve upon those developed by Chen and Sarrafzadeh

for general circuits are also developed.













CHAPTER 1
INTRODUCTION


1.1 Background

The design and fabrication of VLSI chips has been made possible by the automa-

tion of several steps in the design process. The VLSI design process transforms a

formal specification into a fully packaged chip. It consists of the following steps [30]:


1. System specification: In this step a high level representation of the system

is created. Performance, functionality, physical dimensions, choice of design

techniques and fabrication technology are considered in this step.


2. Functional design: The output of this step is a timing diagram which is obtained

by considering the behavioral aspects of the system.


3. Logic design: The logic design, in general, is represented by Boolean expres-

sions. The logic design that represents the functional design is obtained in

this step. The Boolean expressions are minimized to obtain the smallest logic

design. Correctness of the logic design is also asserted in this step.


4. Circuit design: A circuit which represents the logic design of the system is de-

veloped in this step by taking into consideration speed and power requirements,

and the electrical behavior of the components used in the development of the

circuit.












5. Physical design: This is the most time consuming step in the VLSI design

cycle. In this step, the components and the interconnections are represented by

geometric patterns. The objective of this step is to obtain an arrangement of

these geometric patterns which minimizes the area and power and satisfies the

timing requirements of the chip. Due to its high complexity this step is broken

down into smaller sub-steps. We will look into this step in detail later in this

chapter.


6. Design verification: In this step design rule checking and circuit extraction are

done to verify that the circuit layout from the physical design step satisfies the

system specification and design rules.


7. Fabrication: The verified layout is used in the fabrication process to produce

the chip.


8. Packaging, testing and debugging: The fabricated chip is packaged and tested

to ensure proper functioning.

1.2 Physical Design Automation

Given a circuit description, the physical design process transforms the physical

description into a geometric description called layout for fabrication. As the complex-

ity of the physical design process is extremely large, Computer-Aided Design (CAD)

is used in almost all phases of the physical design process.

The physical design process is divided into 5 stages [27]:











1. Partitioning: A large circuit is decomposed into a collection of smaller blocks

of sub-circuits or modules, taking sizes and interconnections between blocks as

factors. Partitioning can be hierarchical if the given circuit is very large.


2. Floorplanning and placement: Logical components of each block are assigned

an approximate location in floorplanning. In placement, blocks are exactly

positioned on a chip so as to minimize the area of the chip and so that the

interconnections between blocks can be completed.


3. Routing: The interconnections between blocks are completed as specified. In

global routing, connections are completed between the proper blocks of the

circuit disregarding the exact geometric details of each wire and pin. In detailed

routing, each connection is assigned a precise geometric position.


4. Compaction: The layout is compressed in all directions to reduce the area.


5. Extraction and verification: The final layout is verified in terms of functionality

by circuit extraction. Other specific requirements, such as performance and

reliability, are also verified in the verification process.


1.3 Dissertation Outline

In this dissertation, we consider some of the problems that arise in the automation

of various stages of the VLSI design process. In Chapter 2, we consider folding

transistors to reduce the layout area of a row-based design. We develop an optimal











algorithm to fold transistors in a channel to minimize the layout area. Our algorithm

is both theoretically and practically faster than the algorithm proposed in Kim and

Kang [18].

In Chapter 3, we consider a module implementation selection algorithm which

minimizes the density of a channel. We develop an optimal algorithm to select module

implementations along a channel to satisfy the net span constraints of each net and

minimize the density of the channel, where each module has at most two possible

implementations. The algorithm is experimentally compared to the one developed

by Her et al. [11]. We also develop a polynomial-time algorithm for the multichannel

version of the problem.

In Chapter 4, we consider resizing gates to reduce the power consumption. We

develop fast optimal algorithms to resize gates in series-parallel circuits and trees to

minimize the power consumption subject to the timing constraint. We also prove

that gate resizing with multigate modules is NP-hard. We develop fast algorithms

to perform gate resizing on general circuits. Experimental results comparing our

algorithm compared with that in Chen and Sarrafzadeh [3] are also presented.

In Chapter 5, we present conclusions and some future directions for this research.













CHAPTER 2
TRANSISTOR FOLDING


2.1 Introduction

In high-performance circuit design, the transistor sizing problem was investigated

widely in the past (for example. [26, 7, 28, 4]). The objective of transistor sizing is

to reduce the circuit delay by increasing the area of transistors. One by-product of

transistor sizing is the generation of layouts of transistors of widely varying size. In

row-based layout synthesis ([17, 29, 32, 34]), we group pMOS and nMOS transistors

together and place them in rows. The layout area for these designs is wasted due to

nonuniform cell heights. The layout area required can be reduced by folding large

transistors so that their height is reduced. Transistor folding to optimize layout

area has been considered by Kim and Kang [18] and Her and Wong [12]. Her and

Wong [12] have developed an O(m6) dynamic programming algorithm for the general

transistor folding problem. (If only s heights are possible for the folded transistors,

the complexity of Her and Wong's algorithm is O(m3s3). In general, s is O(m).) Kim

and Kang [18] have developed a more practical algorithm for the case of row-based

designs. The complexity of their algorithm is O(m2 log m) or O(s(m + s) log m).

They also show that the area of row-based designs can be reduced by as much as

30% by performing transistor folding. In this paper, we consider the row-based-

design transistor-folding problem considered in reference [18] and develop an O(m2)











or O(s(m+s)) algorithm to minimize area. We also report on experiments conducted

by us that show that our algorithm actually runs much faster than the algorithm of

Kim and Kang [18]. The test circuit used in our experiments have between 100 and

100,000 transistor pairs. So, our tests are similar to those conducted by Kim and

Kang [18] where the circuits had from 192 to 88,258 transistor pairs.

2.2 Problem Formulation

We are given a CMOS circuit with a row of m transistor pairs. Each transistor

pair consists of a pMOS transistor and its dual nMOS transistor. Let pi and ni,

respectively, be the heights of the pMOS and nMOS transistors in the ith pair,

1 < i < m. pi and ni are integers that give transistor height in multiples of the

minimum resolution A. Figure 2.1 shows a CMOS circuit with 4 pairs of transistors,

P2 = 10 and n2 = 12. If the folding height of pMOS transistors is 4 and that of

nMOS transistors is 3, then the circuit layout is as in Figure 2.2. The second pMOS

transistor is divided into three columns of height 4, 4, and 2 respectively, and the

second nMOS transistor is divided into four columns of height 3 each. The area

occupied by the folded transistor pair is shown by a shaded box in Figure 2.2. In

practice, the height of the layout area is slightly larger than the sum of the pMOS

and nMOS folding heights, and the layout width is slightly larger than the number

of transistor columns because of overheads.















4





10 pMOS






12 nMOS





Figure 2.1. An example circuit with 4 pairs of transistors








7


Figure 2.2. The Circuit of Figure 2.1 after folding with hp = 4 and h, = 3











Let hp and h, be the folded heights of the pMOS and nMOS transistors, respec-

tively. The width of the folded layout is E 1 max(rl], [r]) + ca and the height is

h, + h, + c, where c,, and Ch are, respectively, vertical and horizontal overheads. The

area of the folded layout is [18]



(h, + h, + ,)( max(r ], r )+c) (2.1)
i=l hp hn

In practice, there is a technological constraint on how small hp and h7 can be. It

is required [18] that hp > PMIN and hn > NMIN.

Kim and Kang [18] give two algorithms to determine hp and h, so that the layout

area is minimized. The first algorithm is an exhaustive search algorithm that simply

tries out all integer choices for h, and h, such that PMIN < h, < maxi
max(P) and NMIN < hn < max<,
and N = {n, n2, ..., nm}.) The complexity of the exhaustive search algorithm is

O(max(P).max(N)-m) = O(m3) because max(P) and max(N) are O(m) for practical

circuits [18].

The second algorithm [18] works in two phases. In the first phase, the al-

gorithm constructs a subset SP of [PMIN, max(P)] and another subset SN of

[NMIN, max(N)] with the property that the optimal h, is in SP and the optimal h,

is in SN. The basic observation used to arrive at SP and SN is that if the heights hi

and hi + k divide a transistor into the same number of columns then hi is preferred











over hi + k (for example if pi = 14, then folding heights 7. 8. 9, 10, 11, 12 and 13

all fold the transistor into two columns; 7 is preferred over the remaining choices).

In the second phase the optimal combination (hp. h,,) is determined from SP and

SN. The complexity of the second phase is O(s(m + s) log m) = O(m2 log m), where

s SPI + IS"1 and that of the first phase is 9(E,=l(P + ni)) = O(m2) (assuming

max(P) and max(N) are O(m)).

2.3 Our Algorithm

2.3.1 Phase I

Our algorithm is also a two phase algorithm. The first phase of our algorithm is

identical to the first phase of Kim and Kang's algorithm [18]. We compute the subsets

SP and SN using the code of Figure 2.3. The arrays SPL and SNL are initialized to

zero in the first two for loops. Then we determine the members of SP and SN; we

set SPL[i] = 1 if and only if i E SP and SPN[i] = 1 if and only if i E SN. Finally,

SP and SN are computed in compact form from SPL and SPN respectively. Note

that we can compute SP and SN in either ascending or descending order easily by

controlling the direction of traversal of the SPN and SPL arrays respectively, in the

last two for loops. The algorithm presented in Figure 2.3 computes SP in ascending

order and S" in defending order.

















Algorithm Phase I (P, N, PMIN, NMIN)
/* Compute SP and SN */

/* Initialize SPLD and SNLD */
for i = PMIN to max(P) do
SPL[i] +- 0;
for i = NMIN to max(N) do
SNL[ij +- 0;

/* set SPLU and SNLO */
for i = 1 to m do
for j = 1 to pi do
SPL[4]] +- 1;
for j = 1 to n, do
SNL[[ ]J +- 1;
end for

/* collect items from SPLO and SNL[ and store them into S'P and SND */
SPsize +- 0; SNsize +- 0;
for i = PMIN to max(P) do
if SPL[i] = 1 then
SP[SPsize++] i;
for i = max(N) downto NMIN do
if SNL[i] = 1 then
SN[SNsize++] <- i;


Figure 2.3. Computing SP and SN












2.3.2 Phase II

Assume that the transistor pairs have been reordered so that P < -'-: also

assume that E = 0 and P =i = c. The formula, Equation 2.1, for the layout area
no nm+l

can be rewritten as


k-1 m
A = (h + h,+ c.)(1[ n + ]+ Ch) (2.2)
i=1 h" i=k hP


where k E [1, m + 1] is such that


Pk- < < pA (2.3)
nk-1 hn nk


Let LN(hn, k) = E [-Sl and Lp(hp, k) = Ekt J ], we can rewrite Equation

2.2 as

A = (h, + h,, + c,,)(LN(h, k) + Lp(hp, k) + ch) (2.4)


From Equation 2.3 and the ordering of the transistors by &, it follows that if ,, is

held fixed and hp increased, the value of k cannot decrease. This observation results

in the algorithm of Figure 2.4.

Since SP and SN can be computed in ascending and descending order respectively

by Algorithm Phase I of Figure 2.3, no sorting is needed to evaluate the members

of SP and SN in the specified order. We can sort the transistors into increasing

(actually nondecreasing) order of & in O(mlogm) time; and the arrays LN and











Algorithm Phase II (P, N, SP, SN, Ch, c,)
/* SP is in ascending order and SN is in descending order */
Sort P and N in increasing P[i]/N[i] ratio;
Compute LNDD and LpOfa;
for each h, E S do
k 1;
for each hp E SP do
while P[k]/N[kj < ht/h do
k k + 1;
A = min(A, (hp + h, + c,) (LN[h,][k] + Lp[hp][k] + ch));
end for
end for


Figure 2.4. Compute optimal h, and h,

Lp can be computed in e(mlSNI) and e(mISPI) time respectively. Each itera-

tion of the outer for loop takes O( SP1 + m) time. Therefore, the time needed

for all ISNI iterations is O(ISNI(ISPI + m)). We can change this complexity to

O(min{IS'%, ISPI}(max{ISNI, ISP)} + m)) by interchanging the inner and outer for

loop headers.

Further improvement in run time is possible. Consider the algorithm of Figure 2.4.

Let ki be the k value that satisfies Equation 2.3 when we use the first (i.e., largest)

h,, value h,1 and the ith (i.e. ith smallest) h, value hp,. On the next iteration of the

outer for loop, h,,, 5 h,, so h, < h~~, and the k value that satisfies Equation 2.3

is at least kl. Hence if we save kl from the first iteration, we can start the search for

the new k value at kl. This observation leads to the refinement shown in Figure 2.5.











Although its worse-case complexity is the same as that of Figure 2.4, it is expected

to run faster in practice.

2.4 Experimental Results

The phase 1 algorithm of Figure 2.3, the phase 2 algorithm of Figure 2.5, and

the two algorithms of Kim and Kang [18] were implemented, by us, in C and run on

a SUN SPARCstation 4. Similar programming methodologies were used to develop

the codes for our algorithm and that of Kim and Kang [18]. As a result, we expect

that almost all of the performance difference exhibited in our experiments is due to

algorithmic rather than programming differences. Since we were unable to obtain

the test data used by Kim and Kang [18], we generated random data. We ignore

any possible correlation between pMOS and nMOS transistors. For our test data.

the number of transistor pairs ranged from 100 to 100,000. This covers the range

in transistor numbers (192 to 88,258) in the circuits of Kim and Kang [18]. For

our first test set, the sizes of the pMOS and nMOS transistors were generated using

a uniform random number generator with range [30,90] for pMOS and [20,60] for

nMOS. These size ranges correspond to those for the circuit fract that was used by

Kim and Kang [18], the circuit fract has 598 transistors. Since all three algorithms

generate optimal solutions, run time is the only comparative factor. This time is

provided in Table 2.1. The exhaustive search algorithm was not run for m > 10, 000

as its run time becomes prohibitive. In the case of the algorithm proposed by Kim





















Algorithm Refined Phase II (P, N, SP, SN, Ch, ct)
/* SP is in ascending order and SN is in descending order */
Sort P and N in increasing P[i]/N[i] ratio;
Compute LN-OD and LpO];
Initialize Kh, to 0 for all hp E SP
if IS'i < ISPI then
for each h, E SN do
k -- 1;
for each hp E SP do
Kh, <- max(k, KhA);
while P[k]/N[k] < h,/hn do
k +-k+1;
KhA k;
.4 = min(A, (hp + ha + c.) (LN[h.][k] + Lp[hp][k) + ch));
end for
end for
else
/* same as ''if", but interchange the inner and outer for loop headers, and replace
Kh, by Ka, */
end if


Figure 2.5. Refined Phase 2 algorithm







15


and Kang [18], the phase 2 time is significantly larger than the phase 1 time. Our

algorithm for phase 2 has brought this time down to approximate the phase 1 time.

For small circuits (m < 10,000), our phase 2 algorithm is 6 to 10 times as fast as

the phase 2 algorithm of Kim and Kang [18] and provides an overall speedup of 3.5

to 5.8 for the entire area minimization process (phase 1 plus phase 2). On larger

circuits, the speedup is more dramatic. For instance, when m = 100,000 our phase

2 algorithm is almost 50 times as fast as that of Kim and Kang [18] and provides an

overall speedup of almost 28.

We experimented with two other data sets. Table 2.2 reports the run times for

circuits in which the range of the uniform random number generator was set to

[30, 180] for pMOS transistor sizes and [20, 120] for nMOS sizes, and Table 2.3 gives

the run times when the transistor sizes are from a normal distribution with mean 40

and standard deviation 10 for pMOS transistors and mean 30 and standard deviation

10 for nMOS transistors. The overall speedups range from a low of 3.95 to a high of

48.02.

2.5 Conclusion

We have developed a transistor folding algorithm that is both theoretically and

practically faster than the algorithm proposed by Kim and Kang [18]. Our algorithm

is also simpler to code. Experiments suggest that our algorithm runs 3 to 50 times as














Table 2.1. Run time and speedup using a uniform distribution

Phase 2 Speedup
m Exhaustive Phase 1 Kim & Kang Our Phase 2 Overall
100 1.46 0.03 0.30 0.03 10.00 5.55
300 4.41 0.08 0.60 0.09 6.67 4.00
500 7.34 0.14 0.89 0.14 6.36 3.62
600 8.79 0.16 1.05 0.17 6.18 3.63
1000 14.67 0.28 1.69 0.27 6.26 3.56
5000 74.59 1.38 11.43 1.45 7.88 4.53
10000 149.12 2.75 30.71 3.01 10.20 5.81
50000 13.64 458.24 17.51 26.17 15.15
100000 27.24 1716.02 35.29 48.63 27.88
Time in seconds


Table 2.2. Run time and speedup using a uniform distribution with larger limits

Phase 2 Speedup
m Exhaustive Phase 1 Kim & Kang Our Phase 2 Overall
100 6.35 0.05 0.97 0.06 16.17 9.67
300 19.87 0.16 2.18 0.20 10.90 6.58
500 33.15 0.27 2.94 0.33 8.91 5.39
600 39.77 0.32 3.38 0.40 8.45 5.16
1000 66.31 0.53 4.73 0.65 7.28 4.47
5000 336.92 2.60 21.82 3.31 6.59 4.13
10000 673.43 5.23 49.25 6.89 7.15 4.50
50000 26.09 485.10 38.87 12.48 7.87
100000 52.12 3710.35 85.40 43.45 27.36
Time in seconds












Table 2.3. Run time and speedup using a normal distribution

Phase 2 Speedup
m Exhaustive Phase 1 Kim & Kang Our Phase 2 Overall
100 0.90 0.02 0.20 0.02 10.00 5.30
300 3.22 0.07 0.47 0.06 7.83 4.24
500 5.79 0.10 0.75 0.11 6.82 3.98
600 6.70 0.12 0.88 0.13 6.77 3.95
1000 12.69 0.20 1.48 0.23 6.43 3.96
5000 68.26 1.03 12.92 1.38 9.36 5.79
10000 129.67 1.99 36.91 2.80 13.18 8.12
50000 10.05 679.50 18.05 37.65 24.54
100000 20.04 2676.08 36.10 74.13 48.02
Time in seconds


fast as the Kim and Kang's algorithm [18] on circuits with 100 to 100,000 transistor


pairs. These circuit sizes are comparable to theirs.













CHAPTER 3
PERFORMANCE DRIVEN MODULE IMPLEMENTATION SELECTION


3.1 Introduction

In the channel routing problem, we have a routing channel with modules on the

top and bottom of the channel, the modules have pins, and subsets of pins define

nets. The objective is to route the nets while minimizing channel height. Several

algorithms have been proposed for channel routing [35].

When the modules on either side of the channel are programmable logic arrays,

we have the flexibility of reordering the pins in each module; any pin permutation

may be used. The ability to reorder module pins adds a new dimension to the

routing problem. Channel routing with rearrangeable pins was studied by Kobayashi

and Drozd [19]. They proposed a three step algorithm: (1) permute pins so as to

maximize the number of aligned pin pairs (a pair of pins on different sides of the

channel is aligned iff they occupy the same horizontal location and they are pins of

the same net), (2) permute the nonaligned pins so as to remove cyclic constraints,

and (3) while maintaining an acyclic vertical constraint graph, permute unaligned

pins so as to minimize channel density. Lin and Sahni [21] developed a linear time

algorithm for step (1), and Sahni and Wu [25] showed that steps (2) and (3) are

NP-hard. Tragoudas and Tollis [31] present a linear time algorithm to determine

whether there is a pin permutation for which a channel is river routable. They also










showed that the problem of determining a pin permutation that results in minimum

density is NP-hard in general, and they developed polynomial time algorithms for

the special case of channels with two terminal nets and channels with at most one

terminal of each net being in each module.

Variants of the channel routing with permutable pins problem have also been

studied [14, 2, 16, 13]. In these variants restrictions are placed on the allowable

pin permutations for each module. Restrictions may arise, for example, because the

module library contains only a limited set of implementations of each module [14].

Another variant, considered by Cai and Wong [2] permits the shifting of modules and

pins to minimize channel density. Extensions to the case when over the cell routing

is permitted are considered [16, 13].

The variant of the channel routing with permutable pins problem that we consider

in this paper is the performance-driven module implementation selection (PDMIS)

problem formulated by Her et al. [11]. In the k-PDMIS problem, we are given two

rows of modules with a routing channel in between, up to k possible implementations

for each module (different implementations of a module differ only in the location of

pins, the module size and pin count are the same) and a set of net span constraints

(the span of a net is the distance between its leftmost and rightmost pins). A feasible

solution to a k-PDMIS instance is a selection of module implementations so that all

net span constraints are satisfied. An optimal solution is a feasible solution with

minimum channel density.











Figure 3.1(a) shows a routing channel with two modules on either side of the

routing channel. Assume that each module has two implementations and that the

pin locations for the second implementation of each module are as in Figure 3.1(b).

The net span constraints of the five nets are 4, 4, 1, 1 and 6, respectively. This defines

an instance of the 2-PDMIS problem. Using the implementations of Figure 3.1(a), the

net spans are 5, 3, 1, 1 and 6, respectively. The span constraint of net 1 is violated.

If each module is implemented as in Figure 3.1(b), the net spans are 1, 5, 1, 1 and

4, respectively. This time, the span constraint of net 2 is violated. If we implement

the modules as in Figure 3.1(c) (i.e., for modules 1 and 2 use the implementations

of Figure 3.1(a) and for modules 3 and 4, use the implementations of Figure 3.1(b)),

the net spans are 4, 4, 1, 1 and 6, respectively. Now, the net span constraints are

satisfied for all nets. The channel density, when module implementations are selected

as in Figure 3.1(c), is 5. Selecting module implementations as in Figure 3.1(d), we

obtain a feasible solution whose density is 3.

Her et al. [11] show that the k-PDMIS problem is NP-hard for every .k > 3.

For the 2-PDMIS problem, they develop an O(plogn) algorithm to find an optimal

solution. In this paper, we develop an alternative O(plogn) algorithm to find an

optimal solution to the 2-PDMIS problem. Experiments indicate that our algorithm

is twice as fast on small circuits and up to eleven times as fast on larger circuits.

We begin, in Section 3.2, by providing an overview of the O(p log n) algorithm [11).

Then, in Section 3.3, we describe our O(p log n) algorithm. In Section 3.4, we develop


















15 1 213 3 1 5]


15 2 111 5 3 31


routing channel


1 2 512 4 4]


[5 2 114 4 2f


Figure 3.1. An example PDMIS problem. (a) first implementation; (b) second im-
plementation; (c) selections that satisfy the net span constraints; (d) selection with
better density











an O(pnc-1) algorithm for the c, c > 1, channel 2-PDMIS problem. Experimental

results using the single channel 2-PDMIS algorithm are presented in Section 3.5.

3.2 O(plogn) Algorithm of Her et al.

Her et al. [11] show how to transform an instance P of 2-PDMIS with net span

constraints and a constraint, d, on channel density into an instance S of the 2-SAT

problem (each instance of the 2-SAT problem is a conjunctive normal form formula

in which each clause has at most two literals.) The 2-SAT instance S is satisfiable iff

the corresponding 2-PDMIS instance has a feasible solution with channel density < d.

The size of the constructed 2-SAT formula S is O(p), where p is the total number of

pins in the modules of P. Since the channel density of the optimal solution is in the

range [1, n], where n is the total number of nets, a binary search over d can be used

to obtain an optimal solution in O(p log n) time.

Her et al. [11] use one boolean variable to represent each module. The interpre-

tation is, variable zx is true iff implementation 1 of module i is selected. The steps

in the 2-PDMIS algorithm [11] are


1. Construct the 2-SAT formula Cs~, such that Cp,,, is satisfiable iff the given

2-PDMIS formula has a feasible solution. This is done by constructing a 2-SAT

formula for each net and then taking the conjunction of these instances. For

each net j, the leftmost and rightmost modules on the top row and bottom

row are identified. These (at most four) modules are the critical modules for











net j as the span of net j is determined solely by these modules. A 2-SAT

formula involving the boolean variables that represent these critical modules is

constructed. This 2-SAT formula has the property that truth value assignments

satisfy the 2-SAT formula iff the corresponding module implementations cause

the net span constraint for net j to be satisfied.


2. Construct a 2-SAT formula Cden using a density constraint d. Cd,, is satisfiable

only by module implementation selections which result in a channel density

that is < d. To construct Cd,, partition the channel into a minimum number

of regions such that no region contains a module boundary in its interior; for

each region, construct a 2-SAT formula so that the density in the region is

< d whenever the 2-SAT formula is true (this 2-SAT formula involves only the

module in the top row of the region and the one in the bottom row); take the

conjunction of the region 2-SAT formulae.


3. Determine if the 2-SAT formula Cpan A Cd,e is satisfiable by using the strongly

connected components method described in Papadimitriou and Steiglitz [22].

This requires that we first construct a directed graph from Csa,, A Cd..


4. Repeat steps 2 and 3 performing a binary search for the minimum value of d

for which Cpan A Cde is satisfiable.


As shown in Her et al. [11], the size of C.pan A Cde is 0(p); step 3 takes O(p)

time; and the overall complexity is O(p log n).











3.3 Our O(p log n) Algorithm

Our algorithm is a two stage algorithm that does not construct a 2-SAT formula.

In the first stage, we construct a set of 2m "forcing lists", where m is the number of

modules. L[i] is a list of module implementation selections that get forced if the first

implementation of module i, 1 < i < m is selected; L[m + i] is the corresponding

list for module i when the second implementation of module i is selected. By forced,

we mean that unless the module implementations on L[i] (L[m + i]) are selected

whenever the first (second) implementation of module i is selected, we cannot have a

feasible solution that also satisfies the given density constraint. In the second stage,

we use the limited branching method [6] and the forcing lists constructed in stage 1

to obtain a module implementation selection that satisfies the net span and density

constraints (provided such a selection is possible). To find an optimal solution, we

use binary search to determine the smallest density constraint for which a feasible

solution exists.

3.3.1 Stage 1

In stage 1, we construct the forcing lists L[1..2m]. If the selection of implementa-

tion 1 of module i requires that we select implementation 1 of module j, we place j on

the list L[i]; if the selection of implementation 1 of module i requires that we select

implementation 2 of module j, we place m + j on L[i]. Similarly when the selection

of implementation 2 of module i requires a particular implementation be selected for











module j, we place either j or m + j on L[m + i]. To assist in the construction of

the forcing lists, we use another array C[1..m] with C[i] = 0 if no implementation of

module i has been selected so far; C[ij = 1 if the first implementation of module i

has been selected; and C[i] = 2 if the second implementation has been selected.

First, we construct the forcing lists necessary to ensure the net span constraints.

For each net i for which a net span constraint is specified, identify the leftmost and

rightmost modules, in each module row, that contain net i (see Figure 3.2). There

are at most four such modules: leftmost module with net i in the top module row

(module u of Figure 3.2), leftmost in the bottom module row (w), rightmost in top

row (v) and rightmost in bottom row (x). The span of net i is determined by a pair

of these critical modules. One module in this pair is a leftmost critical module and

the other is a rightmost critical module. So, there are at most four module pairs to

consider (for the example of Figure 3.2, these four pairs are (u, v), (w, v), (u, x) and

(w, x)).

When a critical module pair is considered, let A denote the implementation of the

left module (of the pair) in which the leftmost pin of net i is to the right of the leftmost

pin of net i in the other implementation (ties are broken arbitrarily). Let A' denote

the other implementation of the left module. Let B denote the implementation of the

right module for which the rightmost pin of net i is to the left of the rightmost pin

of net i in the other implementation (ties are broken arbitrarily). Let B' denote the

other implementation of the right module. In the example of Figure 3.2, consider the












critical module pair (u, x), u is the left module and x is the right module. The second

implementation of u is A and its first implementation is A'; the first implementation

of x is B and its second implementation is B'. There are four ways in which we

can select the implementations of the modules u and x: (A, B), (A, B'), (A', B) and

(A', B'). For each of these four selections, we can determine the span of net i and

classify the selection as feasible (i.e., does not violate the net span constraint) or

infeasible. Notice that if the selection (A, B) violates the net span constraint for net

i, then each of the remaining three selection pairs also violates the net span constraint

for this net.

U v


1st imp.

2nd imp.


I i I I I i
ii I i


i I I I i I

W X
w 3.2. Critical modules of net i

Figure 3.2. Critical modules of net i


1st row



2nd row


We have the following possibilities:


Case 1: [No selection is infeasible.] All four selections are feasible. In this case no

addition is made to the forcing lists.


Case 2: [Exactly one selection is infeasible.] The infeasible selection must be (A', B')

and the other three selections are feasible. Now, the selection of A' forces us to











select B and the selection of B' forces us to select A. Therefore, we add B to

the forcing list for A' and A to that for B'. To add B to the forcing list of A'

(and similarly to add A to the list of B'), we first check CO to determine if an

implementation for the module corresponding to A' has already been selected.

If no implementation has been selected, we simply append B to the list for

A'. If the implementation A has been selected, then we do nothing. If the

implementation A' has been selected, then the implementation B is forced and

we run the function Assign (L, C, B) of Figure 3.3 which selects implementation

B as well as other implementations that may now be forced. This function

returns the value False iff it has determined that no feasible solution exists.

Algorithm Boolean Assign (L], CO, M)
/* Select implementation M and related modules */

if M is selected then
return True;
if M' is selected then
return False;
/* M is undecided */
Mark M selected in CO;
for each X E L[A] do
if not Assign (L, C, X) then
return False;
end for
Remove L[M] and L[M'];
return True;


Figure 3.3. Function Assign











Case 3: [Exactly two selections are infeasible.] This can arise in one of two ways

(a) (A, B) and (A, B') are feasible and (A', B') and (A', B) are infeasible and

(b) (A, B) and (A', B) are feasible and (A', B') and (A, B') are infeasible. In

case (a), we must select implementation A. This is done by executing Assign

(L, C, A). In case (b), we must select implementation B; so, we perform Assign

(L,C,B).


Case 4: [Exactly three selections are infeasible.] Now (A, B) is the only feasible

selection and we perform Assign (L, C, A) and Assign (L, C, B).


Case 5: [All four selections are infeasible.] In this case, the 2-PDMIS instance has

no feasible solution.


Once we have constructed the forcing lists for the net span constraints, we proceed

to augment these lists to account for the channel density constraint. Of course, this

augmentation is to be done only when we haven't already determined that the given

2-PDMIS is infeasible. Our strategy to augment the forcing lists to account for the

density constraint begins by partitioning the routing channel into regions such that

no module boundary falls inside of a region (see Figure 3.4).

To ensure that the channel density is < d, we require that the density in each

region of the channel be < d. This can be done by examining each channel region.

Let T be the module on the top row of the channel region and B the module on

the bottom row. The density in this channel region is completely determined by the













Ii
I



Figure 3.


1 2 33 1 216 1 3 7
I I I I
I I I I
I I I I
S I I
54121 74361

4. Partition a routing channel into regions


nets that enter this region from its left or right and by the implementations of T

and B. Let T1, T2 (B1, B2) denote the two possible implementations of T (B). We

have four possible implementation pairs (TI, BI), (Ti, B2), (T2, B1) and (T2, B2). We

can determine which of these four implementation pairs are infeasible (i.e. result in

a channel region density > d) and use a case analysis similar to that used above for

net span constraints. The cases are


Case 1: [None are infeasible.] Do nothing.


Case 2: [Exactly one is infeasible.] Suppose, for example, only (Ti, B2) is infeasible.

We need to add B1 to the forcing list for T1 and T2 to the list for B2. This is

similar to case 2 for net span constraints.


Case 3: [Exactly two are infeasible.] This can happen in one of six ways. If the

feasible pairs are (TI, B2) and (T2, B1), then T1 forces B2, B2 forces T1, T2

forces BI and B1 forces T2. The remaining five cases are similar.











Case 4: [Exactly three are infeasible.] There are four ways this can happen. For

example, if (Ti, BI) is the only feasible pair, then implementations T, and B,

must be selected. The remaining three cases are similar.


Case 5: [All four are infeasible.] The 2-PDMIS instance with density constraint d

has no feasible solution.


3.3.2 Stage 2

If following stage 1 we have not determined that the 2-PDMIS instance is infea-

sible, stage 2 is entered. If no nonempty forcing list remains, all implementations of

the modules for which no implementation has been selected result in feasible solu-

tions. When nonempty forcing lists remain, we use the limited branching method [6]

to make the remaining module implementation selections. In this method, we start

with a module i whose implementation is yet to be selected. For this module, we try

out both implementations, in parallel, following the forcing lists L[i] and L[m + i],

respectively. This is equivalent to running Assign (L, C, i) and Assign (L, C, m + i) in

parallel and terminating when either (a) both return with value False or (b) one (or

both) return with value True. When (a) occurs, we have an infeasible solution. When

(b) occurs, the selections made by the branch that returns True are used. Note that

the parallel execution of Assign (L, C, i) and Assign (L, C, m + i) is actually done via

simulation by a single processor; this processor alternates between performing one

step of Assign (L, C, i) and one of Assign (L, C, m+ i) and stops when one of the two











conditions (a) or (b) occur. In case of (b), we proceed with the next module with

unselected implementation.

3.3.3 Implementation Details

To implement stage 2, we need two copies of the implementation selection array C;

one copy for each parallel execution branch. Call these copies C1 and C2. Although

both are identical at the start of Assign (L, C1, i) and Assign (L, C2, i), C1 and C2

may differ later. When the execution of these two branches terminates, we need to set

the Ci corresponding to the unselected branch equal to that of the selected branch.

This is done efficiently by maintaining two lists A, and A2 of changes made to C1

and C2 since the start of the two branches. Then, if C1 is selected, we can use A2 to

first convert C2 back to its initial state and then use A1 to convert it from the initial

state to C1. If C2 is selected, a similar process can be used to convert C\ to C2. The

time need for this is JAl + IA21 rather than CI I = IC21 = m (as would be the case

if we simply copy C1 to C2 or C2 to C1).

Further, since the forcing lists are shared by two branches, these branches should

not modify the forcing lists. Therefore the simulation of Assign omits the steps that

remove forcing lists. Finally, to efficiently simulate two parallel executions of Assign,

we need to convert the recursive version of Figure 3.3 into an iterative version. Our

iterative code which simulates the parallel execution of two Assign branches employs











two queues Qi and Q2. A high level description of the code is given in Figures 3.5,

3.6 and 3.7.

3.3.4 Time Complexity

To construct the net span constraints' portion of the forcing lists, we must identify

the up to four critical modules of each net and establish the forcing constraints for

each of the up to four critical module pairs that determine the net span. The critical

modules for all nets can be determined in 8(p) time by making a left to right sweep

of the modules, keeping track, for each net i, of the first and last modules in the

top and bottom module row that contain net i. Since all pin locations and module

boundaries are integers, the modules can be sorted in left to right order in linear time

using bin sort [24]. Each net's contribution to the forcing lists can now be determined

in 8(1) time. Therefore, representing each L[i] as a chain, the net span constraints'

contribution to the L[i]s can be determined in 0(p + n) = 8(p) time.

To construct the portion of L[i] that results from the channel density constraint,

we partition the channel into regions by performing a left to right sweep of the

modules and using the module end points as region boundaries. The number of

channel regions is, therefore, 6(m). In our implementation, we scan the channel

four times to compute the maximum density of each region for each of the four

possible implementations of the module pair that bounds the region. This takes 8(p)

time. Once we have the densities of each region we can, given the density constraint,












Algorithm Boolean Satisfy (LO, C2D)
/* Test whether L is satisfiable */
Copy C2 into C, ];
for i = 1 to m do
if Cl[i] == 0 then /* i is undecided */
if L[i] is empty then
Ci[i] = C2[i] = 1; /* select first implementation */
else if L[m + i] is empty then
Ci[i] = C2[i] = 2; /* select second implementation */
else
EnQueue (Qi, i);
EnQueue (Q2, m + i); /* m + i represent the 2nd implementation of module i */
while Qi not empty and Q2 not empty do
a = DeQueue (Ql);
b= DeQueue (Q2);
if a is rejected in Cl and b is rejected in C2 then
return False;
else if a is rejected in C1 then
EnQueue (Q1, a);
if not Search (L, Q2, C2, A2, b) then
return False;
else if b is rejected in C2 then
EnQueue (Q2, b);
if not Search (L, Q1, Ci, A1, a) then
return False;
else
if a is undecided in C1 then
Add List L[a) into Qi;
Insert a into A1;
Mark a selected in C1;
if b is undecided in C2 then
Add List L[b] into Q2;
Insert b into A2;
Mark b selected in C2;
end while /* Q1 not empty and Q2 not empty */
if Q1 is empty then
Undo (C2, A2, C1,Ai); /* make C2 = CI */
else /* Q2 is empty */
Undo (Ci A1, C2, A2); /* make C1 = C2 */
end if /* L[i] is empty */
end if /* module i is undecided */
end for
return True;


Figure 3.5. Function Satisfy














Algorithm Boolean Search (L. Q, C, A, x)
/* Select module x and modules in Q and related modules, update list A */
Mark x selected in C;
Insert x into A;
Add List L[x] into Q;
while Q not empty do
y = DeQueue (Q);
if y is rejected in C then
return False;
else if y is undecided in C then
Add List L[y] into Q;
else /* y is rejected in C */
Insert y into A;
Mark y selected in C;
end while
return True;


Figure 3.6. Function Search







Algorithm Undo (C1, A1, C2, A2)
/* make C1 = C2 by using delta lists */
for each x E A1 do
Mark x undecided in C1;
for each x E A2 do
Mark x selected in C1;


Figure 3.7. Procedure Undo











construct the forcing lists L(1..2mj in 9(m) time. Notice that on succeeding iterations

of the binary search for an optimal solution. only the contribution to LO from the

density constraint may change. The new contribution to LO can be determined

without recomputing the densities of each region.

The limited branching method of stage 2 uses two queues QI and Q2. The time

needed to add (EnQueue) or delete (DeQueue) an element to/from a queue is 8(1)

[24]. In each iteration of the for loop of Figure 3.5, the time spent following the

successful branch equals that spent following the unsuccessful branch and the time

needed to make C1 and C2 identical (i.e., the cost of the Undo operation) is, asymp-

totically, no more than the time spent following the successful branch. The time

spent following all successful branches is no more than the size of the forcing lists

because no forcing list is examined twice. Therefore, the stage 2 time is O(p).

The binary search for the minimum density solution iterates O(logn) times.

Therefore, our algorithm finds an optimal solution to the 2-PDMIS problem in

O(p log n) time.

Comparing our algorithm to that of Her et al. [11], we note that our algorithm

has the potential of identifying infeasible 2-PDMIS instances quite early; that is,

during the construction of the forcing lists. Although infeasibility resulting from

the critical modules of a single net being too far apart are detected immediately by

both algorithms, our algorithm also can quickly detect infeasibility resulting from

forced selections during stage 1. The algorithm of Her et al. [11] does not do this.











Because of the calls to Assign made during stage 1, the size of the forcing lists to be

processed in stage 2 is often significantly reduced. As a result, the limited branching

operation is often applied to much smaller data sets than the 2-SAT graph on which

the strongly connected component algorithm is applied [11]. These factors contribute

to the observed speedup provided by our algorithm relative to that of Her et al. [11].

3.4 Multichannel 2-PDMIS Problem

In the multichannel 2-PDMIS problem, we have c + 1, c > 1 rows of modules.

Each module has pins on its upper and lower boundaries, each module has two

possible implementations, there is a routing channel between every pair of adjacent

rows, and net span bounds are provided for every channel [11]. Although Her et al.

[11] develop a heuristic for the general multichannel PDMIS problem, they do not

consider polynomial time algorithms for the multichannel 2-PDMIS problem.

For any fixed channel density tuple (dl,d2,..., d) for the c routing channels,

we can develop the forcing lists in O(p) time, where p is the total number of pins.

These lists are developed using ideas similar to those used in Section 3.3. Then,

using the limited branching method of Section 3.3, we can determine, in O(p) time,

whether it is possible to select module implementations so that the channel densities

do not exceed (di, d2,.. ., dc) and so that the net span bounds are satisfied. Thus,

the method of Section 3.3 is easily extended to obtain an O(p) feasibility test for

(di, d2,..., dc). Since there are O(nW) possible density vectors (n is the number of










nets), the c channel 2-PDMIS problem can be solved by trying out all O(nC) tuples

in O(pnc) time.

We can reduce this time to O(pnC-') as follows. When c = 2, first determine the

least y such that (2, y) is a feasible channel density tuple. This is done using a binary

search on d2 and takes O(logn) feasibility tests, each test taking O(p) time. We can

ignore tuples (di, d2) with di < and d2 < y because these tuples are infeasible,

and we can ignore tuples (di, d2) with dl > n and d2 > y because these are inferior

to (Q, y). Therefore, the search for a better tuple than (n, y) may be limited to the

regions di < n and d2 > y, and di > R and d2 < y. These two regions (Figure 3.8)

may now be searched recursively. For example, to find the best tuple in the region

di < | and d2 > y, find the least z such that (q, z) is feasible. Now search the two

regions dl < and d2 > z, and dl > 2 and d2 < z. for a better tuple than (R, z).

The worst-case number of feasibility tests for the above search strategy is given

by the recurrence

N(n) = 2N() + logn, n > 2


and N(1) = 1. The solution to this recurrence is N(n) = O(n). Since each feasibility

test takes 0(p) time, the 2-channel 2-PDMIS problem can be solved in O(pn) time.

By doing an exhaustive search on the densities of c 2 channels and using the

above technique for the remaining 2 channels (i.e., for each choice of densities for








38


c 2 channels, find the overall best choice for the c channels as above), we can solve

the c-channel 2-PDMIS problem in O(p nc-2 n) = O(pnc-1) time.

d2


di <

d2>


eliminate


di >
infeasible
d2 < y


n
2

Figure 3.8. The two regions to be searched recursively


after the binary search


3.5 Experimental Results


We implemented our algorithm as well as that of Her et al. [11] in C and measured

the run time performance of the two algorithms on a SUN SPARCstation 5. Our first

data set consists of benchmark channels used in Her et al. [11]. We partitioned the

top row and bottom row of the channel into intervals and consider these intervals

as "modules", and assume each module has two implementations. Table 3.1 gives

the characteristics of these circuits as well as the time, in seconds, taken by the two

algorithms. The optimal densities given in Table 3.1 differ from those reported [11]











because the partitioning of the top and bottom rows of pins used by us is different

from that used in Her et al. [11]. The speedup provided by our algorithm ranges

from 1.67 to 2.20. Our second data set consists of circuits designed to minimize the

size of the forcing lists constructed in stage 1. The characteristics of these circuits

as well as the performance of the two algorithms on these two circuits are given in

Table 3.2. Our algorithm is 9 to 11 times as fast on these circuits.

Table 3.1. Running time for benchmark channels


Optimal Time/Second
Channel n m p density [11 Our Speedup
exl 21 19 74 12 0.0022 0.0010 2.20
ex3a 44 36 158 14 0.0046 0.0023 2.00
ex3b 47 24 158 16 0.0035 0.0021 1.67
ex3c 54 23 178 18 0.0039 0.0023 1.70
ex4b 54 28 192 17 0.0045 0.0024 1.88
ex5 64 40 190 18 0.0042 0.0025 1.68


Table 3.2. Running time for generated channels


Time/Second
Channel n m p [11 Our Speedup
w32x32 64 66 192 0.0425 0.0046 9.24
w64x64 128 130 384 0.0999 0.0105 9.51
w128x128 256 258 768 0.2275 0.0225 10.11
w256x256 512 514 1536 0.5130 0.0487 10.53
w512x512 1024 1026 3072 1.1755 0.1066 11.03
w1024x1024 2048 2050 6144 2.6150 0.2309 11.33
w2048x2048 4096 4098 12288 5.6700 0.4886 11.60
w4096x4096 8192 8194 24576 12.0500 1.0280 11.72
w8192x8192 16384 16386 49152 24.8800 2.1260 11.70











3.6 Conclusion

We have developed an O(p log n) time algorithm for the single channel 2-PDMIS

problem and an O(pnc-) time algorithm for the c, c > 1, channel 2-PDMIS problem.

Experiments indicate that our single channel algorithm is substantially faster than

the single channel algorithm [11]. The heuristic proposed in Her et al. [11] for the

k-PDMIS problem, k > 2, uses the algorithm for the 2-PDMIS problem. By using

our 2-PDMIS algorithm, the k-PDMIS heuristic of Her et al. [11] will also run faster.













CHAPTER 4
GATE RESIZING TO REDUCE POWER CONSUMPTION


4.1 Introduction

Power consumption, speed and area are three important and related characteris-

tics of a circuit. With the increase in circuit density and the enhanced use of battery

operated devices, the emphasis on power consumption has increased. By reducing

power consumption, we simultaneously reduce heat dissipation and increase battery

life.

In this paper we consider the problem of minimizing the power consumed by

a circuit subject to satisfying the circuit's timing constraints. Power reduction is

obtained by gate resizing larger gates are replaced by smaller ones that have higher

delay but lower power consumption. Power reduction via gate resizing has been

considered [3].

In the general gate resizing problem (GGR), for each gate in the circuit we have a

list of (delay, capacitance) pairs. Each pair gives the delay and capacitance associated

with a possible implementation of that gate. Since the power consumed by a gate is

linearly proportional to the product of its capacitance and the switching activity at

its inputs, the gate's power consumption can be computed from its capacitance once

the circuit characteristics are known. Therefore, we assume that instead of (delay,

capacitance) pairs, we have (delay, power consumption) pairs. In this model we ignore











the change in power from load change and switching activity change due to change of

gate delay. The same assumption has been used in Chen and Sarrafzedeh [3]. In the

GGR problem, we begin with a realization for each gate (i.e., a selection of a (delay,

power consumption) pair) such that the timing constraints are satisfied. We wish to

change the realization of some or all of the gates by replacing their assigned pair with

one that has larger delay (i.e., gate resizing) and such that the timing constraints

remain satisfied and the power consumption of the resized circuit (this is the sum of

the power consumption at each gate) is minimum. The GGR problem (referred to

as the incomplete library problem [3]) is equivalent to the BCI problem studied in Li

et al. [20]. The BCI problem was shown to be NP-Complete [20], even for circuits

that were simply a chain of single input single output gates. Bahar et al. [1] propose

a greedy heuristic for the GGR problem. This heuristic resizes one gate at a time.

Chen and Sarrafzadeh [3] have proposed a heuristic that resizes several gates at a time.

Experimental results presented by Chen and Sarrafzadeh show that their heuristic

is able to reduce the power consumed by benchmark circuits by approximately 10%.

More precisely, the method of Chen and Sarrafzadeh [3] did worse than the greedy

method of Bahar et al. [1] by 5.3% on one of 9 benchmark circuits and did better by

1.4% to 20.6% on the remaining 8 circuits.

Chen and Sarrafzadeh also propose a pseudo polynomial time algorithm for the

Low Power Complete Library-Specific Gate Resizing (CGR) problem. In this prob-

lem, each gate can be realized to have any delay (delays are assumed to be integral).











Further, the power consumed by gate v decreases by the constant c(v) for each unit

increase in delay. Let d1(v) by the delay of the initially assigned realization of gate

v and let d(v) 2 d1(v) be the delay of the resized gate v. Then the power reduction

AP resulting from resizing an n gate circuit is



AP = c(i)(d(i) d(i))
i= 1

In Section 4.2.2 we develop a linear algorithm for the CGR problem for series-

parallel circuits. In Section 4.2.3 we extend this linear algorithm for series-parallel

graphs to obtain an O(n log2 n) time algorithm that works when there is an upper

bound on the delay of each gate. That is, each gate v has realizations with integral

delays in the range [d#(v), d,(v)]. As in the CGR problem, each unit increase in delay

reduces power consumption by c(v). We call this the CUGR problem. The CUGR

problem for tree circuits can also be solved in linear time (Section 4.3). In Section 4.4,

we show that the CGR problem is NP-Complete for circuits that have a special type

of multi-input multi-output gate. An alternative algorithm for the CGR problem is

presented in Section 4.5. This algorithm transforms the CGR problem to an activity-

on-edge network [151 and then uses a known method to minimize project cost [8] to

obtain an optimal solution to the CGR problem. The approach of Section 4.5 is

quite general and can be used even for the CUGR problem and when the c(v)s

are convex functions rather than constants. In Section 4.5 we also point out that











the CGR algorithm of Chen and Sarrafzadeh does not work for CUGR and convex

c(v)s. In Section 4.6 we use the approach of Section 4.5 to obtain a heuristic for the

GGR problem with convex c(v)s. Experimental results comparing our algorithm of

Section 4.5, 4.6 and the CGR algorithms of Chen and Sarrafzadeh [3] are presented in

Section 4.7. Although both CGR algorithms generate minimum power circuits, our

algorithm does this using significantly less time. The GGR heuristic we developed

obtains better power reduction in many circuits than the one developed in Chen and

Sarrafzadeh [3].

Throughout this chapter we assume that a circuit is represented as a directed

acyclic graph. The vertices of this graph represent gates and the edges represent

signal flow. Primary inputs may be modeled as vertices with no incoming edge and

primary outputs may be modeled as vertices with no outgoing edge.

Figure 4.1 gives the digraph for an example circuit. The vertices corresponding

to primary inputs are labeled with the time at which the primary input is available;

vertices corresponding to primary outputs are labeled with the time by which the

output signal must arrive; and the remaining vertices (these correspond to circuit

gates) are labeled with the (delay, power consumption) pair corresponding to their

initial implementation.











(1,12) (2,13)






(2,15) (3,21)

Figure 4.1. Digraph corresponds to a circuit

4.2 Series-Parallel Circuits

4.2.1 Definition

Series-parallel circuits were considered in Li et al. [20j. A series-parallel circuit

may be defined recursively as below: [20]


SPI: a chain of gates is a series-parallel circuit (Figure 4.2(a)).


SP2: several chains of gates joined at the ends to a common first gate and a common

last gate (Figure 4.2(b)) define a simple parallel circuit. A simple parallel circuit

is a series parallel circuit.


SP3: a circuit obtained from a series-parallel circuit C by replacing any interconnect

of C by another series-parallel circuit is a series-parallel circuit (Figure 4.2(c)).


Figure 4.2 gives example series-parallel circuits as well as a circuit that is not

series-parallel.


















(a)






(b)


Figure 4.2. Circuit Examples (Source: Li et al. [20]). (a) Chain; (b) Simple Parallel
Circuit; (c) Series-Parallel Circuit; (d) Non-Series-Parallel Circuit











4.2.2 Complete Library Gate Resizing (CGR)

Our strategy to solve the CGR problem for series-parallel circuits is to reduce the

circuit to one that has a single gate. The CGR problem for the reduced single gate

circuit is easily solved, and finally the solution to this single gate problem is used to

reconstruct the solution for the initial circuit.

To transform an arbitrary series-parallel circuit into an equivalent single gate

circuit (i.e., a single gate circuit with the same maximum power reduction), we first

obtain the series parallel decomposition of the circuit using the linear time algorithm

[33]. This series-parallel decomposition essentially tells us how to build the original

circuit using chains (SP1), simple parallel circuits (SP2), and replacing interconnects

of C by series-parallel circuits (SP3). During this rebuild process, we shall replace

each chain and simple parallel circuit by a single gate. Consequently, when the rebuild

is complete, we will be left with a single gate. The replacement rules for chains and

simple parallel circuits are given below:


Chain: Suppose the chain has n gates labeled with delays dl, dz,..., dn. Let c, be the

power reduction obtained by increasing di by 1. The signal delay through the

chain is E=1 d,. For each unit increase in signal delay over E=li di, the max-

imum possible power reduction is maxi<,i
alent to a gate v with delay Ei=1 d and c(v) = maxl
The power reduction AP obtained by making this replacement is zero.











Simple Parallel Circuit: First transform each chain in the simple parallel circuit into

an equivalent single gate using the transformation of Figure 4.3(a). This results

in the parallel circuit of Figure 4.3(b). The signal delay between the output of

gate s and the input of gate t is maxi
all gates between s and t to max
delay. This gives us a power reduction AP = E!=i ci(maxli
unit increase in delay between s and t beyond maxlj<,n{d,} gives us a power

reduction of E=1 c. Therefore the n gates between s and t are equivalent

to a single gate with delay maxi
a simple parallel circuit may be replaced by the three gate chain shown in

Figure 4.3(b). This chain can, in turn, be replaced by a single gate using the

chain transformation of Figure 4.3(a).


Using the above transformations on the series parallel decomposition yields a

single gate circuit. The input to this gate is the primary input of the original circuit

and the gate output is the primary output of the original circuit. Let v be the single

gate that remains. Let ti and to be the arrival time of the input and the required time

for the output respectively. Since we start with a circuit that can meet its arrival

time requirements (i.e., a feasible circuit) and since the transformations of Figure 4.3

do not affect feasibility, t, + d(v) < to. The additional power reduction possible is

(to ti d(v))c(v) The maximum power reduction APmaS for the original circuit is














c1 Cn
cl P = 0 max(ca)

i (a)


max(dj)

s t t
d. 1 i^=1

Cn AP = E ca(max(d,) -di)
(b)

Figure 4.3. Transformation of Series-Parallel Circuits. (a) Chain; (b) Simple Parallel
Circuit

(to ti d(v))c(v) + sum of the APs from the simple parallel circuit transformation

(Figure 4.3(b)).

To obtain the delay values for each gate of the original circuit that will result in a

power reduction of APm,, simply follow the reduction process backwards. The total

time taken is linear in the number of gates in the original circuit. Figures 4.4 and

4.5 show an example. Each gate is represented by a box, the number inside a box

is the gate delay, the number below a box is the gate's c value, the primary input is

available at time 0, and the primary output is needed at time 37.

4.2.3 Complete Library with Upper Bounds (CUGR) and Convex c(v)s

The power reduction per unit delay increase function c(v) for gate v is convex iff

there exist positive 61,62,..., 6,, and cl >_ c2 > - > cm such that c(v) = cl for delay






















(a) APi = 0


9 5 9
0 13 9 6 37

6 4
8
8
(b) AP2= 6*5+2*2+3*3= 43


(c) AP3 = 0


0 37

6 21 4
(d) AP4 = 8 *15 = 120


Ad = 37 29 = 8


0 37
--- 29 ---
21
(e) AP5 = 21 Ad = 168


Figure 4.4. Transformation of a Series-Parallel Circuit into a single gate.












0 37

21
(a)

0 37

6 21 4
(b)


17 5 9
9 13 9 6 37

6 4
31
8
(d)


Figure 4.5. Computation of new delay for each gate.











increases between 0 and 61; c(v) = c2 for delay increase between 61 + 1 and 61 + 62;

c(v) = c3 for delay increases between 1 + 62 + 1 and 61 +62 +63: and so on. Figure 4.6

shows the power consumption as a function of the increase in delay relative to the

gate's initial delay di(v). Po is the power consumption when the gate has its initial

delay dr(v).

Power Consumption

Po
Slope = -cl


slope = -c2


'" slope = -c,

SDelay Increase
0 61 61 +62 =1 6i

Figure 4.6. Convex delay-power-consumption graph


The CUGR problem can be modeled using gates with convex power reduction

functions c(v). For example if gate v provides a power reduction of c for each unit

increase in delay between d,(v) and d,(v) then we may use c(v) with 61 = d (v)-dj(v)

and cl = c. Because of this correspondence between the CLGR problem and the

convex gate resizing problem ConvexCGR, we consider only the ConvexCGR problem

in this section.











To solve the ConvexCGR problem for series-parallel circuits, we need only develop

methods to transform a chain of convex gates into an equivalent convex gate and

to transform a simple parallel circuit comprised of convex gates into an equivalent

convex gate. These transformations can then be used in place of the transformations

of Section 4.2.2 to obtain an algorithm for the ConvexCGR (and hence also for

the CUGR) problem. We assume that a convex gate is given by a list of tuples

{((61, cl), (62, cZ),..., (6., c) }, ci > c2 > > Cm. This list is called the DP (delay-

power) list of the gate.


Chain of Convex Gates: A chain of n convex gates with initial delays dl, d2,.... d, is

replaced by a convex gate with initial delay ELi di as in Figure 4.3(a). The DP

list for the new gate is obtained by merging the DP lists of the n gates in the

original chain into a single list sorted by non-increasing cis. During this process,

pairs with the same cj value are combined into a single pair. For example if

(5,24) and (2,24) are pairs in DP1 and DP2 respectively, the combined pair

is (7,24). Suppose we have a 3 gate chain with DP1 = {(3, 28), (5,24), (3,21)},

DP2 = {(2, 24), (4, 23)} and DP3 = {(9, 26)}. The DP list for the replacement

gate for this chain is {(3, 28), (9, 26), (7, 24), (4, 23), (3, 21)}.


Simple Parallel Circuit with Convex Gates: First the chains in the circuit are trans-

formed into equivalent single convex gates. Then the delays of these equivalent

convex gates are increased to maxi










in delay provides us a power reduction AP and changes the tuples at the front

of the DP lists of the gates whose delay is increased. Let the new DP lists

be DP,, DP',..., DP,'. Now the gates between s and t are replaced by a gate

with delay maxi
DP' lists. Figure 4.7 shows the process when n = 2. Here L1 and L2 denote

the two DP lists DP[ and DP2 and L denotes the DP list of the replacement

convex gate. Finally, s, t and the replacement gate are combined into a single

convex gate using the method for a chain of convex gates. Suppose we have

two parallel convex gates with delays 3 and 2 respectively and the correspond-

ing DP lists DPI = {(3,28), (5,24), (3,21)} and DP2 = {(2,24), (4,23)}. The

delay of the equivalent convex gate is thus 3, which reduces the power con-

sumption of the second gate by 24. The DP list of the second gate is modified

to {(1, 24), (4, 23)}. By performing the ParalleLMerge operation we obtain the

DP list {(1, 52), (2, 51), (2,48), (3, 24), (3, 21)} for the equivalent gate.


4.2.4 Time Complexity of Convex GR Problem

A straightforward implementation of our algorithm of the previous section for the

ConvexCGR problem uses sorted chains (linked lists) to represent the DP lists. The

time needed to combine/merge the DP lists L1 and L2 of two gates is O(IL1 + IL21)

regardless of whether we do a series or parallel merge. If we start with gates having


















Algorithm ParallelMerge(L1, L2)
/* Merge L1 and L2, consider two gates in parallel */

Pi +- head(L1):
p2 head(L2):
L +- NULL;
while L1 not empty and L2 not empty do
if 6(pi) < d(p2) then
Insert (6(pl), c(pi) + c(p2)) into L;
(p2) +- 6(p2) b(pi);
pi +- next(pi);
else if 6(pi) > 6(p2) then
/* Similar to the "if" part above, interchange pi and p2. */
else /* 6(pl) = 6(p2) */
Insert (6(pl), c(pl) + c(p2)) into L;
Pi -e next(p);
p2 next(p2);
end if
end while
if L1 is empty then
Append the remaining nodes of L2 starting from p2 to L;
else
Append the remaining nodes of L1 starting from pi to L;
end if
return L;


Figure 4.7. Algorithm ParallelMerge











DP lists with k tuples each, then the time needed to transform an n gate series-

parallel circuit into its equivalent single gate is O(kn'). To see this, observe that each

series/parallel combine step reduces the number of gates by at least 1. Therefore,

there can be at most n 1 combine steps. Further, after q combines, the size of a

DP list is O(kq) = O(kn). So the cost of O(n) combines is O(kn2). An example

circuit that exhibits kn2 worst-case behavior is given in Figure 4.8.











Figure 4.8. Worst-case merging of n gates


We can reduce the asymptotic time complexity to O(kn log2 n) by using balanced

binary search trees (BBSTs) [15] to represent the DP lists. Each DP list is repre-

sented by a BBST such that the external nodes represent the pairs (6i, ci) in the DP

list in right to left order (i.e., in decreasing order of power reduction). Each internal

node x contains a triple of the form (D(x), C(x), M(x)), where D(x) is the sum of the

delays of the DP list pairs in the left subtree of x, C(x) is a corrective factor needed

to compute the ci values of pairs in the left subtree of x, and M(x) is a pointer to

the rightmost external node in the left subtree of x. Each external node y stores a










pair (d(y), c(y)) such that d(y) is the 6 value of the DP list pair represented by node

y; the c value of this DP list pair is



c(y) + C()
{x : y is in the left subtree of x}

Figure 4.9 shows a possible BBST for the DP list {(3, 28), (5, 24), (3, 21)}. The

leftmost external node contains the pair (3,13) which represents the DP list pair

(3,21). The correct c value for the DP list pair is obtained by adding to 13 the C

values in the ancestors of the external node.

(8,1 )


(3,5 )
(3,21)


(3,13) (5,14)

Figure 4.9. BBST used to represent DP list {(3, 28), (5, 24), (3,21)}


To insert a new DP list pair into our BBST, we must be able to trace a path

from the root to an appropriate external node. This path tracing is facilitated by the

pointer M(x) in internal node x. By using the c() value in the external node M(x)

and the C values in the nodes from the root to x, we can compute the maximum c

value of any DP pair in the left subtree of x.











Since insertions require rotations, we show how the D() and C) values in internal

nodes are to be changed when rebalancing rotations are done (Figure 4.10). Note

that M() values remain unchanged during tree rotations, and are omitted from the

figure. The tuple (D(x), C(x)) of each internal node x is shown next to the node.

To merge the DP lists of two gates in a chain, we first perform an inorder traversal

of the smaller DP list's BBST to extract the DP list's pairs. Then, these pairs are

inserted into the BBST for the larger DP list. During this insertion, pairs with the

same c value are combined into a single pair. If the two DP lists are L1 and L2, the

time needed to do the series merge is |ILl log(IL1| + (L21), where L1 is the smaller

DP list.

For a parallel merge of two DP lists L1 and L2 (L1 is the smaller list), we need

to identify for each (&, ck) in L1, the external nodes z in the BBST for L2 for which

Ei=1 S < f(z) < Et 6i, where the 6,s are defined with respect to L1 and f(z) is

the sum of the d( values of the external nodes in the BBST of L2 that lie to the left

of z plus the d( value of z. Let x and y be the leftmost and rightmost such external

nodes (see Figure 4.11). These nodes can be found in O(log IL21) time using EX-1 6

and Ei= 6i, and the D values in the triples of the internal nodes of the BBST for L2.

Actually, node x may already be known from the processing of the pair (6k-1, ck-1)

of L1. We need to increase the c values of the external nodes from x to y. This can

be done in logarithmic time by changing the C correctors stored in the internal nodes

on the paths from x and y to their common ancestor (see Figure 4.11).












(D, Cc)


(Db, Cb)


(Da, Ca)


(Do, Ca)


(a) LL rotation


(Da + Db, Ca+Cc)


(Db, Cb)


(D D -Db. Cc)


-- (Db, Cb-Ca)


(b) LR rotation
(Da+Dc, Ca+Cb)


(Db,Cb)


(c) RL rotation


(Db+1
(Db, Cb)
(Dc, Cc-Cb)
a(Da,Ca) /


(d) RR rotation


Figure 4.10. Update of D(s and C()s for internal nodes during tree rotations


(De Db. Cc)


(Db-D,, C6)


(D, Ca)


(D.1c, Cc- Cb)


















common ancestor of x and y

+Ck (


Figure 4.11. Change of c values of internal nodes of L2 for the kth tuple of L1











In addition to the above change in C correctors, we may need to insert a new

external node. If f(y)= Ek= 6i, then no insertion is needed, we increase c(y) by Ck.

Otherwise, we change the original node to (EI= bi f(y) d(y), c(y) + c(k)) and

insert (f(y) E= 6i, Ck + c(y)) into the BBST. The inserted external node is the x

node for the next pair of L1.

When the BBST method is used on the worst-case example for the linked list

method (Figure 4.8), ILI I = k and IL21 < kn for each of the n 1 merges. Therefore,

the run time is O(kn log kn). The worst-case for the BBST method is when we

continually merge DP lists of the same size. The worst case is described by log n

stages of merges; each stage involving pairs of DP lists of the same size. In stage

1, | pairs of lists each of size k are merged in O(2klog 2k) time to produce DP

lists of size 2k each; in stage 2, R pairs of DP lists of size 2k each are merged in

O(n2k log4k) time to produce 2 DP lists of size 4k each; and so on. The total time

is O(Ei" gk log kn) = O(kn log kn log n).

Figure 4.12 shows a circuit on which this worst-case bound is achieved. Let Co

be a circuit with a single module. C, i > 0, is a simple parallel circuit obtained

from Ci-1 as shown in Figure 4.12. The number of modules in Ci is 2i+1 + i(i 1)

and Ci requires 2i-' parallel merges and 2i 1 series merges. The total cost is

O(kn log kn log n).























Figure 4.12. Circuit C2 that exert the worst case behavior


4.3 Tree Circuits

Gates in circuits with a tree topology (for example, distribution trees) can be

resized by transforming the trees into equivalent single gate circuits using the ba-

sic transformation shown in Figure 4.13. The transformation of Figure 4.13 first

transforms a node, all of whose children are leaves, into an equivalent simple parallel

circuit by the introduction of additional gates/nodes with delay r ri, where ri is the

required time for the output of leaf i and r = max(ri). The c values for the new gates

are 0. The simple parallel circuit can now be transformed into an equivalent single

gate using the transformation of Figure 4.3. By repeatly applying this transformation

on any tree, the tree can be transformed into an equivalent single gate.

Although the preceding transformation was described specifically for the CGR

problem, it is easily extended to the CUGR and ConvexCGR problems using the

ideas of Section 4.2.












d rn di r ri
r=


r max(r,)

Cn Cn 0

Figure 4.13. Transformation of a basic tree to a simple parallel circuit

4.4 CGR with Multigate Modules Is NP-Hard


Suppose that a circuit is to be realized with modules that contain multiple gates.

Increasing the delay of a module results in an increase in the delay of all gates on the

module. Figure 4.14 shows a module v with two gates A and B, each is a two-input

one-output gate. The delay of the selected module implementation is d(v) and each

unit increase in module delay reduces power consumption by c(v) and increases the

delay of both A and B by one unit. We shall show that the CGR problem with

multigate modules (MCGR) is NP-hard. For the proof, we show that if MCGR can

be solved in polynomial time, then the one-in-three 3SAT problem [9) can also be

solved in polynomial time.



(d(v),c(v))




Figure 4.14. A module v with two gates A and B


Definition 1 (one-in-three 3SAT)











Input Collection of clauses C1,C2,... ,Cm over variables xi, X2, ..., z, such that

each clause is the disjunction of exactly three literals.

Output "yes" if and only if there is a truth assignment to the variables such that

each clause has exactly one true literal.


Theorem 1 MCGR is NP-hard.


Proof. We show how to transform, in polynomial time, any instance I of the one-in-

three 3SAT problem into an instance I' of MCGR problem such that the maximum

power reduction for I' is (2m + 1)n + m if and only if the answer to I is "yes". Here

m is the number of clauses in I and n is the number of variables.

For the transformation, we define two circuit subassemblies variable subassembly

and clause subassembly. A variable subassembly consists of two multigate modules,

each having two gates and connected as in Figure 4.15.

0-0
0 -h-L:_JI _0
o0
oO_


Zi i

o(x) o()

Figure 4.15. Variable subassembly for variable xi


The first module of the variable subassembly for variable xi is called module xi

and the second is module Yi. The inputs to gates A and B of module xi and to gate B

of T are primary inputs which are available at time 0. One of the inputs to gate A of











module x is a primary input available at 0 and the other input is the output of gate

A of module xa. The output of gate A of module Yi is a primary output which has a

required arrival time of 1. The outputs of the two B gates are non-primary outputs.

The c value for each module is 2m + 1 and the initial delay of each is 0. Notice that

we can increase the delay of either module xi and x (but not both) by 1 and still

satisfy the arrival time requirements of the primary output. Therefore, the maximum

power reduction obtainable from a variable subassembly is 2m + 1. If module xi has

delay 0, then we say that literal xi is true; otherwise xi is false. Similarly, if module

x has delay 0, the literal Tl is true; otherwise 3^ is false. Although we can assign

delays to the two modules so that both literals are false, delay assignments can make

at most one literal true.

We construct one variable subassembly for each of the n variables in the 3SAT

instance I. The maximum power reduction obtainable from these n subassemblies is

(2m + 1)n.

A clause subassembly consists of 3 modules with one gate each; each gate has 3

inputs and 1 output. Let 11, 12 and 13 be the three literals in a clause. Figure 4.16

shows the corresponding clause subassembly together with the inputs to each gate.

These inputs are the outputs of the variable subassemblies. The outputs of the

modules of the clause subassemblies are primary outputs with required time of 1.

The c value for each module in a clause subassembly is 1.













(12) 1--
O(13)


o(12)- -
0(13)



O(2)1
o(13)--


Figure 4.16. Clause subassembly for (li V 12 V 13)

The maximum power reduction obtainable from a clause subassembly is 3. This

corresponds to the case when all 6 literals (li, T1, 12,, 13 and IT) are false. We shall

have one clause subassembly for each of the m clauses in I. Therefore the maximum

power reduction available from all m clause subassemblies is 3m.

The circuit I' is comprised of the n clause and m variable subassemblies described

above.

If there is a truth assignment T for the variables of I such that the answer to the

one-in-three 3SAT instance I is "yes", then make the delay of module xi 1 if zi is

true in T, otherwise make the delay of ;i 1. Further, since exactly one literal is true

in each clause of I, we can make the delay of exactly 1 module in each clause 1. The

total power reduction is (2m + 1)n + m.











Now suppose there is a solution S to I' which gives us a power reduction >

(2m + 1)n + m. In S, each variable subassembly must have exactly one module with

delay 1. To see this, observe that no variable subassembly can have two modules

with delay 1 and if any variable subassembly has no module with delay 1, the power

reduction obtained by S is at most (2m + 1)(n 1) + 3m < (2m + 1)n + m. So,

assume that each variable subassembly has exactly one module with delay 1. This

means that we have a consistent truth assignment; that is, there is no variable x, for

which either both xi and are true or both are false.

Now let's determine the power reduction obtainable from the clause subassemblies.

If 1I is the only literal that is true among 11, 12 and 13, we can make the delay of the

topmost module 1 because the arrival times of o(ll), o(12) and o(13) are all 0. The

delays for the remaining two modules must be 0 because the arrival time of o(Yi) is 1.

A similar analysis can be done for the cases when 12 and 13 are the only true literals.

We conclude that when exactly one literal of a clause is true, we can get a power

reduction of at most 1 from its clause subassembly.

Therefore we can get at most m units of power reduction from the m clause sub-

assemblies. The maximum of m is obtained only when exactly one literal of each

clause is true. Hence the solution to the MCGR instance I' provides a power re-

duction > (2m + 1)n + 3m if and only if the one-in-three 3SAT instance has answer

"yes". 0










4.5 General Circuits

4.5.1 The CGR Algorithm of Chen and Sarrafzadeh

Chen and Sarrafzadeh [3] have proposed a pseudo polynomial time algorithm for

the CGR problem. Let a(v) denote the arrival time of the signal at the output of

gate v and let r(v) denote the required time for the signal at the output of gate v.

For a primary input, a(v) is the time at which the signal becomes available and for

a primary output r(v) is the required time for that output. Hence a(v) is known for

primary inputs and r(v) is known for primary outputs. The remaining a's and r's

are defined as below (d(v) is the assigned delay of gate v):



a(v) = max (a(u) + d(v)) (4.1)
{u:(u,v)EE}
r(v) = min (r(w) d(w)) (4.2)
{w:(v,w))E)


Hence a(v) is the length of the longest delay path from the primary inputs to the

output of v, and r(v) is the latest time by which the signal must arrive at the output

of gate v so that it is still possible for the signal to reach the primary outputs by

their required times. Note that a(v) and r(v) are very closely related to the early

and late event times for activity networks [5]. The slack, s(v), of gate v is


s(v) = r(v) a(v)


(4.3)











A circuit is feasible if and only if all primary outputs arrive by their required time.

From the definitions of r(v) and a(v), it follows that a circuit is feasible if and only

if s(v) > 0 for all v.

Figure 4.17(a) shows an example circuit graph. This corresponds to a circuit

with three gates, 2 primary inputs and 2 primary outputs. The a() values for the two

primary inputs are 0 and 2 respectively. The r() values for the two primary outputs

are 5 and 4 respectively. The 3 gates a, b and c are shown by boxes. The c() value

of a gate is given below the box. The selected delay for each gate is 1 and is shown

inside the box. The a(v)js(v)Jr(v) values for each gate are given above the box.

21315 5
1 14

o 0 12/3 b 14
15 15
a 2 311 4 a
1 (13
13
(a) c (b) c

Figure 4.17. Application of algorithm of Chen and Sarrafzadeh [3] on a CGR circuit.
(a) An example CGR circuit; (b) sensitive graph


The algorithm of Chen and Sarrafzadeh comprises the following steps:


Step 1: Compute the slack for each node of G.


Step 2: If no node has slack > 0, stop.











Step 3: Compute the sensitive graph G, from G as follows. G, contains exactly those

vertices of G that have slack > 0. (u. v) is an edge of G, if and only if either

a(v) a(u) = d(v) or r(v) r(u) = d(c). The weight of a vertex in Gs is its

c) value.


Step 4: Compute the transitive closure graph Gt of G,.


Step 5: Compute a maximum weighted independent set of Gt.


Step 6: Increase the delays of all gates in the maximum weighted independent set by

1.


Step 7: Go to Step 1.


The maximum weighted independent set of Gt may be computed using a maxflow

algorithm [23]. This takes O(nmlog(n2/m)) time for a graph with n vertices and

m edges. Since the algorithm of Chen and Sarrafzadeh [3] may reduce the power

consumption by only one unit on each iteration, its complexity is O(Snmlog(n2/m))

where S is the obtained power reduction.

Figure 4.17(b) shows the graph G, that corresponds to the graph G of Fig-

ure 4.17(a). The numbers inside the vertices are their weights. The transitive closure

graph Gt for G, is the same as G,, and the maximum weighted independent set is

{b, c} with a weight of 14 + 13 = 27. The delays of gates b and c are increased by 1

to obtain a power reduction of 27 and we proceed to the second iteration.











4.5.2 Comments on the Algorithm of Chen and Sarrafzadeh

1. Although we expect the algorithm of Chen and Sarrafzadeh [3] to be quite

efficient on circuits for which only a small power reduction is possible (i.e., S is

small), it is not expected to be efficient on circuits whose power consumption

can be significantly reduced. For example, consider the one-gate circuit of

Figure 4.18. The arrival time of the primary input is 0, the required time of

the primary output is r, and the initial gate delay is 0. The algorithm [3] takes

r iterations to complete. We would like an algorithm that can increase gate

delay by more than 1 on each iteration. In particular, it should be possible to

obtain the optimal solution for the circuit of Figure 4.18 in one iteration.

0 r

c

Figure 4.18. A simple example CGR circuit


2. Chen and Sarrafzadeh [3] have proved that their algorithm indeed solves the

CGR problem optimally. In most realistic situations, however, one or more

of the gates will have an upper limit on the obtainable delay. That is, the

problem will really be a CUGR problem. The algorithm [3] does not obtain

optimal solutions to the CUGR problem. For example, consider the circuit

of Figure 4.19(a). The numbers above a gate give the upper bound on the

gate's delay. This is essentially the circuit of Figure 4.17(a) with the addition











of upper bounds on gate delay. The first iteration of the algorithm of Chen

and Sarrafzadeh [3] proceeds exactly as it did without the upper bounds and

we arrive at the configuration of Figure 4.19(b). For the second iteration, gates

b and c are eliminated from G, because their delays cannot be increased any

further. Gs is now just a single vertex graph. Gt = G, and the maximum

weighted independent set is {a}. The delay of gate a is increased by 1 and

the algorithm terminates (see Figure 4.19(c)). The power reduction obtained

is 13+14+15 = 42. However, the optimal power reduction of 44 is obtained by

changing the delay of gate a to 3 and gate b to 2, and leaving the delay of gate

c at 1 (see Figure 4.19(d)).


3. As noted in Section 4.2, gates with an upper bound on their delay may be

modeled by gates with convex delay-power consumption functions. Since the

algorithm of Chen and Sarrafzadeh [3] does not obtain optimal solutions for

the CUGR problem, it does not obtain optimal solutions for the ConvexCGR

problem.


4.5.3 A Unified Framework for CGR, CUGR and ConvexCGR

The CGR, CUGR and ConvexCGR problems can all be solved in pseudopolyno-

mial time by transforming the circuit into an activity on edge PERT (Performance

Evaluation and Review Technique) network and then using the algorithm of Fulkerson

[8] for project cost curves.




















2 5 2 5


S0 3 314
14 14


a15 2 2 4 15 2 2 4
1 2
13 13











2 3
S2 5 2 5




c c
13 13

(c) (d)

Figure 4.19. Application of algorithm of Chen and Sarrafzadeh [3] on CUGR circuit.
(a) An example CUGR circuit; (b) delay of each gate after first iteration; (c) delay
of each gate after algorithm [3] terminates; (d) delay of each gate for optimal power
reduction











The PERT network G for any circuit C is obtained as follows:


Step 1: For each gate v of C, G contains two vertices. r- and v+. There is an edge

(v, v+) from vertex v- to v+. With this edge (v-, v+), we associate a triple

(a(v-,v+), b(v-,v+),c(v-,v+)) where a(v-,v+) = d<(v), b(v-,v+) = di(v) +

s(v) for the CGR problem and b(v-,v+) = min{d((v) + s(v),du(v)}, where

d,(v) is the upper bound on the delay for gate v in case we are solving solving a

CUGR problem, (the ConvexCGR case is discussed later), and c(v-, v+) = c(v).

Step 2: For each edge (u, v) in C, there is an edge (u-. r-) in G. The triple for this

edge is (0,0, 0).


Step 3: G has two special vertices s (source) and t (sink). There is an edge (s, v-)

for every gate v for which C has a primary input. The triple for this edge is

(max{a(v)}, max{a(v)}, 0), where the maximum is taken over the arrival times

of all primary inputs to v. Additionally, there is an edge (v+, t) for all gates v

that have a primary output. The triple for this edge is (a, a, 0) where a = max{

required times of all primary outputs}- (required time of primary output of

gate v).

The PERT network for the circuit of Figure 4.17(a) is shown in Figure 4.20. Pairs

of vertices (v-, v+) are enclosed in broken boxes. Edge triples are shown above each

edge. Numbers inside each vertex is their initial r() value, which will be defined later.

The interpretation of an edge triple (a, b, c) is: a is the smallest delay through the












edge, b is the maximum delay through the edge, and c is the power reduction per

unit increase in delay in the range a through b.


L ---------- --J

Figure 4.20. A PERT network for the CGR circuit of Figure 4.17(a).


The objective is to assign integer values 7() to the vertices of the PERT network,

and integer weights w() to the edges so as to


maximize


Sc(x, y)w(x. y)
(z,y)EE


(4.4)


subject to


w(x, y)

a(x, y)

r(s)

r(t)


S(y) 7(x)

< w(x, y) b(x, y)

= 0

< max{required time of primary outputs}


(4.5)

(4.6)

(4.7)

(4.8)











It is easy to see that the optimal solution to the above integer linear program de-

fines an optimal solution for the power reduction problem. In this solution, the delay

of gate v is r(v+) r(v-) and the obtained power reduction is E(z,y)EE C(X, y)w(x, y).

The algorithm of Fulkerson [8] solves the above linear program using a primal-

dual approach and a network flow algorithm. It begins by setting w(x, y) = b(x, y)

for each edge and computes the smallest r() that satisfy Equations 4.5-4.7 using a

topological order scan of the PERT network beginning at vertex s. These values

become the initial r() values. In Figure 4.21, the number in each vertex indicates

this initial r value. If the computed T(t) satisfies Equation 4.8, we are done. If not,

the w's are reduced using augmenting path methods until we have an assignment

of w's and T's that satisfy all the constraints (Equation 4.5-4.8). Fulkerson 18] has

extended his algorithm to the case when c(x, y) is given by a convex function for each

edge. This extension essentially increases the number of edges in G. Therefore we

are able to also solve the ConvexCGR problem with this formulation.

Although the asymptotic complexity of Fulkerson's method [8] is the same as that

of the algorithm of Chen and Sarrafzadeh [3], we expect Fulkerson's algorithm to be

faster for the following reasons:


1. Fulkerson's algorithm can reduce the delay of gates by more than 1 on each

iteration. For example, the circuit of Figure 4.18 is handled with just one

iteration.











2. Successive iterations of Fulkerson's algorithm use the results of preceding it-

erations; each iteration requires the computation of new augmenting paths.

Successive iterations of the algorithm of Chen and Sarrafzadeh [3] essentially

start from scratch, recomputing G,, Gt and the maximum weighted indepen-

dent set.


4.6 The General Gate Resizing Problem (GGR)

Chen and Sarrafzadeh [3] have proposed a heuristic for the general gate resizing

problem. We show how our methodology of the previous section can be extended

to obtain a heuristic for the GGR problem with convex (delay, power consumption)

pairs. Let the (delay, power consumption) pairs for a gate be (di,pi), (d2,p2),..., (dk,Pk)

with di < d2 < -dk and p > p2 > > >Pk. (di,pi) is the pair for the initially

selected gate size. The pairs are convex if and only if cl > c2 > > ck-1 where

ci = (p, -pi+1)/(d,+l -di). In practice, we expect most GGR instances will be convex.

To solve GGR with convex pairs, we construct a PERT network as before. How-

ever, each edge (v-, v+) is now a chain as shown in Figure 4.21, where 6i = di+l di.

V- v+
(dl, 61, c) (0,62,c2) (0, 63, 3) (0, ck- 1)
Q --- 2)---(^ --^)---------^y--- ^)~'

Figure 4.21. Transformation of vertex v into a chain in PERT network


Once the network has been solved using the algorithm of Fulkerson of [8], the

flows in the chains are adjusted to obtain a feasible solution to the GGR problem.











Since cl > c2 > > ck-1, we may assume that, in the optimal solution, the delay

for edge (vi. vi+i) is made 6i before the delay of edge (vi,+, vi+2) is increased above

0. For each chain, find the rightmost edge (vi, vi+1) whose delay exceeds 0. If there

is no such edge, select edge (vl, v2). If the delay of the edge is less than bi, set the

delay of the gate v to di; otherwise set it to di+l.

4.7 Experimental Results

We implemented our CGR and GGR algorithms as well as those of Chen and

Sarrafzadeh [3] in C and benchmarked them on a SUN SPARCstation 5. All algo-

rithms were coded using similar programming methodologies so that any observed

performance differences can be attributed to algorithmic differences rather than to

differences in programming style.

The test circuits we used include combinational circuits from the MCNC-91 bench-

mark suite. The library we used includes NAND, NOR, and INVERTOR gates. Each

gate has a minimum delay of one clock cycle. Technology mapping was done using

Berkeley SIS and power consumption was calculated using a 5V supply voltage and

20MHz clock frequency. The switching activity factors of individual gates were cal-

culated using the symbolic simulation technique described in Ghost et al. [10], which

is implemented in Berkeley SIS.

Tables 4.1 and 4.2 give our experimental results for the CGR algorithms. The

number of gates in each circuit is given by n; ta is the length of the critical path in the











circuit (i.e. length of longest path from a primary input to a primary output when

gate delays equal the initially assigned delays). The runtimes are given in seconds.

In the experiments reported in Table 4.1, the required time for the output signal was

set equal to the critical path length, t., of the circuit.

Since both the CGR algorithms produce provably optimal solutions, the only

difference between them is run time. On 6 of the 9 tested circuits, our algorithm

is noticeably faster; on another 2, the two algorithms took about the same time;

and on only 1 of the 9 circuits did the algorithm [3] outperform our algorithm. The

disparity between the two algorithms becomes more striking when the required time

for the output signal is increased beyond the critical path length t,. The run times

for the case when the required time is 2ta are shown in Table 4.2. Now, our algorithm

provided a speedup between 4 and 11 over the algorithm of [3]! Table 4.3 shows the

relative performance of the two algorithms as we increase the required time t, for one

of the test circuits squar5. The relative speedup provided by our algorithm increases

from a low of 0.78 (t, = ta; Table 4.1) to a high of 42 when t, = 100 = 12.5ta.

Tables 4.4 to 4.9 shows the results for the two GGR algorithms using convex pairs.

Our library includes five implementations (i.e. (delay, power consumption) pairs) for

each gate type NAND, NOR and INVERTOR). Since both the GGR algorithms

are heuristics, we compare the power reductions obtained by each rather than their

run times. The first group of tables (Tables 4.4 to 4.6) show the results for the case

when t, = t,; the second group of tables (Tables 4.7 to 4.9) are for the case tr = 2t,.












Table 4.1. Run time and speedup when required time is equal to critical path length


Table 4.2. Run time and speedup when required time is doubled


Table 4.3. Run time for square with different required time


circuit n t. [3] Our speedup
5xpl 158 11 0.39 0.10 3.99
b12 147 13 0.34 0.09 3.91
clip 278 15 1.28 0.25 5.07
rd73 219 17 1.74 0.21 8.37
sao2 225 18 1.13 0.21 5.47
set 170 8 0.10 0.09 1.10
square 99 8 0.03 0.04 0.78
t481 59 10 0.01 0.01 1.00
ttt2 340 11 4.94 0.47 10.62


circuit n t. Chen Our speedup
5xpl 158 11 1.04 0.14 7.22
b12 147 13 0.55 0.12 4.46
clip 278 15 4.82 0.43 11.16
rd73 219 17 1.55 0.29 5.29
sao2 225 18 1.61 0.30 5.40
set 170 8 1.29 0.14 9.47
squar5 99 8 0.22 0.05 4.29
t481 59 10 0.16 0.01 13.25
ttt2 340 11 7.17 0.67 10.75


tr Chen Our speedup
10 0.076 0.044 1.72
20 0.315 0.052 6.06
30 0.555 0.054 10.28
40 0.795 0.052 15.29
50 1.033 0.055 18.78
100 2.235 0.053 42.17












Within the same group of tables, the circuits differ in the selection of initial (delay,

power consumption) pairs for the gates. As can be seen, when tr = ta, our algorithm

obtained a larger power reduction in 18 of 27 tests; when tr = 2t,, our algorithm

obtained larger power reduction in all 27 cases.

Our GGR heuristic took between 0.01 seconds and 5.60 seconds for the test cases.

This is considerably more time than required by the heuristic of Chen and Sarrafzadeh

[3], which took less than 0.30 seconds for each test. However, the run time of our

heuristic is reasonable and our heuristic generally produces better solutions than

those produced by the heuristic of Chen and Sarrafzadeh [3].

Table 4.4. Power reduction of GGR algorithms (1)

circuit Chen Our Diff % imp
5xpl 47416 51386 3970 8.37
b12 56628 63191 6563 11.59
clip 70912 77252 6340 8.94
rd73 76374 88223 11849 15.51
sao2 88922 96865 7943 8.93
set 54756 53809 -947 -1.73
square 32603 34481 1878 5.76
t481 3428 3466 38 1.11
ttt2 139626 151590 11964 8.57


4.8 Conclusion

We have developed polynomial time algorithms for the CGR, CUGR, and Convex-

CGR problems for series-parallel and tree circuits. The CGR problem with multigate

modules was shown to be NP-hard. We presented a unified framework for the solution
















Table 4.5. Power reduction of GGR algorithms (2)

circuit Chen Our Diff % imp
5xpl 102586 112719 10133 9.88
b12 101174 108306 7132 7.05
clip 171917 194559 22642 13.17
rd73 147608 154986 7378 5.00
sao2 132655 137355 4700 3.54
set 98939 111848 12909 13.05
square 68189 73090 4901 7.19
t481 25345 27461 2116 8.35
ttt2 231072 247243 16171 7.00


Table 4.6. Power reduction of GGR algorithms (3)


5xpl 41448 46050 4602 11.10
b12 42716 45812 3096 7.25
clip 62986 66819 3833 6.09
rd73 63160 71430 8270 13.09
sao2 74291 80777 6486 8.73
set 46212 44805 -1407 -3.04
squar5 26192 28172 1980 7.56
t481 3222 2652 -570 -17.69
ttt2 112569 119937 7368 6.55


Chen


circuit I


Our


Diff


% imp
















Table 4.7. Power reduction of GGR algorithms (4)

circuit Chen Our Diff % imp
5xpl 90859 100867 10008 11.01
b12 80015 84321 4306 5.38
clip 157222 171358 14136 8.99
rd73 120554 126382 5828 4.83
sao2 109121 111934 2813 2.58
set 79458 86699 7241 9.11
square 58536 61768 3232 5.52
t481 18361 20373 2012 10.96
ttt2 177478 190093 12615 7.11


Table 4.8. Power reduction of GGR algorithms (5)

circuit Chen Our Diff % imp
5xpl 29651 26047 -3604 -12.15
b12 31532 31771 239 0.76
clip 40721 39796 -925 -2.27
rd73 45337 45015 -322 -0.71
sao2 53423 58414 4991 9.34
set 26175 22871 -3304 -12.62
square 13123 13186 63 0.48
t481 1843 1622 -221 -11.99
ttt2 72498 70742 -1756 -2.42












Table 4.9. Power reduction of GGR algorithms (6)

circuit Chen Our Diff 9I imp
5xpl 53356 67287 13931 26.11
b12 57504 65071 7567 13.16
clip 101648 117108 15460 15.21
rd73 91287 99225 7938 8.70
sao2 91862 97392 5530 6.02
set 56641 62326 5685 10.04
square 38067 41557 3490 9.17
t481 9782 14015 4233 43.27
ttt2 128515 147654 19139 14.89


of CGR, CUGR and ConvexCGR problems on general circuits. This framework can

also be used to obtain a heuristic for the ConvexGGR problem. Experimental results

obtained by us indicate that our CGR algorithm is faster than the CGR algorithm of

Chen and Sarrafzadeh [3] and that our GGR heuristic often obtains better solutions

than those obtained by the GGR heuristic of Chen and Sarrafzadeh [3].













CHAPTER 5
CONCLUSIONS AND FUTURE WORK


We have considered some problems that arise in the automation of various stages

of the VLSI physical design process. The first problem we considered is transistor

folding to reduce layout area. An algorithm was developed to minimize the layout

area. This algorithm outperforms the existing one both asymptotically and experi-

mentally.

We considered the problem of selecting the implementations of two rows of mod-

ules on a routing channel so as to satisfy the net-span constraints as well as minimize

the channel density. An algorithm was developed by applying the limited branching

method. Experimental results indicate a significant reduction in run time over the

existing algorithm.

Another problem we considered is low power gate resizing. We increase the area

of gates to reduce the power consumption while satisfying the time constraint for the

circuit. Fast algorithms were developed for series-parallel and tree circuits and variant

of the problem with multigate modules was proved to be NP-hard. We also developed

a unified framework for the solution of CGR, CUGR and ConvexCGR problems on

general circuits. We used this framework to obtain a heuristic for the ConvexGGR

problem. Experimental results indicate a significant reduction in run time for our








86


CGR algorithm over the existing algorithm. Our ConvexGGR heuristic often obtains

better solutions than those obtained by the heuristic of Chen and Sarrafzadeh [3].

Future research on these problems could include the development of better algo-

rithms for the multichannel 2-PDMIS problem (especially for c > 2); the develop-

ment of better heuristics for the general single and multichannel PDMIS problems;

and faster improved heuristics for the GGR problem.













REFERENCES


[1] R. Iris Bahar, Gary D. Hachtel, Enrico Macii, and Fabio Somenzi. A symbolic
method to reduce power consumption of circuits containing false paths. In IEEE
International Conference of Computer Aided Design, pages 368-371, San Jose,
California. November 1994.

[2] Yang Cai and D. F. Wong. On shifting blocks and terminals to minimize channel
density. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 13(2):178-186, February 1994.

[3] De-Sheng Chen and Majid Sarrafzadeh. An exact algorithm for low power
library-specific gate re-sizing. In Proceeding of the 33rd Design Automation Con-
ference, pages 783-788, Las Vegas, Nevada, 1996.

(4] Z. Dai and K. Asada. MOSIZ: A two-step transistor sizing algorithm based on
optimal timing assignment method for multi-stage complex gates. In Proceedings
1989 Custom Integrated Circuits Conference, pages 17.3.1-17.3.4, San Diego,
California. May 1989.

[5] Salah E. Elmaghraby. Activity Networks: Project Planning and Control by Net-
work Models. John Wiley and Sons, New York, 1977.

[6] S. Even, A. Itai, and A. Shamir. On the complexity of timetable and multicom-
modity flow problems. SIAM Journal on Computing, 5(4):691-703, December
1976.

[7] J. P. Fishburn and A. E. Dunlop. TILOS: A posynomial programming approach
to transistor sizing. In IEEE International Conference on Computer-Aided De-
sign, pages 326-328, Santa Clara, California, November 1985.

[8] L. R. Ford and D. R. Fulkerson. Flows in Networks. Princeton University Press,
Princeton. New Jersey, 1962.

[9) Michael R. Garey and David S. Johnson. Computers and Intractability-A Guide
to the Theory of NP-Completeness. W. H. Freeman and Company, San Fran-
cisco, 1979.

[10] A. Ghost, S. Devadas, K. Keutzer, and J. White. Estimation of average switching
activity in combinational and sequential circuits. In Proceedings of the 29th
Design Automation Conference, pages 253-259, Anaheim, California, 1992.











[11] T. W. Her, Ting-Chi Wang, and D. F. Wong. Performance-driven channel pin
assignment algorithms. IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, 14(7):849-857. July 1995.
[12] T. W. Her and D. F. Wong. Cell area minimization by transistor folding. In
Proceedings 1993 Euro-DAC, pages 172-177, Hamburg, Germany, 1993.
[13] T. W. Her and D. F. Wong. On over-the-cell channel routing with cell orienta-
tions consideration. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 14(6):766-772, June 1995.
[14] T. W. Her and D. F. Wong. Module implementation selection and its applica-
tion to transistor placement. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 16(6):645-651, June 1997.
[15] Ellis Horowitz, Sartaj Sahni, and Dinesh Mehta. Fundamentals of Data Struc-
tures in C++. W. H. Freeman and Company, New York, 1995.

[16] C. Y. Hou and C. Y. Chen. A pin permutation algorithm for improving over-the-
cell channel routing. In Proceedings of the 29th Design Automation Conference,
pages 594-599, Anaheim, California, 1992.
[17] Y. C. Hsieh, C. Y. Hwang, Y. L. Lin, and Y. C. Hsu. LiB: A CMOS cell compiler.
IEEE Transactions on Computer-Aided Design, 10(8), 1991.
[18] Jaewon Kim and S. M. Kang. An efficient transistor folding algorithm for row-
based CMOS layout design. In Proceedings of the 34th Design Automation Con-
ference, pages 456-459, Anaheim, California, 1997.
[19] H. Kobayashi and C. E. Drozd. Efficient algorithms for routing interchangeable
terminals. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, CAD-4(3):204-207, 1985.
[20] Wing-Ning Li, Andrew Lim, Prathima Agrawal, and Sartaj Sahni. On the cir-
cuit implementation problem. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 12(8):1147-1156, August 1993.
[21] Li-Shin Lin and Sartaj Sahni. Maximum alignment of interchangeable terminals.
IEEE Transactions on Computers, 37(10):1166-1177, October 1988.
[22] C. Papadimitriou and K. Steiglitz. Combinatorial Optimization Algorithms
and Complexity. Prentics-Hall, Englewood Cliffs, New Jersey, 1982.
[23] I. Rival, editor. Graphs and Orders: the Role of Graphs in the Theory of Ordered
Sets and its Applications. D. Reidel Publishing Company, Dordrecht, Holland,
May 1984.











[24] Sartaj Sahni. Data Structures, Algorithms, and Applications in C++. McGraw
Hill, Boston, Massachusetts. 1998.

[25] Sartaj Sahni and San-Yuan Wu. Two NP-hard interchangeable terminal prob-
lems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 7(4):467-472. April 1988.

[26] Sachin S. Sapatnekar, Vasant B. Rao, Pravin M. Vaidya, and Sung-Mo Kang. An
exact solution to the transistor sizing problem for CMOS circuits using convex
optimization. IEEE Transaction on Computer-Aided Design, 12(11):1621-1634,
November 1993.

[27] Naveed Sherwani. Algorithms for VLSI Physical Design Automation. Kluwer
Academic Publishers, Norwell, Massachusetts, 2nd edition, 1995.

[28] J. Shyu, J. P. Fishburn, A. E. Dunlop, and A. L. Sangiovanni-Vincentelli.
Optimization-based transistor sizing. IEEE Journal of Solid-State Circuits,
23(2):400-409, April 1988.

[29] A. Stauffer and R. Nair. Optimal CMOS cell transistor placement: A relaxation
approach. In IEEE International Conference on Computer-Aided Design, pages
364-367, Santa Clara, California, November 1988.

[30] Venkat Thanvantri. Efficient Algorithms for Electronic CAD. PhD dissertation,
University of Florida, Gainesville, Florida, 1995.

[31] Spyros Tragoudas and loannis G. Tollis. River routing and density minimization
for channels with interchangeable terminals. Integration, the VLSI Journal,
15:151-178, 1993.

[32] T. Uehara and W. VanCleemput. Optimal layout of CMOS functional arrays.
IEEE Transactions on Computers, C-30(5), 1981.

[33] J. Valdes, R. Tarjan, and E. Lawler. The recognition of series and parallel
digraphs. SIAM Journal of Computing, 11(2):298-313, May 1982.

[34] S. Wimer, R.Y. Pinter, and J.A. Feldman. Optimal chaining of CMOS transis-
tors in a functional cell. IEEE Transactions on Computer-Aided Design, 30(5),
1987.

[35] Takeshi Yoshimura and Ernest S. Kuh. Efficient algorithms for channel rout-
ing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, CAD-1(1):25-35, January 1982.













BIOGRAPHICAL SKETCH


Yu Cheuk Cheng was born on January 23, 1971, in Hong Kong. He received

his Bachelor of Engineering degree in computer engineering from the University of

Hong Kong, Hong Kong, in 1992. He received his Master of Philosophy degree in

computer science from the Hong Kong University of Science and Technology, Hong

Kong, in 1994. He will receive his Doctor of Philosophy degree from the Department

of Computer and Information Science and Engineering, the University of Florida,

Gainesville. Florida, in December 1998. His research interest include VLSI CAD,

algorithm design and theory of computation.











I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.



Sartaj K. Sahni, Chairman
Professor of Computer and
Information Science and Engineering


I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of osophy



Timothy A. Davis
Associate Professor of Computer and
Information Science and Engineering


I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.



Richard E. Newman
Assistant Professor of Computer and
Information Science and Engineering


I certify that I have read this study and that in my opinion it conforms to
acceptable standards of scholarly presentation and is fully adequate, in scope and
quality, as a dissertation for the degree of Doctor of Philosophy.



Sanguthevar Rajasekaran
Associate Professor of Computer and
Information Science and Engineering