UFDC Home  myUFDC Home  Help 
Front Cover  
Acknowledgement  
Table of Contents  
List of Tables  
List of Figures  
Abstract  
1. Introduction  
2. Transistor folding  
3. Performance driven module implementation...  
4. Gate resizing to reduce power...  
5. Conclusions and future work  
References  
Biographical sketch 
PRIVATE ITEM Digitization of this item is currently in progress.  
Material Information
Subjects
Notes
Record Information

Material Information
Subjects
Notes
Record Information

Table of Contents  
Front Cover
Page i Acknowledgement Page ii Table of Contents Page iii Page iv List of Tables Page v List of Figures Page vi Page vii Abstract Page viii Page ix 1. Introduction Page 1 Page 2 Page 3 Page 4 2. Transistor folding Page 5 Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15 Page 16 Page 17 3. Performance driven module implementation selection Page 18 Page 19 Page 20 Page 21 Page 22 Page 23 Page 24 Page 25 Page 26 Page 27 Page 28 Page 29 Page 30 Page 31 Page 32 Page 33 Page 34 Page 35 Page 36 Page 37 Page 38 Page 39 Page 40 4. Gate resizing to reduce power consumption Page 41 Page 42 Page 43 Page 44 Page 45 Page 46 Page 47 Page 48 Page 49 Page 50 Page 51 Page 52 Page 53 Page 54 Page 55 Page 56 Page 57 Page 58 Page 59 Page 60 Page 61 Page 62 Page 63 Page 64 Page 65 Page 66 Page 67 Page 68 Page 69 Page 70 Page 71 Page 72 Page 73 Page 74 Page 75 Page 76 Page 77 Page 78 Page 79 Page 80 Page 81 Page 82 Page 83 Page 84 5. Conclusions and future work Page 85 Page 86 References Page 87 Page 88 Page 89 Biographical sketch Page 90 Page 91 Page 92 Page 93 

Full Text  
EFFICIENT ALGORITHMS FOR VLSI CAD By YU CHEUK CHENG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1998 ACKNOWLEDGMENTS I would like to express my appreciation and gratitude to my advisor Professor Sartaj Sahni for his support and guidance to my research work. I thank him for spending a lot of his valuable time with me to give me research ideas. Without his patience and encouragement, this research would not have been done. I would also like to thank other members in my supervisory committee, Dr. Tim Davis, Dr. Richard Newman, Dr. Sanguthevar Rajasekaran and Dr. Andrew Vince, for their interest and comments. I would like to express my appreciation to Dr. Steve Thebaut for his support and encouragement throughout my study. Many thanks go to my friends here in the university as well as those in Hong Kong and around the world. Special thanks go to my friend Desmond Kwan, who gave me constant support and encouragement through my final year's PhD study. I would like to express my endless thanks to my parents and my brother Eddie for their love, support and energy throughout my lifelong educational endeavors. Special thanks go to my fiancee Isabella Fung for her love, encouragement and patience through all these years of study. To them I dedicate this work. TABLE OF CONTENTS ACKNOWLEDGMENTS .................... ..... ..... ii LIST OF TABLES .................... ............. v LIST OF FIGURES .... ............................ vi ABSTRACT ...................... ............... viii 1 INTRODUCTION ....... ..... .... .............. 1 1.1 Background ........ ................ .. ... .. 1 1.2 Physical Design Automation .. ..... ........ .......... 2 1.3 Dissertation Outline ........................... 3 2 TRANSISTOR FOLDING ........................... 5 2.1 Introduction ................. ............... 5 2.2 Problem Formulation ........................... 6 2.3 Our Algorithm .... .......................... 9 2.3.1 Phase I .... .......................... 9 2.3.2 Phase II .. .. . ........ ..... .. .. .. 11 2.4 Experimental Results .... ............ ........... 13 2.5 Conclusion ....... .......... ............... 15 3 PERFORMANCE DRIVEN MODULE IMPLEMENTATION SELECTION 18 3.1 Introduction ................................ 18 3.2 O(plogn) Algorithm of Her et al ...... .................. 22 3.3 Our O(plogn) Algorithm .................. ....... 24 3.3.1 Stage 1 ................... ............ 24 3.3.2 Stage 2 .... ............... ........... . 30 3.3.3 Implementation Details ..................... 31 3.3.4 Time Complexity ................ ...... . 32 3.4 Multichannel 2PDMIS Problem ................... 36 3.5 Experimental Results ..................... ...... 38 3.6 Conclusion ......................... ........ 40 4 GATE RESIZING TO REDUCE POWER CONSUMPTION ........ 41 4.1 Introduction ... ....... ........... ..... 4.2 SeriesParallel Circuits .................... 4.2.1 Definition ...................... 4.2.2 Complete Library Gate Resizing (CGR) ...... 4.2.3 Complete Library with Upper Bounds (CUGR) and 4.2.4 Time Complexity of Convex GR Problem . . . 4.3 Tree Circuits ......................... 4.4 CGR with Multigate Modules Is NPHard ......... 4.5 General Circuits .. ...................... 4.5.1 The CGR Algorithm of Chen and Sarrafzadeh . . Convex c( . . . v 4.5.2 Comments on the Algorithm of Chen and Sarrafzadeh . . . 4.5.3 A Unified Framework for CGR, CUGR and ConvexCGR . . 4.6 The General Gate Resizing Problem (GGR) . . . . . . .. 4.7 Experimental Results ............. .............. 4.8 Conclusion ................ .. ...... ........... 5 CONCLUSIONS AND FUTURE WORK. . . . . . . . ... REFERENCES ................. .. ............... BIOGRAPHICAL SKETCH ..... .. ............ ....... 41 45 45 47 )s 49 54 62 63 68 68 71 72 77 78 81 85 87 90 LIST OF TABLES 2.1 Run time and speedup using a uniform distribution . . . ... 16 2.2 Run time and speedup using a uniform distribution with larger limits 16 2.3 Run time and speedup using a normal distribution . . . ... 17 3.1 Running time for benchmark channels ........ . .... .. 39 3.2 Running time for generated channels . . . .... ..... .. 39 4.1 Run time and speedup when required time is equal to critical path length . . . . . . . . . . . . . . . . . 80 4.2 Run time and speedup when required time is doubled . . . . 80 4.3 Run time for squar5 with different required time . . . . .... 80 4.4 Power reduction of GGR algorithms (1) ..... . ........... .. 81 4.5 Power reduction of GGR algorithms (2) . . . . . .... 82 4.6 Power reduction of GGR algorithms (3) . . . . . . . . 82 4.7 Power reduction of GGR algorithms (4) ..... . ........... .. 83 4.8 Power reduction of GGR algorithms (5). .. . . ... . . . 83 4.9 Power reduction of GGR algorithms (6) . . . . .... .. . 84 LIST OF FIGURES 2.1 An example circuit with 4 pairs of transistors . . . . . . 7 2.2 The Circuit of Figure 2.1 after folding with hp = 4 and hl = 3 . . 7 2.3 Computing SP and SN ............ .............. 10 2.4 Compute optimal hp and hn ....................... 12 2.5 Refined Phase 2 algorithm ........................ 14 3.1 An example PDMIS problem. (a) first implementation; (b) second implementation; (c) selections that satisfy the net span constraints; (d) selection with better density . . . . . . . .. . 21 3.2 Critical modules of net i ..... . ........ ......... 26 3.3 Function Assign .. ............................ 27 3.4 Partition a routing channel into regions . . . . . . ..... 29 3.5 Function Satisfy .... ...... .. ............. .. .. 33 3.6 Function Search . .................... ......... 34 3.7 Procedure Undo ............. ... ............. 34 3.8 The two regions to be searched recursively after the binary search . 38 4.1 Digraph corresponds to a circuit . . . . . . . . . ... 45 4.2 Circuit Examples (Source: Li et al. [20]). (a) Chain; (b) Simple Par allel Circuit; (c) SeriesParallel Circuit; (d) NonSeriesParallel Circuit 46 4.3 Transformation of SeriesParallel Circuits. (a) Chain; (b) Simple Par allel Circuit .... ... ................. .. ... .. 49 4.4 Transformation of a SeriesParallel Circuit into a single gate .. . 50 4.5 Computation of new delay for each gate . . . . . . .... 51 4.6 Convex delaypowerconsumption graph . . . . . ..... 52 4.7 Algorithm ParallelMerge. ... . ... . . . . . . . . 55 4.8 Worstcase merging of n gates . . . . . . . ..... . 56 4.9 BBST used to represent DP list {(3, 28), (5, 24), (3, 21)} . . . 57 4.10 Update of D(s and C()s for internal nodes during tree rotations . 59 4.11 Change of c values of internal nodes of L2 for the kth tuple of L1 . 60 4.12 Circuit C2 that exert the worst case behavior . . . . . ... 62 4.13 Transformation of a basic tree to a simple parallel circuit ...... ..63 4.14 A module v with two gates A and B . . . . . . ..... 63 4.15 Variable subassembly for variable xi . . . . . . . ..... 64 4.16 Clause subassembly for (11 v 12 V 13) . . . . . . . ..... 66 4.17 Application of algorithm of Chen and Sarrafzadeh [3] on a CGR circuit. (a) An example CGR circuit; (b) sensitive graph . . . . ... 69 4.18 A simple example CGR circuit . . . . . . . ..... . 71 4.19 Application of algorithm of Chen and Sarrafzadeh [3] on CUGR cir cuit. (a) An example CUGR circuit; (b) delay of each gate after first iteration; (c) delay of each gate after algorithm [3] terminates; (d) delay of each gate for optimal power reduction . . . . .. . 73 4.20 A PERT network for the CGR circuit of Figure 4.17(a).. . . ... 75 4.21 Transformation of vertex v into a chain in PERT network ...... ..77 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy EFFICIENT ALGORITHMS FOR VLSI CAD By Yu Cheuk Cheng December 1998 Chairman: Dr. Sartaj Sahni Major Department: Computer and Information Science and Engineering In this dissertation, we develop efficient algorithms for three problems that arise in very large scale integrated computeraided design (VLSI CAD): (1) transistor folding, (2) module implementation selection, and (3) gate resizing. Transistor folding reduces the area of rowbased designs that employ transistors of different size. Kim and Kang have developed an O(m2 log m) algorithm to optimally fold m transistor pairs. In this dissertation we develop an O(m2) algorithm for optimal transistor folding. Our experiments indicate that our algorithm runs 3 to 50 times as fast for m values in the range [100, 100000]. We develop an O(plogn) time algorithm to obtain optimal solutions to the p pin n net single channel performancedriven implementation selection problem in viii which each module has at most two possible implementations (2PDMIS). Although Her, Wang and Wong have also developed an O(p log n) algorithm for this problem, experiments indicate that our algorithm is twice as fast on small circuits and up to eleven times as fast on larger circuits. We also develop an O(pnC') time algorithm for the c, c > 1, channel version of the 2PDMIS problem. We study the problem of resizing gates to reduce overall power consumption while satisfying a circuit's timing constraints. Polynomial time algorithms for seriesparallel and tree circuits are obtained. Gate resizing with multigate modules is shown to be NPhard. Algorithms that improve upon those developed by Chen and Sarrafzadeh for general circuits are also developed. CHAPTER 1 INTRODUCTION 1.1 Background The design and fabrication of VLSI chips has been made possible by the automa tion of several steps in the design process. The VLSI design process transforms a formal specification into a fully packaged chip. It consists of the following steps [30]: 1. System specification: In this step a high level representation of the system is created. Performance, functionality, physical dimensions, choice of design techniques and fabrication technology are considered in this step. 2. Functional design: The output of this step is a timing diagram which is obtained by considering the behavioral aspects of the system. 3. Logic design: The logic design, in general, is represented by Boolean expres sions. The logic design that represents the functional design is obtained in this step. The Boolean expressions are minimized to obtain the smallest logic design. Correctness of the logic design is also asserted in this step. 4. Circuit design: A circuit which represents the logic design of the system is de veloped in this step by taking into consideration speed and power requirements, and the electrical behavior of the components used in the development of the circuit. 5. Physical design: This is the most time consuming step in the VLSI design cycle. In this step, the components and the interconnections are represented by geometric patterns. The objective of this step is to obtain an arrangement of these geometric patterns which minimizes the area and power and satisfies the timing requirements of the chip. Due to its high complexity this step is broken down into smaller substeps. We will look into this step in detail later in this chapter. 6. Design verification: In this step design rule checking and circuit extraction are done to verify that the circuit layout from the physical design step satisfies the system specification and design rules. 7. Fabrication: The verified layout is used in the fabrication process to produce the chip. 8. Packaging, testing and debugging: The fabricated chip is packaged and tested to ensure proper functioning. 1.2 Physical Design Automation Given a circuit description, the physical design process transforms the physical description into a geometric description called layout for fabrication. As the complex ity of the physical design process is extremely large, ComputerAided Design (CAD) is used in almost all phases of the physical design process. The physical design process is divided into 5 stages [27]: 1. Partitioning: A large circuit is decomposed into a collection of smaller blocks of subcircuits or modules, taking sizes and interconnections between blocks as factors. Partitioning can be hierarchical if the given circuit is very large. 2. Floorplanning and placement: Logical components of each block are assigned an approximate location in floorplanning. In placement, blocks are exactly positioned on a chip so as to minimize the area of the chip and so that the interconnections between blocks can be completed. 3. Routing: The interconnections between blocks are completed as specified. In global routing, connections are completed between the proper blocks of the circuit disregarding the exact geometric details of each wire and pin. In detailed routing, each connection is assigned a precise geometric position. 4. Compaction: The layout is compressed in all directions to reduce the area. 5. Extraction and verification: The final layout is verified in terms of functionality by circuit extraction. Other specific requirements, such as performance and reliability, are also verified in the verification process. 1.3 Dissertation Outline In this dissertation, we consider some of the problems that arise in the automation of various stages of the VLSI design process. In Chapter 2, we consider folding transistors to reduce the layout area of a rowbased design. We develop an optimal algorithm to fold transistors in a channel to minimize the layout area. Our algorithm is both theoretically and practically faster than the algorithm proposed in Kim and Kang [18]. In Chapter 3, we consider a module implementation selection algorithm which minimizes the density of a channel. We develop an optimal algorithm to select module implementations along a channel to satisfy the net span constraints of each net and minimize the density of the channel, where each module has at most two possible implementations. The algorithm is experimentally compared to the one developed by Her et al. [11]. We also develop a polynomialtime algorithm for the multichannel version of the problem. In Chapter 4, we consider resizing gates to reduce the power consumption. We develop fast optimal algorithms to resize gates in seriesparallel circuits and trees to minimize the power consumption subject to the timing constraint. We also prove that gate resizing with multigate modules is NPhard. We develop fast algorithms to perform gate resizing on general circuits. Experimental results comparing our algorithm compared with that in Chen and Sarrafzadeh [3] are also presented. In Chapter 5, we present conclusions and some future directions for this research. CHAPTER 2 TRANSISTOR FOLDING 2.1 Introduction In highperformance circuit design, the transistor sizing problem was investigated widely in the past (for example. [26, 7, 28, 4]). The objective of transistor sizing is to reduce the circuit delay by increasing the area of transistors. One byproduct of transistor sizing is the generation of layouts of transistors of widely varying size. In rowbased layout synthesis ([17, 29, 32, 34]), we group pMOS and nMOS transistors together and place them in rows. The layout area for these designs is wasted due to nonuniform cell heights. The layout area required can be reduced by folding large transistors so that their height is reduced. Transistor folding to optimize layout area has been considered by Kim and Kang [18] and Her and Wong [12]. Her and Wong [12] have developed an O(m6) dynamic programming algorithm for the general transistor folding problem. (If only s heights are possible for the folded transistors, the complexity of Her and Wong's algorithm is O(m3s3). In general, s is O(m).) Kim and Kang [18] have developed a more practical algorithm for the case of rowbased designs. The complexity of their algorithm is O(m2 log m) or O(s(m + s) log m). They also show that the area of rowbased designs can be reduced by as much as 30% by performing transistor folding. In this paper, we consider the rowbased design transistorfolding problem considered in reference [18] and develop an O(m2) or O(s(m+s)) algorithm to minimize area. We also report on experiments conducted by us that show that our algorithm actually runs much faster than the algorithm of Kim and Kang [18]. The test circuit used in our experiments have between 100 and 100,000 transistor pairs. So, our tests are similar to those conducted by Kim and Kang [18] where the circuits had from 192 to 88,258 transistor pairs. 2.2 Problem Formulation We are given a CMOS circuit with a row of m transistor pairs. Each transistor pair consists of a pMOS transistor and its dual nMOS transistor. Let pi and ni, respectively, be the heights of the pMOS and nMOS transistors in the ith pair, 1 < i < m. pi and ni are integers that give transistor height in multiples of the minimum resolution A. Figure 2.1 shows a CMOS circuit with 4 pairs of transistors, P2 = 10 and n2 = 12. If the folding height of pMOS transistors is 4 and that of nMOS transistors is 3, then the circuit layout is as in Figure 2.2. The second pMOS transistor is divided into three columns of height 4, 4, and 2 respectively, and the second nMOS transistor is divided into four columns of height 3 each. The area occupied by the folded transistor pair is shown by a shaded box in Figure 2.2. In practice, the height of the layout area is slightly larger than the sum of the pMOS and nMOS folding heights, and the layout width is slightly larger than the number of transistor columns because of overheads. 4 10 pMOS 12 nMOS Figure 2.1. An example circuit with 4 pairs of transistors 7 Figure 2.2. The Circuit of Figure 2.1 after folding with hp = 4 and h, = 3 Let hp and h, be the folded heights of the pMOS and nMOS transistors, respec tively. The width of the folded layout is E 1 max(rl], [r]) + ca and the height is h, + h, + c, where c,, and Ch are, respectively, vertical and horizontal overheads. The area of the folded layout is [18] (h, + h, + ,)( max(r ], r )+c) (2.1) i=l hp hn In practice, there is a technological constraint on how small hp and h7 can be. It is required [18] that hp > PMIN and hn > NMIN. Kim and Kang [18] give two algorithms to determine hp and h, so that the layout area is minimized. The first algorithm is an exhaustive search algorithm that simply tries out all integer choices for h, and h, such that PMIN < h, < maxi max(P) and NMIN < hn < max<, and N = {n, n2, ..., nm}.) The complexity of the exhaustive search algorithm is O(max(P).max(N)m) = O(m3) because max(P) and max(N) are O(m) for practical circuits [18]. The second algorithm [18] works in two phases. In the first phase, the al gorithm constructs a subset SP of [PMIN, max(P)] and another subset SN of [NMIN, max(N)] with the property that the optimal h, is in SP and the optimal h, is in SN. The basic observation used to arrive at SP and SN is that if the heights hi and hi + k divide a transistor into the same number of columns then hi is preferred over hi + k (for example if pi = 14, then folding heights 7. 8. 9, 10, 11, 12 and 13 all fold the transistor into two columns; 7 is preferred over the remaining choices). In the second phase the optimal combination (hp. h,,) is determined from SP and SN. The complexity of the second phase is O(s(m + s) log m) = O(m2 log m), where s SPI + IS"1 and that of the first phase is 9(E,=l(P + ni)) = O(m2) (assuming max(P) and max(N) are O(m)). 2.3 Our Algorithm 2.3.1 Phase I Our algorithm is also a two phase algorithm. The first phase of our algorithm is identical to the first phase of Kim and Kang's algorithm [18]. We compute the subsets SP and SN using the code of Figure 2.3. The arrays SPL and SNL are initialized to zero in the first two for loops. Then we determine the members of SP and SN; we set SPL[i] = 1 if and only if i E SP and SPN[i] = 1 if and only if i E SN. Finally, SP and SN are computed in compact form from SPL and SPN respectively. Note that we can compute SP and SN in either ascending or descending order easily by controlling the direction of traversal of the SPN and SPL arrays respectively, in the last two for loops. The algorithm presented in Figure 2.3 computes SP in ascending order and S" in defending order. Algorithm Phase I (P, N, PMIN, NMIN) /* Compute SP and SN */ /* Initialize SPLD and SNLD */ for i = PMIN to max(P) do SPL[i] + 0; for i = NMIN to max(N) do SNL[ij + 0; /* set SPLU and SNLO */ for i = 1 to m do for j = 1 to pi do SPL[4]] + 1; for j = 1 to n, do SNL[[ ]J + 1; end for /* collect items from SPLO and SNL[ and store them into S'P and SND */ SPsize + 0; SNsize + 0; for i = PMIN to max(P) do if SPL[i] = 1 then SP[SPsize++] i; for i = max(N) downto NMIN do if SNL[i] = 1 then SN[SNsize++] < i; Figure 2.3. Computing SP and SN 2.3.2 Phase II Assume that the transistor pairs have been reordered so that P < ': also assume that E = 0 and P =i = c. The formula, Equation 2.1, for the layout area no nm+l can be rewritten as k1 m A = (h + h,+ c.)(1[ n + ]+ Ch) (2.2) i=1 h" i=k hP where k E [1, m + 1] is such that Pk < < pA (2.3) nk1 hn nk Let LN(hn, k) = E [Sl and Lp(hp, k) = Ekt J ], we can rewrite Equation 2.2 as A = (h, + h,, + c,,)(LN(h, k) + Lp(hp, k) + ch) (2.4) From Equation 2.3 and the ordering of the transistors by &, it follows that if ,, is held fixed and hp increased, the value of k cannot decrease. This observation results in the algorithm of Figure 2.4. Since SP and SN can be computed in ascending and descending order respectively by Algorithm Phase I of Figure 2.3, no sorting is needed to evaluate the members of SP and SN in the specified order. We can sort the transistors into increasing (actually nondecreasing) order of & in O(mlogm) time; and the arrays LN and Algorithm Phase II (P, N, SP, SN, Ch, c,) /* SP is in ascending order and SN is in descending order */ Sort P and N in increasing P[i]/N[i] ratio; Compute LNDD and LpOfa; for each h, E S do k 1; for each hp E SP do while P[k]/N[kj < ht/h do k k + 1; A = min(A, (hp + h, + c,) (LN[h,][k] + Lp[hp][k] + ch)); end for end for Figure 2.4. Compute optimal h, and h, Lp can be computed in e(mlSNI) and e(mISPI) time respectively. Each itera tion of the outer for loop takes O( SP1 + m) time. Therefore, the time needed for all ISNI iterations is O(ISNI(ISPI + m)). We can change this complexity to O(min{IS'%, ISPI}(max{ISNI, ISP)} + m)) by interchanging the inner and outer for loop headers. Further improvement in run time is possible. Consider the algorithm of Figure 2.4. Let ki be the k value that satisfies Equation 2.3 when we use the first (i.e., largest) h,, value h,1 and the ith (i.e. ith smallest) h, value hp,. On the next iteration of the outer for loop, h,,, 5 h,, so h, < h~~, and the k value that satisfies Equation 2.3 is at least kl. Hence if we save kl from the first iteration, we can start the search for the new k value at kl. This observation leads to the refinement shown in Figure 2.5. Although its worsecase complexity is the same as that of Figure 2.4, it is expected to run faster in practice. 2.4 Experimental Results The phase 1 algorithm of Figure 2.3, the phase 2 algorithm of Figure 2.5, and the two algorithms of Kim and Kang [18] were implemented, by us, in C and run on a SUN SPARCstation 4. Similar programming methodologies were used to develop the codes for our algorithm and that of Kim and Kang [18]. As a result, we expect that almost all of the performance difference exhibited in our experiments is due to algorithmic rather than programming differences. Since we were unable to obtain the test data used by Kim and Kang [18], we generated random data. We ignore any possible correlation between pMOS and nMOS transistors. For our test data. the number of transistor pairs ranged from 100 to 100,000. This covers the range in transistor numbers (192 to 88,258) in the circuits of Kim and Kang [18]. For our first test set, the sizes of the pMOS and nMOS transistors were generated using a uniform random number generator with range [30,90] for pMOS and [20,60] for nMOS. These size ranges correspond to those for the circuit fract that was used by Kim and Kang [18], the circuit fract has 598 transistors. Since all three algorithms generate optimal solutions, run time is the only comparative factor. This time is provided in Table 2.1. The exhaustive search algorithm was not run for m > 10, 000 as its run time becomes prohibitive. In the case of the algorithm proposed by Kim Algorithm Refined Phase II (P, N, SP, SN, Ch, ct) /* SP is in ascending order and SN is in descending order */ Sort P and N in increasing P[i]/N[i] ratio; Compute LNOD and LpO]; Initialize Kh, to 0 for all hp E SP if IS'i < ISPI then for each h, E SN do k  1; for each hp E SP do Kh, < max(k, KhA); while P[k]/N[k] < h,/hn do k +k+1; KhA k; .4 = min(A, (hp + ha + c.) (LN[h.][k] + Lp[hp][k) + ch)); end for end for else /* same as ''if", but interchange the inner and outer for loop headers, and replace Kh, by Ka, */ end if Figure 2.5. Refined Phase 2 algorithm 15 and Kang [18], the phase 2 time is significantly larger than the phase 1 time. Our algorithm for phase 2 has brought this time down to approximate the phase 1 time. For small circuits (m < 10,000), our phase 2 algorithm is 6 to 10 times as fast as the phase 2 algorithm of Kim and Kang [18] and provides an overall speedup of 3.5 to 5.8 for the entire area minimization process (phase 1 plus phase 2). On larger circuits, the speedup is more dramatic. For instance, when m = 100,000 our phase 2 algorithm is almost 50 times as fast as that of Kim and Kang [18] and provides an overall speedup of almost 28. We experimented with two other data sets. Table 2.2 reports the run times for circuits in which the range of the uniform random number generator was set to [30, 180] for pMOS transistor sizes and [20, 120] for nMOS sizes, and Table 2.3 gives the run times when the transistor sizes are from a normal distribution with mean 40 and standard deviation 10 for pMOS transistors and mean 30 and standard deviation 10 for nMOS transistors. The overall speedups range from a low of 3.95 to a high of 48.02. 2.5 Conclusion We have developed a transistor folding algorithm that is both theoretically and practically faster than the algorithm proposed by Kim and Kang [18]. Our algorithm is also simpler to code. Experiments suggest that our algorithm runs 3 to 50 times as Table 2.1. Run time and speedup using a uniform distribution Phase 2 Speedup m Exhaustive Phase 1 Kim & Kang Our Phase 2 Overall 100 1.46 0.03 0.30 0.03 10.00 5.55 300 4.41 0.08 0.60 0.09 6.67 4.00 500 7.34 0.14 0.89 0.14 6.36 3.62 600 8.79 0.16 1.05 0.17 6.18 3.63 1000 14.67 0.28 1.69 0.27 6.26 3.56 5000 74.59 1.38 11.43 1.45 7.88 4.53 10000 149.12 2.75 30.71 3.01 10.20 5.81 50000 13.64 458.24 17.51 26.17 15.15 100000 27.24 1716.02 35.29 48.63 27.88 Time in seconds Table 2.2. Run time and speedup using a uniform distribution with larger limits Phase 2 Speedup m Exhaustive Phase 1 Kim & Kang Our Phase 2 Overall 100 6.35 0.05 0.97 0.06 16.17 9.67 300 19.87 0.16 2.18 0.20 10.90 6.58 500 33.15 0.27 2.94 0.33 8.91 5.39 600 39.77 0.32 3.38 0.40 8.45 5.16 1000 66.31 0.53 4.73 0.65 7.28 4.47 5000 336.92 2.60 21.82 3.31 6.59 4.13 10000 673.43 5.23 49.25 6.89 7.15 4.50 50000 26.09 485.10 38.87 12.48 7.87 100000 52.12 3710.35 85.40 43.45 27.36 Time in seconds Table 2.3. Run time and speedup using a normal distribution Phase 2 Speedup m Exhaustive Phase 1 Kim & Kang Our Phase 2 Overall 100 0.90 0.02 0.20 0.02 10.00 5.30 300 3.22 0.07 0.47 0.06 7.83 4.24 500 5.79 0.10 0.75 0.11 6.82 3.98 600 6.70 0.12 0.88 0.13 6.77 3.95 1000 12.69 0.20 1.48 0.23 6.43 3.96 5000 68.26 1.03 12.92 1.38 9.36 5.79 10000 129.67 1.99 36.91 2.80 13.18 8.12 50000 10.05 679.50 18.05 37.65 24.54 100000 20.04 2676.08 36.10 74.13 48.02 Time in seconds fast as the Kim and Kang's algorithm [18] on circuits with 100 to 100,000 transistor pairs. These circuit sizes are comparable to theirs. CHAPTER 3 PERFORMANCE DRIVEN MODULE IMPLEMENTATION SELECTION 3.1 Introduction In the channel routing problem, we have a routing channel with modules on the top and bottom of the channel, the modules have pins, and subsets of pins define nets. The objective is to route the nets while minimizing channel height. Several algorithms have been proposed for channel routing [35]. When the modules on either side of the channel are programmable logic arrays, we have the flexibility of reordering the pins in each module; any pin permutation may be used. The ability to reorder module pins adds a new dimension to the routing problem. Channel routing with rearrangeable pins was studied by Kobayashi and Drozd [19]. They proposed a three step algorithm: (1) permute pins so as to maximize the number of aligned pin pairs (a pair of pins on different sides of the channel is aligned iff they occupy the same horizontal location and they are pins of the same net), (2) permute the nonaligned pins so as to remove cyclic constraints, and (3) while maintaining an acyclic vertical constraint graph, permute unaligned pins so as to minimize channel density. Lin and Sahni [21] developed a linear time algorithm for step (1), and Sahni and Wu [25] showed that steps (2) and (3) are NPhard. Tragoudas and Tollis [31] present a linear time algorithm to determine whether there is a pin permutation for which a channel is river routable. They also showed that the problem of determining a pin permutation that results in minimum density is NPhard in general, and they developed polynomial time algorithms for the special case of channels with two terminal nets and channels with at most one terminal of each net being in each module. Variants of the channel routing with permutable pins problem have also been studied [14, 2, 16, 13]. In these variants restrictions are placed on the allowable pin permutations for each module. Restrictions may arise, for example, because the module library contains only a limited set of implementations of each module [14]. Another variant, considered by Cai and Wong [2] permits the shifting of modules and pins to minimize channel density. Extensions to the case when over the cell routing is permitted are considered [16, 13]. The variant of the channel routing with permutable pins problem that we consider in this paper is the performancedriven module implementation selection (PDMIS) problem formulated by Her et al. [11]. In the kPDMIS problem, we are given two rows of modules with a routing channel in between, up to k possible implementations for each module (different implementations of a module differ only in the location of pins, the module size and pin count are the same) and a set of net span constraints (the span of a net is the distance between its leftmost and rightmost pins). A feasible solution to a kPDMIS instance is a selection of module implementations so that all net span constraints are satisfied. An optimal solution is a feasible solution with minimum channel density. Figure 3.1(a) shows a routing channel with two modules on either side of the routing channel. Assume that each module has two implementations and that the pin locations for the second implementation of each module are as in Figure 3.1(b). The net span constraints of the five nets are 4, 4, 1, 1 and 6, respectively. This defines an instance of the 2PDMIS problem. Using the implementations of Figure 3.1(a), the net spans are 5, 3, 1, 1 and 6, respectively. The span constraint of net 1 is violated. If each module is implemented as in Figure 3.1(b), the net spans are 1, 5, 1, 1 and 4, respectively. This time, the span constraint of net 2 is violated. If we implement the modules as in Figure 3.1(c) (i.e., for modules 1 and 2 use the implementations of Figure 3.1(a) and for modules 3 and 4, use the implementations of Figure 3.1(b)), the net spans are 4, 4, 1, 1 and 6, respectively. Now, the net span constraints are satisfied for all nets. The channel density, when module implementations are selected as in Figure 3.1(c), is 5. Selecting module implementations as in Figure 3.1(d), we obtain a feasible solution whose density is 3. Her et al. [11] show that the kPDMIS problem is NPhard for every .k > 3. For the 2PDMIS problem, they develop an O(plogn) algorithm to find an optimal solution. In this paper, we develop an alternative O(plogn) algorithm to find an optimal solution to the 2PDMIS problem. Experiments indicate that our algorithm is twice as fast on small circuits and up to eleven times as fast on larger circuits. We begin, in Section 3.2, by providing an overview of the O(p log n) algorithm [11). Then, in Section 3.3, we describe our O(p log n) algorithm. In Section 3.4, we develop 15 1 213 3 1 5] 15 2 111 5 3 31 routing channel 1 2 512 4 4] [5 2 114 4 2f Figure 3.1. An example PDMIS problem. (a) first implementation; (b) second im plementation; (c) selections that satisfy the net span constraints; (d) selection with better density an O(pnc1) algorithm for the c, c > 1, channel 2PDMIS problem. Experimental results using the single channel 2PDMIS algorithm are presented in Section 3.5. 3.2 O(plogn) Algorithm of Her et al. Her et al. [11] show how to transform an instance P of 2PDMIS with net span constraints and a constraint, d, on channel density into an instance S of the 2SAT problem (each instance of the 2SAT problem is a conjunctive normal form formula in which each clause has at most two literals.) The 2SAT instance S is satisfiable iff the corresponding 2PDMIS instance has a feasible solution with channel density < d. The size of the constructed 2SAT formula S is O(p), where p is the total number of pins in the modules of P. Since the channel density of the optimal solution is in the range [1, n], where n is the total number of nets, a binary search over d can be used to obtain an optimal solution in O(p log n) time. Her et al. [11] use one boolean variable to represent each module. The interpre tation is, variable zx is true iff implementation 1 of module i is selected. The steps in the 2PDMIS algorithm [11] are 1. Construct the 2SAT formula Cs~, such that Cp,,, is satisfiable iff the given 2PDMIS formula has a feasible solution. This is done by constructing a 2SAT formula for each net and then taking the conjunction of these instances. For each net j, the leftmost and rightmost modules on the top row and bottom row are identified. These (at most four) modules are the critical modules for net j as the span of net j is determined solely by these modules. A 2SAT formula involving the boolean variables that represent these critical modules is constructed. This 2SAT formula has the property that truth value assignments satisfy the 2SAT formula iff the corresponding module implementations cause the net span constraint for net j to be satisfied. 2. Construct a 2SAT formula Cden using a density constraint d. Cd,, is satisfiable only by module implementation selections which result in a channel density that is < d. To construct Cd,, partition the channel into a minimum number of regions such that no region contains a module boundary in its interior; for each region, construct a 2SAT formula so that the density in the region is < d whenever the 2SAT formula is true (this 2SAT formula involves only the module in the top row of the region and the one in the bottom row); take the conjunction of the region 2SAT formulae. 3. Determine if the 2SAT formula Cpan A Cd,e is satisfiable by using the strongly connected components method described in Papadimitriou and Steiglitz [22]. This requires that we first construct a directed graph from Csa,, A Cd.. 4. Repeat steps 2 and 3 performing a binary search for the minimum value of d for which Cpan A Cde is satisfiable. As shown in Her et al. [11], the size of C.pan A Cde is 0(p); step 3 takes O(p) time; and the overall complexity is O(p log n). 3.3 Our O(p log n) Algorithm Our algorithm is a two stage algorithm that does not construct a 2SAT formula. In the first stage, we construct a set of 2m "forcing lists", where m is the number of modules. L[i] is a list of module implementation selections that get forced if the first implementation of module i, 1 < i < m is selected; L[m + i] is the corresponding list for module i when the second implementation of module i is selected. By forced, we mean that unless the module implementations on L[i] (L[m + i]) are selected whenever the first (second) implementation of module i is selected, we cannot have a feasible solution that also satisfies the given density constraint. In the second stage, we use the limited branching method [6] and the forcing lists constructed in stage 1 to obtain a module implementation selection that satisfies the net span and density constraints (provided such a selection is possible). To find an optimal solution, we use binary search to determine the smallest density constraint for which a feasible solution exists. 3.3.1 Stage 1 In stage 1, we construct the forcing lists L[1..2m]. If the selection of implementa tion 1 of module i requires that we select implementation 1 of module j, we place j on the list L[i]; if the selection of implementation 1 of module i requires that we select implementation 2 of module j, we place m + j on L[i]. Similarly when the selection of implementation 2 of module i requires a particular implementation be selected for module j, we place either j or m + j on L[m + i]. To assist in the construction of the forcing lists, we use another array C[1..m] with C[i] = 0 if no implementation of module i has been selected so far; C[ij = 1 if the first implementation of module i has been selected; and C[i] = 2 if the second implementation has been selected. First, we construct the forcing lists necessary to ensure the net span constraints. For each net i for which a net span constraint is specified, identify the leftmost and rightmost modules, in each module row, that contain net i (see Figure 3.2). There are at most four such modules: leftmost module with net i in the top module row (module u of Figure 3.2), leftmost in the bottom module row (w), rightmost in top row (v) and rightmost in bottom row (x). The span of net i is determined by a pair of these critical modules. One module in this pair is a leftmost critical module and the other is a rightmost critical module. So, there are at most four module pairs to consider (for the example of Figure 3.2, these four pairs are (u, v), (w, v), (u, x) and (w, x)). When a critical module pair is considered, let A denote the implementation of the left module (of the pair) in which the leftmost pin of net i is to the right of the leftmost pin of net i in the other implementation (ties are broken arbitrarily). Let A' denote the other implementation of the left module. Let B denote the implementation of the right module for which the rightmost pin of net i is to the left of the rightmost pin of net i in the other implementation (ties are broken arbitrarily). Let B' denote the other implementation of the right module. In the example of Figure 3.2, consider the critical module pair (u, x), u is the left module and x is the right module. The second implementation of u is A and its first implementation is A'; the first implementation of x is B and its second implementation is B'. There are four ways in which we can select the implementations of the modules u and x: (A, B), (A, B'), (A', B) and (A', B'). For each of these four selections, we can determine the span of net i and classify the selection as feasible (i.e., does not violate the net span constraint) or infeasible. Notice that if the selection (A, B) violates the net span constraint for net i, then each of the remaining three selection pairs also violates the net span constraint for this net. U v 1st imp. 2nd imp. I i I I I i ii I i i I I I i I W X w 3.2. Critical modules of net i Figure 3.2. Critical modules of net i 1st row 2nd row We have the following possibilities: Case 1: [No selection is infeasible.] All four selections are feasible. In this case no addition is made to the forcing lists. Case 2: [Exactly one selection is infeasible.] The infeasible selection must be (A', B') and the other three selections are feasible. Now, the selection of A' forces us to select B and the selection of B' forces us to select A. Therefore, we add B to the forcing list for A' and A to that for B'. To add B to the forcing list of A' (and similarly to add A to the list of B'), we first check CO to determine if an implementation for the module corresponding to A' has already been selected. If no implementation has been selected, we simply append B to the list for A'. If the implementation A has been selected, then we do nothing. If the implementation A' has been selected, then the implementation B is forced and we run the function Assign (L, C, B) of Figure 3.3 which selects implementation B as well as other implementations that may now be forced. This function returns the value False iff it has determined that no feasible solution exists. Algorithm Boolean Assign (L], CO, M) /* Select implementation M and related modules */ if M is selected then return True; if M' is selected then return False; /* M is undecided */ Mark M selected in CO; for each X E L[A] do if not Assign (L, C, X) then return False; end for Remove L[M] and L[M']; return True; Figure 3.3. Function Assign Case 3: [Exactly two selections are infeasible.] This can arise in one of two ways (a) (A, B) and (A, B') are feasible and (A', B') and (A', B) are infeasible and (b) (A, B) and (A', B) are feasible and (A', B') and (A, B') are infeasible. In case (a), we must select implementation A. This is done by executing Assign (L, C, A). In case (b), we must select implementation B; so, we perform Assign (L,C,B). Case 4: [Exactly three selections are infeasible.] Now (A, B) is the only feasible selection and we perform Assign (L, C, A) and Assign (L, C, B). Case 5: [All four selections are infeasible.] In this case, the 2PDMIS instance has no feasible solution. Once we have constructed the forcing lists for the net span constraints, we proceed to augment these lists to account for the channel density constraint. Of course, this augmentation is to be done only when we haven't already determined that the given 2PDMIS is infeasible. Our strategy to augment the forcing lists to account for the density constraint begins by partitioning the routing channel into regions such that no module boundary falls inside of a region (see Figure 3.4). To ensure that the channel density is < d, we require that the density in each region of the channel be < d. This can be done by examining each channel region. Let T be the module on the top row of the channel region and B the module on the bottom row. The density in this channel region is completely determined by the Ii I Figure 3. 1 2 33 1 216 1 3 7 I I I I I I I I I I I I S I I 54121 74361 4. Partition a routing channel into regions nets that enter this region from its left or right and by the implementations of T and B. Let T1, T2 (B1, B2) denote the two possible implementations of T (B). We have four possible implementation pairs (TI, BI), (Ti, B2), (T2, B1) and (T2, B2). We can determine which of these four implementation pairs are infeasible (i.e. result in a channel region density > d) and use a case analysis similar to that used above for net span constraints. The cases are Case 1: [None are infeasible.] Do nothing. Case 2: [Exactly one is infeasible.] Suppose, for example, only (Ti, B2) is infeasible. We need to add B1 to the forcing list for T1 and T2 to the list for B2. This is similar to case 2 for net span constraints. Case 3: [Exactly two are infeasible.] This can happen in one of six ways. If the feasible pairs are (TI, B2) and (T2, B1), then T1 forces B2, B2 forces T1, T2 forces BI and B1 forces T2. The remaining five cases are similar. Case 4: [Exactly three are infeasible.] There are four ways this can happen. For example, if (Ti, BI) is the only feasible pair, then implementations T, and B, must be selected. The remaining three cases are similar. Case 5: [All four are infeasible.] The 2PDMIS instance with density constraint d has no feasible solution. 3.3.2 Stage 2 If following stage 1 we have not determined that the 2PDMIS instance is infea sible, stage 2 is entered. If no nonempty forcing list remains, all implementations of the modules for which no implementation has been selected result in feasible solu tions. When nonempty forcing lists remain, we use the limited branching method [6] to make the remaining module implementation selections. In this method, we start with a module i whose implementation is yet to be selected. For this module, we try out both implementations, in parallel, following the forcing lists L[i] and L[m + i], respectively. This is equivalent to running Assign (L, C, i) and Assign (L, C, m + i) in parallel and terminating when either (a) both return with value False or (b) one (or both) return with value True. When (a) occurs, we have an infeasible solution. When (b) occurs, the selections made by the branch that returns True are used. Note that the parallel execution of Assign (L, C, i) and Assign (L, C, m + i) is actually done via simulation by a single processor; this processor alternates between performing one step of Assign (L, C, i) and one of Assign (L, C, m+ i) and stops when one of the two conditions (a) or (b) occur. In case of (b), we proceed with the next module with unselected implementation. 3.3.3 Implementation Details To implement stage 2, we need two copies of the implementation selection array C; one copy for each parallel execution branch. Call these copies C1 and C2. Although both are identical at the start of Assign (L, C1, i) and Assign (L, C2, i), C1 and C2 may differ later. When the execution of these two branches terminates, we need to set the Ci corresponding to the unselected branch equal to that of the selected branch. This is done efficiently by maintaining two lists A, and A2 of changes made to C1 and C2 since the start of the two branches. Then, if C1 is selected, we can use A2 to first convert C2 back to its initial state and then use A1 to convert it from the initial state to C1. If C2 is selected, a similar process can be used to convert C\ to C2. The time need for this is JAl + IA21 rather than CI I = IC21 = m (as would be the case if we simply copy C1 to C2 or C2 to C1). Further, since the forcing lists are shared by two branches, these branches should not modify the forcing lists. Therefore the simulation of Assign omits the steps that remove forcing lists. Finally, to efficiently simulate two parallel executions of Assign, we need to convert the recursive version of Figure 3.3 into an iterative version. Our iterative code which simulates the parallel execution of two Assign branches employs two queues Qi and Q2. A high level description of the code is given in Figures 3.5, 3.6 and 3.7. 3.3.4 Time Complexity To construct the net span constraints' portion of the forcing lists, we must identify the up to four critical modules of each net and establish the forcing constraints for each of the up to four critical module pairs that determine the net span. The critical modules for all nets can be determined in 8(p) time by making a left to right sweep of the modules, keeping track, for each net i, of the first and last modules in the top and bottom module row that contain net i. Since all pin locations and module boundaries are integers, the modules can be sorted in left to right order in linear time using bin sort [24]. Each net's contribution to the forcing lists can now be determined in 8(1) time. Therefore, representing each L[i] as a chain, the net span constraints' contribution to the L[i]s can be determined in 0(p + n) = 8(p) time. To construct the portion of L[i] that results from the channel density constraint, we partition the channel into regions by performing a left to right sweep of the modules and using the module end points as region boundaries. The number of channel regions is, therefore, 6(m). In our implementation, we scan the channel four times to compute the maximum density of each region for each of the four possible implementations of the module pair that bounds the region. This takes 8(p) time. Once we have the densities of each region we can, given the density constraint, Algorithm Boolean Satisfy (LO, C2D) /* Test whether L is satisfiable */ Copy C2 into C, ]; for i = 1 to m do if Cl[i] == 0 then /* i is undecided */ if L[i] is empty then Ci[i] = C2[i] = 1; /* select first implementation */ else if L[m + i] is empty then Ci[i] = C2[i] = 2; /* select second implementation */ else EnQueue (Qi, i); EnQueue (Q2, m + i); /* m + i represent the 2nd implementation of module i */ while Qi not empty and Q2 not empty do a = DeQueue (Ql); b= DeQueue (Q2); if a is rejected in Cl and b is rejected in C2 then return False; else if a is rejected in C1 then EnQueue (Q1, a); if not Search (L, Q2, C2, A2, b) then return False; else if b is rejected in C2 then EnQueue (Q2, b); if not Search (L, Q1, Ci, A1, a) then return False; else if a is undecided in C1 then Add List L[a) into Qi; Insert a into A1; Mark a selected in C1; if b is undecided in C2 then Add List L[b] into Q2; Insert b into A2; Mark b selected in C2; end while /* Q1 not empty and Q2 not empty */ if Q1 is empty then Undo (C2, A2, C1,Ai); /* make C2 = CI */ else /* Q2 is empty */ Undo (Ci A1, C2, A2); /* make C1 = C2 */ end if /* L[i] is empty */ end if /* module i is undecided */ end for return True; Figure 3.5. Function Satisfy Algorithm Boolean Search (L. Q, C, A, x) /* Select module x and modules in Q and related modules, update list A */ Mark x selected in C; Insert x into A; Add List L[x] into Q; while Q not empty do y = DeQueue (Q); if y is rejected in C then return False; else if y is undecided in C then Add List L[y] into Q; else /* y is rejected in C */ Insert y into A; Mark y selected in C; end while return True; Figure 3.6. Function Search Algorithm Undo (C1, A1, C2, A2) /* make C1 = C2 by using delta lists */ for each x E A1 do Mark x undecided in C1; for each x E A2 do Mark x selected in C1; Figure 3.7. Procedure Undo construct the forcing lists L(1..2mj in 9(m) time. Notice that on succeeding iterations of the binary search for an optimal solution. only the contribution to LO from the density constraint may change. The new contribution to LO can be determined without recomputing the densities of each region. The limited branching method of stage 2 uses two queues QI and Q2. The time needed to add (EnQueue) or delete (DeQueue) an element to/from a queue is 8(1) [24]. In each iteration of the for loop of Figure 3.5, the time spent following the successful branch equals that spent following the unsuccessful branch and the time needed to make C1 and C2 identical (i.e., the cost of the Undo operation) is, asymp totically, no more than the time spent following the successful branch. The time spent following all successful branches is no more than the size of the forcing lists because no forcing list is examined twice. Therefore, the stage 2 time is O(p). The binary search for the minimum density solution iterates O(logn) times. Therefore, our algorithm finds an optimal solution to the 2PDMIS problem in O(p log n) time. Comparing our algorithm to that of Her et al. [11], we note that our algorithm has the potential of identifying infeasible 2PDMIS instances quite early; that is, during the construction of the forcing lists. Although infeasibility resulting from the critical modules of a single net being too far apart are detected immediately by both algorithms, our algorithm also can quickly detect infeasibility resulting from forced selections during stage 1. The algorithm of Her et al. [11] does not do this. Because of the calls to Assign made during stage 1, the size of the forcing lists to be processed in stage 2 is often significantly reduced. As a result, the limited branching operation is often applied to much smaller data sets than the 2SAT graph on which the strongly connected component algorithm is applied [11]. These factors contribute to the observed speedup provided by our algorithm relative to that of Her et al. [11]. 3.4 Multichannel 2PDMIS Problem In the multichannel 2PDMIS problem, we have c + 1, c > 1 rows of modules. Each module has pins on its upper and lower boundaries, each module has two possible implementations, there is a routing channel between every pair of adjacent rows, and net span bounds are provided for every channel [11]. Although Her et al. [11] develop a heuristic for the general multichannel PDMIS problem, they do not consider polynomial time algorithms for the multichannel 2PDMIS problem. For any fixed channel density tuple (dl,d2,..., d) for the c routing channels, we can develop the forcing lists in O(p) time, where p is the total number of pins. These lists are developed using ideas similar to those used in Section 3.3. Then, using the limited branching method of Section 3.3, we can determine, in O(p) time, whether it is possible to select module implementations so that the channel densities do not exceed (di, d2,.. ., dc) and so that the net span bounds are satisfied. Thus, the method of Section 3.3 is easily extended to obtain an O(p) feasibility test for (di, d2,..., dc). Since there are O(nW) possible density vectors (n is the number of nets), the c channel 2PDMIS problem can be solved by trying out all O(nC) tuples in O(pnc) time. We can reduce this time to O(pnC') as follows. When c = 2, first determine the least y such that (2, y) is a feasible channel density tuple. This is done using a binary search on d2 and takes O(logn) feasibility tests, each test taking O(p) time. We can ignore tuples (di, d2) with di < and d2 < y because these tuples are infeasible, and we can ignore tuples (di, d2) with dl > n and d2 > y because these are inferior to (Q, y). Therefore, the search for a better tuple than (n, y) may be limited to the regions di < n and d2 > y, and di > R and d2 < y. These two regions (Figure 3.8) may now be searched recursively. For example, to find the best tuple in the region di <  and d2 > y, find the least z such that (q, z) is feasible. Now search the two regions dl < and d2 > z, and dl > 2 and d2 < z. for a better tuple than (R, z). The worstcase number of feasibility tests for the above search strategy is given by the recurrence N(n) = 2N() + logn, n > 2 and N(1) = 1. The solution to this recurrence is N(n) = O(n). Since each feasibility test takes 0(p) time, the 2channel 2PDMIS problem can be solved in O(pn) time. By doing an exhaustive search on the densities of c 2 channels and using the above technique for the remaining 2 channels (i.e., for each choice of densities for 38 c 2 channels, find the overall best choice for the c channels as above), we can solve the cchannel 2PDMIS problem in O(p nc2 n) = O(pnc1) time. d2 di < d2> eliminate di > infeasible d2 < y n 2 Figure 3.8. The two regions to be searched recursively after the binary search 3.5 Experimental Results We implemented our algorithm as well as that of Her et al. [11] in C and measured the run time performance of the two algorithms on a SUN SPARCstation 5. Our first data set consists of benchmark channels used in Her et al. [11]. We partitioned the top row and bottom row of the channel into intervals and consider these intervals as "modules", and assume each module has two implementations. Table 3.1 gives the characteristics of these circuits as well as the time, in seconds, taken by the two algorithms. The optimal densities given in Table 3.1 differ from those reported [11] because the partitioning of the top and bottom rows of pins used by us is different from that used in Her et al. [11]. The speedup provided by our algorithm ranges from 1.67 to 2.20. Our second data set consists of circuits designed to minimize the size of the forcing lists constructed in stage 1. The characteristics of these circuits as well as the performance of the two algorithms on these two circuits are given in Table 3.2. Our algorithm is 9 to 11 times as fast on these circuits. Table 3.1. Running time for benchmark channels Optimal Time/Second Channel n m p density [11 Our Speedup exl 21 19 74 12 0.0022 0.0010 2.20 ex3a 44 36 158 14 0.0046 0.0023 2.00 ex3b 47 24 158 16 0.0035 0.0021 1.67 ex3c 54 23 178 18 0.0039 0.0023 1.70 ex4b 54 28 192 17 0.0045 0.0024 1.88 ex5 64 40 190 18 0.0042 0.0025 1.68 Table 3.2. Running time for generated channels Time/Second Channel n m p [11 Our Speedup w32x32 64 66 192 0.0425 0.0046 9.24 w64x64 128 130 384 0.0999 0.0105 9.51 w128x128 256 258 768 0.2275 0.0225 10.11 w256x256 512 514 1536 0.5130 0.0487 10.53 w512x512 1024 1026 3072 1.1755 0.1066 11.03 w1024x1024 2048 2050 6144 2.6150 0.2309 11.33 w2048x2048 4096 4098 12288 5.6700 0.4886 11.60 w4096x4096 8192 8194 24576 12.0500 1.0280 11.72 w8192x8192 16384 16386 49152 24.8800 2.1260 11.70 3.6 Conclusion We have developed an O(p log n) time algorithm for the single channel 2PDMIS problem and an O(pnc) time algorithm for the c, c > 1, channel 2PDMIS problem. Experiments indicate that our single channel algorithm is substantially faster than the single channel algorithm [11]. The heuristic proposed in Her et al. [11] for the kPDMIS problem, k > 2, uses the algorithm for the 2PDMIS problem. By using our 2PDMIS algorithm, the kPDMIS heuristic of Her et al. [11] will also run faster. CHAPTER 4 GATE RESIZING TO REDUCE POWER CONSUMPTION 4.1 Introduction Power consumption, speed and area are three important and related characteris tics of a circuit. With the increase in circuit density and the enhanced use of battery operated devices, the emphasis on power consumption has increased. By reducing power consumption, we simultaneously reduce heat dissipation and increase battery life. In this paper we consider the problem of minimizing the power consumed by a circuit subject to satisfying the circuit's timing constraints. Power reduction is obtained by gate resizing larger gates are replaced by smaller ones that have higher delay but lower power consumption. Power reduction via gate resizing has been considered [3]. In the general gate resizing problem (GGR), for each gate in the circuit we have a list of (delay, capacitance) pairs. Each pair gives the delay and capacitance associated with a possible implementation of that gate. Since the power consumed by a gate is linearly proportional to the product of its capacitance and the switching activity at its inputs, the gate's power consumption can be computed from its capacitance once the circuit characteristics are known. Therefore, we assume that instead of (delay, capacitance) pairs, we have (delay, power consumption) pairs. In this model we ignore the change in power from load change and switching activity change due to change of gate delay. The same assumption has been used in Chen and Sarrafzedeh [3]. In the GGR problem, we begin with a realization for each gate (i.e., a selection of a (delay, power consumption) pair) such that the timing constraints are satisfied. We wish to change the realization of some or all of the gates by replacing their assigned pair with one that has larger delay (i.e., gate resizing) and such that the timing constraints remain satisfied and the power consumption of the resized circuit (this is the sum of the power consumption at each gate) is minimum. The GGR problem (referred to as the incomplete library problem [3]) is equivalent to the BCI problem studied in Li et al. [20]. The BCI problem was shown to be NPComplete [20], even for circuits that were simply a chain of single input single output gates. Bahar et al. [1] propose a greedy heuristic for the GGR problem. This heuristic resizes one gate at a time. Chen and Sarrafzadeh [3] have proposed a heuristic that resizes several gates at a time. Experimental results presented by Chen and Sarrafzadeh show that their heuristic is able to reduce the power consumed by benchmark circuits by approximately 10%. More precisely, the method of Chen and Sarrafzadeh [3] did worse than the greedy method of Bahar et al. [1] by 5.3% on one of 9 benchmark circuits and did better by 1.4% to 20.6% on the remaining 8 circuits. Chen and Sarrafzadeh also propose a pseudo polynomial time algorithm for the Low Power Complete LibrarySpecific Gate Resizing (CGR) problem. In this prob lem, each gate can be realized to have any delay (delays are assumed to be integral). Further, the power consumed by gate v decreases by the constant c(v) for each unit increase in delay. Let d1(v) by the delay of the initially assigned realization of gate v and let d(v) 2 d1(v) be the delay of the resized gate v. Then the power reduction AP resulting from resizing an n gate circuit is AP = c(i)(d(i) d(i)) i= 1 In Section 4.2.2 we develop a linear algorithm for the CGR problem for series parallel circuits. In Section 4.2.3 we extend this linear algorithm for seriesparallel graphs to obtain an O(n log2 n) time algorithm that works when there is an upper bound on the delay of each gate. That is, each gate v has realizations with integral delays in the range [d#(v), d,(v)]. As in the CGR problem, each unit increase in delay reduces power consumption by c(v). We call this the CUGR problem. The CUGR problem for tree circuits can also be solved in linear time (Section 4.3). In Section 4.4, we show that the CGR problem is NPComplete for circuits that have a special type of multiinput multioutput gate. An alternative algorithm for the CGR problem is presented in Section 4.5. This algorithm transforms the CGR problem to an activity onedge network [151 and then uses a known method to minimize project cost [8] to obtain an optimal solution to the CGR problem. The approach of Section 4.5 is quite general and can be used even for the CUGR problem and when the c(v)s are convex functions rather than constants. In Section 4.5 we also point out that the CGR algorithm of Chen and Sarrafzadeh does not work for CUGR and convex c(v)s. In Section 4.6 we use the approach of Section 4.5 to obtain a heuristic for the GGR problem with convex c(v)s. Experimental results comparing our algorithm of Section 4.5, 4.6 and the CGR algorithms of Chen and Sarrafzadeh [3] are presented in Section 4.7. Although both CGR algorithms generate minimum power circuits, our algorithm does this using significantly less time. The GGR heuristic we developed obtains better power reduction in many circuits than the one developed in Chen and Sarrafzadeh [3]. Throughout this chapter we assume that a circuit is represented as a directed acyclic graph. The vertices of this graph represent gates and the edges represent signal flow. Primary inputs may be modeled as vertices with no incoming edge and primary outputs may be modeled as vertices with no outgoing edge. Figure 4.1 gives the digraph for an example circuit. The vertices corresponding to primary inputs are labeled with the time at which the primary input is available; vertices corresponding to primary outputs are labeled with the time by which the output signal must arrive; and the remaining vertices (these correspond to circuit gates) are labeled with the (delay, power consumption) pair corresponding to their initial implementation. (1,12) (2,13) (2,15) (3,21) Figure 4.1. Digraph corresponds to a circuit 4.2 SeriesParallel Circuits 4.2.1 Definition Seriesparallel circuits were considered in Li et al. [20j. A seriesparallel circuit may be defined recursively as below: [20] SPI: a chain of gates is a seriesparallel circuit (Figure 4.2(a)). SP2: several chains of gates joined at the ends to a common first gate and a common last gate (Figure 4.2(b)) define a simple parallel circuit. A simple parallel circuit is a series parallel circuit. SP3: a circuit obtained from a seriesparallel circuit C by replacing any interconnect of C by another seriesparallel circuit is a seriesparallel circuit (Figure 4.2(c)). Figure 4.2 gives example seriesparallel circuits as well as a circuit that is not seriesparallel. (a) (b) Figure 4.2. Circuit Examples (Source: Li et al. [20]). (a) Chain; (b) Simple Parallel Circuit; (c) SeriesParallel Circuit; (d) NonSeriesParallel Circuit 4.2.2 Complete Library Gate Resizing (CGR) Our strategy to solve the CGR problem for seriesparallel circuits is to reduce the circuit to one that has a single gate. The CGR problem for the reduced single gate circuit is easily solved, and finally the solution to this single gate problem is used to reconstruct the solution for the initial circuit. To transform an arbitrary seriesparallel circuit into an equivalent single gate circuit (i.e., a single gate circuit with the same maximum power reduction), we first obtain the series parallel decomposition of the circuit using the linear time algorithm [33]. This seriesparallel decomposition essentially tells us how to build the original circuit using chains (SP1), simple parallel circuits (SP2), and replacing interconnects of C by seriesparallel circuits (SP3). During this rebuild process, we shall replace each chain and simple parallel circuit by a single gate. Consequently, when the rebuild is complete, we will be left with a single gate. The replacement rules for chains and simple parallel circuits are given below: Chain: Suppose the chain has n gates labeled with delays dl, dz,..., dn. Let c, be the power reduction obtained by increasing di by 1. The signal delay through the chain is E=1 d,. For each unit increase in signal delay over E=li di, the max imum possible power reduction is maxi<,i alent to a gate v with delay Ei=1 d and c(v) = maxl The power reduction AP obtained by making this replacement is zero. Simple Parallel Circuit: First transform each chain in the simple parallel circuit into an equivalent single gate using the transformation of Figure 4.3(a). This results in the parallel circuit of Figure 4.3(b). The signal delay between the output of gate s and the input of gate t is maxi all gates between s and t to max delay. This gives us a power reduction AP = E!=i ci(maxli unit increase in delay between s and t beyond maxlj<,n{d,} gives us a power reduction of E=1 c. Therefore the n gates between s and t are equivalent to a single gate with delay maxi a simple parallel circuit may be replaced by the three gate chain shown in Figure 4.3(b). This chain can, in turn, be replaced by a single gate using the chain transformation of Figure 4.3(a). Using the above transformations on the series parallel decomposition yields a single gate circuit. The input to this gate is the primary input of the original circuit and the gate output is the primary output of the original circuit. Let v be the single gate that remains. Let ti and to be the arrival time of the input and the required time for the output respectively. Since we start with a circuit that can meet its arrival time requirements (i.e., a feasible circuit) and since the transformations of Figure 4.3 do not affect feasibility, t, + d(v) < to. The additional power reduction possible is (to ti d(v))c(v) The maximum power reduction APmaS for the original circuit is c1 Cn cl P = 0 max(ca) i (a) max(dj) s t t d. 1 i^=1 Cn AP = E ca(max(d,) di) (b) Figure 4.3. Transformation of SeriesParallel Circuits. (a) Chain; (b) Simple Parallel Circuit (to ti d(v))c(v) + sum of the APs from the simple parallel circuit transformation (Figure 4.3(b)). To obtain the delay values for each gate of the original circuit that will result in a power reduction of APm,, simply follow the reduction process backwards. The total time taken is linear in the number of gates in the original circuit. Figures 4.4 and 4.5 show an example. Each gate is represented by a box, the number inside a box is the gate delay, the number below a box is the gate's c value, the primary input is available at time 0, and the primary output is needed at time 37. 4.2.3 Complete Library with Upper Bounds (CUGR) and Convex c(v)s The power reduction per unit delay increase function c(v) for gate v is convex iff there exist positive 61,62,..., 6,, and cl >_ c2 >  > cm such that c(v) = cl for delay (a) APi = 0 9 5 9 0 13 9 6 37 6 4 8 8 (b) AP2= 6*5+2*2+3*3= 43 (c) AP3 = 0 0 37 6 21 4 (d) AP4 = 8 *15 = 120 Ad = 37 29 = 8 0 37  29  21 (e) AP5 = 21 Ad = 168 Figure 4.4. Transformation of a SeriesParallel Circuit into a single gate. 0 37 21 (a) 0 37 6 21 4 (b) 17 5 9 9 13 9 6 37 6 4 31 8 (d) Figure 4.5. Computation of new delay for each gate. increases between 0 and 61; c(v) = c2 for delay increase between 61 + 1 and 61 + 62; c(v) = c3 for delay increases between 1 + 62 + 1 and 61 +62 +63: and so on. Figure 4.6 shows the power consumption as a function of the increase in delay relative to the gate's initial delay di(v). Po is the power consumption when the gate has its initial delay dr(v). Power Consumption Po Slope = cl slope = c2 '" slope = c, SDelay Increase 0 61 61 +62 =1 6i Figure 4.6. Convex delaypowerconsumption graph The CUGR problem can be modeled using gates with convex power reduction functions c(v). For example if gate v provides a power reduction of c for each unit increase in delay between d,(v) and d,(v) then we may use c(v) with 61 = d (v)dj(v) and cl = c. Because of this correspondence between the CLGR problem and the convex gate resizing problem ConvexCGR, we consider only the ConvexCGR problem in this section. To solve the ConvexCGR problem for seriesparallel circuits, we need only develop methods to transform a chain of convex gates into an equivalent convex gate and to transform a simple parallel circuit comprised of convex gates into an equivalent convex gate. These transformations can then be used in place of the transformations of Section 4.2.2 to obtain an algorithm for the ConvexCGR (and hence also for the CUGR) problem. We assume that a convex gate is given by a list of tuples {((61, cl), (62, cZ),..., (6., c) }, ci > c2 > > Cm. This list is called the DP (delay power) list of the gate. Chain of Convex Gates: A chain of n convex gates with initial delays dl, d2,.... d, is replaced by a convex gate with initial delay ELi di as in Figure 4.3(a). The DP list for the new gate is obtained by merging the DP lists of the n gates in the original chain into a single list sorted by nonincreasing cis. During this process, pairs with the same cj value are combined into a single pair. For example if (5,24) and (2,24) are pairs in DP1 and DP2 respectively, the combined pair is (7,24). Suppose we have a 3 gate chain with DP1 = {(3, 28), (5,24), (3,21)}, DP2 = {(2, 24), (4, 23)} and DP3 = {(9, 26)}. The DP list for the replacement gate for this chain is {(3, 28), (9, 26), (7, 24), (4, 23), (3, 21)}. Simple Parallel Circuit with Convex Gates: First the chains in the circuit are trans formed into equivalent single convex gates. Then the delays of these equivalent convex gates are increased to maxi in delay provides us a power reduction AP and changes the tuples at the front of the DP lists of the gates whose delay is increased. Let the new DP lists be DP,, DP',..., DP,'. Now the gates between s and t are replaced by a gate with delay maxi DP' lists. Figure 4.7 shows the process when n = 2. Here L1 and L2 denote the two DP lists DP[ and DP2 and L denotes the DP list of the replacement convex gate. Finally, s, t and the replacement gate are combined into a single convex gate using the method for a chain of convex gates. Suppose we have two parallel convex gates with delays 3 and 2 respectively and the correspond ing DP lists DPI = {(3,28), (5,24), (3,21)} and DP2 = {(2,24), (4,23)}. The delay of the equivalent convex gate is thus 3, which reduces the power con sumption of the second gate by 24. The DP list of the second gate is modified to {(1, 24), (4, 23)}. By performing the ParalleLMerge operation we obtain the DP list {(1, 52), (2, 51), (2,48), (3, 24), (3, 21)} for the equivalent gate. 4.2.4 Time Complexity of Convex GR Problem A straightforward implementation of our algorithm of the previous section for the ConvexCGR problem uses sorted chains (linked lists) to represent the DP lists. The time needed to combine/merge the DP lists L1 and L2 of two gates is O(IL1 + IL21) regardless of whether we do a series or parallel merge. If we start with gates having Algorithm ParallelMerge(L1, L2) /* Merge L1 and L2, consider two gates in parallel */ Pi + head(L1): p2 head(L2): L + NULL; while L1 not empty and L2 not empty do if 6(pi) < d(p2) then Insert (6(pl), c(pi) + c(p2)) into L; (p2) + 6(p2) b(pi); pi + next(pi); else if 6(pi) > 6(p2) then /* Similar to the "if" part above, interchange pi and p2. */ else /* 6(pl) = 6(p2) */ Insert (6(pl), c(pl) + c(p2)) into L; Pi e next(p); p2 next(p2); end if end while if L1 is empty then Append the remaining nodes of L2 starting from p2 to L; else Append the remaining nodes of L1 starting from pi to L; end if return L; Figure 4.7. Algorithm ParallelMerge DP lists with k tuples each, then the time needed to transform an n gate series parallel circuit into its equivalent single gate is O(kn'). To see this, observe that each series/parallel combine step reduces the number of gates by at least 1. Therefore, there can be at most n 1 combine steps. Further, after q combines, the size of a DP list is O(kq) = O(kn). So the cost of O(n) combines is O(kn2). An example circuit that exhibits kn2 worstcase behavior is given in Figure 4.8. Figure 4.8. Worstcase merging of n gates We can reduce the asymptotic time complexity to O(kn log2 n) by using balanced binary search trees (BBSTs) [15] to represent the DP lists. Each DP list is repre sented by a BBST such that the external nodes represent the pairs (6i, ci) in the DP list in right to left order (i.e., in decreasing order of power reduction). Each internal node x contains a triple of the form (D(x), C(x), M(x)), where D(x) is the sum of the delays of the DP list pairs in the left subtree of x, C(x) is a corrective factor needed to compute the ci values of pairs in the left subtree of x, and M(x) is a pointer to the rightmost external node in the left subtree of x. Each external node y stores a pair (d(y), c(y)) such that d(y) is the 6 value of the DP list pair represented by node y; the c value of this DP list pair is c(y) + C() {x : y is in the left subtree of x} Figure 4.9 shows a possible BBST for the DP list {(3, 28), (5, 24), (3, 21)}. The leftmost external node contains the pair (3,13) which represents the DP list pair (3,21). The correct c value for the DP list pair is obtained by adding to 13 the C values in the ancestors of the external node. (8,1 ) (3,5 ) (3,21) (3,13) (5,14) Figure 4.9. BBST used to represent DP list {(3, 28), (5, 24), (3,21)} To insert a new DP list pair into our BBST, we must be able to trace a path from the root to an appropriate external node. This path tracing is facilitated by the pointer M(x) in internal node x. By using the c() value in the external node M(x) and the C values in the nodes from the root to x, we can compute the maximum c value of any DP pair in the left subtree of x. Since insertions require rotations, we show how the D() and C) values in internal nodes are to be changed when rebalancing rotations are done (Figure 4.10). Note that M() values remain unchanged during tree rotations, and are omitted from the figure. The tuple (D(x), C(x)) of each internal node x is shown next to the node. To merge the DP lists of two gates in a chain, we first perform an inorder traversal of the smaller DP list's BBST to extract the DP list's pairs. Then, these pairs are inserted into the BBST for the larger DP list. During this insertion, pairs with the same c value are combined into a single pair. If the two DP lists are L1 and L2, the time needed to do the series merge is ILl log(IL1 + (L21), where L1 is the smaller DP list. For a parallel merge of two DP lists L1 and L2 (L1 is the smaller list), we need to identify for each (&, ck) in L1, the external nodes z in the BBST for L2 for which Ei=1 S < f(z) < Et 6i, where the 6,s are defined with respect to L1 and f(z) is the sum of the d( values of the external nodes in the BBST of L2 that lie to the left of z plus the d( value of z. Let x and y be the leftmost and rightmost such external nodes (see Figure 4.11). These nodes can be found in O(log IL21) time using EX1 6 and Ei= 6i, and the D values in the triples of the internal nodes of the BBST for L2. Actually, node x may already be known from the processing of the pair (6k1, ck1) of L1. We need to increase the c values of the external nodes from x to y. This can be done in logarithmic time by changing the C correctors stored in the internal nodes on the paths from x and y to their common ancestor (see Figure 4.11). (D, Cc) (Db, Cb) (Da, Ca) (Do, Ca) (a) LL rotation (Da + Db, Ca+Cc) (Db, Cb) (D D Db. Cc)  (Db, CbCa) (b) LR rotation (Da+Dc, Ca+Cb) (Db,Cb) (c) RL rotation (Db+1 (Db, Cb) (Dc, CcCb) a(Da,Ca) / (d) RR rotation Figure 4.10. Update of D(s and C()s for internal nodes during tree rotations (De Db. Cc) (DbD,, C6) (D, Ca) (D.1c, Cc Cb) common ancestor of x and y +Ck ( Figure 4.11. Change of c values of internal nodes of L2 for the kth tuple of L1 In addition to the above change in C correctors, we may need to insert a new external node. If f(y)= Ek= 6i, then no insertion is needed, we increase c(y) by Ck. Otherwise, we change the original node to (EI= bi f(y) d(y), c(y) + c(k)) and insert (f(y) E= 6i, Ck + c(y)) into the BBST. The inserted external node is the x node for the next pair of L1. When the BBST method is used on the worstcase example for the linked list method (Figure 4.8), ILI I = k and IL21 < kn for each of the n 1 merges. Therefore, the run time is O(kn log kn). The worstcase for the BBST method is when we continually merge DP lists of the same size. The worst case is described by log n stages of merges; each stage involving pairs of DP lists of the same size. In stage 1,  pairs of lists each of size k are merged in O(2klog 2k) time to produce DP lists of size 2k each; in stage 2, R pairs of DP lists of size 2k each are merged in O(n2k log4k) time to produce 2 DP lists of size 4k each; and so on. The total time is O(Ei" gk log kn) = O(kn log kn log n). Figure 4.12 shows a circuit on which this worstcase bound is achieved. Let Co be a circuit with a single module. C, i > 0, is a simple parallel circuit obtained from Ci1 as shown in Figure 4.12. The number of modules in Ci is 2i+1 + i(i 1) and Ci requires 2i' parallel merges and 2i 1 series merges. The total cost is O(kn log kn log n). Figure 4.12. Circuit C2 that exert the worst case behavior 4.3 Tree Circuits Gates in circuits with a tree topology (for example, distribution trees) can be resized by transforming the trees into equivalent single gate circuits using the ba sic transformation shown in Figure 4.13. The transformation of Figure 4.13 first transforms a node, all of whose children are leaves, into an equivalent simple parallel circuit by the introduction of additional gates/nodes with delay r ri, where ri is the required time for the output of leaf i and r = max(ri). The c values for the new gates are 0. The simple parallel circuit can now be transformed into an equivalent single gate using the transformation of Figure 4.3. By repeatly applying this transformation on any tree, the tree can be transformed into an equivalent single gate. Although the preceding transformation was described specifically for the CGR problem, it is easily extended to the CUGR and ConvexCGR problems using the ideas of Section 4.2. d rn di r ri r= r max(r,) Cn Cn 0 Figure 4.13. Transformation of a basic tree to a simple parallel circuit 4.4 CGR with Multigate Modules Is NPHard Suppose that a circuit is to be realized with modules that contain multiple gates. Increasing the delay of a module results in an increase in the delay of all gates on the module. Figure 4.14 shows a module v with two gates A and B, each is a twoinput oneoutput gate. The delay of the selected module implementation is d(v) and each unit increase in module delay reduces power consumption by c(v) and increases the delay of both A and B by one unit. We shall show that the CGR problem with multigate modules (MCGR) is NPhard. For the proof, we show that if MCGR can be solved in polynomial time, then the oneinthree 3SAT problem [9) can also be solved in polynomial time. (d(v),c(v)) Figure 4.14. A module v with two gates A and B Definition 1 (oneinthree 3SAT) Input Collection of clauses C1,C2,... ,Cm over variables xi, X2, ..., z, such that each clause is the disjunction of exactly three literals. Output "yes" if and only if there is a truth assignment to the variables such that each clause has exactly one true literal. Theorem 1 MCGR is NPhard. Proof. We show how to transform, in polynomial time, any instance I of the onein three 3SAT problem into an instance I' of MCGR problem such that the maximum power reduction for I' is (2m + 1)n + m if and only if the answer to I is "yes". Here m is the number of clauses in I and n is the number of variables. For the transformation, we define two circuit subassemblies variable subassembly and clause subassembly. A variable subassembly consists of two multigate modules, each having two gates and connected as in Figure 4.15. 00 0 hL:_JI _0 o0 oO_ Zi i o(x) o() Figure 4.15. Variable subassembly for variable xi The first module of the variable subassembly for variable xi is called module xi and the second is module Yi. The inputs to gates A and B of module xi and to gate B of T are primary inputs which are available at time 0. One of the inputs to gate A of module x is a primary input available at 0 and the other input is the output of gate A of module xa. The output of gate A of module Yi is a primary output which has a required arrival time of 1. The outputs of the two B gates are nonprimary outputs. The c value for each module is 2m + 1 and the initial delay of each is 0. Notice that we can increase the delay of either module xi and x (but not both) by 1 and still satisfy the arrival time requirements of the primary output. Therefore, the maximum power reduction obtainable from a variable subassembly is 2m + 1. If module xi has delay 0, then we say that literal xi is true; otherwise xi is false. Similarly, if module x has delay 0, the literal Tl is true; otherwise 3^ is false. Although we can assign delays to the two modules so that both literals are false, delay assignments can make at most one literal true. We construct one variable subassembly for each of the n variables in the 3SAT instance I. The maximum power reduction obtainable from these n subassemblies is (2m + 1)n. A clause subassembly consists of 3 modules with one gate each; each gate has 3 inputs and 1 output. Let 11, 12 and 13 be the three literals in a clause. Figure 4.16 shows the corresponding clause subassembly together with the inputs to each gate. These inputs are the outputs of the variable subassemblies. The outputs of the modules of the clause subassemblies are primary outputs with required time of 1. The c value for each module in a clause subassembly is 1. (12) 1 O(13) o(12)  0(13) O(2)1 o(13) Figure 4.16. Clause subassembly for (li V 12 V 13) The maximum power reduction obtainable from a clause subassembly is 3. This corresponds to the case when all 6 literals (li, T1, 12,, 13 and IT) are false. We shall have one clause subassembly for each of the m clauses in I. Therefore the maximum power reduction available from all m clause subassemblies is 3m. The circuit I' is comprised of the n clause and m variable subassemblies described above. If there is a truth assignment T for the variables of I such that the answer to the oneinthree 3SAT instance I is "yes", then make the delay of module xi 1 if zi is true in T, otherwise make the delay of ;i 1. Further, since exactly one literal is true in each clause of I, we can make the delay of exactly 1 module in each clause 1. The total power reduction is (2m + 1)n + m. Now suppose there is a solution S to I' which gives us a power reduction > (2m + 1)n + m. In S, each variable subassembly must have exactly one module with delay 1. To see this, observe that no variable subassembly can have two modules with delay 1 and if any variable subassembly has no module with delay 1, the power reduction obtained by S is at most (2m + 1)(n 1) + 3m < (2m + 1)n + m. So, assume that each variable subassembly has exactly one module with delay 1. This means that we have a consistent truth assignment; that is, there is no variable x, for which either both xi and are true or both are false. Now let's determine the power reduction obtainable from the clause subassemblies. If 1I is the only literal that is true among 11, 12 and 13, we can make the delay of the topmost module 1 because the arrival times of o(ll), o(12) and o(13) are all 0. The delays for the remaining two modules must be 0 because the arrival time of o(Yi) is 1. A similar analysis can be done for the cases when 12 and 13 are the only true literals. We conclude that when exactly one literal of a clause is true, we can get a power reduction of at most 1 from its clause subassembly. Therefore we can get at most m units of power reduction from the m clause sub assemblies. The maximum of m is obtained only when exactly one literal of each clause is true. Hence the solution to the MCGR instance I' provides a power re duction > (2m + 1)n + 3m if and only if the oneinthree 3SAT instance has answer "yes". 0 4.5 General Circuits 4.5.1 The CGR Algorithm of Chen and Sarrafzadeh Chen and Sarrafzadeh [3] have proposed a pseudo polynomial time algorithm for the CGR problem. Let a(v) denote the arrival time of the signal at the output of gate v and let r(v) denote the required time for the signal at the output of gate v. For a primary input, a(v) is the time at which the signal becomes available and for a primary output r(v) is the required time for that output. Hence a(v) is known for primary inputs and r(v) is known for primary outputs. The remaining a's and r's are defined as below (d(v) is the assigned delay of gate v): a(v) = max (a(u) + d(v)) (4.1) {u:(u,v)EE} r(v) = min (r(w) d(w)) (4.2) {w:(v,w))E) Hence a(v) is the length of the longest delay path from the primary inputs to the output of v, and r(v) is the latest time by which the signal must arrive at the output of gate v so that it is still possible for the signal to reach the primary outputs by their required times. Note that a(v) and r(v) are very closely related to the early and late event times for activity networks [5]. The slack, s(v), of gate v is s(v) = r(v) a(v) (4.3) A circuit is feasible if and only if all primary outputs arrive by their required time. From the definitions of r(v) and a(v), it follows that a circuit is feasible if and only if s(v) > 0 for all v. Figure 4.17(a) shows an example circuit graph. This corresponds to a circuit with three gates, 2 primary inputs and 2 primary outputs. The a() values for the two primary inputs are 0 and 2 respectively. The r() values for the two primary outputs are 5 and 4 respectively. The 3 gates a, b and c are shown by boxes. The c() value of a gate is given below the box. The selected delay for each gate is 1 and is shown inside the box. The a(v)js(v)Jr(v) values for each gate are given above the box. 21315 5 1 14 o 0 12/3 b 14 15 15 a 2 311 4 a 1 (13 13 (a) c (b) c Figure 4.17. Application of algorithm of Chen and Sarrafzadeh [3] on a CGR circuit. (a) An example CGR circuit; (b) sensitive graph The algorithm of Chen and Sarrafzadeh comprises the following steps: Step 1: Compute the slack for each node of G. Step 2: If no node has slack > 0, stop. Step 3: Compute the sensitive graph G, from G as follows. G, contains exactly those vertices of G that have slack > 0. (u. v) is an edge of G, if and only if either a(v) a(u) = d(v) or r(v) r(u) = d(c). The weight of a vertex in Gs is its c) value. Step 4: Compute the transitive closure graph Gt of G,. Step 5: Compute a maximum weighted independent set of Gt. Step 6: Increase the delays of all gates in the maximum weighted independent set by 1. Step 7: Go to Step 1. The maximum weighted independent set of Gt may be computed using a maxflow algorithm [23]. This takes O(nmlog(n2/m)) time for a graph with n vertices and m edges. Since the algorithm of Chen and Sarrafzadeh [3] may reduce the power consumption by only one unit on each iteration, its complexity is O(Snmlog(n2/m)) where S is the obtained power reduction. Figure 4.17(b) shows the graph G, that corresponds to the graph G of Fig ure 4.17(a). The numbers inside the vertices are their weights. The transitive closure graph Gt for G, is the same as G,, and the maximum weighted independent set is {b, c} with a weight of 14 + 13 = 27. The delays of gates b and c are increased by 1 to obtain a power reduction of 27 and we proceed to the second iteration. 4.5.2 Comments on the Algorithm of Chen and Sarrafzadeh 1. Although we expect the algorithm of Chen and Sarrafzadeh [3] to be quite efficient on circuits for which only a small power reduction is possible (i.e., S is small), it is not expected to be efficient on circuits whose power consumption can be significantly reduced. For example, consider the onegate circuit of Figure 4.18. The arrival time of the primary input is 0, the required time of the primary output is r, and the initial gate delay is 0. The algorithm [3] takes r iterations to complete. We would like an algorithm that can increase gate delay by more than 1 on each iteration. In particular, it should be possible to obtain the optimal solution for the circuit of Figure 4.18 in one iteration. 0 r c Figure 4.18. A simple example CGR circuit 2. Chen and Sarrafzadeh [3] have proved that their algorithm indeed solves the CGR problem optimally. In most realistic situations, however, one or more of the gates will have an upper limit on the obtainable delay. That is, the problem will really be a CUGR problem. The algorithm [3] does not obtain optimal solutions to the CUGR problem. For example, consider the circuit of Figure 4.19(a). The numbers above a gate give the upper bound on the gate's delay. This is essentially the circuit of Figure 4.17(a) with the addition of upper bounds on gate delay. The first iteration of the algorithm of Chen and Sarrafzadeh [3] proceeds exactly as it did without the upper bounds and we arrive at the configuration of Figure 4.19(b). For the second iteration, gates b and c are eliminated from G, because their delays cannot be increased any further. Gs is now just a single vertex graph. Gt = G, and the maximum weighted independent set is {a}. The delay of gate a is increased by 1 and the algorithm terminates (see Figure 4.19(c)). The power reduction obtained is 13+14+15 = 42. However, the optimal power reduction of 44 is obtained by changing the delay of gate a to 3 and gate b to 2, and leaving the delay of gate c at 1 (see Figure 4.19(d)). 3. As noted in Section 4.2, gates with an upper bound on their delay may be modeled by gates with convex delaypower consumption functions. Since the algorithm of Chen and Sarrafzadeh [3] does not obtain optimal solutions for the CUGR problem, it does not obtain optimal solutions for the ConvexCGR problem. 4.5.3 A Unified Framework for CGR, CUGR and ConvexCGR The CGR, CUGR and ConvexCGR problems can all be solved in pseudopolyno mial time by transforming the circuit into an activity on edge PERT (Performance Evaluation and Review Technique) network and then using the algorithm of Fulkerson [8] for project cost curves. 2 5 2 5 S0 3 314 14 14 a15 2 2 4 15 2 2 4 1 2 13 13 2 3 S2 5 2 5 c c 13 13 (c) (d) Figure 4.19. Application of algorithm of Chen and Sarrafzadeh [3] on CUGR circuit. (a) An example CUGR circuit; (b) delay of each gate after first iteration; (c) delay of each gate after algorithm [3] terminates; (d) delay of each gate for optimal power reduction The PERT network G for any circuit C is obtained as follows: Step 1: For each gate v of C, G contains two vertices. r and v+. There is an edge (v, v+) from vertex v to v+. With this edge (v, v+), we associate a triple (a(v,v+), b(v,v+),c(v,v+)) where a(v,v+) = d<(v), b(v,v+) = di(v) + s(v) for the CGR problem and b(v,v+) = min{d((v) + s(v),du(v)}, where d,(v) is the upper bound on the delay for gate v in case we are solving solving a CUGR problem, (the ConvexCGR case is discussed later), and c(v, v+) = c(v). Step 2: For each edge (u, v) in C, there is an edge (u. r) in G. The triple for this edge is (0,0, 0). Step 3: G has two special vertices s (source) and t (sink). There is an edge (s, v) for every gate v for which C has a primary input. The triple for this edge is (max{a(v)}, max{a(v)}, 0), where the maximum is taken over the arrival times of all primary inputs to v. Additionally, there is an edge (v+, t) for all gates v that have a primary output. The triple for this edge is (a, a, 0) where a = max{ required times of all primary outputs} (required time of primary output of gate v). The PERT network for the circuit of Figure 4.17(a) is shown in Figure 4.20. Pairs of vertices (v, v+) are enclosed in broken boxes. Edge triples are shown above each edge. Numbers inside each vertex is their initial r() value, which will be defined later. The interpretation of an edge triple (a, b, c) is: a is the smallest delay through the edge, b is the maximum delay through the edge, and c is the power reduction per unit increase in delay in the range a through b. L  J Figure 4.20. A PERT network for the CGR circuit of Figure 4.17(a). The objective is to assign integer values 7() to the vertices of the PERT network, and integer weights w() to the edges so as to maximize Sc(x, y)w(x. y) (z,y)EE (4.4) subject to w(x, y) a(x, y) r(s) r(t) S(y) 7(x) < w(x, y) b(x, y) = 0 < max{required time of primary outputs} (4.5) (4.6) (4.7) (4.8) It is easy to see that the optimal solution to the above integer linear program de fines an optimal solution for the power reduction problem. In this solution, the delay of gate v is r(v+) r(v) and the obtained power reduction is E(z,y)EE C(X, y)w(x, y). The algorithm of Fulkerson [8] solves the above linear program using a primal dual approach and a network flow algorithm. It begins by setting w(x, y) = b(x, y) for each edge and computes the smallest r() that satisfy Equations 4.54.7 using a topological order scan of the PERT network beginning at vertex s. These values become the initial r() values. In Figure 4.21, the number in each vertex indicates this initial r value. If the computed T(t) satisfies Equation 4.8, we are done. If not, the w's are reduced using augmenting path methods until we have an assignment of w's and T's that satisfy all the constraints (Equation 4.54.8). Fulkerson 18] has extended his algorithm to the case when c(x, y) is given by a convex function for each edge. This extension essentially increases the number of edges in G. Therefore we are able to also solve the ConvexCGR problem with this formulation. Although the asymptotic complexity of Fulkerson's method [8] is the same as that of the algorithm of Chen and Sarrafzadeh [3], we expect Fulkerson's algorithm to be faster for the following reasons: 1. Fulkerson's algorithm can reduce the delay of gates by more than 1 on each iteration. For example, the circuit of Figure 4.18 is handled with just one iteration. 2. Successive iterations of Fulkerson's algorithm use the results of preceding it erations; each iteration requires the computation of new augmenting paths. Successive iterations of the algorithm of Chen and Sarrafzadeh [3] essentially start from scratch, recomputing G,, Gt and the maximum weighted indepen dent set. 4.6 The General Gate Resizing Problem (GGR) Chen and Sarrafzadeh [3] have proposed a heuristic for the general gate resizing problem. We show how our methodology of the previous section can be extended to obtain a heuristic for the GGR problem with convex (delay, power consumption) pairs. Let the (delay, power consumption) pairs for a gate be (di,pi), (d2,p2),..., (dk,Pk) with di < d2 < dk and p > p2 > > >Pk. (di,pi) is the pair for the initially selected gate size. The pairs are convex if and only if cl > c2 > > ck1 where ci = (p, pi+1)/(d,+l di). In practice, we expect most GGR instances will be convex. To solve GGR with convex pairs, we construct a PERT network as before. How ever, each edge (v, v+) is now a chain as shown in Figure 4.21, where 6i = di+l di. V v+ (dl, 61, c) (0,62,c2) (0, 63, 3) (0, ck 1) Q  2)(^ ^)^y ^)~' Figure 4.21. Transformation of vertex v into a chain in PERT network Once the network has been solved using the algorithm of Fulkerson of [8], the flows in the chains are adjusted to obtain a feasible solution to the GGR problem. Since cl > c2 > > ck1, we may assume that, in the optimal solution, the delay for edge (vi. vi+i) is made 6i before the delay of edge (vi,+, vi+2) is increased above 0. For each chain, find the rightmost edge (vi, vi+1) whose delay exceeds 0. If there is no such edge, select edge (vl, v2). If the delay of the edge is less than bi, set the delay of the gate v to di; otherwise set it to di+l. 4.7 Experimental Results We implemented our CGR and GGR algorithms as well as those of Chen and Sarrafzadeh [3] in C and benchmarked them on a SUN SPARCstation 5. All algo rithms were coded using similar programming methodologies so that any observed performance differences can be attributed to algorithmic differences rather than to differences in programming style. The test circuits we used include combinational circuits from the MCNC91 bench mark suite. The library we used includes NAND, NOR, and INVERTOR gates. Each gate has a minimum delay of one clock cycle. Technology mapping was done using Berkeley SIS and power consumption was calculated using a 5V supply voltage and 20MHz clock frequency. The switching activity factors of individual gates were cal culated using the symbolic simulation technique described in Ghost et al. [10], which is implemented in Berkeley SIS. Tables 4.1 and 4.2 give our experimental results for the CGR algorithms. The number of gates in each circuit is given by n; ta is the length of the critical path in the circuit (i.e. length of longest path from a primary input to a primary output when gate delays equal the initially assigned delays). The runtimes are given in seconds. In the experiments reported in Table 4.1, the required time for the output signal was set equal to the critical path length, t., of the circuit. Since both the CGR algorithms produce provably optimal solutions, the only difference between them is run time. On 6 of the 9 tested circuits, our algorithm is noticeably faster; on another 2, the two algorithms took about the same time; and on only 1 of the 9 circuits did the algorithm [3] outperform our algorithm. The disparity between the two algorithms becomes more striking when the required time for the output signal is increased beyond the critical path length t,. The run times for the case when the required time is 2ta are shown in Table 4.2. Now, our algorithm provided a speedup between 4 and 11 over the algorithm of [3]! Table 4.3 shows the relative performance of the two algorithms as we increase the required time t, for one of the test circuits squar5. The relative speedup provided by our algorithm increases from a low of 0.78 (t, = ta; Table 4.1) to a high of 42 when t, = 100 = 12.5ta. Tables 4.4 to 4.9 shows the results for the two GGR algorithms using convex pairs. Our library includes five implementations (i.e. (delay, power consumption) pairs) for each gate type NAND, NOR and INVERTOR). Since both the GGR algorithms are heuristics, we compare the power reductions obtained by each rather than their run times. The first group of tables (Tables 4.4 to 4.6) show the results for the case when t, = t,; the second group of tables (Tables 4.7 to 4.9) are for the case tr = 2t,. Table 4.1. Run time and speedup when required time is equal to critical path length Table 4.2. Run time and speedup when required time is doubled Table 4.3. Run time for square with different required time circuit n t. [3] Our speedup 5xpl 158 11 0.39 0.10 3.99 b12 147 13 0.34 0.09 3.91 clip 278 15 1.28 0.25 5.07 rd73 219 17 1.74 0.21 8.37 sao2 225 18 1.13 0.21 5.47 set 170 8 0.10 0.09 1.10 square 99 8 0.03 0.04 0.78 t481 59 10 0.01 0.01 1.00 ttt2 340 11 4.94 0.47 10.62 circuit n t. Chen Our speedup 5xpl 158 11 1.04 0.14 7.22 b12 147 13 0.55 0.12 4.46 clip 278 15 4.82 0.43 11.16 rd73 219 17 1.55 0.29 5.29 sao2 225 18 1.61 0.30 5.40 set 170 8 1.29 0.14 9.47 squar5 99 8 0.22 0.05 4.29 t481 59 10 0.16 0.01 13.25 ttt2 340 11 7.17 0.67 10.75 tr Chen Our speedup 10 0.076 0.044 1.72 20 0.315 0.052 6.06 30 0.555 0.054 10.28 40 0.795 0.052 15.29 50 1.033 0.055 18.78 100 2.235 0.053 42.17 Within the same group of tables, the circuits differ in the selection of initial (delay, power consumption) pairs for the gates. As can be seen, when tr = ta, our algorithm obtained a larger power reduction in 18 of 27 tests; when tr = 2t,, our algorithm obtained larger power reduction in all 27 cases. Our GGR heuristic took between 0.01 seconds and 5.60 seconds for the test cases. This is considerably more time than required by the heuristic of Chen and Sarrafzadeh [3], which took less than 0.30 seconds for each test. However, the run time of our heuristic is reasonable and our heuristic generally produces better solutions than those produced by the heuristic of Chen and Sarrafzadeh [3]. Table 4.4. Power reduction of GGR algorithms (1) circuit Chen Our Diff % imp 5xpl 47416 51386 3970 8.37 b12 56628 63191 6563 11.59 clip 70912 77252 6340 8.94 rd73 76374 88223 11849 15.51 sao2 88922 96865 7943 8.93 set 54756 53809 947 1.73 square 32603 34481 1878 5.76 t481 3428 3466 38 1.11 ttt2 139626 151590 11964 8.57 4.8 Conclusion We have developed polynomial time algorithms for the CGR, CUGR, and Convex CGR problems for seriesparallel and tree circuits. The CGR problem with multigate modules was shown to be NPhard. We presented a unified framework for the solution Table 4.5. Power reduction of GGR algorithms (2) circuit Chen Our Diff % imp 5xpl 102586 112719 10133 9.88 b12 101174 108306 7132 7.05 clip 171917 194559 22642 13.17 rd73 147608 154986 7378 5.00 sao2 132655 137355 4700 3.54 set 98939 111848 12909 13.05 square 68189 73090 4901 7.19 t481 25345 27461 2116 8.35 ttt2 231072 247243 16171 7.00 Table 4.6. Power reduction of GGR algorithms (3) 5xpl 41448 46050 4602 11.10 b12 42716 45812 3096 7.25 clip 62986 66819 3833 6.09 rd73 63160 71430 8270 13.09 sao2 74291 80777 6486 8.73 set 46212 44805 1407 3.04 squar5 26192 28172 1980 7.56 t481 3222 2652 570 17.69 ttt2 112569 119937 7368 6.55 Chen circuit I Our Diff % imp Table 4.7. Power reduction of GGR algorithms (4) circuit Chen Our Diff % imp 5xpl 90859 100867 10008 11.01 b12 80015 84321 4306 5.38 clip 157222 171358 14136 8.99 rd73 120554 126382 5828 4.83 sao2 109121 111934 2813 2.58 set 79458 86699 7241 9.11 square 58536 61768 3232 5.52 t481 18361 20373 2012 10.96 ttt2 177478 190093 12615 7.11 Table 4.8. Power reduction of GGR algorithms (5) circuit Chen Our Diff % imp 5xpl 29651 26047 3604 12.15 b12 31532 31771 239 0.76 clip 40721 39796 925 2.27 rd73 45337 45015 322 0.71 sao2 53423 58414 4991 9.34 set 26175 22871 3304 12.62 square 13123 13186 63 0.48 t481 1843 1622 221 11.99 ttt2 72498 70742 1756 2.42 Table 4.9. Power reduction of GGR algorithms (6) circuit Chen Our Diff 9I imp 5xpl 53356 67287 13931 26.11 b12 57504 65071 7567 13.16 clip 101648 117108 15460 15.21 rd73 91287 99225 7938 8.70 sao2 91862 97392 5530 6.02 set 56641 62326 5685 10.04 square 38067 41557 3490 9.17 t481 9782 14015 4233 43.27 ttt2 128515 147654 19139 14.89 of CGR, CUGR and ConvexCGR problems on general circuits. This framework can also be used to obtain a heuristic for the ConvexGGR problem. Experimental results obtained by us indicate that our CGR algorithm is faster than the CGR algorithm of Chen and Sarrafzadeh [3] and that our GGR heuristic often obtains better solutions than those obtained by the GGR heuristic of Chen and Sarrafzadeh [3]. CHAPTER 5 CONCLUSIONS AND FUTURE WORK We have considered some problems that arise in the automation of various stages of the VLSI physical design process. The first problem we considered is transistor folding to reduce layout area. An algorithm was developed to minimize the layout area. This algorithm outperforms the existing one both asymptotically and experi mentally. We considered the problem of selecting the implementations of two rows of mod ules on a routing channel so as to satisfy the netspan constraints as well as minimize the channel density. An algorithm was developed by applying the limited branching method. Experimental results indicate a significant reduction in run time over the existing algorithm. Another problem we considered is low power gate resizing. We increase the area of gates to reduce the power consumption while satisfying the time constraint for the circuit. Fast algorithms were developed for seriesparallel and tree circuits and variant of the problem with multigate modules was proved to be NPhard. We also developed a unified framework for the solution of CGR, CUGR and ConvexCGR problems on general circuits. We used this framework to obtain a heuristic for the ConvexGGR problem. Experimental results indicate a significant reduction in run time for our 86 CGR algorithm over the existing algorithm. Our ConvexGGR heuristic often obtains better solutions than those obtained by the heuristic of Chen and Sarrafzadeh [3]. Future research on these problems could include the development of better algo rithms for the multichannel 2PDMIS problem (especially for c > 2); the develop ment of better heuristics for the general single and multichannel PDMIS problems; and faster improved heuristics for the GGR problem. REFERENCES [1] R. Iris Bahar, Gary D. Hachtel, Enrico Macii, and Fabio Somenzi. A symbolic method to reduce power consumption of circuits containing false paths. In IEEE International Conference of Computer Aided Design, pages 368371, San Jose, California. November 1994. [2] Yang Cai and D. F. Wong. On shifting blocks and terminals to minimize channel density. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 13(2):178186, February 1994. [3] DeSheng Chen and Majid Sarrafzadeh. An exact algorithm for low power libraryspecific gate resizing. In Proceeding of the 33rd Design Automation Con ference, pages 783788, Las Vegas, Nevada, 1996. (4] Z. Dai and K. Asada. MOSIZ: A twostep transistor sizing algorithm based on optimal timing assignment method for multistage complex gates. In Proceedings 1989 Custom Integrated Circuits Conference, pages 17.3.117.3.4, San Diego, California. May 1989. [5] Salah E. Elmaghraby. Activity Networks: Project Planning and Control by Net work Models. John Wiley and Sons, New York, 1977. [6] S. Even, A. Itai, and A. Shamir. On the complexity of timetable and multicom modity flow problems. SIAM Journal on Computing, 5(4):691703, December 1976. [7] J. P. Fishburn and A. E. Dunlop. TILOS: A posynomial programming approach to transistor sizing. In IEEE International Conference on ComputerAided De sign, pages 326328, Santa Clara, California, November 1985. [8] L. R. Ford and D. R. Fulkerson. Flows in Networks. Princeton University Press, Princeton. New Jersey, 1962. [9) Michael R. Garey and David S. Johnson. Computers and IntractabilityA Guide to the Theory of NPCompleteness. W. H. Freeman and Company, San Fran cisco, 1979. [10] A. Ghost, S. Devadas, K. Keutzer, and J. White. Estimation of average switching activity in combinational and sequential circuits. In Proceedings of the 29th Design Automation Conference, pages 253259, Anaheim, California, 1992. [11] T. W. Her, TingChi Wang, and D. F. Wong. Performancedriven channel pin assignment algorithms. IEEE Transactions on ComputerAided Design of Inte grated Circuits and Systems, 14(7):849857. July 1995. [12] T. W. Her and D. F. Wong. Cell area minimization by transistor folding. In Proceedings 1993 EuroDAC, pages 172177, Hamburg, Germany, 1993. [13] T. W. Her and D. F. Wong. On overthecell channel routing with cell orienta tions consideration. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 14(6):766772, June 1995. [14] T. W. Her and D. F. Wong. Module implementation selection and its applica tion to transistor placement. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 16(6):645651, June 1997. [15] Ellis Horowitz, Sartaj Sahni, and Dinesh Mehta. Fundamentals of Data Struc tures in C++. W. H. Freeman and Company, New York, 1995. [16] C. Y. Hou and C. Y. Chen. A pin permutation algorithm for improving overthe cell channel routing. In Proceedings of the 29th Design Automation Conference, pages 594599, Anaheim, California, 1992. [17] Y. C. Hsieh, C. Y. Hwang, Y. L. Lin, and Y. C. Hsu. LiB: A CMOS cell compiler. IEEE Transactions on ComputerAided Design, 10(8), 1991. [18] Jaewon Kim and S. M. Kang. An efficient transistor folding algorithm for row based CMOS layout design. In Proceedings of the 34th Design Automation Con ference, pages 456459, Anaheim, California, 1997. [19] H. Kobayashi and C. E. Drozd. Efficient algorithms for routing interchangeable terminals. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, CAD4(3):204207, 1985. [20] WingNing Li, Andrew Lim, Prathima Agrawal, and Sartaj Sahni. On the cir cuit implementation problem. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 12(8):11471156, August 1993. [21] LiShin Lin and Sartaj Sahni. Maximum alignment of interchangeable terminals. IEEE Transactions on Computers, 37(10):11661177, October 1988. [22] C. Papadimitriou and K. Steiglitz. Combinatorial Optimization Algorithms and Complexity. PrenticsHall, Englewood Cliffs, New Jersey, 1982. [23] I. Rival, editor. Graphs and Orders: the Role of Graphs in the Theory of Ordered Sets and its Applications. D. Reidel Publishing Company, Dordrecht, Holland, May 1984. [24] Sartaj Sahni. Data Structures, Algorithms, and Applications in C++. McGraw Hill, Boston, Massachusetts. 1998. [25] Sartaj Sahni and SanYuan Wu. Two NPhard interchangeable terminal prob lems. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 7(4):467472. April 1988. [26] Sachin S. Sapatnekar, Vasant B. Rao, Pravin M. Vaidya, and SungMo Kang. An exact solution to the transistor sizing problem for CMOS circuits using convex optimization. IEEE Transaction on ComputerAided Design, 12(11):16211634, November 1993. [27] Naveed Sherwani. Algorithms for VLSI Physical Design Automation. Kluwer Academic Publishers, Norwell, Massachusetts, 2nd edition, 1995. [28] J. Shyu, J. P. Fishburn, A. E. Dunlop, and A. L. SangiovanniVincentelli. Optimizationbased transistor sizing. IEEE Journal of SolidState Circuits, 23(2):400409, April 1988. [29] A. Stauffer and R. Nair. Optimal CMOS cell transistor placement: A relaxation approach. In IEEE International Conference on ComputerAided Design, pages 364367, Santa Clara, California, November 1988. [30] Venkat Thanvantri. Efficient Algorithms for Electronic CAD. PhD dissertation, University of Florida, Gainesville, Florida, 1995. [31] Spyros Tragoudas and loannis G. Tollis. River routing and density minimization for channels with interchangeable terminals. Integration, the VLSI Journal, 15:151178, 1993. [32] T. Uehara and W. VanCleemput. Optimal layout of CMOS functional arrays. IEEE Transactions on Computers, C30(5), 1981. [33] J. Valdes, R. Tarjan, and E. Lawler. The recognition of series and parallel digraphs. SIAM Journal of Computing, 11(2):298313, May 1982. [34] S. Wimer, R.Y. Pinter, and J.A. Feldman. Optimal chaining of CMOS transis tors in a functional cell. IEEE Transactions on ComputerAided Design, 30(5), 1987. [35] Takeshi Yoshimura and Ernest S. Kuh. Efficient algorithms for channel rout ing. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, CAD1(1):2535, January 1982. BIOGRAPHICAL SKETCH Yu Cheuk Cheng was born on January 23, 1971, in Hong Kong. He received his Bachelor of Engineering degree in computer engineering from the University of Hong Kong, Hong Kong, in 1992. He received his Master of Philosophy degree in computer science from the Hong Kong University of Science and Technology, Hong Kong, in 1994. He will receive his Doctor of Philosophy degree from the Department of Computer and Information Science and Engineering, the University of Florida, Gainesville. Florida, in December 1998. His research interest include VLSI CAD, algorithm design and theory of computation. I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Sartaj K. Sahni, Chairman Professor of Computer and Information Science and Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of osophy Timothy A. Davis Associate Professor of Computer and Information Science and Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Richard E. Newman Assistant Professor of Computer and Information Science and Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Sanguthevar Rajasekaran Associate Professor of Computer and Information Science and Engineering 