<%BANNER%>

Analysis and Design of High Speed I/o Links using a Current-Density Centric Logical Effort Model

Permanent Link: http://ufdc.ufl.edu/UFE0042885/00001

Material Information

Title: Analysis and Design of High Speed I/o Links using a Current-Density Centric Logical Effort Model
Physical Description: 1 online resource (168 p.)
Language: english
Creator: HU,YAN
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: CURRENT -- DECISION -- ELECTRICAL -- EQUALIZER -- FEEDFORWARD -- HIGH -- LINK -- LOGICAL -- PHASE -- RECEIVER -- TRANSMITTER
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Given an increasing demand for higher functional density and processing throughput in high-performance computational platforms such as multi-core and network processor, high speed and low power input/output (I/O) links are desirable. In particular, reducing the power consumption are critical when hundreds of parallel I/Os are integrated on one chip. This dissertation develops a current-density centric logical effort model to mathematically analyze the speed and power performance of I/O links. At the system level, a fast design exploration methodology is demonstrated to search for the optimal link parameters and achieve the optimal energy-delay metrics across technology nodes. At the circuit level, new design techniques are also proposed in this dissertation for high speed I/O links up to Tb/s. The chip prototypes are demonstrated in hardware to achieve high speed link operation with a good signal integrity and energy-delay product.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by YAN HU.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: Bashirullah, Rizwan.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2012-04-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042885:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042885/00001

Material Information

Title: Analysis and Design of High Speed I/o Links using a Current-Density Centric Logical Effort Model
Physical Description: 1 online resource (168 p.)
Language: english
Creator: HU,YAN
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: CURRENT -- DECISION -- ELECTRICAL -- EQUALIZER -- FEEDFORWARD -- HIGH -- LINK -- LOGICAL -- PHASE -- RECEIVER -- TRANSMITTER
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Given an increasing demand for higher functional density and processing throughput in high-performance computational platforms such as multi-core and network processor, high speed and low power input/output (I/O) links are desirable. In particular, reducing the power consumption are critical when hundreds of parallel I/Os are integrated on one chip. This dissertation develops a current-density centric logical effort model to mathematically analyze the speed and power performance of I/O links. At the system level, a fast design exploration methodology is demonstrated to search for the optimal link parameters and achieve the optimal energy-delay metrics across technology nodes. At the circuit level, new design techniques are also proposed in this dissertation for high speed I/O links up to Tb/s. The chip prototypes are demonstrated in hardware to achieve high speed link operation with a good signal integrity and energy-delay product.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by YAN HU.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: Bashirullah, Rizwan.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2012-04-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0042885:00001


This item has the following downloads:


Full Text

PAGE 1

1 ANALYSIS AND DESIGN OF HIGH SPEED I/O LI NKS USING A CURRENT DENSITY CENTRIC LOGI CAL EFFORT MODEL B y YAN HU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUI REMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 201 1

PAGE 2

2 2011 Yan H u

PAGE 3

3 To my family

PAGE 4

4 ACKNOWLEDGMENTS I would like to first thank my advisor Dr. Rizwan Bashirullah for giving me this splendid opportuni ty to work towards a Ph.D under his supervision. His constant guidance and encouragement provided me a clear path for my study, and I have truly enjoyed working with him over the years acquiring technical knowledge as well as other soft skills. I would als o like to thank Dr. William Eisenstadt, Dr. Jenshan Lin and Dr. Loc Vu Quoc for their valuable time and for being on my Ph.D committee. I feel very fortunate to have worked together with all my colleagues, especially Pengfei Li, Hong Yu, Chun ming Tang, Zh iming Xiao, Jikai Chen, Walker Turner, Lin Xue Qiuzhong Wu Chris Dougherty Ian Mclemore, Abhimanyu Kapoor and Abhinav Pand ey in the ICR group, whose helpful discussions, suggestions and friendship have greatly improved the quality of my work. Without th em, the completion of this project would not have been possible. Finally, I would like to acknowledge the love and continuous encouragement from my parents, my husband and my son, Ryan, to whom I dedicate this work.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ ............... 4 LIST OF TABLES ................................ ................................ ................................ ........................... 8 LIST OF FIG URES ................................ ................................ ................................ ......................... 9 ABSTRACT ................................ ................................ ................................ ................................ ... 15 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .................. 18 1.1 Motivation ................................ ................................ ................................ ......................... 18 1.2 Thesis Organization ................................ ................................ ................................ .......... 21 2 HIGH SPEED PARALLEL I/O LINKS ................................ ................................ ................. 23 2.1 Overview ................................ ................................ ................................ ........................... 23 2.2 Band Limited Channel and Signal Integrity ................................ ................................ ..... 23 2.3 Interference ................................ ................................ ................................ ....................... 29 2.3.1 Inter Symbol Interference (ISI) ................................ ................................ .............. 29 2.3.1.1 Skin effect ................................ ................................ ................................ ... 29 2.3.1.2 Dielectric loss ................................ ................................ .............................. 30 2.3.2 Inter Channel Interference (Crosstalk) ................................ ................................ ... 3 1 2.4 Equalization Techniques ................................ ................................ ................................ ... 33 2.4.1 Linear and Nonlinear Equalizers ................................ ................................ ............ 34 2.4.2 Transmitt er and Receiver Side Equalizers ................................ ............................. 35 3 CURRENT CENTRIC LOGICAL EFFORT MODEL ................................ .......................... 38 3.1 Motivation ................................ ................................ ................................ ......................... 38 3.2 Current Density Property in CMOS Pr ocess ................................ ................................ .... 39 3.2.1 Characteristic Current Density in CMOS Transistors ................................ ............ 39 3.2.2 Characteristic Current Density in CML Circuits ................................ ................... 42 3.3 Current Density Centric Logical Effort Model ................................ ................................ 48 3.3.1 CML Gates Delay ................................ ................................ ................................ ... 49 3.3.2 Logical Effort Model for Constant Current Densities ................................ ............ 53 3 3 .2.1 RC delay for J
PAGE 6

6 3.4 Model Validation ................................ ................................ ................................ .............. 62 3. 4 .1 Analysis of a CML Inverter ................................ ................................ .................... 63 3. 4 2 Analysis of a CML Multiplexer ................................ ................................ ............. 64 3. 5 Design Example I: A PRBS Generator ................................ ................................ ............. 67 3. 6 Design Example II: A High Speed Transmitter ................................ ................................ 70 4 SYSTE M MODELING FOR HIGH SPEED PARALLEL I/O LINKS ................................ 78 4.1 Motivation ................................ ................................ ................................ ......................... 78 4.2 Package Strategies for High Speed I/O Links ................................ ................................ .. 78 4.3 Design Framework for High Speed I/O Links ................................ ................................ .. 86 4.3.1 System Level Model ................................ ................................ .............................. 86 4.3.2 A High Speed Transceiver System Blocks ................................ ............................. 87 4.3.3 Link Optimization Flowchart ................................ ................................ ................. 90 4.4 Analysis Results ................................ ................................ ................................ ................ 92 4.4.1 Optimal Data Rate per Link ................................ ................................ ................... 92 4.4.2 Energy per bit Cost for Transceiver Blocks ................................ ........................... 93 4.4.3 Optimal Aggrega te Bandwidth ................................ ................................ ............... 94 4.4.4 Power Density vs. Aggregate Bandwidth ................................ ............................... 95 5 AN ACTIVE CROSSTALK EQUALIZER FOR PARALLEL HIGH SPEED LINKS ........ 97 5.1 Motivation ................................ ................................ ................................ ........................ 97 5.2 Crosstalk Analysis ................................ ................................ ................................ ........... 97 5.3 Crosstalk Induced Jitter Equalizer ................................ ................................ ................. 100 5.4 Circuit Implementation ................................ ................................ ................................ .. 103 5.4.1 System Arichitecture ................................ ................................ ............................ 104 5.4.2 FFE and CIJ Equalizer ................................ ................................ .......................... 105 5.4.3 Phase Interpolator ................................ ................................ ................................ 107 5.4.4 Phase Locked Loop ................................ ................................ .............................. 108 5.4.4.1 VCO ................................ ................................ ................................ ........... 109 5.4.4.2 PFD ................................ ................................ ................................ ............. 111 5.4.4.3 Voltage doubler and level shifter ................................ ............................... 112 5.4.4.4 Charge pump ................................ ................................ .............................. 114 5.5 Chip Fabrication ................................ ................................ ................................ ............ 115 5.6 Experimental Results ................................ ................................ ................................ ..... 115 5.6.1 Channel Characterizat ion ................................ ................................ ..................... 115 5.6.2 Eye Diagram Measurement ................................ ................................ .................. 118 6 LOW POWER HIGH SPEED EQUALIZED CHIP TO CHIP LINK ................................ 121 6.1 Motivation ................................ ................................ ................................ ...................... 121 6.2 Low Power Link Consideration ................................ ................................ ..................... 122 6.3 Link Circuits ................................ ................................ ................................ .................. 126 6.3.1 Link Architecture ................................ ................................ ................................ .. 126 6.3.2 Front End Circuit ................................ ................................ ................................ .. 127 6.3.3 Transmitter ................................ ................................ ................................ ........... 132

PAGE 7

7 6.3.3 Receiver ................................ ................................ ................................ ................ 133 6.4 Chip Fabrication ................................ ................................ ................................ ............ 134 6.5 Experimental Results ................................ ................................ ................................ ..... 135 6.5.1 Test Setup ................................ ................................ ................................ ............. 135 6.5.2 Air Gap Channel Measurement ................................ ................................ ............ 136 6.5.3 Eye Diagrams and BER Measurement ................................ ................................ 137 6.5.4 Energy Measurement ................................ ................................ ............................ 140 7 SUMMARY AND FUTURE WORK ................................ ................................ .................. 142 7.1 Summary ................................ ................................ ................................ ........................ 142 7.2 Future Work ................................ ................................ ................................ ................... 144 APPENDIX A MATLAB CODES FOR DESIGN FRAMEWORK ................................ ............................ 145 B MATLAB CODES FOR STATISTICAL LINK ANALYSIS ................................ ............. 155 LIST OF REFERENCES ................................ ................................ ................................ ............. 161 BIOGRAPHICAL SKETCH ................................ ................................ ................................ ....... 168

PAGE 8

8 LIST OF TABLES Table page 2 1 Summary of seven test cases for HM Zd backplane from Tyco ................................ ....... 25 2 2 ................................ ................................ ................ 30 3 1 Design parameters in 130 nm CMOS technology ................................ ............................. 55 3 2 Logical e ffort model for CML inverter with constant input and output voltage swings ... 56 3 3 Logical effort model for CML inverter with non constant input and output voltage swings ................................ ................................ ................................ ................................ 59 3 4 Predicted and measured FFE taps for a 16 in Tyco channel ................................ ............. 76 3 5 Predicted and measured FFE taps for a 30 in Tyco channel ................................ ............. 76 4 1 Performance summary for the I/O pad arrangements shown in Figure 4 4. ...................... 85 4 2 Performance metrics for 16 nm CMOS technology ................................ .......................... 96 4 3 Summary of package type B specifications ................................ ................................ ....... 96 5 1 T ruth table of model detect block ................................ ................................ .................... 102 5 2 Jitter reduction using FFE and CIJ equalization ................................ .............................. 118 6 1 Power breakdown of the transceiver from [ 77 ] and [ 78 ] ................................ ................. 121 6 2 Link performance summary and comparison with the previous work ............................ 141

PAGE 9

9 LIST OF FIGURES Figure page 1 1 Published I/O data for A) bit rate and B) ener gy per bit versus technology nodes .......... 19 1 2 Performance metrics and design tradeoffs in high speed parallel I/O links ...................... 20 2 1 Paral lel I/O interface and system diagram of each I/O link ................................ ............... 24 2 2 Cross section of a HM Zd Tyco backplane system ................................ ........................... 25 2 3 Insertion los s of Tyco backplane system ................................ ................................ ........... 26 2 4 Sample pulse response of the channel to 200ps wide pulse ................................ ............... 27 2 5 Convulsion of ISI PDFs at corr esponding ISI amplitude ................................ .................. 28 2 6 Plots of ISI PDF and BER for Tyco backplane case 1 ................................ ...................... 28 2 7 Skin effect and dielectric loss from a 30 inch FR4 stripline ................................ ............. 31 2 8 Crosstalk and pulse response for Tyco backplane. A) NEXT and B) FEXT .................... 33 2 9 Conceptual illustration of equalization with channel response, equalizer response and equalized system response in the frequency domain ................................ ......................... 33 2 10 A linear FIR equalizer ................................ ................................ ................................ ........ 34 2 11 A nonlinear equalizer: decision feedback equalizer ................................ .......................... 35 2 12 System diagram for a high speed link ................................ ................................ ................ 36 2 13 Pulse response and BER performance W/O equalization. A) FFE coefficients of 0.3 1 0 0 and DFE coefficients of 1 0.3 0.2, B) FFE coefficients of 0.3 1 0.3 0.2 and DFE coefficients of 1 0 0 ................................ ................................ ................................ ... 37 3 1 Simulated f T as a function of current density for 180 nm, 130 nm, 90 nm, 65nm CMOS technology nodes ................................ ................................ ................................ ... 41 3 2 Schematic of a t ypical current mode logic inverter ................................ ........................... 42 3 3 Open circuit time constant of CML INV across technology nodes ................................ ... 45 3 4 A high speed CML buffer example ................................ ................................ ................... 45 3 5 Speed performance of a CML buffer biased at various current densities. A) Simulated eye diagrams with various current density bias, B) 20 80% rise time ............. 46

PAGE 10

10 3 6 Schemat ic of the MOS CML A) Latch and B) Selector ................................ .................... 47 3 7 A CML buffer composed of inverters. ................................ ................................ ............... 49 3 8 S imulated input voltage swing of a CML inverter vs. current density J ............................ 53 3 9 Design metrics of CML gate vs. logical effort g ................................ ............................... 58 3 10 A data path composed of CML ga tes ................................ ................................ ................ 61 3 11 Calculated logical effort and simulated delays as a function of logical and electrical efforts for a CML inverter with W G =1m, 4m and 40m. ................................ ............. 63 3 12 Simulated energy delay product as a function of data rate and electrical fan out h for a CML inverter with W load = 80m. ................................ ................................ .................... 64 3 13 Simulated energy delay product a s a function of data rate and electrical fan out h for a CML inverter with W load = 80m ................................ ................................ ..................... 65 3 14 Block diagram of a high speed serializer simulation bench ................................ .............. 65 3 15 Simulated and calculated normalized delay as a function of logical effort g and error histogram ................................ ................................ ................................ ............................ 66 3 16 Plot of simulated energy delay products as a function of logical effort and data rate ...... 66 3 1 7 Block diagram of a 2 7 1 PRBS generator ................................ ................................ .......... 67 3 18 Schematic of a CML XOR circuit ................................ ................................ ..................... 68 3 19 Schematic of a CML output driver ................................ ................................ .................... 68 3 2 0 Die picture of the PRBS data generator chip ................................ ................................ ..... 69 3 2 1 Plot of the attenuation for the PRBS chip output data ................................ ....................... 69 3 2 2 Measured eye diagrams for the PRBS chip ................................ ................................ ....... 70 3 2 3 Measured and simulated energy delay performance for the PRBS chip ........................... 70 3 2 4 Block diagram of a high speed transmitter with FFE ................................ ........................ 71 3 2 5 Schematic of current DAC ................................ ................................ ................................ 71 3 2 6 Block diagram of the FFE and output driver for TX ................................ ......................... 72 3 2 7 Block diagrams of 2 7 1 pa rallel PRBS data generator and CMOS D FF ........................... 74 3 2 8 Transmitter chip die photo ................................ ................................ ................................ 74

PAGE 11

11 3 29 Measured channel mag. response ................................ ................................ ...................... 75 3 3 0 Signal attenuation at Nyquist frequency W / FFE and vertical eye opening W/O FFE ................................ ................................ ............................... 75 3 3 1 Energy per bit perf ormance for TX and measured statistical eye diagram. A) 16 in Tyco Channel, B) 30 in Tyco Channel ................................ ................................ .............. 77 4 1 BGA packages from NEC and IBM ................................ ................................ .................. 79 4 2 Package trends from ITRS ( A ) flip chip and BGA ( B ) interposer ................................ .... 80 4 3 3 D packaging strategy for high speed I/O interconnect network ................................ ..... 82 4 4 Pad arrangements and escape routing for four rows of differential I/Os and high density peripheral I/Os. Case A: Inline pads with 2 lines between pads and PTH vias Case B: staggered pads with 3 lines between pads and blind or buri ed vias Case C: inline pads with up to 4 lines between pads and PTH vias ................................ ................ 84 4 5 D=30m, chip area equals 310mm 2 number of stripline signal layers is 2 ...................... 85 4 6 Number of different IOs and IO chip area as a function of total chip area ........................ 85 4 7 System level link model for a high speed transceiver with FFE and DFE ........................ 87 4 8 A typical high speed I/O example A ) transmitter and PLL, B) receiver and CDR .......... 88 4 9 Circuit model for a transceiver front end ................................ ................................ ........... 89 4 10 Flow chart of a full link simulation platform ................................ ................................ ..... 90 4 11 An I/O link optimization framework ................................ ................................ ................. 91 4 12 Energy/bit in mw/Gb/s with data rate per channel with 1dB/GHz and 8dB/GHz channel roll offs ................................ ................................ ................................ ................ 93 4 13 Energy/bit of link components in mw/Gb/s for 16 nm CMOS design at different loss rate of 1dB/GHz and 8dB/GHz for channels ................................ ................................ .. 93 4 14 3D and contour plot for energy/bit of link components in mw/Gb/s with total aggregate bandwidth and chip area ( case B) for 16 nm CMOS design at different loss rate of 1dB/GHz and 8dB/GHz for channels. ................................ ................................ 95 4 15 Power density in w/mm 2 with aggregate bandwidth in Tb/s for fixed die area of 310mm 2 ................................ ................................ ................................ .............................. 96 5 1 Lossless coupled transmission lines model ................................ ................................ ........ 98 5 2 Crosstalk induced jitter generated in the data eye with data transitions ............................ 99

PAGE 12

12 5 3 Methods of CIJ equalization. A) conventional, B) proposed. ................................ .......... 101 5 4 Block diagram of the CIJ clock generation ................................ ................................ ...... 101 5 5 Block diagram of mode detect block ................................ ................................ ............... 102 5 6 High speed CIJ clock generation circuit with the tree architecture. A) Circuit block diagram an d B) Timing diagram ................................ ................................ ...................... 103 5 7 Block diagram of the 4 channel transmitter. ................................ ................................ .... 104 5 8 Transmitter front end with the CIJ equalization and FFE ................................ ............... 105 5 9 Timing diagram of the transmitter data path and mode detect path ................................ 106 5 10 Schematic of the retiming mode detect block ................................ ................................ .. 106 5 11 Schematic of the phase interpolator ................................ ................................ ................. 108 5 12 PLL block diagram ................................ ................................ ................................ .......... 109 5 13 4 stage ring oscillator with the multiple pass loop ................................ .......................... 110 5 14 Block diagram of the 1/2 frequency divider and schematic of the CML latch ................ 111 5 15 Conventional PFD topology ................................ ................................ ............................ 111 5 16 Schematic of the voltage doubler ................................ ................................ ..................... 112 5 17 Schematic of the level shifter circuit ................................ ................................ ............... 113 5 18 Schematic of the charge pump and loop filter ................................ ................................ 114 5 19 Die photo of the 4 channel transmitter ................................ ................................ ............ 115 5 20 BGA package with 4 planes (signal, ground, power, ground) and the typical wire bonding configuration. ................................ ................................ ................................ ..... 116 5 21 Picture of the test board with the BGA package and the chip die photo ......................... 116 5 22 Method of the channel characterization ................................ ................................ ........... 117 5 23 Channel characteristics of insertion loss and crosstalk ................................ .................... 117 ................................ ................................ ............. 118 ................................ ......................... 119 ................................ ................................ ............................... 119

PAGE 13

13 ................................ ................................ ................................ ...................... 120 ................................ ................................ ................................ ...................... 120 6 1 Electrical link A ) no termination B ) RX termination C ) TX/RX termination .............. 123 6 2 Block diagrams of driver with termination. A ) voltage mode driver with series termination B ) cur rent mode driver with parallel termination ................................ ....... 124 6 3 Differential high impedance current mode driver A ) with and B ) without terminations (open drain driver) at the transmitter side ................................ .................. 124 6 4 I/O links with A ) DC coupling and B ) AC coupling ................................ ....................... 125 6 5 System architecture of the chip to chip link ................................ ................................ .... 126 6 6 Schematic of the current sharing front end circuitries: open drain driver, common gate amplifier, impedance matching network and offset control circuit ......................... 128 6 7 Schematic of the link frontend: open drain driver and transimpedance amplifier when switching M 1 ON and M 2 OFF ................................ ................................ ............... 129 6 8 Simulated voltage swing and impedance vs. driver current I T ................................ ........ 131 6 9 Schematic of the transmitter ................................ ................................ ............................ 132 6 10 Block diagram of the 2 7 1 PRBS generator ................................ ................................ ..... 133 6 11 Schematic of the receiver and tracking phase V th shift ................................ .................... 134 6 12 Timing diagram of the proposed DFE architecture ................................ ......................... 135 6 13 TX/RX chip die photos ................................ ................................ ................................ .... 135 6 14 Link test setup ................................ ................................ ................................ .................. 136 6 15 M easured insertion loss of a 20 cm air gap channel ................................ ....................... 137 6 16 Photo of the link chips mounted on the air gap board ................................ ..................... 137 6 17 Measured eye diagrams for 6.25Gb/s link tests with FFE only enabled. ........................ 138 6 18 Measured eye diagrams for 6.25Gb/s link tests with DFE only enabled and FFE+DFE enabled ................................ ................................ ................................ ........... 139 6 19 Measured BER bathtub curves at various equalization settings. ................................ ..... 140

PAGE 14

14 6 20 Measured energy per bit performance at va rious equalization settings .......................... 140 6 21 Energy per bit performance compared with previous work ................................ ............ 141 A 1 Flow chart o f MATLAB functions used in the design framework ................................ .. 145 B 1 Flow chart of MATLAB functions used to estimate BER ................................ ............... 155

PAGE 15

15 Abstract of Dissertation Presented t o the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ANALYSIS AND DESIGN OF HIGH SPEED I/O L IN K S USING A CURRENT DENSITY CENTRIC LOGICAL EFFORT MODEL By Y an H u May 2011 Chair: Rizwan Bashirullah Major: Electrical and Computer Engineering The increasing demand for high aggregate off chip bandwidth in high performance computational platforms requires large number of high data rate and power efficient input/o utput (IO) links. Since IO power efficiency is a strong function of channel response, per lane data rates, process technology and link architectures, it is increasingly important to analyze its implications on overall link performance from a system level p erspective. This thesis presents a link design framework to model overall IO link performance and understand its design tradeoffs. Various high speed IO circuits and architectures are developed to demonstrate power efficient links over band limited and hig h density interconnects. An algorithmic design and scaling methodology is developed to systematically size high speed current mode logic (CML) circuits, and evaluate the impact of device scaling on link performance in terms of compound metrics such as ener gy per bit, bandwidth density and power density A logical effort (LE) model is proposed to optimize speed and power performance in MOS high speed current mode logic (CML) circuits. The relative invariance of the characteristic current density for peak tra nsistor cutoff frequency across technology nodes is used as a normalization factor for CML gates to define logical effort parameters. The proposed logical

PAGE 16

16 effort model is simple yet sufficiently accurate and valid over technology nodes in the constant fiel d scaling regime. To validate this model, a 10 Gb/s transmitter with 3 tap feed forward equalizer (FFE) was designed and fabricated in 65 nm CMOS technology and tested over The proposed model is used to develop a design fra mework to estimate power dissipation for equalized high speed I/O links as a fast design exploration tool to determine link design parameters. High speed packaging strategies with cost constrained I/O counts are used to determine practical solutions for hi gh aggregate bandwidth systems. The optimum data rate per link, number of I/O pins, energy per bit, optimal aggregate bandwidth and power density can be computed from the proposed framework under various channel and noise constraints for different technolo gy nodes. Since dense band limited interconnectivity approaches in IO link systems with high aggregate bandwidth lead to increased amounts of cross talk, a multi tap high speed transmitter with feed forward equalization (FFE) and crosstalk reduction techni que is developed to compensate for frequency dependent channel loss and crosstalk induced jitter respectively This approach adjusts the retiming clock phase via a phase interpolator actively controlled by the corresponding data transitio ns to compensate f or even/odd propagating modes. In order to detect the data transitions at high data rates, a tree based parallel architecture is implemented at lower data speeds instead and multiplexed to perform the timing adjustments. A phase lock loop (PLL) with four stage dual loop ring oscillator is designed to generate high speed in phase and quadrature phase clocks for phase interpolation A voltage doubler is implemented to provide higher supply voltage for the charge pump to increase the oscillator tuning range. A chip prototype with four parallel transmitter cores is implemented in 65 nm CMOS technology to

PAGE 17

17 verify the proposed crosstalk equalizer scheme. Each transmitter includes a PRBS data generator, serializer, FFE, crosstalk equalizer, phase interpolator, PLL and clock distribution circuits. Mode detection blocks are placed between two adjacent transmitters to generate control signals for the crosstalk equalizers. An active low power electrical link demonstration is also reported herein. The transmitter (TX) u tilizes a 1 tap feed forward equalization (FFE) for pre cursor cancellation and the receiver (RX) a 1 tap decision feedback equalization (DFE) for post cursor cancellation. The TX and RX frontends implement a current sharing scheme to reduce the overall li nk power consumption. 6.25 Gb/s at ~0.6 mW/Gb/s (or 0.6 pJ/bits) over a 20cm channel.

PAGE 18

18 CHAPTER 1 INTRODUCTION 1.1 Motivation The demand for higher functional density and processing throughput in high performance computational platforms such as multi core and network processors is progressively increasing the aggregate off chip input/output (I/O) link requirements. Significant performance gain is achieved via scaling o f integrated circuit technology as evident by recent published data of I/O macros and systems. For instance, Figure 1 1 ( A ) shows link bit rate versus technology nodes for various signaling macros such as clock data recovery circuits, equalizers and entire transmit/receive modules. These trends continue to exceed projections by the International Technology Roadmap for Semiconductors (ITRS) [1]. In addition, as shown in Figure 1 1 ( B ), the energy per bit or power density performance of the reported I/O links versus technology nodes has been progressively decreasing to account for the continued demand for low power consumption. A recent example is a complete transceiver core with a reported energy efficiency of 2.2pJ/bit, and clearly 1pJ/bit performance is wit hin reach for short chip to chip IO links [2 4] In order to achieve high aggregate bandwidths, a large number of parallel links must be integrated into a single chip. Increasing the number of I/O channels results in higher aggregate I/O bandwidth and inc reased power consumption. Thus introducing metrics to properly characterize link performance is important. The energy per bit product, E b typically expressed in units of pJ/bit or mW/Gbps, is defined as the total link power (mW) divided by the link data r ate (Gbps). The link power includes the total power associated with the transceiver core, which in turn is closely related to circuit topology, architecture, technologies, overall impedance levels and channel loss characteristics. Overall link performance is also closely related to the bit error rate (BER) performance for reliable data transfer [5] The area constraint from utilizing practical

PAGE 19

19 packaging platforms also affects the power density performance of the link, including the die area of the transceiv er chip and the available number of IO pins inside the package for high speed signaling. Therefore, high speed parallel IO links can be evaluated in terms of simple metrics such as Power consumption Bandwidth per IO link Area A B Figure 1 1 Published I / O data for A ) bit rate and B ) energy per bit versus technology nodes

PAGE 20

20 and compound metrics for such as, Energy efficiency (pJ/bit or mW/Gbps ) Bandwidth density (Gb/s/mm 2 ) Power density (mW/mm 2 ) In order to optimize link metrics, tradeoffs need to be ma de among design variables, e.g. data rate per I/O pin, transceiver architectures, and circuit topology from high level system design down to sub system and circuit level design as illustrated in Figure 1 2. At the lowest level of the system hierarchy, sig nificant effort has been devoted to increasing the data rates in chip to chip communication links. With advances in technology scaling, transistor cutoff frequencies f T have increased significantly (e.g. 200GHz for 90 nm CMOS technology) and accordingly hi gher clock rates within the chips are achievable. For instance, the FO4 delay for digital circuits in 45 nm node is approximately 15ps, which is sufficient to implement core circuits. With speeds in excess of 10GHz, a significant challenge is to achieve th ese high data rates on band limited channels with frequency dependent attenuation, signal reflections and other signal integrity limitations. Figure 1 2 Performance metrics and design tradeoffs in high speed parallel I/O links

PAGE 21

21 From a system level persp ective, optimal link architecture (including types and amount equalization) can be obtained from channel characteristics and packaging strategy in order to meet target BER requirements [6] We apply signal integrity analysis used in communication systems t o address the architecture of high speed I/Os over band limited channels. The link power dissipation is given by the individual power consumption of transceiver building blocks at circuit level, which is dependent on circuit topologies and design strategie s. Current Mode Logic (CML) circuits are widely used in high speed serial links. As the peak f T of MOS transistors remains constant at a current density of ~0.3mA/ m regardless of technology nodes [ 7 ], we propose a new current centric design methodology to systematically design a CML circuit and system. Energy and delay product can be mathematically calculated in terms of proposed logical effort model for CML circuits. Complex tradeoffs associated with high speed I/O links are explored in terms of the energ y profile of CML circuits and system performance based on a logical effort model. In order to obtain the optimal performance metrics for parallel I/O links, an optimization based design framework is developed to guide I/O design at both system and circuit levels, which bridges the gap between analog/mixed signal circuit and high level system design. 1.2 Thesis Organization In Chapter 2, we introduce design parameters associated with typical high speed link environments. We first look at the typical transmi ssion media for electrical links such as backplane. Next we describe the impact of band limited channels on signal integrity such as Inter Symbol Interference (ISI) and Inter Channel Interference (crosstalk). Equalization techniques to compensate ISI and c rosstalk are studied at both transmitter and receiver sides of the link. In Chapter 3, the causes and effects of the relatively invariance of characteristic current density for MOS transistors and MOS CML circuits are reviewed. A current centric logical e ffort

PAGE 22

22 model is introduced for CML gates in order to derive energy and delay product. A high speed PRBS data generator is implemented as a simple digital system to validate the energy profile of the proposed model. A more complex high speed transmitter with equalizer is also implemented to validate both statistical link analysis at system level and logical effort methodology at circuit level. A complete analysis framework is developed in Chapter 4 based on a system level link model composed of band limite d channels and equalizers discussed in Chapter 2 and the logical effort model at circuit level proposed in Chapter 3. High performance packaging strategy is also analyzed including possible pad arrangements, escape routing and performance degradation on hi gh speed signaling. Practical solutions and compound metrics of high speed parallel I/O links can be projected for future technologies. In Chapter 5, we present an active phase interpolator based crosstalk equalizer to compensate for crosstalk induced jit ter (CIJ) A chip prototype of 4 parallel high speed transmitters is fabricated and tested using BGA and QFN packages separately to verify the proposed CIJ equalization scheme and evaluate its performance. In Chapter 6, a low power TX/RX with 1 tap FFE 1 tap look ahead DFE and current sharing frontend is implemented to demonstrate a power efficient chip to chip link The overall link has been measured over a 20 cm channel with an energy per bit performance of 0.6mW/Gb/s at 6.25Gb/s. In Chapter 7 we summa rize our results and provide ideas for extending this work.

PAGE 23

23 CHAPTER 2 HIGH SPEED PARALLEL I/O LINKS 2.1 Overview In order to achieve high aggregate bandwidth on chip, a large number of parallel links must be integrated into a single chip as illustrated in Figure 2 1. The aggregate chip to chip bandwidth can be increased by adding more I/O pins placed in parallel, as well as scaling the bandwidth of each pin in step with the clock speed in ( 2.1 ) As CMOS technology continues to scale down, design efforts have focused on increasing the bandwidth of each channel and also the clock speed, and decreasing the required number of high speed I/O pins [8] This results in less complexity, smaller area and less cost for the overall system. However, high speed data transmission is currently facing a challenge as the characteristics of conventional band limited channels are not keeping pace with the technique improvements being made on multi gigabit circuit implementations. To fully understand the limiting factors tha t affect high speed link design, the band limited channel and its effect on signal integrity will be investigated next. In Section 2.3, the fundamental cause of inter symbol interference (ISI) is studied and different equalization techniques are presented to improve signal integrity over band limited channels. Finally, in Section 2.4, a behavioral model is introduced to characterize a high speed I/O system consisting of a band limited channel and equalizers at both transmitter and receiver sides. (2.1) 2.2 Band Limited Channel and Signal Integrity As the system diagram shows in Figure 2 1, a typical high speed I/O link contains a transmitter sending a synchronized data stream through a channel, and a receiver processing the incomi ng signal and recovering the transmitted bits. When the slicer interprets the received

PAGE 24

24 signal incorrectly, the recovered data is different from the original transmitted data and error can occur. The signal integrity can be quantified as the BER, which is a measure of the signal quality arriving at the receiver. The main obstacles to maintaining good signal integrity in band limited channels are ISI, crosstalk, reflection, and jitter [ 9 ]. With higher data rates or longer channels, these challenges become sev ere, making it impossible to recover the original bits. In this section, a conventional electrical backplane is studied as an example of band limited channels, and its impact on signal integrity is characterized through statistical analysis. Fig ure 2 1. Parallel I/O interface and system diagram of each I/O link Figure 2 2 shows a cross s e ction diagram of a legacy HM Zd backplane system from Tyco Electronics. The line cards plug into the backplane using dense through hole vias and connectors at both edges of the mother board. The I/O chips are packaged and soldered on the two line cards separately for communication. The signal transmitted from an I/O chip has to travel through a number of traces, via stubs and various components, before arriving at the destination. To characterize the transfer function for the signal through this backplane system, a 4 port Vector Network Analyzer (VNA) was used to perform S parameter measurements for seven cases. A summary of the test cases is illustrated in Table 2 1 for various channel lengths and routings. Mixed mode differential insertion losses can be calculated from measured 4 port S parameters [ 10 11 ] in ( 2.2 ) Figure 2 3 shows that large variability occurs in the |SDD21| frequency

PAGE 25

25 response from the loss slo pe, and notches in the magnitude are caused by some via stubs and connectors. Figure 2 2 Cross section of a HM Zd Tyco b ackplane system (2.2) Table 2 1 Summary of seven test cases for HM Zd backplane from Tyco Test Cases Line Card Backplane Total Length 1 13SI 13SI 2 co 4000 13 13SI 3 6 13SI 4 13 13SI 1016mm) 5 13 13 6 13 13 7 13SI 13SI

PAGE 26

26 Figure 2 3 Insertion loss of Tyco backplane system Despite variability of lengths and routings, the backplane system has a mostly low pass characteristic. High speed NRZ signals transmitted along the band limited channel tend to suffer from high frequency attenuation, which causes signal interference. For example, a narrow pulse with 200ps width at the input of case 1 is significantly attenuated and yields an increased pulse width at the output. In Figure 2 4, the dots indicate the symbol spaced samples; pre cursors h ( 5), h ( h ( 1) and post cursors h( 1), are displayed in add ition to the main cursor h (0). By studying the band limited channel as a linear time invariant (LTI) filter and ignoring noise effects, the output samples y(n) can be calculated as the convolution of the input NRZ bit stream x(n) with the sampled pulse re sponse of the filter h(n) That is shown in ( 2.3 ) (2.3) The product x(n) h(0) is the ideal output. However, the superposition of post cursors and pre cursors with the adjacent bits of x(n) will distort the output bit and cause ISI. A statistical analysis can be performed to characterize the effects of ISI on signal integrity, i.e. BER performance [ 12 15 ]. Since the input NRZ data is random in nature, the ISI distortion from pre

PAGE 27

27 /post cursors can be statistically represented with each possible combination of cursors superposed onto each other with equal probability. By recursive ly convoluting the distributions of all significant ISI, the probability density function (PDF) of ISI can be obtained statistically. For example, for a given system pulse response h(n) with a finite number of cursors, the discrete ISI distribution can be characterized by its individual probability density and corresponding ISI amplitude value as shown in ( 2.4 ) Assuming equal probability of 1 or 1 for input random NRZ data, the PDF for each ISI value is 0.5. Figure 2 4 Sample pulse response of the ch annel to 200ps wide pulse for (2.4) Figure 2 5 shows the principle of convoluting two individual PDFs of sign i ficant ISI amplitude h(i) and h(j). The overall ISI PDFs which include a finite number of pre/pos t cursors (N) can be derived as, for (2.5) where z i and a i are recursively calculated from i=1 to N for the interested pre /post cursors, and z N and a N are the ISI PDF distribution with its corresponding ISI a mplitude value.

PAGE 28

28 Figure 2 5 Convulsion of ISI PDFs at corresponding ISI amplitude Therefore, the PDF distribution and received amplitude r corresponding to the transmitted bit of [ 1, +1] are given by ( 2.6 ) for (2.6) By calculating the cumulative distribution function (CDF) from the received voltage PDF distribution, the BER can be obtained as shown in ( 2.7 ) where M is the size of the PDF vector (2.7) For example, based on the pulse response of case 1 obtained from measured S parameters, the calculation procedure for BER as a function of threshold voltage is illustrated in Figure 2 6. Figure 2 6 Plots of ISI PDF and BER for Tyco b ackplane case 1

PAGE 29

29 2.3 In terference As discussed in the previous section, a band limited channel will introduce ISI which causes a degradation of signal integrity. Another cause of signal degradation is Inter channel interference (crosstalk) that occurs between different channels due to electromagnetic coupling from mutual capacitors or inductors [16] In order to characterize the high speed link environment, the physical causes of ISI and crosstalk are studied in this section. 2.3.1 Inter Symbol Interference (ISI) ISI is the most significant factor limiting the maximum achievable data rate for high speed data transmission on band limited channels. By looking at the physical properties of the backplane channels, the primary causes of ISI are found to be the skin effect, conductor l oss, and dielectric loss [ 17 ], although crosstalk and stub reflections caused by poor termination can also contribute. 2.3.1.1 Skin e ffect At low frequencies near DC, the current density inside a conductor has a uniform distribution. As frequency increa ses, the current tends to localize near the surface of the conductor due to the electromagnetic wave interaction within the conductor material. This phenomenon, skin effect, results in a larger effective resistance and signal attenuation. Taking into accou nt the skin effect, the resistance per unit length for DC and AC current can be found using ( 2.8 ) and ( 2.9 ) [8], where D r is the relative resistivity (compared with copper equal to 1), and f is the frequency in Hz. (2.8) (2. 9 )

PAGE 30

30 DC resistance stays constant, while AC resistance is proportional to the frequency. Therefore, the total resistance per unit length can be approximated as shown in ( 2. 10) ( 2.10 ) 2.3.1.2 Dielectric l oss In backplane systems, a la minate material such as FR 4 is used to isolate the copper layers used for signal, ground and power supply. The laminate material is not a perfect insulator, and thus a resistive drop may cause signal attenuation. The dielectric loss can be associated with the dielectric shunt resistance G [ 8 ], where C R is the self loss tangent of laminate material. (2.1 1 ) FR 4 is commonly used in low cost backplane systems, which has the highest loss t angent and thus causes large signal attenuation at high frequency. Some high cost dielectric materials such as Roger or NELCO are required for higher performance backplanes. Table 2 2 shows the dielectric performance and loss tangent for some typical PCB i nsulators. Table 2 2 Material Dielectric constant Loss tangent FR 4 3.9 4.7 0.02 0.03 GETEK 3.5 4.3 0.012 Polymide 4.0 4.5 0.01 Roger 3.0 0.02 Nelco N4000 4.1 0.012 The dielectric loss is found to be proportiona l to frequency in (2.11) Thus, dielectric loss will dominate over the skin effect at high frequency attenuation since the loss related with skin effect is proportional to the square root of the frequency. Both effects are illustrated in Figure 2 7, which shows the attenuation of a 30 inch FR4 stripline as a function of frequency The frequency at which the skin effect becomes larger than the dielectric loss is dependent on the material properties and dimensions of the strip line

PAGE 31

31 Figure 2 7 Skin effect and dielectric loss from a 30 inch FR4 stripline To determine the unit attenuation from band limited channels when the S parameter is not available, ( 2.1 2 ) is written in terms of the skin effect and dielectric loss. (2.1 2 ) Here Z 0 is the characteristic line impedance, R is the DC and skin effect resistance in ( 2.9 ) and (2.10) and G is the dielectric shunt resistance as shown in ( 2.1 1 ) Thus, the magnitude response of a band limited channel can be approximated using ( 2.1 3 ) with c oefficients 1 and 2 dependent on skin effect and dielectric loss. (2.1 3 ) The phase information of transfer function H channel (f) can be easily determined by assuming constant group delay for LTI system. This theoretical model for band limited channel will yield the system pulse response using inverse fast Fourier transform ( ifft ). However, the loss due to the discontinuities, such as vias, is not taken into account. 2. 3 .2 Inter Channel Interference (Crosstalk) In a high speed par allel I/O interface, the electromagnetic fields from different channels may interact and cause inter channel interference (crosstalk). This coupling between channels

PAGE 32

32 can be divided into two categories, near end crosstalk (NEXT) and far end crosstalk (FEXT) As the signal travels from the source to the destination, crosstalk occurs at different paths of components, such as on chip transmission lines, packages, connectors, vias, and the backplane board [18 20] Since the crosstalk is caused by mutual capacita nce or inductance between two signal paths, the amount of crosstalk in high speed digital systems will increas e dramatically as applications demand physically smaller and faster parallel I/O links. Excessive crosstalk can introduce noise on the victim sign al, which reduces the noise margin and degrades signal integrity. In this section, the crosstalk from channels on a backplane board will be discussed. The NEXT and FEXT responses were measured using a 4 port VNA for the Tyco backplane system. Using mixed mode analysis for measured S parameters, the differential transfer function of the crosstalk can be obtained, which can give the system pulse response using ifft as shown in Figure 2 8. The crosstalk between transmission lines on the line card and mother b oard has a small effect since ground planes are placed between different signal layers. The stronger crosstalk occurs between the signal lines in chip packages and connectors due to the area constraint, as discussed in chapter 5. Nevertheless, crosstalk de gradation on signal integrity can be taken into account for BER by recursively convolving the crosstalk pulse response with ISI pulse response. In addition to ISI and crosstalk, other issues such as reflection, jitter, and skew may cause signal integrity problems. The effect of jitter and skew from the clock signal on signal integrity will be discussed in Chapter 4. Reflection is caused by impedance discontinuities along the signal path, which is dependent on physical channel design and out of the scope of this dissertation. In the next section, a few commonly used equalization techniques will be reviewed in the next section to compensate ISI and improve signal integrity.

PAGE 33

33 A B Figure 2 8 Crosstalk and pulse response for Tyco backplane A) NEXT and B) FE XT 2.4 Equalization Techniques As discussed earlier, ISI is the dominant limiting factor on signal integrity for high speed I/O links. Equalization techniques can be used to compensate for ISI effects caused by the band limited characteristic of a channel; in Figure 2 9 the equalization concept is illustrated. Figure 2 9 Conceptual illustration of equalization with channel response equalizer response and equalized system response in the frequency domain

PAGE 34

34 The band limited channel has a low pass frequenc y response. An equalizer is used to boost the attenuation at high frequencies, thus having high pass characteristic [21 22] Ideally, the resulting system response is flat over all frequencies and the ISI effect is eliminated. Two types of equalizers, line ar and nonlinear are discussed here. 2. 4 .1 Linear and Nonlinear Equalizers The linear equalizer shown in Figure 2 10, is the most common type of channel equalizer. The different delayed versions of the input data stream is generated by consecutive delay e lements and summed together with various coefficients to subtract the ISI from the original signal. The time delay element, with a delay of T b is called a tap delay. The transfer function for an N tap symbol spaced linear FIR equalizer is written in ( 2.1 4 ) ( 2.1 4 ) Figure 2 10 A linear FIR equalizer Linear equalization is a straightforward and effective way to compensate for channel loss by subtracting the ISI from the distor ted signal directly. However, linear equalizers have a high pass characteristic and thus amplify both signal and noise simultaneously at high frequencies. To circumvent this noise amplification, nonlinear equalization is utilized for compensation by introd ucing a slicer before the delay elements, as shown in Figure 2 11. Assuming the previous

PAGE 35

35 bits are detected correctly, their effect on the current bit is subtracted from the correctly received analog signal, thus cancelling the ISI from the M tap post curso rs. Note that DFE is not capable of cancelling the pre cursor effects since they are coming from the following bits and not available at the slicer or delay element output at the time of the current bit transmission. The transfer function for an M tap DFE is shown in ( 2.1 5 ) Rather than building a linear filter to equalize the pulse response from the delayed original data, DFE uses the history of the received bits (previous bits) to cancel the trailing ISI (post cursors) that is caused by the limited bandwi dth. DFE is therefore susceptible to error propagation: if a bit has been incorrectly detected, its ISI effect on other bits will not be corrected, yielding error for subsequent bits [ 24 ]. Nevertheless, DFE is heavily used in communication because it does not amplify high frequency noise due to its non linear nature (by including a slicer). Figure 2 11 A nonlinear equalizer: decision feedback equalizer ( 2.1 5 ) 2. 4 2 Transmitter and Receiver Side Equalizers In high speed I/O links, both linear and nonlinear equalizers are used to compensate for signal degradation due to band limited channels. In the mathematical model for a system composed of cascaded filters, the placement of equaliz ers at the transmitter or receiver sides will

PAGE 36

36 not affect the overall transfer function. However, the tradeoffs of power, area, and performance will affect the choice of equalization schemes at circuit level. An equalizer placed at the transmitter side is also called pre emphasis, which is usually in a format of linear feed forward equalization (FFE) since the input data streams are known [25] A FIR filter can be implemented as FFE, which pre distorts the signal by de emphasizing low frequency signal conte nt and keeping high frequency content unchanged before the transmitted signal gets distorted. Nonlinear equalizers are usually placed at the receiver side, since the slicer output is available for use. A system diagram for a typical I/O link composed of a FFE, channel, and DFE is illustrated in Figure 2 12. By describing the channel as a band limited filter, the overall system response for the link can be written as the product of the transfer functions of three cascaded filters, Figure 2 12 System diag ram for a high speed link (2.1 6 ) where C n and K n are FFE and DFE tap coefficients. The receiv ed signal after DFE can be expressed as the convolution result of input signals with the impulse response of the system h (n), the ifft product of the system transfer function H(f). For example, FFE and DFE equalizers are used to compensate the lossy channel of case 1 on the Tyco backplane board. BER performance

PAGE 37

37 can be calculated for the link system by performing statistical analysis on th e FFE, the channel, and the DFE. Figure 2 1 3 shows the system pulse response and BER plot at the data rate of 5Gbps as a function of threshold voltage with different equalization amounts for FFE and DFE. For example, with 4 tap FFE and 3 tap DFE coefficien ts of ( A ) 0.3 1 0 0, and 1 0.3 0.2, ( B ) 0.3 1 0.3 0.2 and 1 0 0, both equalization schemes result in comparable BER performance (10 12) with a voltage opening of ~400mV at the ideal sampling point. The question of which equalization method should be implemented in the design arises based on different amounts of effort. Therefore, energy per bit performance is another important design metric that must be taken into account and will be discussed in the next chapter. A B Figure 2 13 Pulse response and BER performance W/O equalization A ) FFE coefficients of 0.3 1 0 0 and DFE coefficients of 1 0.3 0.2 B ) FFE coefficients of 0.3 1 0.3 0.2 and DFE coefficients of 1 0 0

PAGE 38

38 CHAPTER 3 CURRENT CENTRIC LOGICAL EFFO RT MODEL 3.1 Motivation As presente d in the last chapter, the equalization tap coefficients for high speed data transmission over band limited channels can be determined n the system statistical analysis to meet the target BER requirements. To obtain the design metrics such as speed and pow er dissipation, a circuit level analysis is essential for high speed I/O links. In this chapter, we present a logical effort (LE) model to relate system level parameters to circuit level design parameters for the analysis of high speed links. Logical effor t (LE) i s a design methodology to estimate the delay of CMOS logic circuits for a given logic function It utilizes a simple reformulation of the conventional RC gate delay model to separate the effects of CMOS gate sizing, topology, self loading and fan o ut by normalizing the delay of a gate to that of a reference inverter. The relative simplicity of this method provides a simple yet sufficiently accurate approach to evaluate CMOS datapaths for early design exploration. In this Chapter we present a method to extend the logical effort model to high speed Current Mode Logic (CML) circuits for simple back of the hand evaluation of speed and delay performance. MOS process technology has been widely used for high speed CML circuits due to its wide availability, high integration and fast improvement of speed due to the rapid scaling of feature sizes. At given technology node, the speed performance of CMOS transistors is found to be associated with its current density J defined as the ratio of bias current to tra nsistor size [ 26 ]. The invariance of characteristic current density of a transistor corresponding to the peak f T is firstly reviewed when CMOS technologies are scaling down to nano device regime. CMOS Current Mode Logic (CML) is typically used as the basic logic style for high speed application

PAGE 39

39 instead of CMOS logic. A current centric logical effort model is developed for CML circuits to describe the speed and energy performance [ 27 ] Based on the method of logical effort, algorithmic gate design including sizing and biasing can be developed for high speed I/O links. A high speed Pseudo Random Binary Sequence (PRBS) generator is implemented as a simple digital system to validate the energy and delay profile for the proposed logical effort model. A more compl ex high speed transmitter system with equalizer is fabricated to validate both system level link analysis and circuit level energy delay performance. 3.2 Current Density Property in CMOS Process onductor technology has improved the intrinsic speed of MOSFETs by more than three orders of magnitude in the past 30 years. The intrinsically high device speed is a key enabler for higher operating frequencies in high speed I/O applications. In this secti on, speed performance for a CMOS transistor and CML circuit will be discussed in detail. 3.2.1 Characteristic Current Density in CMOS Transistors For CMOS transistors, the cutoff frequency strongly depends on the small signal transconductance g m and total gate capacitance C gg shown in ( 3.1 ) (3.1) Here g m is the derivative of the drain current I D to the input voltage V GS In saturation region, the drain current of the NMOS transistor is given in ( 3.2 ) if square law model applied (3.2)

PAGE 40

40 Illustrated in ( 3.3 ) the effective mobility n is found to be non constant in the saturation region and it degrades at high V GS with a degradation coefficient of as a result of increased Si/SiO 2 interface scattering d ue to the high vertical electric field [28] (3.3) Substituting ( 3.3 ) into ( 3.2 ) the I D V GS characteristics of NMOS transistors biased in saturation exhibit two distinct regions. At a relatively small V GS the electron mobili ty n remains relatively constant at the intrinsic mobility n0 and the typical square law model is applied for I D At high levels of V GS the I D V GS characteristics become linear due to mobility degradation. In both square law and high field regimes, g m a nd f T can be obtained by differentiating I D with respect to V GS while neglecting the second derivative of mobility. (3.4) (3. 5 ) In order to study the speed dependence on the current density, the first o rder derivative of the cutoff frequency f T with respect to current density J ( I D /W ) is derived in ( 3. 6) (3. 6 ) Thus, f T reaches its peak value when ( 3. 7) is satisfied [ 26 ]. (3. 7 )

PAGE 41

41 The dependence of th e cutoff frequency f T of NMOS transistors on the current density J was simulated for the production of 180 nm, 130 nm, 90 nm, and 65 nm CMOS technologies from multiple foundries as shown in Figure 3 1. The peak f T occurs at the same current density for all the technology nodes, approximately 0.3 0.4mA/m. Furthermore, the delay only changes by less than 10% when the current density varies between 0. 15 mA/m and 0.5 mA/m The optimal noise occurs at the same current density J OPT ~0.3mA/m and the peak f MAX oc curs at the current density J pfMAX ~0.2mA/m irrespective of frequency and technology nodes [ 26 ]. PMOS transistors exhibit the same characteristic current density as the mobility degradation of holes in the high field regime. Therefore, the characteristic c urrent density such as J pfT J OPT and J pfMAX for both PMOS and NMOS transistors remain approximately unchanged across technology nodes. Figure 3 1. Simulated f T as a function of current density for 180 nm, 130 nm, 90 nm, 65nm CMOS technology nodes

PAGE 42

42 3.2. 2 Characteristic Current Density in CML Circuits MOS CML circuits are widely used in high speed digital logic over conventional CMOS logic circuits for their speed advantage with the basic idea of decoupling charging current and switching transistors to r educe the delay time. This type of logic was first implemented using bipolar transistors and then extended for application with MOS transistors [ 29 ]. For example, the speed of a CML inverter as shown in Figure 3 2 is proved twice as fast as a CMOS inverter when driving an identical stage; the delay of these two inverters can be estimated in ( 3. 8) and (3.9) assuming the PMOS transistor size twice that of the NMOS in CMOS inverter and a small signal gain of in CML inverter. (3. 8 ) (3. 9 ) Figure 3 2 Schematic of a t ypical current mode logic i nverter However, unlike CMOS logic, the delay or speed of the basic However, unlike CMOS logic, the delay or speed of the basic CML inverte r is largely determined by the load resistance, R L and the total capacitive load that it drives (assuming R L is smaller than the output impedance looking into the drain of the differential pair, which is typically the case for high speed designs).

PAGE 43

43 In turn R L OUT and the tail current I T It follows that the input pair transistor widths W G of the CML inverter are sized to sustain the desired current I T under complete switching conditions, which in turn determines the minimum required input in In an example of high speed buffer, constant input and output voltage swings are applied for each individual CML inverters, i.e. in = out = Since the characteristic current density for CMOS transistors remains approximately unchanged over technology nodes, we expect that CML circuits composed of transistors will exhibit a similar characteristic. To provide a useful metric for the speed, the open circuit time constant of a CML inverter is derived in ( 3. 10 ) with a fan out of h (3. 10 ) C in and C para are the input and parasitic capacitance, and in represents the load capacitance at the inverter output. As derived in (3.11), C in and C para are both dependent on transistor size W G and technology parameters L C OV and C OX as illustrated in (3.12) (3. 11 ) (3. 12 ) The open circuit time constant CML_INV is proportional to thus the speed performance is dependent on the current density J as well as the voltage swing For proper operation of V is required to be large enough to ensure full switching of the tail current I T in the differential transistors, which is strongly dependent on the DC characteristic of the transistors. Instead of using a simple square law model for both inversion region and high vertical field region a piecewise linear ( PWL) model is used to

PAGE 44

44 express I D V GS and g m V GS transfer func tion s as shown in ( 3.1 3) and (3.14) where the channel length modulation effect is ignored for simplicity (3.1 3 ) ( 3.1 4 ) Here, V eff,pfT = V GS,pfT V T and V GS,pfT is the gate to source voltage at the boundary of the square law and high vertical field regimes (whi ch corresponds to a DC current density I D / W G of 0.15mA/m). The minimum voltage swing min is determined from the DC condition that the differential pair of inverter is fully switched, which gives (3.1 5 ) For complete switching o f a differential pair operating in the square law regime with the small signal gain of the required voltage swing is given by, (3.1 6 ) In the high field regime, the minimum required voltage swing to guaran tee the small signal gain of is derived in [ 26 ]. The result is: (3.1 7 )

PAGE 45

45 V V min expressed in ( 3.1 7) V V min is typically chosen for both the input and output voltage swing of individual gates to provide an enough margin for complete V corresponding to the minimum delay bias is approximately 600mV, 480mV, 320mV, 250mV, 200mV, 160mV and 135mV in the 130 nm, 90 nm, 65 nm, 45 nm, 32 nm, 22 nm and 16 nm nodes, respectively. At dif ferent technology nodes, the fan out of 1 delays for a CML inverter are calculated as shown in Figure 3 3 according to the ( 3. 10) Irrespective of CMOS technology generations, the gate delay is also found to be minimized when the tail current density is ap proximately f T current density J pfT which is invariant across technology nodes. Therefore, a current centric design philosophy is appropriate for foundry independent design of high speed CML circuits. F igure 3 3. Open circuit time constant of CML INV across technology nodes Figure 3 4. A high speed CML buffer example

PAGE 46

46 High speed buffers are core circuits in many high speed transceivers. As shown in Figure 3 4 a buffer composed of four cascaded inverte rs is studied for example in UMC130 nm CMOS technology. Each inverter stage is biased at identical current densities with a constant voltage swing (600mV) to ensure an adequate voltage margin regardless of process variations. Figure 3 5 ( A ) are simulated e ye diagrams for the buffer biased at various tail current densities. In all cases, each buffer drives the same load of 100fF. The 20% 80% rise time is measured at various current densities and the minimum value occurs around 0.3mA/m in Figure 3 5 ( B ). A B Figure 3 5. Speed performance of a CML buffer biased at various current densities A ) S imulated eye diagrams with various current density bias B ) 20 80% rise time In addition to the simple inverter, there are some typical CML gates such as the latch a nd selector (multiplexer) shown in Figure 3 6 with the logic core formed by stacking the transistors on top of each other. When the clock signal (CLK) of the latch is high, the tail current I T flows

PAGE 47

47 through M 1 allowing the NRZ input data D to pass through to the output, and M 3 M 4 behave similarly as a differential pair in a CML INV. On the opposite half clock cycle, the transistor M 2 is turned on and the positive feedback in the differential pair M 5 M 6 holds the data at the output. The selector simply selec ts which data input, D 1 or D 2 appears at the output. If CLK is high, M 1 turns on and the differential pair M 3 M 4 allows D 1 to appear at output. Similarly, D 2 passes to the output when CLK is low. A B Figure 3 6. Schematic of the MOS CML A ) L atch and B ) S elector Same as a CML inverter, the main design variables for the latch and selector includes W G (the transistor size for data inputs), I T V (the constant voltage swing for data input and output). The size of the transistors for the clock inputs is not as critical as W G since the transistors contribute less parasitic capacitance to the output nodes, thus a size that is rou ghly 20% ~50% larger for the transistors that feed the above differential pairs of data input is chosen. In the latch/selector example, the widths of M 1 M 2 would be set to be 20% higher than the widths of M 3 M 4 (L should be the minimum in all cases). This w ill allow the voltage headroom problems caused by transistor stacking to be relieved, while still maintain speed performance. In the latch example, the cross couple pair of M 5 M 6 is used for regenerative operation and would

PAGE 48

48 be set to ~80% of the widths of M 3 M 4 for adequate regenerative gain and less parasitic capacitance. The open circuit time constants for the latch and selector can also be written as in ( 3.1 0) where C in is the capacitance of data input and C para,sel or C para,latchl is the parasitic cap acitance at the output node. In addition to the parasitic capacitance from the transistors of data input C para the extra pair of transistors M 5 M 6 in the latch and selector contributes to the parasitic capacitance at the output nodes, which is associated with a topology dependent coefficient defined as the size ratio of transistors W G,M5 6 / W G,M3 4 By studying the open circuit time constant, it is obvious that the speed performance of the CML latch and selector also peaks at the current density of ~0.3m A/m theoretically. (3.1 8 ) (3.1 9 ) (3. 20 ) (3. 21 ) 3.3 Current Density Centric Logical Effort Model The underlying assumption in the logical effort model for CMOS circuits is that wider transistors, which provide increased current drive, have correspondingly lower resistances and larger diffusion and input capacitances, causing the time constants to remain approximately unchanged. Thus, both logical effort (g) and parasitic delay (p) are determined by the gate topology or gate complexity and are largely independent of the transistor sizes in the gate.

PAGE 49

49 However, unlike CMOS logic, CML circuits have many design variables and most designers rely on a sim ulator for the tweaking of device parameters W G I T V In this section, a fast and systematic design methodology is proposed for CML circuits based on the current centric logical effort model. 3.3.1 CML Gates Delay The speed performance of CML gates can be characterized in terms of the propagation delay, which is defined as the time difference between the zero crossings of the differential input and output. A general CML gate is composed of three parts: resistor loads, switches and constant tail cur rent source. The operation of CML circuits is through steering the tail current source. As illustrated in Fig ure 3 2 a CML inverter is the simplest CML gate that can be used either as an inverter or a buffer dependent on the input/output polarity. By usi ng different combinations of input signals, other functions of CML gates can be easily implemented. Figure 3 7 A CML buffer composed of inverters. In a buffer chain shown in Figure 3 in,i and the output out i which is equivalent to the input voltage swing of next stage V in,i +1. The RC propagation delay of stage i with the schematic shown in Fig ure 3 2 can be calculated as, ( 3. 2 2 )

PAGE 50

50 where k RC is a constant (~ 0.69 for 50% delay), C out,i and C par,i are the load and parasitic out,i /I T,i for R L,i and the product of transistor current density J i and W G,i for I T,i (i.e. I T,i = J i W G,i ), the RC gate delay can be rewritten as, ( 3. 2 3 ) where C g =C in,i /W G,i and C p =C par,i /W G,i are the gate and parasitic capacitances per unit micron of gate width, respectively. C p C g p /C g are all technology dependent parameters. The RC delay can be e xpanded to include the effect of finite input edge rate by expressing the overall gate delay d T in a chain of CML gates as function of the delay of the i stage and the previous stage delays [36 41] ( 3. 24 ) A recursive expansion o f (3. 24 ) in a cascaded chain (i:0) with d RC,i 1 d RC,1 = d RC,0 yields the ( 3. 25 ) where k is a constant value of 0.3~ 0. 5. For typical applications of CML circuits such as I/O and serial link transceiver, the propagation delay of CML circuits in data paths are related with each other. For example, d RC,i 1 = RC,i in a buffer chain with b=1, or a multiplexer data path with b>1 or a de multiplexer data path with b<1. Therefore, the propagation delay d Ti can be re written in terms of a constant B dependent on the type of CML cir cuits in the transceiver system ( 3. 26 )

PAGE 51

51 ( 3. 27 ) Therefore, the speed performance of CML gates is mainly determined by RC delay derived in (3.23). In addition to the current density value J i the RC dela y is also dependent on its output voltage swing V out,i equivalent to the input voltage swing V in,i+1 of its following (i+1) stage, which is determined by the current density value J i+1 As discussed in Section 3.2, t he minimum voltage swing of a CML i nverter to fully switch the transistors is dependent on its DC operating points. In this Section, a more general alpha power law model is used to characterize t he drain current of MOSFET transistor in inversion region written as ( 3. 2 8 ) ( 3. 2 8 ) where very long channel devices and ~1 for very short channel devices [ 30 ]. Here, the high vertical field (J>J pfT ) is not included for discussion since the speed is degraded and more power is dissipated. The effective gate voltage V GS V th annotated as V eff can be found by ( 3. 2 9 ) w here V eff is determined by the current density J defined as I T /W G and V eff0 is included as the effective voltage for impractical low current densities. For sub threshold CML circuits, a constant s witching voltage of 3nV t (~117mV) is used where n derived as 1+C d /C ox ( ~1. 5 ) is the slope factor resulting from the voltage division between the oxide C ox and depletion C d capacitors in the MOS transistors and V t is the thermal voltage defined by V t =kT/q [ 33 ]. This constant sub threshold voltage can be used to define the lower boundary of effective gate voltage

PAGE 52

52 for transistors biased in the inversion region. in is typically determined by the effective gate voltage V eff w ith a margin factor m ~1.5. Based on ( 3.2 9 ), a piecewise linear ( PWL ) model can be used to characterize in for the low current densities and high current densities as illustrated as ( 3. 30 ) where V 0 is the constant voltage swing used for low current densities and determined by V eff0 ( 3. 31 ) J m is defined as the maximum current density, which is usually chosen in terms of the peak transistor current density (~ J pfT ). The corresponding input voltage swing required for complete current switching at J m is annotated as m which can be found by, ( 3. 32 ) m is approximately 600mV in 130 nm node at J m (equal to J pfT node [34] By e quating the two scenarios in (3.32), t he boundary current density J 0 can be determined b y ( 3. 33 ) Fig ure 3 8 shows the simulated in of a CML inverter for 180 nm, 130 nm, 90 nm and 65 nm CMOS technologies f rom multiple foundries Although the ratio of C d /C ox is technology dependent, t he threshold voltage swing V 0 for different technology nodes can be approximated as a constant value of ~200mV since C d /C ox is less than 1. For example, the

PAGE 53

53 calculated current density boundary J 0 for 130 nm CMOS technology is ~0.06mA/m with 3 For each technology node, J 0 increases approximately by a factor of 2 until it reaches the maximum value of J m Figure 3 8 S imulated input voltage swing of a CML inverter vs. current densit y J By studying RC delay of stage i in (3.23) d RC, i is dependent on its current density J i and J i+1 from its following stage For e xample, constant current densities are applied for of a buffer chain, i.e. J i 1 =J i =J i+1 =J However, in some cases, J i is not necess arily equal to J i+1 In Section 3.3.2, we propose a current density centric logical effort model for constant current densities in a data path. In Section 3.3. 3 we further discuss a more general logical effort model for non constant current densities. 3 .3. 2 Logical Effort Model for Constant Current Densities By assuming constant current densities, i.e. J i 1 =J i =J i+1 =J, the input and output voltage swings of all the inverter are same, i.e. V in,i = V out,i = V in,i+1 = V. Based on ( 3. 23) it is found that it is extremely desirable to reduce the voltage swing as much as possible to reduce the

PAGE 54

54 propagation delay. However, the required voltage swing is dependent on current density J when 0 and the minimum value is set by the V 0 when J
PAGE 55

55 Table 3 1 illustrates example design parameters in 130 nm CMOS technology such as power law factor current density J m oxide thickness t ox and the calculated oxide capacitance C ox and gate capacitance per unit width C g Table 3 1. D esign parameters in 130 nm CMOS technology Design parameters V alue L g 0.12 m V dd 1.2 V 1.4 3 J m 0.3mA/ m t ox 2.73nm C ov 2.9e 10f F /m C g 1. 3 f F/ m V m 600mV k RC 0.69 1. 65 ps 3 3 2 2 RC d elay for J J 0 For current densities larger than J 0 the required voltage swing can be expressed in terms of current density, i.e. V= V m (J/J m ) and the RC delay equation can be written as a variable of current density only instead of voltage swing ( 3. 3 7 ) The logical effort, g, is now such that ( 3. 3 8 ) The delay equation in (3.36) hold true for 0 with the same definitions f or fan out (h) parasitic delay ( p). and delay unit ( ), ( 3.3 9 ) From (3.38) n (3.28) is linearly

PAGE 56

56 proportional to V eff delay across current densities. The logical effort parameters and delay equation for CML inverter are summarized in Table 3 2 for Jg 0 CML circuits are biased at low current densities JJ m the transistors in CML gates enter the high vertical filed region with mobility degradation. For very short channel devices ( ), (3. 41 ) will result in g 0 ~1 with J 0 being close to J m as illustrated in Fig ure 3 8

PAGE 57

57 The power dissipation can be found from the tail current (I T G,i ) such that, ( 3. 4 2 A ) ( 3.4 2B ) where P i,m is the power dissipation at current dens ity J m (g=1). T he transistor size W G,i can be recursively obtained from the load, i.e. W G,i =W G,i+1 /h i From the power and delay equations, the energy (power delay) product is, ( 3. 4 3 A ) ( 3.4 3 B ) where E i ,m is the energy for J m The energy delay (ED) product can be found by, ( 3. 4 4A ) ( 3.4 4B ) where ED i ,m is ED performance for J m Therefore, in a data path with co nstant current densities, the design metrics can be recursively obtained from its following stages. 3 3 .2.4 Design o ptimization The proposed logical effort model can be used to gain insight about the important design

PAGE 58

58 metrics of CML gates, such as delay an d energy delay product. The maximum operational speed of CML ga increases since it is a linear function of g. However, the delay does not follow the same trend for g<1 since the mobility of the transistors is degraded due to high vert ical field when J>J m At a given fan out and transistor size, the ED product at J m (ED i,m ) is fixed. From (3.44) the ED 0 0 Thus, a minimum value exists at g=g 0 for ED product. Figure 3 9. Des ign metrics of CML gate vs. logical effort g Fig ure 3 9 graphically illustrates the design metrics of CML gates. Following the trends of each metric, it can be seen that there are three regions ( g<1, 1 g g 0 and g>g 0 ) in which CML circuits can be designed. CML circuits will work inefficiently at g<1 (J>J m ) due to speed degradation and higher ED product caused by larger current consumption. With logical effort 0 the CML circuits are found be to energy efficient because the maximum speed is achieved at g =1 and the ED product decrease until it reaches its minimum value at g=g 0 0 0 ), the power shown in (3.42) is proportional to g 1) which causes the ED product decreasing with g. For g>g 0 (J< J 0 )

PAGE 59

59 0 across current densities, the power dissipation associated with J is linearly proportional to g 1 which causes the ED product proportional to g. For logical effort g>g 0 very low power can be achieved but the speed performance further degrades and the energy delay product suffers. As the CMOS technology scales, the boundary logical effort g 0 decreases due to scaling of the voltage swing V m Thus, the energy efficient region in Fig ure 3 9 will shrink to one optimal p oint at g=1 where fastest speed and minimum energy delay product can be achieved, simultaneously. 3.3. 3 Logical Effort Model for Non Constant Current Densities In the case of non constant current densities applied in the buffer chain shown in Fig ure 3 7 t he voltage swing V out,i in delay equation ( 3.23 ) is determined by current density J i+1 In the following, we will discuss the delay and power optimization for non constant current densities in a serializer and desrializer based on a logical effort model. 3 3 .3.1 RC d elay and power Based on the PWL model for voltage swing shown in (3.30 ), the logical effort parameters are derived in Table 3 3 for J i+1
PAGE 60

60 For J i+1
PAGE 61

61 By compar ing ( 3. 4 2 ) and (3 4 7 ) the power P i is found to be proportional to g i 1) (at g i 0 ) or g i 1 (at g i >g 0 ) for both constant and non constant current densities. Therefore, the optimization strategy for design metrics is the same as as shown in Fig ure 3 9 and the region of i 0 is used for energy efficient design. 3 3 3 .2 Design optimization T he normalized delay of stage i and stage i+1 in Figure 3 10 can be written as d i =g i (h i + ) and d i+1 =g i+1 (h i+1 + ). As discussed before, the minimum energy delay product is achieved at g=g 0 at a given fan out value for individual CM L stage s In a data path composed of two CML gates with a constant path fan out H=h i h i+1 we will discuss the minimum energy delay product of the path Figure 3 1 0 A data path composed of CML gates For a seri a lizer data path from low to high data ra te, the normalized delay can be approximated as d i = b d i+1 where b>1 If the logical effort parameters are constant at the point of minimum ED product for these two stages, i.e. g i =g i+1 =g 0 the fan out of each stage can be approximated as h i = ( b H) 0.5 and h i+1 =(H/ b ) 0.5 when ignoring the parasitic effect By defining a constant K E =V dd J m W load (B ) 2 ( V 0 / V m ) g 0 t he sum of ED products in the path can be derived as,

PAGE 62

62 ( 3. 50 ) For constant fan outs, i.e. h i = h i+1 =H 0.5 the logical effort are expressed as g i = b g 0 and g i+1 =g 0 By ignoring t he energy delay products can be derived as, ( 3. 51 ) By comparing ( 3. 50 ) and ( 3. 51 ) the use of constant logical effort in a seri a lizer data path results in less minimum e nergy delay product and less delay for b >1. For a de seri a lizer data path from high to low data rate s the normalized delay is expressed as d i+1 = b d i instead where N is an intege r. For constant logical effort i.e. g i =g i+1 =g 0 the fan out of each stage c an be approximated as h i =(H/ b ) 0.5 and h i+1 =( b H) 0.5 by ignoring the parasitic effect at large path fan out values. With the same definition for K E t he energy delay products can be derived as, ( 3. 5 2 ) For constant fan outs, the logical effort of each stages, i.e. h i = h i+1 =H 0.5 the logical effort can be exp ressed as g i = g 0 and g i+1 = b g 0 The energy delay products can be approximated as, ( 3. 5 3 ) By two results from ( 3. 50) and (3.51) the use of constant logical effort in a de seri a lizer data path will give less minimum energy delay pr oduct and less delay for b >1. 3.4 Model Validation The proposed logical effort model can be used to estimate delay and other design metrics such as power and energy for a variety of CML circuits. In this Section the accuracy of the proposed model is analy zed for a high speed CML inverter and a multiplexer.

PAGE 63

63 3. 4 .1 Analysis of a CML Inverter The proposed logical effort delay model is verified against simulations in a 130nm CMOS process technology using a CML inverter chain as shown in Fig ure 3 1 1 Four CML i nverter stages (INV 1 INV 4 ) with g=h=1, are used to generate a r ealistic rise time at the input of the CML inverter under test (INV 5 ). In a typical application for a high speed buffer, s ame input and output voltage swings are assumed for INV 5 [42 43]. T he logical and electrical (fan out) efforts of the test inverter are varied and the corresponding delays are analytically determined using the logical effort model in Table 3 1 These are compared against simulated results over three different gate widths W G normalized delay d T agreement with simulations. The maximum error for all cases was within 10% and the average error was within 4%. Figure 3 1 1. Calculated logical effort and simulated delays as a function of logical and electrical efforts for a CML inverter with W G =1m, 4m and 40m.

PAGE 64

64 The simulated power, energy and energy delay product as a function of logical effort g and electrical effort h for a CML inverter wit h the load capacitor of 100fF is illustrated in Fig ure 3 1 2 The trend of power, energy and energy delay product is consistent with the theoretical result presented in Fig ure 3 9 The power dissipation is less with larger logical effort g and fan out h at given capacitive load. The energy is decreased with g until it reaches a constant value for g=g 0 (~1.5) A minimum value of energy delay product at given fan out is achieved at g=g 0 and larger fan out results in a higher energy delay product from larger de lay. The speed performance of CML inverter is limited by its propagation delay. As shown in Fig ure 3 1 3 an optimal data rate corresponding to the minimum energy delay product can be achieved at different fan out values and smaller fan out gives higher op timal data rate. The maximum energy delay error compared with the theoretical result is within 15%. Figure 3 1 2. Simulated energy delay product as a function of data rate and electrical fan out h for a CML inverter with W load = 80m. 3. 4 2 Analysis of a CML Multiplexer A second common CML gate is a high speed multiplexer for chip to chip communication. Fig ure 3 14 shows a 4:1 multipl e xer used as a simulation example. Four CML inverter stages

PAGE 65

65 (INV 1 INV 4 ) and a 2:1 multiplexer annotated as MUX0 are used to generate a realistic rise time at the input of the 4:1 MUX composed of two 2:1 cascaded MUXs annotated as MUX1 and MUX2. The propagation delay of MUX1 is chosen to be approximately twice of that of MUX2 according to the data requirement Figure 3 1 3. S imulated energy delay product as a function of data rate and electrical fan out h for a CML inverter with W load = 80m Figu re 3 1 4. Block diagram of a high speed seri a lizer simulation bench

PAGE 66

66 The input and output voltage swings are firstly assumed to be the same for the data path. S ame logical effort g (g 1 =g 2 ) is chosen for each MUX. The fan out of MUX1is chosen to be twice of MUX2 (h 1 =2 h 2 ) in order to obtain d RC,1 2 d RC,2 according to the data rates in each stage. The calculated and simulated normalized sum delay of MUX1 and MUX2 as a function of constant logical effort from 1 to 3 is shown in Fig ure 3 1 5 The simulated energy and energy delay p roducts are displayed in Fig ure 3 1 6 to show that the minimum energy delay product is obtained at g=g 0 (~1.5) At g =1, larger energy per bit is consumed, but higher data rate can be achieved. Figure 3 1 5. Simulated and calculated normalized delay as a function of logical effort g and error histogram Figu r e 3 1 6. Plot of s imulated energy delay products as a funct ion of logical effort and data rate

PAGE 67

67 3. 5 Design Example I: A PRBS Generator A 2 7 1 PRBS generator is studied and implemented as a simple high speed digital system to verify the energy and delay performance with the previously proposed logical effort model a t circuit level. As shown in Figure 3 1 7 the PRBS generator is a series type implementation composed of seven D flip flops (DFF), one XOR circuit, a three stage CML buffer and an output driver in order for matching the low characteristic impedance Z 0 of t he testing cables [44] A typical DFF is implemented using two identical CML latches with the schematic shown in Figure 3 7. Figure 3 1 7 Block diagram of a 2 7 1 PRBS generator The schematic of the high speed XOR circuit is shown in Figure 3 1 8 with th e two stacked differential inputs A and B. The tail current is steered through only one of the four branches. The transistor size for the two data inputs is the same to provide the same load capacitance at the output nodes of the DFF stages. As shown in F igure 3 19 the circuit topology of the output driver is identical to CML inverters; however the design strategy is different from conventional CML. The output voltage swing V TX is determined by the receiver sensitivity V RX and the signal transmission loss at the

PAGE 68

68 atten For a large enough input voltage swing, the output voltage of the driver is the product of bias current I 0 and termination resistor Z term Therefore, the power consumed by the output driver is given by ( 3. 5 4 ) dependent on the receiver sensitivity requirement and interface characteristic. Figure 3 18. Schematic of a CML XOR circuit Figure 3 19. Schematic of a CML output driver ( 3. 5 4 ) The design procedure starts with high speed backend and p roceeds to the low speed blocks. The sizing of the output driver can be determined from its current, which also provides

PAGE 69

69 the load capacitance for the previous buffer stage. Thus, the total power dissipation can be recursively obtained from the individual C ML circuits, where W G,0 is the transistor size of output driver and N is the total number of CML circuits in the digital system. (3. 5 5 ) Figure 3 2 0. Die picture of the PRBS data generator chip Figure 3 2 1. Plot of the attenuation for the PRBS chip output data The PRBS data generator was fabricated using UMC 130nm CMOS technology. The die picture is shown in Figure 3 2 0 with a core chip area of 100m80m. RF probes and cables were utilized for on die testing. The signal attenuation was measured as shown in Figure 3 2 1

PAGE 70

70 including 6dB loss caused from the impedance matching. An Agilent 81600B oscilloscope with a maximum frequency of 65GHz was used to measure the data received from the PRBS data ge nerator at a constant single end voltage swing of 100mV as illustrated in Figure 3 2 2 The power consumption of the chip at various data rates was measured and compared to the theoretical results from ( 3. 34 ) The energy per bit products are plotted in Figu re 3 2 3 showing a minimum value around 7Gbps, which validates the theoretical results of the logical effort model. Figure 3 2 2. Measure d eye diagrams for the PRBS chip Figure 3 2 3. Measured and simulated energy delay performance for the PRBS chip 3. 6 Design Example II: A High Speed Transmitter A high speed transmitter system with a feed forward equalizer is studied in this section for a typical application on backplane systems. The block diagram is shown in Figure 3 2 4 The transmitter chip generates random data at a low rate and serializes the data at full rate into the

PAGE 71

71 channel. The clock for synchronizing data is provided by an outside source. In order to minimize the power dissipation, st atic and dynamic CMOS gates are used for low speed blocks whi le high speed blocks are implemented by CML gates. A t the front end, 3 tap FFE composed of retim ing latches, selector s, and drivers are implemented to compensate for the ISI. All CML circuits are designed to have variable tail current s controlled by a 5 bi t current DAC. Figure 3 2 4. Block diagram of a high speed transmitter with FFE Figure 3 2 5. Schematic of current DAC

PAGE 72

72 The schematic of current DAC (IDAC) is displayed in Figure 3 2 5 The tail current of CML gate I T is mirrored from a reference current I re f which can be var ied by conducting ON or OFF of an array of PMOS transistors through the gate signals C. Controlled by digital bits from the loader, a buffer stage is used to generate signal C either equal to power supply or V bias from the r eference voltage. Figure 3 2 6. Block diagram of the FFE and output driver for TX This high speed transmitter was implemented for data transmission on commercial backplane boards. To overcome the severe signal loss at high data rates, the half rate FFE ( Figure 3 2 6 ), consuming less power than a full rate FFE, is used at the transmitter front end. The FFE receives two data streams, d 1 and d 2 at half of the bit rate. For proper operation of the 2:1 selector, the data streams are shifted by a unit interva l (UI) using a data retiming latch. The two half rate streams are then interleaved and multiplexed into a full rate signal for the first FFE filter tap. Additional shifting and interleaving yields delayed data streams for the remaining taps of the FFE. The output driver for the main tap is also displayed in Figure 3 2 6 The output currents from other equalization taps are summed together at the termination resistor Z term to

PAGE 73

73 implement the filter function. The transmitter output voltage swing is dependent on the equalization coefficients, channel characteristics and receiver sensitivity. The design procedure begins by calculating the system bit error rate (BER) from the combined channel and equalizer response at different sampling times and offset voltages t hrough the recursive convolution of the ISI PDFs. It follows that the normalized vertical eye opening at the receiver, atten for a target BER can be computed at different sampling time along with optimal equalization coefficients k i for the combined channel/FFE response. Jitter effect in the transmit chain synchronization clocks can also be taken into account for BER cal culations by averaging the conditional PDFs. The resulting minimum required transmitter output voltage swing V TX can be obtained from the receiver eye opening amplitude V RX and the normalized vertical eye opening atten (3. 56 ) atten is the maximum eye opening value when the transmitted data can be sampled at the optimal time without sampling jitter. Following that, the total power consumption from the CML circuits in the transmitter can be concluded in ( 3. 57 ) where W G,0 is the transistor size of output driver and N is the total number of CML circuits in system. (3. 57 ) For testing purpose, a 2 7 1 PRBS data generator is also included inside the transmitter chip. As presented in Secti on 3.5, a series implementation of PRBS data generator is composed of seven D flip flops (DFFs) used in the shift register However, 10 XOR gates are required to form eight phase shifted sequences for multiplexing according to the algorithm given in [ 46 ] a nd another eight DFFs are used to re time the signals to equalize the delays. Thus, a parallel

PAGE 74

7 4 structure shown in Figure 3 2 7 is more suitable in all cases where t he output data streams from the data generator are phase shifted appropriately for direct mul tiplexing further [4 7 ]. Figure 3 2 7. Block diagrams of 2 7 1 parallel PRBS data generator and CMOS D FF Figure 3 2 8. Transmitter chip die photo A high speed transmitter with FFE was fabricated using TI 65 nm CMOS technology, the die picture is shown in Figure 3 2 8 with the core area of 100m200m. A Tyco 16 /30 in.

PAGE 75

75 board with two daughter cards was used to validate the predicted power dissipation The measured channel response / cascaded cables and DC blocks are shown in Fi gure 3 29 The calculated signal attenuation W/O the FFE is shown in Figure 3 3 0 The signal attenuation without the FFE is given by the magnitude response at the Nyquist frequency and is severe at high er frequenc ies With the frequency response boosted b y the FFE at high data rate s less loss and a better attenuation atten at low target BER (10 12 ) can be obtained Figure 3 29 Measured channel m ag. r esponse Figure 3 3 0 Signal attenuation at Nyquist frequency W / FFE and vertical eye ope ning W/O FFE

PAGE 76

76 The FFE tap coefficients can be estimated from the applied digital control bits in the testing since each tap has the same controllable I DAC with the branches of 1, 2, 4, 8 and 16. Table 3 4 and Table 3 5 illus T yco backplane board. Table 3 4 Predicted and measured FFE taps for a 16 in Tyco channel Data rate (Gb/s) Predicted Measured 2 1/0/0 1/0/0 4 1/ 0.1/0 1/ 0.13/0 5 1/ 0.2/0 1/ 0.25/0 6 1/ 0.4/ 0.2 1/ 0.44/ 0.13 7 1/ 0.4/ 0.4 1/ 0.44/ 0.38 Table 3 5 Predicted and measured FFE taps for a 30 in Tyco channel Data rate (Gb/s) Predicted Measured 1 1/0/0 1/0/0 2 1/ 0.1/0 1/ 0.11/0 3 1/ 0.1/ 0.1 1/ 0.13/0 4 1/ 0.15/ 0.15 1/ 0.18/ 0.1 5 1/ 0.2/ 0.15 1/ 0.25/ 0.1 T he power performance of the transmitter chip on the Tyco board from model ing and testing is illustrated in Figure 3 3 1 In order to maintain the same system requirement s at different data rate s an Agilent ParBERT 81250 was used to measure the received data and keep the eye opening at 50mv for the target BER of 10 12 T he predicted and tested energy per bit performance for the transmitter are consistent, and t he measured value being higher since the packaging PCB trace, testing cables, etc will contribute to ISI in addition to the backplane channel. As for the 16 /30 in. channel, the minimum energy per bit products with low target BER are measured around 5Gb / s and 3Gb/s respectively [48] C ompared with the theoretical value for the driver at 3G b/ s and 1G b/ s regardless of BER performance [ 4 9 ], equalization on severe channels is proved to extend the optimal data rate higher.

PAGE 77

77 A B Figure 3 3 1 Energy per bit performance for TX and measured statistical eye diagram A ) 16 in Tyco Channel B ) 3 0 in Tyco Channel

PAGE 78

78 CHAPTER 4 SYSTEM MODELING FOR HIGH SPEED PARALLEL I/O L INKS 4.1 Motivation Based on the high speed IO system behavior al model presented in Chapter 2 and the logical effort model for CML circuits presented in Chapter 3, a design framewor k is introduced in this chapter to relate the system level parameters ( i.e. BER ) to circuit design parameters such as gate sizing and current biasing. Th e proposed design framework can be used to estimate Required number of IO links for a desired aggregat e Required per lane data rates to maintain power efficiency The amount of channel equalization and tap coefficients to compensate for channel loss, dispersion, crosstalk, etc. T he number of channels for a desired aggregate bandwidth is largely dependent on the channel environment and area constraint from desired packaging requirements In addition to the channel discussed in Chapter 2, i.e. backplane systems, packaging design strateg ies limit the achievable number of high speed I/O pins and contribute to th e overall attenuation in the signal path. In section 4.2, we discuss potential packaging strategies to derive practical IO counts for signaling. The design framework is presented in section 4.3 to describe an I/O example including the seri a lize r, de seri a l i zer, FFE, DFE, phase locked loop (PLL) and clock recovery and decision circuit (CDR) P erformance metrics such as energy per bit, bandwidth per unit area and power density are derived for a template link architecture and compared across technology nodes in section 4.4. 4.2 Package Strategies for High Speed I/O Links Packaging platforms play a critical role in determining practical solutions for high aggregate bandwidth systems. For instance, while packages with high I / O count maximize the attainable band width transferred from the IC to the board level interconnect, cost constraints

PAGE 79

79 limit the overall package footprint and the number of available I / Os for signaling. Increasing the number of pins requires high performance interposer substrates with smaller f eature sizes, larger number of layers and finer pad pitches to enable high off chip bandwidth communication for practical package footprints and board real estate [ 5 0 ]. On the chip side, die yields limits the total die area. Since only a fraction of this a rea can be utilized for off chip communication circuitry, the I / O density or bandwidth per unit area is limited. This affects not only the style of circuit design (i.e. mostly digital to exploit technology scaling), but also the type of die to interposer c onnections (i.e. flip chip vs. wirebond) and the overall package requirements. Figure 4 1 BGA packages from NEC and IBM The highest contact density in packages used for high performance systems such as processors and graphic engines is available in fi ne pitch flip chip ball grid array (FCBGA) packages. Flip chip is a standard process to connect the die to an interposer substrate that offers high density connections to maximize the aggregate bandwidth per unit area that can be transferred in and out of the chip surface. State of the art FCBGA packages use organic substrate build ups with the pads with 150 200 m pitch range. For instance, the HyperBGA FCBGA package is a PTFE ( Polytetrafloroethylene ) chip carrier technology from Endicott Interconnects

PAGE 80

80 that can support 180 200 m pitch flip chip attached to 50 m laser drilled vias with line spacing and width of 28 m and 33 m, respectively [ 5 1 ]. These packages are typically composed of multiple low loss copper metallization layers, signal layers embedded in a true stripline environment sandwiched between voltage and/or ground layers. In addition, redistribution layers at the top and bottom are used to establish wiring connections to the chip array pads and BGA pads, respectively. A B Figure 4 2 Package t r ends from ITRS ( A ) f lip chip and BGA ( B ) i nterposer

PAGE 81

81 Despite the continued advances in fine pitch FCBGA packaging technologies, substrate costs limit the pin count and IO densities for high volume and high yield chip to system circuit board connectivity so lutions. The higher IO densities and resulting smaller pad pitch requires an increased number of substrate signal layers and smaller line widths and spacing, and brings about high volume manufacturing issues such as joint reliability, substrate co planarit y, copper adhesion, choice of materials, thermal expansion and underfill requirements [ 5 2 ]. According to the ITRS trends shown in Figure 4 2 cost driven projections limit the BGA solder ball pitch for conventional system board designs to 0.5mm, whereas ch ip to substrate flip chip area array pad pitch are likely to scale down to 100um for future technology nodes [ 1 ]. Thus the overall cost and performance tradeoffs of packaging technologies binds the IO count at the substrate board level interconnection inte rface and sets an upper limit to the number of signaling channels that are available. On the chip side, the available silicon area for an I / O transceiver module or macro is limited to a fraction of the total chip area, which may range from 5% to 25% depe nding on the type of ASIC. As process technology continues to scale, the area for functionally equivalent I / O signaling circuits will also scale proportionally across technology nodes. This would suggest that the effective bandwidth per unit area will incr ease with technology scaling, partly due to higher transistor operating frequencies and higher transistor densities. However, since these high performance I / O macros must be placed around and even underneath the pad area to mitigate propagation loss within the chip, the minimum pitch of the flip chip area array pads will progressively determine the overall chip surface area utilization for data transfer. For maximum I / O density, the circuit area dedicated to each I / O link should be smaller than the area def ined by the minimum flip chip pad pitch arrangement. For differential signaling, a preferred choice for

PAGE 82

82 high speed communications, at least two pads per I / O link are required and hence the I / O macro circuit area can be inferred from the minimum flip chip p ad pitch and footprint. In order to determine the number of I / O channels available for signaling and the overall chip pad arrangement shown in Figure 4 3 This model assumes high density peripheral I / Os to maximize the utilization of the chip surface area while minimizing the overall flip chip area array pad count and the number of required interposer substrate routing layers. Centrally located I / O pads can be made less dense and are mainl y used for power and ground connections which typically constitutes 30% to 50% of the total pad count. These pads are also used to connect supply and ground pins to flip side surface mount decoupling capacitance components. Along the chip edge, lower densi ty corner and middle pads can be used to relax routing to surface mount components on the top layer of the interposer and also provide better inlet channels for edge dispensed underfills to relive stress caused by CTE mismatch between silicon and the inter poser substrate. Figure 4 3 3 D packaging strategy for high speed I / O interconnect network For the high density peripheral differential IO modules, three different pad arrangement strategies are analyzed, as shown in Figure 4 4 In each scenario, chann el routing can be

PAGE 83

83 accomplished using an interposer substrate with two signal layers embedded in a true stripline impedance controlled environment sandwiched between power and ground layers. This allows routing of four rows of differential I / Os from the ch ip perimeter to the bottom BGA layer, where every stripline signal layer supports two rows of differential I / Os. Adding more substrate layers simply increases the I / O count while keeping the I / O density in the chip substrate interface unchanged. Based on t he different pad arrangements, the area of a transceiver core can be defined in terms of the pad pitch P and ranges from 2P 2 to 3P 2 Similarly, the effective number of supply/ground pins per IO module ranges from 0.5 to 1, with the t effective. Via type requirements such as blind versus plated through hole (PTH) vias can also be derived. For instance, the pad arrangements in A and C require PTH vias whereas the pad arrangement in B requires blind or buried microvias to relax th e escape routing density bottleneck. Many manufacturers already provide stacked microvia options for ultra thin substrate technologies with fine pitch lines smaller than 30 m and microvias smaller than 30 m in diameter. E xamples include Any Layer Inner Via Hole (ALIVH) from Matsushita, Burried Bump Interconnection Technology (B2it) from Toshiba and the Multi layer Thin Substrate (MLTS) from NEC, all of which support blind and stacked vias to improve signal routing [ 22 ]. Figure 4 5 shows the resulting pad p itch and I / O core area when the substrate design is driven by minimum line width/spacing parameters. In all cases, equal line width and spacing with flip chip pad diameter D of 30m is used. Based on the pad arrangemen ts shown in Figure 4 4 flip chip pitc h of 200m or less will require lines and spacing less than 20m. At these dimensions, the total transceiver I / O core area is about 0.1mm 2 Figure 4 6 shows an estimate of the total number of differential signaling channels and the percentage area occupie d by signaling cores as a function of chip area. The plots assume equal

PAGE 84

84 substrate lines and spacing of 20m, pad diameter of 30m, two stripline signal layers for routing and that the high density peripheral I / O pads shown in Figure 4 3 populate about 80% of the die edge. As indicated in Figure 4 6 for smaller die sizes of about 100mm 2 the number of I / O range between 500 and 750, whereas larger chip sizes approaching 400mm 2 can support about 900 to 1500 I / O cores. The total area occupied by the transceive r cores in all pad arrangements is 20% 55% for smaller die sizes and 10% 30% for larger chips. Pad arrangement / Os at a moderate die area overhead of less than 25% for larger chip sizes. A summary of performance paramete rs for the various pad configurations is presented in Table 4 1 Figure 4 4 Pad arrangements and escape routing for four rows of differential I / O s and high density peripheral I / Os Case A: Inline pads with 2 lines between pads and PTH vias Case B: stag gered pads with 3 lines between pads and blind or buried vias Case C: inline pads with up to 4 lines between pads and PTH vias

PAGE 85

85 Figure 4 5 D=30 m, c hip area equals 310mm 2 number of stripline signal layers is 2 Figure 4 6 Number of different IOs an d IO chip area as a function of total chip area Table 4 1 Performance s ummary for the I / O pad arrangements shown in Figure 4 4. Pad arrangement A B C Minimum flip chip pad pitch (P) P 0.707P P Pad pitch (assume W=S) D+5W D+7W D+9W Via type PTH Blind vi a PTH Number of stripline signal layers 2 2 2 Number of substrate layers *depends on build up process 6 9* 6 9* 6 9* Core Area 3P 2 2P 2 3P 2 Supply/ground pins per core 0.5/0.5 1/1 0.5/0.5 Total number of differential IOs Chip area 100mm 2 /400mm 2 492/985 753/1506 610/1220 % of chip area Chip area 100mm 2 /400mm 2 22.53/11.87 36.12/19.91 55.24/33.97

PAGE 86

86 4.3 Design Framework for High Speed I/O Links A design framework is presented in this section to analyze the area, power and bandwidth of high speed I/O links to determine the compound metrics such as energy per bit, bandwidth per unit area, and power density [54] At a given target aggregate IO bandwidth and channel characteristic, a high speed transceiver I/O link is studied as an example at both the system an d circuit levels to determine the optimal number of channels, data rate per channel and minimum energy per bit across technology nodes. Alternatively, for pad limited designs, the degradation in energy per bit can be easily determined. 4.3.1 System Level Model In Figure 4 7, we show a behavior model of the I/O link at the system level composed of serializer, FFE, PLL at TX side, channel, and DFE, De serializer, CDR at RX side. As discussed in Chapter 2, the statistical modeling is applied to describe the signal integrity of FFE, channel, and DFE. The design procedure begins by calculating the system bit error rate (BER) from the combined channel and equalizer response at different sampling times and offset voltage through the recursive convolution of the inter symbol interference (ISI) probability distribution function (PDF). It follows that the normalized vertical eye opening at the receiver end for a target BER can be computed at different sampling times and atten is the maximum vertical eye opening val ue with the optimal equalization coefficients c i i M for an M tap FFE) and k i i N for a N tap DFE). Jitter effects in the PLL clocks can also be taken into account for BER calculations by averaging the conditional PDFs. Simultaneously, an optimal cur rent centric design methodology discussed in Chapter 3 can be performed to obtain the power and energy metrics of the individual building blocks in the system.

PAGE 87

87 Figure 4 7 System level link model for a high speed transceiver with FFE and DFE 4.3.2 A Hi gh Speed Transceiver System Blocks A typical transceiver example including PLL and CDR is shown in Figure 4 8. The transmitter consists of a 4:1 parallel to serial converter with a 3 tap pre emphasis filter. A PLL generates a full rate and divided down clo ck signals from a slower reference (i.e. 1/32 of clock rate) to synchronize the latches. Clock buffers are inserted in the transmitter and receiver to account for the de serializer/serializer clock loading. The receiver front end consists of an amplificati on stage and a 3 tap decision feedback equalizer. A clock and data recovery (CDR) loop in the receiver selects the appropriate clock edge from the phase interpolator. To save power dissipation, both PLL and CDR can be implemented using digital approaches a s discussed in [ 55 56] The topology of an example PLL used in the modeling is shown in Figure 4 8 ( A ), composed of a 4 stage ring oscillator, dividers, a time to delay converter (TDC), and a digital loop filter. As a phase/frequency detector and charge pu mp replacement in an all digital PLL, a conventional TDC is employed to measure edge time difference between the divider output and the reference clock signal [ 57 ]. The number of inverters and flip flops in the TDC can be estimated from the required timing resolution and range, which depends partly on the clock frequency. The topology of an example digital CDR is illustrated in Figure 4 8 (B ), composed of a phase interpolator, dividers and a CDR logic circuit. The high speed phase interpolator is

PAGE 88

88 implemente d in current mode logic style and the CDR logic at the data rate of f d /4 is implemented using CMOS logic gates. A B Figure 4 8 A typical high speed I/O example A ) t ransmitter and PLL B ) r eceiver and CDR By calculating the power dissipation for the i ndividual blocks in high speed transceiver, the TX output driver is predominant compared with other circuit components because of its low output impedance level from the channel The design procedure for TX building blocks was discussed in Chapter 3. As sh own in Figure 4 9, the power dissipation of the TX driver is determined by the output voltage V TX which is dependent on the loss of the channel, RX

PAGE 89

89 amplifier gain and FFE/ DFE coefficients from the statistical analysis at the system level but at the same time, TX output voltage needs to satisfy the receiver input sensitivity V RX after attenuated by the channel and boosted by the equalizers The amplifier stage in the receiver compensates for the amplitude loss of the received signal and compares it to a kn own target value in the following slicer stage. The power metrics of transmitter and receiver can be written in ( 4.1 ) Figure 4 9 Circuit model for a t ransceiver front end ( 4.1 ) All of the CML building blocks in RX (number of N 2 ) can be designed based on current centric methodology presented before. The power dissipation of the DFE is related with the equalization coefficients, the load effect from W G,N2 1 (the size of the followed gate), and its path ef fort. Only the front end amplifier is working differently with the conventional CML digital gates and can be analyzed through small signal analysis presented in [ 58 ].

PAGE 90

90 4.3.3 Link Optimization Flowchart The developed simulation framework incorporates the si gnal integrity analysis for high speed I/Os to predict the link performance. Given the methodology employed for obtaining accurate representation for the building blocks in transmitter and receiver behavioral models for CML and CMOS circuits are added int o the design framework. Figure 4 10 Flow chart of a full link simulation platform The flow chart of the simulation platform is graphically shown in Figure 4 10. The inputs of the platform are a channel represented by S parameters or a loss roll off a nd the total aggregate bandwidth at a BER target. First, the data rate per pin ( f d /pin) which satisfies the packaging constraint is swept to obtain the system and circuit metrics for further comparison and optimization. At the system level, the satisfying link architectures including the types and

PAGE 91

91 amounts of equalization coefficients can be determined in order to achieve the target BER through statistical BER calculation or simulation. T his simulation platform provides various options to achieve the target BER through different equalization amounts. At the circuit level, the current centric design methodology is utilized to obtain the power metrics of these link options. At the minimum power consumption point, the optimal link architecture can be found among these link options. Thus, by sweeping the data rate per channel, the proposed platform gives out the optimal link architecture with the system metrics, i.e. energy efficiency and power density at the certain parallelism. Specifically, once the optimal equ alization coefficients are obtained, the eye diagrams associated with BER can be plotted. Figure 4 11 An I/O link optimization framework As shown in Figure 4 11, t he optimization approaches used here can be qualified by simply sending a testing PRBS pattern and simulating BER result through the link system in time domain. The transmitted data at the output of TX can be obtained by convoluting the system

PAGE 92

92 response of FFE and channel with the testing data. The crosstalk effect represented by S parameters H XF ( f ) can be taken into account by sending an uncorrelated testing PRBS pattern through the cascaded system of FFE and H XT ( f ) and adding to the original transmitted pattern at the TX output. The AWGN noise and jitter can also be effectively counted as vo ltage noise on the transmitted data. The received data is calculated by convoluting the data at the TX output with the RX DFE and further compared with the original testing pattern to obtain the error information. After find ing out the optimal settings of the analog equalizer applied through the power metrics can be obtained using the current centric design methodology. 4.4 Analysis Results Based on the transceiver link architecture shown in Figure 4 8 and the design methodology presented in Chapter 3, th e optimum data rate per link, energy per bit cost for transceiver blocks, optimal aggregate bandwidth and power density for different aggregate bandwidth are computed under various channel and noise constraints. In this section the optimal design to obtai n minimum energy per bit is analyzed and then a realistic design is discussed under package constrains. 4.4.1 Optimal D ata R ate per L ink The optimal data rate per link that minimizes the energy per bit across different process technologies is shown in Fig ure 4 12 for channels with 1dB/GHz and 8dB/GHz roll off. Adjacent channel crosstalk coupling of 20dB has been assumed for all results. The minimum receiver sensitivity is set to 26dBm. As shown in Figure 4 12, for low loss channels, higher data rates o f 10Gb/s can be achieved under minimum power dissipation constraints. As the channel loss increases, this optimal d ata rate decreases to 2.5 5Gb/s This is indicative of the increasing amount of equalization power required to compensate for channel losses. The impact of

PAGE 93

93 technology scaling is therefore more pronounced for low loss channels when the overall power dissipation associated with channel equalization is not dominant. Figure 4 12 Energy/bit in mw/Gb/s with data rate per channel with 1dB/GHz and 8dB/GHz channel roll offs Figure 4 13. Energy/bit of link components in mw/Gb/s for 16 nm CMOS design at different loss rate of 1dB/GHz and 8dB/GHz for channels 4.4.2 Energy per bit Cost for Transceiver Blocks The energy/bit (mW/Gb/s) cost breakdo wn for the transmitter (TX), receiver (RX) and PLL with 16nm gate length technology for 1dB/GHz and 8dB/GHz channel roll offs are shown

PAGE 94

94 in Figure 4 13. In all cases the transmitter dissipates the most due to the low impedance level of the channel transmi ssion lines. As the data rate increases, the energy cost of the PLL increases due to the higher operating clock frequency. This is the case for both simulated channel types. The energy per bit for each component does not vary significantly as the data rate increases in the case of the low loss channel ( 1dB/GHz). This is due to the low energy cost of equalizing the channel. In contrast, when an 8dB/GHz channel is used, the cost of equalization in both the TX and RX becomes increasingly dominant as the data rate increases, as shown in Figure 4 13. 4.4.3 Optimal A ggregate B andwidth Figure 4 14 show the 3 D and contour plots for energy per bit attainable as a function of aggregate bandwidth and chip area for case B. As seen from the 3D plot, there is a minimu m energy per bit value for a fixed aggregate bandwidth corresponding to a certain chip area. Meanwhile, for a fixed chip area, in order to get the minimum energy per bit, there is an optimal aggregate bandwidth which can be transmitted. I n the case of larg er or smaller aggregate bandwidth, sub optimal energy per bit will be achieved for a certain chip area. F or 16 nm CMOS design and a moderate chip area of 310mm 2 t he maximum aggregate bandwidth under optimal energy per bit signaling constraints can be deri ved for the two types of channels considered herein (1dB/GHz and 8dB/GHz) and for the packaging configurations discussed in Section.4 2. Table 4 2 summarizes various performance parameters including maximum attainable aggregate bandwidth, power density and total power dissipation for low loss (1dB/GHz) and high loss (8dB/GHz) channels. The maximum aggregate bandwidth ranges from 6.5 Tb/s to 10.6 Tb/s for the 1dB/GHz channel and from 2.2 Tb/s to 3.3 Tb/s for the 8dB/GHz. Similarly, the power density ranges f rom 6.7 W/mm 2 to 18.9 mW/mm 2 for the 1dB/GHz channel and from 5.5 mW/mm 2 to 15.5 mW/mm 2 for the 8dB/GHz.

PAGE 95

95 Figure 4 14 3D and c ontour plot for e nergy/bit of link components in mw/Gb/s with total aggregate bandwidth and chip area (case B) for 16 nm CMOS design at different loss rate of 1dB/GHz and 8dB/GHz for channels. 4.4.4 Power D ensity vs. A ggregate B andwidth Figure 4 15 shows the power consumption/area as a function of total aggregate bandwidth with 1dB/GHz and 8dB/GHz roll off using different pr ocess technologies. When the pitch for high speed IO inside the package is fixed, the number of high speed IOs and total IO core area for case B pad arrangement are mainly determined. As the energy per bit can be calculated for a certain data rate transmit ted on the channel with certain roll off rate, the power density can be easily calculated for the total aggregate bandwidth. As shown in Figure 4 15 the impact of

PAGE 96

96 technology scaling is more dominant for larger aggregate bandwidth when the overall power dissipation associated with channel equalization for larger bandwidth is dominant. Figure 4 15 Power density in w/mm 2 with aggregate bandwidth in Tb/s for fixed die area of 310mm 2 Table 4 2 Performance metrics for 16 nm CMOS technology Channel type 1 dB/GHz 8 dB/GHz Energy per bit ~0.56 ~1 Total aggregate bandwidth ~19 Tb/s ~4.5Tb/s IO density (# IOs/mm 2 ) ~17 Power density (W/cm 2 ) ~15 ~6.5 Bandwidth/area (Gb/s/mm 2 ) ~270 ~64 Table 4 3. Summary of package type B specifications Package type B S pecifications # interposer layers 6 Diff. stripline layers 2 TX/RX core area ~0.058mm 2 C4 pitch/line 170 m Spacing/pad /20 m Diameter (P/S/D) /30 m # of diff. IOs 1325 Total IO area (% of die area) 22%

PAGE 97

97 CHAPTER 5 AN ACTIVE CROSSTALK EQUALIZER FOR PARALL EL HIGH SPEED LINKS 5.1 Motivation In this chapter, we present a n active high speed FEXT crosstalk equalizer In S ection 5 .2, we review the basic crosstalk induced jitter (CIJ) generation. Section 5.3 presents a basic concept of CIJ equalizer. The circuit details are described in Se ction 5.4 including a phase interpolator, a feed forward equalizer and a phase locked loop (PLL) design. To verify the proposed CIJ equalization scheme and evaluate its performance, a chip prototype is implemented and housed in both BGA and QFN packages to do the measurements separately in Section 5.5 The testing results demonstrate a reduced jitter and lower bit error rate (BER) at high data rate. 5.2 Crosstalk Analysis At high transmission data rate s s ignal s are distorted not only in amplitude but als o in phase. As discussed in the previous chapters, amplitude distortion can be compensated by a linear equalizer implemented in form of a FFE and DFE. P hase distortion can be caused by CIJ between adjacent channels and signal dispersion Thus, c rosstalk wo rsens as the wire density increases, especially when signal line shielding is unavailable For example, t he crosstalk

PAGE 98

98 between two coupled transmission lines on board has been discussed in [ 61 ]. Figure 5 1 shows the lumped equivalent circuit of two coupled equations, the voltage and current relation of each line can be described in ( 5 .1 ) w here L m and C m are the mutual inductance and capacitance per unit length; L s and C s are self inductance and self capacit ance per unit length [62] Figure 5 1. Lossless coupled transmission lines model (5.1) Solutions to the ( 5.1 ) are associated with an even or odd mode. The even mode is excited when v 1 (x,t)=v 2 (x,t) and the odd mode is excited when v 1 (x,t)= v 2 (x,t). For data communications, we are mainly concerned with three signal modes: the even mode odd mode and static mode which is the sum of even and odd modes The data transitions occur independently on adjacent lines and, therefore, the mode change randomly. For two parallel transmission lines, the amount of CIJ depends on the transition polarity of the aggressor and victim lines resulting in either odd or even mode propagation delays expressed as [ 63 ]

PAGE 99

99 (5.2 ) The static term is the amount of delay variation caused by static mode propagation a nd f is the amount of delay variation caused by even or odd mode propagations (5.3 ) where Z 0 is the characteristic impedance of the transmission line l is the length of the transmission line. Figure 5 2 Crosstalk induced jitter generated in the data eye with data transitions To display the effect of the crosstalk induced jitter (CIJ), t wo ad jacent stripe lines with a spac e equal to 3 times of width is simulated in ADS and excited by two random data streams. Figure 5 2 shows the increase of jitter in the data eye caused by the different data transitions of even, odd, static modes. In high spe ed I/O link, the data transmission path is complex: from the transmitter chip through bonding wires (or flip chip balls), packages, PCB boards and finally to the receiver chip. The close spacing between high speed IO pins leads to crosstalk coupling betwee n them. The data path can be divided into N sections and the worst case delay variation caused by even/odd mode propagation for a link system can be concluded as, (5.4)

PAGE 100

100 I t is difficult to match the crosstalk coefficient under all routing conditions, thus an active cross talk mitigation technique is necessary to improve the link performance. 5.3 Crosstalk Induced Jitter Equalizer Signal propagations with different modes (even, odd) travel at different propagation veloci ties causing data edges at the input of receiver to arrive at different times. Because the received analog signal after the channel is already attenuated and distorted by reflection and coupling, it is desirable to pre compensate for mismatch in propagatio n modes at the transmitter side before the data is launched instead of the receiver side. As shown in Figure 5 3 ( A ), a CIJ cancelling scheme was reported in [ 64 ] by adding variable delay cells to the data streams directly after the sampling stage, i.e. mu ltiplexer (MUX). This equalization scheme requires the delay cells to work at the full data rate and provide high linearity over wide range. An active CIJ equalization scheme is proposed in this chapter to vary the sampling clock edge instead of the data s treams as shown in Figure 5 3 ( B ) [65] By varying the sampling clock edge CK_MUX to advance or delay, the transmitted data streams are pre distorted in phase according to the odd or even propagation mode so that the received data arrive at the same time n o matter the signal transition modes. The sampling clock CK_MUX is generated from a CIJ clock generation block as shown in Figure 5 4 The CIJ clock generation block is composed of a mode d etect block and a phase rotator The mode detect block is implement ed to detect the signal mode (even, odd or static) of the synchronized data D 1 and D 2 on the adjacent channels. The control signals for the phase interpolator, LEAD and LAG, are generated from the signal mode block. The even mode has less propagation delay so that the LAG signal will be enabled to delay the sampling clock. The odd mode has more propagation delay so that the LEAD signal will be enabled to advance the sampling clock. When it is static mode, both LEAD and LAG will be disabled. The phase rotato r

PAGE 101

101 is used to rotate the synchronization clock from in phase and quadrature phase clock generated from PLL. The polarity to advance or delay the clock is adjustable through digital control bits. A B Figure 5 3 Methods of CIJ equalization A ) c onventiona l B ) proposed. Figure 5 4 Block diagram of the CIJ clock generation

PAGE 102

102 Figure 5 5 Block diagram of m ode detect block Table 5 1 T ruth table of model detect block D1 D2 Mode LEAD LAG 0 1 0 1 EVEN 0 1 1 0 1 0 EVEN 0 1 0 1 1 0 ODD 1 0 1 0 0 1 ODD 1 0 else STATIC 0 0 According to Table 5 1 the mode detector block can be implemented as shown in Figure 5 5, which is composed of XOR and AND gates to detect transitions. All building blo cks are easily implemented in CMOS logic gates, but working at low speed [64] In order to detect the signal transitions at high speeds a parallel tree architecture can be employed incorporating two low speed mode detect blocks and multiplexers as shown i n Figure 5 6. At the half data rate, e ach channel has two data inputs annotated as A and B and B _d is one bit delayed version of B With the inputs of A and B from channel 1 and channel 2, the first mode detect block generates LEAD1/LAG1 signals to detect the transitions from A to B. With the inputs of B_d and A, the other mode detect generates LEAD2/LAG2 signals to detect the transitions from B to A. The resulting output signals from two mode detect blocks are further multiplexed to the full rate LEAD/LAG respectively.

PAGE 103

103 A B Figure 5 6 High speed CIJ clock generation circuit with the tree architecture. A) Circuit block diagram and B) Timing diagram 5.4 Circuit Implementation In this section, we describe a high speed design of four parallel transmitter s with the proposed CIJ equalizers. Firstly, the system architecture is described. Next, the detail circuit building blocks are introduced including the FFE/CIJ equalizer, phase interpolator and PLL design.

PAGE 104

104 5.4.1 System Arichitecture This design is compo sed of 4 parallel identical transmitters and the block diagram is shown in Figure 5 7 Each TX has an independent data generator, an FFE and CIJ equalizer. There are two phase locked loops inside the chip, one for every two TX cores. The data source genera tes an 8 bit wide parallel data stream for two sets of 4:1 multiplexers (MUXs), which is implemented by three 2:1 MUX tree. These data streams are further multiplexed and followed by the final 2:1 MUX and FFE equalizer. The CIJ clock generation block deter mines the data transitions from half rate data streams of the transmitter and its neighboring line when even or odd modes occur. The resulting LEAD/LAG signals are fed to the phase interpolators (PI) to generate the appropriate sampling clock for the final 2:1 MUX and FFE filter taps. The In Phase and Quadrature (I/Q) clocks used in PI are provided by the shared PLL core. The final 2:1 MUX, FFE, and PI use high speed current mode logic (CML) gates whereas all other circuits are implemented using full swing CMOS logic from the standard cell libraries. Figure 5 7 Block diagram of the 4 channel transmitter.

PAGE 105

105 5.4.2 FFE and CIJ Equalizer To overcome the loss on the signal path at high frequencies, a half rate instead of the full rate FFE is used in each tran smitter as it consumes less power [ 66 ]. As illustrated in Figure 5 8, the FFE receives two data streams at half bit rate. For proper operation of the 2:1 MUX, the data streams are shifted by one unit interval (UI) using a data retiming latch. The two half rate streams are then interleaved and multiplexed into full rate signals by the first FFE filter tap selector. Additional shifting and interleaving produces delayed data streams for the remaining three taps of the FFE. All the retiming latches are clocked by CK, which is generated from on chip PLL directly. All MUXs are retimed by CK_MUX generated by the PI, which is appropriately delayed or advanced according to LEAD/LAG signals from crosstalk generation logic. The tap coefficients or tap weights for the p ost tap k 1 the main tap k 0 and two pre taps k 1 k 2 are digitally programmable using 4 bit digital to analog (DAC) current sources with the same topology shown in Figure 3 27 Figure 5 8 Transmitter front end with the CIJ equalization and FFE

PAGE 106

106 Figu re 5 9 Timing diagram of the t ransmitter data path and mode detect path Figure 5 10 Schematic of the retiming mode detect block The timing diagram of the FFE and CIJ is displayed in Figure 5 9. For the synchronization purpose, the flip flops are use d for retiming in the mode detector and data signal paths. The

PAGE 107

107 mode detect block with retiming flip flops is shown in Figure 5 10. The clock to Q delay of the flip flop/digital gates is compensated by adding an extra delay circuits. The edge of CK_MUX at static mode is annotated as CK_MUX_static, which is generated from the PI without counting even/odd mode to achieve the maximum sampling margin for both data path A and B. In the proposed CIJ clock generation circuit, two mode detect blocks generate mode s ignals for transition from B to A and transition from A to B separately, which are shifted properly by the retiming latches. The two mode signals are further multiplexed by CK_MUX_static and generate the control signal for the phase interpolator to advance or delay the corresponding edges of CK_MUX. 5.4.3 Phase Interpolator The phase interpolator (PI) consists of a current mode summing logic with ps eudo pMOS load as shown in Figure 5 11 It phase blends the I/Q clocks from the PLL to generate the appropria te retiming signal (CK_MUX) for the final 2:1 MUX and FFE filter taps. A total of 16 weighting bits (C 0 C 7 and C 24 C 31 ) are used to guarantee enough clock sampling margin s to account for the finite Clk Q delays of the data sampling latches under worst case conditions. An additional 16 bits (C 8 C 23 ) are used by the crosstalk equalizer to adjust the clock sampling edge based on the even/odd mode data transitions of the adjacent lines. The C 8 C 23 bit weighting are programmable to account for the amount of cros s talk induced jitter. The minimum delay resolution adjustment is approximately 1/32 of the clock period. The polarity of CIJ equalizer is programmable by LEAD/LAG signals from CIJ clock generation circuit Without the CIJ equalizer, the PI rotates I/Q cl ock signals to sample the multiplexed data streams with the best phase margin Usually, the PI introduces non linearity in the interpolation steps [ 67 ] However, the CIJ equalizer only requires a certain advance/delay value to compensate

PAGE 108

108 for even / odd mode delay variation at a given channel characteristic, thus the PI linearity for the CIJ is not critical. Figure 5 11 Schematic of the phase interpolator 5.4.4 Phase L ocked L oop The phase locked loop ( PLL ) implemented in this design is a conventional charg e pump style with several design features as shown in Figure 5 12 A three state linear phase/frequency detector (PFD) compares a reference clock to a divided by 16 voltage controlled oscillator (VCO) clock. A conventional single ended charge pump (CP) cir cuit with a third order passive loop filter (LF) is used to generate the control voltage (V CTRL ) for the VCO. A 4 stage dual loop ring oscillator VCO produces 8 high speed I/Q clock signals with 4 phase differential clock signals. To increase the tuning ra nge of the VCO, a voltage doubler and level shifter circuit clocked by the input reference are used to provide higher supply voltage HVDD for the charge pump. This incurs in very little power overhead as the total current of CP and level shifters is

PAGE 109

109 very s mall. In addition, since the voltage doubler is clocked by the input reference, the switching noise voltage is attenuated by the LF. We will discuss each building blocks of PLL in detail. Figure 5 12 PLL block diagram 5.4.4.1 VCO The maximum oscillatio n frequency of a single loop ring oscillator is determined by the minimum delay time through the feedback path, which is the product of the number of stages and the delay of each stage. Most practical ring oscillators have at least 3 stages to sustain stab le oscillation. Even it is possible to construct a two stage VCO, the two stage structure can be difficult to stabilize [ 68 ]. At a given number of stages, the maximum frequency is dependent on the minimum delay of a single stage, which is determined by the characteristics of technology nodes. Therefore, the single loop VCO is limited for high speed application. It is necessary to explore other architecture techniques to increase the maximum frequency of ring oscillators. In this design, a 4 stage ring oscil lator with a fully differential multi pass loop [ 69 ] is implemented for high frequency operation and quadrature outputs. The VCO topology is shown in Figure 5 13. In order to reduce the delay of each stage, a set of secondary inputs, vin2+ and vin2 is ad ded and switched earlier than the primary inputs of

PAGE 110

110 vin1+ and vin1 during the operation [ 70 ]. Note that the auxiliary loops should not be stronger than the main loop to avoid undesired oscillation modes. The basic delay cell shown in Figure 5 13 provides rail to rail output signals and full switching of the transistors in the stage. The NMOS transistors M 1 M 2 create the input pair for the primary loop, while the PMOS transistors M 5 M 6 serve as the input pair for the auxiliary loop. The oscillation frequenc y is controlled by tuning the gate voltage of M 3 M 4 to adjust the strength of negative cross coupled PMOS transistors M 7 M 8 The phase noise and jitter are important design considerations for VCO design. The implemented saturated gain stage periodically sw itch the gain transistors in and out of the conduction, which reduces the noise [ 71 ] Figure 5 13 4 stage ring oscillator with the multiple pass loop The 1/16 frequency divider is formed using four cascaded 1/2 divider circuits. Current Mode Logic (CM L) circuits are used to reduce the switching noise. Figure 5 14 shows a block diagram of 1/2 frequency divider with the positive feedback of two D flip flops. A D flip flop is composed of two identical latches.

PAGE 111

111 Figure 5 14 Block diagram of the 1/2 freq uency divider and schematic of the CML latch 5 .4.4.2 PFD Figure 5 15 shows the schematic of a conventional PFD. The PFD utilizes two flip flops to produce three states (pull up, pull down and high impedance) [ 72 ]. The PFD compares the output signal from th e frequency divider f div with an external reference source f ref The operation of PFD indicates that both f ref and f div must have the same frequency and phase when a PLL is locked. Figure 5 15 Conventional PFD topology If both the output voltages of P FD are not adequate enough to turn on the switches in the charge pump, the charge pump gain and thus the PLL loop gain becomes zero. This nonlinearity in the PFD characteristics is called a dead zone in which the PLL cannot properly respond to the

PAGE 112

112 small ph ase error. A delay block composed of the inverters give a fixed minimum width to the PFD output pulses so that the phase errors are close to zero is avoided and the VCO noise leakage due to the dead zone is reduced. To suppress the dead zone, a complementa ry pass gate is also inserted to compensate the delay difference between the paths of UP and DN. 5.4.4.3 Voltage d oubler and le vel s hifter In order to increase the oscillator tuning range, a voltage doubler is incorporated in the PLL design. To reduce the complexity of the voltage doubler as well as the noise in the PLL system, the number of circuit blocks driven by the voltage doubler should be minimized. Therefore, only the charge pump and the interface circuit converting PFD outputs to the charge pump in puts are chosen to be powered by the voltage doubler output (VDDH). The maximum current supplied by the voltage doubler should only be around a few hundred A. Figure 5 16 Schematic of the voltage doubler A circuit schematic of the voltage doubler is shown in Figure 5 16 [73] The reference frequency (f ref ) of the PLL is used as the input clock signals to drive the charge pump of the voltage doubler circuit formed by M 1 and M 2 The use of cross connecting NMOS transistors M 1 and M 2 is efficient, not on ly because of higher carrier speed, but particularly since it provides

PAGE 113

113 automatic reverse bias of the junctions [ 74 ]. In 65 nm CMOS technology, VDD is 1.2V. The voltage at the sources of M 1 and M 2 are between 1.2 and 2.4V and are 180 o out of the phase. The series PMOS switch M 3 and M 4 are turned on only when their sources and gates are 2.4 and 1.2V respectively. The use of transistors M 5 and M 6 is to prevent forward biasing the drain to body junctions of M 3 M 6 The capacitor C B (100fF) is necessary to prese rve the bulk potential when switching. An external 100pF capacitor C 4 is used to filter the output to reduce the voltage ripple. Due to the loss of the switches and the loading of the CP circuit, the output DC voltage is ~1.8V. A level shift circuit is de signed to convert the UP/DN signals generated by the PFD to the input signals UPh/DNh of the CP [73] The circuit schematic is shown in Figure 5 17. The input two inverters use a normal 1.2V power supply, while the rest of the circuits use VDDH as the powe r supply. The differential output of the inverters drive the source coupled NMOS pair M 1 and M 2 and amplified by M 3 and M 4 Since M 3 and M 4 are cross coupled pair with positive feedback, one of the output voltages will be pulled up to VDDH and the other is kept at 0V. M 5 and M 6 form a buffer to drive the switch transistor in CP. Figure 5 17 Schematic of the level shifter circuit

PAGE 114

114 5.4.4.4 Charge p ump The charge pump (CP) is controlled by the UPh and DNh signals generated by the PFD circuit after the leve l shifter. When UPh is high, the CPh deposits charges to the capacitors in the loop filter (LF) to increase the control voltage for the VCO. Similarly, when DNh is high, it withdraws charges form the LF. The schematic of the CP is shown in Figure 5 18. Co mpared with the conventional CP, the UPh/DNh signals are applied to the gates of M 1 and M 2 instead of M 3 and M 4 to avoid large glitches at the CP output [ 75 ]. The current glitches now occur at the sources of M 3 and M 4 due to the switching of M 1 and M 2 and will be attenuated at the output node. M 5 and M 7 are the dummy transistors used to match the voltage drop across the switches M 1 and M 2 so that the current reference can be accurately mirrored to M 3 and M 4 The reference current can be tuned by varying the voltage CP_ctrl. Figure 5 18 Schematic of the charge pump and loop filter The loop filter (LF) used in the PLL is a third order pass filter. The design parameters are chosen to be C p of 2pf, R z of 2.6k C z of 32pf, R 4 of 256 and C 4 of 3.2pf. These a long with the CP current of 100 A set the loop bandwidth to be ~10MHz and the phase margin to be ~62 o

PAGE 115

115 5.5 Chip Fabrication 500m 530m Figure 5 19 Die photo of the 4 channel transmitter 5.6 Experimental Results 5 .6.1 Channel Characterization Firstly, the transmitter chip is assem bled into a 36mm36mm 4 layer 344 pin BGA package from TI [76] This package is 4 layer plastic ball grid array type with one signal layer, one power plane, and two ground planes as shown in Figure 5 20. The 4 layer test board provides a platform to mount the package and testing connectors, and to help facilitate measurement and connection to the high speed logic, clock, and power supplies. The package is mounted on the bottom side of the PCB board. Inside the package, t he bond wires run in parallel for ap proximately 4 mm, followed by about 12mm of closely coupled routing to the solder balls. The output signals are probed directly on PCB transmission lines using RF probes at about 32mm away from the package. The board is connected to the testing

PAGE 116

116 instrument through a 6 ft long, 3 GHz coaxial RG cable. Figure 5 21 shows the testing board with the BGA package and chip die photo. Figure 5 20 BGA packag e with 4 planes (signal, ground, power, ground) and the typical wire bonding configuration. Figure 5 21 P icture of the t est board with the BGA package and the chip die photo To investigate the effect of BGA package has on the signal integrity, package structure was modeled using a 3D full wave EM solver from AUTOCAD drawing, as shown in Figure 5 2 2

PAGE 117

117 Signal t race from the package is connected to the PCB testing board through solder ball using the through hole via. BT resin with dielectric constant of 4.7 is used for substrate material and the die is attached to the package finger with gold wire bond. The bondi ng wires with the length of 4mm and coaxial RG cables were modeled in Agilent ADS. Thus, the channel characteristics were obtained by cascading the S parameters from the bonding wires, package, PCB board and testing cables in ADS. Figure 5 2 3 shows the sim ulated channel characteristics including the insertion loss, return loss and crosstalk. The channel composed of BGA package die board interface bonding wire, PCB board and testing cables presents a complex interconnection system susceptible to crosstalk noise and frequency dependent attenuation. Figure 5 22 Method of the c hannel characterization Figure 5 23 Channel characteristics of i nsertion loss and c rosstalk

PAGE 118

118 5.6.2 Eye D iagram Measurement A PC controlled pattern generator (Agilent E4832A) is us ed to load the control bits through an on chip serial interface. It also provides the reference clock to the PLL. The single ended eye diagram as measured on oscilloscope (Agilent 86100B) for TX output with out equalization, with FFE only and with both FFE and CIJ equalization are shown in Figure 5 24 working up to 8Gb/s/channel. The jitter and eye opening measurement results are summarized in Table 5 2 By enabling both FFE and CIJ equalization, the eye opening is increased from 30mV to 50mV, the RMS jitte r is reduced from 14.1ps to 6.6ps, and the peak to peak jitter from 96ps to 33ps, in both cases the improvements are greater than 50%. Table 5 2. Jitter r eduction using FFE and CIJ e qualization Equalizat ion RMS jitter (ps) Peak to peak jitter (ps) Vertical eye opening (mv) None 14.1 96 ~30 FFE 12.0 77 ~50 FFE and CIJ 6.6 33.3 ~50 Figure 5 25 shows the measured single ended clock waveform from the PLL test chip. The RMS jitter is about 7.3ps, which is worse than that obtained from the equalized TX output eye pattern. This is due to extra jitter added from PLL output buffers, BGA package, PCB transmission line and SMA connectors. The measured nominal power consumption of 4 channel 8Gb/s/channel transmit ter is 225mW. The energy/bit is about 7pJ/bit, or 7mW/Gbps.

PAGE 119

119

PAGE 120

120

PAGE 121

121 CHAPTER 6 LOW POWER HIGH SPEED EQUALIZED CHIP TO CHIP LINK 6.1 Motivatio n In order to meet the large bandwidth requirement of communication systems, such as microprocessors, servers and routers, high speed and low power chip to chip links are desirable. In particular, reducing the power consumption of wireline transceivers are critical when hundreds of parallel I/Os are integrated on one chip to achieve a large aggregate bandwidth. Recently examples of highly efficient chip to chip interfaces in [ 2 3 ] demonstrated the power efficiencies as low as ~1 2pJ/bit in CMOS nodes of 65n m and below. For two examples of transceivers published by Fukuda [ 77 ] and Palermo [ 78 ], th e transceiver contains multiplexer/de multiplexer, output driver, input amplifier, clock generation and distribution circuits, and the power breakdown is listed in T able 6 1 A s predicted by link analysis, the transmitter output driver and receiver frontend are the dominant source s of power dissipation, consuming up to ~70% of the total power in [ 77 ] and ~40% in [ 78 ]. Table 6 1 Power brea kdown of the transceiver fro m [ 77 ] and [ 78 ] Equalization Fukuda Palermo Technology 65nm 90nm Supply voltage (V) 1.0 1.0 Data rate (Gb/s) 12.5 16 TX MUX (mW) 0.58 (4.8%) 23 (18%) TX driver (mW) 4.85 (40.6%) 24 (19%) TX subtotal (mW) 5.43 (45.4%) 47 (37%) RX frontend (mW) 2.16 ( 18%) 23 (18%) RX slicer +DeMUX (mW) 1.61 (mW) RX CDR (mW) 1.96 (16.4%) 35 (27%) RX subtotal (mW) 5.73 (47.9%) 58 (45%) TX PLL + clock driver (mW) 0.80 (6.7%) 23 (18%) Energy per bit (mW/(Gb/s)) 0.98 8.1 In order to improve the power efficiency of t he transceiver, a current sharing technique between the transmitter output driver and receiver amplifier is proposed in this chapter. W e first describe the low power design consideration for transmitter/receiver architecture and interface,

PAGE 122

122 including reflec tion and termination strategies i n S ection 6 .2 Section 6.3 presents the circuit details in the proposed I/O link including frontend interface, transmitter and receiver design. To verify the proposed I/O scheme and evaluate its performance, a transmitter c hip ad a receiver chip prototypes are implemented and measured the link performance over a conventional FR 4 and an air gap channel on the standard low cost FR 4 board in Section 6.4. 6.2 Low Power Link Consideration The transmitter output driver contrib utes a large portion of power since it i s required to generate a large voltage swing on the termination resistor, which is usually 50 low resistance in order to match the characteristic impedance of the channel. However, the termination strategy is essential at the transceiver interface for the reliable data transmission. The electrical channel used for high speed signaling is normally modeled as a transmission line with the intrinsic impedance Z 0 When a signal is transmitted over a transmission line via a travelling wave, portions of the wave will be reflected when it reaches impedance discontinuities. Therefore, the proper terminatio n strategy is required at the link interconnect to effectively suppress reflection. Figure 6 1 illustrates the cases when the channel is properly terminated and when it is not. When the channel is not terminated in ( A ), the signal that arrives at the rece iving end can bounce back to the opposite direction. With no termination at the transmitter side, the signal keeps bouncing back and forth until it dissipates in the channel. In this case, for the correct data decision, the transmitter may wait until the r eflections are very small to send the next bit. However, this will reduce the data rate significantly. When the channel is perfectly terminated at the RX side in ( B ), the signal sent from the TX side is fully absorbed and therefore consecutive bits can be sent though the channel with no delay imposed by the reflections. The termination at the transmitter is necessary since deviations from ideal matching conditions at the receiver can

PAGE 123

123 cause some reflections that should be absorbed at the transmitter side, ho wever the termination at the transmitter side will decrease the signal amplitude as discussed in the later section. A B C Figure 6 1 Electrical link A ) no termination B ) RX termination C ) TX/RX termination At the transmitter side, the output driver c an be voltage mode driver or current mode driver. As shown in Figure 6 2, low impedance voltage mode driver typically employs series termination and high impedance current mode driver typically employs parallel termination at the transmitter side. The volt age mode driver can be implemented by using a conventional CMOS inverter without the static power consumption [ 79 ]. The low impedance of the voltage mode driver can cause power supply noise on the transmitter chip to appear unattenuated on the transmitted signals. Thus, the voltage mode driver is widely used in low power application because it is more power efficient, but suffered from worse supply rejection. High impedance current mode driver is usually implemented in fully differential form as shown in Fi gure 6 3 ( A ) with termination and ( B ) without termination at the transmitter side.

PAGE 124

124 A B Figure 6 2 Block diagrams of driver with termination. A ) v oltage mode driver with series termination B ) current mode driver with parallel termination A B Figure 6 3 Differential high impedance current mode driver A ) with and B ) without terminations (open drain driver) at the transmitter side

PAGE 125

125 As discussed earlier, if the input impedance of the receiver is not perfectly matched with the channel, the termination at t he transmitter side is necessary to absorb the reelection. However, it will consume more power by placing the termination at the driver output compared with the open drain structure without the termination. For example, in order to achieve an output voltag e swing of 250mV, the minimum required current source is 10mA with the 50 termination matching to the characteristic impedance of the channel, and the open drain structure only need 5mA. Therefore, there is a design tradeoff of the power dissipation and i mpedance matching at the I/O interface. A B Figure 6 4 I/O links with A ) DC coupling and B ) AC coupling After the channel, the signal from the transmitter may be AC or DC coupled to the receiver as shown in Figure 6 4 For DC coupling, the transmitter output lines are directly connected to the receiver input lines, so any DC voltage on the transmitter output is presented to the receiver input line. The common mode voltage of a DC coupled receiver will therefore vary as the common mode voltage of the tra nsmitter varies. For an AC coupled link, the transmitter output lines are connected to the receiver input lines through series capacitors, which serve as DC blocks, so AC coupling allows for independent receiver common mode level, but

PAGE 126

126 transmission data mus t be coded since the combined response of channel with series capacitors has low frequency cut off. In chip to chip link, the DC coupling is typically used to avoid the use of capacitors occupying large chip area, thus the common mode output voltage of TX needs to be designed properly to provide the bias of receiver input stage. 6.3 Link Circuits In this section, we describe the circuit building blocks in the proposed chip to chip link system including the transceiver frontend, transmitter and receiver. Cu rrent mode logic is used to implement all circuit building blocks to suppress the supply noise. 6.3.1 Link Architecture Figure 6 5 shows the block level schematic of the proposed chip to chip link. On the left is the transmitter with a FFE. On the right i s the receiver with a look ahead DFE. A key goal of the chip to chip links is achieving low power, high data rate and good signal integrity. With increasing data rates, signal integrity is degraded not only by the ISI, but also the reflections, caused by u nmatched impedance at the interface. Equalizers are widely used to compensate the ISI effect in high speed IO link as discussed in the previous chapter. Figure 6 5 System architecture of the chip to chip link

PAGE 127

127 6.3.2 Front End Circuit As discussed in p revious chapter, a large portion of power dissipation in high speed I/O links comes from the transmitter output and receiver input interfaces. In order to generate enough sufficient voltage swing at the output driver on the termination resistor (R=Z 0 ), a l arge bias current is required for the driver. At the input of the receiver, an amplifier is placed to amplify the attenuated signal from the transmitter for the correct operation of the slicer. The amplifier is required to work at the full data rate, so it contributes significant amount of power. Therefore, the total amount of power dissipation can be decreased significantly if the current between the most power hungry driver and amplifier can be shared. Figure 6 6 shows the basic concept of current sharin g front end circuit: an open drain high impedance current driver is dc coupled to the receiver via a differential common gate (CG) transimpedance amplifier. Instead of terminating the transmitter to the power supply, the current I T of the open drain driver is entirely supplied by the dc coupled receiver through the CG input stage, thus reusing the receiver current and reducing the total link power consumption. The CG stage at the receiver provides low to high impedance transformation and boosts the input vo ltage for the following slicer stage. In the meanwhile, an offset cancellation block using the MO1 and MO2 branches is also placed at the input of the receiver i n order to compensate the current mismatch at the differential input caused by the inherent CG amplifier offset, the output driver and physical channel mismatch. The offset control block can provide both polarity and amplitude control using an I DAC as discussed in Chapter 3 The input impedance of the CG transimpedance amplifier can be approximated as, (6.1)

PAGE 128

128 Figure 6 6 Schematic of the current sharing front end circuitries: open drain driver, common gate amplifier, impedance matching network and offset control circuit where g m3 and g m4 is the tran sconductance of M 3 and M 4 approximately equal at the equilibrium and dependent on the ratio of drain current and over drive voltage, i.e., g m =I D /2(V GS V th ) when ignoring the body effects. It would be ideal to mach to the characteristic impedance of the t ransmission line Z 0 over a wide range of the bias current I T In the design of a CG transimpedance amplifier [ 80 ], the impedance match at the receiver was achieved through a feedback circuit to adjust the dc value of transistor bias (V GS ) in proportion to the input current so that g m can be kept constant. However, extra bias current was needed for the transimpedance amplifier and the feedback loop, which causes extra power dissipation. In our design, an 8 bit digitally controlled resistor network R match res istor is connected across the receiver inputs, which, together with the front end M 3 and M 4

PAGE 129

129 termination To compensate for the g m dependence of M 3 and M 4 on I T R match is chosen such that differential in put impedance Z D 0.5mA to 5mA to effectively terminate the channel to prevent reflections. Also the switchable resistor network can compensate the effect of load impedance R L on the input imped ance due to the finite intrinsic gain of the transistors M 3 M 4 Thus, the input impedance can be calculated as, (6.2 ) Figure 6 7 Schematic of the link frontend: open drain driver and transimpedance ampli fier when switching M 1 ON and M 2 OFF In order to obtain the differential output swing of the driver as large as possible, the input voltage of the open drain driver is required to be large enough to switch one transistor on and the other transistor off. Fo r example, as shown in Figure 6 7, transistor M 1 is switched on to steer all of the tail current I T and M 2 is turned off. In order to provide low to high impedance

PAGE 130

130 transformation, the CG transistor M 3 and M 4 are required to work in saturation region and th e drain current of M 3 and M 4 can be calculated as, (6.3 ) In addition, the sum of current I D3 and I D4 is the bias current of the driver, (6.4) Combining ( 6.3) and ( 6.4 ) the input v oltage swing of the transimpedance amplifier in can be derived as, (6.5) The input voltage swing in is also the voltage value across the switchable resistor network with the resistance value R match ( 6.6) Therefore, the numeric solution of input voltage swing in transconductance g m3 g m4 and required R match can be calculated in ( 6.7 ) (6.7 ) The output voltage swing out is the voltage difference on the load resistor R L

PAGE 131

131 which can be derived in ( 6. 8 ) (6.8) Wit hout the matching network, the output voltage swing of the CG amplifier can be written as I T R L since there is no current through M 4 While adding a parallel R match at the receiver input for matching, the output voltage swing is decreased due to the curren t dissipated on R match Figure 6 8 Simulated voltage swing and impedance vs. driver current I T Beyond a peak output voltage of ~550mV at I T ~3.5mA the CG amp lifier becomes voltage headroom limited as M 3 and M 4 are forced into triode region. By digitally controlling R match the differential impedance Z D 0.5mA 5mA to avoid the reflection at the f rontend. In the meanwhile, the simulated common

PAGE 132

132 mode impedance Z C C bias current I T of 1mA. 6.3.3 Transmitter Figure 6 9 shows the schematic of the transmitter, composed of a data gener ator, retimer, 2:1 MUX, 1 tap FFE and output driver. A 2 7 1 RPBS data generator is implemented to generate the testing pattern at half of the data rate f d /2, the two half rate data streams are further retimed and multiplxed into a full rate data stream wit h its delayed version. These two data streams data0 and data1 are further summed together with certain coefficients by controlling the tail current I T0 and I T1 through IDAC. This 1 tap FFE can be used to equalize either pre cursor or post cursor, which is dependent on the choice of tap coefficients. The half rate clock for data generation, latch and multiplxers are provided from chip outside. Figure 6 9 Schematic of the transmitter As shown in Figure 6 10, a 2 7 1 parallel PRBS generator produces parall el PRBS bit sequences, which are shifted appropriately for direct multiplexing [4 7 ]. All signals in the system are differential. The schematic of a D flipflop composed of two identical latches and 2:1 multiplexer are the same as shown in Figure 3 6 and the tail current is controlled by I DAC.

PAGE 133

133 Figure 6 10 Block diagram of the 2 7 1 PRBS generator 6.3.3 Receiver The receiver implements a decision feedback equalizer (DFE) with a CG pre amplifier front end. The DFE uses 1 tap look ahead architecture and an em bedded 1 to 2 DeMUX with half rate clocking to avoid power hungry full rate latches and relax the stringent timing closure requirements of the feedback loop [ 81 83 ]. As shown in Figure 6 1 1 ISI cancellation is performed using a latch (L1) with adjustable decision threshold. The threshold adjustment is embedded in the latch, thus avoiding the use of a summer to save the power dissipation. The tap current I dfe increases/decreases the latch threshold V th controlled by UP/DN signals. Since the CG preamplifier preceding the latches improves the receiver sensitivity, threshold adjustment of the four parallel data receiving latches (L1) is accomplished by applying UP/DN current weighed tap control signals during the active tracking phase. This also eliminates the need for a second

PAGE 134

134 latch prior to the speculative select MUXs compared with regeneration phase V th shift proposed by IBM [ 82 ]. The time diagram of the proposed DFE architecture with tracking phase V th shift is illustrated in Figure 6.1 2 The overall timing is 1 bit time earlier for the data available for MUXs and the required feedback selecting signals (the previous data). The latency time of the proposed DFE is the delay of MUXs plus the following latch before S 1 (or S 2 ). Figure 6 11 Schematic of the r eceiver and tracking phase V th shift 6 4 Chip Fabrication To explore the impact of equalization on power efficiency of on chip interconnects and test the dc coupled current sharing chip to chip link concept, a proof of concept transmitter and receiver ch ips in 1.2V CMOS process has been fabricated and tested. The TX and RX

PAGE 135

135 cores measure 0.03mm 2 and 0.02mm 2 respectively. Figure 6 1 3 shows the die photos and test board with fabricated air gapped transmission line. Figure 6 12 Timing diagram of the pro posed DFE architecture Figure 6 13 TX/RX chip die photos 6.5 Experimental Results 6.5.1 Test Setup A block diagram of the test setup is illustrated in Figure 6 1 4 An external signal generator is used to provide the synchronization clock for the tra nsmitter. The synchronizing clock for the

PAGE 136

136 receiver is manually adjusted using a SHF 2000 variable delay line to generate the decision in the DFE with the maximum voltage and timing margins. The TX chip generates 2 7 1 PRBS data and connects to the RX chip through a channel. The differential output of the receiver is connected to the Agilent 86100B high frequency oscilloscope to display the time domain waveform or the Agilent ParBERT 81250 to measure the BER. Figure 6 1 4 Link test setup 6.5.2 Air Gap Channel Measurement An air gap channel fabricated on standard FR 4 PCB board is used for link testing. Figure 6 15 shows the measured insertion loss up to 40GHz for 2 0cm long test structures. Measurements indicate an attenuation of 12 dB at 3.125 GHz and 1 9 dB at 5 GHz for a 2 0cm line The effective dielectric constant is ~ 1.73, confirming the effectiveness of the air gap line in comparison to that of the FR4 base material with a dielectric constant of 4.4. The transmitter and receiver test chips were assem bled using wire bonded 32 pin QFN packages and mounted on an FR 4 board with a 20 gapped strip line interconnect structures. The testing bench is shown in Figure 6 1 6 with the high speed I/O pins using CMA connectors and low speed pins using DC headers.

PAGE 137

137 Figure 6 15 Measured insertion loss of a 20 cm air gap channel Figure 6 16 Photo of the link chips mounted on the air gap board 6.5.3 Eye D iagrams and BER Measurement Figure 6 1 7 presents the measured eye diagrams with FFE enabled, ( A ) the RX frontend output without equalization at 6.25Gb/s, ( B ) the RX frontend output (after the CG amplifier) with FFE enabled, ( C ) the DeMUX output with FFE enabled only and ( D ) the DeMUX transient waveform with FFE ebaled only. The RX frontend output was c losed without any equalization at the data rate of 6.25Gb/s. With FFE enabled only, the measured eye heigh of the CG amplifier

PAGE 138

138 output is ~100mV after the buffer and eye width is ~60% of the bit time. The DeMUX output had a good quality of horizontal and ve rtical eye openings at 3.125Gb/s. A B C D Figure 6 17 Measured eye diagrams for 6.25Gb/s link tests with FFE only enabled. Figure 6 1 8 presents the measured eye diagrams with DFE enabled only and both FFE and DFE enabled. The RX f rontend output was closed without any equalization at the data rate of 6.25Gb/s. With DFE enabled only after the CG amplifier, the ISI effect at the frontend output can be equalized and the DeMUX output had a good quality of horizontal and vertical eye ope nings at 3.125Gb/s. With both FFE and DFE enabled, the DeMUX output has a good eye quality.

PAGE 139

139 A B C D Figure 6 1 8 Measured eye diagrams for 6.25Gb/s link tests with DFE only enabled and FFE+DFE enabled Thus, a good eye quality at the DeMU X output can be achieved at various settings, with FFE enabled only, with DFE enabled only, and with both FFE and DFE enabled as shown in Figure 6 1 8 The DeMUX output waveforms for different settings are quite similar because they are shaped by the large output buffers to the output pad. In order to compare the link performance at various equalization settings, the measured receiver bathtub curves are shown in Figure 6 19 By continuously tuning the external variable delay line to adjust the sampling time for the slicer, the BER performance of the link output can be recorded and plotted. At the data rate of 6.25Gb/s and BER of 10 12 w ith only the TX FFE enabled, the eye opening is 3 0 % UI.

PAGE 140

140 Enabling the RX DFE and disabling FFE improve s the eye opening to 37 % Enabling both FFE and DFE yields 56% UI horizontal eye opening but decreased power efficiency. Figure 6 19 Measured BER bathtub curves at various equalization settings. 6.5. 4 Energy Measurement Figure 6 2 0 presents the measured link energy per bit p erformance at different data rates, measured at different equalization settings as in Figure 6. 19 Detailed energy per bit performance is obtained by measuring supply currents of the transmitter and receiver at different conditions such as enabling/disabli ng FFE/DFE, data generator, output driver and etc. by digitally adjusting the loader bits to tune the individual tail current source. Figure 6 2 0 Measured e nergy per bit performance at various equalization settings

PAGE 141

141 At 6.25Gb/s and a BER of 10 12 w ith only the TX FFE enabled, the eye opening is 3 0 % UI. Enabling the RX DFE and disabling FFE improve s the eye opening to 37 %, while the overall power efficiency improves from 0.9 to 0.6 m W /(Gb/s) respectively. Enabling both FFE and DFE yields 56% UI horizont al eye opening but decreased power efficiency. Figure 6 2 1 Energy per bit performance compared with previous work Table 6 2 Link performance summary and comparison with the previous work This work Fukuda Technology 130nm 90nm Supply voltage (V) 1.2 1.0 Data rate (Gb/s) 6.25 12.5 Frontend swing (mV) 125 100 BER 1e 12 1e 12 Horizontal eye 60% UI @ 6.25 Gb/s TX 1.44 (39%) TX+Driver 5.43 (45%) Power (mW) Frontend 1.20 (32%) Clock 0.80 (7%) RX 1.06 (29%) VGA+RX 5.73 (48%) Energy per bit (mW/(Gb/s)) 0.6* 0.98 TX/RX core area (mm2) 0.03/0.02 0.24/0.24 *doesn't include clock generation and distribution circuits

PAGE 142

142 CHAPTER 7 SUMMARY AND FUTURE W ORK 7.1 Summary The demand for higher functional density and processing throughput in high performance computational platforms such as multi core and network processors is p rogressively increasing the aggregate I/O bandwidth. In particular, reducing the power consumption is critical when hundreds of parallel I/Os are integrated on one chip. In this dissertation a logical effort (LE) model is presented for high speed CML gate s to describe the design metrics, i.e. power and speed. Incorporating the system and circuit analysis, a complete design framework is developed for high speed parallel I/O links to predict the compound metrics and direct the most energy efficient design in cluding the link architectures and circuit design specifications across the technology nodes. The band limited characteristics of the link channels, i.e. backplane systems are studied in terms of the physical causes and the effect on the signal integrity. Both linear and nonlinear equalizers are introduced to compensate the frequency attenuation caused by link channels. A statistical approach has been developed to characterize the signal integrity, i.e. BER for the I/O links by behaviorally modeling the eq ualizers and channels. Through system level analysis, the options for the link architecture including the types and amounts of equalization for this band limited channel can be optimized to achieve the target BER. At the circuit level of CML gates, logica l effort g is defined to capture the driving effort, which is related with the current density. The fact that logical effort g is invariant across technology nodes due to the constant field scaling and electrical effort h scales with the required data rate s enables the algorithmic sizing of the CML gates in high speed I/O links. A prototype of PRBS data generator implemented in UMC130 nm CMOS technology validated the logical

PAGE 143

143 effort based power and speed model. A more complex high speed transmitter implement ed in 65 nm CMOS technology was measured to validate both the link analysis including feed forward equalizer (FFE) and circuit model for power and speed performance. For parallel I/O links with very large aggregate bandwidth, a simulation framework incorp orating the system and circuit modeling is developed to facilitate the analysis of link architectures across technology nodes and predict the performance metrics. To achieve optimal energy efficiency and target BER, the level of parallelisms for high speed I/O links can be determined from the simulation framework for different band limited channels. The area constraint from the realistic packaging strategy for I/O links can also be included in the simulation framework, from which we can obtain the compound metrics, i.e. power density and achievable bandwidth per area across the technology nodes. Although ISI effect is the dominant factor that limits the high speed signaling on band limited channels and properly modeled in the design framework, crosstalk effe ct might be severe when the high speed signals transmit closely in parallel inside the packages or on boards. Based on its measured S parameters or theoretical model, the crosstalk effect can be included in the statistical calculation for BER using the sam e convolution method as ISI. A high speed jitter equalizer is presented in this dissertation that compensates the timing difference in propagation velocity caused by the crosstalk induced jitter. By adjusting sampling clock edge using a phase interpolator, the output data of the transmitter is advanced or delayed according to modes of data transitions between the adjacent channels. The proposed CIJ equalizer is easily amenable with a feedfoward equalizer to compensate the channel loss and crosstalk at the s ame time. To verify the proposed CIJ equalization scheme and evaluate its performance, a chip prototype of 4 parallel high speed transmitters is fabricated and tested using both BGA and QFN packages. The

PAGE 144

144 measurement results demonstrate a reduced jitter (~5 0%) and lower BER (10 12 ) after enabling the CIJ equalizer at high data rate. Based on the proposed link model, the power breakdown of the individual blocks in a high speed I/O link can be estimated and the transceiver frontend is found to be a major sou rce of power consumption. This dissertation also proposes a new circuit technique, which shares the current of transceiver frontend: output driver at the transmitter and input transimpedance amplifier at the receiver. The proposed current sharing transceiv er directly conducts the transimpedance amplifier current into the channel and powers up the output driver. To verify the proposed low power circuit technique, a transmitter and a receiver chip are fabricated and measured the link performance. The measurem ent results indicate the operation of the link up to 6.25Gb/s over a 20 cm air gap with the energy per bit cost around 0.6mW/Gb/s. 7.2 Future W ork In this work, the main building blocks of the high speed I/O link were implemented including the transmitte r, receiver, equalizers and PLL. The link performance was measured by tuning the variable delay of the sampling clock signals for the receiver externally. A clock recovery circuit can be designed to complete the function of the I/O link.

PAGE 145

145 APPENDIX A MATL AB CODES FOR DESIGN FRAMEWORK The MATLAB codes used to predict the link performance across the technology nodes ( from 130 nm to 16 nm ) are provided in this section as an example of the proposed design framework The flow chart of MATLAB functions is illust rated in Figure A 1. In the main function of the design framework, the technology parameters are firstly read out and the performance metrics of power and energy at different technology nodes are calculated across the data rates A t a given channel charact eristic of insertion loss rate the tap coefficients of FFE and DFE are swept and optimized to obtain the minimum BER performance. According to the transceiver architecture presented in Chapter 4, the power of transmitter and receiver can be calculated sep arately and summed together to obtain the link power dissipation. Figure A 1. Flow chart of MATLAB functions used in the design framework

PAGE 146

146 Main function of design framework tpp=RC_tech_para; vdd=tpp.a; cgate=tpp.b; tox=tpp.c; lg=tpp.d; v_swing=tpp. e; p=tpp.f; % define channel loss rate 1dB/GHz or 8dB/GHz alpha= 1.0*(1e 9); % define current density Jc=0.3e 3/1e 6; % define working data rates, increase data rates for low lossy channel Rb=[0.5e9 0.75e9 1e9 2.5e9 5e9 7.5e9 10e9 12.5e9 15e9 20e9 30e9 40e9]; c_pad=0.08e 12.*cgate./cgate(1); c_esd=0.2e 12.*cgate./cgate(1); c_IO=c_pad+c_esd; % define IO cap_ratio of input output RX,either 1 or other values cap_ratio_RX_TX=1; % define receiver sensitivity vrx_sen=10e 3; % define crosstalk value crosstalk_d b= 20; % define noise value in channel k_noise=1e 3;%0.001; % get data rate number [xk,size_fd]=size(Rb);kx=size_fd; %sweep data rate at a given technology, 1 >7 % 130 nm technology for n=1:1:kx [Power_T_opt1g(n),Power_R_opt1g(n),Power_osc_opt1g(n),Powe r_TR_opt1g(n),ksample_1(n),k1_1(n),k2_1(n),kdfe1_1(n),k_dfe2_1( n)]=yIO_Power_EQU(Rb(n),tox(1),lg(1),cgate(1),c_IO(1),vdd(1),alpha,v_swing(1),vrx_sen,cap_ratio_RX_TX,Jc,p(1)); end save power1_g Power_TR_opt1g; x1g=Power_TR_opt1g.*1000./(Rb.*1e 9); %90 nm fo r n=1:1:kx [Power_T_opt2g(n),Power_R_opt2g(n),Power_osc_opt2g(n),Power_TR_opt2g(n),ksample_2(n),k1_2(n),k2_2(n),kdfe1_2(n),k_dfe2_2( n)]=yIO_Power_EQU(Rb(n),tox(2),lg(2),cgate(2),c_IO(2),vdd(2),alpha,v_swing(2),vrx_sen,cap_ratio_RX_TX,Jc,p(2)); end save power2_g Power_TR_opt2g; x2g=Power_TR_opt2g.*1000./(Rb.*1e 9); %65 nm for n=1:1:kx [Power_T_opt3g(n),Power_R_opt3g(n),Power_osc_opt3g(n),Power_TR_opt3g(n),ksample_3(n),k1_3(n),k2_3(n),kdfe1_3(n),k_dfe2_3( n)]=yIO_Power_EQU(Rb(n),tox(3),lg(3),cga te(3),c_IO(3),vdd(3),alpha,v_swing(3),vrx_sen,cap_ratio_RX_TX,Jc,p(3)); end save power3_g Power_TR_opt3g; x3g=Power_TR_opt3g.*1000./(Rb.*1e 9); %45 nm for n=1:1:kx [Power_T_opt4g(n),Power_R_opt4g(n),Power_osc_opt4g(n),Power_TR_opt4g(n),ksample_4(n ),k1_4(n),k2_4(n),kdfe1_4(n),k_dfe2_4( n)]=yIO_Power_EQU(Rb(n),tox(4),lg(4),cgate(4),c_IO(4),vdd(4),alpha,v_swing(4),vrx_sen,cap_ratio_RX_TX,Jc,p(4)); end save power4_g Power_TR_opt4g; x4g=Power_TR_opt4g.*1000./(Rb.*1e 9); %32 nm for n=1:1:kx [Powe r_T_opt5g(n),Power_R_opt5g(n),Power_osc_opt5g(n),Power_TR_opt5g(n),ksample_5(n),k1_5(n),k2_5(n),kdfe1_5(n),k_dfe2_5( n)]=yIO_Power_EQU(Rb(n),tox(5),lg(5),cgate(5),c_IO(5),vdd(5),alpha,v_swing(5),vrx_sen,cap_ratio_RX_TX,Jc,p(5)); end save power5_g Power_TR_o pt5g; x5g=Power_TR_opt5g.*1000./(Rb.*1e 9); %22 nm for n=1:1:kx [Power_T_opt6g(n),Power_R_opt6g(n),Power_osc_opt6g(n),Power_TR_opt6g(n),ksample_6(n),k1_6(n),k2_6(n),kdfe1_6(n),k_dfe2_6( n)]=yIO_Power_EQU(Rb(n),tox(6),lg(6),cgate(6),c_IO(6),vdd(6),al pha,v_swing(6),vrx_sen,cap_ratio_RX_TX,Jc,p(6)); end save power6_g Power_TR_opt6g; x6g=Power_TR_opt6g.*1000./(Rb.*1e 9); %16 nm for n=1:1:kx [Power_T_opt7g(n),Power_R_opt7g(n),Power_osc_opt7g(n),Power_TR_opt7g(n),ksample_7(n),k1_7(n),k2_7(n),kdfe1 _7(n),k_dfe2_7( n)]=yIO_Power_EQU(Rb(n),tox(7),lg(7),cgate(7),c_IO(7),vdd(7),alpha,v_swing(7),vrx_sen,cap_ratio_RX_TX,Jc,p(7));

PAGE 147

147 end save power7_g Power_TR_opt7g; x7g=Power_TR_opt7g.*1000./(Rb.*1e 9); Technology parameters from ITRS function ss = RC_tech_p ara; %norminal power supply votlage vdd_130=1.2; vdd_90=1.2; vdd_65=1.1; vdd_45=1.0; vdd_32=0.9; vdd_22=0.8; vdd_16=0.7; vdd=[vdd_130 vdd_90 vdd_65 vdd_45 vdd_32 vdd_22 vdd_16]; %norminal high performance NMOS saturation drive current idsat_130=900; idsat_ 90=1110; idsat_65=1510; idsat_45=1900; idsat_32=2050; idsat_22=2400; idsat_16=2768; %high performance nmos intrinsic delay t_130=1.6e 12; t_90=0.95e 12; t_65=0.64e 12; t_45=0.39e 12; t_32=0.26e 12; t_22=0.15e 12; t_16=0.1e 12; %NMOS device gate capacitance Cgate cgate_130=t_130*idsat_130/vdd_130; cgate_90=t_90*idsat_90/vdd_90; cgate_65=t_65*idsat_65/vdd_65; cgate_45=t_45*idsat_45/vdd_45; cgate_32=t_32*idsat_32/vdd_32; cgate_22=t_22*idsat_22/vdd_22; cgate_16=t_16*idsat_16/vdd_16; cgate=[cgate_130 cgate_90 cg ate_65 cgate_45 cgate_32 cgate_22 cgate_16]; %equivalent electrical oxide thickness in inversion tox tox_130=2.3e 9; %130nm technology tox_90=2.1e 9; %90nm technology tox_65=1.3e 9; %65nm technology tox_45=1.1e 9; %45nm technology tox_32=1.0e 9; %32nm tech nology tox_22=0.9e 9; %22nm technology tox_16=0.9e 9; %16nm technology tox=[tox_130 tox_90 tox_65 tox_45 tox_32 tox_22 tox_16]; er=3.9; J=0.3e 3/1e 6; %[v_swing_130]=v_gen(tox_130,lg_130,er,0.5*J); %v_full_130=v_swing_130*2 [v_swing_130]=RC_v_gen(tox_130,l g_130,er,1*J); [v_swing_90]=RC_v_gen(tox_90,lg_90,er,J); [v_swing_65]=RC_v_gen(tox_65,lg_65,er,J); [v_swing_45]=RC_v_gen(tox_45,lg_45,er,J); [v_swing_32]=RC_v_gen(tox_32,lg_32,er,J); [v_swing_22]=RC_v_gen(tox_22,lg_22,er,J); [v_swing_16]=RC_v_gen(tox_16,lg _16,er,J); v_swing=1.5.*[v_swing_130 v_swing_90 v_swing_65 v_swing_45 v_swing_32 v_swing_22 v_swing_16]; ss.a=vdd; ss.b=cgate; ss.c=tox; ss.d=lg; ss.e=v_swing; p_130=0.19; p_90=0.27; p_65=0.27;

PAGE 148

148 p_45=0.31; p_32=0.36; p_22=0.42; p_16=0.5; p=[p_130 p_90 p_65 p_45 p_32 p_22 p_16]; ss.f=p; Transceiver link power calculation function function [Power_TRX,Power_TX,Power_RX,Power_osc_loop,I_IO_TX,c_tx_in,c_rx_in,vlimit,H_TX,H_RX]=Power_TRX(fd,w_g,cgate,v0,vm ,vrx_sen,h_diff,c_IO,vdd,c0,c1,c2,cdfe1,cdfe2,Rd,cap_rati o_RX_TX,Jc,p,tox) [Power_osc,Power_TX_data,Power_ck_loop_TX,Power_ck_buf_TX, I_IO_TX, c_tx_in,H_TX]=power_TX(fd,w_g,cgate,Rd,v0,vm,h_diff,c_IO,vdd,c0,c1,c2,vrx_sen,Jc,p); cap_IO=c_IO/1; c_rx_out=cap_ratio_RX_TX*c_tx_in+cap_IO; Jm=0.3e 3/1e 6; v_swing=vm*(J c/Jm)^2; gain=v_swing/vrx_sen; [Power_RX, Power_RX_data, Power_CDR,c_rx_in,Namp,vlimit,H_RX]=power_RX(fd,w_g,cgate,v0,vm,gain,c_rx_out,vdd,cdfe1,cdfe2,Jc,p,tox); Power_osc_loop=Power_osc+Power_ck_loop_TX; Power_TX=Power_ck_buf_TX+Power_TX_data; Power_TRX=P ower_TX+Power_RX+Power_osc_loop; Transmitter power calculation function function [Power_osc,Power_TX_data,Power_ck_loop_TX,Power_ck_buf_TX, I_IO_TX, c_tx_in,Hx_TX]=power_TX(fd,w_g,cgate,Rd,v0,vm,eyemax,c_IO,vdd,c0,c1,c2,vrx_sen,Jc,p) h_diff=eyemax; [Power _driver_TX, c_driver, I_IO_TX, cap_out_TX, Hx_driver]=power_TX_driver(Rd,cgate,h_diff,vdd,c0,c1,c2,vrx_sen,Jc,p,c_IO); fopt=fd; if c1 == 0 & c2 == 0 % NO FIR [Power_latch_FIR_TX, c_FIR, I_FIR_TX, c_ck_fir_latch,Hx_fir_latch]=power_ff_cml(fd,w_g ,cgate,c_driver,v0,vm,vdd,Jc,p); c_ck_fir=c_ck_fir_latch; Power_FIR_TX=Power_latch_FIR_TX; Hx_fir=Hx_fir_latch; end if c2 == 0 & c1 ~= 0 [Power_latch_FIR_TX2, c_FIR2, I_FIR_TX2, c_ck_fir_latch2,Hx_fir_latch2]=power_ff_cml(fd,w_g,cgate,c_driver,v0,vm,vdd,Jc ,p); % 1 st LATCH for 1st tap, load cap=c_driver+c_FIR2 [Power_latch_FIR_TX1, c_FIR1, I_FIR_TX1, c_ck_fir_latch1,Hx_fir_latch1]=power_ff_cml(fd,w_g,cgate,c_driver+c_FIR2,v0,vm,vdd,Jc,p); % 1st+2nd LATCH Power_FIR_TX=Power_latch_FIR_TX2+Power_latch_FIR_T X1; c_ck_fir=c_ck_fir_latch2+c_ck_fir_latch1; % FIR clock load c_FIR=c_FIR1; % data path input cap Hx_fir=Hx_fir_latch2; end if c2 ~= 0 % 1 st, 2 nd, 3 rd flip flop % 3 rd LATCH for 3 rd tap, load cap=c_driver [Power_latch_FIR_TX3, c_FIR3, I_FIR_TX3, c_ck_fir_latch3,Hx_fir_latch3]=power_ff_cml(fd,w_g,cgate,c_driver,v0,vm,vdd,Jc,p); % 2 nd LATCH for 2nd tap,load cap=c_driver+c_FIR3 [Power_latch_FIR_TX2, c_FIR2, I_FIR_TX2, c_ck_fir_latch2,Hx_fir_latch2]=power_ff_cml(fd,w_g,cgate,c_driver+c_F IR3,v0,vm,vdd,Jc,p); % 1 st LATCH for 1st tap, load cap=c_driver+c_FIR2 [Power_latch_FIR_TX1, c_FIR1, I_FIR_TX1, c_ck_fir_latch1,Hx_fir_latch1]=power_ff_cml(fd,w_g,cgate,c_driver+c_FIR2,v0,vm,vdd,Jc,p); % 1st+2nd LATCH Power_FIR_TX=Power_latch_FIR_TX3+ Power_latch_FIR_TX2+Power_latch_FIR_TX1; c_ck_fir=c_ck_fir_latch3+c_ck_fir_latch2+c_ck_fir_latch1; % FIR clock load c_FIR=c_FIR1; % data path input cap Hx_fir=Hx_fir_latch3; end % Full rate clock load capacitance ci_1to1_tx=c_ck_fir; % fd clock input cap for FIR DFF %Power_fullrate=Power_retimer+Power_FIR_TX+Power_driver_TX; Power_fullrate=Power_FIR_TX+Power_driver_TX; % Power FF for FIR + TX driver % 2:1 mux includes: 2 latch + 3 latch + 1 MUX (latch) working at fd/2 >fd c_2to1=c_FIR; % d ata path input cap [Power_mux_2to1, c_latch_2to1, I_2to1mux, c_ck_mux_2to1,Hx_mux_2to1]=power_mux_cml(fd,w_g,cgate,c_2to1,v0,vm,vdd,Jc,p);

PAGE 149

149 % Assume 5 same Latch for 2:1 fd/2 > fd Mux [Power_latch_2to1, c_4to2, I_2to1latch, c_ck_latch_2to1,WG_latch_2to1,R_ latch_2to1,Hx_latch_2to1]=power_latch_cml(fd/2,w_g,cgate,c_latch_2to1,v0,vm,vdd,Jc,p); Power_halfrate=Power_mux_2to1+5*Power_latch_2to1; % Half rate clock load capacitance% fd/2 clock input cap for 5latch+1mux ci_2to1_tx=c_ck_mux_2to1+5*c_ck_latch_2to1; Hx_2to1=Hx_latch_2to1*Hx_mux_2to1; % 4:2 mux includes: 2 latch + 3 latch +1 MUX, working at fd/4 > fd/2 [Power_mux_4to2, c_latch_4to2, I_4to2mux, c_ck_mux_4to2,Hx_mux_4to2]=power_mux_cml(fd/2,w_g,cgate,c_4to2,v0,vm,vdd,Jc,p); [Power_latch_4to2, c_tx_in, I _4to2latch, c_ck_latch_4to2,WG_latch_4to2,R_latch_4to2,Hx_latch_4to2]=power_latch_cml(fd/4,w_g,cgate,c_latch_4to2,v0,vm,vdd,Jc,p); Power_quadrate=2*(Power_mux_4to2+5*Power_latch_4to2); % 2 branch % Quad rate clock fd/2 clock input cap for 2*(5latch+1mux) ci_4to2_tx=2*(c_ck_mux_4to2+5*c_ck_latch_4to2); Hx_4to2=Hx_mux_4to2*Hx_latch_4to2; % TX datapath power consumption Power_TX_data=Power_fullrate+Power_halfrate+Power_quadrate; Hx_TX=Hx_4to2*Hx_2to1*Hx_fir*Hx_driver; [Power_osc,Power_ck_loop,Power_ck_buf_TX ,c_load_osc]=power_PLL_TX(fd,w_g,cgate,v0,vm,vdd,Jc,p,ci_4to2_tx,ci_2to1_tx,ci_1t o1_tx); % calculate TX PLL loop dynamic power % TDC: 32 bit TDC = 32 INV + 32 DFF (32*8 INV), at fb/32 % digital loop filter: 20 bit voter (adder), 1 adder = 20 INV at fb/32 c ap_inv=4*w_g*cgate; cap_tot=(32+32*8+20*20)*cap_inv; Power_filter=0.5*cap_tot*(vdd*vdd)*(fd/32); Power_ck_loop_TX=Power_ck_loop+Power_filter; % Include PI, Loop filter + PD R eceiver power calculation function function [Power_RX,Power_RX_data,Power_CDR,c_r x_in,Namp,vlimit,Hx_RX]=power_RX(fd,w_g,cgate,v0,vm,gain,c_rx_out,vdd,cdfe1,cdfe 2,Jc,p,tox) % 2:4 Demux includes: 2 latch + 3 latch, working at fb/4 c_2to4=c_rx_out; [Power_latch_2to4, c_1to2, I_2to4demux, c_ck_latch_2to4,WX,RX,Hx_2to4]=power_latch_cml(fd/ 4,w_g,cgate,c_2to4,v0,vm,vdd,Jc,p); Power_demux_2to4=5*Power_latch_2to4; Power_quadrate_RX=2*Power_demux_2to4; ci_2to4=10*c_ck_latch_2to4; % fd/4 clock driving 10 latch % 1:2 Demux includes: 2 latch + 3 latch, working at fb/2 [Power_latch_1to2, c_1to1, I_1 to2demux, c_ck_latch_1to2,WX,RX,Hx_1to2]=power_latch_cml(fd/2,w_g,cgate,2*c_1to2,v0,vm,vdd,Jc,p); Power_demux_1to2=5*Power_latch_1to2; Power_halfrate_RX=Power_demux_1to2; ci_1to2=5*c_ck_latch_1to2; % fb/2 clock driving 5 latch % DFE Delay includes: 2 latc h, working at fd if cdfe1 == 0 & cdfe2 == 0 % NO DFE c_load_slicer=2*c_1to1; % slicer only driving 2 latch [Power_latch_slicer, c_adder, I_slicer_latch,c_ck_latch_slicer,WX,RX,Hx_dfe]=power_latch_cml(fd,w_g,cgate,c_load_slicer,v0,vm,v dd,Jc,p); Power_slicer=2*Power_latch_slicer; c_ck_slicer=2*c_ck_latch_slicer; [Power_adder, c_amp2, I_adder,Hx_adder]=power_sum(fd,w_g,cgate,c_adder,v0,vm,vdd,cdfe1,cdfe2,Jc,p); Power_DFE=0; cap_dfe=0; c_ck_dfe=0; end if cdfe1 ~= 0 & cdfe2 == 0 % 1 tap DFE c_load_slicer=4*c_1to1; %Slicer Power=1DFF=2Latch [Power_latch_slicer, c_adder, I_slicer_latch,c_ck_latch_slicer,WX,RX,Hx_dfe]=power_latch_cml(fd,w_g,cgate,c_load_slicer,v0,vm,vdd,Jc,p); Power_slicer=2*Power_latch_slicer; c_ck_slic er=2*c_ck_latch_slicer; %Adder Power [Power_adder, c_amp2, I_adder,Hx_adder]=power_sum(fd,w_g,cgate,c_adder,v0,vm,vdd,cdfe1,cdfe2,Jc,p); c_load_DFE=c_amp2; [Power_latch_DFE, cap_dfe, I_DFE_latch,c_ck_latch_dfe,WX,RX,Hx_dfe]=power_latch_cml(fd,w_g,cgat e,c_load_DFE,v0,vm,vdd,Jc,p); Power_DFE=2*Power_latch_DFE;

PAGE 150

150 c_ck_dfe=2*c_ck_latch_dfe; end if cdfe1 ~= 0 & cdfe2 ~= 0 % 2 tap DFE c_load_slicer=6*c_1to1; [Power_latch_slicer, c_adder, I_slicer_latch,c_ck_latch_slicer,WX,RX,Hx_dfe]=po wer_latch_cml(fd,w_g,cgate,c_load_slicer,v0,vm,vdd,Jc,p); Power_slicer=2*Power_latch_slicer; c_ck_slicer=2*c_ck_latch_slicer; %Adder Power [Power_adder, c_amp2, I_adder,Hx_adder]=power_sum(fd,w_g,cgate,c_adder,v0,vm,vdd,cdfe1,cdfe2,Jc,p); c_load_DFE= c_amp2; [Power_latch_DFE, cap_dfe, I_DFE_latch,c_ck_latch_dfe,WX,RX,Hx_dfe]=power_latch_cml(fd,w_g,cgate,c_load_DFE,v0,vm,vdd,Jc,p); Power_DFE=4*Power_latch_DFE; c_ck_dfe=4*c_ck_latch_dfe; end if cdfe1 == 0 & cdfe2 ~= 0 % Only 1 tape DFE c_load_slicer=5*c_1to1; [Power_latch_slicer, c_adder, I_slicer_latch,c_ck_latch_slicer,WX,RX,Hx_dfe]=power_latch_cml(fd,w_g,cgate,c_load_slicer,v0,vm,vdd,Jc,p); Power_slicer=2*Power_latch_slicer; c_ck_slicer=2*c_ck_latch_slicer; %Adder Power [Power _adder, c_amp2, I_adder,Hx_adder]=power_sum(fd,w_g,cgate,c_adder,v0,vm,vdd,cdfe1,cdfe2,Jc,p); c_load_DFE=c_amp2; [Power_latch_DFE, cap_dfe, I_DFE_latch,c_ck_latch_dfe,WX,RX,Hx_dfe]=power_latch_cml(fd,w_g,cgate,c_load_DFE,v0,vm,vdd,Jc,p); Power_DFE=4*Pow er_latch_DFE; c_ck_dfe=4*c_ck_latch_dfe; end Power_sum=1*Power_adder; ci_1to1=c_ck_slicer+c_ck_dfe; [Power_CDR,Power_ck_loop,Power_ck_buf_RX,Power_filter_RX,c_load_osc_rx]=power_CDR_RX(fd,w_g,cgate,v0,vm,vdd,ci_2to4, ci_1to2,ci_1to1,Jc,p); [Power_amp, c_rx _in,Namp,vlimit]=power_amplifier(fd,w_g,cgate,tox,c_amp2,v0,vm,gain,vdd,Jc,p); Power_RX_data=Power_amp+Power_DFE+Power_sum+Power_slicer+Power_halfrate_RX+Power_quadrate_RX; Power_RX=Power_RX_data+Power_CDR; Hx_RX=c_rx_out/c_rx_in; Template CML INV design function [Ct, It, Rt, Wt, Hx]=design_INV_temp(fd,w_g,cgate,cload,v0,vm,vdd,Jx,p,alpha) Jm=0.3e 3/1e 6; % Define maximum J t_min=vm/Jm*cgate; % delay_unit beta=2; gm=1; % beta=2 >square law model, gm=1@Jm J0=Jm*(v0/vm)^beta; g0=v0/vm*Jm/J0; % g0 >J0 >v0 % define region of J0
PAGE 151

151 end Power_latch=vdd*Icurrent; CML MUX design function function [Power_mux, c_in, Icurrent, c_ck, Hx]=power_mux_cml(fd,wg_min,cgate,cload,v0,vm,vdd, Jx,p) alpha=1; % for CML Latch [Ctx, Itx, Rtx, Wtx, Hx]=design_INV_temp(fd,wg_min,cgate,cload,v0,vm,vdd,Jx,p,alpha); if Hx < 1 Icurrent=1e6;Power_latch_dc=1e9;Wgate=1;Wck=1;Rload=0;c_in=1;c_ck=1; else Wgate=Wtx;Wck=1.5*Wtx; % clock path gate capacitance=1.5*main path gate for voltage head room c_in=Ctx; c_ck=Wck*cgate; Icurrent=Itx; Rload=Rtx; end Power_mux=vdd*Icurrent; Transmitter CML driver function [Power_IO, c_in, Icurrent,cap_total, Hx]=power_TX_driver(Rload,cgate,ymin,vdd,k0,k1,k2,vr x_sen,Jc,p,c_IO) R_T=50; % RX input termination resistance Z0=50; % Channel impedance I_max=vdd/(Rload);%+Z0); R_eff=Rload;%*50/(Rload+50); v_diff=vrx_sen/ymin; % output rail rail voltage swing === IT*Rload Icurrent=v_diff/(Rload); % meet bandwidth requirement if v_diff >= vdd % output voltage swing MUST < Vdd Icurrent=1e6; end Power_IO_dc=vdd*Icurrent*(1+abs(k1)+abs(k2)); % Power consumed by TX output driver main tap, 1st tap, 2nd tap Power_driver_dc=Power_IO_d c;%*(k0+abs(k1)+abs(k2)); Wgate=Icurrent/Jc; Cin=Wgate*cgate; if k1==0 & k2==0 % only main tape, no tape 1, tape 2 cap_total=c_IO+Cin*p; end if k1~=0 & k2==0 % only main tape and tape 1 cap_total=c_IO+2*(p*Cin); end if k2~=0 % main tape + tape 1 + tape 2 cap_total=c_IO+3*(p*Cin); end c_in=Cin; % Input capaitance for data in Hx=cap_total/c_in; % Electrical Effort Power_dyn=0;%0.5*cap_total*(v_diff^2)*fb; % TX output dynamic power Power_IO=Power_ driver_dc+Power_dyn; % TX output driver total power CML DFF design function function [Power_ff, c_in, Icurrent, c_ck, Hx]=power_ff_cml(fd,wg_min,cgate,cload,v0,vm,vdd,Jx,p) alpha=0.8; [Ctx, Itx, Rtx, Wtx, Hx ]=design_INV_temp(fd,wg_min,cgate,cload,v0, vm,vdd,Jx,p,alpha); if Hx < 1 Icurrent=1e6;Power_ff_dc=1e9;c_in=1;c_ck=1; else Wgate=Wtx; Wck=1.5*Wtx; % clock path gate capacitance=1.5*main path gate for voltage head room c_in=1*Ctx; % input capacitance c_ck=2*Wck*cgate;Icur rent=2*Itx;Rload0=Rtx;% composed of 2 latches end Power_ff=vdd*Icurrent; Transmitter PLL function [Power_osc,Power_ck_loop, Power_ck_buf,c_load_osc1]=power_PLL_TX(fd,w_g,cgate,v0,vm,vdd,Jx,p,ci_4to2,ci_2to1,ci_1to1) % ci_4to2, ci_2to1, ci_1to1: TX 4to2 MU X, 2to1 MUX, 1to1 DFF input CK cap % 1/16 to 1/32 frequency divider, including 2 cml latch % 1/8 to 1/16 frequency divider, including 2 cml latch,working at fb/8 % 1/4 to 1/8 frequency divider, including 2 cml latch,working at fb/4 %Change CML divier to st atic divider for fd/4 >fd/8 >fd/16 >fd/32 % Static Divider=2DFF=2*8 INV % cap_inv=3*w_g*cgate;dynpower=0.5*cap_tot*(vdd*vdd)*(fb/32); c_pfd=24*w_g*cgate;

PAGE 152

152 Power_div_32=0.5*c_pfd*(vdd*vdd)*(fd/32); % 1/8 to 1/16 frequency divider, static,working at fb/8 stat ic_div=0.5*(vdd*vdd)*(1*8)*(3*w_g*cgate); %1/8 >1/16 frequency divider: CMOS static divider Power_div_16=static_div*(fd/16); c_ck_8=12*w_g*cgate; % 1/4 to 1/8 frequency divider, including 2 cml latch,working at fb/4, CML function function [Power_ff, c_in, Icurrent, c_ck, Hx]=RC2_power_ff_cml(fd,wg_min,cgate,cload,v0,vm,vdd,Jx,p) [Power_latch_div_8, c_in_8, Icurrent_8, c_ck_4,Hx_div_8]=power_ff_cml(fd/4,w_g,cgate,c_ck_8,v0,vm,vdd,Jx,p); Power_div_8=2*Power_latch_div_8; % Clock buffer_4 [Power_ck_buf_4,c_buf _4, Icurrent_buf4,Hx_buf_4]=power_buf_cml(fd/4,w_g,cgate,ci_4to2,v0,vm,vdd,Jx,p); c_load_div4=c_ck_4+c_buf_4; % 1/2 to 1/4 frequency divider, working at fb/2, CML [Power_latch_div_4, c_in_4, Icurrent_4, c_ck_2,Hx_div_4]=power_ff_cml(fd/2,w_g,cgate,c_load_d iv4,v0,vm,vdd,Jx,p); Power_div_4=2*Power_latch_div_4; % Clock buffer_2 [Power_ck_buf_2,c_buf_2, Icurrent_buf2,Hx_buf_2]=power_buf_cml(fd/2,w_g,cgate,ci_2to1,v0,vm,vdd,Jx,p); c_load_div2=c_ck_2+c_buf_2; % 1/1 to 1/2 frequency divider, working at fb [Power_ latch_div_2, c_in_2, Icurrent_2, c_ck_1,Hx_div_2]=power_ff_cml(fd/1,w_g,cgate,c_load_div2,v0,vm,vdd,Jx,p); Power_div_2=2*Power_latch_div_2; %Total divider power Power_div=Power_div_32+Power_div_16+Power_div_8+Power_div_4+Power_div_2; % Clock buffer_1 [Powe r_ck_buf_1,c_buf_1, Icurrent_buf1,Hx_buf_1]=power_buf_cml(fd/1,w_g,cgate,ci_1to1,v0,vm,vdd,Jx,p); c_load_oscbuf=c_ck_1+c_buf_1; % osc output clock buffer for TX [Power_osc_buf,c_load_osc, Icurrent_oscbuf,Hx_oscbuf]=power_buf_cml(fd,w_g,cgate,c_load_oscbuf, v0,vm,vdd,Jx,p);% second osc buffer [Power_osc_buf1,c_load_osc1, Icurrent_oscbuf1,Hx_oscbuf1]=power_buf_cml(fd,w_g,cgate,c_load_osc,v0,vm,vdd,Jx,p);% first osc buffer % CK buffer: ck_buf_1 ck_buf_2 ck_buf_4 Power_ck_buf=Power_ck_buf_1+Power_c k_buf_2+Power_ck_buf_4; % ck loop: 1st osc_buf1 2nd osc_buf 1/2 div 2/4 divider 4/8 divider 8/16 divider 16/32 divider Power_ck_loop=Power_osc_buf1+Power_osc_buf+Power_div; [Power_osc, Icurrent_osc]=power_ringosc4(fd,w_g,cgate,c_load_o sc1,vm,vdd,Jx,p); Transmitter oscillator design in PLL function [Power_tot, Icurrent]=power_ringosc4(fd,w_g,cgate,cload,v_sw0,vdd,Jc,p) %fosc: 4 stage ring oscillator frequency, cgate: gate capacitance fosc=fd; % Angle frequency, not frequency only I_tail0=0.1e 3; % Initial tail current 1mA A_0=sqrt(2);% Gain for each stage 2 k_mod=2.5;% Mobility ratio, un/up=2.5 wn_0=I_tail0/(Jc);%2*I_tail0*glength/(v_sw0*v_sw0*un*cox)%I_tail0/(0.3e 3/1e 6); % Initial width of nmos transistor for peak fT wp_0=k_mod* wn_0/A_0; % Initial width of pmos transistor cap_p0=1*wp_0*cgate; % Intrinsic pmos cap cap_n0=4*wn_0*cgate+2*wp_0*cgate;% Intrinsic nmos cap cap_0=cap_n0+cap_p0;%(4+k_mod*2/A_0)*wn_0*cgate % Initial total capacitor for each stage, intrinsic cap RL_0= v_sw0/I_tail0; % Initial PMOS load resistance fmax=1/(8*0.69*RL_0*cap_0);% Initial osc frequency, 4 stage ring oscillator Power_dc_0=4*vdd*I_tail0; % Initial dc power consumption, 4 stage. x_opt=cload/(cap_n0+cap_p0)/(fmax/fosc 1); Icurrent=I_tail0*x_opt; Power_dc_opt=Power_dc_0*x_opt; cap_tot=cload+x_opt*(cap_n0+cap_p0);% Total output capaictance of each stage v_swing_opt=v_sw0; % Swing NOT change Power_dyn=0; Power_tot=Power_dc_opt+Power_dyn; if x_opt<=0 %speed can not meet 4 stage ring osc,change topolog y fmax=1/(4*0.69*RL_0*cap_0); % Initial osc frequency, 2 stage ring oscillator Power_dc_0=2*vdd*(4*I_tail0); % Initial dc power consumption, 4 stage. x_opt=cload/(cap_n0+cap_p0)/(fmax/fosc 1); Icurrent=I_tail0*x_opt; Power_dc_opt=Power_dc_0*x_opt; cap_tot= cload+x_opt*(cap_n0+cap_p0); % Total output capaictance of each stage v_swing_opt=v_sw0; % Swing NOT change Power_dyn=0;%0.5*cap_tot*(v_swing_opt)^2*fosc;

PAGE 153

153 Power_tot=Power_dc_opt+Power_dyn; if x_opt<=0 Power_tot=1e12; Icurrent=1e12; end end CML adder desi gn for DFE function [Power_adder,c_in, Icurrent,Hx]=power_sum(fd,w_g,cgate,cload,v0,vm,vdd,kdfe1,kdfe2,Jc,p) kdfe=abs(kdfe1)+abs(kdfe2); if kdfe1 == 0 & kdfe2 == 0 % NO DFE case alpha=0; [Ct, It, Rt, Wt, H]=design_INV_temp(fd,w_g,cgate,cload,v0,vm,vdd,Jc, p,alpha); Icurrent=It; c_in=Ct; Power_adder_dc=vdd*Icurrent; end if kdfe1 ~= 0 & kdfe2 ==0 % 1 tap DFE case alpha=1; [Ct, It, Rt, Wt, H]=design_INV_temp(fd,w_g,cgate,cload,v0,vm,vdd,Jc,p,alpha); Icurrent=It;c_in=Ct; Power_adder_dc=vdd*Icurrent; end if kd fe1 ~= 0 & kdfe2 ~=0 % 2 tap DFE case alpha=2; [Ct, It, Rt, Wt, H]=design_INV_temp(fd,w_g,cgate,cload,v0,vm,vdd,Jc,p,alpha); Icurrent=It;%cload/(1/(v_swing*fbw1) 3*p*cgate/(kk/1e 6*(1 abs(kdfe1) abs(kdfe2))))/(1 abs(kdfe1) abs(kdfe2)); Power_adder_dc=vdd *Icurrent;c_in=Ct; end if kdfe2 ~= 0 & kdfe1 ==0 % 1 tap DFE case alpha=1; [Ct, It, Rt, Wt, H]=design_INV_temp(fd,w_g,cgate,cload,v0,vm,vdd,Jc,p,alpha); Icurrent=It;%cload/(1/(v_swing*fbw1) 2*p*cgate/(kk/1e 6*(1 abs(kdfe1) abs(kdfe2))))/(1 abs(kdfe1) abs (kdfe2)); c_in=Ct; end Power_adder_dc=vdd*Icurrent; Power_adder=Power_adder_dc*(1+abs(kdfe1)+abs(kdfe2)); Hx=H; Receiver CDR function [Power_CDR,Power_ck_loop,Power_ck_buf,Power_filter,c_load_osc_rx]=power_CDR_RX(fd,w_g,cgate,v0,vm,vdd,ci_2to4,ci_1to2,ci 1to1,Jc,p) % Clock buffer_4 [Power_ck_buf_4,c_buf_4, Icurrent_buf4, Hxx]=power_buf_cml(fd/4,w_g,cgate,ci_2to4,v0,vm,vdd,Jc,p); % 1/2 to 1/4 frequency divider, including 2 cml latch,working at fb/2 % c_load_div4=c_ck_4+c_buf_4; c_load_div4=c_buf_4*2; [Power _latch_div_4, c_in_4, Icurrent_4, c_ck_2,HXX]=power_ff_cml(fd/2,w_g,cgate,c_load_div4,v0,vm,vdd,Jc,p); Power_div_4=2*Power_latch_div_4; % Clock buffer_2 [Power_ck_buf_2,c_buf_2, Icurrent_buf2,HXX]=power_buf_cml(fd/2,w_g,cgate,ci_1to2,v0,vm,vdd,Jc,p); % 1/1 to 1/2 frequency divider, including 2 cml latch,working at fb c_load_div2=c_ck_2+c_buf_2; [Power_latch_div_2, c_in_2, Icurrent_2, c_ck_1,hxx]=power_ff_cml(fd,w_g,cgate,c_load_div2,v0,vm,vdd,Jc,p); Power_div_2=2*Power_latch_div_2; % Model phase detector fo r CDR, 3 DFF, but clock driving 2 DFF equivalently % Power_PD=6*Power_latch_PD; % ci_1to1_pd=4*c_ck_latch_pd; % Clock buffer_1 % osc output clock buffer for TX c_load_oscbuf=c_ck_1+ci_1to1; % 2nd osc buffer cap load [Power_osc_buf2,c_load_os cbuf2, Icurrent_oscbuf2,hxx]=power_buf_cml(fd,w_g,cgate,c_load_oscbuf,v0,vm,vdd,Jc,p); % 2nd osc buffer % Phase Interpolator alpha=3;k=0.7;cload_PI=c_load_oscbuf2; [Ct, It, Rt, Wt, H]=design_INV_temp(fd,w_g,cgate,cload_PI,v0,vm,vdd,Jc,p,alpha); Power_PI =4*vdd*It; c_load_osc_rx=Ct;

PAGE 154

154 Power_ck_buf=Power_ck_buf_2+Power_ck_buf_4+Power_osc_buf2+Power_div_2+Power_div_4; % full rate half rate quad rate clock % I oscbuf % osc_buf2 + div2 + div4 + div8 + div16 + div32 % Q oscbuf % calculate RX CDR loop dynamic power % digital loop filter: 20 bit voter (adder), 1 adder = 20 INV % Calculate the CDR logic and Digital filter cap_inv=3*w_g*cgate; cap_tot=(20*20+16*8)*cap_inv; Power_filter=0.5*cap_tot*(vdd*vdd)*(fd/4); Power_ck_loop=Powe r_PI+Power_filter; Power_CDR=Power_ck_loop+Power_ck_buf;

PAGE 155

155 APPENDIX B MATLAB CODES FOR STATISTICAL LINK ANALYSIS A statistical analysis is developed in Chapter 2 to estimate the BER performance for an equalized high speed I/O link. The MATLAB codes used to estimate BER and power metrics for a n equalized transmitter are shown below as an example Figure B 1 illustrates the flow chart of the functions used to estimate BER and calculate power dissipation. At given equalization taps and data rates, t he pulse response is firstly generated After calculating PDF and CDF for ISI, the BER performance and its corresponding power dissipation can be calculated. The MATLAB codes for the functions listed in the flow chart are shown below. Figure B 1. Flow chart of MATLAB functions used to estimate BER Main function of the statistical TX analysis function [Power_TX_total,Power_TX_data,I_driver,c1,c2,maxeye,atten_eff]=design_TX(Rb, Nt,wg_min,cgate,Rd,v0,v_swing,maxeye,c_IO,vdd,c0,c1,c2,vrx_sen,Jc,p) % amplitude max. v alue and step amp_max=1.5;amp_s=0.005; %sweep FFE taps for n1=1:1:6

PAGE 156

156 for n2=1:1:6 % generate pulse response [h0, t0,norm_amp0,Hnew0,f0,hm0,tm0,atten]=gen_hparam(Rb,Nt,n1,n2,n3); h_norm0=h0'.*norm_amp0; % jitter effect, gaussian distrituion, mean=0 delta_rj =0.01; % specify # of cursors included Npre=15;Npost=20; ber_log= 11;% BER threshold % calculate cdf at diff. sampling time [amp_norm0,cdf_abs_time0,cdf_log_time0,amp_eye_time0]=cdf_time(h_norm0,delta_rj,amp_max,amp_s,Npre, Npost, Nt,ber_log); ts= 1:1/(Nt 1):1; % Test amp_eye_time0 whether has maximum value to meet BER threshold, otherwise no need to do polyfit and no need to calculat e power eye_test=max(amp_eye_time0); % round cdf_log_time0 cdf_log_time1=[cdf_log_time0' cdf_log_time0' ]'; amp_eye_time1=[a mp_eye_time0 amp_eye_time0]; if eye_test > 0 % polyfit amp_eye_time0, only when eye is good, not good, polyfit no use [xx,yy]=size(ts);eye_start=0;eye_end=0; amp_eye_end=amp_eye_time0; for n=1:1:yy/2 if amp_eye_end(n)==0 eye_start=n; end end eye_start=eye_ start+1; for n=round(yy/2):1:yy if amp_eye_end(n)>0 eye_end=n; end end eye_eff=amp_eye_end(eye_start 1:eye_end+1); ts_eff=ts(eye_start 1:eye_end+1); p=polyfit(ts_eff,eye_eff,6); f=polyval(p,ts_eff); %plot(ts,amp_eye_time0,'o',ts_eff,f,' '); eye_f=[amp_e ye_end(1:eye_start 2) f amp_eye_end(eye_end+2:end)]; [eye_max0,max_index0]=max(eye_f); t_dev=max_index0 Nt; cdf_log0_1Tb(1:Nt,:)=cdf_log_time1(max_index0 (Nt 1)/2:max_index0+(Nt 1)/2,:); amp_eye0_1Tb(1:Nt)=amp_eye_time1(max_index0 (Nt 1)/2:max_index0+(Nt 1)/2); ts_1Tb= 0.5:1/(Nt 1):0.5; eye_max1=amp_eye0_1Tb((Nt+1)/2); cdf_log0_cmp2=cdf_log0_1Tb((Nt+1)/2,:); else eye_max1=0; end eye_max(n1,n2)=eye_max1; katt(n1,n2)=atten; end end % Get the maximum value and corressponding n1, n2 [a1 b1]=max(eye_max); [a2 b 2]=max(max(eye_max)); maxeye=a2;k2=b2;k1=b1(b2);eye_max(k1,k2);atten_eff=katt(k1,k2); nfir1=11; nfir2=11;nfir3=3; s1= 0.5/((nfir1 1)/1); s2= 0.5/((nfir2 1)/1); s3= 0.25/((nfir3 1)/1); x1=[0:s1: 0.5]; x2=[0:s2: 0.5]; x3=[0:s3: 0.25]; c1=x1(k1);c2=x2(k2); % TX power calculation fd=Rb; Power_TX_total,Power_TX_data,I_driver,Wg_retimer,Rload_retimer]=power_TX(fd,wg_min,cgate,Rd,v0,v_swing,maxeye,c_IO,vdd,c 0,c1,c2,vrx_sen,Jc,p); CDF calculation at different sampling time function [amp_norm ,cdf_abs_time,cdf_log_time,amp_eye_time]=cdf_time(h,delta_rj,amp_max,amp_s,Npre, Npost, Nt,ber_mac) ns= 1*(Nt 1)/2*2:+1*(Nt 1)/2*2; % sampling time: 1UI >1UI

PAGE 157

157 ts= 1:1/(Nt 1):1; for xt=1:1:2*Nt 1 Nsample=ns(xt); [amp_norm,cdf_abs, cdf_log,eye_mac]=tyco_cdf2 (h,delta_rj,amp_max,amp_s,Npre, Npost, Nt,Nsample,ber_mac); cdf_abs_time(xt,:)=cdf_abs; cdf_log_time(xt,:)=cdf_log; amp_eye_time(:,xt)=eye_mac'; end CDF calculation at a given sampling time function [amp_norm,cdf_abs, cdf_log,eye_mac]=cdf2(h,delta_rj,amp_ max,amp_s,Npre, Npost, Nt,Nsample,ber_mac) % with jitter effect if delta_rj~=0 [amp1_norm,x1_norm]=tyco_cdf_pdf_cal_tj(h,delta_rj,amp_max,amp_s,Npre, Npost, Nt, Nsample); end % no jitter effect if delta_rj==0 [amp1_norm,x1_norm]=tyco_cdf_pdf_cal_notj(h,del ta_rj,amp_max,amp_s,Npre, Npost, Nt, Nsample); end % 1 here [amp_norm1,cdf_abs1, cdf_log1]=pdf_cdf(amp1_norm,x1_norm); % output eye opening from cdf_abs1, amp_norm1 [xn,cdf_size]=size(cdf_abs1);tag_eye=0; % add ber_mac: matrix of ber_log [xm,ym]=size(ber_m ac); for mm=1:1:ym % different ber_log value ber_log=ber_mac(mm); for n=1:1:cdf_size if cdf_log1(n)<=ber_log tag_eye=n; end end amp_eye=0; i f tag_eye>0 amp_eye=amp_norm1(tag_eye); end eye_mac(mm)=amp_eye; end amp_norm=[ amp_norm1(end: 1:2) amp_norm1]; cd f_abs=[cdf_abs1(end: 1:2) cdf_abs1]; cdf_log=[cdf_log1(end: 1:2) cdf_log1]; CDF calculation at a given sampling time function [amp_norm,cdf_abs, cdf_log,eye_mac]=cdf2(h,delta_rj,amp_max,amp_s,Npre, Npost, Nt,Nsample,ber_mac) % with jitter effect if delta_ rj~=0 [amp1_norm,x1_norm]=tyco_cdf_pdf_cal_tj(h,delta_rj,amp_max,amp_s,Npre, Npost, Nt, Nsample); end % no jitter effect if delta_rj==0 [amp1_norm,x1_norm]=tyco_cdf_pdf_cal_notj(h,delta_rj,amp_max,amp_s,Npre, Npost, Nt, Nsample); end % 1 here [amp_norm1,cd f_abs1, cdf_log1]=pdf_cdf(amp1_norm,x1_norm); % output eye opening from cdf_abs1, amp_norm1 [xn,cdf_size]=size(cdf_abs1);tag_eye=0; % add ber_mac: matrix of ber_log [xm,ym]=size(ber_mac); for mm=1:1:ym % different ber_log value ber_log=ber_mac(mm); for n =1:1:cdf_size if cdf_log1(n)<=ber_log tag_eye=n; end end amp_eye=0; if tag_eye>0 amp_eye=amp_norm1(tag_eye); end

PAGE 158

158 eye_mac(mm)=amp_eye; end amp_norm=[ amp_norm1(end: 1:2) amp_norm1]; cdf_abs=[cdf_abs1(end: 1:2) cdf_abs1]; cdf_log=[cdf_log1(end: 1:2) cdf_log1 ]; cdf_log=[cdf_log1(end: 1:2) cdf_log1]; CDF calculation from PDF function [amp_norm,cdf_abs, cdf_log]=pdf_cdf(amp,pdf) cdf_abs_1 = cumsum(0.5.*pdf);% ---> amp1_p1 [xn,x1_p1_size]=size(amp); tag_1=0; for n=1:1:x1_p1_size if amp(n)<0 tag_1=n; e nd end tag_ 1=tag_1+1; cdf_abs=cdf_abs_1(tag_1:end); cdf_log=log10(cdf_abs); amp_norm=amp(tag_1:end); PDF calculation including jitter effect function [amp1_norm,x1_norm]=pdf_cal_tj(h,delta_rj,amp_max,amp_s,Npre, Npost, Nt, Nsample) if Nt~=1 ts=0:1/(Nt 1):1; pdfck=(1 ./(sqrt(2*pi)*delta_rj).*exp( (ts 0.5).^2./(2*delta_rj^2))); pdfck_norm=pdfck./sum(pdfck); pj=pdfck_norm; end if Nt==1 pj=1; end %find h0 [hmax,tag_hmax]=max(h); %Npost=floor((N tag_hmax)/Nt); %Npre=floor(tag_hmax/Nt); %Npre=Npre 1*Nt;Npost=Npost 1*Nt; %% %%%%%%%%%%%%%%%% original coding for different sampling time % h_mc=h(tag_hmax (Nt 1)/2+ns(xt):tag_hmax+(Nt 1)/2+ns(xt)); % h_post2=h(tag_hmax (Nt 1)/2+Nt*2+ns(xt):tag_hmax+(Nt 1)/2+Nt*2+ns(xt)); % h_post1=h(tag_hmax (Nt 1)/2+Nt*1+ns(xt): tag_hmax+(Nt 1)/2+Nt*1+ns(xt)); % h_pre2=h(tag_hmax (Nt 1)/2 Nt*2+ns(xt):tag_hmax+(Nt 1)/2 Nt*2+ns(xt)); % h_pre1=h(tag_hmax (Nt 1)/2 Nt*1+ns(xt):tag_hmax+(Nt 1)/2 Nt*1+ns(xt)); h_mc=h(tag_hmax (Nt 1)/2+Nsample:tag_hmax+(Nt 1)/2+Nsample); for n =1:1:Npost+1 % +1 including jitter effect h_post(n,:)=h(tag_hmax (Nt 1)/2+Nt*n+Nsample:tag_hmax+(Nt 1)/2+Nt*n+Nsample); end for n=1:1:Npre+1 % +1 including jitter effect h_pre(n,:)=h(tag_hmax (Nt 1)/2 Nt*n+Nsample:tag_hmax+(Nt 1)/2 Nt*n+Nsample); end h _pre_rev=h_pre(Npre+1: 1:1,:); h_matrix=[h_pre_rev;h_mc;h_post]; [Nh,Nx]=size(h_matrix); hmpre_matrix=h_matrix(Npre+2: 1:1,:); hmpost_matrix=h_matrix(Npre+2:Nh,:); [amp0_shrink_pre,x0_shrink_pre, amp1_shrink_pre,x1_shrink_pre]=casper_pdf_pre_cal(hmpre_matr ix,pj,amp_max,amp_s); [amp0_shrink_post,x0_shrink_post, amp1_shrink_post,x1_shrink_post]=casper_pdf_post_cal(hmpost_matrix,pj,amp_max,amp_s); Hmain=sum(h_mc);pmain=1; [amp0_tmp, x0]=conv_isi2(amp0_shrink_pre,amp0_shrink_post,x0_shrink_pre,x0_shrink_post); amp0= 1*Hmain+amp0_tmp; [amp1_tmp, x1]=conv_isi2(amp1_shrink_pre,amp1_shrink_post,x1_shrink_pre,x1_shrink_post); amp1=+1*Hmain+amp1_tmp; [amp0_norm,x0_norm]=pdf_norm2(amp0,x0,amp_max,amp_s); [amp1_norm,x1_norm]=pdf_norm2(amp1,x1,amp_max,amp_s); PDF calcul ation of pre cursors including jitter effect function [amp0_shrink,x0_shrink, amp1_shrink,x1_shrink]=pdf_pre_cal(hmp_matrix,pj,amp_max,amp_s) [xp,Npj]=size(pj); npj_dev=(Npj 1)/2;

PAGE 159

159 [Nmp_total,Nt]=size(hmp_matrix);Nmp=Nmp_total 1; Hpre=zeros(Nmp,1); Nisi=2* amp_max/amp_s+1; isi_00_norm=zeros(Nmp,Nisi); pdf_00_norm=zeros(Nmp,Nisi); isi_01_norm=zeros(Nmp,Nisi); pdf_01_norm=zeros(Nmp,Nisi); isi_10_norm=zeros(Nmp,Nisi); pdf_10_norm=zeros(Nmp,Nisi); isi_11_norm=zeros(Nmp,Nisi); pdf_11_norm=zeros(Nmp,Nisi); for n=1:1:Nmp Hpre(n)=sum(hmp_matrix(n,:)); H pre(1)=0; % assume post cal including main tap h_pre_00= 1.*[Hpre(n)]; p_pre_00=1; [h_pre_00_norm,p_pre_00_norm]=pdf_norm2(h_pre_00,p_pre_00,amp_max,amp_s); isi_00_norm(n,:)=h_pre_00_norm; pdf_00_norm(n, :)=p_pre_00_norm; h_pre_11=+1.*[Hpre(n)]; p_pre_11=1; [h_pre_11_norm,p_pre_11_norm]=pdf_norm2(h_pre_11,p_pre_11,amp_max,amp_s); isi_11_norm(n,:)=h_pre_11_norm; pdf_11_norm(n,:)=p_pre_11_norm; xamp_jitter_lead=0;xamp_jitter_lag=0; for t=0:1:npj_de v 1 %tail_isi_jitter 1 isi_amp_lag(t+1)=xamp_jitter_lag+hmp_matrix(n,Nt t); xamp_jitter_lag=isi_amp_lag(t+1); isi_p_lag(t+1)=pj(npj_dev t); isi_amp_lead(t+1)=xamp_jitter_lead+hmp_matrix(n+1,1+t); xamp_jitter_lead=isi_amp_lead(t+1); isi_p_lead(t+1)=p j(Npj+1 (npj_dev t)); end if Nt==1 isi_amp_lag=0;isi_amp_lead=0; isi_p_lag=0;isi_p_lead=0; end if pj==1 isi_amp_lag=0;isi_amp_lead=0; isi_p_lag=0;isi_p_lead=0; end h_pre_01_lag=+1.*Hpre(n) 2.*isi_amp_lag; p_pre_01_lag=isi_p_lag; h_pre_01_lead=+1.*Hpre(n) 2.*isi_amp_lead; p_pre_01_lead=isi_p_lead; h_pre_01_nodev=+1.*[Hpre(n)]; p_pre_01_nodev=pj(npj_dev+1); h_pre_01_tmp=[h_pre_01_lead,h_pre_01_nodev,h_pre_01_lag]; p_pre_01_tmp=[p_pre_01_lead,p_pre_01_nodev,p_pre_01_lag]; h_pre_10_tmp= 1.*h_pre_01_tmp; p_pre_ 10_tmp=p_pre_01_tmp; [h_pre_01_norm,p_pre_01_norm]=pdf_norm2(h_pre_01_tmp,p_pre_01_tmp,amp_max,amp_s); [h_pre_10_norm,p_pre_10_norm]=pdf_norm2(h_pre_10_tmp,p_pre_10_tmp,amp_max,amp_s); isi_01_norm(n,:)=h_pre_01_norm; pdf_01_norm(n,:)=p_pre_01_norm; isi_10_norm(n,:)=h_pre_10_norm; pdf_10_norm(n,:)=p_pre_10_norm; end % conv each taps isi_amp= amp_max:amp_s:amp_max; % The last post cursor pdf x0_tmp=0.5.*(pdf_00_norm(Nmp,:)+pdf_10_norm(Nmp,:)); % --> pdf corresponding [ amp_max,amp_max]%isi_0 0_norm(n_rev,:) x1_tmp=0.5.*(pdf_11_norm(Nmp,:)+pdf_01_norm(Nmp,:)); [amp0_shrink x0_shrink]=pdf_shrink(isi_amp, x0_tmp,Nisi); [amp1_shrink x1_shrink]=pdf_shrink(isi_amp, x1_tmp,Nisi); for n=1:1:Nmp 1 % shirnk the norm distribution, get rid of pdf=0 value n_rev=Nmp n; % The cursor before the last cursor [isi_00_shrink, pdf_00_shrink]=pdf_shrink(isi_00_norm(n_rev,:), pdf_00_norm(n_rev,:),Nisi); [isi_01_shrink, pdf_01_shrink]=pdf_shrink(isi_01_norm(n_rev,:), pdf_01_norm(n_rev,:),Nisi); [isi_10_shrink, pdf_10 _shrink]=pdf_shrink(isi_10_norm(n_rev,:), pdf_10_norm(n_rev,:),Nisi); [ isi_11_shrink, pdf_11_shrink]=pdf_shrink(isi_11_norm(n_rev,:), pdf_11_norm(n_rev,:),Nisi); % Do conv conv(x0,pdf00) --> 00 % Do conv conv(x0,pdf01) --> 01 % Do conv conv(x1,pdf10) --> 1 0 % Do conv conv(x1,pdf11) --> 11 [amp00_tmp, pdf00_tmp]=conv_isi2(amp0_shrink,isi_00_shrink,x0_shrink,pdf_00_shrink); [amp01_tmp, pdf01_tmp]=conv_isi2(amp0_shrink,isi_01_shrink,x0_shrink,pdf_01_shrink); [amp10_tmp, pdf10_tmp]=conv_isi2(amp1_shrink,isi_10_ shrink,x1_shrink,pdf_10_shrink); [amp11_tmp, pdf11_tmp]=conv_isi2(amp1_shrink,isi_11_shrink,x1_shrink,pdf_11_shrink);

PAGE 160

160 % Norm to get x0,x1 [amp00_norm,pdf00_norm]=pdf_norm2(amp00_tmp,pdf00_tmp,amp_max,amp_s); [amp01_norm,pdf01_norm]=pdf_norm2(amp01_tmp,pdf0 1_tmp,amp_max,amp_s); [amp10_norm,pdf10_norm]=pdf_norm2(amp10_tmp,pdf10_tmp,amp_max,amp_s); [amp11_norm,pdf11_norm]=pdf_norm2(amp11_tmp,pdf11_tmp,amp_max,amp_s); x0_tmp=0.5.*(pdf00_norm+pdf10_norm); x1_tmp=0.5.*(pdf11_norm+pdf01_norm); % shrink x0_tmp and x1_tmp, isi_amp [amp0_shrink x0_shrink]=pdf_shrink(isi_amp, x0_tmp,Nisi); [amp1_shrink x1_shrink]=pdf_shrink(isi_amp, x1_tmp,Nisi); e nd Convolution of two ISI vectors function [isi_amp, isi_p]=conv_isi2(a,b,pa,pb) % conv vector a and vector b, row vector [atp,na]=size(a); [btp,nb]=size(b); samp=zeros(na,nb);tp=samp; for n=1:1:na samp(n,:)=a(n)+b; t p(n,:)=pa(n).*pb; end y_amp=reshape(samp,[1,na*nb]); y_p=reshape(tp,[1,na*nb]); isi_amp=unique(y_amp); isi_p=zeros(size(isi_amp)); [xtp,isi_size]=size(isi_amp); for m=1:1:isi_size for n=1:1:na*nb if isi_amp(m)==y_amp(n) isi_p(m)=isi_p(m)+y_p(n); end end end

PAGE 161

161 LIST OF REFERENCES [1] Semiconductor Ind. Assoc., The International Technology Roadmap for Semiconductors [Online]. Available: http://public.itrs.net. [2] R. Palmer, CMOS for Serial Chip to IEEE Int. Solid State Circuit Conf. Dig. Tech. Papers pp. 440 441, Feb. 2007. [3] Frank O Mahony, James E. Jaussi, et.al, A 47 10Gb/s 1. 4m W/Gb/s Parallel Interface in 45 nm CMOS IEEE J. Solid State Circuits vol. 4 5 no. 12 pp. 2828 2837 Dec. 20 10 [4] K. Hu, T. Jiang, J. Wang, F. O Mahony, and P. Y. Chiang, A 0.6mW/Gbps, 6.4 8.0 Gbps serial link receiver using local injection locked ring oscillators in 90nm CMOS, in Proc. Symp. VLSI Circuits Jun. 2009, pp. 46 47. [5] M. J.E. Lee, W.Dally, and P. Chiang, Low power area efficient high speed I/O circuit techniques, IEEE J. Solid State Circuits vol. 35 pp. 1591 1 599 Nov ., 200 0 [6] Ganesh Bala IEEE. Trans. on Advanced Pack aging vol. 32, no. 2, pp. 237 247, May 2009. [7] T. Dickson, R. Beerkens, and S. Voinige V 45 Gb/s decision circuit using IEEE J. Solid State Circuits vol. 40, no.4, pp. 994 1003, Apr., 2005. [8] G. Wei, J. Kim, D. Liu, S. Sidiropoulos and M. Horowitz, A variable frequency parallel I/O Interface with adaptive powe r supply regulation, IEEE J. Solid State Circuits vol. 35 no. 11, pp. 1600 1 610 Nov ., 200 0 [9] IEEE Proceedings of Custom Integrated Circuits Conference pp. 589 594, Sep. 2003. [10] W. F an, A. Lu, L. Wai, and B. Lok, Mixed mode s parameter characterization of differential structures, in Electronics Packaging Technology 2003 5 th Conference, pp. 533 537, EPTC 2003, IEEE, Dec 2003. [11] K. Kurokawa, Power waves and the scattering matrix, IEE E Transactions on Microwave Theory and Techniques vol. MTT 13, pp. 194 202, March 1965. [12] DesignCon 2004. [13] B. Casper, M. Haycock, and multi Gb/s chip to IEEE VLSI Circuits Symp. Tech. Papers pp. 54 57, Jun. 2002.

PAGE 162

162 [14] B. Ahmad, Performance specification of interconnect, presented at the DesignCon, Santa Clar a, CA, Feb. 2003. [15] Kyung Suk (Dan) Oh, Frank Lambrecht, Sam Chang, Qi Lin, Jihong Ren, Chuck Yuan, Jared Zerbe, and Vladimir Stonanovic, Accurate System Voltage and Timing Margin Simulation in High Speed I/O System Designs, IEEE Trans. on Advanced Packagi ng vol. 31 no.4, pp. 722 7 3 0 Nov ., 200 8 [16] K. Lee, H. K. Jung, H. J. Chi, H. J. Kwon, J. Y. Sim, H. J. Park, Serpentine Microstrip Lines With Zero Far End Crosstalk for Parallel High Speed DRAM Interfaces, IEEE Trans on Advanced Packaging vol. 33 no. 2 pp. 552 558 May 20 10 [17] Prentice Hall April 1993. [18] T. H. Kim, et.al, Signal transient and crosstalk model of capacitively and inductively coupled VLSI interconnec t lines, J. S emiconductor Technol. Sci., vol. 7 pp. 260 200 7. [19] S. K. Lee, et.al, FEXT eliminated stub alternated microstrip line for multi gigabit/second parallel links, Electron. Lett. vol.44, pp.272, 2008. [20] L. Zhi, W. Qiang and S. Changsheng, Applic ation of guard traces with vias in the RF PCB layout, Proc. 3 rd Int. Symp. Electromagn. Cornpatibil ., pp. 771, 2002. [21] Carl Werner, Claus Hoyer, et.al, Modeling, Simulation, and Design of a Multi Mode 2 10Gb/sec Fully Adaptive Serial Link System, IEEE CIC C, pp. 709 716 2005 [22] R. Kollipara, G J Yeh, B. Chia, A. Agarwal, Design, modeling and characterization of high speed backplane interconnects, DesignCon 2003. [23] W. J. Dally, et.al, Transmitter Equalization for 4 Gbps Signaling, IEEE Micro Vol. 17, No. 1 Jan. Feb, 1997, pp.48 56. [24] IEEE J. Solid State Circuits vol. 32, No. 4, pp. 514 520, April 1997. [25] A. Fiedler, et.al, A 1.0625 Gbps transceiver with 2 oversampling and tran smit signal pre emphasis, IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers pp. 238 239, 1997. [26] T. Dickson, K. Yau, T. Chalvatzis, A. Mangan, E. Laskin, R. Beerkens, P.Westergaard, M. Tazlauanu, M. Yang, and S. P. Voinigescu, The invariance of chara cteristic current densities in nanoscale MOSFETs and its impact on algorithmic design methodologies and design prorting of Si (Ge) (Bi)CMOS high speed building blocks, IEEE J. Solid State Circuits vol. 41, pp. 1830 1845, Aug. 2006. [27] Y. Hu and R. Bashirull ah, A Current Density Centric Logical Effort Delay Model for High Speed Current Mode Logic Circuits, submitted to Electronics Letters

PAGE 163

163 [28] Gergory M. Yeric, Al F. Tasch, and Sanjay K. Banerjee, A Universal MOSFET Mobility Degradation Model for Circuit Simul ation, IEEE Trans. on Computer Aided Design vol. 9, no. 10, October, 1990. [29] High Performance at Low IBM Journal of Research and Development vol. 35, No. 3, pp. 313, 1991. [30] Takayasu Sakur Power Law MOSFET Model and its IEEE J. of Solid State Circuits vol. 25, no.2, pp. 584 594, April 1990. [31] O x ford University Press USA, Aug. 2007. [32] Ivan Sutherland, Bob Sproull, and David Harris Circuits Morgan Kaufmann Publishers 1999 [33] Benton H. Calhoun, Alice Wang, and Anantha Chandrakasan Modeling and Sizing for Min imum Energy Operation in Subthreshold Circuits IEEE J. Solid State Circuits vol. 40, no. 9 pp. 1778 1 786 Sept. 2005. [34] T. Dickson, R. Beerkens, and S. Voinigescu, A 2.5 V 45 Gb/s decision circuit using SiGe BiCMOS logic, IEEE J. Solid State Circuits vol. 40, no. 4, pp. 994 1003, April, 2005. [35] Y. Hu, R. Bashirullah, et.al, A Current Density Centric Logical Effort Model for High Speed CML Gates, submitted to IEEE Trans. o n Circuits and System I: Regular Papers [36] Myeong Eun Hwang, Seong Ook Jung, and Ka ushik Roy, Slope Interconnect Effect: Gate Interconnect Interdependent Delay Modeling for Early CMOS Circuit Simulation, IEEE. Trans. on Circuit and System I: Regular Papers vol. 56, no. 7, July, 2009. [37] A. Kabbani D. Al Khalili, and A. J. Al Khalili, Delay analysis of CMOS gates using modified logical effort model, IEEE Trans. Comput. Aided Design Integr. Circuits Syst ., vol. 24, no. 6, pp. 937 947. Jun. 2005. [38] E. G. Friedman, J. H. Mulligan, Jr, Ramp input response of RC tree networks, Analog Integ rated Circuits and Signal Processing, vol. 14, no. pp. 53 58, Sept. 1997. [39] R. Mita, G. Palumbo, and M. Poli, Propagation delay of an RC chain with a ramp input, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 1, pp. 66 70, Jan. 2007. [40] Utku Seck in, and Chih Kong Yang A Comprehensive Delay Model for CMOS CML Circuits IEEE Trans. On Circuits and System I: Regular Papers pp. 2608 2618 Oct 200 8

PAGE 164

164 [41] Payam Heydari and Ravindran Mohanavelu Design of Ultra High Speed Low Voltage CMOS CML Buffers an d Latches IEEE Trans. on VLSI Systems vol. 12 no. 1 0 pp.1 081 1 093 Oct 200 4 [42] N. Hedenstierna and K. O. Jeppson, CMOS circuit speed and buffer optimization, IEEE Trans. Computer Aided Design vol. CAD 6, pp. 270 281, Mar. 1987. [43] Neil Weste and David Harris CMOS VLSI Design: A Circuit and System Perspective (3 rd Edition) Addison Wesley May 2004 [44] J. J. O Reilly, Series parallel generation of m sequences, The Radio and E l ectronic Engineer vol. 45, pp. 171 176, Apr. 1975. [45] A. N. Van Luyn, Shift re gister connections for delayed versions of m sequences, Electron. Lett ., vol. 14, pp. 713 715, Oct. 1978. [46] F. Sinnesbichler, A. Ebberg A. Felder, and R. Weigel, Generation of high speed pseudorandom sequences using multiplex techniques, IEEE Trans. Micr owave Theory Tech vol. 44, no. 12, pp. 2738 2742, Dec. 1996. [47] Ekaterina Laskin, and Sorin P. Voinigescu, A 60mW per Lane, 4 23 Gb/s 2 7 1 PRBS Generator, IEEE J. Solid State Circuits vol. 41, no. 1 0 pp. 2 198 2 208 Oct 2006. [48] Y. Hu, et.al, A current den sity centric logical effort based design methodology for high speed IO links, TECHCON, Sept. 2010. [49] Hamid Hatamkhani and Chih Speed I/O IEEE VLSI Circuits Symp Tech. Papers Jun. 2004. [50] Lesley Anne Polka, Challenge for Tera Intel Technology Journal pp. 197 206, 2007. [51] J. Xue, et al., "Evaluation of Manufacturing Assembly Process Impact on Log Term Reliability of a High Performance ASIC using Flip Chip HyperBGA Package ", Proc. 53rd Electronic Components and Technology Conference pp 359 364, May 2003. [52] Packages with High Elect ronic Components and Technology Conference pp. 187 193, 2006. [53] Performance PCBGA Based on Ultra Thin Packaging NEC J. of Adv. Tech pp. 222 228, Summer 2005. [54] Y. Hu, J. Chen, and R. Bashirullah, Performance Analys is of High Speed IOs for Systems with Large Aggregate Bandwidth Requirements, submitted to IEEE Trans. on Very Large Scale Integration (VLSI) Systems

PAGE 165

165 [55] Digital PLL and Transmitter for IEEE J. Solid State Circuits vol. 40, no.12, pp. 2469 2482, Dec., 2005. [56] IEEE J. Solid State Circuits vol. 41, no.8, pp. 1867 187 5, Aug., 2006. [57] R. B. Staszewski, D. Leipold, C based frequency IEEE Proc. of 2004 IEEE Radio Frequency Integrated Circuits Symp ., pp. 215 218, Jun., 2004. [58] Behzad Razavi, Design of Int egrated Circuits for Optical Communication, 1 st ed., McGraw Hill, 2003. [59] PAM Parallel Bus Interface with Transmit IEEE Int. Solid State Circuits Confere nce Digest vol. 2, pp. 66 67, Feb., 2001. [60] Cambridge Univ. Press. 1998. [61] Dependent Jitter in Serial IEEE Trans. of Microwav e Theory and Techniques vol. 53, no.11, pp. 3388 3397, Nov., 2005. [62] Young Soo Sohn, Jeong Cheol Lee, Hong June Park, and Soo Equations on Electrical Parameters of Coupled Microstrip Lines for Crosstalk Estimation in Printed Circuit Board IEEE Trans. on Advanced Packaging vol. 24, no.4, pp.521 527, Nov., 2001. [63] Wiley Press. 1997. [64] IEEE J. Solid State Circuits vol. 41, no.3, pp. 621 632, March, 2006. [65] Y. Hu, J. Chen, M. Lamson and R. Bashirullah, An Active Crosstalk Reduction Technique for Parallel High Speed Links in Low Cost Wirebond BGA Packages, IEEE Electrical Performance of Electronic Packaging pp. 27 29, Oct. 2008. [66] J. F. Bulzacchelli, et al., A 10 Gb /s 5 Tap DFE/4 Tap FFE transceiver in 90 nm CMOS technology, IEEE J. Solid State Circuits vol. 41 pp. 2885 2990 Dec. 2006 [67] IEEE J. Solid State Circuits vol. 38, no.12, pp. 2094 2100, Dec., 2003. [68] Chan Noise, 900 MHz VCO in 0.6 IEEE J. Solid State Circuits vol. 34, no.5, pp.586 591, May, 1999.

PAGE 166

166 [69] GHz Voltage Controlled R ing Oscillator in 0.18 IEEE J. Solid State Circuits vol. 39, no.1, pp.230 233, Jan., 2004. [70] W Z. Chen, C. K. Kuo, and C. C. Liu, A CMOS 10 Gb/s SONET transceiver, ESSCIRC 03 Conference pp. 361 364, Sept. 2003. [71] N. H. E. Weste and K. Eshragrian nd ed. Reading, MA: Addison Wesley 1993. [72] European Journal of Scientific Research vol. 33, no.2, pp.261 269, 2009. [73] Pierre Favrat, Philippe Deval, and Mic Efficiency CMOS IEEE J. Solid State Circuits vol. 33, no.3, pp.410 416, March, 1998. [74] Chih V 5.5 GHz CMOS Phase IEEE J. Solid State Circuits vol. 3 7, no.4, pp.521 525, Apr., 2002. [75] GHz 8 Modulus Prescaler and a 20 GHz Phase Locked Loop Fabricated in 130 IEEE J. Solid State Circuits vol. 42, no.6, pp.1240 1249, June, 2007. [76] IEEE Electrical Performance of Electric Packaging Conference pp.15 18, Sep., 2007. [77] Koji Fukuda, Hiroki Yamashita, et.al, A 12.3 mW 12.5 Gb/s Complete Transceiver in 65 nm CMOS Process, IEEE J. Solid State Circuits vol. 4 5 no. 12 pp. 2838 2894 Apr., 2005. [78] Samuel Palermo, Azita Emami IEEE J. Solid State Circuits vol. 43, no.5, pp.1235 1246, May, 2008. [79] Liang Zhang, John M. W ilson, Rizwan Bashirullah, Lei Luo, Jian Xu, and Paul D. Mode Driver Preemphasis Technique for On IEEE Tran. Of Very Large Scale Integration (VLSI) Systems vol. 15, no.2, pp.231 236, Feb., 2007. [80] Stephen I. Long, and J Mode 1.2 Gb/s IEEE J. Solid State Circuits vol. 32, no.6, pp.890 897, June, 1997. [81] A. Emami Neyestanak, A. Varzaghani, J. Bulzacchelli, A. Rylyakov, C. K. Yang, D. Power Rec eiver with Switch IEEE Symp. VLSI Circuits pp.322 325, June, 2006.

PAGE 167

167 [82] D. Turker, A. Rylyakov, D. Friedman, S. Gowda, E. Sanchez 38mW 1 IEEE Symp. VLSI Circuits pp.216 217, June, 2009. [83] Byungsub Kim, Yong Liu, Timothy O.Dickson, John F. Bulzacchelli, and Daniel J. Gb/s Compact Low Power Serial I/O With DFE IIR Equalization in 65 IEEE J. Solid State Circuits vol. 44, no.12, pp.3526 3538, Dec., 2009. [84] Channel Fabrication for Microelectromechanical Systems Via Sacrificial Photosensitive J. of Microelectromechanical Systems pp.147 159, 2003.

PAGE 168

1 68 BIOGRAPHICAL SKETCH Yan Hu was born in Yangzhou Jiangsu Province, China. Sh e received the Bachelor of Engineering and Master of Science degree s in e lectronic e ngineering from the Southeast University Nanjing China, on July 2001 and May 2004 Sh e received the Master of S cience degree in e lectrical and c omputer e ngineering from the University of Florida, Gainesville, Florida, on May 200 7 and received the PhD degree in the same department on May 2011 Since 200 5 sh e has been w ith Integrated Circuits Research (ICR) Lab ECE Dep artment, University of Florida. H er research focuses on CMOS high speed I/O link design