<%BANNER%>

Characterizing, Modeling and Mitigating Microarchitecture Vulnerability and Variability in Light of Small-Scale Processi...

Permanent Link: http://ufdc.ufl.edu/UFE0024945/00001

Material Information

Title: Characterizing, Modeling and Mitigating Microarchitecture Vulnerability and Variability in Light of Small-Scale Processing Technology
Physical Description: 1 online resource (239 p.)
Language: english
Creator: Fu, Xin
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: nbti, noc, pv, ser
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The rapidly increased soft error rate (SER) due to the scaled processing technology is one of the critical reliability concerns in current processor design. In order to characterize and mitigate the microarchitecture soft-error vulnerability in modern superscalar and multithreaded processors, I developed Sim-SODA (Software Dependability Analysis), a unified framework for estimating microprocessor reliability in the presence of soft errors at the architectural level. By using Sim-SODA, I observed that a single performance metric is not a good indicator for program vulnerability; on the other hand, a combination of several performance metrics can well predict the architecture-level soft-error vulnerability. Based on the observation that issue queue (IQ) is a reliability hot-spot on Simultaneous Multithreaded (SMT) processors, I proposed VISA (Vulnerable InStruction Aware) Issue and ORBIT (Operand Readiness Based InsTruction) dispatch to improve the IQ reliability. I further combined the circuit and microarchitecture techniques in soft error robustness on SMT processors to leverage the advantage of the two levels? techniques while overcoming the disadvantage of both. Results show that my proposed techniques have strong ability in improve IQ reliability with negligible performance penalty. As one of the nano-scale design challenges, process variation (PV) significantly affects chip performance and power. I characterized the microarchitecture soft error vulnerability in the presence of PV, and proposed two techniques that work at fine-grain and coarse-grain levels to mitigate the impact of PV mitigation techniques on reliability and maintain optimal vulnerability, performance, and power trade-offs. Negative Body Temperature Instability (NBTI) has become another important reliability concern as processing technology scaled down. Observing that PV has both positive and negative effects on circuits, I took advantage of the positive effects in NBTI tolerant microarchitecture design to efficiently mitigate the detrimental impact of PV and NBTI simultaneously. The trend towards multi-/many- core design has made network-on-chip (NoC) a crucial hardware component of future microprocessors. I proposed several techniques that hierarchically mitigate the PV and NBTI effect on NoC while leveraging their benign interplay.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xin Fu.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Li, Tao.
Local: Co-adviser: Fortes, Jose A.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024945:00001

Permanent Link: http://ufdc.ufl.edu/UFE0024945/00001

Material Information

Title: Characterizing, Modeling and Mitigating Microarchitecture Vulnerability and Variability in Light of Small-Scale Processing Technology
Physical Description: 1 online resource (239 p.)
Language: english
Creator: Fu, Xin
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: nbti, noc, pv, ser
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The rapidly increased soft error rate (SER) due to the scaled processing technology is one of the critical reliability concerns in current processor design. In order to characterize and mitigate the microarchitecture soft-error vulnerability in modern superscalar and multithreaded processors, I developed Sim-SODA (Software Dependability Analysis), a unified framework for estimating microprocessor reliability in the presence of soft errors at the architectural level. By using Sim-SODA, I observed that a single performance metric is not a good indicator for program vulnerability; on the other hand, a combination of several performance metrics can well predict the architecture-level soft-error vulnerability. Based on the observation that issue queue (IQ) is a reliability hot-spot on Simultaneous Multithreaded (SMT) processors, I proposed VISA (Vulnerable InStruction Aware) Issue and ORBIT (Operand Readiness Based InsTruction) dispatch to improve the IQ reliability. I further combined the circuit and microarchitecture techniques in soft error robustness on SMT processors to leverage the advantage of the two levels? techniques while overcoming the disadvantage of both. Results show that my proposed techniques have strong ability in improve IQ reliability with negligible performance penalty. As one of the nano-scale design challenges, process variation (PV) significantly affects chip performance and power. I characterized the microarchitecture soft error vulnerability in the presence of PV, and proposed two techniques that work at fine-grain and coarse-grain levels to mitigate the impact of PV mitigation techniques on reliability and maintain optimal vulnerability, performance, and power trade-offs. Negative Body Temperature Instability (NBTI) has become another important reliability concern as processing technology scaled down. Observing that PV has both positive and negative effects on circuits, I took advantage of the positive effects in NBTI tolerant microarchitecture design to efficiently mitigate the detrimental impact of PV and NBTI simultaneously. The trend towards multi-/many- core design has made network-on-chip (NoC) a crucial hardware component of future microprocessors. I proposed several techniques that hierarchically mitigate the PV and NBTI effect on NoC while leveraging their benign interplay.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xin Fu.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Li, Tao.
Local: Co-adviser: Fortes, Jose A.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0024945:00001


This item has the following downloads:


Full Text

PAGE 1

CHARACTERIZING, MODELING AND MITIGATING MICROARCHITECTURE VULNERABILITY AND VARIABILITY IN LIGHT OF SMALL-SCALE PROCESSING TECHNOLOGY By XIN FU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2009 1

PAGE 2

2009 Xin Fu 2

PAGE 3

To my mother and father, and my fianc Zongyu 3

PAGE 4

ACKNOWLEDGMENTS Without the support of many, I would not have successfully journeyed through this beginning. My first, and most earnest, acknowledgment must go to my advisor, Dr. Tao Li for his encouragement, advice, mentoring, and rese arch financial support. His tough and incisive questions, unending drive and dedication to the fi eld of computer architecture never cease to amaze me. Not only was he readily avai lable for me, as he so generously is for all of his students, but he always read and responded to the drafts of my work more quickly than I could have hoped. In every sense, none of this work would have been possible without him. I would like to gratefu lly acknowledge my co-advisor, Dr. Jos A. B. Fortes for his advice and support. Jose has been instrumental in en suring my academic, professional, financial, and moral well being ever since. His patience and encouragement always inspire me to explore and challenge further. Many thanks al so go to committee members Dr. Renato J Figueiredo and Dr. Jih-Kwon Peir for their valuable comments, produc tive suggestions, and the time for reading the draft of my thesis. I would like to thank the stude nts (past and current) at In telligent Design and Efficient Architectures Laboratory (IDEAL) and Advanced Computing and Information Systems (ACIS) Laboratory. I want especially to thank (in alphabetical order) Chang-Burm Cho, Clay Hughes, James Michael Poe II, and Wangyuan Zhang for helping me setting up the simulation environments used in this research, sharing th eir technical wisdom and research ideas, and providing valuable comments on drafts of my paper submissions and useful feedback at practice talks. I also thank Prapaporn Rattanatamr ong, David Wolinsky, Liping Zhu, Ming Zhao, Jing Xu, Yuchu Tong, Jiangyan Xu, Jian Zhang, Girish Venkatasubramanian Pierre Tony St. Juste, Selvi Kadirvel, Priya Chaman Bhat, Erin ta ylor, Maurcio Tsugawa Andra Matsunaga, and Yonggang Liu for their friendships and their collec tive encouragement to fi nish this dissertation. 4

PAGE 5

A penultimate thank-you goes to my friends (Qiong Zhou, Haoyang Zhuang, Jian Sun) for their help and support throughout the completion of this degree. I express my gratitude to the High Performance Computing (HPC) Center at University of Florida for providing me the simulation resource s. Thanks to Dina, Cathy, Shannon, Julie, and other administrative assistants who has worked in Computer E ngineering in the past years. During the course of my research, I have submitted several papers to peer-reviewed conferences. The anonymous reviewers have provided valuable in sights, pointers to liter ature, and criticisms that I have used to make my research stronger. My final, and most heartfelt, acknowledgeme nt must go to my wonderful parents Yafei Wang and Zhongyang Fu, my fianc Zongyu Zhai, and my family (my grandfather, uncles and aunts), for their love, support, and encouragemen ts. They are always there when I need them most, they deserve far more credit than I can ever give them. This dissertation is dedicated to them. 5

PAGE 6

TABLE OF CONTENTS page ACKNOWLEDGMENTS ............................................................................................................... 4LIST OF TABLES .........................................................................................................................10LIST OF FIGURES .......................................................................................................................11ABSTRACT ...................................................................................................................... .............16 CHAPTER 1 INTRODUCTION ................................................................................................................ ..18Processor Vulnerability and Variability .................................................................................18Soft Errors .......................................................................................................................18Negative Biased Temperature Instability ........................................................................20Process Variation .............................................................................................................20Contributions ................................................................................................................. .........21Dissertation Organization ..................................................................................................... ..232 SIM-SODA: SOFTWARE DEPENDABILITY ANALYSIS ................................................24Related Work .................................................................................................................. ........24Microarchitecture Soft-Error Vulnerability Estimation ..........................................................25The Sim-SODA Reliability Estimation Framework ...............................................................26Overview ...................................................................................................................... ...26Fine-Grained Reliability Estimation ................................................................................26Instruction window ...................................................................................................26Trivial instruction .....................................................................................................28Reliability Estimation of Unexplored Structures ............................................................29Hybrid AVF computation for register files ..............................................................29ROB AVF computation ............................................................................................32Victim buffers AVF computation ...........................................................................32Experimental Setup and Results .............................................................................................33Experimental Setup .........................................................................................................33Experimental Results .......................................................................................................34Program vulnerability profile at instruction level ....................................................34AVF of major microarchitecture structures .............................................................353 SOFT ERROR VULNERABILIT Y PHASE CHARACTERIZATION ................................43Characterizing Run-Time Microarchitecture Vulnerabil ity to Soft Errors .............................43Time Varying Behavior of Micr oarchitecture Vu lnerability ...........................................43How Correlated Is AVF to Performance Metrics ............................................................44 6

PAGE 7

Program Reliability Phase Classification ...............................................................................45Basic Block Vectors vs. Perf ormance Monitoring Counters ...........................................46The Effectiveness of Hybrid Schemes ............................................................................49Sensitivity An alysis .........................................................................................................5 04 ISSUE QUEUE RELIABILITY OPTIMIZATION ...............................................................60Vulnerable InStruction Aware (VISA) Issue ..........................................................................60Vulnerable InStruction Aware (VISA) Issue ..................................................................60Exploring VISA Issue Based Optimizations ...................................................................63Optimization 1: dynamic IQ resource allocation .....................................................63Optimization 2: handling L2 cache misses...............................................................65Experimental Setup .........................................................................................................66Evaluation .................................................................................................................... ....67Operand Readiness Based Instruction Dispatch (ORBIT) .....................................................69IQ Soft-Error Mitigation Thr ough Instruction Dispatch .................................................69Combine ORBIT with Prediction ....................................................................................71Evaluation .................................................................................................................... ....75Reliability and performance impact .........................................................................75A comparison with different fetch policies ..............................................................77Reliability impact on the en tire processor core ........................................................78Discussion ................................................................................................................79Related Work .................................................................................................................. ........805 COMBINED CIRCUIT AND MICROARCHITECTURE TECHNIQUES FOR SOFT ERROR ROBUSTNESS ........................................................................................................95Background .................................................................................................................... .........95Soft Error Robust SRAM (rSRAM) ................................................................................95Voltage Scaling for SRAM Soft Error Robustness .........................................................96Combined Circuit and Microarchitecture Techniques ............................................................97Radiation Hardening IQ Desi gn Using Hybrid Techniques ............................................97The RIQ Design and Alternative ...................................................................................100Using Dual-VDD to Improve ROB Reliability ...............................................................103Experimental Setup a nd Evaluation Results .........................................................................105Experimental Setup .......................................................................................................105Evaluation .................................................................................................................... ..106Effectiveness of rSRA M based IQ design ..............................................................106Sensitivity analysis on criticality threshold and RIQ size ......................................107Effectiveness of dual-VDD in ROB SER robustness ...............................................108Putting them together .............................................................................................110Related Work .................................................................................................................. ......110 7

PAGE 8

6 CHARACTERIZING AND MITIATING SO FT-ERROR VULNERABILITY IN THE PRESENCE OF PROCESS VARIATION ...........................................................................121Experimental Methodology ..................................................................................................122Process Variation Modeling ..........................................................................................122Impact on Circuit-Level So ft Error Vulnerability .........................................................123Microarchitecture-Level Soft E rror Vulnerability Analysis ..........................................125Microarchitecture-Level Soft Error Vuln erability Characterization Under Process Variations .................................................................................................................... ......127FIT Variation Across the Analysis Cells .......................................................................127Microarchitecture Soft Error Vulnerability Under Process Variation Mitigation Techniques .................................................................................................................128The effect of variable-laten cy register file (VL-RF) ..............................................128The effect of dynamic fine-gra ined body biasing (DFGBB) .................................130Reliability-Aware Process Variation Mitigation ..................................................................131Entry-Level Granularity Vulnerability Mitigation ........................................................131Structure-Level Granularity Vulnerability Mitigation ..................................................134Evaluation .................................................................................................................... .........137Experimental Methodology ...........................................................................................137Effectiveness of Entry-BVM .........................................................................................138Effectiveness of Structure-BVM ...................................................................................139Combining Entry-BVM and Structure-BVM ................................................................141Overhead of Entry-BVM and Structure-BVM ..............................................................141Related Work .................................................................................................................. ......1427 NBTI TOLERANT MICROARCHITECTU RE DESIGN IN THE PRESENCE OF PROCESS VARIATION ......................................................................................................152Background .................................................................................................................... .......153Negative Bias Temperature Instability (NBTI) .............................................................153The Interplay Between NBTI and PV ...........................................................................155Process Variation Aware NBTI Tolerant Microarchitecture ................................................156Motivation .....................................................................................................................157PV-Aware NBTI Mitigation for Multi-Ported Based Microarchitecture Structures .....158PV-Aware NBTI Mitigation for Combinational Blocks ...............................................162PV-Aware NBTI Mitigation for Storage Cell Based Structures ...................................166Evaluation .................................................................................................................... .........168Effectiveness of O1 .......................................................................................................169Effectiveness of O2 .......................................................................................................170Effectiveness of O3 .......................................................................................................171NBTI&PV Efficiency Regarding to the Entire Chip .....................................................171Related Work .................................................................................................................. ......1728 ARCHITECTING RELIABLE NETWORKON-CHIP MICROARCHITECTURE IN LIGHT OF SCALED TRANSISTOR PROCESSING TECHNOLOGY ............................182Packet-Switched NoC Rout er Microarchitecture .................................................................183 8

PAGE 9

A Hierarchical Design of NoC That Mitig ates the Effects of NBTI and PV .......................184Intra-Router Level NoC PV and NBTI Optimization ...................................................184NBTI and PV optimization for NoC combinational-logic structures ....................185NBTI and PV optimization for NoC storage cell based structures ........................190Inter-Router Level NoC PV and NBTI Optimization ...................................................193Evaluation .................................................................................................................... .........196Experimental Methodologies .........................................................................................196Effectiveness of the VA_SA_M1 ..................................................................................196Effectiveness of VC_M2 ...............................................................................................198Effectiveness of IR_M3 .................................................................................................199Real Workload Results ..................................................................................................200Related Work .................................................................................................................. ......2019 CONCLUSIONS AND FUTURE WORKS .........................................................................220Conclusions ...........................................................................................................................220Future Works ........................................................................................................................224LIST OF REFERENCES .............................................................................................................226BIOGRAPHICAL SKETCH .......................................................................................................239 9

PAGE 10

LIST OF TABLES Table page 2-1 A comparison of different architectur al level reliability analysis tools .............................38 2-2 Trivial instructions (in Alpha ISA) identified by Sim-SODA ...........................................38 2-3 Simulated machine configuration ......................................................................................39 2-4 SPEC 2000 INT and BioInfoMark benchmarks (T he data set name is integrated with the benchmark name) .........................................................................................................39 3-1 Variation in run-time micr oarchitecture vul nerability .......................................................53 3-2 Events used by PMC-15 and PMC-5 schemes ..................................................................53 3-3 Number of phases identified by the K-means clustering algorithm ..................................54 3-4 AVF phase classification error on different machine configurations ................................54 4-1 Accuracy of using PC to identify ACE instructions (committed instruction only) ...........83 4-2 Simulated machine configuration ......................................................................................83 4-3 The studied SMT workloads ..............................................................................................84 4-4 A summary of the proposed techniques .............................................................................84 5-1 The studied SMT workloads ............................................................................................112 10

PAGE 11

LIST OF FIGURES Figure page 2-1 ACE and un-ACE lifetime partitions due to the different orders of un-ACE reads (marked as read*) and ACE reads. .....................................................................................40 2-2 Instruction level vulnerability pr ofile of the studied benchmarks .....................................40 2-3 AVF of instruction window ...............................................................................................40 2-4 AVF of instruction window and wake-up table .................................................................41 2-5 AVF of ROB ......................................................................................................................41 2-6 AVF of register files ..........................................................................................................41 2-7 AVF of function unit ..................................................................................................... .....42 2-8 Data array AVF of L1 data cache (DL1 ), data TLB (DTLB), victim buffer (VBuf), load queue (LQ) and store queue (SQ) ..............................................................................42 2-9 Tag array AVF of L1 data cache (DL1), data TLB (DTLB), vi ctim buffer (VBuf), load queue (LQ) and store queue (SQ) ..............................................................................42 3-1 Run-time instruction wi ndow AVF profile on benchmark gcc and parser. ......................55 3-2 The correlations between AVFs and a si mple performance metric. A) Instruction windows. B) Reorder buffer. C) F unction unit. D) Wakeup table. ....................................56 3-3 COVs yielded by different phase classification schemes. A) Instruction windows. B) Reorder buffer. C) Function unit. D) Wakeup table. .........................................................57 3-4 COVs yielded by hybrid schemes. A) Instruction windows. B) Reorder buffer. C) Function unit. D) Wakeup table. ........................................................................................58 3-5 The COVs of reliability phase classification on the 8issue and 16-issue processors. A) Instruction windows. B) Reorder buffer. C) Function unit. D) Wakeup table. ............59 4-1 Microarchitecture so ft-error vulnerability profile on SMT processor ...............................85 4-2 The histograms of ready queue length and ACE instruction percentage on a 96-entry IQ. SMT processor issue width = 8, the e xperimented workloads: 4-context CPU ( bzip ,eon ,gcc ,perlbmk ) .......................................................................................................85 4-3 Dynamic IQ resource allocation based on IPC, ready queue length and total IQ size ......85 4-4 L2-cache-miss sensitive IQ resource allocation ................................................................86 11

PAGE 12

4-5 Reliability and performa nce with ICOUNT fetch policy. A) Normalized IQ AVF. B) Normalized throughput IPC. ..............................................................................................86 4-6 Reliability and performance using different fetch policies. A) Normalized IQ AVF. B) Normalized IPC ............................................................................................................87 4-7 A) IQ AVF contributed by waiting instructions and ready in structions. B) Profiles of the quantity of ready instructions and wa iting instructions in a 96 entries IQ. C) Residency cycles of ready instructions a nd waiting instructions in a 96 entries IQ. The ICOUNT fetch policy is used. ....................................................................................88 4-8 ORBIT with prediction ......................................................................................................89 4-9 Predict the load instruction completion time .....................................................................89 4-10 Readiness prediction example on PredictALL and Predict_non_load ..............................90 4-11 Reliability and performance resu lts when using ICOUNT fetch policy. A) Normalized IQ AVF. B) Normalized thr oughput IPC. C) Normalized harmonic IPC ......91 4-12 A Comparison with different fetch polic ies. A) IQ AVF. B) Throughput IPC. C) Harmonic IPC .................................................................................................................. ..92 4-13 The Impact of proposed techniques on core and other microarchitecture structures vulnerability (DALL => DelayALL, PDACE => Predict_DelayACE ) .............................93 4-14 (A) IQ AVF contributed by waiting instru ctions and ready instructions. B) Profiles of the quantity of waiting instructions in the 96 entries IQ with Predict_DelayACE C) Profiles of residency cycles of waiting instructions in the 96 entries IQ with Predict_DelayACE The ICOUNT fetch policy is used. ...................................................94 5-1 Soft error robust SRAM (rSRAM) Cell (6T+2C). The rSRAM cel l is built from a standard 6 transistor high density SR AM cell above which two stacked MetalInsulator-Metal (MIM) capacitors are symmetrically added. The embedded capacitors increase the critical ch arge required to flip the ce ll logic state and lead to a much lower SER. The common node of the two capacitors is biased at VDD/2. .............113 5-2 The control flow of instruction dispat ch in the proposed IQ using hybrid radiation hardening techniques .......................................................................................................113 5-3 An overview of radiation harden ed IQ design using hybrid techniques ..........................114 5-4 The wakeup logic of the RIQ ...........................................................................................114 5-5 The designs alternative of the hybrid ra diation hardened IQ. A) First design. B) Second design. .................................................................................................................115 5-6 The correlation between ROB AVF and L2 cache miss ..................................................115 12

PAGE 13

5-7 A comparison of normalized IQ SER, throughput and harmonic IPCs. A) Normalized IQ SER. B) Normalized throughput IP C. C) Normalized harmonic IPC. .......................116 5-8 Criticality threshold anal ysis. A) CPU combination workloads. B) MIX combination workloads. C) MEM combination workloads. .................................................................117 5-9 ROB SER reduction and processor pow er overhead with L2 _miss trigger and enhanced trigger. A) ROB SER reduction. B) Power overhead. .....................................118 5-10 Vulnerability threshold analysis. A) CPU combination workloads. B) MIX combination workloads. C) MEM combination workloads. ............................................119 5-11 The aggregate effect of the proposed two techniques ......................................................120 6-1 Multi-level quad-tr ee partitioning to model system variation .........................................144 6-2 Standard 6T SRAM schematics w ith current source inserted at Q ..................................144 6-3 V(Q) and V(QB) of SRAM (in 45nm pr ocessing technology) under a particle strike. A) Without PV. B) In the presence of PV. ......................................................................144 6-4 Critical charge variati on map for a chip (data are pr esented in units of fC). ...................145 6-5 Qcrit_entry of each entry in the register file (80 entries, each entry contains 64 bits) .........145 6-6 Entryand bitlevel Qcrit distribution in the register file .................................................145 6-7 A) IQ AVF in baseline case and VL-RF technique on gcc over a period of 375ms. B) The # of issue pipe stalls and IQ AVF difference between VL-RF and baseline case. ...146 6-8 Critical charge vs. delta threshold vo ltage in range of [-0.3v, 0.3v] in 6T SRAM ..........146 6-9 Entry-BVM architectural overview .................................................................................147 6-10 BB in each phase ..............................................................................................................147 6-11 Structure-BVM Pseudo-code ...........................................................................................148 6-12 Normalized IQ SER and IPC*fpv yielded by VL related techniques. A) Normalized IQ SER. B) Normalized IPC*fpv ......................................................................................148 6-13 Normalize VPP on VL related techniques .......................................................................149 6-14 Normalize IQ SER with BB related techniques ...............................................................149 6-15 Normalized IPC*fpv with BB related techniques .............................................................149 6-16 Normalized VPP with BB related techniques ..................................................................150 13

PAGE 14

6-17 AVFpv IPC*fpv plot for intervals on equake (up) and swim (down) (Intervals within the drawn area are close to the mean value) ....................................................................150 6-18 A) Normalize IQ SER under Entry-BV M + Structure-BVM. B) Normalized VPP under Entry-BVM + Structure-BVM ...............................................................................151 7-1 Different guardband sett ings for tolerating NBTI ...........................................................174 7-2 The limitation of simply combining NBTI and PV mitigation techniques ......................174 7-3 2-Read port register files with detailed read port design .................................................174 7-4 Cycle time under NBTI and PV effects. A) Baseline case without optimization. B) O1. ........................................................................................................................... .........175 7-5 Pseudo code for port switching in O1 ..............................................................................175 7-6 Examples of PS in O1 ......................................................................................................176 7-7 An example of 2-to-4 decoder .........................................................................................176 7-8 O2 circuit design ..............................................................................................................177 7-9 Vth (in mV) variation map for a cache ..............................................................................177 7-10 Fundamental idea of O3 in the 4-way L1cache ...............................................................178 7-11 The effectiveness of O1 in RF. A) No rmalized CPI. B) Normalized NBTI guardband. C) Normalized NBTI&PV_efficiency ..............................................................................179 7-12 The effectiveness of O2 in IntALU. A) Normalized NBTI&PV_guardband B) Normalized NBTI&PV_efficiency ...................................................................................180 7-13 The effectiveness of O3 in L1 cache. A) Normalized CPI. B) Normalized NBTI&PV_efficiency .......................................................................................................181 7-14 NBTI&PV_efficiency w ith various granularities ............................................................181 8-1 Two-stage adaptive r outer microarchitecture ..................................................................203 8-2 Circuit design of the tw o-stage adaptive router [28]. A) Circuit design of the twostage adaptive router. B) Zo om-in view at VA logic. ......................................................204 8-3 Circuit design in the first step of the VA stage (one VC entry is shown) ........................205 8-4 The implementation of VA_M1.......................................................................................205 8-5 The guardband and power benefit of applying AVS+ during the lifetime. A) Guardband benefit. B) Power benefit ..............................................................................206 14

PAGE 15

8-6 Vth (in mV) variation map for a router .............................................................................206 8-7 The implementation of VC_M2 .......................................................................................207 8-8 The implementation of IR_M3 ........................................................................................208 8-9 The effectiveness of VA_SA_M1 on NBTI&PV_overhead and NBTI_guardband. A) Uniform random traffic. B) Bit-complement tr affic. C) Transpose traffic. D) Tornado traffic. ...................................................................................................................... .........209 8-10 Sensitivity analysis on threshold (URT with 0.3flit/node/cycle injection rate) ...............211 8-11 The effectiveness of VC_M2 on network latency and NBTI&PV_overhead. A) Uniform random traffic. B) Bit-complement tr affic. C) Transpose traffic. D) Tornado traffic. ...................................................................................................................... .........212 8-12 The Effectiveness of IR_M 3 on Network latency and NBTI_guardband. A) Uniform random traffic. B) Bit-complement traffic. C) Transpose traffic. D) Tornado traffic. ....214 8-13 The effectiveness of IR_M3 on NBTI&PV_overhead A) Uniform random traffic. B) Bit-complement traffic. C) Transpos e traffic. D) Tornado traffic. ..................................216 8-14 The effectiveness of the combined techniques (VA_SA_M1+VC_M2+IR_M3) on real workloads. A) Normalized NBTI&PV_guardband B) Normalized network latency. C) Normalized NBTI&PV_overhead .................................................................218 15

PAGE 16

Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CHARACTERIZING, MODELING AND MITIGATING MICROARCHITECTURE VULNERABILITY AND VARIABILITY IN LIGHT OF SMALL-SCALE PROCESSING TECHNOLOGY By Xin Fu December 2009 Chair: Tao Li Cochair: Jos A. B. Fortes Major: Electrical and Computer Engineering The rapidly increased soft erro r rate (SER) due to the scal ed processing technology is one of the critical reliabili ty concerns in current processor design. In order to characterize and mitigate the microarchitecture soft-error vulnera bility in modern superscalar and multithreaded processors, I developed Sim-S ODA (Software Dependability Analysis), a unified framework for estimating microprocessor reliability in the presence of soft errors at the architectural level. By using Sim-SODA, I observed that a single perf ormance metric is not a good indicator for program vulnerability; on the other hand, a combin ation of several performance metrics can well predict the architecture-level so ft-error vulnerability. Based on the observation that issue queue (IQ) is a reliability hot -spot on Simultaneous Multi threaded (SMT) processors, I proposed VISA (Vulnerable InStruction Aware) Issue and O RBIT (Operand Readiness Based InsTruction) dispatch to improve the IQ reliability. I furt her combined the circuit and microarchitecture techniques in soft error robustness on SMT pro cessors to leverage the advantage of the two levels techniques while overcoming the disadvant age of both. Results show that my proposed techniques have strong ability in improve IQ re liability with negligible performance penalty. 16

PAGE 17

17 As one of the nano-scale design challenges, process variation (PV) significantly affects chip performance and power. I characterized the microarchitecture soft error vulnerability in the presence of PV, and proposed two techniques that wo rk at fine-grain and co arse-grain levels to mitigate the impact of PV mitigation techniques on reliability and maintain optimal vulnerability, performance, and power trade-offs. Negative B ody Temperature Instability (NBTI) has become another important re liability concern as processing technology scaled down. Observing that PV has both positive and negative effects on circuits I took advantage of the positive effects in NBTI tolerant microarchitecture design to effici ently mitigate the detrimental impact of PV and NBTI simultaneously. The trend towards multi-/m anycore design has made network-on-chip (NoC) a crucial hardware component of future microprocessors. I propo sed several techniques that hierarchically mitigate the PV and NBTI effect on NoC while leveraging their benign interplay.

PAGE 18

CHAPTER 1 INTRODUCTION Recently, VLSI technology continues to provid e increasing numbers of transistors and clock speeds to allow computer architects to build powerful microprocessors and computer systems. On the other hand, the continuously scaled processing technology at nano-scale exacerbates processor reliability and variability issues, which lead to high error rates in processors, greatly affect the processor lifetime, and result in substantial performance loss and power overhead. High availability and reliability are essential for any computer system. Now, it is crucial for computer architects to quantifying and mitigating the deleterious vulnerability and variability factors. This chapter (1) introduces thr ee critical reliability and variability issues: soft errors, negative biased temperat ure instability, and process varia tions; (2) describe the necessity for mitigating those vulnerability and variability factors; (2) presents the objectives and contributions of this dissertation. Processor Vulnerability and Variability With the advance of semiconductor processi ng technologies, several critical reliability issues become the major causes of processor fail ures. For instances, soft error rate increases significantly as transistor size becomes smaller; Negative Bias Temperature Instability (NBTI), a wear-out mechanism, has become the consider able reliability concern. Moreover, it becomes harder to control sub-wavele ngth lithography and channel dopi ng which results in Process Variation (PV), those variations cause a la rge amount of frequency loss, leakage power consumption and also affect the chip lifetime. Soft Errors Soft errors [2, 27, 28], also referred to as transient faults or Single Event Upsets (SEU) can cause single or multiple bit flips in memory, f lip-flops and registers, and finally lead to a 18

PAGE 19

wrong computation even in perfectly working circuit. As opposed to hard errors, soft errors are not due to physical damage but rather are cau sed by unpredictable, temporary and environmental conditions. By the virtue of their nature, soft errors are out of the scope of testing and verification, and are difficu lt to pinpoint on the field. Soft errors are caused by high-energy (more th an 1 MeV) neutrons from cosmic radiation, alpha particles emitted by trace uranium and thorium impurities in packaging materials and lowenergy cosmic neutron interactio ns with the isotope boron-10 (10B) in IC materials used to form insulator layers in manufacturing. When the pa rticles pass through the semiconductor devices, electron-hole pairs are generated. The source and diffusion nodes will collect the charges, a sufficient amount of accumulated charges will invert the state of the logic device and cause a bit flip. The error will not be restored until the device is reset or rewritten. Safety-critical applications are severely th reatened by soft errors. For instance, the electronics must operate correct ly in high latitude and harsh environment conditions in aero spatial and military area; in au tomotive and transportation, the failure of antilock brake system caused by soft errors can even endanger human lives ; the attack of soft e rrors in banking systems can results in a substantial monetary losses. Soft errors are not a new problem, they became well known during 1970s with the introduction of RAM. They originally cause a s imple problem in applications in high risk environment. However, as the semiconductor tech nology scales down to the nano-scale, it results in the increasing complexity of microprocessors, the shrinking of transist or size, the lowing of supply voltage and the increasing nu mber of gates on the same die. Therefore, soft-error rates of future generations of processors are projected to increase significantly. 19

PAGE 20

Negative Biased Temperature Instability Negative bias temperature instability (NBTI) is a considerable reliability concern for submicrometer CMOS technologies. NBTI occurs in PMOS devices when the gate-source voltage is negative (Vgs = -Vdd). NBTI increases the threshold voltage (Vth) and reduces the drive current (Idsat), which causes degradation in circuit speed and requires a minimum voltage (Vmin) increase in storage cells to keep the content. Eventually, this will lead to failures in logic circuits and storage structures due to timing violations or Vmin limitations. The NBTI effect in PMOS transistors, which stems from an electro-mechan ical reaction involving th e electric field, holes, Si-H bonds, and temperature, is not a recently disc overed wear-out mechanism. It was originally observed in the early phases of CMOS deve lopment (almost 40 years ago), but was not considered important because of the low elec tric fields under norma l operating conditions. However, technology scaling has resulted in the convergence of several factors (e.g. the introduction of nitrided oxides, the increase in gate oxide fields, a nd operating temperature), which have made NBTI the most critical reliability concern for deep sub-micrometer transistors [98, 99, 100]. For example, it has been observed that NBTI can increase Vth by as much as 50mV for devices operating at 1.2V or below [99] and the circuit perf ormance degradation may extend up to 20% in 10 years [100]. Process Variation Process variation (PV), the dive rgence of transistor process parameters from their nominal specifications, results in variability in circ uit performance/power and has become a major challenge in the design and fabrication of future microprocessors [101, 102, 103, 104]. For example, chip frequency can be degraded by as much as 30% in 45nm process technology due to process variation [102] and a 20x increase in leakage power consum ption is reported in [101]. PV is caused by the difficulty in controlling the sub-wavelength lithography and channel doping 20

PAGE 21

as process technology scales. Process variation consists of die-to-die (D2D) and within-die (WID) variations. Die-to-die vari ation consists of parameter fluc tuations across dies and wafers, whereas within-die variation refers to variations of design parameters wi thin a single die. As technology scales, within-d ie variation, which is the primar y focus of this study, has become more significant and is a growing threat to future microprocessor design [101, 102]. Contributions This dissertation makes several contributions to model, characterize, and mitigate microarchitecture soft error vulnerability in microprocessors and multi-threaded processors; characterize and mitigate the soft error vulnerabili ty in the presence of process variation; build the NBTI-tolerant microarchitect ure design under the impact of pr ocess variation; and further improve the reliability in network-on-chips. The summary of the contributions is listed below: 1. In order to improve the soft error robustne ss, architects need a framework to understand the structures vulnerability and evaluate th e effectiveness of their proposed vulnerability mitigation techniques. I developed Sim-SODA ( SO ftware D ependability A nalysis). SimSODA, built on top of the Sim-Alpha tool sets, is the first unified simulation framework for high-performance microprocessor reliability es timation in the presence of soft errors. Sim-SODA is public available. It has been used by more than 90 universities and research labs (e.g. Cornell University, Louisianan State University, and Lawrence Livermore National Lab) for their research projects. 2. Characterizing and predicting program phase behavior from a reliability perspective is crucial in order to apply dynamic faulttolerant mechanisms and to optimize performance/reliability trade-offs. I found that a single performance metric (e.g. IPC, cache miss) is not a good indicator for program vulnerab ility. The vulnerabilities of the structures are correlated with program c ode-structure and run-time events to identify vulnerability phase behavior. I observed that in general, tracking runtime performance metrics performs better than tracking program control flow in vulnerability phase classification. 3. The issue queue (IQ) is a key microarchitectur e structure for exploi ting instruction-level and thread-level parallelism in dynamica lly scheduled SMT processors. However, exploiting more parallelism yields high suscepti bility to transient fa ults on a conventional IQ. With the rapidly increasing soft error rates, the IQ is likely to be a reliability hot-spot on SMT processors. I proposed reliability-a ware instruction sc heduling (VISA) and resource allocation to reduce the quantity and re sidency cycle of vulnerable instructions in the IQ, therefore, optimize the IQ reliability to soft errors. Moreover, I observed that IQ soft-error vulnerability is largely affected by instructions waiting for their source operands. 21

PAGE 22

I explored operand-readiness-based instruc tion dispatch (ORBIT) to minimize the number of waiting instructions in IQ and reduce the IQ vulnerabil ity with negligible performance degradation. Furthermore, I extended ORBIT wi th prediction methods that can anticipate the readiness of source operands ahead of time to achieve more attractive reliability/performance trade-offs. 4. Techniques for structures reliability improveme nt in SMT processors exist on both circuit and microarchitecture levels, but there are relatively few studies that cost-effectively integrate them together. I bridged the gap by proposing combined circuit and microarchitecture techniques to leverage their advantage while overcoming the disadvantages. The combined techniques achieve significant reliability improvement in multithreaded environment. 5. As transistor process technology approaches the nanometer scale, pr ocess variation (PV) significantly affects chip pe rformance and power. However, the impact of process variation on soft error vu lnerability is not well studied. I characterized the microarchitecture soft error vulnerability in the presence of PV, and proposed two techniques that work at fine grain (entry-based ) and coarse grain (structure-based) levels to mitigate the deleterious impact of PV mitigat ion techniques on reliability and maintain optimal vulnerability, performance, and power trade-offs. 6. Negative bias temperature instability (NBTI), which increases the delay and reduces the lifetime of PMOS transistors, is becomi ng a growing reliability concern for submicrometer CMOS technologies. Process va riation (PV) can exacerbate the PMOS transistor wear-out problem and further redu ce the reliable lifetime of microprocessors. Observing that PV also has positive effect, I pr oposed to take advantage of the PV positive effect in NBTI tolerant microarchitecture design which effectively mitigate the detrimental impact of PV and NBTI simultaneously, wh ile achieving good trade-offs among chip performance, power, lifetime, and area overhead. 7. The trend towards multi-/manycore design has made network-on-chip (NoC) a crucial hardware component of future microprocesso rs. Meanwhile, process variation (PV) and negative bias temperature instability (NBTI) in creasingly affect hardware reliability and lifetime. I proposed novel techniques that can hierarchically mitigate the PV and NBTI effect on NoC while leveraging their benign in terplay. The low-level mechanisms improve PV and NBTI efficiency of key components (e .g. virtual channels, switch arbiters) that are in the critical paths of the pipelined router microarchitecture. The high-level mechanisms leverage NBTI degradation and PV information from multiple routers to intelligently route the packets, delivering optimized performan ce-power-reliability efficiency across the NoC substrate. 22

PAGE 23

Dissertation Organization The rest of the dissertation is organized as follows: Chapter 2 describes the development of the unified simulation framework to estimate micr oarchitecture processor vulnerability in the presence of soft errors. Chapter 3 characterizes the microarchitecture soft-error vulnerability phase behavior. Chapter 4 proposes two architect ure level techniques to mitigate the soft-error vulnerability of issue queue in the simultaneous multithread architectures. Chapter 5 combines the circuit-level and microarchitecture-level techniques to efficiently reduce the soft error rate of microarchitecture structures in simultaneous multithreaded architectures. Chapter 6 characterizes the impact of process variation on soft error vulnerability, improves proc essor reliability and variability, and meanwhile achieves the optimum trade-offs among performance, reliability, and power. Chapter 7 takes advantage of the positive effects in NBTI tolerant microarchitecture design to efficiently mitigate the detrimen tal impact of proces s variation and NBTI simultaneously. Chapter 8 targets on the reliabi lity enhancement in network-on-chip, and proposes several techniques to hierarchically mitigate the process variations and NBTI effects on network-on-chip. Chapter 9 concludes the diss ertations and suggests future opportunities. 23

PAGE 24

CHAPTER 2 SIM-SODA: SOFTWARE DEPENDABILITY ANALYSIS Sim-SODA estimates the dependability of hardware components in a high-performance, out-of-order superscalar micropr ocessor using the computation methods introduced in [4, 22]. Compared with previous studies [4, 19, 22], Si m-SODA provides a unified infrastructure to study the reliability of all majo r units of a high-performance mi croprocessor with a single run, proposes a fine-grained reliability analysis to improve the accuracy of th e reliability estimation, and also proposes a hybrid method that can be us ed to accurately estimate the vulnerability of complex structures. While previous architectural reliability analysis tools were built on proprietary performance models [13, 24], Sim-SODA uses an open source, publicly available simulator Sim-Alpha [11, 12], which makes porting the reliability analysis framework described in this dissertation to other popular simulator to ol suites (such as Simplescalar[6] and M5 [3]) relatively easy. Related Work There has been prior work on dependability modeling at a high level. For example, hardware RTL models have been used in the past to estimate processor reliability [25, 27]. The RTL models contain all of the detailed informa tion about the microprocessors. Nevertheless, the simulation slowdown of RTL models is too expe nsive for architecture studies, in which the tradeoffs between many hardware configurations n eed to be considered. Moreover, these models are generally not available duri ng the architectural exploration phase of a microprocessor design. The Architectural Vulnerability Factor (AVF) analysis methods proposed by Mukherjee et al used a performance model to generate reliability estimates. In [4, 22] the vulnerability of hardware structures (e.g. issue queue, execution unit, TLB and caches) of an Itanium2-like IA64 processor was studied. In [1], Asadi et al es timated the vulnerability of L1 cache through the 24

PAGE 25

residency time of critical words in the cache. In [19], Li and Adve developed SoftArch, an architecture level tool for mode ling and analyzing soft errors. The SoftArch framework estimates reliability using a probabilistic model of the error generation and propagation process in a processor. As a complementary approach to AVF computation, statistic fault injection has been used in several studies [5, 10, 25, 27] to evaluate architectural reliabili ty. To obtain statistic significance, a large numbe r of experiments need to be perf ormed on an investigated hardware component. Table 2-1 summarizes the features of se veral architectural reli ability estimation tools from the perspectives of methodology, modeled ha rdware structures and the availability of baseline models. Microarchitecture Soft-Error Vulnerability Estimation Sim-SODA estimates microprocessor reliability using the Architectural Vulnerability Factor (AVF) computing methods introduced in [ 4, 22]. In this section, I briefly review the concept and the computation of AVF. Since not all soft errors can cause erroneous program execution, the probability that a fault in a hardwa re structure will cause an externally visible error in the final output of a progr am is referred to as the archit ectural vulnerability factor (AVF) of that hardware structure. A hardware structure s error rate is the product of its raw error rate, mainly determined by device and circuit design technology, and the AVF. The key to calculating the AVF is to determine which bits affect the final system output and which do not. In [22], a subset of processor state bits required for archit ecturally correct execution (ACE) are called ACE bits. Hence, the AVF of a hardware structure in a given cycle is the percentage of the ACE bits in that structure. The AVF of a hardware stru cture during program execut ion is the average AVF at any point in time. 25

PAGE 26

The Sim-SODA Reliability Estimation Framework Overview I have developed Sim-SODA, an architectural framework to estimate the reliability of programs running on high-performance, out-of-orde r microprocessors. To track the residence time of ACE bits in various structures, I inst rumented Sim-alpha, an open source, validated cycle-accurate performance simulator for Alpha 21264. In the Sim-SODA framework, I classify each dynamic instruction of a programs executi on based on whether the instructions output affects the outcome of that program. Since instru ctions executed along a mi spredicted path will not be committed, and do not affect AVF, I only c onsidered committed instructions. I consider an instruction an ACE instruction if its results might affect the final program output, and an instruction un-ACE if its results definitely w ill not affect the program output. Bits in an ACE instruction are ACE, but an un-ACE instruction c ontains both ACE and un-ACE bits (details are explained in [22]). Additionally, I classify one type of un-ACE instruct ion dynamically dead if its results are not used subsequently (more detail ed classification of un-ACE instructions will be introduced in section 5.1). In Sim-SODA, I implemented the post-commit analysis window proposed in [22] to determine if the instruction is dynamically dead or if there are any bits that are logically masked. Through cy cle-level simulation, both microa rchitecture and architecture states are classified into ACE/un-ACE bits a nd their residency and resource usage counts are generated. This information is then used to estimat e the reliability of vari ous hardware structures. Fine-Grained Reliability Estimation Instruction window In high performance processors, the instru ction window is used to support dynamic scheduling and out-of-order execu tion. In [22], the instruction window is treated as a bulk structure. Sim-SODA provides fine-grained reliability analys is for the instruction window. When 26

PAGE 27

an instruction completes its execution, its destin ation register ID is broadcasted to all the instructions in the window to inform all dependent instructions of the availability of the result. Each entry compares the broadcasted register ID with its own source register ID. If there is a match, the source operand is latched and the depende nt instruction may be ready to execute. This register ID broadcast and associ ated comparison is called instruc tion wake-up. A soft error that results in an incorrect match between a broadcas ted physical register ID and a corrupted tag may cause instructions waiting for that operand to be issued pre-maturely. A single bit error in the tag array that results in a mismatch where there should have been a h it can prevent ready instructions from being issued, causing a deadlock in issuing instructions. Therefore, the wake-up table is vulnerable to soft error strikes. The Sim-SODA framework estimates the vulnerability of both the instruction window and the wake-up table. When a new instruction is allocated in the instruction window, the wake-up table records the renamed physical register IDs of instructi ons on which that instru ction depends. There are two fields in each wake-up table entry to hol d the renamed register IDs for the two source operands of an instruction. A field in the wake-up table entry becomes invalid once the source operand is ready. The operations on the wake-up table include fill, read and invalidate; therefore, its non-overlapping lif etime can be partitioned into fill-to-read, read-to-read and invalidate-to-fill periods. Note that there is no read-to-invalidate component because the last read between fill and invalidate will cause a match between the stored register ID and the broadcasted register ID. Once there is a match, the field in the wake-up table will become invalid immediately. In other word s, the lifetime of the read-to-invalidate component in the wake-up table is always zero. Therefore, I combine fill-to -read and read-to-read components together, and attribute the invalidate-to-fill component as un-ACE. 27

PAGE 28

Trivial instruction In [22], Mukherjee et al identified logical ma sking instructions as a source of un-ACE bits. An operand and its bits are logically masked and can be attribut ed to un-ACE bits if the operand does not influence the result of an instruction execution. In their study, Mukherjee et al considered three types of logical masking instru ctions: compare instructi ons prior to a branch, bitwise logical operations and 32-bi t operations in a 64-bit architect ure. In this study, I identified further logical masking bits. I found that the bits us ed to encode the specifiers of source registers which hold logically masked values are un-ACE bits. This is because a corrupted register specifier may cause the processor to fetch the wr ong data from a different register. Nevertheless, the computation result will not be altered because of the logical masking effect. Additionally, I extend logical masking instructio ns to trivial instructi ons [29] in this study. Trivial instructions are those computations whose output can be determined without performing the computation, so they cover all the un-ACE bits that logical ma sking instructions can identify. In this study, I further classified the trivial in structions into the following three categories. The first type of trivial instructions has two source regi sters. For these trivial instructions, a soft error is tolerant when it strikes a register whose cont ribution to the computation result is masked by the second register. For example, in a multiplication instruction, if one of the source registers is equal to zero, a soft error that hits the other register would not affect the result. Therefore, the bits held by that source register are un-ACE b its. Additionally, the bits used for encoding the other source register specifier within the same instruction are also un-ACE bits. The second type of trivial instructions contains an immediate value and only one source register. The bits in the source registers can be considered un-ACE when the immediate value ma sks the instructions contribution to the computation results. Similarl y, bits in the immediate value can be considered un-ACE when the source register doesnt affect the computation result. The source register 28

PAGE 29

specifier bits in that instruction become un-ACE if the value held in th at register is un-ACE. Note that in both types of trivial instructions, only one of the source ope rands can be considered un-ACE at a time, since a soft error hit to th e operand which provides the masking function will skew the computation results. The last type of trivial instructi ons is specific to XOR and EQV operations (see Table 2-2). For these two operations, when the first and second register specifiers are identical, bits in that register are un-ACE. Table 2-2 summa rizes the trivial instructions identified by the Sim-SODA framework. The partic ular source operand value that trivializes the operation is also listed. Integer adds and subtractions are not included in Table 2-2. This is because their computation results depend on both operands. A b it change in either operand may result in incorrect computation output. Integer divisions are not listed in Table 2-2 because a division operation is implemented through multiplication inst ructions in Alpha inst ruction set [8, 9, 16]. Note that trivial instructions can be dynamically d ead instructions also. If an instruction is both a trivial instruction and a dynamica lly dead instruction, I attribute it to the dynamically dead instructions because more un-ACE bits can be derived from that type of instruction. In other words, trivial instructions an alyzed by the Sim-SODA framewor k are special ACE instructions that have un-ACE bits in the in structions and their registers. Reliability Estimation of Unexplored Structures Hybrid AVF computation for register files The register files hold the architectural state of program executi on. Previous studies [19, 25, 27] show that the register f iles are highly vulnerable to soft errors. In these studies, the reliability analysis of register files was perfor med by injecting faults statistically or modeling error propagation. The AVF calculation for register files has not been addressed. 29

PAGE 30

When comparing register files with the data cache, I summarize some common characteristics between the two. The register files are similar to the data array in a data cache: both are used to keep values for instruction execution. Activities occurring during the lifetime of a bit in the register files also include i dle, fill, read, write and evict. Therefore, if I follow the methodology to compute the AVF for addres s-based structures [4], I can also classify the register files lifetimes into non-overlapp ing ACE or un-ACE periods. For example, idle, read-to-write and write-to-write are un-ACE, while fill-to-read and write-to-read are ACE. However, the calculation of register files AVF in this way can be very conservative. This is because unlike the data cache, registers are heav ily utilized and write and read operations occur very frequently. To obtain a more realistic estimate of regist er files AVF, I analy zed each ACE lifetime component to discover which operations on the b its convert ACE lifetime to un-ACE lifetime. I found that not every read affects the final program output. For example, if there is a read caused by a dynamically dead instruction, the final output will not change even if the data is incorrect. Written data is also un-ACE when the write is caused by a dynamically dead instruction. Additionally, as described in section 3.2.2, when a trivial instruc tion has a read operation on bits of the register files, the read da ta is un-ACE if it doesnt affect the result of the instruction. Since I combine ACE/un-ACE lifetime analysis with an in struction analysis window (which is used to detect dynamically dead instructio ns and trivial instructions, see section 4), I call the method a hybrid AVF computation scheme. A write to the register file is usually follo wed by more than one read. Two types of reads can occur. The first is caused by ACE instructions and I call it an ACE read. The second is caused by dynamically dead or trivial instructio ns and I call it an un-A CE read. Whether to 30

PAGE 31

convert ACE lifetime to un-ACE lifetime depe nds on the order un-ACE reads take place among all the reads after a write. I identified th ree cases based on the order of un-ACE read operations. First, as Figure 2-1 (A) show s, un-ACE reads (marked as read*) occur closely after the write activity. The second case is shown in Figure 2-1 (B) There are one or more ACE reads before and after un-ACE reads. In the th ird case, as illustrated by Figu re 2-1 (C), there are no more ACE reads but another write or evict follows un-ACE reads. In Figure 2-1 (A) and (B), un-ACE reads can not be converted from ACE time in to un-ACE time since they are followed by ACE reads which can not bear any soft error. In Fi gure 2-1 (C) if I follow the methodology applied in address-based structure AVF computation [4], the lifetime component between the last ACE read and the last un-ACE read should be identified as ACE, however I can convert it into un-ACE since there are no more ACE reads following un-ACE reads. To calculate the un-ACE lifetime, I identify the last ACE read after the write and th en attribute the remaining lifetime between it and the next write or evict to un-ACE. Similar to the data cache array, edge effects also arise in re gister files AVF computation. For example, if the simulation ends at a point afte r a write to the register files completes, I can not determine whether the period between that write a nd the ending point is ACE lifetime or unACE lifetime. Therefore, I have to count the above period as unknown. Since COOLDOWN mechanism [4] has a remarkable impact on redu cing the unknown portion of a data cache arrays AVF, I applied the COOLDOWN strategy to compute register files AVF. The granularity at which I maintain the lifetime information can have a significant impact on register files AVF. I cant se t the granularity to 64-bit; since not all of the instructions defined in the Alpha Instruction set [8, 9, 16] cons ume the whole 64-bit data word (a register in Alpha 21264 processor is 64-bit). They read or wr ite 32-bit data occasionally, which means the 31

PAGE 32

other 32-bits in that register ar e idle at that time. The methodology used to compute the AVF of address-based structures [4] partitions a tag-base d structure into two parts: data array and tag array. Since register files are sp ecified by their identifiers instead of tag array, there is no false positive or false negative match in the register files AVF computation. Because the case in which read data from register files is less than 32-bit occurs infrequently, it is unnecessary to perform per-byte analysis, or even more detailed per-bit analysis such as the one I used to analyze the logical masking effect. In this study, I maintain granularity fo r register files AVF analysis as 32-bit. ROB AVF computation In an out-of-order execution microprocesso r, the reorder buffe r (ROB) stores all uncommitted instructions. To effectively explo it ILP, modern processors use large ROB, implying its AVF can greatly affect the AVF of the entire chip. I have developed an ROB AVF model for the Sim-SODA framework. Data in each ROB entry is allocated for an on-the-fly instruction. A ROB entry includes instruction number, register spec ifiers and operands. If an inst ruction is un-ACE, program output will not be affected, so bits in that entry are un-ACE bits. If an instruction is a trivial instruction, some bits in that entry are un-ACE. In the Si m-SODA framework, I use an instruction analysis window and trivial instruction classifica tion to identify thes e scenarios in the ROB. Victim buffers AVF computation The Sim-SODA framework models the AVF of a victim buffer. Whenever there is a cache miss in the L1 cache, the replaced block will be evicted to the victim buffer. Since the evicted block may be recalled by the program again, any single bit error in it can cause incorrect program output. The victim buffer is also an addr ess-based structure, and only has fill, read, evict and end activities. I classified no-overlapping ACE, un-ACE and unknown lifetime 32

PAGE 33

components on it. For example, fill-to-read, read-to -read are ACE; read-to-evict, fill-to-evict, evict-to-fill, evict-to-end are un-ACE; and fill-to-end, read-to-end are unknown. I used COOLDOWN and hamming distance one mechanisms introduced in [4] to accurately compute AVF. The above AVF models can be integrated into a range of architectural simulators to provide reliability estimates. To implement th ese AVF models into a unified, timing accurate framework, I have instrumented the Sim-Alpha architectural simulator. I chose Sim-Alpha because previous work [11, 12] has shown that Sim-Alpha can accurately model an Alpha 21264 processor, and in [11, 12], the authors showed that Sim-Alpha is much more accurate than Simplescalar for modeling real hardware I ha ve extended the simulator with a post-commit instruction analysis window (with a size of 40,000 instructions) wh ich supports the identification of dynamically dead and trivial instructions The Sim-SODA framework also includes AVF models for cache, TLB, and load/store queue. I used synthesized micro-benchmarks with known characteristics to validate the Sim-SODA framework. The dynamically dead and trivial instructions and the ACE and un-ACE time reported by Sim-SODA match my expectation in the micro-benchmarks. Experimental Setup and Results Experimental Setup Using the Sim-SODA framework, I performed a de tailed reliability analysis of an Alpha21264-like microprocessor running a wide range of applications. Table 2-3 summarizes the simulated machine configuration. The workloads I used in this study include 12 programs from SPEC 2000 INT and 6 programs from BioInf oMark [20]. I didnt include SPEC 2000 FP benchmarks because Sim-Alpha does not model fl oating point pipeline execution accurately [11]. To reduce the simulation time while still ma intaining representative program behavior, I 33

PAGE 34

obtained the number of instructi ons to skip using SimPoint anal ysis [26] and run each SimPoint for 50 million instructions. Table 2-4 lists the skippe d instructions and the input data set for each benchmark. The numbers I present in this disserta tion are results for the first SimPoint of each benchmark. I abbreviate the input names as follows: bzip2-source is bzip2-s gcc-166 is gcc-1 eon-rushmeier is eon-r, gzip-graphic is gzip-g parser-dict is parser-d, perlbmlk-splitmail is perlbmlk-s vpr-route is vpr-r, clustalw-ureaplasma is clustalw-u dnapenny-ribosomal is dnapenny-r glimmer-bacteria is glimmer-b, hmmer-SWISS-PORT is hmmer-S predatoreukaryote is predator-e promlk-17 species is promlk-1 Experimental Results Program vulnerability prof ile at instruction level Figure 2-2 shows an instruction level vulnerabi lity profile of the studied benchmarks. On average, 69% and 73% of the committed instructions are ACE instructions for SPEC 2000 integer and BioInfoMark suites respectively. The un-ACE instruc tions include NOPs, prefetch and dynamically dead instructions. As suggested in [22], dynamically dead instructions can be classified into two types: (1) first-level dynamica lly dead (FDD) if their computation results are simply not read by any other instructions, or (2 ) transitively dynamically dead (TDD) if their results are only consumed by FDD or other T DD instructions. The Unknown refers to those instructions whose destination registers lifetimes can not be determined by the instruction analysis window. As shown in Figure 2-2, NOPs and FDD instru ctions (FDD_reg and FDD_mem) dominate un-ACE instructions In this study, I found 10% NOP s in the SPEC2000 integer suite. This is the same as the NOPs fraction Fahs et al [14] reported in SPEC2000 integer using the Alpha instruction set. TDD instructions (TDD_reg and TDD_mem) contribute a ne gligible fraction (e.g. less than 1%) of un-ACE in structions. Similar to the results re ported in [22], my study shows that 34

PAGE 35

the fraction of FDD_reg (11% in SPEC and 9% in BioInfoMark) instructions is normally higher than that of FDD_mem (7% in both benchmark suites) instructions. The fraction of TDD and FDD instructions reported by Sim-SODA framework is 17% in SPEC2000 suite and it is close to that reported in [14] (14% FDD and TDD inst ructions tracked via register and memory). AVF of major microarchitecture structures Figure 2-3 shows the AVF of the instruction window. I further decompose ACE bits stored in the instruction window based on their instruction types. As can be seen, the dominant portion of ACE bits in the in struction window comes from ACE instructions. For prefetch and NOP instructions, only instruction opcodes are ACE bits [22]. In this work, I count all the opcode and destination register specifier bits of FDD and TDD instructions as ACE bits; all other instruction bits are un-ACE bits [22]. Figure 2-3 shows that the instruct ion windows AVF ranges from 26% ( gzip ) to 62% ( gap) in SPEC 2000 and from 44% ( dnapenny ) to 84% (clustalw ) in BioInfoMark respectively. On average, the AVF of the inst ruction window is 42% and 52% in SPEC 2000 and BioInfoMark. Biological multiple sequence alignment benchmark clustalw has the highest AVF. This is because clustalw has the highest ACE instruction fraction (e.g. 90% as shown in Figure 2-2). Instruction residency time in th e instruction window also a ffects the AVF result, so the benchmark promlk yields a higher AVF than hmmer even though its ACE in struction fraction is lower than hmmer (58% vs. 85% respectively). Figure 2-4 shows the AVF of the instru ction window, the wake-up table and the aggregated results (i.e. considering the instru ction window and the wake -up table as a single structure). I can see the AVF of the wake-up table is much lower than that of the instruction window. This is because an instru ctions ACE time in the wake-up table is always shorter than the instructions residency time in the instructio n window. An instruction may still need to wait 35

PAGE 36

for a function unit by staying in the instruction window after all of its source operands are ready. As shown in Figure 2-4, the aggregated result s do not reduce significan tly (4%-10% in SPEC 2000 and 6%-9% in BioInfoMark). This is because the number of bits contained in the wake-up table is less than that in the instruction window. Figure 2-5 shows the AVF of the reorder buffer. Interestingly, the ROBs AVF is significantly lower than that of the instruction window. This is due to the following effect. The Alpha-21264 processor has separate integer and floating point in struction windows. The integer instruction window has 20 entries. The ROB is used to hold all type s of instructi ons and the size of the ROB is 80 entries. Because of the lack of floating point operations, the fraction of idle bits in the ROB is much higher than that in the instruction window. Sim-SODA models both the high and low 32 b it of the physical registers. In this dissertation, I report the average AVF of the entire 64-bit registers. As shown in Figure 2-6, hybrid AVF computation can reduce register files AVF on many workloads (e.g. 9% on crafty 10% on promlk and 13% on predator ). On average, hybrid AVF calculation reduces register files AVF by 4% and 5% on SPEC 2000 and BioInfoMark respectively. I assume each function unit has about 50% cont rol latches and 50% datapath latches, and the datapath within it has a width of 64 bits The AVF numbers shown in Figure 2-7 are the average statistics of all four f unction units. I apply trivial instru ction analysis to each function unit to further attribute un-ACE bits to different instructions. The semantics of trivial instructions implies that at least one input value to th e function unit can be un-ACE. There are other instructions that only produce 32-bit outputs. In that case, the upper 32-bits in the output data path become idle. As is shown, function unit AVF bits are mainly caused by the operands of ACE instructions as well as the output of those instructions. Because the ma jority of instructions 36

PAGE 37

have two input operands and one output operand, the input bits cont ribute more ACE bits in the function units than the output bits do. The level 1 data cache, data TLB, victim buf fer and load/store queues are address-based structures. I applied lifetime analysis to both tag and data arrays of these structures and classified lifetime into ACE, un-ACE and unknown component s. The Sim-SODA framework uses bit level analysis for tag array and byte level anal ysis for data array. I implemented the COOLDOWN mechanism to reduce the unknown fraction since e dge effect can be significant in these structures [4]. To avoid false positive and false negative matches in the ta g array, I have also implemented the hamming-distance-one analysis method [4] in Sim-SODA. Figure 2-8 and 2-9 show the data array and th e tag array AVF for L1 data cache (DL1), data TLB (DTLB), victim buffer (VBuf), load queue (LQ) and store queue (SQ). As can be seen, the L1 data cache tag arrays AVF is higher than the data arrays AVF. This is because the L1 data cache in the Alpha 21264 [17, 18] is a write-b ack cache and the cache tag must be correct at eviction time. Therefore, all bits of the tag are ACE from the time that there is any write activity that occurs in that entry until it is evicted. The same scenario happens in load and store queues. In contrast, the victim buffer tag arrays AVF is lower than the da ta arrays AVF. This is because it is a write-through cache and the ACE tim e in the tag array is much lower. 37

PAGE 38

Table 2-1. A comparison of different archit ectural level reliability analysis tools Metrics Mukherjee et al [22], Biswas et al [4] Wang et al [27] SoftArch [19] Sim-SODA [this dissertation] Methodology AVF Statistic fault injection Probabilistic model of error generation and propagation AVF, AVF for address-based structures and hybrid AVF computing Hardware structures modeled Issue queue, function unit, Data cache and TLB, store buffer Pipeline and its control states Instruction buffer, decode unit, register file, functional unit, TLB, issue queue Issue queue, register file, function unit, cache, TLB, ROB, load/store queue, victim buffer Baseline models and availability Asim, Intels proprietary tool for modeling Itanium 2-like processor Verilog model of an Alpha processor Turandot, available on request Sim-Alpha, publicly available Comment Complex hardware such as issue queue is modeled as bulk structure A subset of Alpha ISA is modeled. Caches are not modeled. RTL model is not usually available at early design stage Memory hierarchy is not modeled. Complex hardware such as issue queue is modeled as bulk structure Fine-grained AVF models for complex structures. Covers more hardware structures Table 2-2. Trivial instructions (i n Alpha ISA) identified by Sim-SODA Type Operation Triviality Condition I MULL/MULQ/MULH: A B A=0 or B=0 AND: A & B A=0 or B=0 BIS: A | B A=1 or B=1 BIC: A & ~ B A=0 or B=1 ORNOT: A | ~ B A=1 or B=0 II MULLI/MULQI/MULHI/: A IMM A=0 or IMM=0 ANDI: A & IMM A=0 or IMM=0 BISI: A | IMM A=0 or IMM=1 BICI: A & ~ IMM A=0 or IMM=1 ORNOTI: A | ~ IMM A=0 or IMM=0 III XOR: A ^ B A=B EQV: A ^ ~B A=B 38

PAGE 39

Table 2-3. Simulated machine configuration Parameter Configuration Pipeline depth 7 Integer ALUs/multi 4/4 Integer ALU/multi latency 1/7 Fetch/slot/map/issue/commit width 4/ 4/4/4/11 instructions per cycle Issue queue size 20 Reorder buffer size 80 Register file size 80 Load/store queue size 32 Branch predictor Hybri d, 4K global + 2-level 1K local+ 4K choice Return address stack 32-entry Branch misprediction penalty 7 cycles L1 instruction cache 64KB instructio n/64KB data,2-way, 64B line, 1-cycle latency L1 data cache 64KB instruction/64KB data,2-way, 64B line, 3-cycle latency L2 cache 2048KB, direct mapped, 64B line, 7-cycle latency TLB size 128-entry ITLB/128-en try DTLB, fully -associative MSHR entries 8/cache Prefetch MSHR entries 2/cache Victim buffer 8 entries, 1-cycle hit latency Table 2-4. SPEC 2000 INT and BioInfoMark benchmarks (The data set name is integrated with the benchmark name) SPEC 2000 INT Instructions Fast Forwarded BioInfoMark Instruction Fast Forwarded bzip2s 64 M clustalwu 20,400 M gcc-1 30 M dnapennyr 140 M crafty 123 M glimmerb 20 M eonr 216 M hmmerS 27,200 M gap 88 M predatore 25,900 M gzipg 1 M promlk1 320 M mcf 143 M parserd 1,771 M perlbmlks 1 M twolf 312 M vortex3 47 M vprr 3 M 39

PAGE 40

Figure 2-1. ACE and un-ACE lifetime partitions due to the different orders of un-ACE reads (marked as read*) and ACE reads. Figure 2-2. Instruction leve l vulnerability pr ofile of the studied benchmarks Figure 2-3. AVF of instruction window 40

PAGE 41

Figure 2-4. AVF of instruction window and wake-up table Figure 2-5. AVF of ROB Figure 2-6. AVF of register files 41

PAGE 42

Figure 2-7. AVF of function unit 0 10 20 30 40 50 60 70 80 90 100AVF % Unknown AVF bzip2 gcc crafty eon gap gzip mcf parser perl twolf vortex vpr clustalw dnapny glimer hmmer predator promlk DL1 DTLB VBuf LQ SQData SPEC 2000 INT BioInfoMark Figure 2-8. Data array AVF of L1 data cache (DL1), data TLB (DTLB), victim buffer (VBuf), load queue (LQ) and store queue (SQ) 0 10 20 30 40 50 60 70 80 90 100AVF % Unknown AVF bzip2 gcc crafty eon gap gzip mcf parser perl twolf vortex vpr clustalw dnapny glimer hmmer predator promlk DL1 DTLB VBuf LQ SQTag SPEC 2000 INT BioInfoMark Figure 2-9. Tag array AVF of L1 data cache (DL1 ), data TLB (DTLB), victim buffer (VBuf), load queue (LQ) and store queue (SQ) 42

PAGE 43

CHAPTER 3 SOFT ERROR VULNERABILITY PHASE CHARACTERIZATION Experimental results from Sim-SODA show that the microarchitecture components manifest significant time varying behavior in their run-time reliability characterization. To identify appropriate reliability estimators, I used various perfor mance metrics and examined their correlations with vulnerability f actors at the microarchitecture le vel. I found that in general, using a simple performance metric is insufficient to predict program vulnerability behavior. I then explored code structure based and run-time event based phase analysis techniques and compared their reliability phase prediction abilities. For my reliability-oriented phase characteri zation experiments, I simulated an Alpha21264-like 4-way dynamically scheduled microprocessor with a two level cache hierarchy. The baseline microarchitecture model is detailed in Table 2-3. I used 10 SPEC2000 integer benchmarks and obtained AVF and performance characteristics using Sim-SODA. To perform this study within a reasonable amount of time, I used the MinneSPEC [59] reduced input datasets. However, bzip2 and gzip still require a significant amount of time to finish simulation. Therefore, I could not include th eir results in this dissertation draft. In order to collect enough interval information for accurate AVF phase beha vior classification, I chose each benchmarks interval size based on the total dynamic instructions executed. For instance, gcc and parser have 100,000 instructions in each interval, while perlbmk has an interval size of 200,000 instructions. Characterizing Run-Time Microarchit ecture Vulnerability to Soft Errors Time Varying Behavior of Mi croarchitecture Vulnerability Figure 3-1 shows the run-time instruction window AVF profile on the benchmarks gcc and parser. Each point of the plots represents the AVF of an interval with a size of 100,000 instructions. (I dont show all the 10 benchmarks AVF profiles because of space limitations). As 43

PAGE 44

can be seen, during program execution, the instru ction windows vulnerabili ty shows significant variation between intervals, and some intervals also show sim ilar AVF behavior regardless of temporal adjacency. Table 3-1 lists AVF statistics in terms of mean and Co efficient of Variation (COV) for all studied structures and all simulated benchmarks. The COV has been widely used in evaluation of phase analysis techniques. It is the standard deviation divided by the mean. In other words, COV measures standard deviation as a percentage of the mean. Table 3-1 shows that although both the instruc tion window and the ROB are used to support out-of-order execution, they show significant discrepancy in their AVFs: the ROBs AVF is lower than the instruction windows AVF on a majority of studied benchmarks. This can be explained as follows: The Alpha 21264 processor has separate integer and floating poi nt instruction windows [56]. The integer instruction wi ndow has 20 entries. The ROB is used to hold all types of instructions and the size of the ROB is 80 entries. Due to the l ack of floating point operations, the fraction of idle bits in the ROB is much high er than that in the in struction window. Table 3-1 also shows that the COVs of the run-time AVFs on mcf gap, and vortex are much higher than those on the remaining benchmarks. This indi cates that these programs contain complex vulnerability phases which may be challe nging for accurate phase classification. How Correlated Is AVF to Performance Metrics If a hardware structures AVF shows strong co rrelations with a simple performance metric, its AVF can be predicted using that performa nce metric across different benchmarks. To determine how correlated vulnerability is to perf ormance metrics, I calculated the correlation coefficients between the run-time hardware stru ctures AVF and each of the simple performance metrics using statistics that we re gathered from each interval during program execution. I chose widely used performance metrics such as IPC, branch prediction rate, L1 data and instruction miss rates and L2 cache miss rate. 44

PAGE 45

Figure 3-2 shows there are fuzzy correlations between the AVFs and the different performance metrics. For example, for the instruction window, IPC shows strong correlation with AVF on the benchmarks perlbmk vortex mcf and crafty while yielding near zero correlation coefficients on twolf and eon Intuitively, high IPC reduces the ACE bits residency time in a microarchitecture structure, resulting in a reduction of the AVF. On the other hand, the high ILP (usually manifested by high IPC) in a program can cause the microprocessor to aggressively bring more instructi ons into the pipeline, increasing th e total number of ACE bits in a microarchitecture structure. Interestingly, I observed that the same performance metric (e.g. IPC) can exhibit both positive and negative correl ations with the AVFs of different structures running on the same benchmark. For example, on perlbmk, the correlation coefficient between IPC and the ROBs AVF is -0.98, while the corr elation coefficients between IPC and other structures AVFs are all around 0.99. As mentioned in Chapter 2.2, ACE bits residency time plays an important role in determining AVF. Fo r the same instruction, its residency time in different structures can vary significantly. For example, when a processor executes a low ILP code segment, few instructi ons can graduate. This can in crease the ROBs AVF if those instructions contain a significan t amount of ACE bits. Note that the low IPC doesnt necessarily mean a long residency time of instructions in th e instruction window, the functional units and the wakeup table. The instructions may have b een completed a long time before their final graduation from the ROB because of the out-of-order execution and in-order commitment. Overall, the results shown in Fi gure 3-2 suggest that I simply ca nnot use one performance metric to indicate hardware reliability behavior. Program Reliability Phase Classification Various phase detection techniques have been proposed in the past. In [26, 37, 41, 60, 62], phases are classified by examining the programs c ontrol flow behavior via the features of basic 45

PAGE 46

block, working set, program counter, call graphs a nd control branch counters. This allows phase tracking to be independent of the underlying architecture confi guration. In [39, 40], performance characteristics of the applicati ons are used to determine phase s. These methods detect phase changes without using knowledge of program code structure; however, the identified phases may not be consistent across different architectures. There are also pr evious studies on comparing and evaluating different phase detecti on techniques. In [38], Dhodapka r and Smith showed that basic block vectors perform better than instruction working set and cont rol branch counters in tracking performance oriented phases. The studies reported in [41, 43] rev ealed the correlations between applications phase behavior with code signatures. Isci and Ma rtonosi recently conducted a study [40] where they compared different phase cla ssification techniques in tracking run-time power related phases. Although the above methods ha ve been used in performance and power characterization, their effectiveness in classify ing program reliability-oriented phases remains unknown. In this section, I examine program vulne rability phase detec tion using two popular techniques: the basic block vect or based and the performance event counter based schemes. A complete comparison of all of the phase anal ysis techniques exceeds the scope of this dissertation. Basic Block Vectors vs. Performance Monitoring Counters A basic block is a single-entr y, single-exit section of code with no internal control flow. Therefore, sequences of basic blocks can be used to represent cont rol flow behavior of applications. In [26], Sherwood and Calder propos ed the use of Basic Block Vectors (BBV) as a metric to capture a programs phase behavior. They first determine all of the unique basic blocks in a program, create a BBV to represent the num ber of times each unique basic block is executed in an interval, and then weight each basic bloc k so that instructions are represented equally regardless of the size of the basi c block. In this dissertation, I qua ntify the effectiveness of basic 46

PAGE 47

block vectors in capturing AVF phase behavior across several different microarchitecture structures. In [39], Isci and Martonosi showed that ha rdware performance monitoring counters (PMC) can be exploited to efficiently capture program power behavior. To examine an approach that uses performance characteristics in AVF phase an alysis, I collected a set of performance monitor counters (PMC) to build a PMC vector for each execution interval. I then used the PMC vectors as a metric to classify AVF phases. In my experiments, I instrumented the Sim-SODA framework with a set of 18 performance counters and dumped the statistics of these counters for each execution interval. I then ran an exhaustive search on the 18 counters to determine the 15 counters (shown as PMC-15 in Table 3-2) that were able to characterize the AVF phases most accurately. I chose the size of the counter vector to be 15 because BBV is commonly reduced to 15 dimensions after random pr ojection [26]. To investigat e whether the AVF phases are predictable using a small set of counters, I further reduced th e PMC vector to 5 dimensions (shown as PMC-5 in Table 3-2). The 5 counter s were chosen based on their importance in performance characterization and their generali ty and availability across different hardware platforms. The BBVs of each benchmark were generated using the SimPoint tool set [26] and the PMC vectors were dumped for every interval during Sim-SODA simulation. I then ran the kmeans clustering algorithm [61] to compare the si milarity of all the inte rvals and to group them into phases. Table 3-3 lists the number of phase s identified by the k-means clustering algorithm. In this study, I also examined the use of hybrid information that combined both BBV and PMC vectors to characterize reliability phases. More detailed information about the hybrid schemes can be found in Chapter 3.2.2. As shown in Tabl e 3-3, the BBV generates a slightly larger 47

PAGE 48

number of phases than the PMC-15. The PMC-15 scheme detects more phases than the PMC-5 on a majority of the studied benchmarks. I use the COV to compare different phase classification methods. After classifying a programs intervals into phases, I examined each phase and computed the average AVF of all of the intervals within the phase. I then calculated the standard deviation of the AVF for each phase, and I divided the standard deviation by the averag e to get the COV. I calculate an overall COV metric for all phases by taking the COV of each phase, weighting it by the percentage of execution that the phase accounts for, and then summing up the weighted COV. Better phase classification will exhibit lower COV. In the extr eme case, if all of the intervals in the same phase have exactly the same AVF, then the COV will be zero. Figure 3-3 shows that both BBV and PMC methods can capture program reliability phase behavior on all of the studied hardware struct ures. As compared to the COV baseline case without phase classification, the BBV based techniques achieve signif icantly higher accuracies in identifying reliability phases. On the average, the BBV based scheme reduces the COVs of AVF by 6x, 5x, 6x, and 4x on the issue queue, ROB, f unction unit and wakeup table respectively. Figure 3-3 further shows that the PMC-15 sche me can achieve even higher accuracies in characterizing program reliability behavior. For example, PMC-15 yields lower COVs than BBV on 8 out of the 10 studied benchmarks for all of the examined structures. On benchmarks eon and vpr, BBV performs better than PMC-15 because of the weak correlations between performance metrics and vulnerability shown in Figure 3-2. Ov erall, the PMC-15 scheme leads to an average COV of 3.5%, 4.5%, 4.3% and 5.7% on the issu e queue, reorder buffer, function unit and wakeup table, while BBVs achieve COVs of 4.9 %, 5.8%, 5.4% and 6% on the four studied microarchitecture structures respectively. 48

PAGE 49

As I expected, Figure 3-3 shows that the COV of AVF for PMC-5 is higher than that for PMC-15 in most cases. This is because PMC-15 provides additional information that can improve the accuracy in phase cl ustering. The only exception is vortex shown in Figure 3-3 (a). From Figure 3-2 (a), one can see that on benchmark vortex several performance metrics (e.g, IPC, DL1miss_rate and L2miss_rate) already ex hibit strong correlation with the instruction windows AVF. When more performance metrics are included, it is possible that additional noise is also introduced into the cl ustering. However, this situatio n does not happen frequently as I observed only one exception among a ll the cases that I analyzed. On the average, reducing the PMC dimensionality from 15 to 5 increases the COV by 1.3%, 0.8%, 1.5% and 2.1% on the 4 studied microarchitecture structures respectivel y. In practice, gather ing information for 5 counters is much easier than coll ecting statistics for 15 counters. This suggests that the PMC-5 scheme can be used as a good approximation to the more expensive PMC-15 scheme. Compared with the BBV technique, the PMC-5 scheme yi elds better AVF prediction for the ROB but exhibits worse accuracy for the wakeup table. Fo r the instruction window and function unit, both schemes show similar performance in AVF classification. In summary, my experiments show that both BBV and PMC phase analysis have significant benefit in characterizing power behavi or. I found that in general, the PMC-15 phase analysis performs better than the BBV based approach. The Effectiveness of Hybrid Schemes I also explored two hybrid phase detection ap proaches that combine information used by both BBV and PMC schemes. In my first experime nt, I used random projection to obtain a BBV with 10 dimensions and then concatenated the BBV with a PMC-5 vector of the same program execution interval to form a hybrid vector, i. e. Hybrid(10+5). In my second experiment, I appended the PMC-5 vector to the original BBV to form a 20 dimension hybrid vector, i.e. 49

PAGE 50

Hybrid (15+5). I then invoked the k-means clustering algorithm on the two hybrid vectors. I compared the effectiveness of hybrid scheme s with the default BBV and a BBV with a 20 dimensions of data. Figure 3-4 shows the COVs on Hybrid(10+5), BBV, Hybrid(15+5) and BBV-20 schemes. As can be seen, simply concatenating BBV and PMC information does not show significant benefit in reliability phase detections. In fact Hybrid(10+5) show slightly worse performance than BBV on a majority of benchmarks. Simila r observation is held when Hybrid(15+5) and BBV-20 are compared. One possible reason is th at when the PMC vector and the BBV are randomly merged together, their information can in terfere with each other. Such interference can reduce the quality of AVF phase classification. Although I only examine two combinations of PMC and BBV in this dissertation, I think it may be possible to gain some benefit from other combinations that can take the advantages of both schemes while avoiding the interference between each other. Further exploration of ef fective hybrid phase detection schemes is one aspect of my future work. Sensitivity Analysis Phase analysis techniques which use architecture dependent ch aracteristics (e.g. PMC) may be sensitive to the machine configuration. To evaluate the robustness of the prior studied phase classification techniques in the context of AVF phase classification, I examine the applicability of using PMC, BBV, and hybrid te chniques to detect program reliability phases across different architectures. As my baseli ne model is a 4-way issue Alpha-21264-like superscalar microprocessor, to ge nerate different machine configur ations, I varied the bandwidth in each pipeline stage, modified the latency, size and associativity of the L2 cache, and resized the branch predictor, load store queue and regist er files. I created two additional configurations to model 8-issue and 16-issue processors. The 8-issue machine has 120 physical registers, a 50

PAGE 51

20KB hybrid branch predictor, a 64-etnry load/s tore queue, and an 8MB 8-way set associative L2 cache. I also increased the br anch misprediction penalty to 14 cycles on the 8-issue processor. I used a similar approach to further scale up resources for the 16-issue machine. Using these new configurations, I ran the SimSODA framework on all of the benchmarks. The collected data were then processed using the k-means clustering algorithm to obtain th e phase classification results (shown in Figure 3-5). Figure 3-5 shows that the COVs of reliability phase classification yielded on the 8-issue architecture are very close to those on th e 16-issue architecture regardless of phase characterization techniques. This indicates th at the various phase cl assification schemes I examined in chapter 3.2.1 and 3.2.2 show similar performance in reliability phase detection across different architecture conf igurations. Interestingly, I obse rved that the COV on the 8 and 16-issue machines is higher than that on the 4-issue machine. In tuitively, as the processor issue width increases, programs show larger variati on in different characteristics because of the reduced resource constraint. The increased standard deviation can result in an increment in COV. However, if the configuration of the hardware structures (i.e. instruct ion window, ROB, function unit and wakeup table) is fixed, the impact of the increased issue width will be limited by a threshold. The structures AVF will not change significantly when the issue width surpasses that threshold. This explains the noticeable increas e in COV when the machine configuration is varied from the 4-issue to the 8-issue and the diminishing variation in COV when the issue width is further increased to 16. Table 3-4 shows AVF phase classification error on different machine configurations. Due to space limitations, I repo rt the average statistics of the 10 studied benchmarks. Compared with COV, AVF errors are slightly changed across different configurations. This is because the positive a nd negative errors may cancel with each other, 51

PAGE 52

resulting in an overall reduction in that metric The results shown in Table 3-4 indicate that various techniques are still able to track the aggr egated AVF behavior on the more aggressive 8 and 16-issue machines despite of the in creased COV in phase classification. 52

PAGE 53

Table 3-1. Variation in run-time microarchitecture vulnerability Benchmarks Inst. Window Reorder Buffer Function Unit Wakeup Table MEAN COV MEAN COV MEAN COV MEAN COV mc f 23 86% 23 28% 17 100% 11 28% eon 37 7% 17 7% 31 10% 9 17% gcc 27 11% 10 15% 14 10% 5 12% gap 24 96% 40 43% 5 121% 9 45% parser 32 16% 13 13% 19 5% 6 20% perlbmk 29 9% 12 47% 16 8% 8 13% twolf 40 13% 15 14% 18 8% 14 20% vortex 26 20% 13 54% 17 23% 3 24% vpr 40 9% 15 20% 19 10% 12 14% crafty 29 25% 10 37% 15 17% 7 33% AVG 31 29% 17 28% 17 31% 8 23% Table 3-2. Events used by PMC-15 and PMC-5 schemes Events PMC-15 PMC-5 Number of Loads X Number of Stores X Number of Branches X Total Instructions Total C y cles X X Correctl y Predicted Branches X X Pi p eline Flushes due to Branch X Pi p eline Flush due to other X Victim Buffer Misses X Inte g er Re g ister File Reads X Inte g er Re g ister File W r ites X L1 Data Cache Misses X X L1 Instruction Cache Misses X L2 Cache Misses X X Data TLB Misses X S q uashed Instructions X X Idle Entr y in IQ Idle Entr y in ROB 53

PAGE 54

Table 3-3. Number of phases identified by the K-means clustering algorithm Benchmarks BBV PMC-15 PMC-5 Hybrid(5+10) Hybrid(5+15) BBV-20 crafty 35 28 26 33 32 34 eon 37 25 28 26 34 29 gap 28 25 23 24 22 24 gcc 30 26 25 29 28 33 mcf 25 27 22 22 24 25 p arse r 33 26 22 25 30 25 p erlbmlk 11 21 20 11 11 12 twolf 26 29 28 26 25 25 vortex 31 27 22 26 27 25 vpr 29 26 22 20 29 30 Table 3-4. AVF phase classification erro r on different machine configurations Instruction Window Reorder Buffer 4 issue 8 issue 16 issue 4 issue 8 issue 16 issue PMC-5 0.70% 1.59% 1.86% 0.71% 2.24% 2.20% PMC-15 2.06% 2.32% 1.04% 1.45% 1.20% 1.19% BBV 3.21% 3.19% 3.09% 1.75% 2.83% 2.71% BBV-20 2.74% 1.57% 1.79% 1.24% 1.26% 2.07% Hybrid(10+5) 1.28% 1.66% 1.71% 0.98% 1.61% 2.10% Hybrid(15+5) 1.48% 1.16% 1.39% 1.95% 1.35% 1.64% Table 3-4. Continued Function Unit Wakeup Table 4 issue 8 issue 16 issue 4 issue 8 issue 16 issue PMC-5 1.04% 1.38% 0.98% 1.69% 3.24% 2.59% PMC-15 3.85% 2.50% 1.08% 3.72% 2.27% 2.40% BBV 3.67% 3.63% 2.47% 1.91% 4.07% 2.24% BBV-20 3.69% 2.40% 1.86% 1.50% 3.03% 3.51% Hybrid(10+5) 0.83% 1.29% 0.90% 1.12% 1.96% 1.91% Hybrid(15+5) 1.68% 0.88% 1.03% 1.81% 2.19% 1.99% 54

PAGE 55

gcc 0 10 20 30 40 50 60Instruction Window AVF (%)0% 20% 40% 60% 80% 100% Execution Time parser 0 20 40 60 80 100Instruction Window AVF (%)0% 20% 40% 60% 80% 100% Execution Time Figure 3-1. Run-time instruction window AVF profile on benchmark gcc and parser. 55

PAGE 56

-1.5 -1 -0.5 0 0.5 1 1.5Corrleation Coefficient ipc bpred_rate dl1miss_rate il1miss_rate l2miss_ratecrafty eongcc gap mcf parser perlbmk twolf vortex vpr A -1.5 -1 -0.5 0 0.5 1 1.5Corrleation Coefficient ipc bpred_rate dl1miss_rate il1miss_rate l2miss_ratecrafty eon gcc gap mcf parser perlbmk twolf vortex vpr B -1.5 -1 -0.5 0 0.5 1 1.5Corrleation Coefficient ipc bpred_rate dl1miss_rate il1miss_rate l2miss_ratecrafty eon gcc gap mcf parser perlbmk twolf vortex vpr C -1.5 -1 -0.5 0 0.5 1 1.5Corrleation Coefficient ipc bpred_rate dl1miss_rate il1miss_rate l2miss_ratecrafty eongcc gap mcf parser perlbmk twolf vortex vpr D Figure 3-2. The correlations between AVFs and a simple performance metric. A) Instruction windows. B) Reorder buffer. C) Function unit. D) Wakeup table. 56

PAGE 57

0 4 8 12 16craf t y e on ga p gc c m c f parser perlbmk twolf v or t e x vp r AVGCOV (%) PMC-15 PMC-5 BBV A 0 4 8 12 16crafty eon g a p g c c mc f pars er perl bm k t wo l f vor t e x vpr A V GCOV (%) PMC-15 PMC-5 BBV B 0 4 8 12 16 20c r a f ty eon gap gcc m cf pa r se r p erlb m k t wolf vortex v p r AVGCOV (%) PMC-15 PMC-5 BBV C 0 4 8 12 16 20c ra ft y eo n g a p gcc m cf pars e r p e rl b mk t wol f vortex v pr A V GCOV (%) PMC-15 PMC-5 BBV D Figure 3-3. COVs yielded by differe nt phase classification schemes. A) Instruction windows. B) Reorder buffer. C) Function unit. D) Wakeup table. 57

PAGE 58

0 4 8 12 16c ra fty e o n gap gcc m cf p a r se r per lb m k tw o lf v o r tex v p r A VGCOV (%) Hybrid(10+5) BBV Hybrid(15+5) BBV-20 A 0 4 8 12 16 20craf ty eon g ap g cc m c f p a r s er p e r l b m k twolf vortex v p r A VGCOV (%) H y brid ( 10+5 ) BBV H y brid ( 15+5 ) BBV-20 B 0 4 8 12 16 20c r a fty eon gap gc c mcf pa r s e r pe r lbmk twolf vo r te x v p r AVGCOV (%) Hybrid(10+5) BBV Hybrid(15+5) BBV-20 C 0 4 8 12 16 20c raf t y e on gap gc c mcf parser pe r lbmk t w olf vortex vpr AVGCOV (%) Hybrid(10+5) BBV Hybrid(15+5) BBV-20 D Figure 3-4. COVs yielded by hybrid schemes. A) Instruction windows. B) Reorder buffer. C) Function unit. D) Wakeup table. 58

PAGE 59

Instruction Window 0 4 8 12 16crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVGPMC-5 PMC-15 BBV BBV-20Hybrid(10+5)Hybrid(15+5) CoV (%) 8 issue 16 issue A Reorder Buffer 0 4 8 12 16crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVGPMC-5 PMC-15 BBV BBV-20Hybrid(10+5)Hybrid(15+5) CoV (%) 8 issue 16 issue B Function Unit 0 4 8 12 16crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVGPMC-5 PMC-15 BBV BBV-20Hybrid(10+5)Hybrid(15+5) CoV (%) 8 issue 16 issue C Wakeup Table 0 5 10 15 20crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVG crafty eon gap gcc mcf parser perlbmk twolf vortex vpr AVGPMC-5 PMC-15 BBV BBV-20Hybrid(10+5)Hybrid(15+5) CoV (%) 8 issue 16 issue D Figure 3-5. The COVs of reliabili ty phase classification on the 8issue and 16-issue processors. A) Instruction windows. B) Reorder buffer. C) Function unit. D) Wakeup table. 59

PAGE 60

CHAPTER 4 ISSUE QUEUE RELIABILITY OPTIMIZATION Using a reliability-aware architecture simulato r, I characterize soft error vulnerability of several key microarchitecture structures in a SM T processor. Results are shown in the form of the architecture vulnerability f actor (AVF) [22] of a hardware structure which estimates the probability that a transient fault in that hardware structure will produce incorrect program results. Details of my simulation framework, machine co nfiguration, evaluated workloads and metrics are presented in Section 4.1.3. Figur e 4-1 indicates that among the microarchitecture structures studied, the IQ exhibits the highe st vulnerability. This indicates that the IQ is likely to be a reliability hot-spot in SMT processors fabric ated using advanced technology nodes. In this chapter, I explore several optimizations to mitigate IQ soft error vulnerability on SMT architectures. I focus my study on the IQ which shows the highest vulnerability, and it is a necessary first step towards protecting entire chip with multiple components to which similar principles might also be applicable. Vulnerable InStruction Aware (VISA) Issue Vulnerable InStruction Aware (VISA) Issue Since ACE instructions are the major cont ributor of ACE bits, the IQ soft error vulnerability can be mitigated by reducing the residency cycles and quantity of ACE instructions in the IQ. To reduce ACE instruction residency cycles in the IQ, I propose Vulnerable InStruction Aware (VISA) issue that gives the AC E instructions higher priority than the un-ACE instructions. Therefore, once there is a ready ACE instruction, it can bypass all the ready-toexecute un-ACE instructions. If there are several ready ACE instructions, they will be issued in the program order. Note that th e ready un-ACE instructions cannot be issued until all the readyto-execute ACE instructions have been issued. If the number of ready ACE instructions is less 60

PAGE 61

than the number of available issu e slots, the ready un-ACE instruc tions can also been issued in their program order. Workload IQ utilization characteristics (e.g. the average number ready instructions per cycle and the fraction of ACE instructions in ready instructions) can significantly affect the efficiency of applying VISA issue for IQ vulnera bility reduction. For ex ample, if the average number of ready instructions per cycle is sma ller than the issue bandwidth, it is unnecessary to use the VISA issue policy since all ready ACE and un-ACE instructions can be issued within the same cycle. If the average fraction of ACE instru ctions per cycle is small (or there is no ACE instruction in the ready state at all), the benefit of applying VISA issue policy will be marginal. To answer the above questions, I set up experi ments to characterize a SMT processors IQ utilization from both performance and reliability perspectives. In each cycle, I count the number of instructions stayed in the re ady queue (a collection of IQ entr ies that holds ready-to-execute instructions) and break them down into ACE and un-ACE instructions. I then plot the histograms of ready queue size with the co rresponding ACE instruction percen tage. Figure 4-2 shows results for the 4-context CPU workload. The Y-axis on th e left represents the pr obabilistic-based ready queue length distribution and the Y-axis on the right represents the corresponding ACE instruction percentage. As can be seen, the maximal ready queue length is 73. This indicates that on SMT processors, exploiting TLP increas es the number of ready-to-execu te instructions in the IQ. Interestingly, a hill is shown in the ready que ue length distribution and its peak value is around 26. Moreover, only 10% of execution cycles have a ready queue length less than 9. Since the issue width of the SMT processor is 8, there are abundant ready-to-e xecute instructions that can be chosen by the issue logic. The ACE instructi on percentage plot shows that on average, 60% of 61

PAGE 62

the ready instructions are ACE instructions. Th e fraction of ACE instructions becomes higher when the ready queue is short. The above obser vations indicate that th ere are good opportunities to reduce IQ vulnerability by giving ACE instructions a higher issue priority. The VISA-based issue policy heavily relies on the capability of identifying the ACE-ness of each instruction in the IQ. However, in genera l, a retired instruction cannot be classified as ACE or un-ACE until a large amount of its followi ng instructions have graduated. In [22], a post-graduate instruction analysis window w ith a size of 40,000 instru ctions is used for vulnerability classification. To perform the ju st-in-time vulnerability identification at perinstruction level, I propose to perform instru ction vulnerability characterization offline and extend ISA (Alpha ISA is applied in my work) to encode the 1-bit ACE-ness tag (e.g. ACE or un-ACE). When an instruction is decoded, its ACE-ness tag will be checked to determine its vulnerability. The offline profiling statically cl assifies each PC as ACE or un-ACE. A PC is classified as ACE if any of its dynamic instances is identified as an ACE instruction. Instructions executed along the mispredicted paths are not used for this classification. By doing this, I make my classification independent of branch predictor implemen tation and the non-deterministic inter-thread branch aliasing. The profiling method is conservative since it can predict some finally squashed instructions as ACE. Moreover, the same instruction is not always ACE or unACE during the entire program execution. For exam ple, an instruction within a loop may be unACE in the first several iterations, but become s ACE at the last iteration if only the last iterations computation result is consumed by other instructions, and vise versa. I refer falsepositive to the case that un-ACE instructions are incorrectly identified as ACE, whereas falsenegative is the opposite case. As described above, using instruct ion PC to identify vulnerable instructions can avoid false-nega tive, namely, no ACE instruction is mispredicted. However, my 62

PAGE 63

method can cause false-positive. Table 4-1 show s the accuracy of using instruction PC to identify ACE instructions from committed instructions across different SPEC CPU2000 benchmarks. As can be seen, the identification accuracy in most applications is around 98% and the average accuracy is 94%. This indicates the false-positive matches happen infrequently. My PC-based ACE-ness classification can incorrectly pr edict a number of squashed instructions as ACE if their PCs are tagged as ACE, however, the prediction still achie ves good accuracy (83% on average) when squashed instructions are considered. Exploring VISA Issue Based Optimizations The vulnerability-aware instru ction scheduling shows the poten tial to reduce the residency cycles of vulnerable bits in the IQ. In this section, I further apply dynamic resource allocation to the IQ to prevent excessive vulnerable bits from entering the IQ. Combining with vulnerabilityaware instruction scheduling, the proposed schemes achieve significant IQ reliability enhancement by effectively reducing the number of vulnerable bits and their residency in the IQ. Optimization 1: dynamic IQ resource allocation Using VISA-based issue reduces ACE instructio ns residency cycles but it also increases the overall IQ utilization. Because ACE instructi ons generally exhibit longer data dependence chains than unACE instructions, more ILP can be exploited by issuing ACE instructions earlier. As a result, more instructions in ROB can be disp atched into IQ. From the reliability perspective, as the number of instructions in IQ increases which results in more ACE bits moving to the IQ, the IQ becomes more vulnerable to soft erro r strikes. If IQ resource allocation can be dynamically controlled to maintain performan ce while minimizing soft error vulnerability, a better performance/reliabili ty tradeoff can be achieved. In this section, I explore reliability-aware IQ resource allocation to contro l the quantity of ACE bits that the processor can bring into the 63

PAGE 64

IQ. Incorporated with VISA issue policy, the pr oposed techniques can effectively mitigate IQ vulnerability to soft error. When an instruction is in the dispatch stage, the processor checks whether there is a free entry in the IQ. If the IQ is fully occupied, the instruction has to stay in the ROB to wait for an available entry. To reduce IQ vulnerability, I setup a threshold to cap IQ utilization. The instruction dispatch logic compares current IQ ut ilization with this threshold. If the current IQ utilization is higher than the threshold, the processor will not allocate new IQ entries for instructions even there are idle entries in the IQ As a result, the processor can not dispatch any instruction to the IQ until the IQ utilization drop below the th reshold. To identify the optimal threshold, a large number of off-line simulations need to be performed for each SMT workload. In this dissertation, I propose an on-line mechanism that learns and adapts the IQ utilization threshold dynamically. I sample workload executi on characteristics within fixed-size intervals and then use the gathered statistics to setup IQ resource utilization appropriately for the next execution interval. Figure 4-3 shows the dynamic IQ resource allocation algorithm. In Figure 4-3, RQL is the ready queue length an d IQ_SIZE is the total size of the IQ. As can be seen, I adapt IQ resource allocation strategies using worklo ad IPC. This is because IQ utilization can highly correlate wi th IPC: high-IPC workloads require more IQ entries to exploit ILP. Since processor performance is sensitive to the size of ready queue, I incorporate it in my decision making policies. To give vulnerability re duction a priority, I use static caps which are proportional to the total size of the IQ. Since the maximal commit bandwidth of the studied SMT processor is 8, I partition the IPC into 4 non-overlapping regions (my experimental results show that 4 regions outperform other number of regions) and set up the ratios using the low and high IPCs of each region. Alternatively, these ratios can be dynamically setup using the actual IPC. I 64

PAGE 65

experiment with dynamic ratio setup using linear models that correlates with IPC. Simulation results show that both static and dynamic ratios show similar efficien cy. I use static ratios in this dissertation due to their simplicity. The interval size is another important parameter in my design. If it is too large, the technique will not be adaptive enough to the resource requirements. If it is too small, the technique will be too sensitive to the cha nges of workload behavior. After experimenting with various interval sizes, I choose an interval si ze of 10K instructions in my design. Optimization 2: handling L2 cache misses As the number of L2 cache miss increases, the ready queue becomes shorter and there will be more instructions sitting in the waiting qu eue until the cache misses are solved. Due to the clogged IQ, processors exhibit low IPC perfor mance. After the cache misses are solved, the ready queue size and IPC increase s rapidly. However, as shown in Figure 4-3 when the IPC and the ready queue length are both low, the number of allocated IQ entries w ill be small. This will limit the number of instructions that can be dispatched to the waiting queue. After L2 misses are solved, the number of instructions that become ready-to-execute will be much less than the scenario where no IQ resource control is perfor med. Therefore, the optimization shown in Figure 4-3 will result in noticeable performance degrad ation if there are frequent L2 cache misses. Recall that my goal is to mitigate IQ soft e rror vulnerability with negligible performance penalty. To achieve this, I propose a L2-cache-miss sensitive IQ resource allocation strategy. As Figure 4-4 shows, when the L2 cache miss frequency is below a threshold ( Tcache_miss), the dynamic IQ resource allocation mechanism described in Figure 4-3 is used to perform reliability optimization. On the other hand, when the L2 cache miss frequency exceeds the threshold, FLUSH fetch policy [31] is used for vulnerabil ity mitigation. On SMT processors, FLUSH stalls threads that cause L2 cache misses by flushing th eir instructions from the pipelines. The de65

PAGE 66

allocated pipeline resource can be efficiently used by the non-offending threads to boost their performance. Note that the cache miss threshold Tcache_miss is an important parameter in my design. Using different SMT workload mixes, I pe rformed a sensitivity analysis and choose 16 as the L2 cache miss threshold. Experimental Setup To evaluate the reliability and performance impact of the proposed techniques, I use a reliability-aware SMT simulation framework developed in [63]. It is built on a heavily modified and extended M-Sim simulator [33] which models a detailed, execution driven simultaneous multithreading processor. I use the computed AVF as a metric to quantify hardware soft error susceptibility. In my experiments, although ACE-ness is classified at instruction-level, the AVF computation is performed at bit-level. Table 42 lists microarchitecture parameters of the SMT processor I considered in this study. The ICOUNT [35], which assign the highest priority to the thread that has the fewest in-flight instructions is used as the default fetch policy. In this work, I further examine the efficiency of the proposed IQ reliability optimiza tions using a set of advanced fetch policies such as STALL [3 1], DG [42], PDG [ 42] and FLUSH [31]. The SMT workloads in my experiments ar e comprised of SPEC CPU 2000 integer and floating point benchmarks. I create a set of SMT workloads with individual thread characteristics ranging from computation intensive to memory access intensive (see Table 4-3). The CPU and MEM workloads consist of programs all from th e computation intensive and memory intensive workloads respectively. Half of the programs in a SMT workload with mixed behavior (MIX) are selected from the CPU intensive group and th e rest are selected from the MEM intensive group. The results I presented in this disserta tion are average statistics of each benchmark category. I use the Simpoint tool [26] to pick th e most representative simulation point for each benchmark and each benchmark is fast-forwarded to its representative point before detailed 66

PAGE 67

multithreaded simulation takes place. The simulations are terminated once the total number of simulated instructions 400 million. Evaluation In this section, I evaluate the efficiency of VISA-based issue and optimizations across various SMT workload mixes. Figure 4-5 presen ts the average IQ AVF and the throughput IPC yielded on each type of SMT workloads (CPU, MIX and MEM). ICOUNT is used as the default fetch policy. The IQ AVFs and throughput IPCs are normalized to the baseline case without any optimization. As I expected, by issuing ACE instructions fi rst, the IQ AVF is reduced moderately (5% on average), and throughput IPC is close to the baseline case (1% im provement on average). This is due to the higher IQ u tilization. The IQ AVF is further reduced by applying dynamic IQ resource allocation (VISA+opt1). Using CPU workloads as an example, IQ AVF is reduced by about 34% while maintaining the same IPC. Th is suggests that VISA+opt1 can be used to effectively control the IQ u tilization and more aggressively reduce vulnerability without performance penalty on computation intensiv e workloads. Conversely, on MIX and MEM workloads, VISA+opt1 reduces both throughput IP C and IQ AVF noticeably. This indicates that this scheme is not well suited to handling memo ry intensive workloads which introduce resource contention more frequently. The results become mo re promising, however, if I use the number of L2 cache misses to trigger FLUSH. When I apply VISA+opt2, the throughput IPC is improved 1% than the baseline case on the average, with a 48% reduction in IQ AVF. The IQ AVF reduction on MIX and MEM workloads (56%) is higher than that on CPU workloads (33%) because the baseline IQ AVF is lower on CPU workloads which encounter fewer resource clogs that extend ACE instruction residency. On MEM workloads, VISA+opt2 yields slightly lower IPC than the baseline case. This is caused by the FLUSH fetch policy. FLUSH continues to fetch 67

PAGE 68

for at least one thread even if all other threads are stalle d and their corresponding pipeline resources have been de-allocated. Thus, the activ e threads instructions will occupy the entire pipeline. On MEM workloads, the performance of the active threads can not be improved much by increasing the pipeline resour ces due to their inherently lo w ILPs. Worse, the IQ can be occupied by active threads with even lower IPCs than the stalled threads. Interestingly, the normalized IPC yielded by VISA+opt2 on MIX workloads is highe r than that on the baseline case. Due to the mix of computation intensiv e and memory intensive programs, FLUSH is triggered less frequently. Further, when FLUS H is triggered, the proba bility that the active threads will be low IPC is less than that on the MEM workloads. If the active threads are computation intensive, the throughput IPC will increases. Therefore, the normalized IPC on MIX workloads shows less predictable behavior. VISA-based issue and optimizations can be in tegrated into any SMT fetch policy. Next I show the results achieved by using FLUSH, ST ALL, DG and PDG as the default fetch policy. Figures 4-6 (a) and (b) show the average IQ AVF and IPC in each workload category when the above fetch policies are used. The results ar e normalized to the baseline cases of the corresponding fetch policies. As is shown in the figures, even when advanced fetch policies are used, my approaches are still ab le to provide an impressive IQ AVF reduction of 36% with only a 1% performance penalty on the average. On MIX and MEM workloads, the IQ AVF reduction is less significant using the FLUSH policy than when using the other fetch policies. This is because the FLUSH baseline case is already proficient at handling resource congestion and its IQ AVF is already much lower than the baseline cases of the other fetch policies. On CPU workloads, however, the differences in the adva nced fetch policies do not affect the IQ AVF reduction since there are less cache misses and thus the advanced fetch polices have less 68

PAGE 69

opportunity to take effect. Wh en VISA+opt2 is employed, IPC increases on MIX workloads but decreases on MEM workloads when comparing DG and PDG baseline cases. The trend is similar to that of using ICOUNT as the defa ult fetch policy (see Figure 4-5 (b)). Operand Readiness Based Instruction Dispatch (ORBIT) IQ Soft-Error Mitigation Through Instruction Dispatch In a dynamic-issue, out-of-order execution micr oprocessor, an instruction dispatched from the reorder buffer (ROB) will stay in the IQ until all of its so urce operands are ready and the appropriate functional unit is available. An in struction IQ residency time can be broken down into cycles during which the instruction is wa iting for its source opera nds and cycles during which the instruction is ready to execute but is waiting for an availa ble function unit. An instruction in the IQ can be classified as ei ther a waiting instruction or a ready instruction, depending on the readiness of its source operands. Both wa iting instructions and ready instructions affect the IQ soft-error susceptibi lity. Figure 4-7 (a) shows the IQ AVF contributed by waiting instructions and ready instructions across three types of workloads (shown as Table 43) on the studied SMT processor (shown as Tabl e 4-2). As IQ AVF is determined by the number of vulnerable instructions per cy cle and instruction resi dency cycles in IQ, Figure 4-7 (b) and (c) depict the quantity and residency cycles of waiting instructions and ready inst ructions in the IQ. Since ACE instructions are the major source of AC E bits, Figure 4-7 (b) and (c) also profile the number of waiting and ready ACE instructions and their residency cycles. As Figure 4-7 (a) shows, on an average, waiting instructions contribute to 86% of the total IQ AVF. Figure 4-7 (b) and (c) help to e xplain the high AVF contribution from waiting instructions. As can be seen, wa iting instruction residency time in the IQ ranges from 10 to 48 cycles, whereas ready instructions usually spend 1.5 cycles in the IQ on average. This suggests that an instruction can spend a significant fracti on (91% on average) of it s IQ residency cycles 69

PAGE 70

waiting for source operands that ar e being produced by other instruct ions. Previous studies [33] have also observed that instruc tions usually spend most of thei r IQ residency cycles waiting for their operands to be ready. Nevert heless, no attempt has been made to exploit operand readiness in reliability optimizations. Figure 4-7 shows that at every cycle, the number (61 on average) of waiting instructions also overwhelms that (9 on average) of ready instructions. Especially on MEM workloads that exhibit higher cache miss rates, most instructions ar e congested in the IQ waiting for ready operands both the quantity an d residency cycles of waiting instructions greatly surpass those of ready inst ructions. As a result, waiting in structions contribute to 98% of the total IQ AVF. Furthermore, as Figure 4-7 (b ) and (c) show, waiting ACE instructions also play a much more important role than ready ACE instructions in determin ing the IQ AVF due to their higher quantity and longer IQ residency. In short, in order to mitigate IQ AVF, I should focus on the waiting instructions. IQ residency cy cles can be minimized if instructions are dispatched from the ROB with ready operands; m eanwhile, the number of wa iting instructions is also reduced because when instructions are di spatched they are ready-to-execute directly. Therefore, dispatching instructions only when their operands are ready can effectively control both the quantity and residency of waiting instru ction in the IQ and reduce IQ AVF significantly. Note that using operand readiness for instruct ion dispatch does not adversely affect other structures (e.g. ROB) AVF. For example, even t hough the instructions disp atch is delayed, their residency time in the ROB is the same since they have to remain in the ROB until graduation. If the ready instructions can be dispatched to the IQ in a timely fashion for execution, the performance impact will be negligible. To effectively reduce instruction waiting cycles in the IQ, I propose a scheme ( DelayALL ) which delays the dispatch for instructions with at least one non-ready operand. Since physical 70

PAGE 71

registers are allocated at the register renaming stage, upon inst ruction dispatch the operands availability of an instructi on can be obtained by checking the ready state of the corresponding registers. The DelayALL scheme blocks an instruction dispatch if this check returns at least one non-ready status. Younger instructions can still be dispatched if their source operands are ready and this does not affect the correctness of program execution since in structions are still committed in order. The primary goal of ORBIT is to reduce the instruction waiting cycles in the IQ. Compared with un-ACE instructions, ACE in structions are the major source of ACE-bits since bits in an ACE instruct ion are ACE while an un-ACE inst ruction only contains a small portion of ACE-bits (e.g. opcode). Based on this observation, I propose applying operand readiness based instruction disp atch only to ACE instructions (DelayACE). Compared with DelayALL, the DelayACE can achieve better perfor mance since it does not block the dispatch of un-ACE instructions (31% instru ctions [30] in SPEC CPU20 00 workloads are un-ACE) which are not critical to reliability. Combine ORBIT with Prediction The schemes I propose in Section 4.2.1 can de grade performance since they delay the dispatch of instructions (ACE instructions in DelayACE ) until their operands are ready. Alternatively, I can dispat ch instructions slight ly before their operands are ready. By doing so, instructions will become ready-to-execute soon af ter entering the IQ. Putting more instructions into the IQ one cycle ahead of time can help im prove performance by effectively exploiting ILP. In this subsection, I propose a scheme called PredictALL which allows instructions with nonready operands to be dispatched just be fore the predicted operand ready cycle. The efficiency of the PredictALL scheme relies on several key design parameters. First, the ready time of instruction source operands should be predicted. Otherwise, in structions can not be dispatched ahead of their ready time. Second, once the ready time is predicted, the scheme 71

PAGE 72

should be able to dispatch instructions ahead of time while minimizing their residency cycles in the IQ. In this dissertation, I opt for dispatching instructions onl y one cycle before they become ready to execute because this is the required latency to place in structions from the ROB to the IQ. By doing this, instructions will be immediately available fo r issuing and execution at the time they enter the IQ. Finally, since predic ting an operands ready time usually involves forecasting the completion time of the instructio ns which produce the source operands, it is crucial to decide when such a prediction s hould be made. The pred iction of instruction completion time can be made at several pipeline stages, such as fetch, renaming, dispatch and issue. Predictions are less accurate when made earlier since there are more unpredictable variations (e.g. resource conflicts ) between the prediction time and the actual completion time. In this study, I predict an operands re ady time in the dispatch stage. Note that if an instructions predicted operand readiness time is longer than the actual readiness time, ORBIT will dispatch the instruction based on its actual readiness time. Figure 4-8 illustrates the overall architecture of ORBIT with pr ediction. As can be seen, to implement PredictALL, I need a timer for each physical register to count down the remaining cycles during which a source operand will not be available. The timer decreases its value by one in each cycle and it returns ready-to-dispatch when it reaches one since instructions are dispatched one cycle before they become ready to execute. The timer is initialized when the corresponding register is allocated in renami ng stage and will not count down until it is set. When instructions are in the ROB, their dispatch will be blocked if thei r prediction is not ready, so each ROB entry will add two fields: the source re gister ID and the register ready bit. There is a dedicated bus between the ROB and the timers, wh en a timer is reduced to one, it will send the signal to the ROB and write the corresponding ready bits in each ROB entry as ready. When an 72

PAGE 73

instruction is considered for disp atch, the ready bits of the source registers in the ROB entry are checked. The instruction is blocked if either of the ready bits is set as not ready. When an instruction is permitted to dispatch, its predicted completion time will be recorded into the timer associated with its destination register. The timer is set to the sum of dispatch-to-issue delay, issue-to-execute delay and func tion unit latency (which can be obtained from an instructions opcode). Due to cache misses, accurate prediction of th e memory reference instructions completion time is challenging. Since store instructions ha ve no data consumers, I do not predict their completion time. Therefore I only need to predic t operand readiness for load instructions. The load latency, depending on cache hit/miss at different levels, varies from zer o cycles (e.g. a hit in load store queue) to hundreds of cycles (e.g. a L2 cache miss). This information can not be accurately determined until the effective address is calculated in the execution stage and the cache access is performed. In order to obtain the load latency for in structions pre-schedule in the IQ, researchers [44, 48] have proposed several co mplicated predictors to predict the effective address and cache hit/miss at the instruction fetch stage. To implement those predictors, multilevel last value tables, prediction engine and complex operations on the predictors are required. Note that in my proposed techniques, predic ting a cache miss is not required since forecasting load instruction completion time can be deferred to one cycle ahead of the dispatch of load dependent instructions. As a resu lt, my prediction techniques do not require the last value table [44] and the cache miss prediction engine [48]. Figure 4-9 illustrates the load instruction completion time prediction in detail. When a load instruction is dispatched I temporarily predict that its destination regist er will be ready after a long latency which is equal to the largest number of cycles the timer can be set to. The pred icted completion time is updated after the load 73

PAGE 74

instructions effective address is calculated a nd the cache access is performed. Note that bus competition is not taken into consideration for prediction simplicity, and my load instruction complete time prediction does not consider scen arios such as TLB misses and page faults. In these cases, the faulty instruction will be te rminated and re-execute d after the exception is handled. In this study, I also examine an alternative design called Predict_non_load which skips completion time prediction for all load instructions. As a result, instructions that directly depend on loads can not be dispatched to the IQ ahead of time, but prediction resumes for the following dependent instructions (except loads). Fi gure 4-10 illustrates the difference between Predict_non_load and PredictALL with a code segment exampl e, and the data dependence among instructions is also presented. Note that node 9 performs a store operation and its completion time is not predicted in either of the designs, and other inst ructions readiness is predicted in PredictALL In Predict_non_load, however, the readiness of instruction 3 and 6 (in grey color) are not forecasted since they are loads. Correspondingly, in struction 4 and 7 (in shade) are not dispatched until they are ready-to-execute due to the direct data dependence on instruction 3 and 6. Additionally, readiness predic tion on instruction 5 and 10 are resumed in the Predict_non_load scheme since they are indirectly dependent on loads 3 and 6. Since un-ACE instructions contribute fewer ACE-bits th an ACE instructions, I further extend the PredictALL and Predict_non_load to PredictALL_DelayACE and Predict_non_load_DelayACE Similar to DelayACE, both PredictALL_DelayACE and Predict_non_load_DelayACE dont apply the operand readiness based dispatch to un-ACE instructions. To improve the performance of SMT processors, Joseph et al. propose 2OP_BLOCK [33], which blocks instructions with two non-ready operands and their corre sponding threads in the 74

PAGE 75

dispatch stage until one of the s ources becomes ready. The goal of 2OP_BLOCK is to improve IQ utilization by suspending thr eads with long latency instruc tions and allocating more IQ entries to threads exhibiting high ILP. 2OP_BLOCK can reduce IQ AVF since instructions with long waiting cycles are blocke d at the dispatch stage. 2OP_BLOCK resumes instruction dispatch once one source operand of the instruction is rea dy. Therefore, the instru ction still spends IQ residency cycles waiting for other so urce operands. Another difference between 2OP_BLOCK and my techniques is that by usi ng out-of-order dispatch, I bloc k not-ready instructions but dont stall the corresponding thread. I implement 2OP_BLOCK and examine its efficiency on softerror vulnerability mitiga tion. In addition, I explore 2OP_BLOCKACE which applies 2OP_BLOCK to only ACE instructions. Table 4-4 summarizes the eight IQ soft-error vulnerability mitigation schemes that I examine in this dissertation. Evaluation The machine configuration is shown in Ta ble 4-2. And the simulation workloads are shown in Table 4-3. Reliability and performance impact Figure 4-11 (a)-(c) shows IQ AVF, throughput IPC and harmonic IPC yielded on the proposed techniques. The results are normalized to a baseline case without optimization. As Figure 4-11 (a) shows, on average, the DelayALL scheme reduces IQ AVF by about 86%. On MEM workloads where a high frequency of L2 cache misses cause more long latency instructions, the IQ AVF is reduced by 96%. This suggests that the long residency cycles of these instructions are effectively reduced by ORBIT. Figure 4-11 (b) and (c) show that DelayALL decreases throughput and harmonic IPC by 6% and 10% respectively. DelayACE achieves an 83% IQ AVF reduction which is sl ightly smaller than that on DelayALL since DelayACE does not block the dispatch of un-ACE instructions and un-ACE instruction re sidency cycles also 75

PAGE 76

contribute to the IQ AVF as un-AC E instructions still contain a small number of ACE bits. On average, DelayACE results in better performance by showing a 5% reduction in both throughput and harmonic IPC. PredictALL yields higher perf ormance than both DelayALL and DelayACE. On average, it decreases throughput IPC by 4% and harmonic IPC by 4%. The IQ AVF reduction is 84%, which is smaller than that on DelayALL since statistically each inst ruction could reside for one more cycle in the IQ. Ideally, PredictALL should have no performance penalty since instructions are always permitted to dispatch one cycle be fore they are ready. However, the maximum number of dispatched instructi ons can not exceed the processor di spatch bandwidth. In a scenario that multiple predicted ready to execute instructions compete for limited dispatch bandwidth, dispatch congestion occurs and prevents the disp atch of these instructions on time. This will eventually affect the number of instructions that the proces sor can issue. The performance penalty of PredictALL is more noticeable on MEM workloads due to burst increased pressure on the dispatch bandwidth caused by L2 misses. Compared with PredictALL on average across all of the workloads, PredictACE further improves performance by showing 1% throughput IPC and 3% harmonic IPC reduction. The IQ AVF reduction on PredictACE is 79%. Figure 4-11 (b) and (c) show compared to PredictALL Predict_non_load_DelayALL yields lower throughput and harmonic IPC. In Predict_non_load_DelayALL, the dispatch of direct consumers of load instructions are delayed until their operands are available as the completion time of load instructions is not pr edicted. As a result, Predict_non_load_DelayALL also shows lower IQ AVF. A similar observation holds when comparing Predict_non_load_DelayACE with PredictACE. 76

PAGE 77

Figure 4-11 also shows the IQ AVF and performance yielded on 2OP_BLOCKALL and 2OP_BLOCKACE The 2OP_BLOCK techniques increase performance by 2% and reduce IQ AVF by 65%. Note that [33] reports a higher pe rformance improvement (9% and 5% increase in throughput IPC and harmonic IPC) on 2OP_BLOCK on an IQ with a size of 64. The modeled SMT processor in my study has as a 96-entry IQ. As discussed in [33], the benefit of 2OP_BLOCK reduces with increased IQ size. On CPU and MIX workloads, PredictACE and Predict_non_load_delayACE show similar performance characteristics as 2OP_BLOCK while PredictACE and Predict_non_load_delayACE exhibit superior capability in mitigating IQ AVF (e.g. on MIX workloads, IQ AVF is further decreased by 18% on PredictACE and Predict_non_load_delayACE ). A comparison with different fetch policies To improve performance in modern SMT pro cessors, various fetch policies have been proposed in the past. For example, STALL [31] blocks instruction fetch from offending threads when experiencing a L2 cache miss. As an exte nsion of STALL, FLUSH [31] not only stalls those threads but also squashes instructions from them. DG [42] and PDG [42] respond to cache misses by assigning a lower priority to offending threads. These fetch policies are built on the ICOUNT scheme. In [63], the impact of SMT fe tch policies on microarchitecture soft-error vulnerability was analyzed and the study show s that compared with ICOUNT, the advanced fetch policies exhibit a superior capability of soft-error mitigati on. In this subsection, I compare the reliability and performance characteristics of the proposed techniques implemented with ICOUNT with those on advanced fetch policies Figure 4-12 shows the results on IQ AVF (a), throughput IPC (b) and harmonic IPC (c). Due to space limitations, I show results using the worst case ( DelayALL), two best cases ( Predict_DelayACE and Predict_no_load_DelayACE ), and the average statistics across all 8 77

PAGE 78

schemes. As Figure 4-12 (a) depi cts, the advanced fetch polices improve IQ reliability to some extent, especially FLUSH which reduces IQ AVF significantly on MEM workloads due to the frequently triggered flush opera tion. Compared to advanced polic ies, ORBIT schemes achieve a greater IQ AVF reduction. Note th at all of these fetch policies aim to improve IQ throughput by disallowing instructions to occupy an IQ entry for too long. However, instructions are permitted to dispatch with not ready operands and cy cles spent waiting on source operands are unavoidable. Throughput IPC comparison (Figur e 4-12.b) shows that FLUSH and STALL can boost the overall performa nce of SMT execution while my designs (e.g Predict_DelayACE and Predict_no_load_DelayACE ) yield negligible performance de gradation. Nevertheless, FLUSH and STALL suffer the fairness problem as shown in Figure 4-12 (c) since they blindly enforce flush/stall operations on all of the instructions from the offendi ng threads and are biased to high ILP threads. Contrary to this, ORBIT techniques only yield a negligible lo ss in harmonic IPC. Although DG and PDG yield higher performance th an ORBIT schemes, the much higher IQ vulnerability reduction gained fr om ORBIT schemes outweigh the sl ight performance difference. Reliability impact on the entire processor core As can be seen, the proposed techniques exhibit strong ability in im proving IQ reliability, it is interesting and important to evaluate their impact on the entire proces sor core and other core structures. Figure 4-13 shows AVF of the pro cessor core and the major microarchitecture structures (such as reorder buffe r, register file, load store queue, function units and DTLB) under the techniques across the three types of workloads. Results are normalized to the baseline case without any optimization. Due to the space lim itation, I present the results of the worst case ( DelayALL) and the best case ( Predict_DelayACE ) on performance de gradation, and the averaged case through all the proposed techniques. As Figur e 4-13 shows, on average, my techniques reduce core AVF by 10%. The impact of the techniques on other microarchitecture 78

PAGE 79

structures is trivial except ROB (AVF reduces 22 %). Because my techniques delay instructions dispatch in each thread, the default fetch policy (ICOUNT) will fetch fewer instructions when the thread has a number of delayed instructions in the pipeline. As a resu lt, it prevents vulnerable bits which have long residency cycles from movi ng into the threads private ROB. On the other hand, ICOUNT brings more vulnerabl e bits into ROBs of other thr eads with fewer instructions in the pipeline, however, their resi dency time in those ROBs is short. And the overall ROB AVF still decreases. Discussion Recall that my proposed techni ques in the dissertation are motivated by Figure 4-7 which reveals waiting instructions contribute 86% of IQ AVF, and the large quantity and long IQ residency cycles of waiting inst ructions are the two factors whic h result in a highly vulnerable IQ. In this subsection, I revisit the IQ AVF and two characteristic s of waiting instructions with Predict_DelayACE on the three types of workloads. The re sults are shown in Figure 4-14 (a)-(c). Dotted lines represent th e reduction obtained from Predict_DelayACE compared with the baseline case. Characteristics of waiting ACE inst ructions are also presented in Figure 4-14 (b) and (c). Due to page limitations, I focus the discussion on Predict_DelayACE and the other techniques show similar behavior. As Figure 4-14 shows, both the quantity and residency cycles of waiting instructions on Predict_DelayACE decrease dramatically. This is also the case on waiting ACE instructions. On average, the number of waiting instructions per cycle drops from 61 to 8, and instruction residenc y cycles also reduce from 23 to 2. Therefore, the IQ AVF reduces significantly. The quantity and residenc y cycle reductions on waiting instruction show the efficiency of my techniques in IQ reliabi lity improvement by preventing the dispatch of notready instructions. Note that my techniques do not enforce limita tions on ready instructions, the 79

PAGE 80

quantity and residency cycles of ready instructions decrease slightly which is not shown in Figure 4-14. Related Work In the past, there has been extensive research on instruction scheduli ng. To avoid the large issue queue size which degrades the clock cycle time, Lebeck et al. [51] proposed the waiting instruction buffer to hold instru ctions dependent on a long latency operation and reinsert them into a small size IQ when the operation comple tes. Various forms of dependence-based IQ design were proposed in [52, 53, 54, 55, 57], they ge nerally sorted instructions following dataflow order in the partitioned IQ based on depende nce chains, instructions thereby is issued inorder which reduces the complexity of issue logic and improves the clock frequency for large IQ window size. In [58], checkpoint processing and recovery micr oarchitecture is proposed to implement large instruction window processor without large cycle-cr itical buffers. Srinivasan et al. [64] found out critical load inst ructions by building a cr itical table. Tune et al. [65] and Fields et al. [66] independently proposed the prediction of cr itical path instructio ns which provides the opportunity of optimizing processo r performance. Seng et al. [67] proposed to issue critical instruction in-order and slot th em to fast function units to reduce power consumption. In my work, instruction is prioriti zed and scheduled based on their reliability criticality. As power consumption has become an important consideration in processor design, researchers have also studied low power IQ design. Ponomarev et al. [68] proposed to dynamically adjust the IQ size based on the interv al sampling of its occupancy. Folegnani et al. [69] partitioned issue queue into blocks, and disa bled them while the IPC monitoring mechanism reports they have no contribution to IPC. Jones et al. [70] save d power via software assistance which dynamically resizes IQ based on compiler an alysis. In [71], Brooks et al. explored the tradeoffs between several mechanisms for res ponding to periods of thermal trauma. With 80

PAGE 81

appropriate dynamic thermal management, the CP U can be designed for a much lower maximum power rating with minimal performance impact fo r typical applications. In [72], a novel issue queue is proposed to exploit the varying dynami c scheduling requirement of basic blocks to lower the power dissipation and complexity of the dynamic issue hardware. In [73], a circuit design of an issue queue is presented for a superscalar processor that leverages transmission gate insertion to provide dynamic lowcost configurability of size a nd speed. Karkhanis et al. [74] proposed to save energy by delivering instructions just-in-time, it inhibi ts instruction fetching when there are a certain number of in-flight instructions in th e pipeline. Buyuktosunoglu et al. [75] combined fetch gating and issue queue adap tation to match the instruction window size with the application ILP characteris tics and achieved energy savi ng. In my study, the dynamic resource allocation in IQ is dependent on several important reliability-relevant feedback metrics (e.g. RQL) which are not applied in previous studies, and the methodology to implement the resource allocation is different from prior wo rk. Additionally, DVM works on SMT architectures with several novel futures incl uding wq_ratio adaptation, re liability-aware instruction dispatching and on-line AVF estimation. There is a growing amount of work aimed at characterizing soft error behavior at the microarchitecture level. Walcott et al. [32] used a set of proce ssor metrics to predict structure AVF, which is then applied to trigger redundant multithreading implementation for structures reliability maintenance. Soundararajan et al. [76] proposed di spatching stall and partial redundant thread to control the ROB AVF unde r certain threshold at cycle leve l. My work is unique in its joint consideration of performance and reliab ility of IQ design on multithread environment without using redundant execution. 81

PAGE 82

Joseph et al. [33] proposed 2OP_BLOCK to block instructions with 2 non-ready operands and the corresponding thread at dispatch stage to improve the performance, but its impact on reliability is unknown. My work is unique in its joint consideration of performance and reliability of IQ design on SMT architectures. In [77, 78, 79], instructions completion time is predicted for instructions pre-schedule in instruction window and therefore, improve the performance. To achieve the target, prediction ha s to be done at an early pipeline stage (e.g. decode and renaming stage), and as a resu lt, these mechanisms rely on implementing complicated prediction tables to maintain th e required prediction accuracy. While in my techniques, a much simpler predictor was implem ented since I only need to perform prediction during the dispatch stage. 82

PAGE 83

Table 4-1. Accuracy of using PC to identify ACE instructions (committed instruction only) Benchmark Accuracy Benchmark Accuracy Benchmark Accuracy applu 99.8% galgel 98.8% mgird 99.9% bzip 87.8% gap 95.9% perlbmk 99.9% crafty 89.4% gcc 96.5% swim 99.8% eon 87.6% lucas 99.2% twolf 95.8% equake 99.1% mcf 96.1% vpr 81.8% facerec 93.7% mesa 74.9% wupwise 97.5% AVG 93.7% Table 4-2. Simulated machine configuration Parameter Configuration Processor Width 8-wi de fetch/issue/commit Baseline Fetch Policy ICOUNT Issue Queue 96 ITLB 128 entries, 4-way, 200 cycle miss Branch Predictor 2K entries Gshare 10-bit global history per thread BTB 2K entries, 4-way Return Address Stack 32 entries RAS per thread L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 96 entries per thread Load/ Store Queue 48 entries per thread Integer ALU 8 I-ALU, 4 I-MUL/DIV, 4 Load/Store FP ALU 8 FP-ALU, 4FP-MUL/DIV/SQRT DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle access L2 Cache unified 2MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 64 bit wide, 200 cycles access latency 83

PAGE 84

Table 4-3. The studied SMT workloads Thread Type Benchmarks CPU Group A bzip2, eon, gcc, perlbmk Group B gap, facerec, crafty, mesa Group C gcc, perlbmk, facerec, crafty MIX Group A gcc, mcf, vpr, perlbmk Group B mcf, mesa, crafty, equake Group C vpr, facerec, swim, gap MEM Group A mcf, equake, vpr, swim Group B lucas, galgel, mcf, vpr Group C equake, swim, twolf, galgel Table 4-4. A summary of the proposed techniques Schemes Description DelayALL Delay dispatch of all the non-r eady-to-execute instructions DelayACE Delay dispatch of ACE non-ready-to-execute instructions, ignore un-ACE instructions Predict_DelayALL Delay all the instructions dispatch till they are predicted to be ready in the next cycle Predict_DelayACE Delay ACE instructions dispatch till they are predicted to be ready in the next cycle, ignore un-ACE instructions Predict_non_load_DelayALL Similar to Predict_DelayALL but load instructions completion time is not predicted Predict_non_load_DelayACE Similar to Predict_DelayACE but load instructions completion time is not predicted 2OP_BLOCKALL Block all the instructions w ith two non-ready operands and their corresponding threads until one source is ready 2OP_BLOCKACE Block ACE instructions with two non-ready operands and their corresponding threads until one source is ready, ignore un-ACE instructions 84

PAGE 85

0 10 20 30 40 50 60 70 80 CPUMIXMEMArchitecture Vulnerability Factor (%) Instruction Queue Reorder Buffer Funtion Unit Register File Figure 4-1. Microarchitecture soft-error vulnerability profile on SMT processor 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1611162126313641465156616671 Ready Queue LengthHistogram of Ready Queue Length (%)0 10 20 30 40 50 60 70 80 90 100ACE Instruction Percentage (%) Histogram of Ready Queue Length ACE instruction Percentage Figure 4-2. The histograms of r eady queue length and ACE instru ction percentage on a 96-entry IQ. SMT processor issue width = 8, the e xperimented workloads: 4-context CPU ( bzip ,eon ,gcc ,perlbmk ) 11 02, *_,*_ 63 11 24, *_,*_ 32 12 46, *_,*_ 23 2 68, *_,_ 3 I PCIQLminRQLIQSIZEIQSIZE I PCIQLminRQLIQSIZEIQSIZE I PCIQLminRQLIQSIZEIQSIZE IPCIQLminRQLIQSIZEIQSIZE Figure 4-3. Dynamic IQ resource allocation based on IPC, ready queue length and total IQ size 85

PAGE 86

_11 02, *_,*_ 63 11 24, *_,*_ 32 2__ 12 46, *_,*_ 23 68,cachemissT I PCIQLminRQLIQSIZEIQSIZE I PCIQLminRQLIQSIZEIQSIZE Lcachemiss I PCIQLminRQLIQSIZEIQSIZE IPCIQLminRQL _2 *_,_ 3 2__ ,__cachemissTIQSIZEIQSIZE LcachemissenableFLUSHpolicy Figure 4-4. L2-cache-miss sensi tive IQ resource allocation 0 0.2 0.4 0.6 0.8 1 1.2CPUMIXMEMNormalized IQ AVF VISA VISA+opt1 VISA+opt2 A 0 0.2 0.4 0.6 0.8 1 1.2 1.4CPUMIXMEMNormalized Throughput IPC VISA VISA+opt1 VISA+opt2 B Figure 4-5. Reliability and perf ormance with ICOUNT fetch polic y. A) Normalized IQ AVF. B) Normalized throughput IPC. 86

PAGE 87

0 0.2 0.4 0.6 0.8 1 1.2FLUSH STALL DG PDG FLUSH STALL DG PDG FLUSH STALL DG PDG CPU MIX MEM Norm alized IQ A VF VISA VISA+opt1 VISA+opt2 A 0 0.2 0.4 0.6 0.8 1 1.2FLUSH STALL DG PDG FLUSH STALL DG PDG FLUSH STALL DG PDG CPU MIX MEM Normalized throughput IPC VISA VISA+opt1 VISA+opt2 B Figure 4-6. Reliability and performance using diffe rent fetch policies. A) Normalized IQ AVF. B) Normalized IPC 87

PAGE 88

0 20 40 60 80C PU MIX MEMIQ AVF (%) Ready Instruction Waiting Instruction A 0 10 20 30 40 50 60 70 80 90CPU MIX MEM CPU MIX MEM ALL Instruction ACE Instruction Number of Instructions per Cycl e Waiting Instructions Ready Instructions B 0 10 20 30 40 50CPU MIX MEM CPU MIX MEM ALL Instruction ACE Instruction Residency Cycles Waiting Instruction Ready Instruction C Figure 4-7. A) IQ AVF contributed by waiting instructions and ready instructions. B) Profiles of the quantity of ready instructions and wa iting instructions in a 96 entries IQ. C) Residency cycles of ready instructions and waiting instructions in a 96 entries IQ. The ICOUNT fetch policy is used. 88

PAGE 89

Fetch Queue Reorder Buffer Decode & Renaming Timers Ready for Dispatch Instruction Queue Execute Latency Function Units Issue + Delays Complete Time Initialization Auto-decrease Physical Register Files Figure 4-8. ORBIT with prediction Reorder Buffer Timers 111...111 Ready for Dispatch Instruction Queue Function Units Issue Load Inst Effective Address Load Store Queue DL1 Cache DL2 Cache Updated Complete Time Figure 4-9. Predict the load instruction completion time 89

PAGE 90

0: addl r2, r5, 3 1: addl r8, r4, 5 2: subl r9, r8, 1 3: ldq_u r10, 0(r2) 4: mult r16, r2, 10 5: subl r12, r16, r8 6: ldwu r11, 0(r9) 7: addl r15, r11, 6 8: addl r3, r7, 8 9: stb r3, 0(r15) 10: mult r6, r15, 2 11: beq r6, r12, L3 12: subl r20, r21, r26 13: addl r25, r20, 8 14: mult r22, r20, 5 0 3 4 5 1 2 6 7 10 11 8 9 12 14 13 Figure 4-10. Readiness prediction example on PredictALL and Predict_non_load 90

PAGE 91

0 0.2 0.4 0.6 0.8 1 1.2 CPUMIXMEMAVGnormalized IQ AVF Baseline DelayALL DelayACE Predict_DelayALL Predict_DelayACE Predict_non_load_DelayALL Predict_non_load_DelayACE 2OP_BLOCKALL 2OP_BLOCKACE A 0 0.2 0.4 0.6 0.8 1 1.2 CPUMIXMEMAVGNormalized throughput IP C Baseline DelayALL DelayACE predict_DelayALL Predict_DelayACE Predict_non_load_DelayALL Predict_non_load_DelayACE 2OP_BLOCKALL 2OP_BLOCKACE B 0 0.2 0.4 0.6 0.8 1 1.2 CPU MIXMEMAVGNormalized hamonic IPC Baseline DelayALL DelayACE predict_DelayALL Predict_DelayACE Predict_non_load_DelayALL Predict_non_load_DelayACE 2OP_BLOCKALL 2OP_BLOCKACE C Figure 4-11. Reliability and performance re sults when using ICOUNT fetch policy. A) Normalized IQ AVF. B) Normalized thr oughput IPC. C) Normalized harmonic IPC 91

PAGE 92

0 10 20 30 40 50 60 70 80 CPU MIX MEMIQ AVF (%) Baseline STALL FLUSH DG PDG DelayALL Predict_DelayACE Predict_no_load_DelayACE AVG A 0 1 2 3 4 5 6 CPU MIX MEMThroughput IPC Baseline STALL FLUSH DG PDG DelayALL Predict_DelayACE Predict_no_load_DelayACE AVG B 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 CPU MIX MEMHarmonic IPC Baseline STALL FLUSH DG PDG DelayALL Predict_DelayACE Predict_no_load_DelayACE AVG C Figure 4-12. A Comparison with di fferent fetch policies. A) IQ AVF. B) Throughput IPC. C) Harmonic IPC 92

PAGE 93

0 0.2 0.4 0.6 0.8 1 1.2DALLPDACEAVGDALLPDACEAVGDALLPDACEAVG CPU MIX MEM Normalized AV F Core AVF ROB AVF register File AVF LSQ AVF Functional Unit AVF DTLB AVF Figure 4-13. The Impact of proposed techniques on core and other microarchitecture structures vulnerability (DALL => DelayALL, PDACE => Predict_DelayACE ) 93

PAGE 94

0 20 40 60 80 100CPU MIX MEMIQ AVF (%) Eliminated by Predict_DelayACE Ready Instruction Waiting Instruction A 0 10 20 30 40 50 60 70 80 90CPU MIX MEM CPU MIX MEM All Instruction ACE Instruction Number of Instructions per Cycle Eliminated by Predict_DelayACE Predict_DelayACE B 0 10 20 30 40 50CPU MIX MEM CPU MIX MEM All Instruction ACE Instruction Residency Cycles Eliminated by Predict_DelayACE Predict_DelayACE C Figure 4-14. (A) IQ AVF contributed by waiting inst ructions and ready instructions. B) Profiles of the quantity of waiting instructions in the 96 entries IQ with Predict_DelayACE C) Profiles of residency cycles of waiting instructions in the 96 entries IQ with Predict_DelayACE The ICOUNT fetch policy is used. 94

PAGE 95

CHAPTER 5 COMBINED CIRCUIT AND MICROARCHITECTURE TECHNIQUES FOR SOFT ERROR ROBUSTNESS Background The circuit-level SER can be expressed by the following empirical model [80]: ,,()crit s Q Q dpdnSERFAAKe (5-1) Where F is the total neutron flux within the whole energy spectrum, Ad,p and Ad,n are the ptype and n-type drain diffusion areas which are sensitive to particle strikes, K is a technologyindependent fitting parameter, Qcrit is the critical charge, and Qs is the charge collection efficiency of the device. A soft e rror occurs if the collected charge Q exceeds the critical charge Qcrit of a circuit node. For a give n technology and circuit node, Qcrit depends on supply voltage VDD, the effective capacitance of the drain nodes C and the charge collection waveform. The critical charge Qcrit of a six transistor SRAM cell is a function (shown as Equation 5-2) of VDD, the threshold voltage VT and the effective time constant T of the collection waveform. In Equation 5-2, the time dependence of current transients is given by T0, which depends strongly on the strike location and activated mobility models. Equation 5-1 and 5-2 show that SER increases exponentially with reduction in Qcrit and Qcrit is proportional to the effective capacitance of the node and the supply voltage. Hence, the SER is exponentially dependent onC and VDD. )) ((),(0T T VVVCTVQTDD DD DDcrit (5-2) Soft Error Robust SRAM (rSRAM) Equation 5-2 suggests that the minimum amount of charges required to flip the SRAM cell logic state is proportional to the internal node capacitances. Therefore, increasing the effective 95

PAGE 96

capacitances will reduce the SER of a storage node. In [81], the soft error robust SRAM (rSRAM) cell (see Figure 5-1) is built by symmet rically adding two stacked capacitors to a standard six transistor high dens ity SRAM cell. Both area penalty and manufacturing cost of the rSRAM can be mitigated by adding the two capacitors in the vertical dimension (i.e. between the polysilicon and the Metal 1 levels) and manufactured with a st andard embedded DRAM process flow. Accelerated alpha and neut ron tests have demonstrated th at the rSRAM devices are alpha immune and almost insensitive to neutrons [82]. The rSRAM cell symmetry and the transistor sizing remain strictly identical to the standard SRAM. Further comparison between rSRAM and st andard SRAM shows that they both have similar power consumption, leakage and area. Howe ver, there are trade-offs between robustness and timing performance. Compared with the standa rd SRAM, both the read current and the static noise margin of rSRAM are unchanged, whereas the intrinsic write operation of the rSRAM is slowed down proportionally to the extra loads on the two internal nodes. The normalized SER rates for the rSRAM as a function of the added cap acitor value were studie d in [82] using Monte Carlo simulations. For example, to achieve th e desired SER rates on rSRAM fabricated using 90nm CMOS technology, a capacitor with value of 12fF needs to be added to each node which degrades the memory cell write timing performan ce by a factor three in typical conditions. For very high capacitor values, the write might beco me even slower than the read, leading to significant cycle time penalty. Su ch disadvantage limits the applic ability of using the rSRAM to harden hardware structures that resident in the critical path of the processor pipeline. Voltage Scaling for SRAM Soft Error Robustness Equation 5-2 shows that Qcrit has a linear relation with the supply voltage. Transistors with high supply voltage exhibit strong immunity to so ft errors since the pa rticle energy threshold required to cause soft errors is increased. Th erefore, scaling up supply voltage can provide 96

PAGE 97

immunity to soft errors. In this dissertation, I used dualVDD [83], a technique that is originally proposed for power saving, to enhance hardware SER robustness. However, scaling up voltage will increase dynamic and leakage power consum ption. For example, dynamic power of the circuit is proportional to the square of the supply voltage. Th erefore, it is important to appropriately scale up supply voltages such that the power savings can be balanced with concerns of reliability. This di ssertation proposes methods that can selectively adjust the supply voltage and achieve attractive tradeoffs between power and reliability. Differing from circuit level radiation hardening methods, microarchitecture level soft error vulnerability mitigation techniques exploit progr am characteristics to achieve applicationoriented reliability optimization. In general, these techniques can reduce soft error failure rate but does not guarantee convergence to the high reliabili ty design goal. Combined Circuit and Microarchitecture Techniques Radiation Hardening IQ Desi gn Using Hybrid Techniques As described in Chapter 4, microarchitecture level techniques such as operand readiness based instruction (ORBIT) dispatch can effectively reduce the IQ AVF despite that they provide no protection to soft errors. However, this tech nique results in performance penalty. This is because instructions are selected for dispatching solely based on their operand readiness instead of their criticality to performance. Note th at the increased program runtime will increase processors overall transient faul t susceptibility since so ft errors now have more opportunities to strike the chips. Therefore, microarchitectur e soft error mitigation techniques should cause minimal performance overhead. Due to the superior soft error robustness of the rSRAM cell, it can be used to implement IQ, a SRAM based stru cture. However, the using of rSRAM increases write latency, which implies that an IQ en tirely implemented with the rSRAM will suffer noticeable performance degradation. 97

PAGE 98

To leverage the advantage of circuit and microarchitectur e level soft error tolerant techniques while overcoming th e disadvantage of both, I propose an IQ consists of a part implemented using the standard SRAM cells (NIQ ) and a part implemented using the radiation hardened rSRAM technologies (RIQ). The operands ready instructions are dispatched into NIQ while other not-ready but performance critical instructions are dispatched into RIQ. By decreasing both quantity and residency cycles of instructions vulnerabl e bits in a hardware structure, the operand readiness based dispatch can effectively mitigate soft error vulnerability of NIQ where no error protection is provided. The filt ering out of performan ce critical instructions from the delayed dispatch alleviates performa nce penalty. Meanwhile, the write latency of the rSRAM based RIQ can be efficiently hidden since instructions dispatched to the RIQ normally will not be immediately ready for issuing. The rS RAM technique, which provides great soft error immunity, successfully protects those instructio ns from soft error strikes during their RIQ residency period. Therefore, compared with me thods that exclusivel y rely on circuit or microarchitecture solution, the hybrid schemes can achieve more desirable trade-offs between reliability and performance. Figure 5-2 presents the control flow of instruction dispatch in the proposed IQ design that uses hybrid radiation hardeni ng techniques. When instructio ns in ROB are scheduled for dispatch, the dispatch logic only places ready-to-e xecute instructions into the NIQ. By doing so, the quantity and residency cycles of instructions in the NIQ are significantly reduced and the corresponding IQ SER decreases. The performance criticality of other not-ready-to-execute instructions is examined and cr itical instructions are dispat ched to the RIQ without delay. Therefore, only non-critical instructions are delayed at the dispatch stage. 98

PAGE 99

In this study, I investigate hybrid schemes that can achieve attractive reliability and performance tradeoffs without sign ificantly increasing the hardware cost. I assume that the NIQ and RIQ have the total size equal to that of the original IQ, an d they share the same amount of dispatch bandwidth as in the original design. Figure 5-3 (a) provides an overview of the architecture support for the proposed ideas. The detailed circuit design on supporting RIQ wakeup will be discussed in Section 5.2.2. In orde r to obtain their opera nds readiness when the instructions are sitting in the ROB, a multi-ba nked, multi-ported array is built to record the register files readiness state. The bit array is updated during write back stage. The ROB can be logically partitioned into several segments to al low parallel accesses to the multiple banks of the array which hold the same copies of informa tion. A simple AND gate is added in each ROB entry to determine the readiness of a instruction. Note that in my scheme, younger instructions can still be dispatched if their source operands are ready and this does not affect the correctness of program execution since instructions are still committed in order. Figure 5-3 (b) illustrates how th e critical instructions can be identified. In this dissertation, I define the performance critical instructions as branch instructi ons and the instructions with long dependence chain in ROB. I use critical tables proposed in [64] to quantify an instructions criticality. Each threads ROB is associated with a critical table and each ROB entry has a corresponding critical table entry to represent the data dependences of other instructions on this instruction. Each critical table entry is a vector having one bi t per ROB entry, a certain bit of the vector is set as if its corresponding ROB entry is direct or indirect data dependent on the current ROB entry. The sum of each bit in the ith critical table entry represents the length of the ith instructions data dependence chain which, in other words, describes its performance criticality. The critical table is updated at d ecode and renaming stages. As shown in Figure 599

PAGE 100

3(b), assume that an instru ction is placed into the jth ROB entry, and it directly consumes the computation results of instructions in the mth and nth ROB entries. Then the jth column in the critical table is updated as the bitwise OR result of the mth and nth columns, meanwhile, the mth and nth entry of the jth column are also marked as one. Therefore, the bit with in the jth column demonstrates on which inst ructions the newly inserted inst ruction is data dependent. As the instructions criticality is available in critical table, a criticality threshold is set to classify the instructions into critical inst ructions and non-criti cal instructions. Inst ructions with higher criticality than the threshold ar e recognized as critical instruction, and vise versa. Branch instructions are always identified as critical. Note that the criticality threshold affects the required RIQ size and correspondingly, the perf ormance and reliability of the proposed techniques. A detailed analysis can be found in Section 5.3.2. The RIQ Design and Alternative An conventional IQ entry consists of severa l fields: 1) payload area (such as the opcode, destination register address, function units type and so on); 2) left and right tags of the two source registers, and each tag is coupled with a CAM (content-addressable memory) for register number comparison; 3) left and right source ready bits, used to record the availability of the source registers; 4) and another r eady bit to present the instruction s readiness, which is the logic AND result of the two source ready bits. When an instruction completes its execution, it destination resister identifier is sent to the tag buses and broadcasted thro ugh all IQ entries. The CAM in each IQ entry figures out whether ther e is a match between the instructions source register number and identifier in the tag buses, and the corresponding source ready bit is set to if a match occurs. In the case that both source ready bits are set to the instruction is ready, and ready bit will ra ise the issue request signal to the selection logic. 100

PAGE 101

In my hybrid IQ, the wakeup logic of NIQ is id entical to that of th e conventional IQ. Care must be taken for the RIQ design due to the ex tra write latency to the rSRAM cells. Figure 5-4 describes the detailed circuit design on each field of the RIQ entry. Since instructions dispatched into the RIQ usually are not ready-to-execute, th e latency caused by initial write operations to the RIQ entry can be overlapped with the instruct ions waiting-for-ready period. As a result, the rSRAM is used to build the payload area and tags in each RIQ en try. However, the write latency delays the update of the ready bi ts and prevents the instructions from being issued on time. In another word, the selection and issue stages of the pipeline will be postponed. To avoid the negative performance impact of the rSRAM, I impl ement the three ready bits per IQ entry using standard SRAM-based cells and use ECC to protect their integrity. The overhead of ECC is small due to the overall small quantity of the ready b its in the IQ. Another important design consideration for RI Q entry is the CAM which is composed of storage cell (SRAM) and comparison circuit (XOR gates), the rSRAM techniques can also be used to implement robust CAM without any ar ea penalty. [81] proposed to extend rSRAM technique into CAM (i.e. rCAM). The rCAM has th e similar characteristic as rSRAM, namely, it also suffers from the write latency, but read tim e is unchanged. In this study, I also consider rCAM implementation for a fault-free RIQ. Since the data (source register number) is written to CAM storage cell once the instruction is dispatched into RIQ and stay there until the instruction is issued, the write latency in rCAM is overla pped with that on writing instruction information into the RIQ payload and tags. Therefore, rCAM doesnt introduce extra performance delay in RIQ. However, it is possible that the instruc tion misses the register number broadcasting while its information is being written into the rCAM. In order to timely update the instructions source 101

PAGE 102

ready bits, as shown in Figure 5-3 (a), the register ready bits array will be checked once the write operation completes. The using of rCAM can be avoided in wake up-free version of the RIQ design. In [85, 86], instruction reinsertion and sele ctive replay were proposed fo r performance enhancement. I propose two designs in Figure 5-5. The first appro ach (shown as Figure 5-5.a) is to reinsert the instructions from RIQ back to the NIQ when they will be ready soon. The re inserted instructions take the dispatch bandwidth and have higher priority than other normally dispatched instructions. If the NIQ is full, the reinsert operation has to be stalled. Since RIQ is wakeup free, instructions can not be selected and issued directly from the RIQ. In order to fully explore the SER robustness of the RIQ, instructions in it should not be re inserted until they become ready in the following cycle. To satisfy this requirement, pred iction on the ready time of the source register has to be taken, which involves complicated hardwa re designs [86, 87], such as the load address predictor, the cache miss/hit predictor and the timi ng table which is used to record the predicted ready time. An alternative appro ach (Figure 5-5.b) to the abov e design is to directly issue instructions from RIQ when they are predicted to be ready-to-execute. Accurate wakeup prediction is required in the case, which also introduces several complicated predictors mentioned above. Since the predictor can not be 100% accurate, the mis-prediction has to be handled: instructions source register ready states have to be ch ecked before they are really issued into function units. If they are not ready, instructions have to be replayed (reinsert into IQ), and the information in the predictor has to be updated also. As can be seen, the two wakeup free designs depend on complicated register read y time prediction which results in additional hardware implementations and operations. In my study, the ECC protected SRAM based ready bits and rCAM are selected. 102

PAGE 103

Using Dual-VDD to Improve ROB Reliability ROB is another important microarchitecture st ructure in SMT processo rs. As introduced in Section 5.1, supplying high VDD to CMOS circuit can improve hardware structures raw soft error rate. However, high VDD should be judiciously appl ied since the dynamic power consumption is quadratic to supply voltage. In th is dissertation, I explore using microarchitecture level soft error vulnerability characteristics a nd runtime events to enable and disable high VDD, which can achieve attractive trade-offs between re liability and power. Recall that the overall soft error rate of a microarchitect ure structure is determined by FIT rate per bit and AVF at microarchitecture level. In the case that different VDD varies FIT per cycle, Equation 5-3 can be rewritten as: __## #normalFIT enhancedFITnormal enhanced TT executioncyclesFIT ACEbitspercycleFIT ACEbitspercycle SER BT (5-3) where FITnormal represents the FIT with normal VDD, while FITnormal represents the FIT with high VDD. Correspondingly, Tnormal_FIT and Tenhanced_FIT depict the period of FITnormal and FITenhanced respectively. As can be seen from Equation 5-3, when the number of ACE bits in the structure is small during Tenhanced_FIT, the SER reduction gained vi a reducing FIT (i.e. increasing VDD) is substantially discounted. Take an extreme cas e for example, when there is no ACE bit, I can not gain any benefit from increasing VDD since all the errors are ma sked at microarchitecture level. On the other hand, when a ll the bits in the structure are ACE (e.g. no error can be masked), the benefit can be totally exploited. In order to effectively improve ROB reliabil ity and control the extr a power consumption, I propose to trigger high VDD when ROB shows high vulnerability at microarchitecture level and switch VDD back to normal when the vulnerability dr ops below a threshold. Due to the circuitlevel complexity concerns, I limit my scheme to two supply voltages, an d that supply voltage 103

PAGE 104

transition is called dualVDD technique [83]. A DC-CD convert er can continuously adjust the supply voltage, unfortunately, the converter requires a long time fo r voltage ramping [88] and it is not suitable for high performance SMT processors. I choose to use two different power supply lines for the quick VDD switching, and a pair of PMOS tran sistors is inserted to handle the voltage transition. Li et al. [88] and Usami et al. [89] proved that the energy and area overhead from the two-supply-power-network is negligib le. In this dissertation, I make simplistic assumption that varying supply voltage in CMOS doesnt cause power and performance overhead. The clock frequency maintains the same while dualVDD is applied since the transistor can operate with normal frequency when the VDD switches to high voltage. In [90], Burd et al. showed that CMOS can continuously operate wh en the voltage switch is limited in a certain amount per nano second. In other words, th e voltage transition ca n not be completed immediately. Therefore, when triggering high VDD, the structures high vulnerability period should be long enough to cover the transition cycl es. Figure 5-6 shows th e relation between L2 miss and ROB AVF over a period of 5000 cycles on benchmark vpr execution. Note that the right Y-axis just simply describes the occurrence of L2 miss, and re presents that L2 miss exists at that cycle. As can be seen, the RO B AVF jumps high when L2 miss occurs, and drops down after it is solved. Because upon an L2 cache miss, the pipeline usually ends up stalling and waiting for data, instructions can fill up the RO B quickly and the congestion will not be solved until L2 cache miss is handled. Furthermore, since high utilization in ROB results in high quantity of vulnerable bits, the ROB AVF usually exhibits a strong correlation to L2 cache miss. In SMT processors, L2 cache miss latency often la sts for hundreds of cycles which can cover the VDD transition cycles, therefore, L2 cache miss is a good trigger for VDD switching. 104

PAGE 105

Experimental Setup and Evaluation Results Experimental Setup To evaluate the reliability and performance im pact of the proposed techniques, I use the same framework used in Chapter 4. In additi onal, I ported Wattch power model [91] into my simulation framework for power evaluation. The ba seline machine configuration is shown in Table 4-2. The IQ is a shared structure while the ROB is private to each thread. I use ICOUNT as the baseline fetch policy. I assume 90nm CM OS technology. In [81], the relation between added capacitor value, write time and SER for 9 0nm standard rSRAM was studied. The results show that the write time is linearly related to the capacitor value, and SER reduced dramatically while capacitor value increases from 0 fF to 12 fF however, it varies sli ghtly as the increase of capacitor value when the value is larger than 12 fF Therefore, in my experiments, I select the capacitor value as 12 fF, correspondingly, the write time in rSRAM is as three times as the standard SRAM. In my experiment, I assume the normal VDD is 1.2 V and high VDD is set as 2.4 V. The enhanced SERSRAM is computed using Eq.1 and 2. I assume the voltage can transit 0.05v/ns, so the transition time lasts for 20 cy cles in the simulated 1GHz processor. The SMT workloads in my experiments ar e comprised of SPEC CPU 2000 integer and floating point benchmarks. I create a set of SMT workloads with individual thread characteristics ranging from computation intensive to memory access intensive (see Table 5-1). The overall SER capturing vulnerability on both circuit and micr oarchitecture levels is used as a baseline metric to estimate how susceptible a microarchite cture structure is to soft-error strikes. I use throughput IPC, which qualifies the throughput impr ovement, and harmonic mean of weighted IPC. 105

PAGE 106

Evaluation Effectiveness of rSRAM based IQ design I compare my hybrid scheme with several ex isting techniques (e.g. 2OP_BLOCK [33] and ORBIT) which exhibit good capability in achieving IQ reliability enhancement. A comparison is also performed with the design th at uses rSRAM to implement th e entire IQ. Additionally, [63] showed that among the several advanced fe tch policies in SMT processors, FLUSH can effectively reduce IQ vulnerability. I also compare my technique with the baseline SMT processors that use FLUSH fetch police. In the h ybrid scheme, I set critical threshold as 2 with RIQ size of 24, and the threshold increases as high as the ROB size during L2 miss. A detail sensitivity analysis is pr esented in Section 5.3.2. Figure 5-7 (a) (c) presents the overall IQ soft error rate, throughput IPC and harmonic IPC yielded by various technique s across three SMT workload categories. The results are normalized to the baseline case without any optim ization technique. Note that rSRAM-based IQ has zero soft error rate when normalized, its SER is not presented in Figure 5-7 (a). As can be seen in Figure 5-7 (a), on average, my hybr id scheme exhibits strong SER robustness which reduces IQ SER 80% with only 0.3% throughput IPC and 1% harmonic IPC reduction through all the workloads. The IQ SER reduction is more noticeable on MEM workloads, because low IPC workloads have less ready-to-execute instruct ions and RIQ is fully utilized to protect the ACE bits in those instructions. ORBIT obtains similar IQ SER reduction as my design since they have common property that only ready-to-execute instructio ns can be dispatched into unprotected IQ. The 2OP_BLOCK scheme, which bl ocks instructions with 2 non-ready operands and the corresponding thread at dispatch stage but still allows the dispatching of unready instructions to unprotected IQ, gains 20% less SER reduction comp ared with the hybrid scheme. Moreover, my design outperforms FLUSH policy by 58% in reliability improvement. On the 106

PAGE 107

performance perspective, as Figure 5-7 (b) a nd (c) show, the hybrid scheme surpasses other techniques on both throughput and fairness perfor mance, and the performance difference is more noticeable in MIX and MEM workloads. As I expected, the rSRAM based IQ suffers significantly performance penalty (20% degradation on both throughput and harmonic IPC), and the performance degradation can be as worse as 35%. Sensitivity analysis on critica lity threshold and RIQ size In SMT environment, a L2 miss can cause congestion in the corres ponding threads ROB. As a result, the computed instruction criticality using the critical table can easily surpass the preset criticality threshold. Nevertheless, most in structions are data dependent on the load miss instruction and can not become ready-to-ex ecute until the L2 cache miss is solved. Their entrance to the RIQ, however, results in RIQ res ource congestion and preven ts the dispatching of critical instructions from other high performance threads. In my study, in order to avoid the RIQ congestion and improve the overall throughput, each thread is assi gned with a pre-set critical threshold and the threshold is adjusted to a high value (for example, equal to the RIQ size) when the thread is handling L2 cache miss. Both criticality threshold and RIQ size can cont rol the dispatching of instructions into RIQ and affect the effectiveness of my hybrid sche me. In this dissertation, I perform a sensitivity analysis to understand th e impact and interaction of these tw o factors. As can be seen, the two factors interact each other, when criticality threshold is high, a large RIQ is not necessary; on the other hand, a small RIQ requires a high criticalit y threshold. In my study, I start the analysis from the fixed criticality threshol d of two, because instructions with less than two consumers are likely to be dynamically dead instructions whose computation result will not affect the program final output, therefore, they are not performan ce critical. The fixed criticality threshold is combined with various RIQ size ranging from 8 to 64. By doing this, I can quickly figure out the 107

PAGE 108

optimal RIQ size required to satisfy the lowest criticality threshold. Note that RIQ size cannot be extended to extraordinary large or small, because with the fixed total IQ size, an extra large RIQ size corresponds to an extremely sm all NIQ size which has difficulty in holding all the ready-toexecute instructions and delays their dispatchi ng. On the other hand, the benefit from dispatching not-ready critical instructi ons to RIQ is disappeared with an extremely small RIQ. Figure 5-8 (a) -(c) presents the normalized throughput IPC, ha rmonic IPC and IQ SER to the baseline case on various RIQ sizes. As can be seen, IQ SER re duces as the RIQ size increases, because the unprotected NIQ size is reduced a nd less vulnerable bits are expos ed to soft error strikes. However, the increase of RIQ size results in deleterious performance impact due to the thirst for NIQ to hold ready-to-execute instructions. As shown in Figure 5-8, RIQ size with 24 generates the closest performance to the baseline case in al l the three workload categories and it satisfies my target on maintaining application performa nce while improving IQ reliability. After the RIQ size is fixed at 24 for the lowe st criticality threshold, anothe r set of experiments can be performed to search for an appropriate criticality threshold. However, higher criticality threshold requires smaller RIQ which results in higher IQ vulnerability, it is not suitable to my target even though the performance can be improved. In my dissertation, the 24 entry RIQ with pre-set criticality threshold equal to two is used in the experiments. Effectiveness of dual-VDD in ROB SER robustness In this subsection, the effi ciency of applying DualVDD for ROB SER enhancement is examined. Black bars in Figure 5-9 (a) and (b) show the reduced ROB SER and the power overhead of the processor core after the proposed technique ap plied to the three types of workloads. As can be seen, on average, ROB SER reduces 50% by consuming extra 6% core power. And in MEM workloads which encounter a large number of L2 misses, my technique gains 67% ROB SER reduction. In most architectur e design, 6% power overhead is larger than 108

PAGE 109

the acceptable boundary; therefore, the using L2 mi ss as a trigger has to be improved. Notice that the number of vulnerable bits in the ROB is not always positively proportional to the ROB utilization, which suggests that L2 miss does not always imply a larg e number of ACE bits in the ROB. In this dissertation, I propose an enhanced trigger which takes the qu antity of the ACE bits in ROB into account. The trigger performs as fo llows: when a L2 miss occurs, the number of ACE bits in ROB per cycle is countered and av eraged in the following 20 cycles, and the high VDD will not be supplied if there are not enough ACE bits, saying, lower than a vulnerability threshold. After the L2 miss is solved, the VDD is switched back to normal. Since on-line, accurate ACE bits identification is difficult, in my study, I approximate the number of ACE bits at the instruction level. The basic idea is: the longer dependence chain th e instruction has, the higher possibility its computation result affects program final out put. Consequently, I assume the bits in instructions with high cr iticality (e.g. criticality > 16) are ACE. The information stored in the critical table can be used for ACE-ness es timation. Note that the pre-defined vulnerability threshold affects both the ROB SER reduction an d the power overhead. Care must be taken when choosing a pre-defined vulnerability threshol d, as setting this value too high can result in limited ROB reliability improvement, and setting it too low can result in minimal control over the power consumption. In this study, I vary th e vulnerability threshold in my experiments dependant upon the size of the ROB (within a range of 1/2*ROB _size to 5/6*ROB_size). To evaluate the effectiven ess of various thresholds, I propose a metric, SER_reduction/power_overhead, whic h describes the tradeoff betw een reliability and power. A higher SER_reduction/power_overhead value indicates a better tradeoff. Figures 5-10 (a) (c) present the ROB SER reduction, power overhea d and SER_reduction/power_overhead across various vulnerability thresholds and three work load categories. As expected, both the ROB SER 109

PAGE 110

reduction and power overhead increase as the threshold decreases because high VDD is triggered more frequently. However, this is not the case for SER_reduction/power_overhead. When the threshold is set to 64, as shown in Figure 5-10 (a) and (b), SER_reduction/power_overhead attains its maximum value on CPU and MIX workloads. Therefore, a vulnerability threshold of 64 is selected for my study. The white bars in Figure 5-9 present the results yielded by the enhanced trigger, and on average, ROB SER reduces 35% with only a 3.5% power overhead. Putting them together Figure 5-7 and 5-9 show that both the hybrid radiation hardened IQ and the Dual-Vdd based ROB exhibit strong SER robustness whil e yielding a negligible performance and power overhead. I also apply the two t echniques simultaneously and evalua te their aggregate effect on the entire processor core SER. The impact of the two proposed techniques on the vulnerability of other primary structures, such as register files, load store queue, DTLB a nd function units, is also examined. The normalized SER results (to the ba seline case where no optimization is applied) are shown in Figure 5-11. As can be seen, on average, the core SER substantially decreases by 23% while other structures SERs are slightly a ffected by my techniques. Furthermore, the load store queue vulnerability is al so reduced by 15%. I exclude a discussion of the performance penalty and power overhead for the aggregated technique as they ha ve already been discussed in previous sections. Related Work Various methodologies exist to model pr ocessor vulnerability to soft error. In the past, duplicated coarse-grained structures such as functional units, processo r cores or hardware contexts have been used to detect and tolerate transient faults [21]. Ho wever, those approaches result in significant overhead in performance, area and power. [92] proposed to perform redundant execution only during low ILP and L2 mi sses in order to achieve high error coverage 110

PAGE 111

with low performance loss. In [93], SlicK is introduced to avoid the redundancy on results predictable instructions. Wang et al. [94] showed that soft er rors produce observable symptoms, and used these to trigger roll b ack execution. Soft error tolerance techniques also exist at the circuit level. [80] established a relation between the atmos pheric neutron soft error rate and technology feature size. [95] proposed a soft e rror detection circuit based on abnormal switching currents. Choudhury et al. [7] proposed the use of gate size and VDD as design parameters to realize SER robustness circuit de sign. [96] selectively resized the most vulnerable gates to improve the combination logic circuit single-event upset (SEU) robustness. In [97], Srinivasan et al. applied Asymmetric SRAM optimized for stor ing zero (ASRAM-0) to reduce structures SER based on the observation that most of the configuration bit stream is composed of zero. My approach is unique in that I ex plore designs using combined circ uit and microarchitecture level SER robustness techniques. 111

PAGE 112

Table 5-1. The studied SMT workloads Thread Type Benchmarks CPU Group A bzip2, facerec, gap, wupwise, Group B crafty, fma3d, mesa, perlbmk Group C eon, gcc, wupwise, mesa MIX Group A crafty, gap, lucas, swim Group B mcf, mesa ,twolf, wupwise Group C equake, facerec, perlbmk, vpr MEM Group A applu, galgel, twolf, vpr Group B ammp, equake, lucas, twolf Group C lucas, mcf, mgrid, swim 112

PAGE 113

VDD VDD Bit line Bit line Word line VDD/2 Stacked capacitor Stacked capacitor Figure 5-1. Soft error robust SRAM (rSRAM) Cell (6T+2C). The rSRAM cell is built from a standard 6 transistor high density SR AM cell above which two stacked MetalInsulator-Metal (MIM) capacitors are symm etrically added. The embedded capacitors increase the critical charge required to flip the cell logi c state and lead to a much lower SER. The common node of the two capacitors is biased at VDD/2. Instructions Ready to execute? No Yes RIQ full? Ready to issue? Yes Select ready instructions to function units Insert into NIQ NIQ full? No Yes Insert into RIQ Criticality > critical threshold? Delay dispatch Yes Yes No No Criticality Computation in critical table Check Register Files Ready Bits Array Figure 5-2. The control flow of instruction dispatch in the proposed IQ using hybrid radiation hardening techniques 113

PAGE 114

Decode & Renaming NIQ RIQ Issue Critical Inst. Check Criticality Operands Ready Inst. Update Function Units Reorder Buffers Fetch Queues Critical Tables Critical Or Non-Critical Register Files Ready Bits Array Check Readiness Register Ready Bits Update Operand Readiness Update Update(b) Entry M Entry N Entry J Entry M Entry N Entry J 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ... ROB size ROB size(a) Figure 5-3. An overview of radiation hard ened IQ design using hybrid techniques R Payload R Tag Storage cell Storage cell ... = = ... R R(ECC protected) L Payload L Tag Storage cell Storage cell ... = = L R(ECC protected) Tag Buses rSRAM based rCAM Figure 5-4. The wakeup logic of the RIQ 114

PAGE 115

... ... Payload & Tags CAM & R Payload & Tags ... ... Function Unit R e i n s e r t I s s u e NIQ RIQ Register Readiness Prediction Architecture Ctirical Inst. Ready Inst. Predicted Readiness Check Reorder Buffers A Register Readiness Prediction Architecture ... ... Payload & Tags CAM & R Payload & Tags ... ... Ctirical Inst. Ready Inst. Predicted Readiness Check NIQ RIQ Function Unit Issue Readiness Check Register Files Ready Not-Ready Reorder Buffers B Figure 5-5. The designs alternative of the hybrid radiation hardened IQ. A) First design. B) Second design. 0 20 40 60 80 100 Time (cycle)ROB AVF (%)0 1L2 cache miss (1 vs. 0) ROB AVF L2 cache miss Figure 5-6. The correlation betw een ROB AVF and L2 cache miss 115

PAGE 116

0 0.2 0.4 0.6 0.8 1 1.2C P U1 C P U2 C P U3 MIX 1 MIX2 MIX3 M E M 1 MEM2 M E M 3 A ve r ag eNormalized IQ SE R ORBIT 2OP_BLOCK FLUSH_enabled Hybrid_IQ A 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2CP U1 C PU2 CPU 3 MIX 1 MIX 2 M I X3 ME M1 MEM2 ME M3 A verageNormalized Throughput IP C ORBIT 2OP_BLOCK FLUSH_enabled Hybrid_IQ rSRAM_based_IQ B 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2CPU 1 CP U2 C PU3 MIX1 MIX 2 M IX3 M E M1 ME M2 ME M3 Ave ra geNormalized Harmonic IPC ORBIT 2OP_BLOCK FLUSH_enabled Hybrid_IQ rSRAM_based_IQ C Figure 5-7. A comparison of norma lized IQ SER, throughput and harmonic IPCs. A) Normalized IQ SER. B) Normalized throughput IP C. C) Normalized harmonic IPC. 116

PAGE 117

0.88 0.9 0.92 0.94 0.96 0.98 1 1.02 816243240485664 RIQ sizeNormalized IPC0 0.1 0.2 0.3 0.4 0.5Normalized IQ SER Noramlized Throughput IPC Normalized Harmonic IPC Normalized IQ SER A 0.5 0.6 0.7 0.8 0.9 1 1.1 816243240485664 RIQ sizeNormalized IPC0.02 0.04 0.06 0.08 0.1 0.12 0.14Normalized IQ SER Noramlized Throughput IPC Normalized Harmonic IPC Normalized IQ SER B 0.5 0.6 0.7 0.8 0.9 1 1.1 816243240485664 RIQ sizeNormalized IPC0.02 0.04 0.06 0.08 0.1Normalized IQ SER Noramlized Throughput IPC Normalized Harmonic IPC Normalized IQ SER C Figure 5-8. Criticality threshold an alysis. A) CPU combination wo rkloads. B) MIX combination workloads. C) MEM combination workloads. 117

PAGE 118

0 10 20 30 40 50 60 70 80C PU1 C PU 2 C PU 3 MIX1 MIX2 M I X 3 MEM 1 MEM2 MEM3 AverageROB SER Reduction ( % L2_miss trigger L2_miss+#_ACE_ bits trigger A 0 1 2 3 4 5 6 7 8 9 10CPU1 CP U2 CP U3 MIX1 MIX2 MIX3 MEM1 MEM 2 MEM3 A ve r ag ePower Overhead (%) L2_miss trigger L2_miss+#_ACE_ bits trigger B Figure 5-9. ROB SER reduction an d processor power overhead with L2_miss trigger and enhanced trigger. A) ROB SER reduction. B) Power overhead. 118

PAGE 119

0 2 4 6 8 10 12 14 807672686460565248 Vulnerability ThresholdPower overhead and SER reduction (%)8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7SER_reduction/Power _overhead Power_overhead ROB SER reduction SER_reduction/Power_overhead A 0 5 10 15 20 25 30 35 40 45 807672686460565248 Vulnerability ThresholdPower overhead and SER reduction (%)4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4SER_reduction/Power _overhead Power_overhead ROB SER reduction SER_reduction/Power_overhead B 0 10 20 30 40 50 60 70 807672686460565248 Vulnerability ThresholdPower overhead and SER reduction (%)4.1 4.15 4.2 4.25 4.3 4.35 4.4 4.45 4.5 4.55 4.6 4.65SER_reduction/Power _overhead Power_overhead ROB SER reduction SER_reduction/Power_overhead C Figure 5-10. Vulnerability threshold analysis A) CPU combination workloads. B) MIX combination workloads. C) MEM combination workloads. 119

PAGE 120

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 CPU1CPU2CPU3MIX1MIX2MIX3MEM1MEM2MEM3AVGNormalized SE R Core Register Files Load Store Queue Function Units DTLB Figure 5-11. The aggregate effect of the proposed two techniques 120

PAGE 121

CHAPTER 6 CHARACTERIZING AND MITIATING SO FT-ERROR VULNERABILITY IN THE PRESENCE OF PROCESS VARIATION The continued scaling of CMOS process techno logy to the nanometer scale has resulted in process variation (PV) that lead to significant variability in chip pe rformance and power [101, 102, 104, 107, 141]. Existing work on analyzing and m itigating the impact of process variation on microprocessor design has been largely focused on performance and power domains [118,119]. Since the soft erro r rate is inherently re lated to the device parameters (e.g. gate length ( L ) and threshold voltage ( Vth)) [116, 117, 139], as techno logy advances, the impact of process variation should be taken into consideration by architects in reliability evaluation. Furthermore, techniques aimed at mitigating th e process variation should also c onsider their reliability impact. Nevertheless, at the microarchite cture level, there has been no prior work on (1) characterizing the impact of parameter variation on soft e rror robustness, and (2) optimizing soft error reliability in light of process variation. In this chapter, I characterize the imp act of process variation on the processor microarchitecture soft error vulne rability. I find that although the cri tical charge (a widely used metric to evaluate the raw soft error rate (SER)) varies signif icantly under process variation at the bit-level, it exhibits small variation at th e entry-level of microarc hitecture structures. I examine the impact of two recently proposed proce ss variation tolerant techniques (e.g. Variable Latency Register Files (VL-RF) [118] and Dyna mic Fine Grain Body Bi asing (DFGBB) [119]) on microarchitecture soft error reliability and find that both techniques increase the SER. I propose reliability-aware process va riation tolerant mechanisms that are capable of achieving an optimal vulnerability-frequenc y-leakage opera ting point. 121

PAGE 122

Experimental Methodology This section presents a model of process va riation and describes the methodology I use to estimate the impact of PV on soft error robustnes s. In addition, the microarchitecture-level soft error reliability computat ion method is introduced. Process Variation Modeling Process variation is a combin ation of random effects (e.g. ra ndom dopant fluctuations) and systematic effects (e.g. lithographic lens aber rations). Random variation refers to random fluctuations in parameters from die to die and de vice to device. Systematic variation, on the other hand, refers to the layout-dependant variati on through which nearby devices share similar parameters. Devices exhibit spatial correlations w ithin a given die. Die-to-Die (D2D) variation mainly exhibits as a random variation, whereas Within Die (WID) variation is composed of both random and systematic variation. I model variations on L and Vth since they are the two major process variation sources [102]. L and Vth in each device can be represented as follows: 2 nomDDWID L LLL (6-1) 2 nom DD WIDthth th thVVVV (6-2) where Lnom and Vthnom are the nominal value of gate lengt h and threshold voltage respectively. 2 D DL and 2 D DthV represent the D2D variations. Devices in a single die share the same 2 D DL and 2 D DthV which are generally constant offsets. WID L and represent the WID variation which can be further expressed as the additive effect of systematic and random variations. I focus the PV modeling on within-die variation si nce die-to-die variation can be modeled as an offset value to all the devices within the chip. WIDthV To model the random effects of WID variation, I generate random variables that follow a normal distribution. To model systematic variat ions, I use the multi-level quad-tree partitioning 122

PAGE 123

method proposed in [121], which has been widely used in prior work [105, 118, 122]. Figure 6-1 illustrates this method. A chip is covered by se veral layers and each layer has a number of quadrants. Each quadrant is assigned a random vari able (following a normal distribution) and all devices in the same quadrant share the same va lue. The systematic variation of each device within the chip is equal to the sum of the random variables of the quadrants, th rough all the layers, to which it belongs. Nearby devices shar e more quadrants than far away devices. For example, as shown in Figure 6-1, devices in qua drant (2,1) and (2,2) sh are the random variables in quadrant (0,1) and (1,1), but devices in qua drant (2,1) and (2,16) onl y share quadrant (0,1). This approach effectively captures the spatial corr elation between devices. I chose an area of 32 6T SRAM cells as the granularity of the sma llest quadrant, which is sufficient to capture systematic variation. The WID variation follows a normal distribution (random variables are generated through Monte-Carlo simula tion) with a standard deviation 22 randsys, where rand and s ys represent the standard deviation for random and systematic variation respectively. In this study, I simulate processors developed using 45nm process technology and assume /12 % and /2randsys based on variability projections from [123]. I use the Alpha 21264 as our baseline machine. The layout is scaled down to 45nm from an Alpha 21264 chip floor-plan and 400 chips are generated for stat istical analysis. Predictive Technology Models [124], the evolution of previous Berkeley Pred ictive Technology Models (BPTM), are used to provide the basic device parame ters for HSPICE simulations. Impact on Circuit-Level Soft Error Vulnerability As Equation 5-1 shows, the SER increases exponentially with reduction in Qcrit. In this study, we use the variability of cr itical charge under process va riation to estimate the raw SER variation at the circuit level. 123

PAGE 124

There are several types of storage cells (e.g. SRAM, Transmission Gate Flip-Flop, and dynamic latch) and the critical charge in each ty pe is different. As an example, I present the analysis on a standard 6T SRAM in this subs ection. Figure 6-2 shows the schematic. When the word line is not asserted, transi stors M5 and M6 disconnect the cell from the bit lines and the two cross-coupled inverters formed by M1~M4 reinforce one anothe r, maintaining a stable state in the cell. There are only two stable states (each state corresponds to st ore or ) in the SRAM: Q=0 and QB=1 or Q=1 and QB=0. When en ergetic particles hit Q (assuming Q=1) and discharge it, M2 begins conducti ng and charges QB. The state of the cell will be flipped after Q drops below VDD/2, and a soft error occurs. The estimation of critical charge can be expressed as follows: 0fT critdQI dt (6-3) where Tf is the flipping time and Id is the drain current induc ed by the particle. Device parameters (e.g. L Vth) impact both Tf and Id, suggesting their effect on Qcrit. I insert a current source (shown in Equation 6-4) into node Q to model the particle strike: //()()tt inQ Itee (6-4) where Q is the charge deposited as a result of particle strike, is the collection time constant of the junction, and is the ion-track establishment time constant. Figure 6-3 depicts the impact of PV on SRAM Qcrit. This figure plots two sets of our experimental results showing how voltages change when a SRAM with process va riation (shown in Figur e 6-3(a)) and without process variation (shown in Figure 6-3(b)) are attacked by the same particle at 100 picoseconds. {V(Q), V(QB)} refers to the voltages of Q and QB without PV and {V(Qpv), V(QBpv)} refers to the voltages of Q and QB with PV. As can be se en, V(Q) and V(QB) maintain the initial state 124

PAGE 125

and tolerate the strike while V(Qpv) and V(QBpv) reverse quickly after th e strike, leading to a failure. Note that PV also has positive effect on critical charge since parameters can vary up and down around the mean value. I model the critical charge variability of diffe rent types of storage cells and combinational logic (e.g. chain of 6 inverters, 2-input NAND gate) using the SERQcrit model and HSPICE simulations similar to th e one described above. The L and Vth of transistors are established using random variables from Monte-Carlo simulation. Fi gure 6-4 shows the chip-wide critical charge variation map obtained from our experiments (data are presented in units of fC). The results, which exhibit an average variati on of 18.9% across the simulated chip s, show that there is large variation in the critical charge under PV. Since SER is exponentially related to critical charge, one can expect an even larger variation in the SER in the presence of PV. Microarchitecture-Level Soft Error Vulnerability Analysis A key observation on the effect of soft error at the microarchitecture level is that a SEU may not affect a processor state that is requ ired for correct program execution. The overall microarchitecture structures soft error rate wit hout the impact of PV, as given in Equation 6-5 [126], is determined by two factors: the FIT rate (Failures in Time, which is the raw SER at the circuit level), mainly determined by circuit design and process technology, and the architecture vulnerability factor (AVF) SERFITAVF (6-5) A hardware structures AVF refe rs to the probability that a transient fault within that hardware structure will result in incorrect prog ram results. Therefore, the AVF, which can be used as a metric to estimate how vulnerable a hard ware structure is to so ft error during program execution, is determined by the processor state bits required for architectur ally correct execution 125

PAGE 126

(ACE) [22]. At cycle level, the AVF of a hardware structure is th e percentage of bits that are ACE within that structure. At program level, the AVF of a hardware structure is derived by averaging cycle-level AVF of the structure acros s program execution [22], as shown in Equation 6-6: _# #executioncycles executioncyclesACEbitspercycle AVF BT (6-6) where #B is the number of bits in the structure. As can been seen, AVF is primarily determined by the quantity of ACE bits pe r cycle and their residency tim e within the structure. Note that Equation 6-6 assumes that the fre quency is constant. However, the frequency may vary throughout execution due to PV mitigation techniques (e.g. DFGBB boosts chip frequency [119]). To quantify the architecture vulner ability factor in light of process variation, I propose a PV-aware vulnerability metric AVFpv, which is expressed as follows: _# 1 #executioncycles pv pv executioncycles pv A CEbitspercycle f AVF B f (6-7) where fpv is the frequency under the influence of PV mitigation techniques. Since the FIT/bit (determined by the critical charge) can have large variations due to PV, it is not appropriate to use a random bits FIT to compute the entire micr oarchitecture structures FIT (Note that without PV, the FIT of a structure is the product of FIT/bit and the number of bits in the structure) and obtain the SER. An analysis cell sh ould be introduced, which can be as small as a single bit or as large as the entire structure. E ach bit within a cell is assumed to share the same FIT/bit and the overall structures SER under PV (SERpv) can be expressed as: __ # pv pvcellpvcell ofcellsSERFITAVF (6-8) 126

PAGE 127

where FITpv_cell and AVFpv_cell represent FIT and AVFpv of the finest analysis cell in the structure. Note that even though we assume each bits FIT within an analysis cell is identical, we still compute AVFpv at the bit level. Microarchitecture-Level Soft Error Vulnerability Characterization Under Process Variations In this section, I perform an extensive study to estimate FIT variation across the analysis cells. In addition, the impact of PV mitigation techniques on microarchitecture-level soft error vulnerability is presented. FIT Variation Across the Analysis Cells In the previous section, I intr oduce a method to quantif y critical charge variation at the bit level. At the microarchitecture level, data with in a structure are usually accessed on a per-entry basis. Therefore, a characterization of the cr itical charge variati on at the entry level ( Qcrit_entry) can provide further insight for exploring reliabil ity-aware PV optimizations. In this section, I present a Qcrit_entry variation analysis for the register file and the same characterization procedure has been applied to other microarchitecture stru ctures. For a given microa rchitecture structure, the analysis cell is set to be equal to the size of one entry of that microarc hitecture structure. In this study, I opt to use the mini mum critical charge as the Qcrit_entry. Since a low critical charge increases the FIT rate, by choosing the minimum cr itical charge I base my estimation on the bit that is the most critical in determining the upper bound SER of an entry within the microarchitecture structure. I obtain a Qcrit_entry distribution for each chips register file and present data from the chip with the largest Qcrit_entry variation. Figure 6-5 shows each entrys Qcrit_entry within the 80-entry regi ster file and a zoom-in view of the entry with the minimum Qcrit_entry by displaying each bits crit ical charge within that entry. Figure 6-6 plots both the entry-level and bi t-level critical charge distribution within the 127

PAGE 128

register file. The standard deviat ion of the entry-level critical ch arge (3.5%) is much smaller than that at the bit level (25.9%). This is because a la rge portion of bit-level criti cal charge variation is smoothed out at the entry level (shown in Figure 5). I can conclude that Qcrit_entry only changes slightly within the structure under PV. Microarchitecture Soft Error Vulnerabili ty Under Process Variation Mitigation Techniques In this section, I evaluate the impact of PV mitigation techniques on microarchitecture soft error vulnerability. I evaluate the effect of variable-lat ency register file an d dynamic fine-grained body biasing techniques. The effect of variable-latency register file (VL-RF) The multi-ported Register File (RF) is a cr itical component in determining both the frequency and IPC of a processor. Delay in th e register file is dominated by the SRAM access time. Frequency loss within the register file due to process variation can be reduced by applying variable-latency (VL) techniques. In [118], for each read port in the register file, entries are partitioned into fast and slow entries based on the SRAM read delay. Read operations can complete within one cycle in fast entries, but take two cycles in slow entries. These slow entries are not accounted for during register file fre quency calibration. n% VL-RF defines the RF frequency based on the slowest read time of the fastest n% RF entries for each read port. The frequency is pre-defined by testing the read ports in each RF entry. In VL-RF, it is possible that a RF entry will have both slow and fast read ports. When a slow port is assigned to a read operation, one cycle stall is encountered in the pipeline that the port belongs to, and consequently, the issue width for the ne xt cycle shrinks and IPC is reduced. [118] evaluated the effect of VL-RF on performance. In this paper, we perform a complementary evaluation of its impact on reli ability. The issue queue (IQ) is a good starting 128

PAGE 129

point for such an investigation since shrinking th e issue width directly affects the instruction issue within the IQ. It is important to note that since the ba ndwidths of other stages (e.g. fetch, renaming, etc) are not affected when a slow read port is selected, only the number of instructions exiting the IQ is reduced and th e number of instructions enteri ng the IQ remains the same. As a result, the IQ holds a larger number of instructions and the quanti ty of ACE bits within the IQ increases correspondingly. Moreover, the slow read operation on RF will cause back-to-back and dependent instructions to wait fo r an extra cycle, extending their residency cycles within the IQ. Compared to the baseline case that includes PV but not fast/slow RF entries, IQ AVF increases since both the quantity and residency time of ACE bits are increased. Figure 6-7(a) compares the IQ AVF of the baseline case to that of using the VL-RF tec hnique on the benchmark gcc over a period of 375ms at 1ms granular ity. Figure 6-7(b) plot s the IQ AVF differences between the two and the average number of issue pipe stalls during the interval. As can be seen in Figure 6-7(a), the IQ AVF in VL-RF is substant ially higher than that in the ba seline case. As the number of issue pipeline stalls climbs to a high level, there is a corresponding increase in the IQ AVF difference between VL-RF and the baseline case (s hown in Figure 6-7(b)). This illustrates the degradation in soft error robustn ess that is introduced with VL-RF, and suggests that this degradation is largely a resu lt of the pipeline stalls. In order to mitigate the IPC loss due to VL-RF, [118] proposed a port switching technique that switches from slow ports to fast ports when reading from the RF. This technique compensates for IPC loss by avoiding a large porti on of reads from slow ports. This produces fewer pipeline stalls, and the high IQ AVF cau sed by VL-RF is also mitigated. However, port switching cannot eliminate the use of slow ports and therefore the IQ AVF is still higher than the baseline case. 129

PAGE 130

The effect of dynamic fine-gra ined body biasing (DFGBB) Body Biasing (BB) applies a voltage between the source or drain and substrate to adjust the threshold voltage. Forward body biasing (FBB) decreases the Vth, decreasing the delay of the transistors but making them leakier. Contra rily, reverse body biasi ng (RBB) increases the Vth, creating less leaky but slower transistors. The impact of body biasing on storage cells SER has been studied in the past. [138] found that FBB has the ability to improve a flip-flops soft error robustness by 35% and RBB degrades the SER by 36% in 130nm process technology. To evaluate the effect of body biasing on an SRAMs SER in 45nm technology, I compute the critical charge using HSPICE simulations by biasing Vth in the range of [-0.3v, 0.3v] in 6T SRAM with a resolution of 0.032v. Re sults are presented in Figure 68 (delta threshold voltage is equal to the subtraction of Vth after BB has been applied from the original Vth. FBB corresponds to positive since it reduces Vth and RBB corresponds to negative thV thV as it increases Vth). As can be seen, there is a linear relati onship between critical charge and Vth, and critical charge reduces as Vth increases. Compared to the baseline case without Vth biasing, the critical charge increases 44% with a forward body biasing of 0.3v, and decreases 33% when a reverse body biasing of 0.3v is applied. Since the SER is expone ntial to the critical ch arge, the impact of the varying Vth on the SER will be further amplified. To mitigate the impact of PV on performan ce under a pre-defined power budget, Static Fine-Grained Body Biasing (SFGBB) is applied dur ing the chip calibration time. It partitions the chip into cells, and statically sets each cel ls body biasing value under the worst case (e.g. the cell is fully loaded in the worst case temperat ure). Recently, Teodorescu and Torrellas proposed Dynamic Fine-Grain Body Biasing (DFGBB) scheme [119], which allows bi as voltages to adapt to dynamic conditions. Take a cell which is init ially slow for example, SFGBB applies forward body biasing at the calibration time. With DFGBB, each cells body biasing valu e is set to be the 130

PAGE 131

SFGBB value when it is first powered-up. DFGBB dynamically reduces the forward body biasing or even apply reverse body biasing to save the leakage power when the cell is underloaded. Conversely, when the cell is fully loaded the temperature increases which results in a frequency loss, DFGBB increase s forward body biasing value to improve the frequency. As can be seen, DFGBB is able to reduce the power ove rhead by adjusting the b ody biasing value of the cell to the dynamically changing workload. DFG BB, however, ignores th e reliability domain characteristics during body bias ing adaptations and results in reliability degradation. For example, when an L2 cache miss occurs and the pipeline stalls, cells su ch as the IQ and ROB buffer large numbers of instructions, leading to a high AVF. In this scenario, DFGBB will reduce the forward body biasing or even apply reverse body biasing, and as a result, degrade the soft error robustness because of the associated decrea se in the critical charge. On the other hand, when the structure has a high throughput and the inst ructions residency time is short, the AVF is low. In this situation, DFGBB will increase the forward body biasing that is applied since the temperature is increased. This forward body biasi ng reduces the FIT (affect ed by critical charge) but the SER improvement is limited due to th e already low AVF. To summarize, DFGBB does not effectively capitalize on opportunities to reduce a structur es vulnerability. Reliability-Aware Process Variation Mitigation As discussed above, previous PV mitigation techniques degrade microarchitecture soft error robustness. In this secti on, I propose reliability-awa re PV tolerant sche mes that are capable of (1) mitigating microarchitecture soft error vulnerability and (2) achieving an optimal trade-off between performance, reliability a nd power in the presence of PV. Entry-Level Granularity Vulnerability Mitigation I propose a technique that operate s at a fine granularity to re duce IQ vulnerability based on VL-RF with port switching (VL-RF+PS). I cal l the technique Entry Based Vulnerability 131

PAGE 132

Mitigation (Entry-BVM). Note that port switch ing is not an essential requirement for my technique, but my method is adaptable to VL-RF. My goal is to mitigate IQ AVF without loosing the performance improvement obtained by VL-RF+ PS. Since high IQ AVF is a result of both a large quantity of ACE bits and the long residenc y cycles of ACE bits in the IQ, to reduce IQ AVF both of these factors should be considered. As described in Chapter 4, assigning ACE-rea dy instructions a higher issue priority than un-ACE instructions can reduce the residency time of ACE bits because these instructions will be removed from the IQ more quickly, reducing the number of resident ACE bits. Greater reliability benefit is gained when the issue widt h decreases due to slow RF reads. As the issue pipe becomes more competitive, granting an issu e request to an ACE instruction can reduce its probability of waiting extra cycles within the IQ. As described in [22], an instruction cannot be characterized as ACE or un-ACE until its dependence chain has been determined. This can be difficult to accomplish at run time, but there are several methods available to characterize ACE instru ctions on the fly. A method that uses a SlicK table is proposed in [93] to identify ACE instructions by taking advantage of the time lag between the leading and trailing th reads. This method, however, is not adaptable to my design. In [64], a critical table was proposed to quantify an instructions cri ticality. The rationale of this approach is that performance critical instruct ions usually have a large number of dependent instructions and their co mputation results are likely to be prop agated to the program final output. Consequently, there is a large ov erlap between ACE and performance critical instructions. In this study, I apply the criticality-based approach to dynamically id entify vulnerable instructions: instructions with high criticalit y are assumed to be ACE instruc tions. Note that I still perform rigorous AVF computation [22] in the technique evaluation. 132

PAGE 133

With VL-RF+PS, a large number of ACE bits in the IQ are created by the imbalanced bandwidth between the issue stag e and the other pipeli ne stages. In [118], a stall signal is generated and sent to the issue st age when a slow RF read occurs. In Entry-BVM, this signal is also sent to other pipeline st ages since the input/output imbalance should be avoided in all pipeline stages if possible. For example, if the dispatch stage were the only stage to receive the stall signal, the decode buffer would have to to lerate the imbalance and the vulnerability would migrate from the IQ to the decode buffer. Th e only exception is the commit stage since reducing the commit width degrades the IPC. In Entry-BVM, RF entries with a larger number of fast ports are assigned a higher renaming priority, and by doi ng this, slow RF entries are avoided. This has a positive effect on IPC and helps to compensate for the performance loss caused by issue width reduction. Similar to VL-RF+PS, the per-port speed information for registers is loaded from a ROM which records all of the pr e-collected port speed information for each RF entry. Therefore, it is straightforward to obtain the number of fast ports within the registers. To implement this approach, two bits describing the fa st port quantity in the 4-port RF entry are added into the free list macro. These bits are read when selecting free physical registers for register renaming. Figure 6-9 presents an architectural overview of Entry-BVM. Floating point structures are omitted due to space limitations, but are similar to th e integer structures. Note that Entry-BVM is based upon VL-RF+PS. The detailed implementation of VL-RF+PS, such as the slow and fast RF entry pre-partitioning, VL-R F frequency pre-definition, and th e circuit support (e.g. latches, MUX) for port switching, is described in [118]. In the renaming stage, registers with a large number of fast ports will be given higher priori ty when selecting from the free physical register pool. The critical table is added to implement dynamic vulnerability identification. The table is updated when an instruction is renamed in or der to record up-to-date dependence chain 133

PAGE 134

information. For each instruction within the IQ, the table provides the number of dependent instructions in the pipeline. A larger number represents a high er performance criticality. Upon table lookup, instruction criticality is provided to the IQ. This is done in parallel with the instruction wakeup stage and does not introduce any delay into th e pipeline. Instructions are recognized as ACE () or un-ACE () if thei r criticality is higher or lower than a given threshold. The request signal accompanied by the vulnera bility ( or ) information is sent to the selection logic when the instruction become s ready. The selection logic grants the request signal, taking into consideration the vulnerability information. In th e register read stage, the stall signal will be sent to other stages, and their band width will be reduced by one per stall signal at that cycle. Since variable-latency techniques can be applied to other microarchitecture structures, Entry-BVM is also adaptabl e to other structures. Structure-Level Granularity Vulnerability Mitigation In order to fully exploit soft error robus tness improvement opport unities while achieving optimal trade-offs between vulnerability, perf ormance and power, I propose coarse-grain, Structure Based Vulnerability Mitigation (Struc ture-BVM) techniques wh ich use two metrics AVFpv and IPC*fpv to guide dynamic body biasing adaptatio ns. Although IPC is a widely used performance metric, it is not suitable for meas uring performance across chips with variable frequencies. In my scheme, IPC*fpv is used for performance evaluation. Similar to [119], Structure-BVM is implemented based on SFGBB wh ere the body biasing valu e is statically set for the worst-case execution environment. Th e maximum forward body biasing applied to Structure-BVM is the body biasing value dete rmined by SFGBB since the power consumption cannot exceed the upper bound power overhead. The AVF of a microarchitecture structure exhibits si gnificant variation at runtime [30, 32] and it is possible to achieve a considerable SER improvement by reducing the raw FIT when 134

PAGE 135

AVF is high. Since both AVFpv and IPC*fpv are considered in Struct ure-BVM and do not always exhibit a strong correlation (i.e. high IPC does not always imply a high AVF, and vice versa), I propose the categorization of program execu tion into four types of phases based on AVFpv and IPC*fpv; namely high AVFpv + high IPC*fpv, high AVFpv + low IPC*fpv, low AVFpv + high IPC*fpv and low AVFpv + low IPC*fpv. The boundary of a phase is the mean value of AVFpv and IPC*fpv, which equally partitions the two combined parame ters into four sets. Consequently, the number of times I apply FBB and RBB will be roughly equal, avoiding the substantial power overhead of over using FBB or the performance loss of over using RBB. The above two-dimensional partition allows me to explore desirable trad e-offs across multiple domains (e.g. reliability, performance and power). For instance, a high AVFpv phase will require the application of forward body biasing to gain the highest degr ee of soft error robustness; and a higher AVFpv corresponds to a higher forw ard body biasing value. On the other hand, for a low AVFpv phase, I can apply reverse body biasing to save l eakage power with ne gligible effect on SERpv; similarly a lower AVFpv corresponds to a higher reverse body biasi ng value. However, overall performance may be reduced significantly when reverse body biasing is applied during a phase of low AVFpv + high IPC*fpv. Because of this, reverse body biasing will not be applied if the total performance cannot be maintained at a level comparable to the baseline case with pr ocess variation without any optimization. While in a phase of low AVFpv + low IPC*fpv, the use of reverse body biasing will be less constrained since its negative effect on performance is small. Figure 6-10 illustrates how forward body biasing and reverse body biasing are applied in each phase in the proposed scheme. I present the pseudo code of our technique in Figure 6-11. Body biasing is adjusted at an interval granularity with a length of 1ms. At the beginning of each interval, the AVFpv and 135

PAGE 136

IPC*fpv of the last interval are com puted (the temperature effect on fpv is also considered in our study). Their maximum, mean and minimu m values are updated correspondingly. The performance effect due to body biasing is evalua ted, which consists of the product of the last intervals IPC and the frequency difference wi th and without body biasin g applied within the interval. The overall performance impact (perf_gain) is updated correspondingly. The AVFpv and IPC*fpv of the current interval are pr edicted to be the same as last interval, and the body biasing value for the current interval is determined by AVFpv. The total performance effect and IPC*fpv are used to check whether RBB s hould be applied. Since the body biasing value is discrete, I partition [mean AVFpv, maximum AVFpv] into smaller, equal ranges. The number of ranges is determined by FBB_steps, which is the total nu mber of FBB steps, and each range corresponds to a FBB value. Similarly, [minimum AVFpv, mean AVFpv] is partitioned and linked to RBB values. These ranges can be partit ioned statically to avoid the overhead of the maximum, mean, and minimum AVFpv updates and dynamic range computations. Since AVFpv can only vary from 0% to 100%, the AVFpv range can be pre-determined using information such as the number of body biasing steps that can be app lied. Static range partitioning, how ever, does lose the ability to keep a balanced utilization between FBB a nd RBB (see Evaluation Section for a detailed evaluation). As can be seen, Structure-BVM requires onl ine AVF estimation for the structure. The critical table introduced in Entry-BVM for ACE instruction identification is not applied in Structure-BVM, because every cell that require s AVF computation must access the table. This would introduce a large number of read ports and wires into the table, increasing the power and area overhead. [140] proposed online AVF estim ation by performing fault injections during program execution and observing whether the erro rs result in failures. This requires extra 136

PAGE 137

hardware support and results in an area overhea d. In this study, I use the methodology proposed in [76], which computes the number of ACE bits by observing the number of instructions read and written into the cell on every cycle, therefore, performs just-in-time vulnerability estimation. The dynamic vulnerability estimation is used to guide body biasing adaptati on within the cell and accurate AVF computation is still performed for evaluation purposes Fine-grain cells within the chip are generally partitioned at the microa rchitecture structure level [108] and [119] demonstrates the feasibility of applying body biasing techniques at a per structure level. Note that FGBB is applied independently to each st ructure, and therefore my technique does not simply migrate vulnerability from one stru cture to another. To implement body biasing techniques, a bias generator is attached to the st ructure. It generates bidi rectional but equivalent body biasing values for PMOS and NMOS transist ors separately, because the same body biasing value has an opposite effect on the two types of transistors. The detailed circuit modifications can be found in [119]. Evaluation In this section, I describe the experimental methodology a nd evaluate the two proposed reliability-aware PV mitigation techniques. Next I examine the aggregated effects of the two proposed techniques. Finally, I di scuss the area and power overhead. Experimental Methodology To evaluate reliability at the microarchitectur e level, I use the reliability-aware simulation framework Sim-SODA [131]. In ad dition, I port Wattch [91] into my simulation framework for dynamic power evaluation. The leak age power is estimated using HotLeakage [134]. The power results are scaled based on tec hnology projections from ITRS [ 135]. The HotSpot tool [136] is used to estimate chip temperature. I use a default Alpha 21264 machin e configuration with 20entry INT and 15-entry FP IQs, an 80-entry ROB, an 80-entry INT and 72-entry FP register file 137

PAGE 138

with 4-rd/2-wr ports, a 32KB L1 D-cache, a 32KB L1 I-cache, and a 2MB L2 cache. The processor pipeline bandwidth is set to four. I use SPEC CPU 2000 integer and floating point benchmarks in our evaluation. The Simpoint tool [26] is used to identify the most representative simulation interval for each benchmark. After that, each benchmark is fast-forwarded to its representative interval before detailed simulation takes place. I simulate 500 million instructions for each benchmark and present the average resu lt across 400 chips. 80% VL-RF is chosen in [118] because it is built on 65nm technology, whic h has smaller process variation than 45nm technology. For the Entry-BVM technique, we implement 70% VL-RF and obtain a 10% frequency increase compared to the chip without VL-RF. A larger portion of slow RF entries has to be discounted for in order to gain a signi ficant frequency improvement in 45nm. I set the body biasing range as [-0.3v, 0.3v], with a resolution of 32mv. The dynamic FIT computation is based on the linear interpolatio n shown in Figure 6-8. Besides mitigating microarchitecture vulnerabilit y in the presence of process variation, my techniques also target achievi ng the optimal trade-off among re liability, performance and power. I propose a metric Vulnerability-Power/Perf ormance (VPP), which can be expressed as pv pvSERpower IPCf to evaluate the cross domain effi ciency of the proposed techniques. A better technique will achieve a lower VPP. Effectiveness of Entry-BVM We compare our Entry-BVM scheme to VL-R F and VL-RF+PS techniques. Figure 6-12 (a)-(b) shows the IQ SER and performance ( IPC*fpv) across the studied benchmarks. Results are normalized to the baseline case without any optim ization under the impact of process variation. As expected, VL-RF reduces IQ reliability significantly. For example, the IQ SER increases by 58% on gcc. As can be seen, VL-RF does not introduce a severe IQ reliability degradation on 138

PAGE 139

some benchmarks (e.g. equake, mcf). This is because there are a large number of L2 cache misses in those benchmarks which limit the quantity of ready instructions in the IQ for issue. There are still enough pipes for instructions issue even though the number of pipes decreases because of VL-RF. Consequently, the IQ AVF is slightly affected by VL-RF. As shown in Figure 6-12 (a), on average, VL-RF degrades the IQ SER by 18% compared to the baseline case. VL-RF+PS mitigates the negative effect of VL-RF to some degree, but still increases IQ SER by 8%. Entry-BVM exhibits a strong ability to im prove IQ soft error robustness. It not only mitigates the vulnerability increase caused by VL -RF techniques, but further reduces IQ SER by 24%. As Figure 6-12 (b) shows, the IPC*fpv of VL-RF+PS and Entry-BVM is almost equal, thus Entry-BVM is able to maintain the same perf ormance improvement as VL-RF+PS. Figure 6-13 shows the trade-off metric VPP across the variou s techniques. The results are normalized to the baseline case. Entry-BVM produces a much lower VPP than other approaches. On average, the VPP of Entry-BVM is 50% and 28% lower than that of VL-RF and VL-RF+PS respectively. Since Entry-BVM controls the number of input/o utput instructions to each microarchitecture structure in the entire pipeline via broadcas ting the stall signal, it does not migrate the vulnerability from the IQ to ot her microarchitecture structures. Effectiveness of Structure-BVM I first compare Structure-BVM schemes (using both dynamic and static AVFpv range partitioning) with DFGBB [119] on reliability. Figure 6-14 presents the IQ SER across the various techniques. Results are normalized to the baseline case. Because DFGBB does not effectively take advantage of FBB to improve so ft error robustness, it does not decrease but instead increases the IQ SER by an average of 8% across the studied benchmarks. Compared to the baseline case, the static AVFpv range partition slightly affects IQ vulnerability because it lacks the ability to adapt to the dynamic AVFpv behavior and choose an optimal BB value. Structure139

PAGE 140

BVM with dynamic AVFpv range partitioning shows a strong abil ity to reduce IQ vulnerability. As a result, the IQ SER is 30% and 20% lowe r than that on DFGBB and the baseline case respectively. The normalized performance ( IPC*fpv) results are shown in Figure 6-15. As can be seen, DFGBB suffers a considerable performa nce loss even though it maintains the frequency, while Structure-BVM with dynamic AVFpv range partition maintains or improves performance by 22%. This is because the per-interval perf ormance constraint does not incur any negative effect from biasing the threshold voltage. Cons equently, performance usually increases. Note that while Structure-BVM has an intrinsic pow er constraint, both DF GBB and Structure-BVM have the ability to reduce the power overhead introduced by SFGBB. One can expect a greater power saving on DFGBB than on Structure-BVM, since Structure-BVM trades off a certain amount of power reduction in improving reliability and performance. Instead of analyzing the power consumption of these techni ques, it is more interesting to evaluate the trade-off among the three domains. I present the normalized VPP in Figure 6-16. On average, the VPP of DFGBB is 40% higher than Structure-BVM with dynamic AVFpv range. As shown in Figure 6-16, the benefit gained from Structure-BVM varies substantially across benchmarks. For example, a large IQ SER reduction is obtained on equake, while the IQ SER on swim is only slightly changed when Structure-BVM is applied. With Structure-BVM, intervals which have close-to-mean AVFpv will be assigned a close-to-zero BB value. If most of an applications intervals have an AVFpv close to the mean value, the effectiveness of StructureBVM is reduced. I select one repr esentative chip and show the AVFpv IPC*fpv plot in Figure 617 for intervals on equake and swim respectively. Dotted lines mark the boundaries of the four phases, and an area is drawn with its center (with mean AVFpv and mean IPC*fpv) at the 140

PAGE 141

intersection of the boundaries. As is s hown, a large number of intervals in swim are covered by the area whereas few intervals in equake are included within the area. Combining Entry-BVM and Structure-BVM As the results have shown, both Entry-BVM a nd Structure-BVM exhibit a strong ability to improve a structures soft error robustness. Sinc e they focus on difference levels, they can be combined to further reduce the vulnerability and push the trade-off among the three studied domains closer to an optimal point. Figure 618 (a) and (b) show the normalized IQ SER and VPP under the aggregate effect of Entry-BV M + Structure-BVM across the benchmarks. Compared to the baseline case wi th process variation, the combined technique reduces IQ SER substantially by 40% while improving VPP by 46%. Overhead of Entry-BVM and Structure-BVM It was found in [118] that VL-RF+PS adds a 2% area overhead and less than 5% power overhead. Entry-BVM adds a small amount of lo gic in register renaming and instruction selection on VL-RF+PS. A bit table is added for dynami c vulnerability identification, where n is the number of ROB entries. I estimate the extra area overhead introduced by EntryBVM to be less than 1% and with negligible power overhead. Previous studies have also estimated the overhead of body biasing techniques [108]. Generally, FGBB introduces a 2% area overhead and a 1% power overhead. The timi ng overhead of FGBB is negligible [119]. Structure-BVM updates several parameters (e.g. AVFpv, IPC*fpv) at the end of each interval, however the processor does not need to stop fo r these parameter updates. The previous BB can be applied until its re-evaluation is completed, so Structure-BVM introduces no timing overhead. The extra power consumed by the updates is sma ll. There are few circui ts and buffers added by Structure-BVM and the estimated ex tra area overhead is less than 1%. nn 141

PAGE 142

Related Work There have been many studies on characteri zing the impact of PV on performance and power. Borkar et al. [101] investigated the imp act of PV on both circui t and microarchitecture levels. Bowman et al. [102] introduced maximu m clock frequency (FMAX) to model the chip frequency under PV. Orshansky et al. [103] de rived a model to estimate the performance degradation for a given circuit a nd process parameters in the pr esence of PV. Agarwal et al. [105] presented a statistical timing analysis technique that accounts for WID variations with spatial correlation. Statistical modeling and analysis of leakage power under PV were performed in [104, 106, 107]. There are limited works studying the impact of PV on SER. Ding et al. [110] investigated the impact of L and Vth variations on critical charge separately. Their work was performed at the circuit level and is limited to single storage cells with single parameter variation. Our study extends the SER variation ch aracterization to the microarchitecture level and considers the combinational effect of L and Vth variations. Various PV mitigation techniques have been pr oposed in the past. Body biasing is widely applied to mitigate both D2D and WID variation [102] or achieve an optimal tradeoff between frequency and power [108]. [109 ] developed power optimization techniques considering the PV effect based on the statistical analysis. [115] showed that PV can slow down the issue queue read/write operations and hurt the performance, it proposed inst ruction steering, operandand port-switching to improve the perf ormance. Tiwari et al. [111] proposed Recycle, which applies cycle time stealing to microarchi tecture pipeline stages for freque ncy improvement. Liang et al. [112] proposed a memory architecture based on novel 3T1D DRAM to tolerate PV. [113] applied two fine-grained tuning t echniquesvoltage interpolation and variable latency to reduce the frequency variation between chips, between cores on one chip and between units within one core. [14] used linear programming to find the best voltage and frequency levels for each core of 142

PAGE 143

the CMP in the presence of PV. However, th ese PV optimization schemes largely ignore microarchitecture soft error reliability. In this paper, I proposed novel techniques to mitigate the negative PV effect considering reliab ility, performance and power factors. 143

PAGE 144

0,1 1,1 1,2 1,4 1,3 2,4 2,1 2,2 2,3 0,1 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 2,9 2,10 2,11 2,12 2,5 2,6 2,7 2,8 2,13 2,14 2,15 2,16 Figure 6-1. Multi-level quad-tree partitioning to model system variation 0 0.002 0.004 0.006 0.008 0.01 0.00.10.20.20.30.30.4 Time (ns)Insert Current (A) M1 Q M3 M5 VDD M4 QB Vbit M2 WordLine Vbit_bar M6 Figure 6-2. Standard 6T SRAM schematic s with current source inserted at Q Particle strike Particle strike Time (s) Voltage at QB keeps as '0' Voltage at Q keeps as '1' Q QBVoltage (V) A Particle strike Particle strike Time (s)Voltage (V) Voltage at Qpvdrops to '0' Voltage at QBpv increases to '1' QpvQBpv B Figure 6-3. V(Q) and V(QB) of SRAM (in 45nm processing technology) unde r a particle strike. A) Without PV. B) In the presence of PV. 144

PAGE 145

Figure 6-4. Critical charge varia tion map for a chip (data are pr esented in units of fC). 7 9 11 13 15 17 19 21 23 01020304050607080 Entry IDCritical Charge (fC) 7 9 11 13 15 17 19 21 23 01 63 24 86 4 Bit IDCritical Charge (fC) (Total 80)(Total 64) Figure 6-5. Qcrit_entry of each entry in the register file ( 80 entries, each entry contains 64 bits) 0 5 10 15 20 25 30 35 40 45 6.589.51112.51415.51718.52021.523 Critical charge (fC)Entry count (total 80)0 2 4 6 8 10 12 14 16Bit count (total 64) Critical charge at entry level Critical charge at bit level Figure 6-6. Entryand bitlevel Qcrit distribution in the register file 145

PAGE 146

0 10 20 30 40 50 60 70 80 0 75 150225300375 Time (ms)IQ AVF (% ) VL-RF Baseline A 0 0.4 0.8 1.2 1.6 075150225300375 Time (ms)# of stalls in issue stag e -20 0 20 40 60 80IQ AVF (%) # of issue pipe stalls IQ AVF difference between VL-RF and baseline B Figure 6-7 A). IQ AVF in baselin e case and VL-RF technique on gcc over a period of 375ms. B) The # of issue pipe stalls and IQ AVF difference between VL-RF and baseline case. 0 5 10 15 20 25 0.30 0.14 -0.02 -0.18 (v)Critical Charge (fc)thV Figure 6-8. Critical charge vs. delta threshold voltage in range of [-0.3v, 0.3v] in 6T SRAM 146

PAGE 147

FETCH DECODE Int Register Renaming table Int Issue Queue RENAME WAKEUP SELECT Table for Dynamic Vulnerability Identification Vulnerability Check Int Register File REG READ Stall Signal ALU ACE inst issue first EXECUTE 3 2 4 1Fast RF rename first Figure 6-9. Entry-BVM architectural overview FBB FBB IPC*fpv>Mean_IPC*fpvIPC*fpv<=Mean_IPC*fpvAVFpv<=Mean_AVFpv AVFpv>Mean_AVFpv RBB if no perfermance loss then RBB else no BB Figure 6-10. BB in each phase 147

PAGE 148

1.Every 1ms 2.{ 3. AVFpv and IPC*fpv computing(); 4. Max, Min and Mean values updating(); 5. Perf_gain computing(); 6. 7. if AVFpv belongs to [Mean_AVFpv, Max_AVFpv] 8. then 9. { 10. AVF_step = (Max_AVFpv Mean_AVFpv)/FBB_steps; 11. BB = Max_FBB*(AVFpv-Mean_AVFpv)/AVF_step; 12. } 13. else 14. { 15. AVF_step = (Mean_AVFpv Min_AVFpv)/RBB_steps; 16. BB = Max_RBB*(Mean_AVFpv-AVFpv)/AVF_step; 17. if IPC*fpv >Mean_IPC*fpv andPerf_gain < 0 18. then BB = 0 19. } 20.} Figure 6-11. Structure-BVM Pseudo-code 0.2 0.5 0.8 1.1 1.4 1.7g cc cra f ty eo n perlbmk b zip gap vpr twolf mcf m grid s wim fma3d m e sa facere c w u p wise applu gal g el a mm p equake l uc as AVGNormalized IQ SER VL-RF VL-RF + PS Entry-BVM A 0.6 0.7 0.8 0.9 1 1.1gcc c raft y eo n p er lbm k b z ip g ap vpr two lf mcf m g r id swim f ma3d mesa fac e rec w u p wise a p plu g alg el ammp eq ua ke luc as AVGNormalized IPC*fpv VL-RF VL-RF + PS Entry-BVM B Figure 6-12. Normalized IQ SER and IPC*fpv yielded by VL related techniques. A) Normalized IQ SER. B) Normalized IPC*fpv 148

PAGE 149

0.6 0.8 1 1.2 1.4 1.6 1.8 2gcc c r afty e on per lbmk bzi p ga p v pr t w olf mc f mgrid sw i m fm a 3d me sa fac e r e c w up w i s e ap p l u g a lge l a m mp equake lucas AVGNormalized VPP VL-RF VL-RF + PS Entry-BVM Figure 6-13. Normalize VPP on VL related techniques 0.4 0.6 0.8 1 1.2 1.4gcc crafty eon per l bm k bz ip gap vpr twol f m c f mgrid sw i m fma 3 d mesa face r ec w upwi s e a ppl u g al g el amm p equake lucas AVGNormalized IQ SER DFGBB static_Structure-BVM dynamic_Structure-BVM Figure 6-14. Normalize IQ SE R with BB related techniques 0.4 0.6 0.8 1 1.2 1.4 1.6g cc c r afty e on p er lbmk b z i p gap vpr tw o l f mcf m grid swim fma3d m e sa f ac er ec wu p wi s e appl u galgel amm p equa k e lucas A V GNormalized IPC*fpv DFGBB static_Structure-BVM dynamic_Structure-BVM Figure 6-15. Normalized IPC*fpv with BB related techniques 149

PAGE 150

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2g cc c rafty eo n perl b mk bzip g ap vpr twolf mcf m grid sw im f m a 3d mesa facer e c w u pw i se ap pl u g al ge l a mmp eq uake lucas AV GNormalized VPP DFGBB static_Structure-BVM dynamic_Structure-BVM Figure 6-16. Normalized VPP with BB related techniques 0 10 20 30 40 50 60 70 80 90 100 0123456 IPC*fpv (10^9)AVFpv (%) 0 5 10 15 20 2500.511.522.533.5 IPC*fpv (10^9)AVFpv (%) Mean AVFpv value Mean IPC*fpv value Figure 6-17. AVFpv IPC*fpv plot for intervals on equake (up) and swim (down) (Intervals within the drawn area are close to the mean value) 150

PAGE 151

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1g c c crafty e o n p e rl b mk bz i p g a p vpr two l f mcf mgrid s w im fma3d mesa fac e re c wupw i se a ppl u ga l gel ammp e q u a ke l u c a s AV GNormalized IQ SER A 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1gcc cr a ft y eon perlbmk bz i p gap v p r t w olf mcf m g ri d s wi m fma3d me sa face r ec wu pwi s e applu galge l amm p equa ke l u c as AV GNormalized VPP B Figure 6-18. A) Normalize IQ SER under Entry-BVM + Structure-BVM. B) Normalized VPP under Entry-BVM + Structure-BVM 151

PAGE 152

CHAPTER 7 NBTI TOLERANT MICROARCHITECTURE DESIGN IN THE PRESENCE OF PROCESS VARIATION The technology scaling has resu lted in the convergence of several factors (e.g. the introduction of nitrided oxides, the increase in gate oxide fields, a nd operating temperature), which have made Negative Bias Temperature Instability (NBTI) [100] a critical relia bility threat for deep sub-micrometer CMOS technologies. NB TI increases the PMOS transistor threshold voltage (Vth) and reduce s the drive current, causing degradati on in circuit speed and requires an increase of the minimal voltage in storage cells to keep the content. The PMOS degradation (i.e. wear-out) problem due to NBTI is aggravated in the presence of process variation. Under the impact of PV, circuit operating frequency decreases significantly after the chip is fabricated (frequency is determined by the slowest critical path). The NBTI effect further exacerbates circuit performance degradation duri ng chip operation due to increased Vth. Consequently, the decreasing circuit operating frequency is a cumula tive effect of both PV and NBTI. Current PVtolerant mechanisms largely ignore the NBTI wear-out probl em. On the other hand, existing NBTI-tolerant techniques lack the ab ility to address the deleterious impact of PV. As a result, the chip can still suffer a significant frequency lo ss and increased power overhead even though the NBTI-tolerant mechanisms are applied. In the up coming nano-/atomscale transistor design era, microarchitecture design techniques which can effectively addre ss the combined PV and NBTI effect are greatly needed. In this chapter, I show that simply combining PV mitigation techniques with NBTI recovery mechanisms cannot efficiently address the aggregated effect. Observing that process variation has both positive and negative effects on circuits, I take advantage of the positive effects in NBTI-tolerant design. I propose three mi croarchitecture NBTI reliability enhancements in the presence of process variation which mitig ate the detrimental impact of PV and NBTI 152

PAGE 153

simultaneously, while achieving at tractive trade-offs among chip performance, power, lifetime, and area overhead. I show that the proposed t echniques can be applied to a wide range of microarchitecture structures, le ading to significant reliability and performance improvements at the chip level. Background In this section, I illustrate the effect of NBTI on PMOS transistors and describe mechanisms to recover NBTI degradation. The inte raction between NBTI a nd PV is discussed as well. Negative Bias Temperatu re Instability (NBTI) NBTI is the result of interf ace trap generation in the sili con/oxide interface of PMOS transistors. When the PMOS tr ansistor is under negative volta ge, the silicon-hydrogen bonds at the silicon/oxide interface can easily break and generate interface traps ( NIT). NIT captures electrons flowing from the source to the drain and increases the PMOS threshold voltage. As a result, the transistor becomes slower and can cause failures when the delay exceeds timing specifications. NBTI leads to failures in the storage cell as well. Higher Vth requires a higher Vmin to keep the content and Vmin in the cell may not be able to satisfy this requirement due to limited power budget. Note that PBTI (Positive Bias Te mperature Instability) also occurs in NMOS transistors. However, its impact is negligible co mpared to the NBTI effect in PMOS transistors [84]. NBTI degradation can be re covered when the positive voltage is set at the gate of PMOS transistors. It helps to heal the interf ace traps generated, which partially recovers Vth. Thus, a PMOS experiences the period of either stress mode (g ate is set as ) or recovery mode (gate is set as ) during its lifetime. The NBTI degrad ation is partially rec overed once the stress is moved. Therefore, minimizing the period during which negative voltage is a pplied at the gate of PMOS can reduce the NBTI effect. Other met hods, such as resizing PMOS or reducing the 153

PAGE 154

operating voltage, can also be applied to mitigat e NBTI degradation [120, 126]. As discussed in [127], considering performance, power, and ar ea overhead introduced, reducing the amount of time PMOS under stress outperforms other NBTI mitigation methods. To mitigate NBTI degradation in combinati onal logic units, [127] proposed the use of special vectors as input into the units when th ey are idle; avoiding the aggressive stress on a specific PMOS. As a result, PMOS transistors in the units degrade evenly and their lifetime is extended since lifetime is determined by the mo st degraded PMOS. In storage cell (e.g. 6T SRAM) based structures (e.g. register file and cache), there is always one PMOS under stress and another under recovery. Therefor e, the best NBTI degradation s cenario is to degrade the two PMOS in the SRAM evenly. Storing and 50% of the time can achieve balanced NBTI degradation. To achieve this goal, [127] observed that on average, a register file entry is free (time between the release and the next write operation) around 50% of the time and proposed to invert the register file entry while in the free state. In addition, [127] proposed to invalidate and store the sampled inverted values into 50% of the L1 cache lines during the entire lifetime to statistically degrade the two PM OS in each SRAM bit evenly. Guardbanding, as a conservative approach and a la st resort, can be used to tolerate NBTI degradation. Guardbanding reduces the processor frequency or increases the minimal voltage to defend against the expected degrada tion in logic circuits or storag e structures during the targeted microprocessor lifetime. For instan ce, in [128], 20% of the cycle time is reserved to combat NBTI degradation. Mitigating NBTI degrada tion can reduce the necessity of guardbanding, leading to improvements in frequency and power savings. However, NBTI mitigation techniques can cause performance penalties and power overh ead, making it a poor choice if the overhead outweighs that of guardbanding. NBTIefficiency (shown as Eq.1) is proposed in [127] to evaluate 154

PAGE 155

the efficiency of NBTI tolerant schemes. It quantifies the trade-off among performance (Delay), power and area overhead (TDP), and lifetime (the amount of required NBTIguardband). The Delay and TDP obtained by the technique will be normalized to the case without NBTI and PV effects. As can be seen, lower NBTIefficiency implies an improved approach and the optimum technique will achieve a NBTIefficiency of 1 since both the Delay and TDP will be 1, and the NBTIguarband is equal to zero. 3((1)) NBTIefficiencyDelayNBTIguardbandTDP (7-1) The Interplay Between NBTI and PV As described earlier, both NBTI and PV affect PMOS Vth. Therefore, guardbanding should consider the potential Vth increase contributed by all factor s. Only targeting on NBTI (or PV) underestimates the guardband require ment and results in a shorter lifetime. This is because the frequency loss and power overhead caused by PV (or NBTI) is not counted. On the other hand, simply adding a NBTI guardband to the PV guardband will overestimate the actual guardband investment since doing so conservatively assumes the worst case scenario and ignores the benign impact of PV on NBTI, which helps reduce the guardband. The excessive guardband causes unnecessary frequency loss and power overhead. Since parameters vary around their nomina l design specification, PV can have both positive and negative effects on transistor characteristics: it either decreases Vth ( ) or increases Vth ( ). NBTI degradation only increases Vth, but the amount of increase on PMOS Vth varies significantly due to the different stress period. NBTI impact can be generally described as either high Vth increase ( ) or low increase (thV thV _thhighV thV_thlowV ). We can classify the aggregated effect of PV and NBTI into four categories: &_thVhigthhV and The guardband will be as high as the sum of NTBI &_thVV thlow&_thVhig thhV &_thVlowthV 155

PAGE 156

and PV guardbands if dominates. Note that NBTI is a temporal effect, its impact on Vth dynamically changes across runtime duri ng the lifetime, depending on the fraction of time its gate is set as The &_th thVhighV highthV shift can be compensated by PMOS with with low performance penalty and power overhead. Therefore, the total guardband can be reduced to the max(thV &_th thVhi g hV &_thVlo ) and a large amount of frequency and power savings is reclaimed. In an ideal scenario, wher e all positive effects of PV are exploited to mitigate the NBTI degradation, guardband will decr ease to as low as the PV guardband. Figure 7-1 illustrates the difference be tween the conservatively estimate d guardband with the optimized one which considers the interaction between NBTI and PV. The difference can be as large as 36% based on our evaluation. thV w As discussed above, to achieve an optimized NBTI+PV guardband setting, it is important to consider the interaction be tween NBTI and PV. However, to my knowledge, existing NBTI and PV tolerant mechanisms [118, 119, 127] addre ss the two factors indivi dually and separately. In this study, I propose severa l cost-effective PV-aware NB TI tolerant methodologies. Process Variation Aware NBTI Tolerant Microarchitecture In this Section, I argue that simply putting NB TI and PV tolerant t echniques together can only reduce the total guardband re quirement to a limited extent. Moreover, even though it can maximally reduce the guardband in some cases, it results in a large pe rformance penalty. To efficiently reduce the total guardbanding while minimizing the negative impact on performance and power, I propose a set of PV-aware NBTI tolerant techniques fo r different types of microarchitecture structures that can exploit the positive interaction between NBTI and PV. 156

PAGE 157

Motivation In order to reduce the required NBTI and PV guardbands, one can apply NBTI tolerant and PV mitigation techniques together. This will m itigate the NBTI degradation and the deleterious PV effect independently. Take a mul ti-ported register file (RF) as an example. It is comprised of combinational logic circuits (decoders, wordlines bitlines, and output am plifiers) and storage cells (SRAM based RF entries). The NBTI mitiga tion techniques that target logic circuits and storage cells can be applied to reduce NBTI guardband. The NBTI guardband of the entire RF is determined by the highest NBTI guardband of the two parts. Meanwhile, the VL+PS (i.e. variable latency and port switching) scheme can be applied to the RF to reduce the frequency loss caused by PV and to minimize the PV guardband. However, as my evaluation results show in the Evaluation Section, simply putting the NBTI and PV mitigation techniques together only reduces the PV guardband and even has a negati ve effect on NBTI guardband because the PV mitigation technique exacerbates the NBTI degradation. The reason is that this method largely ignores the interplay between NBTI and PV a nd loses the opportunity to reduce the total guardband further. Since the ultimate goal of NBTI mitigation techniques is the same for different microarchitecture structures, one can expect that similar s cenarios occur in other structures (e.g. issue queue, functional units). Figu re 7-2 illustrates the limitation of the simple NBTI+PV mitigation technique. Note that with a considerable performance and power overhead, it is still possible for the simple combined approach to reduce the total guardbands by a significant margin. However, as shown in Equation 7-1, guardband is not the only factor that dete rmines the efficiency of the proposed techniques. The trade-o ff between reliability and perf ormance/power should also be considered. The interaction between NBTI a nd PV provides the opportunity to minimize the performance penalty or power overhead without degrading the guardba nd enhancement obtained 157

PAGE 158

by the combined technique. To summarize, simply combining NBTI with PV mitigation techniques lacks the capability to exploit the positive interaction between NBTI and PV which is beneficial to achieve either a lower guardband or less performance pena lty and power overhead. PV-Aware NBTI Mitigation for Multi-Port ed Based Microarchitecture Structures In this Section, I present the proposed techniques in light of register file (RF) design since the RF is a representative multi-ported microarch itecture structure. In a multi-ported RF, the RF delay is dominated by the read access time since write access time is not as delay critical as read access time [129]. In this study, I focus on RF read access and leave write access as future work. Figure 7-3 presents the 2-read port RF with deta iled read port design. On ly one bit cell is shown in this Figure due to space limitations. As it shows, a read port in cludes two wordline (the inverter) and two bitline transistors. The read access time consists of the wordline charge delay and the bitline discharge delay. Variation of the four transistors will cause a difference in the read access time of each read por t. It will further affect the RF frequency, which is determined by the slowest read access time. Therefore, the ef fect of PV and NBTI on the read port should be accounted for by guardband estimation. When a read port is selected to perform the r ead operation (e.g. read port A in Figure 7-3), the decoder will trigger the wordline associated with that port. This causes a negative voltage to be set at the PMOS gate in the inverter and tri ggers the NBTI degradati on. On the other hand, if the port is not selected (e.g. port B in Figure 7-3) the positive voltage is set at the PMOS gate, putting that PMOS under the recovery mode. As can be seen, the port is under stress mode whenever it is enabled for read operation. Therefore, reducing the por t utilization can help mitigate NBTI degradation. Based on the above observation, I propose micr oarchitecture optimization 1 (O1) which assigns higher utilization to the ports with shorter read access times. By doing so, the ports with 158

PAGE 159

longer read access times suffer much less NBTI degr adation since their utilization decreases. As can be seen, O1 leverages th e interaction between NBTI a nd PV by migrating more NBTI degradation to th e ports with low Vth (due to PV). Therefore, it minimizes the case of &_th thVhi g hV and efficiently reduces the NBTI guar dband requirement. Since VL has been proved as an efficient PV mitigat ion method [118], I use VL tec hnique in O1 to reduce the PV guardband. The read ports are partitioned into fas t/slow ports. In 45nm processing technology, the fastest 60% to 80% of ports can be classified as fast ports and correspondingly, the slowest port in the slow ports requires 1.16 to 1.22 cycle time to complete a read access [118]. Since they are assigned two cycles for the read operation, at least 78% of the cycl e time can be used to tolerate the extra delay caused by NBTI degradation. Theref ore, aggressively using the slow ports will not affect the VL frequency nor, as a consequenc e, the required guardband. Note that the access time also varies among fast ports and there is a fraction of fast ports with short access times which allow them to be continuously utilized (their PMOS are under the stress mode) without contributing to the NBTI guardband. I define them as absolute fast ports (AFPs). The remaining fast ports are called possible fast ports (PFPs ) because the NBTI degradation on them likely leads to a time violation and c ontributes to the NBTI guardband. I estimated the read port speed of each RF entry across 400 chips under the im pact of PV and observed that on average the fastest 36% read ports in a chip can be classified as AFP since they are at least 15% faster than the VL cycle time. One may notice that even usi ng AFP we may still eventually fail to meet the time specification since NBTI degr adation can cause as much as 20% frequency loss during the targeted lifetime period [127]. The PFP still needs to be used in case ther e is no available AFP. Meanwhile, using PFP lowers the threshold for AFP classification and incr eases the fraction of 159

PAGE 160

ports that can be included in the AFP categor y. As a result, the overa ll guardband requirement should consider the wear-out of both PFP and AFP and is determined by the maximum of the two. Migrating RF port utilization from PFP to AFP and slow ports can greatly reduce the guardband requirement. To better understand th e proposed technique, I present cycle time variation under the impact of NB TI and PV in Figure 7-4. Figure 74 (a) shows the baseline case and the optimized scenario is shown in Figure 7-4( b). In both cases, the read ports are arranged based on their access delay. In the baseline case, the initial cycle time is determined by the longest port delay due to the PV Generally, NBTI degrades the ports evenly and the final cycle time is an accumulated effect of the worst case in PV and NBTI. On the other hand, with O1, the initial cycle time is gr eatly improved by VL; the read ports are partitioned into AFP, PFP and slow ports based on their delay and only PFP are vulnerable to NBTI effects. Moreover, NBTI degrades ports unevenly based on their category under the control of O1. Therefore, the cycle time is efficiently reduced compared to the base line case. The description above mainly focuses on the combinational circuits in RF since it is crucial to the RF frequency. The inversion method proposed in [127] is applied to the SRAM based RF entries for NBTI recovery. To implement O1, a key issue is the port util ization assignment. In the proposed scheme, PS is applied to switch from PFP to either AFP or slow ports whenever possible, occurring once the instruction is dispatched into the issue queue (IQ). Since instructions ha ve to stay in the IQ for wakeup and selection, the port informati on checking and switching can be performed simultaneously without affecting the performance. When the IPC is low, switching from PFP to slow port occurs. The amount of required issue bandwidth is usually low during the low IPC phase and pipe stalls caused by the slow port will cause few issue stalls in the following cycles and hence the impact on performance is small. Intuitively, to avoid the large number of pipe 160

PAGE 161

stalls, one needs to limit the numbe r of instructions using slow ports for RF reading. I found that it is unnecessary to do so since there are only ab out 20% slow ports, the probability that all instructions will be issued on the same cycle, causing pipeline stalls, is low. When the IPC is high, O1 checks the possibility of switching from PFP to AFP. If it cannot be performed and the use of slow port is unavoidable, O1 will try to use a slow port for the other operand read. Because a pipe stall will occur, the performance impact is the sa me no matter if only one or both of the read ports are slow. However, the NBTI effect is different when one PFP and one slow port are used compared with two slow ports bein g used. Figure 7-5 shows the pseudo code of PS in O1. The IPC is updated every 100 cycles and an IPC of 1 is used as a threshold between high and low performance phases. Figure 7-6 shows an example of port switching in O1. The port information is attached to each register file entr y and the operand in each instruction is originally assigned a read port. The detailed operations are shown when a PS occurs for a given instruction. The implementation of port information profiling a nd reading, and the hardware support for port switching can be found in [118]. As discussed in [118], VL+PS resu lts in 2% area overhead, O1 introduces extra 1% area overhead to record the port information. Note that each read port is assigned to a d ecoder for the port activation. The port is linked to a specific decode line in the decoder. Sin ce the read critical path delay includes the decode delay [118] as well, the NBTI effect caused by po rt utilization on the de coder cannot be ignored. For illustration, we consider the 2-to-4 decode r in Figure 7-7. The decode line contains an inverter, a NOR gate, and a NAND gate which also have PMOS transistors. In order to understand the input of each gate for NBTI degrad ation analysis, a truth table is included in Figure 7-7. An output of in D0~D3 causes the port connected to the decode line to be activated for a read operation. In addition, th e detailed circuit of NOR and NAND gates are 161

PAGE 162

presented to illustrate each PMOS transistors stress or recovery mode depending on the two inputs. I show an example where both of the inputs to the gate are As can be seen, the input stresses the PMOS gate and the input will recover the PMOS. As the truth table shows, when a port is activated, its co rresponding decode line will have two inputs in the NOR gate and two inputs in the NAND gate. Correspon dingly, the two PMOS transistors in the NOR gate are under stress mode while those in th e NAND gate are under recovery mode. When a port is deactivated, there are three input combinations to the NOR gate, which re sult in either one of the PMOS being under recovery or two of th em being under recover y. Additionally, the two PMOS transistors in NAND are under stress mode. A pproaches such as resi zing transistors [120] can be used to tolerate the NBTI degradation on the inverter, which is not private to a specific decode line. Generally, half of th e PMOS transistors in the decode line are under stress mode and the remaining are under recovery mode whenever the port connected to the line is enabled or disabled. In another words, O1 does not affect the amount of NBTI degr adation stressed on the decode line. The idea of inserting input vectors [127] when the decoder is idle is used to recover NBTI degradation, solving the uneven degradation problem in the decoder line. PV-Aware NBTI Mitigation for Combinational Blocks In this section, I propose PV-aware NBTI to lerant schemes that target microprocessor combinational blocks. I illustrate the design on the functional units. As described in Background Section, the NBTI recovery in a functional unit can be performed whenever the functiona l unit is idle. A longer idle tim e provides more opportunity for NBTI recovery [127], resulting in reduced NBTI guardband. In high performance 64-bit microprocessors, many operand valu es in the applications do not require the full 64-bit width. These operands are referred to as narrow-width values. When there is an instruction whose operands are narrow-width values, the instructio n requires an add operation and the two values 162

PAGE 163

only occupy 16 bits. 1/4 of the 64-bit functional unit will be devoted to the instructions computation and the remaining 3/4 of the unit can stay in idle mode, providing opportunities for NBTI recovery. As can be seen, narrow-width values can help exploit idle time within a functional unit for NBTI recovery. Previous st udies show that there are a large number of narrow-width operations in general purpose applications. For example, in SPEC 2000 INT benchmarks, about 50% of the in structions contain operands no wider than 16 bits. In our study, a 64-bit functional unit is partitioned into four segments with granularities of 16 bits. Each segment can complete 16-bit executions independently. For normal-width values, which are wider than 16 bits, all four segmen ts are involved in computation. In order to achieve high performance, the co mbinational blocks in functional units are either pipelined or parallelized. Take the carry look-ahead adder (C LA) as an example. Instead of waiting for the carry to ripple through all the prev ious stages to find its proper value, as in a ripple carry adder (RCA), the CLA calculates the dependence of each carry-out bit on the first carry-in bit, and paralle lizes the carry-out bit computation. Therefore, the add operation in CLA is much faster than in RCA. The frequency of CLA is determined by the longest carry-out bit computation. The disadvantage of CLA is the ra pidly increasing complexity as the number of bits increases. A multi-level CLA is proposed to create a larger adder. The frequency of a multilevel CLA is determined by the carry-out computa tion delay across all the levels. For instance, a 64-bit adder can be built upon 4 parallelized 16bit CLAs, which match the segment partition introduced above. The 64-bit CLA (partitioned as 4 segments) delay is dominated by the carryout computation delay in the 16-bit CLAs. The case is similar for other pipelined or parallelized units. As can be seen, the functional units frequency is highly relate d to the critical path delay in each pipelined stage or parallelized block, whic h is the partitioned segment in our study. 163

PAGE 164

Due to the effect of PV, the critical path delay varies in each segment. The narrow-width operations should not be assigned randomly to th e segment without considering the interaction between NBTI and PV. For example, the be nefit of narrow-width operations for NBTI guardband reduction will be nullified if the opera tion is always performed on the segment with the longest delay, which results in more &_th thVhighV cases. Even though other segments achieve high NBTI mitigation, it is equivalent to the case without narrow-width detection since the guardband is determined by the worst-case delay. In this study, I propose optimization technique 2 (O2) which steers th e narrow-width operation to the fa stest segment. In general, a functional unit is more resilient to PV than RF b ecause its critical path is longer than that in RF and the delay difference among the segments is usually smaller than 20% [118]. This differs from the AFP in RF since an absolute fast se gment is usually nonexisten t. The initial fastest segment will become the bottleneck for the guard band reduction if it keeps being utilized. An online detection of the aggregated effect of NBTI and PV is re quired to guide migration of the narrow-width operations to the current fastest segment. IDDQ, which describes the standby leakage current in the circuit, ca n be applied to detect the eff ect. IDDQ is originally used for testing manufacturing faults [132]. The IDDQ values can demonstrate th e underlying parameter variations [133]. Recently, [142] discovered that IDDQ can be applied in NBTI degradation detection as well because the leakage current decreases exponentially as Vth increases in transistors. Therefore, IDDQ has the capability to capture both th e static and dynamic variations in Vth. In this study, the segment with the highest IDDQ is the fastest one and is selected for the narrow-width value operation. Figure 7-8 shows the hardware implementa tion to support O2. The narrow-width value detection occurs after th e result is computed. The 48 most significant bits are checked, in 164

PAGE 165

parallel, to determine if they are all 1s (one-det ector) or 0s (zero-detector) indicating that the operand is a narrow-width value. One bit, the narro w width record bit (NWR), is added into the RF entry to record whether the value is narrow width. When the two operands are read-out from the RF and written into the latch, the NWR is checked to determine whether a narrow-width operation can be performed. If it is narrow width, the highest 48 bits will not be latched and will be written directly into the result lines. For each operand, 4 MUXs are added between the latch and the four segments and are used to select an input value to the se gment between the NBTI recovery patterns (shown as the special input in Figure 7-8) and the real value (shown as A or B in Figure 7-8). Therefore, a total of 8 MUXs are used for the two operands. If this is a narrowwidth operation, 4 copies of the 16-bit value (a total of 8 copies for the two operands) will be sent to the MUXs. Otherwise, the normal value is used as the input. It is possible that the 16-bit operation causes an overflow. In this case, 4 carry -out lines are added in the output. In O2, the IDDQ testing is performed in each segment periodical ly, the testing current is sent to the 4-input comparator. The comparison output will determin e which segment should be selected for the narrow-width operation and its tw o inputs will be the 16-bit va lues. Other segments will be inserted with the recovery vector. Another 4 M UXs are added at the output of the comparator before the comparison result is sent to t hose 8 MUXs for input selection. Because the comparison output should be masked if the curren t operation is not narrow-width, all of the input should be the real value instead of the NBTI reco very vector. The signal s elect the real value will be multiplexed with the comparison output and the signal narrow-width operation determines which signal will be sent out to the 8 MUXs. Similarly, the signal is sent to the output of each segment to decide whose computation resu lt is valid for launching into the result line. The circuit of IDDQ testing, which mirrors th e circuit IDDQ (in in Figure 7-8) to Mn through 165

PAGE 166

Mm, is also shown. The analog voltage signal (out in Figure 7-8) reflects the changes in circuit IDDQ. Note that the IDDQ testing and comparator are not in the critical path and they do not introduce any extra delay in the cycle time. As shown in Figure 7-8, O2 only introduces the MUX and zero detection into the critical path, considering the comparatively long execution path in the functional unit, their effects to the cycle time is negligible. Moreover, their area and power overhead is around 1%. PV-Aware NBTI Mitigation for Storage Cell Based Structures In this section, the PV-ware NBTI mitigation t echnique is proposed for cache, which is the representative storage cell-based structures.PV e xhibits both random and systematic effects. Due to systematic effects, transistors share similar parameters with other (e .g. nearby) transistors. These transistor groups define an area in which the transistors e xhibit similar behavior. Since the parameter variation between two transistors is la rger as their adjacency distance increases [143], transistors which are far away from this area will exhibit different behavior. If they share another parameter with transistors around them, those tran sistors can be classifi ed into another area. Figure 7-9 shows the Vth variation map for a cache. As can be seen, the Vth variation in cache is not entirely random. Since the cache occupies a large portion of the chip area, transistor Vth can be generally high/low in some areas of the cache. Areas with similar Vth can be easily found. For other structures, such as RF a nd functional units, which occupy a small area of the chip, the critical path variation is mainly caused by the ra ndom effect since the similar systematic effect performs across the entire struct ure. Therefore, alt hough the critical paths in the structure are very close, they still va ry in the path delay. It is well known that body biasing (BB) is an efficient method for PV mitigation. However, it must be applied at the structure level and a finer granularity is not achievable with BB technology [119]. Usually, a cache is assigned one BB generator and a uniform voltage biasing is 166

PAGE 167

applied in all areas, whet her they have high or low Vth. The amount of BB applied is determined by the worst case across the entire cache. [127] proposes a NBTI recovery mechanism for cache structures by invalidating 50% of the cache lines and uses them to store the inverted values. However, keeping half of the cache invalidated increases the cache miss rate and degrades performance, especially on app lications that have high cache utilization. When combining the BB technology [119] with the NBTI recovery approach [ 127], the guardband is reduced significantly. Note that areas with low initial Vth can tolerate more NBTI degradation and, as long as the final Vth does not exceed that in th e areas with high initial Vth due to PV, the strict cache line inversion percentage (e.g. 50%) can be appropriately relaxed in those areas. Doing so reduced the number of invalidated cache line s, which decreases the cache miss rate and performance loss, leading to an improvement in the technique efficiency to NBTI and PV mitigation in terms of performance, power, and chip lifetime. Based on the above observation, I propose O3 to ta ke advantage of the systematic effect of PV in guardband reduction while maintaining pe rformance. I apply adaptive body biasing (ABB) in O3 to mitigate the PV effect. First, O3 parti tions the cache into several areas according to the similarity of transistors Vth. Each area has its individual invers ion percentage (areas with lower will be assigned a lower inversion percen tage, corresponding to a smaller number of invalidated and inverted cache lines). The per centage is estimated based on the difference between the highest Vth in the cache and that in the area. Similar to the proposal in [127], the valid/state bits are used to indicate whether th e cache line is valid and non-inverted, or invalid and inverted. A counter is used in each area to count the number of inverted cache lines. Once it is below the pre-defined threshold, one LRU cach e line is invalidated and written with the thV 167

PAGE 168

inverted value. Since different cache ways are impl emented close to one another, the PV exhibits a stronger systematic effect in the horizontal direction than in the vertical direction [144]. The cache area is partitioned at th e set level. However, the pa rtition granularity should be considered. If it is too small, there are fewer cac he lines being chosen from the area for inversion and it becomes more difficult to match the requi red the inversion percentages to a concrete inversion number. In addition, a large number of counters ar e required for the inversion percentage control, which causes a higher area overhead. On the othe r hand, if the granularity is too large, the systematic effect cannot be effici ently exploited. For example, when the granularity is set as the entire cache, O3 will be the same as combining BB with the technique proposed in [127]. I perform the sensitivity analysis in Eval uation Section and choose the granularity as 8 sets. Figure 7-10 describes the idea of O3 in th e 4-way L1 cache. The cache line with gray color represents the invalida ted and inverted lines. Evaluation In this Section, I evaluate the three techniques proposed in this Chapter. In this study, I model the dynamic NBTI degradation in Vth by applying the reactiondiffusion (RD) model proposed in [145], the PMOS stress and recovery cycles are obtained via the microarchitectural simulator, and the signal possibility is computed and inserted into the model to determine the shift in Vth due to NBTI. I used the same architecture level evaluation methodology described in Chapter 6 for the tech niques evaluation. Since both NBTI and PV effects are addressed in our study, I extend the NBTIefficiency metric to NBTI&PV_efficiency (Equation 7-2), which quantifie s the technique efficiency to both NBTI and PV. Correspondingly, the NBTI+PV guardband is named as NBTI&PV_guardband 3&_((1&_)) NBTIPVefficiencyDelayNBTIPVguardbandTDP (7-2) 168

PAGE 169

Effectiveness of O1 O1 is compared with the baseline case w ithout any optimization. I also compare the technique combining 70% VL with port switch ing (PS) and the NBTI mitigation technique, which inserts a special input vector (SIV) in th e idle time (defined as VL+PS+SIV). Figure 7-11 (a)-(c) presents the CPI, NBTI guardband, and NBTI&PV_efficiency of the three cases in RF. The CPI and NBTI guardband are normalized to the baseline case. The TDP of VL+PS+SIV and O1 is 1.02 and 1.03 respectively due to the area overhead. As shown in Figure 7-11 (a), CPI increases in both of the NBTI&PV mitigation techniques because the use of slow read ports cannot be eliminated. When they are selected for RF read ope ration, pipe stalls occur and degrade the performance. However, the performan ce penalty is negligible in some applications (e.g. equake, mcf ) because they are running in low IPC phases most of the time and the pipe stalls are tolerated by the low bandwidth require ment. One may notice that O1 increases the CPI by 2% compared to VL+PS+SIV. This happens because slow ports are intentionally chosen for read operations when the IPC is low in order to reduce the PFP utilization. When the IPC information obtained from the last phase generates an incorrect prediction, a slow port is selected by mistake, which causes performance loss. Even though O1 slightly increases the CPI, it gains a significant NBTI guardband reducti on. As Figure 7-11 (b) shows, on average, O1 reduces NBTI guardband by 35% and 36% compared to the ba seline case and VL+PS+SIV, respectively. Interestingly, the VL+PS+SIV exacerbates the NB TI degradation compared to the baseline case because fast ports are used aggressively in VL+PS+SIV and they must accept the utilization migrating from the slow ports. Meanwhile, the SIV does not help reduce the NBTI degradation in read ports since the port switches to the recovery mode automatically when it is free, additionally, the positive effect caused by SIV on the decoder line is not noticeable enough to combat the negative effect. I forgo a presentati on of NBTI&PV guardband, which is equal to the 169

PAGE 170

sum of NBTI and PV guardband. In the baseline case, on average across a ll the simulated chips, the PV guardband is set to be 0.3, when appl ying VL technique, improving the frequency by 20% and the PV guardband reduces to 0.1. Fi gure 7-11 (c) proves that O1 reduces NBTI&PV_efficiency greatly. It re duces the efficiency as high as 1.00 compared to the baseline case, which implies it improves the efficiency 100% since the best techni que has the efficiency of 1 (no PV and NBTI effect). Moreover, it e xhibits much stronger ability than VL+PS+SIV in solving NBTI and PV because it achieves 30% improvement in NBTI&PV_efficiency Effectiveness of O2 O2 is compared with the baseline case, the NBTI mitigation technique SIV, the technique which applies SIV and takes narrow-width operati on into consideration (defined as SIV+NW). Since the VL technique [118] is orthogonal to the above methodologies, the discussion on their combination to VL is skipped. Figure 7-12 (a) -(b) presents the NBTI&PV_guardband, which is normalized to the baseline case, and the efficiency of the four cases in Integer ALU. CPI is not shown in the figure since it has a negligible eff ect on performance. The TDP in SIV+NW and O2 is 1.01 and 1 in SIV and the baseline case. I show the results of IntALU because most of the narrow-width operations are integer arithmetic and logic operations. It is not fair to judge the efficiency of the techniques in functional uni ts (e.g. FPU) with few narrow-width operations. As Figure 7-12 shows, compared to the baseline case, on average across all the benchmarks, SIV reduces the guardband by 28%. It gains less reduc tion than that reported in [127] (63%) because I focus on the IntALU, which performs both arit hmetic and logic operations and has less idle time than the adder studied in [127]. O2 exhi bits much stronger cap ability in guardband reduction, which are 55% and 59% in INT and FP benchmarks respectively and, as a result, improves the efficiency by 73% and 76% in th e two benchmark categories. Compared to SIV+NW, which blindly assigns th e narrow-width operations inside the unit, O2 decreases the 170

PAGE 171

guardband 15% and 12% in INT and FP benchm arks. This contribu tes to 18% and 13% efficiency improvement compared with SIV+NW. Effectiveness of O3 Figure 7-13 (a)-(b) shows the normalized CPI and NBTI&PV_efficiency generated by the baseline case, the technique applying ABB with cache line inversion (CLI) (define as ABB+CLI), and O3. Since the NBTI and PV problem can be easily solved in the L2 cache by implementing periodical inversion [146], I fo cus the study on L1 data cache. Note that HotLeakage is used to evaluate the power overh ead caused by ABB. As can be seen, ABB+CLI has negligible CPI impact on some benchmarks (e.g. lucas mcf ) because of frequent L2 cache misses: a L1 miss latency caused by the cache line inversion will be covered by the L2 miss which occurs simultaneously. However, it degrades the performance significantly on benchmarks with low L2 cache miss rates. O3 solves this problem since it efficiently utilizes the L1 resources. For example, O3 improves the performance by 19% in eon and 8% in mesa As shown in Figure 7-13 (a), O3 obtains similar CPI results as the baseline case. It improves the NBTI&PV efficiency 13% compared to ABB+CLI. Fi gure 7-14 describes the NBTI&PV_efficiency obtained by O3 as the granularity varies from a single set to the entire cache. I perform the analysis on benchmarks (e.g. eon vpr) which are sensitive to ABB+ CLI technique. As expected, the performance loss is high when the granular ity is extremely small or large. An 8-set granularity achieves the best e fficiency, it is chosen in the O3 implementation but requires an extra cache line and 16 counters, which results in 1% additional area overhead. NBTI&PV Efficiency Regarding to the Entire Chip In order to evaluate the eff ectiveness of the three proposed techniques on the entire chip, I compute the NBTI&PV_efficiency of the processor following th e equations proposed in [127]; based on each structures Delay, NBTI&PV_guardband and TDP generated by our techniques. 171

PAGE 172

On average, I obtain an efficiency of 2.20 for th e entire chip. In the baseline case without any optimization, the chip NBTI&PV_efficiency goes up to 3.375. As can be seen, the proposed techniques improve the efficien cy by 117%. The effectiveness of simply combining PV and NBTI mitigation techniques is evaluated for the comparison, its NBTI&PV_efficiency is 2.41, and my technique outperforms this technique by 21%. Related Work There have been several studies on NBTI m odeling and mitigation at both the circuit and microarchitectural levels. The Reaction-diffusion (R-D) model has been widely used to model the NBTI degradation and recovery effect [ 147, 148]. [145] recently c onsidered temperature variation in the NBTI model. The impact of NBTI on the performance of combinational circuits is investigated in [149], which shows that NBTI degradation is sensitive to the input patterns and the stress time. In addition, the NBTI effect on SRAM array is modeled and studied in [150], where it is shown that the read stability degrad es due to NBTI and th at the degradation is exacerbated in the presence of PV. To mitig ate combinational circuit aging under NBTI, adaptive body biasing (ABB) is a pplied in NBTI resilient circ uits [151]. [157] proposes to identify the critical gates that are most importa nt for timing degradation and protects them from NBTI. To improve the storage cell reliability under NBTI, [ 152] proposes a new memory cell design consisting of a number of NAND gates instead of inverters to reduce the average degradation on each PMOS. Periodic i nversion [146] is proposed to f lip the contents of all cells periodically, keeping the balance between an d in the cell and is an efficient way to mitigate NBTI in storage cells, but the extra f lipping delay in the critical path causes 10% frequency loss. [153] improves th e cache reliability unde r NBTI. It proposes proactive use of microarchitectural redundancy, in which the two co mponents operate either in active mode or in recovery mode, periodically transitioning be tween the two modes according to a recovery 172

PAGE 173

schedule. The combined effect of PV and NBTI has been modeled and analyzed in [154, 155]. Moreover, [158] proposes online PV and NBTI detection in logic circuits and applies ABB to tolerate the Vth variations. [156] proposed a technique ca lled Razor to tune the supply voltage by monitoring the error rate caused by PV and NBTI during circuit operation, thereby eliminating the need for voltage margins. Razor mainly targets combinational logics. In our study, we target the mitigation of NBTI and PV e ffect in both combinational circuits and storage cell based structures with desirable trade-offs among performance, reliability, and power. To my knowledge, this is the first work taking advantag e of the interplay between PV and NBTI to efficiently address the variation problem caused by NBTI and PV. 173

PAGE 174

guardband Conservative guardband Consider the interaction between NBTI and PV 0 NBTI PV Optimized guardband Figure 7-1. Different guardband se ttings for tolerating NBTI guardband Conservative guardband Consider the interaction between NBTI and PV 0 NBTI PV Optimized guardband PV only mitigation NBTI only mitigation Simply combine NBTI and PV mitigation technique Figure 7-2. The limitation of simply combin ing NBTI and PV mitigation techniques Read port A Decoder Read port B SRAM Decoder Precharge Bitline Wordline Write port One bit cell Figure 7-3. 2-Read port register file s with detailed read port design 174

PAGE 175

Port delay Cycle time under PV Cycle time under NBTI&PV 0 Read ports A Port delay cycle time under PV in the baseline case Cycle time under PV after VL Cycle time under NBTI&PV after O1 0.85 Cycle time after VL Slow ports PFP AFP Fast ports 0 B Figure 7-4. Cycle time under NBTI and PV effects. A) Baseline case without optimization. B) O1. 1.Every cycle 2.{ 3. IPC update every 100 cycles(); 4. IF (last interval IPC <=1) THEN 5. { 6. switch from PFP to slow ports; 7. } 8. ELSE 9. { 10. IF (AFP is available for switch) THEN 11. { 12. switch from PFP or slow ports to AFP; 13. } 14. ELSE IF (slow ports is unavoidable) THEN 15. { 16. switch from PFP to slow ports; 17. } 18. ELSE 19. no port switching; 20. } 21. } Figure 7-5. Pseudo code fo r port switching in O1 175

PAGE 176

R4 R3 R2 R1 R5 R7 R6 AFP S S PFP AFP PFP S AFP PFP S S PFP PFP PFP rd port B rd port A PortA PortB 1. ADD R5, R1, R3 2. AND R7, R4, R5 3. SUB R6, R2, R6 1. If(last_interval_IPC<=1) PS(R1, R3); /*switch from PFP to slow ports*/ else no PS; 2. PS(R4, R5); /*switch from PFP to AFP when AFP is vailable for switch*/ 3. PS(R2, R6); /*even it is in high performance phase, switch from PFP to slow ports when slow port is unavoidable*/ Figure 7-6. Examples of PS in O1 A0 wordline B0 B1 B2 B3 B4 B5 B6 B7 C0 C3 C2 C1 A1 0 0 0 0 D0 D1 D2 D3 0 0 A0 A1 B0 B1 B2 B3 B4 B5 B6 B7 C0 C1 C2 C3 D0 D1 D2 D3 0 1 1 0 1 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 0 0 0 1 1 1 0 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 Figure 7-7. An example of 2-to-4 decoder 176

PAGE 177

A Latch high 16A B NWR from the RFCLK AND the bit IDDQ testing A Latch low B Latch high B Latch low 48 16 48 64 64 A Special input 16 16 B Special input 16 16 A Special input 16 16 B Special input 16 16 A Special input 16 16 B Special input 16 16 A Special input 16 16 B Special input 16 16 NWR from the RFCLK AND the bit IDDQ testing IDDQ testing IDDQ testing 64 64 comparator narrowwidth operation Select the real value Zero detector Register files 16 16 16 16 16 16 16 16 17(1 Carry out)17(1 Carry out)17(1 Carry out)17(1 Carry out) 65(1 narrow-width bit) 64 IDDQ testing in outMm Mn Figure 7-8. O2 circuit design Figure 7-9. Vth (in mV) variation map for a cache 177

PAGE 178

... ... ... ... Body biasing generator PMOS Vdd NMOS Vdd 3 Inversion Counters 2 4 ...W ay 3 W ay 2 W ay 1 W ay 0 Figure 7-10. Fundamental idea of O3 in the 4-way L1cache 178

PAGE 179

0.9 0.95 1 1.05 1.1 1.15 1.2amm p applu bzi p c r aft y e on e q uak e facerec fma3d ga l ge l gap g c c l uc as mcf m e s a mgrid pe r lbmk s w im twolf vpr w up wi s e A V GNormalized CPI Baseline VL+PS+SIV O1 A 0 0.2 0.4 0.6 0.8 1am m p ap p l u bzi p cra f t y e o n e q ua k e f a c ere c fma3d ga l gel g a p g cc l u c as mc f me sa mgr id p e r l bmk s w im twolf vp r w u p wi s e AV GNormalized NBTI guardband Baseline VL+PS+SIV O1 B 0 0.5 1 1.5 2 2.5 3 3.5amm p a p pl u bzi p c r aft y e on equa k e fac e re c fma3 d galge l ga p gcc luca s mc f mes a mgri d p er l b m k s w i m t wol f v p r w upwis e AVGNBTI&PV_efficiency Baseline VL+PS+SIV O1 C Figure 7-11. The effectiveness of O1 in RF. A) Normalized CPI. B) Normalized NBTI guardband. C) Normalized NBTI&PV_efficiency 179

PAGE 180

0 0.2 0.4 0.6 0.8 1 1.2b z i p cr a f t y eon ga p g c c mcf per lb m k tw olf vpr A VG amm p applu e qu a ke f ac e re c fma3d galgel luc as mesa mgrid s wim w upwise A VGNormalized NBTI&PV guardband Baseline SIV SIV+NW O2 Integer benchmarks Floating point benchmarks A 1.2 1.4 1.6 1.8 2 2.2 2.4bz i p cra ft y eon ga p gcc m cf pe r lbm k twol f v p r AV G amm p applu e q ua k e fa ce rec f m a3 d gal gel luc as me sa m g ri d swim wu p wi s e AV G NBTI&PV_efficiency Baseline SIV SIV+NW O2 Integer benchmarks Floating point benchmarks B Figure 7-12. The effectiveness of O2 in IntALU. A) Normalized NBTI&PV_guardband B) Normalized NBTI&PV_efficiency 180

PAGE 181

0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1a mm p ap pl u bz i p cr a ft y eon equak e f a ce re c f m a 3d ga l gel ga p gc c l u c as mcf me sa mgri d p e rlbm k s w i m twolf v pr wupwise A VGNormalized CPI Basline ABB+CLI O3 1.19 A 1 1.2 1.4 1.6 1.8 2 2.2 2.4a m m p ap p lu b z ip c r a fty eon e q u a ke fa c er e c fm a3d galgel gap g c c l ucas mc f m esa mg r id p e rl b m k s w im two lf v p r w u pwise A V GNBTI&PV_efficiency Basline ABB+CLI O3 B Figure 7-13. The effectiveness of O3 in L1 cache. A) Normalized CPI. B) Normalized NBTI&PV_efficiency 1 1.02 1.04 1.06 1.08 1.1 1.12 1.14 12481632 Number of sets per areaNormalized CPI ammp bzip crafty eon fma3d gap mesa twolf vpr Figure 7-14. NBTI&PV_efficiency with various granularities 181

PAGE 182

CHAPTER 8 ARCHITECTING RELIABLE NETWORK-ON-C HIP MICROARCHITECTURE IN LIGHT OF SCALED TRANSISTOR PROCESSING TECHNOLOGY The trend towards multi-/manycore processor design has made a scalable and highbandwidth on-chip communication fabric that co nnects these cores critically important. The packet-switched network-on-chip (NoC) [159] is emerging as the pervasive design paradigm for multi-core communication fabrics. With the continuous down-scaling of CMOS processing technologies, reliability and variability (e.g. pro cess variation (PV), negati ve biased temperature instability (NBTI)) is becoming a primary targ et in NoC design [160]. Although PV and NBTI can be addressed at the deviceor circuitlevels, such solutions are costly in terms of area and power and exhibit poor scalability. Several ar chitecture and system level techniques [111-115, 118, 127, 153, 161, 162] have been proposed to mitigat e the NBTI and PV effects on processor operations and achieve a lower guardband. As a result, the frequency loss and power consumption reserved for the guardband can be reduced. These techniques focus on processor cores and memory hierarchy and largely ignori ng the emerging NoC architectures whose design is significantly different from pr ocessor architectures. Ignoring the reliability of the NoC can turn it into a potential reliability bottleneck of multi/many-core architectures. In NoC architectures, ultra-low latency designs are desired since shared-memory workloads are highly sensitive to the interconnect latency [163]. In addition, power manageme nt is also critical in a NoC, the literature [164, 165, 166] reports the interconnect power co nsumption at approximately 30% to 40% of total chip power consumption. Since NBTI and PV affect both NoC delay and power, it is imperative to address these challenges at th e NoC architecture design stage to ensure its efficiency as the underlying CMOS fabrication technologies conti nue to scale [167]. As introduced in Chapter 7, PMOS transistor wear-out caused by NBTI is aggravated in the presence of process variation. The guardba nd considering both NBTI and PV will be an 182

PAGE 183

additive result from the guardbands for the two effects respectively. In the upcoming nano-scale transistor design era, NoC design techniques that can effectively address the combined PV and NBTI effect are needed. In this chapter, I ta rget NoC architectures and propose novel circuit, router microarchitecture, and in ter-router reliability enhancements in the presence of process variation and NBTI, while ach ieving attractive trade-offs among NoC performance, power, guardbands for reliability, and area overhead. In stead of using a flat/c entralized method (as [161]), I propose three tec hniques to hierarchically mitigate PV and NBTI effects in NoC at both inter-router (high) and intra-router (low) levels. Packet-Switched NoC Router Microarchitecture Peh and Dally [168] proposed the canonical NoC virtual channel router microarchitecture. The router is input-queued and has P-input and P-output ports, while P is usua lly set as five. Four of them connect to the four car dinal directions (Nor th, East, West, and South) and one connects to the local processing element. The router micr oarchitecture is composed of the following key elements: virtual channel (VC) FI FOs, routing computation unit, VC allocation logic, crossbar allocation logic, and the crossbar itself. There are four stages in the router pipeline. When a flit enters the router through one of the input ports, it is stored in the VC buffer that has been reserved by the upstream node. If the flit is a header (the first flit of a new packet), it proceeds to the route computation (RC) stage. The routing computation unit determin es the output port for this new packet. The following pipeline stage is VC allocation (VA) during which the VC allocation logic attempts to assign a free VC in the next hop to the header flit. In the following cycle, if VC allocation is successful, the flit en ters the switch allocation (XA) stage, during which it competes with other flits from the rout er for the output port. The data and tail flits belonging to the same packet as the header flit can skip the RC and VA stages and proceed to XA. Once the link between the input port and its output port is built up, the flit enters the 183

PAGE 184

crossbar traversal (XB) stage a nd is sent to the next hop. To re duce router latency and improve performance, researchers have proposed a two-st age router pipeline, which incorporates route look-ahead [169] and path specu lation [168]. The first technique performs routing one hop in advance. The second technique speculates that the waiting packet will successfully obtain the output VC from the VC allocation logic and para llelizes the switch allocation and virtual channel allocation stages. If both alloca tion requests are granted, the late ncy of switch arbitration is absorbed. The router pipeline can be further reduced to a single stage [170] by performing additional speculation, but will incur more mis-sp eculations at high loads and cause one-cycle penalty. Prior studies [171, 172] have also proposed adaptive routing for NoC design. Compared with dimension-order routing (DOR), the adaptive routing schemes can achieve better fault tolerance and congestion avoidan ce capability. In this study, we consider a two-stage adaptive router microarchitecture similar to that proposed by Kim et al. [172]. I opt to use this design since both reliability and performance are considered in NoC design. Figure 8-1 shows the baseline adaptive router microarc hitecture and pipeline. Adaptive routing requires extra logic to collect congestion statistics, whic h is used to pre-select the pr eferred output port (the port with the least congestion) for each pack et one cycle in advance [172]. A Hierarchical Design of NoC That Mi tigates the Effects of NBTI and PV In this section, I propose a hier archical design of NoC that ef ficiently mitigates the impact of NBTI and PV while leveraging the benign inte raction between them. The hierarchical design performs at both intra-router (low) and inter-router (high) levels. Intra-Router Level NoC PV and NBTI Optimization The key components of the pipe lined router can be classifi ed into combinational-logic structures (e.g. VC allocation l ogic, switch allocation unit) a nd storage-cell structures (e.g. 184

PAGE 185

virtual channels). My intra-r outer NBTI and PV optimization schemes target both types of structures. NBTI and PV optimization for NoC combinational-logic structures Prior studies [168, 173] have shown that, when compared to other rout er pipeline stages (e.g. XB and LT), the delay in VA and XA stages largely determines the frequency of canonical router microarchitecture. I thus focus our st udy on virtual channel allocation logic and switch allocation unit as the representative combina tional-logic structure in a pipelined router. Figure 8-2(a) illustrates the detailed circuit desi gn of the virtual cha nnels and Figure 8-2(b) shows a zoom-in view of the VA logic of the tw o-stage adaptive router [172]. A packet in a minimally routed 2D mesh can only proceed to tw o of the four quadrants (i.e. NE, NW, SE, and SW). The RC units determine the quadrant to which the packet should travel based on its destination direction, and the pre-selection function selects one output port for the quadrant depending on the congestion information. The VA and XA are then performed for the selected output port. In order to support adaptive routing, the VCs are partitioned into four sets. Each VC set is assigned to one quadrant and is used to co llect the flits routed to this quadrant. There are three groups of VCs within each set to accept flits from the local processing element (PE) as well as from the two other directions that are not included in this quadrant. For example, the VC set assigned to NW quadrant can only accept flits from PE and E and S directions. Note that a packet whose destination is the local PE will be injected into the PE directly without going through the VA, XA, and XB stages. In the adapti ve router, the VA stage operates in two steps: the first step assigns a free output VC (VC at th e downstream nodes) to each request from VCs in the four sets. It requires one arbiter per VC in the sets and a total number of 12 arbiters is needed (i.e. there are four sets a nd each set has three groups of VC, stands for the number of virtual channels in a group). The second step produces a winner for each output VC among all 185

PAGE 186

the competing VCs in the sets. Correspondingly, one arbiter is required for each output VC. The total number of arbiters requi red for the second step is 8 Since VA logic consists of multiple arbiters in parallel, the VA delay is sensitive to the PV effect [173]. Therefore, mitigating the NBTI and PV effects in the VA stage will allow a lower guardband and directly boost the router frequency. As shown in Figure 8-2 (b), each VC entry has a private arbiter in the first VA step. Since there are multiple parallel arbiters in the first VA step, they exhibit various delays due to the PV effect and the slowest critical pa th under the impact of PV dete rmines the delay of the first VA step. As a NBTI recovery method, the special input values [127] can be inse rted into the arbiter when there is no request signal sendi ng out from the associated VC en try (i.e. the arbiter is idle). However, it is possible that the slowest arbiter is frequently u tilized, losing the opportunity for NBTI recovery. This will lead to a longer VA delay since the NBTI de gradation on the slowest arbiter is not efficiently recovere d even when the NBTI optimizati on technique is applied. In this study, I propose NBTI and PV mitigation technique 1 at the VA stage (VA_M1) to assign a higher utilization to the faster arbiters and insert the special input values to idle arbiters. By doing so, the slowest arbiters obtain more idle time to perform the NBTI recovery. As can be seen, the fast arbiters absorb more NBTI degrad ation while protecting the slowest arbiters from the impact of NBTI at the most degree. Theref ore, the guardband decreases. Due to its overhead, the adaptive body biasing (ABB) technique [19] is not suitable for mitigating PV effect at fine granularity. In the study, I appl y the ABB technique in a chip-wid e manner to tolerate the PV effect across the entire chip. As mentioned earlier, the firs t VA step selects a free output VC entry from a specific VC group at the downstream node. Si nce arbiter utilization depends on the requests from the 186

PAGE 187

associated VC, instead of blocking the requests to the arbiter I prefer to avoid using the VC even when it is free. If there is no flit stored in the VC, no request will be sent out. In other words, the arbiter gets the opportunity to recover from NBTI degradation if its associ ated VC is idle. Figure 8-3 shows the circuit of the first VA step (only on e VC entry is shown). As can be seen, when a flit at a local VC sends an output VC allocation request to its private arbiter, the request will be fanned out to AND gates ( stands for the number of virtua l channels in a group) and ANDed with each VC status (i.e. idle: or occupied: ) at the downstream node. The arbiter only accepts the fan-out request signals for downstream VCs whose status is idle. At the downstream node, when the VC associated with the slowest arbite r is free and there is more than one free VC in the group, VA_M1 marks it as occupied before a credit describing it s status is sent back to the upstream node. Consequently, other free VCs in the sa me group that link to fast er arbiters will be used to store the flit. In the case where there is only one free VC, which is exactly the one linked to the slowest arbiter, VA_M1 marks it as free in order to maintain the performance. Since VCs are not fully occupied during most of the router service period, VA_M1 has substantial opportunities to migrate the NBTI degradation from th e slowest arbiter to fast er ones. In addition, the arbiters use a least-recently-served priority scheme [170], every faster arbiter receives even NBTI degradation migrated from the slowest one. Therefore, the NBTI degradation will not accumulate on a specific fast arbiter, which can eventually become the new bottleneck for the guardband reduction. Figure 8-4 presents the implementation of VA_ M1. An IDDQ detector is attached to each arbiter to perform an online detect ion of the aggregated NBTI and PV effect on it. In this study, I assume the critical paths within each arbiter have similar delay due to the systematic effect of PV. Therefore, one detector per arb iter can well describe the NBTI a nd PV effect in the arbiter and 187

PAGE 188

the arbiter with the lowest IDDQ value is the slowest one. The detector operates periodically and sends the tes ting current to a -input NMOS analog voltage comparator [174] that determines which VC should be marked as occu pied. The comparison resu lt is written into the VC status table (a hardware component existing in the typical flow-control-based router), which reflects the availability of each VC in the router. Note that the IDDQ de tector and comparator perform concurrently with the VA stage and the backward credit representing a VC status does not need to wait for the latest VC status befo re being sent out. Theref ore, no extra delay is introduced into the router pipeline. When insert ing the special input values into the arbiter, invalid grant signals will be generated. In order to block them from entering the second VA step, an invalid port number is sent to the DEMUX at the arbiter output. Note that VA includes twostep arbitration, VA_M1 will lose its efficiency if the arbitration delay variation in the first step is not critical to the VA delay. Upon successful op eration in the first step, the grant signal is fanned out to the arbiters in the second step. Theref ore, the delay variation in the first step cannot be tolerated by the second step th e slow arbiters in the first step will affect the entire VA delay. Moreover, VA_M1 mitigates the NBTI degradation within the first step and it does not introduce any extra NBTI effect to the second step. In or der to mitigate the NBTI and PV effects in the second VA step, VA_M1 inserts the special i nput values when the arbiter is idle. The switch allocation unit in the XA stage, which is another combinational-logic structure, is sensitive to the effect of PV and NBTI as well [173]. Since every flit of the packet has to experience the XA stage while only head flit requires virtual channel allocation, switch allocation unit is frequently used and suffering more NBTI degradation than virtual channel arbiters. Therefore, only focusing on the PV and NBTI mitigation at VA stage will lead to another frequency bottleneck at XA stage. As a widely used mechanism to dynamically tune the 188

PAGE 189

circuit delay, Adaptive Voltage Sc aling (AVS) [162] is applied to the switch allo cation units in our study. By increasing the supply voltage of the combinational-logic blocks, the critical path delay can be effectively reduced. However, a higher supply voltage results in larger power consumption (higher temperature as well), which al so increases transistor aging-rate (i.e. the transistors will have shorter lifetime). Generall y, the XA delay is shorter than VA delay at the beginning of the service time when PV is the major effect on circuit delay [168, 173]. In this study, I propose to trigger the higher supply voltage (AVS+) when XA delay becomes the limitation to the router frequency (Ideally, it is the time that XA delay is longer than the VA delay at the end of the lifetime after VA_M1 is applied). By doing that, not only the power consumption can be significantly reduced, but also th e negative effect on ag ing-rate is efficiently alleviated (applying AVS+ towards the end of the lifetime has little effect on agingrate [162]). Figure 8-5 (a)-(b) show the guardband and power be nefit of applying AVS+ during the lifetime. As Figure 8-5(a) shows, afte r integrating the AVS+ techni que with VA_M1, the guardband significantly decreases compared to the case in which only VA_M 1 is applied. In addition, the power overhead is reduced substantially when AVS + is enabled later at the service time instead of at the very beginning. The combined VA_M1 and AVS+ technique is named as VA_SA_M1. Since the AVS+ technique at XA stage is implemented simultaneously with VA_M1, one cannot recognize the final effec tiveness of VA_M1 on VA delay during the service time and use it to determine the appropriate trigger time for AVS+. In order to explore a simple design, I enable AVS+ when the XA delay is longer than a pr e-set threshold delay. No te that the threshold is supposed to be shorter than the final VA delay, and less power is consumed when it is closer to that ultimate delay. I set a conservative thresh old based on the sensitivity analysis in the Evaluation section. Since a conserva tive threshold is applied, instea d of monitoring each arbiters 189

PAGE 190

delay in switch allocation unit to obtain the accurate XA delay and resulting in a large area overhead, I apply one IDDQ detector to the whole unit for XA delay estimation. The IDDQ output is compared with the th reshold for AVS+ activation. Note that the IDDQ detection and comparison occurs simultaneously with the router pipeline and no extra delay is introduced. To reduce the implementation complexity, two diffe rent power lines to supply the normal (1.0V) and higher (1.2V) voltage are used respectively at the chip level, and a pair of PMOS transistors is inserted at the switch allocation unit in each router to handle the vol tage switching. Note that the AVS+ trigger time at the swit ch allocation unit per router varies according to the PV and NBTI effects on each individual router. As can be seen, the three-transistor based IDDQ detectors, the 2-to-1 MUXs, and the compar ators are the major extra overhead induced by VA_SA_M1. The gate-level estimation shows that VA_SA_M1 causes around 3% area overhead to the virtual channel and switch allocation logics. In addition, as shown in Figure 8-2, the cro ssbar delay is dominat ed by the wire delay [168]. Wires are immune to the NB TI degradation and techniques reducing the wire delay are out of the scope of this paper. NBTI and PV optimization for No C storage cell based structures In this section, I present an optimizati on technique for VC buffers, which are the representative storage-cell structures in NoC. Recall that an efficient NBTI degradation mitigation approach for the storage-cell structure is to keep 50% of the bits storing the sampled inverted value. To apply this technique in our work, the inversion at flit-size granularity is performed, which is large enough to statistically summarize the NBTI degradation in all of the VCs. To implement this technique, only half of th e VC capacity can be used to store the flits and the performance loss is significant on workloads that exhibit heavy traffic. Due to the systematic component of process variation (i .e. the spatial correc tion effect of WID), transistors usually 190

PAGE 191

share similar behavior with nearby transistors. I synthesized and gene rated the layout of a prototype router at 45nm proces sing technology. The router layout is similar to those reported in previous work [163, 175]. The methodology to mode l the systematic and random variations on Vth can be found in Chapter 6. Figure 8-6 presents the Vth variation map for the router. Note that it is based on the physical layout of the router in stead of the conceptual router microarchitecture shown in Figure 8-2. As shown, the distance between the transistors in the same VC set (shown by the rectangular) is much shorter than that betw een the transistors from different VC sets. As a result, the Vth of transistors within eac h input port exhibit similar characteristics allowing VCs from the same port to be grouped into one area. Tr ansistors from the same area are characterized by one uniform Vth, which is determined by the worst-case Vth (i.e. the highest Vth, because it affects the minimal voltage has to be applied) in that area. Therefore, there are several areas in the router (one area per set) with various representative Vth values. I observe that areas with low Vth can tolerate more NBTI degradation and, as long as the final Vth does not exceed that in the area with the highest Vth, the strict inversion pe rcentage (i.e. 50%) can be appropriately relaxed in those areas. The second mitigation technique proposed in this Chapter targets VC buffers (VC_M2) and assigns lower inversion per centage to areas with lower Vth. By doing so, there are more VC buffers to hold flits and mitigate the perf ormance loss, achieving a good trade-off among reliability, performance, and power. Based on my NBTI modeling, I characterize the relation between inversion percentage a nd the corresponding increase of Vth. Using the above information and the statistics of the highest Vth and the Vth of a given area, one can compute out the inversion percentage for that area duri ng the calibration time. Figure 87 (a) shows one VC set when VC_M2 is applied. In VC_M2, a recover flit is defined as the flit with inverted values, and it 191

PAGE 192

does not contain valid data. To implement VC_M2, two extra buffers are added in the set. One stores the sampled inverted value. The other one is called the stat us buffer and records the status (: free, 1: occupied) of all the flit-size buffers in the VCs: if one flit-size buffer is written by a flit or a recover flit, its corresponding status bit will be se t to 1. On the contrary, when the buffer is released, the status bit is reset to 0. Note that at the XA stage, the flit will not send out a traversal request if there is no free buffer in the allocated output VC. In general, a credit returns to the upstream node when there is at least one free buffer in the al located VC at the current node. Note that the credit describes the availability of buffers in a specific VC it is different from the VC status credit (discussed in VA_M1 techniqu es) which shows the availability of a VC. In VC_M2, when sending back the credit representin g the buffer availability, the buffer holding the recover flit should also be marked as occupied. Howe ver, the VC status is still considered as free if the VC only contains recover flits. Since th e recover flit does not belong to any packet, allocating a recover flit-occupied VC to a new packet will not cause the mix of packets into single VC, which is not allowed in the flit base d flow control [168]. VC_M2 mainly focuses on the credit of buffer availability and it does not affect the credit of VC availability. Therefore, it is orthogonal to VA_M1. As introduced in Section 2.1, in a typical router, each VC is a FIFO-based structure [168, 171] and every flit in the VC will m ove from the tail to the head and finally enter the crossbar. In VC_M2, the recover flit follows the same policy. When it arrives at the head of the FIFO (as shown in Figure 8-7 (a)), it will not be read out, but overwritten by the fo llowing flit, therefore, no request to the arbiters in the first VA step. As can be seen, even though a VC is re-defined as occupied in VA_M1 due to its link to the slowes t arbiter, the buffers in this VC still can be used to hold recover flits and its private arbiter becomes idle during the period for NBTI 192

PAGE 193

recovery. In each set, there is a threshold value to describe the number of inverted flit-size buffers corresponding to the required inversion percentage. Meanwhile, a counter (IBCNT) is attached to the set to track the inversion number. Once the IBCNT is below the threshold (e.g. the recover flit arrives the head of the FIFO and is overwritten), VC_M2 will read the status buffer, and pick a free flit-size buffer to store the inverted value and maintain the inversion percentage. When the recover flit is overwritten, a buffer at the tail of the FIFO will be released simultaneously. Therefore, there is always at least one free buffe r for inversion (shown in Figure 8-7 (a) to (b)). As can be seen, VC_M2 performs simultaneously with router pipeline stages and it does not increase the router latency. When implementing VC_M2 at each set of virtual channels, it introduces a status buffer with the bit size equal to the number of buffers in the set, a flit-size buffer to keep the inverted value, a 5bit buffer for inversion threshold, a 5-bit counter for IBCNT, and simple combinational-logics. In summary, the area overhead caused by VC_M2 to the virtual cha nnels is around 3%. Inter-Router Level NoC PV and NBTI Optimization In the NOC architecture, routers experience WID variations allowing them to support different frequencies. In this work, I assume NoCs with a sing le frequency domain, which is determined by the slowest router, and apply the chip-wide adaptive body bi asing technique [162] to mitigate the PV effect. The routers under the PV positive effect (e.g. fa ster routers) can be assigned higher utilization (e.g. handling a larger number of packets) wh ile exhibiting less NBTI degradation. By doing so, the routers under negati ve PV effect will process less packets, which limits the impact of NBTI. Since th e NBTI degradation migrates from the slower routers to faster ones, the guardband for the chip frequency significantly decreases Note that network congestion status should be taken into consideration when mitigating the NBTI effect at the inter-router level. For instance, to migrate the NBTI effect on slower routers, it is possible that the packets 193

PAGE 194

are all routed to the faster routers. Consequent ly, the faster routers quickly become congested, resulting in longer network latency. Prior wo rk [171] proposes regi onal-congestion-aware routing to balance the traffic load in the NoC and improve the performance. However, the NBTI and PV effects are not considered. The third technique proposed in this paper to mitigate NBTI and PV effects achieves both reliability and congestion aware routing at the inter-router level (IR_M3). In IR_M3, each router (1) collects its NBTI, PV and congestion status and produces an aggregated statistic; (2) computes reliability and performance efficiency metrics by considering the aggregated statistics from both local and remote routers; (3) determin es whether the aggregated metric or the local congestion metric should be used for the port sel ection. In the case when the packet stays in the VC for a long time and keeps sending the output VC request to the VA logic, the router suffers more NBTI degradation. Therefore, if the current router is the slowest one, instead of trying to route its packet to a faster but possibly more congested node, I prefer to quickly send it to a less congested node which can reduce the VA utilization in the slowest router; (4) selects a preferred output port for the packets based on the computed metrics to achieve a good trade-off between reliability and performance; and (5) propagates it s aggregated statistic to its neighbor routers. Figure 8-8 shows the implementa tion of IR_M3. The aggregati on and propagation modules are added into the adaptive router to perform steps (1), (2) and (5). In addition, the pre-selection unit takes the computed metrics to perform step (4). Note that the above-mentioned steps perform simultaneously with the router pipe line and they do not cause any extra delay to the flit traversal. As mentioned earlier, IDDQ can quantify both NB TI and PV effects. Since the delay in VA and XA stage mainly determines the router frequency, in each rout er, I reuse those IDDQ detectors deployed in VA_SA_M1 and the lowest IDDQ current is used as the reliability 194

PAGE 195

estimation. A high IDDQ value represents small NBTI and PV effects. As suggested in [171], the combination of free virtual channels and crossbar demands (vc_xb) is used for congestion estimation in our study. A high vc_xb value indicat es low congestion. In order to route the packets to an optimal path, the collected estimations needs to be integrated as a single statistic. Note that the remote statistics are obtained fr om all network directions and an aggregated statistic is produced for each network direction. It consists of two stages: first, a weighting function combines the reliability and congestion estimations together to produce an aggregated statistic for the local router. If there is no pr eference between reliability and performance, one can simply choose a 50-50 weight. If reliability has a higher priority, one can grant it a higher weight, and vice versa. Furthermore, one can dyn amically change the weighting function based on the packet characteristics. For example, if there ar e a large number of redundant data in a packet, the content of the packet can be represented by a small number of bits. The VC buffers storing this packet can be used to store the inverted value for NBTI mitigation. Therefore, reliability statistic will not be considered in its next hop routing si nce the packet will not exacerbate the reliability of the downstream router and a router with high NBTI degradation can still accept it. In that case, a 100-0 weighting function can be applied between performance and reliability. A sensitivity analysis is performed to explore various weighting functions in the Evaluation section. In the second stage, the router assigns equal weight to the local and remote aggregated statistics to compute the combined reliability and performance metrics. In the preselection function, the metrics from the two directions of each quadrant are compared and the neighboring node with higher valu e (i.e. representing a better reliability and performance tradeoff) is chosen as the next hop. The propaga tion is performed on each direction as well a more detailed discussion can be found in [ 171]. The aggregation/propagation module mainly 195

PAGE 196

consists of two adders. Our ga te-level estimation shows that the extra area overhead caused by IR_M3 is around 2% to each router. Evaluation Experimental Methodologies In this study, I use Garnet [176], which is a detailed cycle-accurate NoC simulator, and extend it to support the two-stage adaptive routing. I us e the integrated Orion power model [177] to track the dynamic and leakage power of the No C. All simulations are performed for a 25-node (5x5) mesh network. I restrict our evaluation to a 2D mesh NoC, but the general principles presented here could be applied to other NoC topologies as well. Each VC group has 4 virtual channels. Each VC holds four 128-bit flits. I eval uate the proposed techniques using a set of representative synthetic traffic patterns (i.e. uniform random, transpose, bit-complement and tornado). In addition, I use tra ffic traces from real-world wo rkloads such as SPLASH-2 [178], SPEComp [179] and Specjbb 2005 [180] in techniques evaluation. For synthetic traffic simulation, I modified the Garnet simulator to inject packets during a period of 1 million cycles (including 100K warm-up cycles). Both one-fli t and five-flit long packets are injected. Please also refer the Evaluation secti on of Chapter 6 and Chapter 7 for the NBTI and process variation modeling, and the circuit, archite cture level evaluation methodology. Effectiveness of the VA_SA_M1 As shown in [127], inserting th e special input values (SIV) in to a combinational logic unit during its idle periods shows a strong capability in NBTI mitigation. I compare the effectiveness of VA_SA_M1 with that of SIV in both virtual channel allocation and switch allocation logics. I incorporate adaptive body biasing in SIV scheme for the purpose of fair comparison. Note that the VA_SA_M1 shares the same PV guar dband as SIV. Figure 8-9 (a)-(d) shows NBTI_guardband and NBTI&PV_overhead (In this dissertation, I use NBTI&PV_efficiency and 196

PAGE 197

NBTI&PV_overhead interchangeably) achieved by VA_SA_M1 and SIV when using the four synthetic traffic patterns with different packet in jection rates. Since there are multiple routers on the chip and VA_SA_M1 performs at intra-router level, the analysis is focused on routers that suffer the most severe NBTI degradation. No te that VA_SA_M1 does not block any packet during the VA stage so it does not influence the number of cycles that a pack et needs to traverse through the router pipeline. Ther efore, the network latency is not affected by our proposed techniques. The area and power ove rhead of VA_SA_M1, which aff ects TDP is incorporated in the NBTI&PV_overhead calculation. Figure 8-9 shows that VA_SA_M1 provides st rong NBTI mitigation under heavy traffic loads. For example, it achieves 47% NBTI_guardband reduction compared to SIV in uniform random traffic when the injection rate is 0.1. When there are light traffic loads (e.g. 0.02 flits/node/cycle), VA_SA_M1 gains less NBTI_guardband reduction. Due to the lower arbiter utilization, the long idle peri od already provides a good opportunity for SIV to achieve NBTI recovery. Therefore, there is a limited room for VA_SA_M1 to fu rther improve it. On the other hand, when the network loads are extremely h eavy (e.g. 0.42 flits/node/cycle in uniform random traffic), all the arbiters are busy and the possibi lity for VA_SA_M1 to migrate the utilization is low. Therefore, the guardband reduction becomes smaller. In general, VA_SA_M1 reduces the NBTI&PV_overhead by 10% compared to SIV. One may notice that VA_SA_M1 shows smaller NBTI mitigation improvement for tornado traffic. Th is is because in tornado traffic, packets are only sent along X direction, the VCs and their local arbiters at the east and west network directions are highly utilized. As a result, VA_SA_M1 does not have enough opportunities to hide the NBTI degradations via th e faster arbiters at the X dimension. Note that the 197

PAGE 198

NBTI_guardband is determined by the maximum Vth. Even though VA_SA_M1 efficiently mitigates the NBTI effect in the Y dimension, its benefit does not show up in the final results. Figure 8-10 shows the sensitivity analysis on the threshold guardband used for enabling AVS+. The uniform random traffic with 0.3flit/node/cyc le injection rate is us ed in this example. As it shows, when the threshold ex ceeds a certain value (17%), both NBTI_guardband and NBTI&PV_overhead are high and largely hold constant. Because the threshold is higher than the final VA delay under VA_M1, AVS+ is not triggered and the NBTI_guardband is mainly affected by XA delay. Meanwhile, a high NBTI_guardband leads to a high NBTI&PV_overhead When the threshold guardband is lower, VA_SA_M1 obtains considerable NBTI_guardband reduction with even lower NBTI&PV_overhead However, when it drops below 12%, VA_SA_M1 cannot obtain more benefit in reducing the NBTI_guardband and NBTI&PV_overhead starts to increase again. This is because the extremely low threshold causes a large amount of power overhead due to AVS+; but the NBTI_guardband is limited by the VA delay. In this study, I conservatively set the threshold guardband as 11%. Effectiveness of VC_M2 VC_M2 is compared with the 50%_inversion sc heme which fixes the inversion percentage as 50% for NBTI mitigation. Note that ad aptive body biasing technique is trigged in 50%_inversion as well. Figure 8-11 (a)-(d) presents the network latency and NBTI&PV_overhead of VC_M2 and 50%_inversion for the different traffic patterns. Note that both techniques target the same maximum Vth. Therefore, they achieve the same guardband. The TDP of VC_M2 and 50%_inversion is 1.03. As can be seen, when there is heavy traffic, VC_M2 significantly improves the network latency and reduces the NBTI&PV_overhead compared to 50%_inversion. Take uniform random traffic as an example, compared to 50%_inversion, VC_M2 can absorb 24% more NoC load before becoming saturated (an NoC is saturated when 198

PAGE 199

the network latency is three times the zero-lo ad latency) and this also results in 164X NBTI&PV_overhead reduction. VC_M2 obtains si milar latency as 50%_inversion when the traffic is light. As most of the VC buffers are free, inverting 50% of the free buffers will not introduce a significant pe rformance penalty. Effectiveness of IR_M3 IR_M3 is compared with two other schemes: RCA (Regional Conges tion Aware routing) [171], which is an adaptive rout ing scheme that gives the perf ormance metric 100% weight and therefore does not consider the NBTI and PV effect, and AR_reliability, which only takes the reliability statistics into consid eration when routing packets. Adaptive body biasing technique is included in both RCA and AR_re liability. Figure 8-12 shows the NBTI&PV_guardband and network latency of these three techniques across different traffic patterns. Note that SIV is applied to all the three scheme s. As Figure 8-12 shows, in most cases, IR_M3 achieves a much lower guardband (e.g. 50%) compared to RCA. The be nefit becomes smaller when the traffic is heavy due to the high router utilization. On the other hand, IR_M3 significantly improves the network throughput (around 19%) when compared to AR_reliability. Figure 8-13 shows the NBTI&PV_overhead which considers trade-offs among perf ormance, reliability, and power. As can be seen, IR_M3 outperforms both RCA and AR_reliability. In ge neral, it reduces the NBTI&PV_overhead by 14% and 30% when compared to the two schemes, respectively. Figure 8-13 shows that when the packet injection rate increases, the NBTI&PV_overhead in AR_reliability first increases to an extremely high value (due to its higher network latency compared to the baseline case without PV and NB TI effect), and then drops down (because the network latency in the baseline cas e increases to infinite as well ). Different from AR_reliability, the NBTI&PV_overhead in IR_M3 and RCA first drops befo re the variations, because both of them improve the network throughput compared to the baseline case. For example, under the 199

PAGE 200

same injection rate (e.g. 0.29 flits/node/cycle in transpose traffic), the network latency in the baseline case becomes infinite while it still st ays low in the two schemes. IR_M3, RCA, and AR_reliability merge to the same NBTI&PV_overhead when the injection rate is around 0.45 flits/node/cycle since the network is saturated in all these cases. Since the proposed intra-router and inter-rout er techniques are orthogonal to each other, I propose to combine them together to gain additi onal benefits in NBTI mitigation. The combined techniques cause total 4% area overhead to each router. I compare our combined scheme with the technique (i.e. SIV+50%_inversion+RCA) that uses SIV, 50%_inversion and RCA together, adaptive body biasing is incorporated as well. Compared with SIV+50%_inversion+RCA, my combined technique can reduce the NBTI&PV _guardband and NBTI&PV_overhead by as much as 70% and 41% respectively wh ile improving network throughput by 5%. Note that the network latency has already been improved by using SIV+50%_inversion+RCA due to the use of congestion aware adaptive routing to bala nce the buffer utilization in routers. Real Workload Results Figure 8-14 (a)-(c) shows the NBTI&PV_guardband network latency, and NBTI&PV_overhead of the real workloads when the proposed schemes are applied. The results are normalized to the technique of SIV+50%_i nversion+RCA. As it shows, my techniques are able to improve the NBTI recovery on benchmar ks with high traffic load. For example, on water-spatial which exhibits high traffic, when co mpared with SIV+50%_inversion+RCA, the combined scheme can reduce the NBTI&PV_guardband and network latency by 59% and 4% respectively, leading to 33% reduction in NBTI&PV_overhead Similarly, on benchmarks with relatively high traffic (e.g. water-nsquared equake fma3d, and mgrid), the combined technique also improves the NBTI&PV_guardband network latency and NBTI&PV_overhead by 48%, 2% and 20%, respectively. The improvemen t on medium-load benchmarks (e.g. ocean radix, barnes 200

PAGE 201

and fft ) is 30% in NBTI&PV_guardband 1% in network latency, and 12% in NBTI&PV_overhead The combined technique achieves smaller improvements in light-load benchmarks such as raytrace because there are fewer uses of VA arbiters and less congestion in routers. The NBTI&PV_guardband reduction is 5% with 0.5% and 2% improvement in network latency and NBTI&PV_overhead respectively. Related Work There is a vast body of previous work on im proving performance and power efficiency of NoC design [165, 166, 181-184]. These studies all a ssume that the underlying transistors share the same characteristics. In [164], Li and Peh observed that NoC design choices are very much influenced by the effects of process variation. Their study focuses on analyzing process variation early in the design flow while our work proposes PV mitigation techniques suitable for runtime NoC operation. In [125], Ogras a nd Marculescu studied the use of NoC consisting of multiple voltage-frequency islands to cope with parameter variation problem s. To tolerate the effect of process variations on NoC, [86] proposed using se lf-correcting links that automatically detect delay variations and compensate them, [ 173] proposed VA compaction and SA folding mechanisms to increase the NoC immunity to PV effects. The above studies exclusively focus on PV while my work targets both PV and NBTI, which inherently interplay with each other. In the past, various PV mitigation technique s have been proposed [111-115]. Those PV mitigation schemes largely focus on processor core architectures while our work proposes both intra-router and inter-router PV mitigation tech niques suitable for the state-of-the-art router microarchitecture and the distributed, large-scal e NoC substrates. I expect that my work will open new opportunities for robust nano-scale NoC de sign. The combined effect of PV and NBTI has been modeled and analyzed in [154, 155,161]. In this work, we focus on leveraging NoC microarchitecture characteristics for the purpose of NBTI and PV mitigation. For example, in 201

PAGE 202

flow-control-based NoC, credit of VC availability is sent back from the downstream router to its upstream router. In our two low-level techniques, we take advantage of backward credit to intelligently steer the VC allocation (VA_M1) a nd control the required inversion percentage (VC_M2) in the downstream node. Compared to Razor [156], in this study, I target the mitigation of NBTI and PV effect in both combinational circuits and storage-cell structures with desirable trade-offs among performance and relia bility. Recently, Tiwari and Torrellas proposed aging driven application schedul ing [162], which can hide the e ffect of aging by intelligently mapping workloads to cores with different frequency (caused by pr ocess variation). To the best of my knowledge, there has been no prior wo rk on addressing NBTI effect in NoCs. 202

PAGE 203

Routing Computation (RC) Unit Virtual Channel Allocator (VA) & Switch Allocator (XA) From upstream node Pre-selection Unit Virtual channel (VC) Virtual channel (VC) To downstream nodeLocal Congestion Metrics Credits in Figure 8-1. Two-stage adaptive router microarchitecture 203

PAGE 204

RC VA & SA From East (E) Preselection To NW From Local PE Set for NW Quadrant VC 1 VC V From E to NW Quadrant VC 1 VC V From S to NW Quadrant VC 1 VC V From Local PE to NW Quadrant Set for SW Quadrant From NW From SW From NE From SE N W S E To VA&SA A V:1 Arbiter 1 S e t f o r N W Quadrant V:1 Arbiter v V:1 Arbiter 1 V:1 Arbiter v Set for SW Quadrant V:1 Arbiter 1 V:1 Arbiter v Output port N Output port South 6V:1 Arbiter 1 Output port W 6V:1 Arbiter v 6V:1 Arbiter 1 6V:1 Arbiter v V:1 Arbiter 1 V:1 Arbiter v V:1 Arbiter 1 V:1 Arbiter v V:1 Arbiter 1 V:1 Arbiter v 6V:1 Arbiter 1 6V:1 Arbiter v 6V:1 Arbiter 1 6V:1 Arbiter v 6V:1 Arbiter 1 6V:1 Arbiter v 6V:1 Arbiter 1 6V:1 Arbiter v B Figure 8-2. Circuit design of the two-stage adaptive router [28]. A) Circuit design of the twostage adaptive router. B) Zo om-in view at VA logic. 204

PAGE 205

V:1 Arbiter Request from a VC v v v v Output port VC status at neighbor nodes v v To arbiters in the second VA step Figure 8-3. Circuit design in the first step of the VA stage (one VC entry is shown) V:1 Arbiter 1 VC request v v v v Output port To arbiters in the second VA step in outMm Mn Output VC status Special input values Perform NBTI recovery Invalid output port Perform NBTI recovery IDDQ Testing V:1 Arbiter V IDDQ Testing Comparator Input port VC status Credits out Choose the VC with the slowest arbiter "occupied" "free" Only the VC linked to the slowest arbiter is idle Mp Mb Mb Figure 8-4. The implementation of VA_M1 205

PAGE 206

Time Critical path delay XA delay without AVS+ VA delay with VA_M1 XA delay with AVS+ Guardband under VA_M1 Guardband under VA_M1 and AVS+ A Time Power overhead AVS+ enabled AVS+ enabled B Figure 8-5. The guardband and power benefit of applying AVS+ during the lifetime. A) Guardband benefit. B) Power benefit Figure 8-6. Vth (in mV) variation map for a router 206

PAGE 207

INV Buffer 1 1 1 1 0 1 1 1 1 1 1 1 4 IBCNT Status Buffer flits in flits out VC Buffers Current free buffer Write the inverted valueRecover flit Real flit idle buffer A INV Buffer 0 1 1 1 1 1 1 1 1 1 1 1 4 IBCNT Status Buffer flits in flits out VC Buffers Current free buffer Write the inverted valueReal flit idle buffer Recover flit B Figure 8-7. The implementation of VC_M2 207

PAGE 208

North Aggregation Local reliability metric Local congestion metric Aggregated metric from downstream node Local congestio n metric Whether the current router is extremely slow (4) Preselection Unit South Aggregation East Aggregation VA & SA West Aggregation (5) Propagation N W S E (1) (2) (3) Figure 8-8. The implementation of IR_M3 208

PAGE 209

0 5 10 15 20 25 0.020.10.180.260.340.420.5 Injected Traffic (flits/node/cycle)NBTI_guardband (%)2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6NBTI&PV_overhead SIV(guardband) VA_SA_M1(guardband) SIV(overhead) VA_SA_M1(overhead) A 0 5 10 15 20 25 0.020.080.140.20.260.320.38 Injected Traffic (flits/node/cycle)NBTI_guardband (%)2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6NBTI&PV_overhead SIV(guardband) VA_SA_M1(guardband) SIV(overhead) VA_SA_M1(overhead) B Figure 8-9. The effectiveness of VA_SA_M1 on NBTI&PV_overhead and NBTI_guardband. A) Uniform random traffic. B) Bit-complement tr affic. C) Transpose traffic. D) Tornado traffic. 209

PAGE 210

0 5 10 15 20 25 0.020.080.140.20.260.320.38 Injected Traffic (flits/node/cycle)NBTI_guardband (%)2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6NBTI&PV_overhead SIV(guardband) VA_SA_M1(guardband) SIV(overhead) VA_SA_M1(overhead) C 0 5 10 15 20 25 0.020.080.140.20.260.320.38 Injected Traffi c (flits/ node/ cycle)NBTI_guardband (%)2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6NBTI&PV_overhead SIV(guardband) VA_SA_M1(guardband) SIV(overhead) VA_SA_M1(overhead) D Figure 8-9. Continued 210

PAGE 211

5 10 15 20 68101214161820 Threshold Guardband (%)NBTI_guardband (%)2.5 2.7 2.9 3.1 3.3 3.5NBTI&PV_overhead NBTI_guardband NBTI&PV_overhead Figure 8-10. Sensitivity analysis on threshold (URT with 0.3flit/node/cyc le injection rate) 211

PAGE 212

0 20 40 60 80 100 0.020.10.180.260.34 Injected traffic (flits/node/cycle)Network Latency (cycle)0 50 100 150 200 250NBTI&PV_ overhead 50%_inversion(latency) VC_M2(latency) 50%_inversion(overhead) VC_M2(overhead) A 0 20 40 60 80 100 0.050.110.170.230.29 Injected traffic (flits/node/cycle)Network Latency (cycle)0 0.5 1 1.5 2NBTI&PV_ overhead 50%_inversion(latency) VC_M2(latency) 50%_inversion(overhead) VC_M2(overhead) B Figure 8-11. The effectiveness of VC_M2 on network latency and NBTI&PV_overhead. A) Uniform random traffic. B) Bit-complement tr affic. C) Transpose traffic. D) Tornado traffic. 212

PAGE 213

0 20 40 60 80 100 0.020.080.140.20.26 Injected traffic (flits/node/cycle)Network Latency (cycle)0 5 10 15 20 25 30NBTI&PV_ overhead 50%_inversion(latency) VC_M2(latency) 50%_inversion(overhead) VC_M2(overhead) C 0 20 40 60 80 100 0.020.080.140.20.260.32 Injected traffic (flits/node/cycle)Network Latency (cycle)0 1 2 3 4 5NBTI&PV_ overhead 50%_inversion(latency) VC_M2(latency) 50%_inversion(overhead) VC_M2(overhead) D Figure 8-11. Continued 213

PAGE 214

0 10 20 30 40 50 60 70 80 90 100 0.020.080.140.20.260.320.380.45 Injected Traffic (flots/node/cycle)Network Latency (cycle)0 5 10 15 20 25NBTI&PV_guardband (%) IR_M3 (latency) RCA(latency) AR_reliability (latency) IR_M3 (guardband) RCA (guardband) AR_reliability (guardband) A 0 10 20 30 40 50 60 70 80 90 100 0.020.080.140.20.260.320.380.45 Injected Traffic (flots/node/cycle)Network Latency (cycle)0 5 10 15 20 25NBTI&PV_ guardband (%) IR_M3 (latency) RCA(latency) AR_reliability (latency) IR_M3 (guardband) RCA (guardband) AR_reliability (guardband) B Figure 8-12. The Effectiveness of IR_M3 on Network latency and NBTI_guardband. A) Uniform random traffic. B) Bit-complement traffic. C) Transpose traffic. D) Tornado traffic. 214

PAGE 215

0 10 20 30 40 50 60 70 80 90 100 0.020.080.140.20.260.320.380.45 Injected Traffic (flots/node/cycle)Network Latency (cycle)0 5 10 15 20 25NBTI&PV_guardband (%) IR_M3 (latency) RCA(latency) AR_reliability (latency) IR_M3 (guardband) RCA (guardband) AR_reliability (guardband) C 0 10 20 30 40 50 60 70 80 90 100 0.020.080.140.20.260.320.38 Injected Traffic (flots/node/cycle)Network Latency (cycle)0 5 10 15 20 25NBTI&PV_guardband (%) IR_M3 (latency) RCA(latency) AR_reliability (latency) IR_M3 (guardband) RCA (guardband) AR_reliability (guardband) D Figure 8-12. Continued 215

PAGE 216

0 0.5 1 1.5 2 2.5 3 0.020.080.140.20.260.320.380.45 Injected traffic (flits/node/cycle)NBTI&PV_overhea d IR_M3 RCA AR_reliability 2041 2041 A 0 0.5 1 1.5 2 2.5 3 0.020.080.140.20.260.320.380.45 Injected traffic (flits/node/cycle)NBTI&PV_overhead IR_M3 RCA AR_reliability 660 651 651 B Figure 8-13. The effectiveness of IR_M3 on NBTI&PV_overhead A) Uniform random traffic. B) Bit-complement traffic. C) Transpose traffic. D) Tornado traffic. 216

PAGE 217

0 0.5 1 1.5 2 2.5 3 0.020.080.140.20.260.320.380.45 Injected traffic (flits/node/cycle)NBTI&PV_overhea d IR_M3 RCA AR_reliability 107 107 C 0 0.5 1 1.5 2 2.5 3 0.020.080.140.20.260.320.380.45 Injected traffic (flits/node/cycle)NBTI&PV_overhea d IR_M3 RCA AR_reliability D Figure 8-13. Continued 217

PAGE 218

20 40 60 80 100barnes f f t ocean radix ra yt race wat ernsquar ed w at er spat i al equake f ma3d mgr i d specjbbNormalized NBTI&PV_guardband (%) A 94 95 96 97 98 99 100b arn e s fft oc ea n r adi x r a ytr ac e wate rnsq u are d water-spatial equ a ke f ma3d mg r id s pe cjb bNormalized Network Latency (% ) B Figure 8-14. The effectiveness of the comb ined techniques (VA_SA_M1+VC_M2+IR_M3) on real workloads. A) Normalized NBTI&PV_guardband B) Normalized network latency. C) Normalized NBTI&PV_overhead 218

PAGE 219

20 40 60 80 100 120ba r ne s f f t oc e an radix r ay t ra c e water ns q uare d w a te r -s p at i al e q u a ke fm a 3 d m g rid s p ecjbbNormalized NBTI&PV_overhead (%) C Figure 8-14. Continued. 219

PAGE 220

CHAPTER 9 CONCLUSIONS AND FUTURE WORKS Conclusions Semiconductor transient faults have become an increasing challenge for reliable software execution. To explore cost-effectiv e fault tolerant mechanisms for dependable execution of the next generation of software, researchers clearly need to analyze program vulnerability to soft errors at a high level and at an ea rly design stage. I have developed Sim-SODA a unified framework to estimate software vulnerability to transit faults at the architectural level. The foundations for the vulnerability modeling infras tructure are parameterized AVF models of microarchitecture structures present in mode rn high-performance microprocessors. Compared with previously proposed tools, Sim-SODA provides fine-grained AVF models and covers more hardware structures. Phase analysis is becoming increasingly importa nt for optimizing the efficiency of next generation computer systems. As semiconductor tran sient faults become an increasing threat to reliable software execution, it is imperative to design workload dependent fault tolerant mechanisms for future microprocessors which contain billions of tr ansistors. Observing reliability-oriented program phase behavior is th e first step leading to dynamic fault tolerant systems which can be tuned to meet the workload reliability requirement. I studied the run-time reliability characteristics of four important mi croarchitecture structures. Is investigated the applicability and effectiveness of different phase identifica tion methods in classifying program reliability-oriented phases. I found that both code-structure based and run-time event based phase analysis techniques perform well in detecting reliability oriented phases. In general, the performance counter scheme outperforms the c ontrol flow based scheme on a majority of benchmarks. I also found that pr ogram reliability phase s can be cost-effectively identified using 220

PAGE 221

five performance counters that are generally available across differe nt hardware platforms. This suggests that on-line reliability phase detection and pr ediction hardware with low complexity can be built on a small set of performance counters. Since IQ exhibits high vulnerability in SMT environment, I have presented novel microarchitecture techniques design ed to reduce the IQ vulnerabi lity to soft errors on SMT processors. The key observation is that the number of vulnerable instructi ons that are ready to execute on SMT processors is much higher th an that on superscalar processors. The IQ vulnerability to soft error can be reduced by assigning vulnerable instructions a higher issue priority. This has the effect of reducing the num ber of ACE-bit resident cycles and thus the vulnerability of the IQ. I further apply reliability -aware dynamic resource allocation to the IQ to prevent excessive vulnerable bits from enteri ng the IQ. Results from the implementation and evaluation of the proposed techniques show 42% reliability improvement in the IQ with 1% performance improvement. Based on the observation that instructions long waiting-for-ready time significantly increases IQ soft-error vulnerability, I have further developed six novel microarchitecture techniques to improve the instruction queue reli ability and meanwhile, maintain processor performance. I use instruction operand readin ess prediction to reduce the performance penalty of ORBIT schemes. The pr oposed Predict_DelayACE scheme reduces IQ vulnerability by 79% with 1% throughput IPC and 3% harmonic IPC degradation on average across all three types of work loads. The ORBIT schemes outperform the advanced SMT fetch policies when reliability, throughput and fairness ar e all considered. Results also prove that the IQ vulnerability does not migrate to other struct ures. In this study, I focus on the IQ, however I believe our technique could be extended to other microarchite cture structures. 221

PAGE 222

Improved reliability design and control over de sign overhead can be achieved through the use of hybrid approaches that allow simultane ous tradeoffs between circuit and architecture techniques. In this dissertation, SER robustness techniques at the circu it and microarchitecture levels are effectively combined. They benefit each other and are able to obtain a greater reliability improvement with negligible performance loss and power overhead. To my knowledge, this is the first work to effici ently bridge the gap between circuit and microarchitecture level fault tole rance techniques. I first proposed hybrid radiation hardened IQ design, which dispatches performanc e critical but not-rea dy instructions into well-protected RIQ, and uses un-protected NIQ to only hold operand rea dy instructions. The hybrid design takes the advantage of RIQ in SER robustness to achieve the goal of not only protecting critical instructions but also providing a time of issue guarantee. Resu lts show the technique achieves 80% IQ SER reduction with only 0.3% through put IPC and 1% harmonic IPC penalty. The hybrid scheme also outperforms other exis ting techniques (e.g. 2O P_BLOCK, ORBIT and FLUSH fetch policy). To improve ROB reliabi lity, I propose the us e of architectural characteristics to trigger DualVDD, and switch the normal VDD to a higher level when the ROB exhibits high Vulnerability during a L2 miss. This technique obtains 35% ROB SER robustness with 3.5% power overhead. I also put the two tec hniques together and eval uate their aggregate affect on the entire core. On average, the core SER reduces 23%, and as a byproduct, LSQ achieves 15% SER reduction. Process variation is becoming th e major challenge that needs to be addressed as process technology continues to scale down. Most PV re lated studies focus onl y on performance and power domains, and little work has been done to link PV to soft error vulnerability. However, an efficient PV mitigation technique should have the ability to take reliability, performance, and 222

PAGE 223

power into account. I present the first study to optimize the reliability of PV mitigation techniques while consider ing all three domains. I ch aracterize the critical charge variation at the bit and entry level in the presence of PV. I propose Entry-BVM and Structure-BVM to improve microarchitecture soft error reliability and its tr ade-offs with performance and power in light of PV. Simulation results show that Entry-BVM and Structure-BVM can reduce IQ SER by 24% and 20% and achieve a trade-offs improvement of 27% and 26% respectively. The combinatorial effect of the two techniques can achieve even higher IQ SER reduction (i.e. 40%) with 46% trade-offs improvement. NBTI is another crucial reliability concer n in nanometer technology. It degrades PMOS transistors by increasing their Vth, which leads to failures in both logic circuits and storage cells. Meanwhile, process variations (PV), which re sult in a static parameter variation (e.g L and Vth) in transistors, exacerbate the reliability probl em in current high performance processors. Methodologies to mitigate both PV and NBTI effect s are highly desired. In this study, I observe that techniques leveraging the positive interac tion between PV and NBTI can obtain attractive trade-offs among performance, reliability, and power. I propose three microarchitecture optimizations to efficiently take advantage of the positive interplay between NBTI and PV to mitigate NBTI effect in the presence of PV. My techniques are flexible and can be applied to most of the microarchitecture structures. My experi mental results show that the aggregated effect of the proposed methods has the ability to improve the chip NBTI&PV_efficiency by 117% compared to the baseline case without any optim ization, and by 21% compared to the technique which simply combines NBTI and PV mitigation methods. NoC is becoming the imperative communicatio n fabric for emerging multi-/manycore processors. Existing NoC designs largely assume reliable pro cessing technologies and uniform 223

PAGE 224

transistor characteristics. As CMOS fabrication technologies approach the nanoand atomscales, process variation can si gnificantly degrade the performance and reliability of these designs. This problem is further compounded by th e NBTI effect, which wears-out transistors and reduces their lifetime. Theref ore, it is unwise to ignore the impact of PV and NBTI in NoC architecture design. In this st udy, I propose novel techniques to m itigate the impact of PV and NBTI on NoC. Experimental resu lts show that my intra-router techniques (i.e. VA_SA_M1 and VC_M2) reduce guardband by 47% while impr oving network throughput by 24%. My interrouter optimization scheme (i.e. IR_M3) resu lts in 50% buardband re duction and 19% network latency improvement. To the best of our knowledge, this is the first work that optimizes both PV and NBTI effect on the emerging NoC architecture design. Future Works In the near future, I am interested in exte nding my Ph.D. research work to design networkon-chip architectures in many-/multi-core pro cessors considering the interactions among OS, architecture, and circuit levels to improve performance, reliability and power on emerging application execution. For example, observing that routers in NoC exhibit different speed due to the PV effect, and faster routers can tolerate more delay caused by NBTI degradation, I intend to leverage OS in scheduling the applications w ith heavy traffic to the area including a large number of faster routers; on the other hand, applications with light traffic are assigned to the area with slow routers. By doing that, we can ach ieve an optimal trade-offs among performance, reliability and power. In addition, the one-to-many (multicasting) tra ffic degrades the performance significantly in current network-on-chip architectures, it is crucial to explore an improved NoC architecture which provides low latency, high throughput, and high relia bility to multicasting traffics. I intend to dynamically build up and update the multicasting trees to avoid the traffic congestion, and 224

PAGE 225

mitigate the impact of process variation and NBTI degradation in each router. Moreover, OS can help to reduce the number of hops required in each multicasting tree by assigning the specific cores which should be involved into an application.Furthermore, in order to further improve the latency and throughput in NoC architectures, 3D technologies and th e nanophotonic devices have been introduced to network-on-chip, I intend to look at the challenges and opportunities in those novel techniques for NoC design. My long-term research will focus on explor ing the reliable and high efficient many-/multicore processor architecture design. Semiconducto r roadmaps forecast that ten years from now, processor chips will integrate 128 billion tr ansistors in 11nm technology. This will likely translate into hundreds and even thousands of core s into one chip. It grea tly helps to boost the performance, but will keep exacerbating the reliab ility issues. For example, some cores may fail infancy due to the impact of process variation, some cores may fail durin g the required lifetime due to the severe NBTI degradation, moreover, some cores may not be able to execute the program correctly due to its high vulnerability to soft errors. I intend to develop techniques across OS, architecture and circuit levels to build robust many-/multi-core processors. I intend to explore techniques to support systems that auto matically analyze heterogeneous workloads, dynamically reconfigure the applicati on components based on the current reliability/performance/power characteristics of th e cores assigned to the a pplication. At circuit level, vulnerability mitigation techniques (e.g. dynamic voltage scaling, adaptive body biasing) can be efficiently combined with the architecture and OS level techniques to achieve the optimal trade-offs among reliability, performance and power in future many-/multi-core architecture design. 225

PAGE 226

LIST OF REFERENCES [1] G. Asadi, V. Sridharan, M. B. Tahoori, and D. Kaeli Balancing Performance and Reliability in the Memory Hierarchy, in Proc. ISPASS 2005. [2] R. C. Baumann, Soft Errors in Commerci al Semiconductor Technology: Overview and Scaling Trends, in IEEE Reliability Physics Tutorial Notes, Reliability Fundamentals, 2002. [3] N. L. Binkert, E. G. Hallnor, and S. K. Reinhardt, Network-Oriented Full-System Simulation using M5, in Workshop on Computer Architecture Evaluation using Commercial Workloads 2003. [4] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas, and R. Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, in Proc. ISCA 2005. [5] J. Blome, S. Mahlke, D. Bradley, and K. Flautn er, A Microarchitectur al Analysis of Soft Error Propagation in a Production-Leve l Embedded Microprocessor, in Workshop on Architectural Reliability 2005. [6] D. Burger and T. M. Austin, The SimpleScalar Tool Set, Version 2.0, Technical Report No. 1342, Computer Science Dept., University of Wisconsin-Madison, June 1997. [7] M. Choudhury, Q. Zhou, and K. Mohanram, Design Optimization for Single-Event Upset Robustness using Simultaneous dual-VDD and Sizing Techniques, in Proc. ICCAD 2006 [8] Compaq Computer Corporation, Alpha 21264 Microprocessor Hardware Reference Manual, July 1999. [9] Compaq Computer Corporation, Compile r Writers Guide for the Alpha 21264, 1999. [10] K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky, Assessing SEU Vulnerability vi a Circuit-Level Timing Analysis, in Workshop on Architectural Reliability, 2005. [11] R. Desikan, D. Burger, S. W. Keckler, and T. Austin, Sim-alpha: A Validated, ExecutionDriven Alpha 21264 Simulator, Technical Report, TR-01-23, Dept. of Computer Sciences, University of Texas at Austin, 2001. [12] R. Desikan, D. Burger, and S. W. Keck ler, Measuring Experimental Error in Microprocessor Simulation, in Proc. ISCA, 2001. [13] J. Emer, P. Ahuja, N. Binkert, E. Borch, R. Espasa, T. Juan, A. Klauser, C.-K. Luk, S. Manne, S. S. Mukherjee, H. Patil, and S. Wallace, Asim: A Performance Model Framework, IEEE Computer, vol. 35, no. 2, pp. 68-76, February 2002. 226

PAGE 227

[14] B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S. J. Patel, and S. S. Lumetta, Performance Characterization of a Hardware Mechanism for Dynamic Optimization, in Proc. MICRO 2001. [15] Y. Yeh, Triple-triple Redundant 777 Primary Flight Computer, in Proc. IEEE Aerospace Applications Conference 1996 [16] R. E. Kessler, E. J. McLellan, and D. A. Webb, The Alpha 21264 Microprocessor Architecture, in Proc. ICCD 1998. [17] R. E. Kessler, The Alpha 21264 Microprocessor, IEEE Micro, vol. 19, no. 2, pp. 24-36, March/April 1999. [18] D. Leibholz and R. Razdan, The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor, in Proc. Compcon, 1997, pp. 28-36. [19] X. D. Li, S. V. Adve, P. Bose, and J. A. Ri vers, SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors, in Proc. DSN 2005. [20] Y. Li, T. Li, T. Kahveci, and J. A. B. Fortes, Workload Characterization of Bioinformatics Applications, in Proc. MASCOTS 2005. [21] S. K. Reinhardt and S. S. Mukherjee, T ransient Fault Detection via Simultaneous Multithreading, in Proc. ISCA, 2000 [22] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, A Systematic Methodology to Compute the Architectural Vu lnerability Factors for a High-Performance Microprocessor, in Proc. MICRO 2003. [23] H. Zhou, A Case for Fault Tolerance and Performance Enhancement using Chip MultiProcessors, IEEE Computer Architecture Letters September 2005. [24] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, SWIFT: Software Implemented Fault Tolerance, in Proc. CGO 2005. [25] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee, Design and Evaluation of Hybrid Fault-Detection Systems, in Proc. ISCA, 2005. [26] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, Automatically Characterizing Large Scale Program Behavior, in Proc. ASPLOS 2002. [27] N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel, Characterizing th e Effects of Transient Faults on a High-Performance Processor Pipeline, in Proc. DSN 2004. [28] C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor, in Proc. ISCA, 2004. 227

PAGE 228

[29] J. J. Yi and D. J. Lilja, Improving Processor Performance by Simplifying and Bypassing Trivial Computations, in Proc. ICCD, 2002. [30] X. Fu, J. Poe, T. Li, and J. Fortes, C haracterizing Microarc hitecture Soft Error Vulnerability Phase Behavior, in Proc. MASCOTS, 2006. [31] D. Tullsen and J. Brown, Handling Long-late ncy Loads in a Simultaneous Multithreading Processor, in Proc. MICRO 2001. [32] K. R. Walcott, G. Humphreys, and S. Guru murthi, Dynamic Prediction of Architectural Vulnerability from Microarchitectural State, in Proc. ISCA 2007. [33] J. Sharkey, D. Ponomarev, Efficient Instruction Schedulers for SMT processors, in Proc. HPCA 2006. [34] X. Shen, Y. Zhong, and C. Ding, L ocality Phase Prediction, In Proc. ASPLOS 2004. [35] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, in Proc. ISCA 1996. [36] T. Sherwood, S. Sair, and B. Calder, Phase Tracking and Prediction, in Proc. ISCA 2003. [37] A. Dhodapkar and J. Smith, Managing Multi-Configurable Hardware via Dynamic Working Set Analysis, in Proc. ISCA 2002. [38] A. Dhodapkar and J. Smith, Comparing Pr ogram Phase Detection Techniques, in Proc. MICRO 2003. [39] C. Isci and M. Martonosi, Identifying Program Power Ph ase Behavior using Power Vectors, in Workshop on Workload Characterization 2003. [40] C. Isci and M. Martonosi, Phase Characte rization for Power: Evaluating Control-FlowBased Event-Counter-Based Techniques, in Proc. HPCA, 2006. [41] M. Annavaram, R. Rakvic, M. Polito, J.-Y Bouguet, R. Hankins, and B. Davies, The Fuzzy Correlation between Code and Performance Predictability, in Proc. MICRO 2004. [42] A. El-Moursy and D. H. Albonesi, Front-end Policies for Improved Issue Efficiency in SMT Processors, in Proc. HPCA, 2003. [43] J. Lau, J. Sampson, E. Perelman, G. Hamerly, and B. Calder, The Strong Correlation between Code Signatures and Performance, in Proc. ISPASS, 2005. [44] K. Wang and M. Franklin, Highly Accura te Data Value Prediction using Hybrid Predictors, in Proc. MICRO 1997. 228

PAGE 229

[45] D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton, HyperThreading Technology Architectur e and Microarchitecture, Intel Technology Journal vol. 6, no. 1, February 2002. [46] M. Huang, J. Renau, and J. Torrellas, Positional Adaptation of Proce ssors: Application to Energy Reduction, in Proc. ISCA 2003. [47] A. Iyer and D. Marculescu, Power Aware Microarchitecture Resource Scaling, in Proc. DATE 2001. [48] G. Memik, G. Reinman, and W. H. Mangione-S mith, Just Say No: Benefits of Early Cache Miss Determination, in Proc. HPCA 2003. [49] J. Borkenhagen, R. Eickemeyer, R. Kallaa, and S. Kunkel, A Multithreaded PowerPC Processor for Commercial Servers, IBM Journal of Research ad Development, vol. 44, no. 6, pp. 885 898, 2000. [50] S. Kim and A. K. Somani, Soft Error Sens itivity Characterization of Microprocessor Dependability Enhancement Strategy, in Proc. DSN 2002. [51] A. R. Lebeck, J. Koppanalil, T. Li, J. Pa twardhan, and E. Rotenberg, A Large, Fast Instruction Window for Tole rating Cache Misses, in Proc. ISCA 2002. [52] S.E. Raasch, N.L. Binkert, S.K. Reinhardt, A Scalable Instruction Queue Design Using Dependence Chains, in Proc. ISCA 2002. [53] R. Canal and A. Gonzlez, A Low-complexity Issue Logic, in Proc. ICS 2000. [54] R. Canal and A. Gonzlez, Reducing the Complexity of the Issue Logic, in Proc. ICS, 2001. [55] S. Palacharla, N. P. Jouppi, and J. E. Smith, Complexity-E ffective Superscalar Processors, in Proc. ISCA, 1997. [56] R. E. Kessler, E. J. McLellan, and D. A. Webb, The Alpha 21264 Microprocessor 57Architecture, in Proc. ICCD 1998. [57] P. Michaud and A. Seznec, Data-flow Pres cheduling for Large In struction Windows in Out-of-order Processors, in Proc. HPCA, 2001. [58] H. Akkary, R. Rajwar, and S. T. Sriniv asan, Checkpoint Processing and Recovery: Towards Scalable Large Instru ction Window Processors, in Proc. MICRO 2003. [59] A. J. KleinOsowski and D. J. Lilja, Mi nneSPEC: A New SPEC Benchmark Workload for Simulation Based Computer Architecture Research, IEEE Computer Architecture Letters vol. 1, June 2002. 229

PAGE 230

[60] W. Liu and M. Huang, EXPERT: Expedited Simulation Exploiting Program Behavior Repetition, in Proc. ICS 2004 [61] J. MacQueen, Some Methods for Classi fication and Analysis of Multivariate Observations, in Proc. Fifth Berkeley Symposium on Mathematical Statistics and Probability 1967. [62] R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas, Memory Hierarchy Reconfiguration for Energy and Perf ormance in General Purpose Architectures, in Proc. MICRO 2000. [63] Wangyuan Zhang, Xin Fu, Tao Li, and Jos Fort es, An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures, in Proc. ISPASS April 2007 [64] S. T. Srinivasan, R. D. Ju, A. R. Lebeck, and C. Wilkerson, Locality vs. Criticality, in Proc. ISCA 2001. [65] B. Fields, S. Rubin, and R. Bodik, Focus ing Processor Policies via Critical-Path Prediction, in Proc. ISCA 2001. [66] E. Tune, D. Liang, D. M. Tullsen, and B. Calder, Dynamic Predic tion of Critical Path Instructions, in Proc. HPCA, 2001. [67] J. S. Seng, E. S. Tune, and D. M. Tullsen, Reducing Power with Dynamic Critical Path Information, in Proc. MICRO 2001 [68] D. Ponomarev, G. Kucuk, and K. Ghose, R educing Power Requirements of Instruction Scheduling through Dynamic Allocation of Multiple Datapath Resources, in Proc. MICRO 2001. [69] D. Folegnani and A. Gonzalez, Energy-Effective Issue Logic, in Proc. ISCA 2001. [70] T. M. Jones, M. F. P. O'Boyle, J. Abella, and A. Gonzalez, Software Directed Issue Queue Power Reduction, in Proc. HPCA 2005. [71] D. Brooks and M. Martonosi, Dynamic Thermal Management for High-Performance Microprocessors, in Proc. HPCA 2001. [72] M. G. Valluri, Lizy K. John, and K. S. McKinley, Low-power, Low-complexity Instruction Issue using Compiler Assistance, in Proc. ICS 2005. [73] A. Buyuktosunoglu, S. Schuster, D. Brooks, P. Bose, P. W. Cook, and D. H. Albonesi, An Adaptive Issue Queue for Reduced Power at High Performance, in Proc. PACS 2000. [74] T. Karkhanis, J. E. Smith, and P. Bose, S aving Energy with Just In Time Instruction Delivery, in Proc. ISLPED 2002. 230

PAGE 231

[75] A. Buyuktosunoglu, T. Karkhanis, D. H. Albone si, and P. Bose, Energy Efficient CoAdaptive Instruction Fetch and Issue, in Proc. ISCA, 2003. [76] N. Soundararajan, A. Parashar, and A. Sivasubramaniam, Mechanisms for Bounding Vulnerabilities of Processor Structures, in Proc. ISCA 2007. [77] D. Ernst, A. Hamel, and T. Austin, Cyclone: A Broadcast-Free Dynamic Instruction Scheduler with Selective Replay, in Proc. ISCA 2003. [78] Y. Liu, A. Shayesteh, G. Memik, and G, Re inman, Scaling the Issue Window with LookAhead Latency Prediction, in Proc. ICS, 2004. [79] T. E. Ehrhart and S. J. Patel, Reduci ng the Scheduling Critic al Cycle using Wakeup Prediction, in Proc. HPCA, 2004. [80] P. Hamcha and C. Svensson, Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate, IEEE Trans. Nuclear Science vol. 47, no. 6, pp. 2586 2594, December 2000. [81] P. Roche, F. Jacquet, G. Gasiot, C. Caillat, B. Borot, and J. P. Schoellkopf, High Density SRAM robust to radiation-induced Soft Errors in 90nm CMOS Technology, in Proc. ICMTD 2005. [82] P. Roche, F. Jacquet, C. Caillat, and j. P. Schoellkopf, An Alpha Immune and Ultra Low Neutron SER High Density SRAM, in Proc. IRPS 2004. [83] D. Marculescu, On the Use of Microarchitectural-Driven Dynamic Voltage Scaling, in Workshop on Complexity-Effective Design 2000. [84] V. Reddy et al., Impact of Negative Bias Temperature Instability on Digital Circuit Reliability, in Proc. IRPS 2002. [85] A.R. Lebeck, J. Koppanalil, T. Li, J. Patw ardhan, and E. Rotenberg, A Large, Fast Instruction Window for Tole rating Cache Misses, in Proc. ISCA 2002. [86] M. Simone, M. Lajolo, and D. Bertozzi, Variation Tolerant NoC Design by Means of Self-Calibrating Links, in Proc. DATE, March 2008. [87] Y. Liu, A. Shayesteh, G. Memik, and G, Re inman, Scaling the Issue Window with LookAhead Latency Prediction, in Proc. ICS 2004. [88] H. Li, C. Cher, T.N. Vijaykumar, and K. Roy, VSV: L2-Miss-Driv en Variable SupplyVoltage Scaling for Low Power, in Proc. MICRO 2003. [89] K. Usami et al., Design Methodology of Ultra Low-Power MPEG4 Codec Core Exploiting Voltage Scaling Techniques, in Proc. DAC 1998. 231

PAGE 232

[90] T. Burd, and R. Brodersen, Design Issues for Dynamic Voltage Scaling, in Proc. ISLPED 2000. [91] D. Brooks, et al., Wattch: A framework fo r Architectural-Level Power Analysis and Optimizations, in Proc. ISCA 2000. [92] M.A. Gomaa, and T.N. Vijaykumar, Opportunistic Transient-Fault Detection, in Proc. ISCA 2005. [93] A. Parashar, A. Sivasubramaniam, and S. Curumurthi, SlicK: slice-based locality exploitation for efficient redundant multithreading, in Proc. ASPLOS 2006. [94] N. J. Wang and S. J. Patel, ReStore: Symptom Based Soft Error Detection in Microprocessors, in Proc. DSN 2005. [95] P. Ndai, A. Agarwal, O. Chen, and K. Roy, A Soft Error Monitor Using Switching Current Detection, in Proc. ICCD, 2005. [96] J. M. Cazeaux et al, On Transistor Leve l gate Sizing for Increased Robustness to Transient Faults, In Proc. International On-Line Testing Symposium 2005. [97] S. Srinivasan et al., Improving Soft-error Tolerance of FPGA Configuration Bits, In Proc. ICCAD 2004. [98] D. K. Schroder and J.A. Babcock, Negative Bi as Temperature Instability: Road to Cross in Deep Submicron Silicon Semiconductor Manufacturing. J. Applied Physics, 2003. [99] L. Peters, NBTI: A Growing Th reat to Device Reliability, emiconductor International 2004. [100] K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L. Ji, S. R. Nassif, E. J. Nowak, D. J. Pearson, and N. J. Rohrer, High-performance CMOS Variability in the 65-nm Regime and Beyond, IBM J. Res. & Dev. 2006. [101] S. Borkar, T. Karnik, S. Narendra, J. Tsch anz, A. Keshavarzi, and V. De., Parameter Variations and Impact on Circui ts and Microarchitecture, in Proc. DAC 2003. [102] K. Bowman, S. Duvall, and J. Meindl, Impact of Die-to-Die and Within-Die Parameter Fluctuations on the Maximum Clock Frequency Distribution for Gigascale Integration, IEEE J. Solid-State Circuits 2002. [103] M. Orshansky, L. Milor, P. Chen, K. Keutzer, and C. Hu, Impact of spatial intrachip gate length variability on the performance of high-speed digital circuits, IEEE Trans. Computer-Aided Design of Inte grated Circuits and Systems, May 2002. [104] H. Chang and S. S. Sapatnekar, Full-chip Analysis of Leakage Power under Process Variations, including Spa tial Correlations, in Proc. DAC 2005. 232

PAGE 233

[105] A. Agarwal, D. Blaauw, and V. Zolotov, Stati stical timing analysis for intra-die process variations with spatial correlations, in Proc. ICCAD 2003. [106] A. Srivastava, R. Bai, D. Blaauw, and D. Sylvester, Modeling and analysis of leakage power considering within-die process variations, in Proc. ISLPED 2002. [107] R. Rao, A. Srivastava, D. Blaauw, and D. Sy lvester, Statistical estimation of leakage current considering inter and intr a-die process variations, in Proc. ISLPED 2003. [108] J. Tschanz, J. Kao, and S. Narendra, Adap tive body bias for reduci ng impacts of die-todie and within-die parameter variations on microprocessor frequency and leakage, IEEE J. Solid-State Circuits 2002. [109] A. Srivastava, D. Sylvester, and D. Blaauw, Statistical optimizat ion of leakage power considering process variations using dual-Vth and sizing, in Proc. DAC 2004. [110] Qian Ding, R. Luo, H. Wang, H. Yang, and Yuan Xie, Modeling the impact of process variation on critical ch arge distribution, in Proc. SOCC 2006. [111] A. Tiwari, S. R. Sarangi, and J. Torrellas, ReCycle: Pipeline Adaptation to Tolerate Process Variation, in Proc. ISCA, 2007. [112] X. Liang, R. Canal, G.Y. Wei, and D. Br ooks, Process Variation Tolerant 3T1D-Based Cache Architectures, in Proc. MICRO 2007. [113] X. Liang, G.Y. Wei, and D. Brooks, ReVIVaL : A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency, in Proc. ISCA 2008. [114] R. Teodorescu and J. Torrellas, Variatio n-Aware Application Scheduling and Power Management for Chip Multiprocessors, in Proc. ISCA 2008. [115] N. Soundararajan, A. Yanamandra, C. Nicopoulos, and N. Vijaykrishnan, Analysis and Solutions to Issue Queue Process Variation, in Proc. DSN 2008. [116] V. Degalahal, R. Ramanarayanan, N. Vijaykrishn an, Y. Xie, and M, J, Irwin, The effect of threshold voltages on the soft error rate, in Proc. ISQED 2004 [117] R. R. Rao, D. Blaauw, and D. Sylvester, Sof t error reduction in comb inational logic using gate resizing and flipflop selection, in Proc. ICCAD 2006. [118] X. Liang and D. Brooks, Mitigating the impact of process variations on processor register files and execution units, in Proc. MICRO 2006. [119] R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas, Mitigati ng parameter variation with dynamic fine-grain body biasing, in Proc. MICRO 2007. [120] R. Vattikonda, W. Wang, and Y. Cao, Modeling and Minimization of PMOS NBTI Effect for Robust Nanometer Design, in Proc. DAC 2006. 233

PAGE 234

[121] A. Agarwal, D. Blaauw, S. Sundareswaran, V. Zolotov, M. Zhou, K. Gala, and R. Panda. Path-based statistical timing analysis consid ering inter and intra-die correlations, in Proc. TAU 2002. [122] K. Meng, and R. Joseph, Process variati on aware cache leakage management, in Proc. ISLPED 2006. [123] A. Kahng, How much variability can designers tolerate?, Design & Test of Computers 2003. [124] Nanoscale Integration and Modeling (NIMO) Group Predictive Technology Model (PTM), [updated November, 2008; cited August, 2009], available from http://www.eas.asu.edu/~ptm/ [125] U.Y. Ogras, R. Marculescu, and D. Marculescu, Variation-Adaptive Feedback Control for Networks-on-Chip with Multiple Clock Domains, in Proc. DAC June 2008. [126] C. Schlunder et al., On the Degradation of P-MOSFETs in Analog and RF Circuits under Inhomogeneous Negative Bias Temperature Stress, in Proc. IRPS, 2003. [127] J. Abella, X. Vera, and A. Gonzalez, P enelope: The NBTI-Aware Processor, in Proc. MICRO 2007. [128] W. Abadeer and W. Ellis, Behavior of NBTI under AC Dynamic Circuit Conditions, in Proc. IRPS, 2003. [129] X. Liang and D. Brooks, Latency Adaptation for Multiported Regist er Files to Mitigate the Impact of Process Varations, Workshop on ASGI 2006. [130] K. R. Walcott, G. Humphreys, and S. Guru murthi, Dynamic Prediction of Architectural Vulnerability from Microarchitectural State, in Proc. ISCA 2007. [131] X. Fu, Tao Li, and Jose A.B. Fortes, Sim -SODA: A Unified Framework for Architectural Level Software Reliability Analysis, [ updated June, 2006; cited August 2009], avaiable from http://www.ideal.ece.ufl.edu/sim-soda [132] R. Rajsuman, IDDQ Testing for CMOS VLSI, in Proc. IEEE 2000. [133] A. Agarwal, K. Kang, and K. Roy, Accurate Estimation and Modeling of Total Chip Leakage Considering Inter-&IntraDie Process Variations, in Proc. ISLPED 2005. [134] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan, HotLeakage: A temperature-aware model of subthreshold and gate leakage for architects, Technical Report CS-2003-05, University of Virginia 2003. [135] International Technology Roadm ap for Semiconductors (ITRS) Semiconductor Industry Association, 2006. 234

PAGE 235

[136] K. Skadron, M. R. Stan, W. Huang, S. Velu samy, K. Sankaranarayanan, and D. Tarjan, Temperature-aware microarchitecture, in Proc. ISCA June 2003. [137] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, Automatically Characterizing Large Scale Program Behavior, in Proc. ASPLOS 2002. [138] T. Karnik, et al., Impact of body bias on alphaand neutron-induced soft error rates of flip-flops, Symposium on VLSI Circuits Digest of Technical Papers 2004 [139] C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt, Techniques to reduce the soft error rate of a high performance microprocessor, in Proc. ISCA, 2004. [140] X. Li, S.V. Adve, P. Bose, and J. A. Ri vers, Online Estimation of Architectural Vulnerability Factor for Soft Errors, in Proc. ISCA, 2008. [141] O. S. Unsal, J. Tschanz, K. A. Bowman, V. De, X. Vera, A. Gonzalez, and O. Ergin, Impact of Parameter Variations on Circuits and Microarchitecture, IEEE Micro 2006. [142] K. Kang, M. A. Alam, and K. Roy, Char acterization of NBTI induced Temporal Performance Degradation in Nano-Scale SRAM array using IDDQ, IEEE International Test Conference 2007. [143] P. Friedberg et al., Modeli ng Within-die Spatial Correlation Effects for Process-design Co-optimization, in Proc. ISQED 2005. [144] S. Ozdemir, D. Sinha, G. Memik, J. Adams, and H. Zhou, Yield-aware Cache Architectures, in Proc. MICRO 2006. [145] H. Luo, Y. Wang, K. He, R. Luo, H. Yang, and Y. Xie, Modeling of PMOS NBTI Effect of Considering Temperature Variation, in Proc. ISQED 2007. [146] S. V. Kumar, C. H. Kim, and S. S. Sapatn ekar, Impact of NBTI on SRAM Read Stability and Design for Reliability, in Proc. ISQED 2006. [147] S. V. Kumar, C. H. Kim, and S. S. Sapatn ekar, An Analytical Model for Negative Bias Temperature Instability, in Proc. ICCAD 2006. [148] R. Vattikonda, Y. Luo, A. Gyure, X. Qi, S. Lo, M. Shahram, Y. Cao, K, Singhal, and D. Toffolon, A New Simulation Method for NBTI Analysis in SPICE Environment, in Proc. ISQED 2007. [149] W. Wang et al., The Impact of NBTI on the Performance of Combinational and Sequential Circuits, in Proc. DAC 2007. [150] K. Kang, H. Kufluoglu, K. Roy, and M. A. Al am, Impact of Negative-Bias Temperature Instability in Nanoscale SRAM A rray: Modeling and Analysis, IEEE Trans. on CAD 2007. 235

PAGE 236

[151] Z. Qi and M. Stan, NBTI Resilient Ci rcuits Using Adaptive Body Biasing, in Proc. GLSVLSI 2008. [152] J. Abella et al., NBTI-Resilient Memory Cells with NAND Gates for Highly-Ported Structures, Workshop on DSN 2007. [153] J. Shin, V. Zyuban, P. Bose, and T. Pinkston, A Proactive Wear-out Recovery Approach of Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime, in Proc. ISCA 2008. [154] S. Basu, and R. Vemuri, Process Variation a nd NBTI Tolerant Standa rd Cells to Improve Parametric Yield and Lifetime of ICs, in Proc. ISVLSI 2007. [155] Th. Fischer et al., Impact of Process Variations and Long Term Degradation on 6TSRAM cells, Advances in Radio Science 2007. [156] D. Ernst et al, Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation, in Proc. MICRO 2003. [157] W. Wang, Z. Wei, S. Yang, and Y. Cao, An E fficient Method to Identify Critical Gates under Circuit Aging, in Proc. ICCAD 2007. [158] K. Kang, K. Kim, and K. Roy, Variati on Resilient Low-Power Circuit Design Methodology Using On-Chip Phase Locked Loop, in Proc. DAC 2007. [159] W. J. Dally and B. Towles, Route Packet s, Not Wires: On-Chip Interconnection Networks, in Proc. DAC 2001. [160] D. Park, C. Nicopoulos, J. Kim, N. Vijayk rishnan, and C. R. Das, Exploring FaultTolerant Network-on-Chip Architectures, in Proc. DSN 2006. [161] X. Fu, T. Li, and J. Fortes, NBTI Tolerant Microarchitecture Design in the Presence of Process Variation, in Proc. MICRO 2008. [162] A. Tiwari and J. Torrellas, Facelift: Hiding and Slowing Down Aging in Multicores, in Proc. MICRO 2008. [163] A. Kumar, P. Kundu, A. Singh, L.-S. Peh, and N. K. Jha, A 4.6Tbits/s 3.6 GHz SingleCycle NoC Router with a Novel Sw itch Allocator in 65nm CMOS, in Proc. ICCD, October 2007. [164] B. Li, L.-S. Peh, and P. Patra, Impact of Process and Temperature Variations on Networkon-Chip Design Exploration, in Proc. NOCS April 2008. [165] A. Kumar, L. S. Peh, and N. K. Jha, Token Flow Control, in Proc. MICRO 2008. [166] D. Park, S. Eachempati, R. Das, A. Mishra, Yuan Xie, V. Narayanan, and C. Das, MIRA: A Multi-layer On Chip Interconnect Router Architecture, in Proc. ISCA 2008. 236

PAGE 237

[167] J. D. Owens, W. J. Dally, R. Ho, D. N. Jayasi mha, S. W. Keckler, and L. S. Peh, Research Challenges for On-Chip Interconnection Networks, IEEE Micro, Special Issue on OnChip Interconnects for Multicores September/October 2007. [168] L.-S. Peh and W. J. Dally, A Delay Mode l and Speculative Architecture for Pipelined Routers, in Proc. HPCA January 2001. [169] M. Galles., Scalable Pipelined Interconnect for Distributed Endpoint Routing: The SGI Spider Chip, HOT Interconnects IV 1996. [170] R. Mullins, A. West, and S. Moore, Low-L atency Virtual-Channel Routers for On-Chip Networks, in Proc. ISCA June 2004. [171] P. Gratz, B. Grot, and S. W. Keckler, R egional Congestion Awareness for Load Balance in Networks-on-Chip, in Proc. HPCA February 2008. [172] J. Kim, D. Park, T. Theocharides, N. Vija ykrishnan, and C. R. Das, A Low Latency Router Supporting Adaptivity fo r On-Chip Interconnects, in Proc. DAC June 2005. [173] C. Nicopoulos, S. Srinivasan, A. Yanamandra, D. Park, V. Narayanan, C. R. Das, and M. J. Irwin, On the Effects of Process Variation in Network-on-Chip Architectures, IEEE Trans. Dependable and Secure Computing October 2008. [174] E. G. Pumphrey, NMOS Analog Voltage Comparator, US patent 4812681 1988. [175] J. Dielissen, A. Radulescu, K. Goossens, and E. Rijpkema, Concepts and Implementation of the Philips Network-on-Chip, IP-Based SOC Design 2003. [176] N. Agarwal, L. S. Peh, and N. Jha, Gar net: A Detailed Interconnection Network Model inside a Full-system Simulation Framework, Technical Report, CE-P08-001, Dept. of Electrical Engineering, Princeton University February 2008. [177] H. Wang, X. Zhu, L.-S. Peh, and S. Malik, Orion: As Power-Performance Simulator for Interconnection Networks, in Proc. MICRO 2002. [178] Steven Cameron Woo Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh and Anoop Gupta The SPLASH-2 Programs: Characteriza tion and Methodological Considerations, [updated June 1995; cited August 2009], available from http://www-flash.stanford.edu/apps/SPLASH/ [179] V. Aslot, M. J. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, SPEComp: A New Benchmark Suite for Measur ing Parallel Computer Performance, in Proc. WOMPAT 2001. [180] Standard Performance Evaluation Corpor ation (SPEC), SPECjbb2005 (Java Server Benchmark), [updated August 2006; cited August 2009], available from http://www.spec.org/jbb2005 237

PAGE 238

[181] M. F. Chang, J. Cong, A. Kaplan, M. Naik, G. Reinman, E. Socher and S. Tam, CMP Network-on-Chip Overlaid with Multi-Band RF-Interconnect, in Proc. HPCA, February 2008. [182] D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. P. Jouppi, M. Fiorentino, A. Davis, N. Binkert, R. G. Beausoleil, and J. H. Ahn, Corona: System Implications of Emerging Nanophotonic Technology, in Proc. ISCA 2008. [183] A. Kumar, L. S. Peh, P. Kundu, and N. K. Jha, Express Virtual Channels: Towards the Ideal Interconnection Fabric, in Proc. ISCA 2007. [184] N. E. Jerger, L. S. Peh, and M. H. Lipasti, Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support, in Proc. ISCA 2008. 238

PAGE 239

239 BIOGRAPHICAL SKETCH Xin Fu was born in Xiangxiang, China, as the daughter of Zhongyang Fu and Yafei Wang. After completing her high school education at th e First Middle School in Xiangxiang, China, she entered the Department of Computer Science and Technology, Central South University in Changsha, China in September 1999. She received the degree of Bachelor of Engineering in computer science and engineering from Central South University in July 2003. From September 2003 to July 2004, she joined the gr aduate program for computer sc ience and engineering at the Department of Computer Science and Technol ogy, Changsha, China. In September 2004, she entered the Ph.D. program in comput er engineering at The University of Florida. She is a student member of The Institute of Electrical and Electronics Engi neers (IEEE), The Institute of Electrical and Electronics Engi neers (IEEE) Computer Societ y, Association for Computing Machinery (ACM), and Association for Computing Machinery Special Interest Group on Computer Architecture. (ACM SIGARCH).